Reimplement Data Ingest and Processing in Typescript #52

META-DREAMER · 2025-03-23T23:07:23Z

A new TypeScript-based pipeline has been implemented, leveraging SQLite and Drizzle ORM for improved data management and processing:

Key Features

Normalized database schema for efficient storage of GitHub data
Highly configurable scoring and tagging system via TypeScript configuration
Advanced pattern matching for expertise recognition
Comprehensive CLI for managing the entire pipeline
Streamlined processing that eliminates the need for multiple scripts

Work in progress, still have to do:

migrate github actions to use new TS pipeline
implement AI summary steps in the pipeline
output JSON files from the pipeline for processed contribution data
update frontend code to directly use sqlite DB to generate pages
Ensure feature parity with python version
test out scoring algorithm to ensure it is accurate / beneficial

Summary by CodeRabbit

New Features
- Launched a new TypeScript-based analytics pipeline for GitHub contributions with enhanced data fetching, processing, and scoring.
- Expanded environment configuration by introducing additional variables for GitHub and API integrations.
Documentation
- Updated user guides with streamlined instructions for database initialization and pipeline operations.
- Added contributor guidelines covering modern development and build commands.
Chores / Refactor
- Phased out legacy scripts and refined dependency setups for improved stability and performance.

- Introduced a new TypeScript-based analytics pipeline for improved data management and processing. - Removed legacy init-db script and updated README to reflect new pipeline commands and configuration options. - Deleted outdated SQL migration files and adjusted database schema to support new features. - Enhanced README with detailed instructions for the new pipeline and its configuration.

coderabbitai · 2025-03-23T23:07:30Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces significant updates across configuration, documentation, database schema, and data processing modules. New TypeScript rules and environment variable settings have been added along with revised guidelines in CLAUDE.md and an updated README. A new TypeScript analytics pipeline replaces legacy scripts with an enhanced CLI (using analyze-pipeline.ts), updated database schema via new SQL scripts and JSON snapshots, and refined modules for GitHub data ingestion, processing, and error handling.

Changes

File(s)	Change Summary
`.cursor/rules/typescript-rules.mdc`, `.envrc.example`, `.gitignore`, `CLAUDE.md`, `README.md`, `config/pipeline.config.ts`	Added new configuration and documentation files for TypeScript rules, environment variables, pipeline configuration, and contributor guidelines; updated ignore patterns and README to reflect the new TS pipeline.
`drizzle/0000_aromatic_slipstream.sql`, `drizzle/0000_serious_susan_delgado.sql`, `drizzle/meta/0000_snapshot.json`, `drizzle/meta/_journal.json`, `src/lib/data/schema.ts`	Removed the legacy SQL schema and introduced a revised database schema with new tables, indexes, and modifications to tags and user statistics.
`package.json`, `scripts/README.md`, `scripts/analyze-pipeline.ts`, `scripts/init-db.ts`	Updated build scripts by removing the legacy init-db command, adding a new pipeline CLI command, and incorporating additional dependencies.
`src/lib/data/db.ts`, `src/lib/data/github.ts`, `src/lib/data/ingest.ts`, `src/lib/data/ingestion.ts`, `src/lib/data/processing.ts`, `src/lib/data/queries.ts`, `src/lib/data/scoring.ts`, `src/lib/data/types.ts`	Modified core data modules: introduced GitHubClient, DataIngestion, and ContributorPipeline classes; enhanced error handling and schema validation; removed outdated scoring and ingestion logic.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant CLI as Pipeline CLI
    participant DI as DataIngestion
    participant GH as GitHubClient
    participant CP as ContributorPipeline
    participant DB as Database

    U->>CLI: Run "pipeline" command
    CLI->>DI: Initialize ingestion process
    DI->>GH: Fetch GitHub data (PRs, Issues, Commits)
    GH-->>DI: Return data
    DI->>DB: Store raw data
    CLI->>CP: Process contributions
    CP->>DB: Query and update scores
    CP-->>CLI: Return results
    CLI->>U: Display analytics

Poem

In code’s vast realm, a pipeline’s born,
New rules and schemas greet the morn.
GitHub data flows with structured art,
Contributions scored right from the start.
A CLI dance in rhythmic delight,
Celebrating progress through code’s light.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

META-DREAMER · 2025-03-23T23:11:02Z

@coderabbitai review

coderabbitai · 2025-03-23T23:11:07Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (31)

CLAUDE.md (3)

1-2: Header Clarity
The title clearly indicates the file’s purpose. Consider adding a one-line summary beneath the header to quickly inform contributors what aspects of the pipeline the document covers.

3-12: Build & Development Section
The commands are well-organized and cover essential tasks such as running the development server, building the production site, and initializing the database. Ensure these commands are fully aligned with the new TypeScript pipeline and the SQLite/Drizzle ORM setup. It might be useful to mention any required environment variables or dependencies that need to be set.

13-22: Code Style Guidelines
This section neatly outlines the coding conventions, from import order to component naming and the use of modern Next.js patterns. Consider linking to additional guidelines or examples for Next.js 15 and shadcn/ui to aid contributors in adhering to these standards.
.envrc.example (1)
1-1: Use a placeholder pattern instead of a realistic-looking token example.

While this provides a clear example, using a token format that resembles a real GitHub PAT could potentially lead to security concerns. Consider using a placeholder pattern that doesn't match the exact format.
-export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_1234567890abcdef1234567890abcdef12345678
+export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_your_github_personal_access_token_here
Additionally, consider adding a comment explaining what permissions this token needs.
src/lib/data/queries.ts (1)
41-71: Consider extracting the JSON parsing logic to a reusable helper function.

Both functions have nearly identical error handling code for JSON parsing. This could be extracted to reduce duplication.
+function safeJsonParse<T>(jsonString: string, errorMessage: string): T[] {
+  try {
+    return JSON.parse(jsonString);
+  } catch (error) {
+    console.error(errorMessage, error);
+    return [];
+  }
+}

 export async function getContributorRecentPRs(username: string, limit = 5) {
   const [summary] = await db
     .select({
       pullRequests: userDailySummaries.pullRequests,
     })
     .from(userDailySummaries)
     .where(eq(userDailySummaries.username, username))
     .orderBy(desc(userDailySummaries.date))
     .limit(1);

   if (!summary) return [];

-  try {
-    const prs = JSON.parse(summary.pullRequests);
-    return prs.slice(0, limit);
-  } catch (error) {
-    console.error("Failed to parse pull requests:", error);
-    return [];
-  }
+  const prs = safeJsonParse<any>(summary.pullRequests, "Failed to parse pull requests:");
+  return prs.slice(0, limit);
 }
src/lib/data/db.ts (1)
7-7: Use const instead of let if you're not reassigning.

Since sqlite is only assigned once in the try block, consider making it a const to maintain immutability.
- let sqlite: Database;
+ const sqlite: Database;
.cursor/rules/typescript-rules.mdc (1)
9-18: Correct spelling errors and improve clarity.

There are a few typos in these lines:

“invarients” → “invariants”

“explantion” → “explanation”

“implmenting” → “implementing”

“nessesary” → “necessary”
- specify clearly their inputs, outputs, invarients and types.
+ specify clearly their inputs, outputs, invariants and types.

- review your thought process and present... step by step explination
+ review your thought process and present... step by step explanation

- Before implmenting a feature...
+ Before implementing a feature...

- ...review your assumptions about what is nessesary...
+ ...review your assumptions about what is necessary...
scripts/analyze-pipeline.ts (1)

121-140: Consider extracting date & pagination logic
The loop that fetches data for each repository (lines 121-140) applies consistent date logic. It might be beneficial to extract repeated date-handling logic into a small helper function for reusability and clarity.

src/lib/data/github.ts (2)

27-53: Ensure GH CLI availability
All GraphQL and REST calls assume gh is installed, configured, and accessible. Consider wrapping calls in additional checks or fallback logic to handle environments where gh might not be installed or logged in.

501-549: Enhance concurrency for large commit histories
Current commit-fetching logic sequentially processes up to 100 commits per iteration. For large repositories, consider parallel requests or a chunk-based approach to improve performance.

src/lib/data/ingestion.ts (1)

317-320: Inspect partial failures
When a fetch or store operation fails, the function rethrows an error. If partial data ingestion is acceptable, consider adding rollback or partial commit handling strategies to ensure integrity.

README.md (5)

51-55: Secure newly introduced environment variables.
The new OpenAI/OpenRouter vars look good. Ensure .envrc or a similar approach is used so they aren't committed.

84-84: Confirm documentation references.
The step now references bun run generate-db. Verify other docs or scripts do not still reference the older init-db.

97-100: Add internal cross-reference to pipeline docs.
Linking to scripts/README.md might help new users quickly find extended instructions.

112-118: Highlight advanced config examples.
Consider adding a short code snippet showcasing how to tune scoring or tags in pipeline.config.ts.

233-233: Clarify Bun version requirement.
If older Bun versions are unsupported, mention that explicitly here.

scripts/README.md (4)

33-41: Configuration details are concise.
Suggest referencing a sample config snippet to demonstrate advanced usage.

43-43: Legacy script references are helpful for context.
Ensure we cross-link them with the new pipeline where relevant.

112-113: Visually separate the data flow comparison.
A heading or a short note here might make the differences clearer at a glance.

129-141: New pipeline diagram is well-structured.
Consider detailing how data transitions from raw to analyzed.

src/lib/data/schema.ts (7)

21-58: Check indexing strategy for large PR data.
If you frequently query by mergedAt or closedAt, indexing might improve performance.

81-111: Align naming for created/updated fields.
The table uses createdAt, updatedAt, but the user table uses lastUpdated. Consistency helps.

141-161: Potential large patch storage.
Storing patch data as text can impact performance. Consider storing diffs externally if they’re large.

163-182: Optional reference to commit in PR reviews.
Might be valuable if you want to link reviews to specific commit diffs.

268-268: Added fields improve tagging flexibility.
If patterns grow large, consider normalizing them in a separate table.

Also applies to: 272-273

285-299: Strengthen user-tag relationship.
A unique compound index (username, tag) might help prevent duplicates.

314-323: Repositories table.
Maintaining lastFetchedAt is helpful. Consider storing a next fetch timestamp if scheduling is used.

src/lib/data/types.ts (3)

9-24: Validate default values in RawCommitSchema.

Everything appears consistent for commits. If certain fields (e.g., messageHeadline) remain unused, consider removing them to keep the schema lean.

33-40: Restrict state field if possible.

Currently, state is an unconstrained string. If you have known states (e.g., “APPROVED”, “CHANGES_REQUESTED”), consider a z.enum to ensure consistency.

131-262: Processed data schemas look solid.

The newly added or updated fields (e.g., optional timestamps) improve flexibility. However, note the inconsistent naming (created_at vs. updatedAt). Uniform naming might simplify maintenance.

src/lib/data/processing.ts (1)

274-712: processContributor method is extensive but well-organized.

This method calculates a combined score from PRs, issues, reviews, and comments. The daily cap logic for PRs/reviews helps avoid inflated scoring. Good handle on additive logic.

You might consider splitting each metric type’s logic (PR, issues, reviews, comments) into smaller helper functions for readability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 423aea2 and 4b77140.

⛔ Files ignored due to path filters (1)

bun.lock is excluded by !**/*.lock

📒 Files selected for processing (23)

.cursor/rules/typescript-rules.mdc (1 hunks)
.envrc.example (1 hunks)
.gitignore (1 hunks)
CLAUDE.md (1 hunks)
README.md (5 hunks)
config/pipeline.config.ts (1 hunks)
drizzle/0000_aromatic_slipstream.sql (0 hunks)
drizzle/0000_serious_susan_delgado.sql (1 hunks)
drizzle/meta/0000_snapshot.json (4 hunks)
drizzle/meta/_journal.json (1 hunks)
package.json (2 hunks)
scripts/README.md (4 hunks)
scripts/analyze-pipeline.ts (1 hunks)
scripts/init-db.ts (0 hunks)
src/lib/data/db.ts (1 hunks)
src/lib/data/github.ts (1 hunks)
src/lib/data/ingest.ts (0 hunks)
src/lib/data/ingestion.ts (1 hunks)
src/lib/data/processing.ts (1 hunks)
src/lib/data/queries.ts (2 hunks)
src/lib/data/schema.ts (5 hunks)
src/lib/data/scoring.ts (0 hunks)
src/lib/data/types.ts (3 hunks)

💤 Files with no reviewable changes (4)

scripts/init-db.ts
src/lib/data/ingest.ts
drizzle/0000_aromatic_slipstream.sql
src/lib/data/scoring.ts

🧰 Additional context used

🧬 Code Definitions (4)

config/pipeline.config.ts (2)

src/lib/data/schema.ts (1)

pipelineConfig (306-312)

src/lib/data/types.ts (1)

PipelineConfig (352-352)

src/lib/data/github.ts (1)

src/lib/data/types.ts (4)

RepositoryConfig (355-355)

RawPullRequestSchema (58-105)

RawIssueSchema (107-129)

RawCommitSchema (10-24)

src/lib/data/ingestion.ts (3)

src/lib/data/types.ts (2)

PipelineConfig (352-352)

RepositoryConfig (355-355)

src/lib/data/schema.ts (9)

repositories (315-323)

users (12-19)

rawPullRequests (22-58)

rawPullRequestFiles (60-79)

rawCommits (113-139)

prReviews (163-182)

prComments (184-203)

rawIssues (81-111)

issueComments (205-224)

src/lib/data/github.ts (1)

githubClient (554-554)

src/lib/data/processing.ts (2)

src/lib/data/types.ts (1)

PipelineConfig (352-352)

src/lib/data/schema.ts (11)

rawPullRequests (22-58)

rawIssues (81-111)

prReviews (163-182)

prComments (184-203)

issueComments (205-224)

users (12-19)

rawPullRequestFiles (60-79)

tags (268-280)

userTagScores (282-303)

userDailySummaries (227-248)

userStats (250-266)

🪛 LanguageTool

scripts/README.md

[uncategorized] ~48-~48: Loose punctuation mark.
Context: ... Collection - scripts/fetch_github.sh: Fetches raw GitHub data (PRs, issues, c...