-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement Data Ingest and Processing in Typescript #52
base: main
Are you sure you want to change the base?
Conversation
- Introduced a new TypeScript-based analytics pipeline for improved data management and processing. - Removed legacy init-db script and updated README to reflect new pipeline commands and configuration options. - Deleted outdated SQL migration files and adjusted database schema to support new features. - Enhanced README with detailed instructions for the new pipeline and its configuration.
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis pull request introduces significant updates across configuration, documentation, database schema, and data processing modules. New TypeScript rules and environment variable settings have been added along with revised guidelines in CLAUDE.md and an updated README. A new TypeScript analytics pipeline replaces legacy scripts with an enhanced CLI (using Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant CLI as Pipeline CLI
participant DI as DataIngestion
participant GH as GitHubClient
participant CP as ContributorPipeline
participant DB as Database
U->>CLI: Run "pipeline" command
CLI->>DI: Initialize ingestion process
DI->>GH: Fetch GitHub data (PRs, Issues, Commits)
GH-->>DI: Return data
DI->>DB: Store raw data
CLI->>CP: Process contributions
CP->>DB: Query and update scores
CP-->>CLI: Return results
CLI->>U: Display analytics
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (31)
CLAUDE.md (3)
1-2
: Header Clarity
The title clearly indicates the file’s purpose. Consider adding a one-line summary beneath the header to quickly inform contributors what aspects of the pipeline the document covers.
3-12
: Build & Development Section
The commands are well-organized and cover essential tasks such as running the development server, building the production site, and initializing the database. Ensure these commands are fully aligned with the new TypeScript pipeline and the SQLite/Drizzle ORM setup. It might be useful to mention any required environment variables or dependencies that need to be set.
13-22
: Code Style Guidelines
This section neatly outlines the coding conventions, from import order to component naming and the use of modern Next.js patterns. Consider linking to additional guidelines or examples for Next.js 15 and shadcn/ui to aid contributors in adhering to these standards..envrc.example (1)
1-1
: Use a placeholder pattern instead of a realistic-looking token example.While this provides a clear example, using a token format that resembles a real GitHub PAT could potentially lead to security concerns. Consider using a placeholder pattern that doesn't match the exact format.
-export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_1234567890abcdef1234567890abcdef12345678 +export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_your_github_personal_access_token_hereAdditionally, consider adding a comment explaining what permissions this token needs.
src/lib/data/queries.ts (1)
41-71
: Consider extracting the JSON parsing logic to a reusable helper function.Both functions have nearly identical error handling code for JSON parsing. This could be extracted to reduce duplication.
+function safeJsonParse<T>(jsonString: string, errorMessage: string): T[] { + try { + return JSON.parse(jsonString); + } catch (error) { + console.error(errorMessage, error); + return []; + } +} export async function getContributorRecentPRs(username: string, limit = 5) { const [summary] = await db .select({ pullRequests: userDailySummaries.pullRequests, }) .from(userDailySummaries) .where(eq(userDailySummaries.username, username)) .orderBy(desc(userDailySummaries.date)) .limit(1); if (!summary) return []; - try { - const prs = JSON.parse(summary.pullRequests); - return prs.slice(0, limit); - } catch (error) { - console.error("Failed to parse pull requests:", error); - return []; - } + const prs = safeJsonParse<any>(summary.pullRequests, "Failed to parse pull requests:"); + return prs.slice(0, limit); }src/lib/data/db.ts (1)
7-7
: Useconst
instead oflet
if you're not reassigning.Since
sqlite
is only assigned once in thetry
block, consider making it aconst
to maintain immutability.- let sqlite: Database; + const sqlite: Database;.cursor/rules/typescript-rules.mdc (1)
9-18
: Correct spelling errors and improve clarity.There are a few typos in these lines:
- “invarients” → “invariants”
- “explantion” → “explanation”
- “implmenting” → “implementing”
- “nessesary” → “necessary”
- specify clearly their inputs, outputs, invarients and types. + specify clearly their inputs, outputs, invariants and types. - review your thought process and present... step by step explination + review your thought process and present... step by step explanation - Before implmenting a feature... + Before implementing a feature... - ...review your assumptions about what is nessesary... + ...review your assumptions about what is necessary...scripts/analyze-pipeline.ts (1)
121-140
: Consider extracting date & pagination logic
The loop that fetches data for each repository (lines 121-140) applies consistent date logic. It might be beneficial to extract repeated date-handling logic into a small helper function for reusability and clarity.src/lib/data/github.ts (2)
27-53
: Ensure GH CLI availability
All GraphQL and REST calls assumegh
is installed, configured, and accessible. Consider wrapping calls in additional checks or fallback logic to handle environments wheregh
might not be installed or logged in.
501-549
: Enhance concurrency for large commit histories
Current commit-fetching logic sequentially processes up to 100 commits per iteration. For large repositories, consider parallel requests or a chunk-based approach to improve performance.src/lib/data/ingestion.ts (1)
317-320
: Inspect partial failures
When a fetch or store operation fails, the function rethrows an error. If partial data ingestion is acceptable, consider adding rollback or partial commit handling strategies to ensure integrity.README.md (5)
51-55
: Secure newly introduced environment variables.
The new OpenAI/OpenRouter vars look good. Ensure.envrc
or a similar approach is used so they aren't committed.
84-84
: Confirm documentation references.
The step now referencesbun run generate-db
. Verify other docs or scripts do not still reference the olderinit-db
.
97-100
: Add internal cross-reference to pipeline docs.
Linking toscripts/README.md
might help new users quickly find extended instructions.
112-118
: Highlight advanced config examples.
Consider adding a short code snippet showcasing how to tune scoring or tags inpipeline.config.ts
.
233-233
: Clarify Bun version requirement.
If older Bun versions are unsupported, mention that explicitly here.scripts/README.md (4)
33-41
: Configuration details are concise.
Suggest referencing a sample config snippet to demonstrate advanced usage.
43-43
: Legacy script references are helpful for context.
Ensure we cross-link them with the new pipeline where relevant.
112-113
: Visually separate the data flow comparison.
A heading or a short note here might make the differences clearer at a glance.
129-141
: New pipeline diagram is well-structured.
Consider detailing how data transitions from raw to analyzed.src/lib/data/schema.ts (7)
21-58
: Check indexing strategy for large PR data.
If you frequently query bymergedAt
orclosedAt
, indexing might improve performance.
81-111
: Align naming for created/updated fields.
The table usescreatedAt
,updatedAt
, but the user table useslastUpdated
. Consistency helps.
141-161
: Potential large patch storage.
Storing patch data as text can impact performance. Consider storing diffs externally if they’re large.
163-182
: Optional reference to commit in PR reviews.
Might be valuable if you want to link reviews to specific commit diffs.
268-268
: Added fields improve tagging flexibility.
If patterns grow large, consider normalizing them in a separate table.Also applies to: 272-273
285-299
: Strengthen user-tag relationship.
A unique compound index(username, tag)
might help prevent duplicates.
314-323
: Repositories table.
MaintaininglastFetchedAt
is helpful. Consider storing a next fetch timestamp if scheduling is used.src/lib/data/types.ts (3)
9-24
: Validate default values inRawCommitSchema
.Everything appears consistent for commits. If certain fields (e.g.,
messageHeadline
) remain unused, consider removing them to keep the schema lean.
33-40
: Restrictstate
field if possible.Currently,
state
is an unconstrained string. If you have known states (e.g., “APPROVED”, “CHANGES_REQUESTED”), consider a z.enum to ensure consistency.
131-262
: Processed data schemas look solid.The newly added or updated fields (e.g., optional timestamps) improve flexibility. However, note the inconsistent naming (
created_at
vs.updatedAt
). Uniform naming might simplify maintenance.src/lib/data/processing.ts (1)
274-712
:processContributor
method is extensive but well-organized.This method calculates a combined score from PRs, issues, reviews, and comments. The daily cap logic for PRs/reviews helps avoid inflated scoring. Good handle on additive logic.
You might consider splitting each metric type’s logic (PR, issues, reviews, comments) into smaller helper functions for readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
bun.lock
is excluded by!**/*.lock
📒 Files selected for processing (23)
.cursor/rules/typescript-rules.mdc
(1 hunks).envrc.example
(1 hunks).gitignore
(1 hunks)CLAUDE.md
(1 hunks)README.md
(5 hunks)config/pipeline.config.ts
(1 hunks)drizzle/0000_aromatic_slipstream.sql
(0 hunks)drizzle/0000_serious_susan_delgado.sql
(1 hunks)drizzle/meta/0000_snapshot.json
(4 hunks)drizzle/meta/_journal.json
(1 hunks)package.json
(2 hunks)scripts/README.md
(4 hunks)scripts/analyze-pipeline.ts
(1 hunks)scripts/init-db.ts
(0 hunks)src/lib/data/db.ts
(1 hunks)src/lib/data/github.ts
(1 hunks)src/lib/data/ingest.ts
(0 hunks)src/lib/data/ingestion.ts
(1 hunks)src/lib/data/processing.ts
(1 hunks)src/lib/data/queries.ts
(2 hunks)src/lib/data/schema.ts
(5 hunks)src/lib/data/scoring.ts
(0 hunks)src/lib/data/types.ts
(3 hunks)
💤 Files with no reviewable changes (4)
- scripts/init-db.ts
- src/lib/data/ingest.ts
- drizzle/0000_aromatic_slipstream.sql
- src/lib/data/scoring.ts
🧰 Additional context used
🧬 Code Definitions (4)
config/pipeline.config.ts (2)
src/lib/data/schema.ts (1)
pipelineConfig
(306-312)src/lib/data/types.ts (1)
PipelineConfig
(352-352)
src/lib/data/github.ts (1)
src/lib/data/types.ts (4)
RepositoryConfig
(355-355)RawPullRequestSchema
(58-105)RawIssueSchema
(107-129)RawCommitSchema
(10-24)
src/lib/data/ingestion.ts (3)
src/lib/data/types.ts (2)
PipelineConfig
(352-352)RepositoryConfig
(355-355)src/lib/data/schema.ts (9)
repositories
(315-323)users
(12-19)rawPullRequests
(22-58)rawPullRequestFiles
(60-79)rawCommits
(113-139)prReviews
(163-182)prComments
(184-203)rawIssues
(81-111)issueComments
(205-224)src/lib/data/github.ts (1)
githubClient
(554-554)
src/lib/data/processing.ts (2)
src/lib/data/types.ts (1)
PipelineConfig
(352-352)src/lib/data/schema.ts (11)
rawPullRequests
(22-58)rawIssues
(81-111)prReviews
(163-182)prComments
(184-203)issueComments
(205-224)users
(12-19)rawPullRequestFiles
(60-79)tags
(268-280)userTagScores
(282-303)userDailySummaries
(227-248)userStats
(250-266)
🪛 LanguageTool
scripts/README.md
[uncategorized] ~48-~48: Loose punctuation mark.
Context: ... Collection - scripts/fetch_github.sh
: Fetches raw GitHub data (PRs, issues, c...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~53-~53: Loose punctuation mark.
Context: ...cessing - scripts/calculate_scores.py
: Calculates contributor scores based on ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...ary Generation - scripts/summarize.py
: Generates human-readable summaries of c...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~67-~67: Loose punctuation mark.
Context: ...nt - scripts/manage_thread_history.sh
: Manages versioning and backup of discus...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~145-~145: Loose punctuation mark.
Context: ... G ``` ## Data Storage - data/daily/
: Current day's data (legacy) - `data/dai...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~155-~155: Loose punctuation mark.
Context: ...red environment variables: - GH_TOKEN
: GitHub access token - `OPENROUTER_API_K...
(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~158-~158: The singular proper name ‘Site’ must be used with a third-person or a past tense verb.
Context: ...egration (optional) - SITE_NAME
: Site name for OpenRouter integration (optional) ...
(HE_VERB_AGR)
[style] ~162-~162: Style-wise, it’s not ideal to insert an adverb (‘eventually’) in the middle of an infinitive construction (‘to replace’). Try moving the adverb to avoid split infinitives.
Context: ...Notes - The new TypeScript pipeline is designed to eventually replace the Python scripts - Both systems can run side by side dur...
(SPLIT_INFINITIVE)
🔇 Additional comments (111)
drizzle/meta/_journal.json (1)
8-9
: LGTM: Updated migration metadata.The timestamp and tag have been updated appropriately to reflect the new migration.
.gitignore (1)
30-31
:❓ Verification inconclusive
Verify all environment file patterns are covered.
The change from
.env*
to explicit.envrc
and.env
entries means other environment files like.env.local
or.env.development
would no longer be ignored.
🏁 Script executed:
#!/bin/bash # Check for other .env* files that might need to be added to .gitignore fd -H "^\.env.*" --exclude ".envrc" --exclude ".env" --exclude ".envrc.example"Length of output: 75
Action: Manually Verify .env Patterns
- The automated search using the provided script returned no additional files beyond
.envrc
and.env
.- However, the lack of output makes it unclear if environment files such as
.env.local
or.env.development
might be present or expected in the project.- Please manually verify that no other
.env*
files are in use or planned, and update the ignore rules if necessary.src/lib/data/queries.ts (2)
41-47
: Good error handling improvement for PR parsing.Adding try-catch for JSON parsing makes the function more robust against malformed data.
65-71
: Good error handling improvement for commit parsing.Similar to the PR parsing function, this properly handles JSON parsing errors.
src/lib/data/db.ts (1)
8-15
: Consider re-enabling WAL mode and verifying concurrency.Removing the WAL mode can reduce performance under concurrent writes. If concurrency is important, consider reapplying:
PRAGMA journal_mode=WAL;
or confirm your use case does not require it.
package.json (1)
12-13
: Looks good!The new
"pipeline"
script and the added dependencies (chalk
,commander
,date-fns
,glob
,yaml
) appear valid. Ensure all of these libraries are indeed required to avoid bloat.Also applies to: 24-24, 27-28, 31-31, 38-38
config/pipeline.config.ts (4)
11-41
: Good config structure for basic pipeline settings.The definition of repositories, lookbackDays, and botUsers is clear and straightforward. No issues.
43-185
: Verify scoring configuration logic.This block has elaborate scoring rules. Confirm they produce the desired results and avoid edge-case abuses (e.g., awarding excessive points for large auto-generated additions).
187-342
: Tag definitions are well-structured.The area/role/tech tags and multipliers promote clear categorization. This should help tailor scoring to domain expertise.
344-350
: AI Summary toggle looks fine.The optional AI summary feature is well-isolated and can be safely toggled. No concerns.
README.md (5)
46-48
: Unify environment variables for clarity.
Using bothGH_ACCESS_TOKEN
andGH_TOKEN
could be confusing. Consider removingGH_ACCESS_TOKEN
references if the pipeline only needsGH_TOKEN
.
87-87
: Validate leftover references to removed commands.
Make sure no scripts or docs still mention the deprecatedinit-db
command.
101-110
: Pipeline usage steps look solid.
The stepwise approach to init, fetch, and process is clear. Good job.
179-198
: Verify JSON fields match code usage.
The new keys are comprehensive. Confirm that each field is actually processed or displayed somewhere in the pipeline or UI.
215-222
: Directory structure updates align with the new pipeline.
Good job reflecting the new SQLite DB and config directory.scripts/README.md (6)
5-16
: Intro to TypeScript pipeline is well explained.
This high-level summary is helpful.
17-31
: Quick start section is straightforward.
Providing the pipeline commands in order is helpful for new contributors.
114-114
: Legacy pipeline diagram is comprehensive.
No concerns here.
145-149
: Ensure disclaimers about new DB location.
Remind users to excludedata/db.sqlite
from version control if it’s not ephemeral.🧰 Tools
🪛 LanguageTool
[uncategorized] ~145-~145: Loose punctuation mark.
Context: ... G ``` ## Data Storage -data/daily/
: Current day's data (legacy) - `data/dai...(UNLIKELY_OPENING_PUNCTUATION)
156-158
: Recheck environment var duplication.
GH_TOKEN
is repeated across docs. Confirm they're consistent.🧰 Tools
🪛 LanguageTool
[grammar] ~158-~158: The singular proper name ‘Site’ must be used with a third-person or a past tense verb.
Context: ...egration (optional) -SITE_NAME
: Site name for OpenRouter integration (optional) ...(HE_VERB_AGR)
162-165
: Clear transition plan from Python to TypeScript.
Running both pipelines in parallel is smart.🧰 Tools
🪛 LanguageTool
[style] ~162-~162: Style-wise, it’s not ideal to insert an adverb (‘eventually’) in the middle of an infinitive construction (‘to replace’). Try moving the adverb to avoid split infinitives.
Context: ...Notes - The new TypeScript pipeline is designed to eventually replace the Python scripts - Both systems can run side by side dur...(SPLIT_INFINITIVE)
src/lib/data/schema.ts (7)
8-8
: Nice addition ofunique()
.
Ensures better data constraints.
11-11
: Comment clarifies user table purpose.
No further issues.
60-79
: VerifychangeType
usage.
Consider establishing a default or ensure it’s reliably set in data ingestion to avoid null.
113-139
: Ensure user existence on commits.
If commits come from unknown authors, handle gracefully or auto-insert user rows.
184-203
: PR comments table looks good.
Structure is consistent withprReviews
.
205-224
: Issue comments table is consistent.
No concerns here.
305-312
: Pipeline config table.
Validate JSON structure on insertion or handle backward compatibility.src/lib/data/types.ts (13)
3-7
: Define GitHub user schema clearly.The
GithubUserSchema
definition is straightforward and accommodates optional/nullable fields foravatarUrl
. This looks good as-is.
26-31
:RawPRFileSchema
alignment is good.The fields and their default values align well with typical file-level data. No glaring concerns.
42-49
:RawCommentSchema
usage is consistent.Nullable/optional text fields make sense to handle missing or incomplete data. Good job.
51-56
:RawLabelSchema
is straightforward.The schema covers essential fields for issue/PR labels. Looks fine.
58-105
: ComprehensiveRawPullRequestSchema
.This schema includes everything from labels to files, commits, reviews, and comments. Good structure for a robust ingestion.
107-129
:RawIssueSchema
is well-structured.Optional fields (e.g.,
closedAt
) precisely capture typical GitHub issue data.
264-310
:ScoringConfigSchema
is extensive yet clear.The default values cover typical scoring logic, offering large configuration flexibility. Fine as-is.
312-312
:TagTypeSchema
provides clarity.Enumerating tag categories as
["AREA", "ROLE", "TECH"]
is a good approach to keep code thoroughly typed.
314-320
:TagConfigSchema
extends tagging capabilities.Storing multiple patterns in
patterns
fosters easy expansion later. This looks well-designed.
322-327
:RepositoryConfigSchema
is minimal and effective.Captures the essential repository properties. No issues found.
328-347
:PipelineConfigSchema
handles nested configurations well.The structure for repositories, tags, and AI summary is well-integrated, giving room for further extension.
349-349
: Comment for type exports is harmless.No concerns. This clarifies the upcoming type definitions.
352-357
: Type exports are consistent.Redundant alias
ScoringRules
equalsScoringConfig
may be helpful or extraneous depending on usage. Otherwise, no issues.drizzle/meta/0000_snapshot.json (13)
4-4
: Snapshot ID updated.This indicates a new migration step/version. Looks standard for Drizzle snapshots.
7-109
:issue_comments
table introduction.Fields, indexes, and foreign keys are well-structured. The
last_updated
defaulting toCURRENT_TIMESTAMP
is standard and helpful for tracking modifications.
110-141
:pipeline_config
table.Storing pipeline configuration in JSON allows dynamic updates. Straightforward approach, no immediate issues.
142-244
:pr_comments
table.Similar structure to
issue_comments
. The indexing strategy onpr_id
andauthor
is consistent with usage patterns. Looks good.
245-347
:pr_reviews
table.Handles a broad set of fields (state, body, submitted_at). The references to
raw_pull_requests
andusers
look correct.
348-446
:raw_commit_files
table.Captures file-level commit data, referencing
raw_commits
. The partial unique constraints and indexing are well-structured.
447-614
:raw_commits
table.Includes author references, message fields, and ties to
raw_pull_requests
. The indexing onauthor
,repository
, andcommitted_date
should help queries. No issues.
615-763
:raw_issues
table.Indexes and a unique constraint on
(repository, number)
effectively enforce uniqueness. Storinglabels
as JSON string is suitable for dynamic usage.
764-855
:raw_pr_files
table.Maintains file-level data for each pull request. The
unq_pr_id_path
ensures no duplicate records for a single path. Well done.
856-1049
:raw_pull_requests
table.The unique
(repository, number)
constraint is standard, and indexing onauthor
,repository
, andcreated_at
covers common queries. Solid design.
1050-1096
:repositories
table.Stores minimal info about each repository, including
owner
andname
.last_fetched_at
for tracking synchronization is helpful.
1107-1137
: Enhancements totags
table.New columns (
category
,weight
,patterns
) expand tag flexibility, matching the pipeline’s tagging approach.
1413-1517
:user_tag_scores
improvements.Switching
username
andtag
tonotNull
is sensible. This table fosters robust tracking of user–tag relationships. Nicely done.src/lib/data/processing.ts (18)
1-18
: Import statements and type references established.The combined imports for schema entities and config types provide a clear, centralized reference for the pipeline’s needs.
19-24
: Introductory doc block sets context.The comment clarifies the high-level purpose and scope of the pipeline. Helpful for future maintainers.
26-29
:DateRange
interface is straightforward.Specifying start/end ensures a consistent approach to time-based queries.
31-79
:ContributorMetrics
captures wide-scope stats.This structure thoroughly enumerates relevant contributor data, from PRs to comments. Suitable for robust analytics.
81-91
:ProcessingResult
clarifies returned data.Separating
metrics
array andtotals
fosters clear usage. Nice organization.
93-101
:ContributorPipeline
constructor is simple.Configuration is typed as
PipelineConfig
, ensuring compile-time checks on pipeline settings.
103-158
:processTimeframe
method logic.Retrieves active contributors, computes metrics, sorts them, and saves daily summaries. Great structure for a top-level pipeline operation.
160-272
:getActiveContributors
fetches multiple contributor roles.Pulling authors, reviewers, and commenters ensures a comprehensive active user set. Filtering out bots is a nice touch.
714-736
:fetchPullRequests
isolates PR queries nicely.Clear conditions for user, date range, and repository. Straightforward approach.
738-760
:fetchIssues
parallelsfetchPullRequests
.Keeps logic consistent across different resource types. Good strategic uniformity.
762-793
:fetchGivenReviews
merges review & PR data.Inner join ensures we only retrieve relevant PR reviews. Good usage of the DRIZZLE-ORM approach.
795-826
:fetchPRComments
is parallel in structure to review fetching.Again, consistent approach for bridging comments with PR data. Looks good.
828-869
:calculateCodeScore
for added/deleted lines.Capping line changes to
maxLines
helps avoid outliers. The test coverage bonus is a nice nudge for best practices.
871-898
:calculateFocusAreas
focuses on top-level directories.Provides a quick overview of contributor focus. Straightforward logic slicing to top 5 areas.
900-926
:calculateFileTypes
clarifies extension-based distribution.Using
path.extname
is typical. Sorting by count to get top 5 keeps the data manageable.
928-1003
:calculateExpertiseAreas
applies tag rules.Combining file path and PR title checks covers relevant patterns. Logging results in DB with
storeTagScore
ensures persistent tracking.
1005-1056
:storeTagScore
ensures tag presence before upsert.Inserting/updating
user_tag_scores
is a nice usage ofonConflictDoUpdate
. Code is neat.
1058-1143
:saveDailySummaries
finalizes data.Compiles daily stats and merges them into persistent tables. Good plan for historical tracking.
drizzle/0000_serious_susan_delgado.sql (39)
1-10
: Table 'issue_comments' structure is well-defined.
The columns, default values, and foreign keys (linking toraw_issues
andusers
) are correctly set up. Ensure that using TEXT for identifiers and dates aligns with your overall schema strategy.
13-13
: Index onissue_comments.issue_id
is correctly added.
Optimizes queries filtering by issue reference.
14-14
: Index onissue_comments.author
is appropriately defined.
Helps with lookup queries by comment author.
15-19
: Table 'pipeline_config' setup looks solid.
All required fields are present; consider validating the structure of theconfig
content in your application logic.
21-31
: Table 'pr_comments' is well structured.
The foreign keys to pull requests and users ensure referential integrity, and default values are appropriate.
33-33
: Index onpr_comments.pr_id
is well-defined.
This index will improve query performance when filtering by pull request identifiers.
34-34
: Index onpr_comments.author
is set up appropriately.
Ensures efficient retrieval of comments by author.
35-45
: Table 'pr_reviews' is correctly implemented.
Includes necessary fields (review state, submission timestamp, etc.) with proper foreign key constraints.
47-47
: Index onpr_reviews.pr_id
is properly added.
Facilitates fast lookups of reviews by pull request.
48-48
: Index onpr_reviews.author
is appropriately defined.
Supports efficient queries filtering by review author.
49-60
: Table 'raw_commit_files' is well defined.
Captures file-related commit metrics and the patch content; ensure the TEXT column forpatch
can handle large diffs as needed.
62-62
: Index onraw_commit_files.sha
is correctly set.
Enhances performance when joining commit files with commits.
63-80
: Table 'raw_commits' structure is robust.
All essential commit details are present, and foreign keys tousers
andraw_pull_requests
are properly enforced. Verify that storing dates and IDs as TEXT meets the broader system requirements.
82-82
: Index onraw_commits.author
is well-implemented.
This will improve query performance when filtering by commit author.
83-83
: Index onraw_commits.repository
is appropriately defined.
Optimizes searches based on repository reference.
84-84
: Index onraw_commits.committed_date
is correctly added.
Facilitates efficient time-based queries of commit records.
85-85
: Index onraw_commits.pull_request_id
is appropriately defined.
Improves join performance with pull request data.
86-101
: Table 'raw_issues' is defined well.
Captures issue metadata with sensible defaults for fields likebody
andlabels
. Consider whether storinglabels
as a JSON string meets your query needs.
103-103
: Index onraw_issues.author
is properly created.
Helps ensure quick lookups by issue author.
104-104
: Index onraw_issues.repository
is accurately defined.
Optimizes filtering of issues by repository.
105-105
: Index onraw_issues.created_at
is well-set.
Boosts query performance for time-based issue queries.
106-106
: Unique index on (repository
,number
) in 'raw_issues' is a sound design choice.
This maintains uniqueness of issue numbers within a repository.
107-116
: Table 'raw_pr_files' is well-defined.
Ensures file metadata for pull requests is captured correctly, with a proper foreign key toraw_pull_requests
.
118-118
: Index onraw_pr_files.pr_id
is set up correctly.
Improves performance for queries based on pull request file associations.
119-119
: Unique index on (pr_id
,path
) in 'raw_pr_files' is effective.
Enforces that each file path is unique per pull request.
120-141
: Table 'raw_pull_requests' is robustly defined.
It includes comprehensive metadata and enforces integrity through foreign keys. The default forlabels
as'[]'
should be managed appropriately in your application.
143-143
: Index onraw_pull_requests.author
is correctly applied.
Facilitates efficient queries filtering by pull request authors.
144-144
: Index onraw_pull_requests.repository
is appropriately defined.
Ensures quick filtering based on the associated repository.
145-145
: Index onraw_pull_requests.created_at
is well-placed.
Optimizes queries needing time-based ordering of pull requests.
146-146
: Unique index on (repository
,number
) for pull requests is a sound integrity constraint.
Prevents duplicate pull request numbers within the same repository.
147-153
: Table 'repositories' is clearly defined.
The columns for owner, name, and timestamp management are appropriately set up.
155-163
: Table 'tags' is set up effectively.
Storing tag metadata with a default forpatterns
(as a JSON string) is acceptable if your app parses it correctly.
165-180
: Table 'user_daily_summaries' captures daily activity metrics well.
Usage of TEXT columns to store JSON arrays (for commits, pull requests, and issues) is valid if these are parsed as needed.
182-182
: Index onuser_daily_summaries.username
is properly added.
Optimizes retrieval based on user identifiers.
183-183
: Index onuser_daily_summaries.date
is appropriately set.
Facilitates efficient time-based queries in daily summaries.
184-197
: Table 'user_stats' is well-structured.
The fields and JSON defaults for columns likefiles_by_type
andprs_by_month
are practical, assuming your processing handles them correctly.
199-210
: Table 'user_tag_scores' is thoroughly defined.
Foreign key constraints tousers
andtags
effectively maintain referential integrity.
212-212
: Index onuser_tag_scores.username
is correctly added.
It supports efficient queries by username for tag scores.
213-218
: Table 'users' is essential and correctly designed.
The primary key, default values (e.g., an empty string foravatar_url
), and timestamp handling are all appropriate.
scripts/analyze-pipeline.ts
Outdated
const program = new Command(); | ||
|
||
program | ||
.name("analyze-pipeline") | ||
.description("GitHub Contribution Analytics Pipeline") | ||
.version("1.0.0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Confirm CLI usage and version
Constructing the CLI with Commander is great. Verify that the declared version ("1.0.0") remains synchronized with package.json or other references to avoid confusion.
Would you like to run a script to compare the version field in package.json with this CLI version so we can ensure consistency?
🏁 Script executed:
#!/bin/bash
# Checking consistency of "version" in package.json vs "scripts/analyze-pipeline.ts"
rg '"version": "([^"]+)"' package.json
rg '\.version\("([^"]+)"\)' scripts/analyze-pipeline.ts
Length of output: 141
CLI version mismatch: synchronize package.json and script version
- The CLI script in
scripts/analyze-pipeline.ts
uses version"1.0.0"
, whilepackage.json
declares version"0.1.0"
. - Please update either the CLI code or package.json to ensure the version stays consistent. Alternatively, consider dynamically importing the version from
package.json
into your CLI script.
scripts/analyze-pipeline.ts
Outdated
await program.parseAsync(["analyze-pipeline", "fetch"]); | ||
|
||
// Run process command | ||
await program.parseAsync(["analyze-pipeline", "process"]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Validate multiple parseAsync calls
Invoking program.parseAsync
twice in a row (for "fetch" and "process") is functional but somewhat unconventional. Consider calling the underlying methods directly or handle commands in sequence to avoid potential confusion or side effects.
- await program.parseAsync(["analyze-pipeline", "fetch"]);
- await program.parseAsync(["analyze-pipeline", "process"]);
+ await fetchAction();
+ await processAction();
Committable suggestion skipped: line range outside the PR's diff.
while (hasNextPage) { | ||
const variables: Record<string, string> = {}; | ||
if (endCursor) { | ||
variables.endCursor = endCursor; | ||
} | ||
|
||
const response = await this.executeGraphQL(query, variables); | ||
|
||
const pullRequests = response.data.repository.pullRequests; | ||
allPullRequests = [...allPullRequests, ...pullRequests.nodes]; | ||
|
||
hasNextPage = pullRequests.pageInfo.hasNextPage; | ||
endCursor = pullRequests.pageInfo.endCursor; | ||
|
||
// If we have date filters and already have some results, | ||
// check if we've gone past our start date to avoid fetching too much | ||
if (startDate && allPullRequests.length > 0) { | ||
const oldestPR = allPullRequests[allPullRequests.length - 1]; | ||
if (new Date(oldestPR.createdAt) < startDate) { | ||
// We've gone far enough back | ||
hasNextPage = false; | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential empty repository object
When parsing paginated pull requests, if response.data.repository
is null or undefined for any reason, the code will fail. Consider adding a guard check before accessing pullRequests
.
+ if (!response.data || !response.data.repository) {
+ console.error("Repository data is missing or invalid.");
+ break;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
while (hasNextPage) { | |
const variables: Record<string, string> = {}; | |
if (endCursor) { | |
variables.endCursor = endCursor; | |
} | |
const response = await this.executeGraphQL(query, variables); | |
const pullRequests = response.data.repository.pullRequests; | |
allPullRequests = [...allPullRequests, ...pullRequests.nodes]; | |
hasNextPage = pullRequests.pageInfo.hasNextPage; | |
endCursor = pullRequests.pageInfo.endCursor; | |
// If we have date filters and already have some results, | |
// check if we've gone past our start date to avoid fetching too much | |
if (startDate && allPullRequests.length > 0) { | |
const oldestPR = allPullRequests[allPullRequests.length - 1]; | |
if (new Date(oldestPR.createdAt) < startDate) { | |
// We've gone far enough back | |
hasNextPage = false; | |
} | |
} | |
} | |
while (hasNextPage) { | |
const variables: Record<string, string> = {}; | |
if (endCursor) { | |
variables.endCursor = endCursor; | |
} | |
const response = await this.executeGraphQL(query, variables); | |
if (!response.data || !response.data.repository) { | |
console.error("Repository data is missing or invalid."); | |
break; | |
} | |
const pullRequests = response.data.repository.pullRequests; | |
allPullRequests = [...allPullRequests, ...pullRequests.nodes]; | |
hasNextPage = pullRequests.pageInfo.hasNextPage; | |
endCursor = pullRequests.pageInfo.endCursor; | |
// If we have date filters and already have some results, | |
// check if we've gone past our start date to avoid fetching too much | |
if (startDate && allPullRequests.length > 0) { | |
const oldestPR = allPullRequests[allPullRequests.length - 1]; | |
if (new Date(oldestPR.createdAt) < startDate) { | |
// We've gone far enough back | |
hasNextPage = false; | |
} | |
} | |
} |
src/lib/data/ingestion.ts
Outdated
|
||
// Skip if username is in the bot users list | ||
if (this.config.botUsers?.includes(username)) { | ||
console.log(`${this.logPrefix} Skipping bot user: ${username}`); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recheck foreign key constraints for bot users
If a bot user is skipped here, any PR or issue referencing that user might cause a foreign key violation unless the schema enforces cascading or optional references. Consider storing bot users with a distinct flag or assigning "unknown"
to the PR author field.
- `bun run lint` - Run ESLint only | ||
- `bunx tsc --noEmit` - Run TypeScript checks | ||
- `bun run serve` - Serve built site | ||
- `bun run init-db` - Initialize database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- errors on bun run build and bun run check (schemas defined but never used, unexpected any, etc)
error: Script not found "init-db"
- errors with
bun run lint
A new TypeScript-based pipeline has been implemented, leveraging SQLite and Drizzle ORM for improved data management and processing:
Key Features
Work in progress, still have to do:
Summary by CodeRabbit
New Features
Documentation
Chores / Refactor