Skip to content

feat: add GitHub as a first-class data source#136

Open
zerone0x wants to merge 1 commit intomvanhorn:mainfrom
zerone0x:feat/github-source
Open

feat: add GitHub as a first-class data source#136
zerone0x wants to merge 1 commit intomvanhorn:mainfrom
zerone0x:feat/github-source

Conversation

@zerone0x
Copy link
Copy Markdown

Summary

Closes #134. Adds GitHub Issues and PRs as a structured search source using the GitHub Search API, following the existing 7-layer source pattern.

  • Source module (github.py): search via api.github.com/search/issues, parse responses, enrich top items with comment threads
  • Schema: GitHubItem dataclass with engagement metrics, labels, body snippets, and comment insights
  • Normalize/Score/Dedupe: full pipeline — engagement formula weights comments (0.55) over reactions (0.45) since active discussion is a stronger signal on GitHub
  • Orchestration: parallel execution via ThreadPoolExecutor with per-depth timeouts (quick: 30s, default: 60s, deep: 90s)
  • Render: compact view, status line, context snippets, and full markdown output sections
  • Config: optional GITHUB_TOKEN env var for higher rate limits (30 req/min vs 10 unauthenticated); always available as a source
  • Query tiering: added as tier2 for product, concept, how_to, comparison query types

Design decisions

  • Modeled after the HackerNews source (closest analog: free API, similar engagement signals)
  • Progressive unlock: works without auth, better with optional token
  • Comments weighted higher than reactions in engagement scoring because active discussion threads are a stronger quality signal on GitHub than passive thumbs-up

Test plan

  • 17 new unit tests covering parsing, normalization, scoring, sorting, and serialization (tests/test_github.py)
  • Full test suite passes (900/900 — 3 pre-existing failures unrelated to this PR)
  • Manual test with --search github flag
  • Manual test with GITHUB_TOKEN set for authenticated rate limits

🤖 Generated with Claude Code

Add GitHub Issues and PRs as a structured search source using the GitHub
Search API. Works without auth (10 req/min) with optional GITHUB_TOKEN
for higher rate limits (30 req/min).

Implements the full 7-layer source pattern:
- Source module (github.py): search, parse, comment enrichment
- Schema: GitHubItem dataclass with engagement, labels, comments
- Normalize: normalize_github_items() with date confidence
- Score: engagement formula (0.45*reactions + 0.55*comments)
- Dedupe: dedupe_github() wrapper
- Orchestration: parallel execution with configurable timeouts
- Render: compact, status, context, and full markdown output
- Config: GITHUB_TOKEN env var, source tiering (tier2 for product/
  concept/how_to/comparison queries)

Includes 17 unit tests covering parsing, normalization, scoring,
sorting, and serialization.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add GitHub as a first-class data source

1 participant