Skip to content

feat(ingester): crawl Starknet blog 2025-2026#100

Merged
enitrat merged 7 commits intomainfrom
feat/starknet-blog-crawler
Jan 4, 2026
Merged

feat(ingester): crawl Starknet blog 2025-2026#100
enitrat merged 7 commits intomainfrom
feat/starknet-blog-crawler

Conversation

@enitrat
Copy link
Copy Markdown
Collaborator

@enitrat enitrat commented Jan 4, 2026

Summary

  • crawl Starknet blog posts (2025/2026) directly in the TS ingester
  • normalize + filter blog URLs, extract main content, and strip boilerplate sections
  • add bun tests covering sitemap discovery, year detection, and chunk metadata

Context

Moves Starknet blog ingestion to a single-step crawl (no pre-generated markdown), while keeping per-page source attribution and avoiding rate-limit issues.

Test Plan

  • bun test

Copy link
Copy Markdown
Collaborator Author

@enitrat enitrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Overall Assessment

This is a well-structured PR that simplifies the Starknet blog ingestion pipeline by consolidating a two-step Python→TS process into a single TypeScript crawler. The approach is sound and the implementation demonstrates good engineering practices (rate limiting, retry logic, proper URL normalization).

Strengths

  • Removes Python dependency for this ingester, simplifying deployment
  • Good test coverage for core functionality (sitemap discovery, year detection, chunk metadata)
  • Reuses existing patterns (RecursiveMarkdownSplitter, MarkdownIngester base class)
  • Proper error handling with retries and exponential backoff
  • Clean URL normalization and filtering logic

Areas for consideration

  • The ~700-line file with many module-level helper functions could benefit from being split into focused modules (e.g., crawler.ts, content-extractor.ts, url-utils.ts)
  • Some hardcoded values could be configuration-driven for flexibility
  • Test coverage for error paths would strengthen confidence

Overall, this is a solid contribution that improves code health. See inline comments for specific suggestions.

Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts Outdated
Comment thread ingesters/__tests__/StarknetBlogIngester.test.ts Outdated
Comment thread ingesters/config/sources.json
Copy link
Copy Markdown
Collaborator Author

@enitrat enitrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

In-Depth Code Review - Bug Analysis

I've performed a detailed analysis of the ingester logic. The implementation is well-structured with good improvements in the latest commit (gzip support, JSON-LD parsing, global backoff). Here are my findings:

Summary

  • 1 potential bug that could cause issues in edge cases
  • 1 minor bug with graceful degradation
  • 2 design suggestions for robustness

See inline comments for details.

Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts Outdated
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts Outdated
Comment thread ingesters/src/ingesters/StarknetBlogIngester.ts Outdated
Comment thread ingesters/__tests__/StarknetBlogIngester.test.ts
- Fix potential infinite recursion in parseSitemap by adding depth limit
  and visited URL tracking
- Return null explicitly when gzip decompression fails in fetchSitemap
- Move getTempDir import to top-level (remove require() inside method)
- Add debug logging for non-retryable HTTP status codes (404, etc.)
- Reset globalBackoffUntil at start of crawl for parallel safety
@enitrat
Copy link
Copy Markdown
Collaborator Author

enitrat commented Jan 4, 2026

[AUTOMATED] All review feedback has been addressed in commit 8e5b150:

Fixed:

  • ✅ Infinite recursion in parseSitemap (added depth limit + visited URL tracking)
  • ✅ Gzip decompression failure handling (now returns null explicitly)
  • ✅ Moved require() to top-level import
  • ✅ Added debug logging for non-retryable HTTP status codes
  • ✅ Reset globalBackoffUntil at start of crawl for parallel safety

Acknowledged (no changes needed):

  • ALLOWED_YEARS kept hardcoded intentionally for explicit control
  • useUrlMapping: true is intentional for per-page source attribution
  • OZ useUrlMapping fix is intentional
  • Regex patterns verified correct
  • Test coverage suggestions noted for follow-up PR

Ready for re-review!

- Fix axios mock implementation types in tests to properly handle url parameter
- Import AnyNode and Element types from domhandler for cheerio compatibility
- Refactor Buffer.from usage to handle ArrayBuffer types explicitly
- Update Cheerio<Element> type to Cheerio<AnyNode> for proper type inference

All TypeScript compilation errors resolved and tests passing.
@enitrat
Copy link
Copy Markdown
Collaborator Author

enitrat commented Jan 4, 2026

[AUTOMATED] Added URL cache feature in commit 5535290:

How it works:

  1. On first run, all URLs are fetched and checked
  2. URLs determined to be NOT 2025/2026 posts are saved to generated/starknet-blog-excluded-urls.json
  3. On subsequent runs, cached URLs are skipped entirely (no HTTP request, no parsing)

Expected output on re-run:

info: Loaded 247 excluded URLs from cache
info: Skipping 247 URLs from cache, processing 15 URLs

This should dramatically speed up subsequent pipeline runs!

Add a persistent cache that stores URLs confirmed NOT to be 2025/2026
blog posts. On subsequent runs, these URLs are skipped entirely,
avoiding redundant HTTP requests and content parsing.

Cache file: generated/starknet-blog-excluded-urls.json

This significantly speeds up re-runs when most blog posts are from
previous years.
@enitrat enitrat force-pushed the feat/starknet-blog-crawler branch from 5535290 to fc2d6c3 Compare January 4, 2026 13:09
@enitrat enitrat merged commit b1d3184 into main Jan 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant