feat(ingester): crawl Starknet blog 2025-2026 by enitrat · Pull Request #100 · KasarLabs/cairo-coder

enitrat · 2026-01-04T10:44:23Z

Summary

crawl Starknet blog posts (2025/2026) directly in the TS ingester
normalize + filter blog URLs, extract main content, and strip boilerplate sections
add bun tests covering sitemap discovery, year detection, and chunk metadata

Context

Moves Starknet blog ingestion to a single-step crawl (no pre-generated markdown), while keeping per-page source attribution and avoiding rate-limit issues.

Test Plan

bun test

enitrat

[AUTOMATED]

Overall Assessment

This is a well-structured PR that simplifies the Starknet blog ingestion pipeline by consolidating a two-step Python→TS process into a single TypeScript crawler. The approach is sound and the implementation demonstrates good engineering practices (rate limiting, retry logic, proper URL normalization).

Strengths

Removes Python dependency for this ingester, simplifying deployment
Good test coverage for core functionality (sitemap discovery, year detection, chunk metadata)
Reuses existing patterns (RecursiveMarkdownSplitter, MarkdownIngester base class)
Proper error handling with retries and exponential backoff
Clean URL normalization and filtering logic

Areas for consideration

The ~700-line file with many module-level helper functions could benefit from being split into focused modules (e.g., crawler.ts, content-extractor.ts, url-utils.ts)
Some hardcoded values could be configuration-driven for flexibility
Test coverage for error paths would strengthen confidence

Overall, this is a solid contribution that improves code health. See inline comments for specific suggestions.

enitrat

[AUTOMATED]

In-Depth Code Review - Bug Analysis

I've performed a detailed analysis of the ingester logic. The implementation is well-structured with good improvements in the latest commit (gzip support, JSON-LD parsing, global backoff). Here are my findings:

Summary

1 potential bug that could cause issues in edge cases
1 minor bug with graceful degradation
2 design suggestions for robustness

See inline comments for details.

- Fix potential infinite recursion in parseSitemap by adding depth limit and visited URL tracking - Return null explicitly when gzip decompression fails in fetchSitemap - Move getTempDir import to top-level (remove require() inside method) - Add debug logging for non-retryable HTTP status codes (404, etc.) - Reset globalBackoffUntil at start of crawl for parallel safety

enitrat · 2026-01-04T11:07:11Z

[AUTOMATED] All review feedback has been addressed in commit 8e5b150:

Fixed:

✅ Infinite recursion in parseSitemap (added depth limit + visited URL tracking)
✅ Gzip decompression failure handling (now returns null explicitly)
✅ Moved require() to top-level import
✅ Added debug logging for non-retryable HTTP status codes
✅ Reset globalBackoffUntil at start of crawl for parallel safety

Acknowledged (no changes needed):

ALLOWED_YEARS kept hardcoded intentionally for explicit control
useUrlMapping: true is intentional for per-page source attribution
OZ useUrlMapping fix is intentional
Regex patterns verified correct
Test coverage suggestions noted for follow-up PR

Ready for re-review!

- Fix axios mock implementation types in tests to properly handle url parameter - Import AnyNode and Element types from domhandler for cheerio compatibility - Refactor Buffer.from usage to handle ArrayBuffer types explicitly - Update Cheerio<Element> type to Cheerio<AnyNode> for proper type inference All TypeScript compilation errors resolved and tests passing.

enitrat · 2026-01-04T12:01:27Z

[AUTOMATED] Added URL cache feature in commit 5535290:

How it works:

On first run, all URLs are fetched and checked
URLs determined to be NOT 2025/2026 posts are saved to generated/starknet-blog-excluded-urls.json
On subsequent runs, cached URLs are skipped entirely (no HTTP request, no parsing)

Expected output on re-run:

info: Loaded 247 excluded URLs from cache
info: Skipping 247 URLs from cache, processing 15 URLs

This should dramatically speed up subsequent pipeline runs!

Add a persistent cache that stores URLs confirmed NOT to be 2025/2026 blog posts. On subsequent runs, these URLs are skipped entirely, avoiding redundant HTTP requests and content parsing. Cache file: generated/starknet-blog-excluded-urls.json This significantly speeds up re-runs when most blog posts are from previous years.

feat(ingester): crawl Starknet blog 2025-2026

e42c848

enitrat commented Jan 4, 2026

View reviewed changes

Harden Starknet blog crawl and cleanup

ae5917d

enitrat commented Jan 4, 2026

View reviewed changes

enitrat force-pushed the feat/starknet-blog-crawler branch from 5535290 to fc2d6c3 Compare January 4, 2026 13:09

enitrat added 2 commits January 4, 2026 14:23

fix titles too long

846aaaa

chore: rm legacy code

c5a8f70

enitrat merged commit b1d3184 into main Jan 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingester): crawl Starknet blog 2025-2026#100

feat(ingester): crawl Starknet blog 2025-2026#100
enitrat merged 7 commits intomainfrom
feat/starknet-blog-crawler

enitrat commented Jan 4, 2026

Uh oh!

enitrat left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enitrat left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enitrat commented Jan 4, 2026

Uh oh!

enitrat commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

enitrat commented Jan 4, 2026

Summary

Context

Test Plan

Uh oh!

enitrat left a comment

Choose a reason for hiding this comment

Overall Assessment

Strengths

Areas for consideration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enitrat left a comment

Choose a reason for hiding this comment

In-Depth Code Review - Bug Analysis

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enitrat commented Jan 4, 2026

Uh oh!

enitrat commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant