feat(ingester): crawl Starknet blog 2025-2026#100
Conversation
enitrat
left a comment
There was a problem hiding this comment.
[AUTOMATED]
Overall Assessment
This is a well-structured PR that simplifies the Starknet blog ingestion pipeline by consolidating a two-step Python→TS process into a single TypeScript crawler. The approach is sound and the implementation demonstrates good engineering practices (rate limiting, retry logic, proper URL normalization).
Strengths
- Removes Python dependency for this ingester, simplifying deployment
- Good test coverage for core functionality (sitemap discovery, year detection, chunk metadata)
- Reuses existing patterns (
RecursiveMarkdownSplitter,MarkdownIngesterbase class) - Proper error handling with retries and exponential backoff
- Clean URL normalization and filtering logic
Areas for consideration
- The ~700-line file with many module-level helper functions could benefit from being split into focused modules (e.g.,
crawler.ts,content-extractor.ts,url-utils.ts) - Some hardcoded values could be configuration-driven for flexibility
- Test coverage for error paths would strengthen confidence
Overall, this is a solid contribution that improves code health. See inline comments for specific suggestions.
enitrat
left a comment
There was a problem hiding this comment.
[AUTOMATED]
In-Depth Code Review - Bug Analysis
I've performed a detailed analysis of the ingester logic. The implementation is well-structured with good improvements in the latest commit (gzip support, JSON-LD parsing, global backoff). Here are my findings:
Summary
- 1 potential bug that could cause issues in edge cases
- 1 minor bug with graceful degradation
- 2 design suggestions for robustness
See inline comments for details.
- Fix potential infinite recursion in parseSitemap by adding depth limit and visited URL tracking - Return null explicitly when gzip decompression fails in fetchSitemap - Move getTempDir import to top-level (remove require() inside method) - Add debug logging for non-retryable HTTP status codes (404, etc.) - Reset globalBackoffUntil at start of crawl for parallel safety
|
[AUTOMATED] All review feedback has been addressed in commit 8e5b150: Fixed:
Acknowledged (no changes needed):
Ready for re-review! |
- Fix axios mock implementation types in tests to properly handle url parameter - Import AnyNode and Element types from domhandler for cheerio compatibility - Refactor Buffer.from usage to handle ArrayBuffer types explicitly - Update Cheerio<Element> type to Cheerio<AnyNode> for proper type inference All TypeScript compilation errors resolved and tests passing.
|
[AUTOMATED] Added URL cache feature in commit 5535290: How it works:
Expected output on re-run: This should dramatically speed up subsequent pipeline runs! |
Add a persistent cache that stores URLs confirmed NOT to be 2025/2026 blog posts. On subsequent runs, these URLs are skipped entirely, avoiding redundant HTTP requests and content parsing. Cache file: generated/starknet-blog-excluded-urls.json This significantly speeds up re-runs when most blog posts are from previous years.
5535290 to
fc2d6c3
Compare
Summary
Context
Moves Starknet blog ingestion to a single-step crawl (no pre-generated markdown), while keeping per-page source attribution and avoiding rate-limit issues.
Test Plan