Fix/clean path no urlscan#1
Merged
Merged
Conversation
Clean path now fetches DOM directly with a 5s timeout and 1MB cap rather than submitting every deployment to URLScan. Free-tier URLScan is ~2/min — at CT log volume this backed up the worker pool so ~93% of cards sat in pending forever. - lib/metadata.js: fetchDom, extractTitle, extractMetaDescription - lib/fingerprint.js: detectContentTags now accepts hostname; add detectSuspiciousHostname reusing PRIORITY_KEYWORDS (single source of truth) - index.js: rewrite processEntry. URLhaus hit -> URLScan (flagged). Otherwise fetch DOM, fingerprint, check hostname. Suspicious hostname + URLSCAN_KEY -> escalate to URLScan for sandboxed evidence. Otherwise Playwright screenshot and mark clean. - preflight404 removed; fetchDom returns 404 via its status code - AbuseIPDB auto-report call removed — scanner.js currently treats every non-Vercel IP as C2, so reporting would spam CDN operators with false positives. Will be re-enabled after scanner.js uses real URLScan verdicts. Smoke test (60s run): 18/18 entries reached a final state, no pending backlog. Titles being extracted. deploysPerHour now shows a real rate. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Previously extractScanData flagged every non-Vercel remote IP as C2 and every non-vercel.app asset as a redirect domain. For a 12-platform tool this meant every Netlify/CF/Render site had its own origin IPs and every CDN (Google Fonts, jsdelivr, Cloudflare analytics) classified as malicious. Paired with the AbuseIPDB auto-reporter that was previously wired in, this would have shipped libellous reports against CDN operators at scale. - Remove VERCEL_IP_PREFIXES and isVercelIp. First-party boundary is now the actual page domain per URLScan's result.page.domain. - c2Ips only populated when result.verdicts.overall.malicious is true. - scriptSources/redirectDomains now compare against the scanned page's domain, not a hardcoded platform. - .js query-string suffixes (/foo.js?v=1) now caught by the script regex. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two UI bugs made working filters look broken: 1. Clean cards without a URLScan ID (now the norm, since commit 2724ca9 stops calling URLScan on clean sites) had no actionable links at all — just a bare hostname span. Fixed: card-host is now an <a> for non-flagged, non-suspicious entries. Flagged/suspicious stay as text-only so we never auto-link users to phishing. 2. Filters like Interesting / Flagged / Suspicious render blank when no entries match. The previous empty-state only showed when the feed was completely empty, so "no matches" looked identical to "server broken". Now any time visible === 0, show a filter-aware message: "Waiting for new deployments..." when the feed has never received a card, "No deployments match this filter" otherwise. Verified in browser: clicking Flagged on a 7-card feed now shows the filter-match message; card-host links have correct href + target=_blank. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…tests Cleanup bundle after the scan-pipeline rewrite: - lib/platforms.js: drop 'brilliant-' from Netlify internalRe. It was one of Netlify's default adjective prefixes for auto-named sites — filtering it silently dropped roughly 1 in N free-tier Netlify deployments. - lib/rss.js: feed title and description no longer claim this is Vercel-only. - lib/github-pages.js: compare event IDs by equality instead of lexical <=. GitHub IDs are strings that can exceed Number.MAX_SAFE_INTEGER, and lexical compare breaks at digit-length boundaries; equality-stop catches up correctly with 100-per-page polling. - index.js: startup log says 'deployment-feed' not 'vercel-feed'; add a 204-no-content favicon route to suppress the console 404. - test/certstream.test.js: fix broken imports (isValidDeployment lives in lib/platforms.js, not lib/certstream.js) and drop tests that called extractHostnames with a structured object — the real signature is a base64 TLS leaf blob. Added Netlify brilliant-* coverage. - test/queue.test.js: removed two tests that asserted a MAX_QUEUE_SIZE cap feature that was never actually implemented in JobQueue.push. Full test suite now 66/66 passing — previously the npm test glob failed to expand on Windows and those two files never ran cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…dcards
Three interlocking changes to actually make the 12-platform story work:
1. CT log throughput bumped 4x. MAX_PER_POLL was 2048, which left Argon/Xenon
permanently behind real ingestion rate — platforms issuing Let's Encrypt
certs (Render, Railway, Fly.io, Deno) barely surfaced. Raised to 8192.
2. crt.sh poller is now resilient and selective:
- Only polls platforms flagged crtshSupplement=true (Render, Replit, Deno,
Railway). Previously polled all 12, including Fly.io (always 404s),
Vercel/Netlify/Glitch/Surge (wildcard-only, returns nothing useful).
Full cycle is now ~8min instead of ~24min.
- Exponential backoff on 502/504/429 (crt.sh is often flaky).
- After 3 consecutive failures, a platform is paused for 30min — stops
the poller hammering a dead endpoint.
- HTML-instead-of-JSON bodies (crt.sh's common error mode) no longer
silently swallowed — they're logged as the failure reason.
3. Each platform in lib/platforms.js now has an 'ingestion' field documenting
how its deployments become visible:
- 'ct-log' — per-deployment cert lands in CT logs
- 'events-api' — GitHub Events API
- 'wildcard' — single *.domain cert, no public firehose exists
map.html now says "Platforms tracked: 8" with a note that Vercel, Netlify,
Glitch and Surge use wildcard certs — setting accurate expectations rather
than showing empty planets users assume are broken.
Smoke test: startup log confirms 4-platform crt.sh rotation, Argon grabs
+4/poll vs +1 before, no 404/502 noise in logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
node --test doesn't expand globs itself; it relies on the shell. Bash without globstar and Windows bash both treat ** as *, so the previous test/**/*.test.js pattern matched only subdirectories of test/ — and since all test files live directly in test/, the glob resolved to nothing and CI failed with "Could not find ...". All existing tests live at test/*.test.js so the simple glob catches them. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.