Fix/clean path no urlscan by C-Moir · Pull Request #1 · C-Moir/deployment-feed

C-Moir · 2026-04-20T07:13:24Z

No description provided.

Clean path now fetches DOM directly with a 5s timeout and 1MB cap rather than submitting every deployment to URLScan. Free-tier URLScan is ~2/min — at CT log volume this backed up the worker pool so ~93% of cards sat in pending forever. - lib/metadata.js: fetchDom, extractTitle, extractMetaDescription - lib/fingerprint.js: detectContentTags now accepts hostname; add detectSuspiciousHostname reusing PRIORITY_KEYWORDS (single source of truth) - index.js: rewrite processEntry. URLhaus hit -> URLScan (flagged). Otherwise fetch DOM, fingerprint, check hostname. Suspicious hostname + URLSCAN_KEY -> escalate to URLScan for sandboxed evidence. Otherwise Playwright screenshot and mark clean. - preflight404 removed; fetchDom returns 404 via its status code - AbuseIPDB auto-report call removed — scanner.js currently treats every non-Vercel IP as C2, so reporting would spam CDN operators with false positives. Will be re-enabled after scanner.js uses real URLScan verdicts. Smoke test (60s run): 18/18 entries reached a final state, no pending backlog. Titles being extracted. deploysPerHour now shows a real rate. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Previously extractScanData flagged every non-Vercel remote IP as C2 and every non-vercel.app asset as a redirect domain. For a 12-platform tool this meant every Netlify/CF/Render site had its own origin IPs and every CDN (Google Fonts, jsdelivr, Cloudflare analytics) classified as malicious. Paired with the AbuseIPDB auto-reporter that was previously wired in, this would have shipped libellous reports against CDN operators at scale. - Remove VERCEL_IP_PREFIXES and isVercelIp. First-party boundary is now the actual page domain per URLScan's result.page.domain. - c2Ips only populated when result.verdicts.overall.malicious is true. - scriptSources/redirectDomains now compare against the scanned page's domain, not a hardcoded platform. - .js query-string suffixes (/foo.js?v=1) now caught by the script regex. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Two UI bugs made working filters look broken: 1. Clean cards without a URLScan ID (now the norm, since commit 2724ca9 stops calling URLScan on clean sites) had no actionable links at all — just a bare hostname span. Fixed: card-host is now an <a> for non-flagged, non-suspicious entries. Flagged/suspicious stay as text-only so we never auto-link users to phishing. 2. Filters like Interesting / Flagged / Suspicious render blank when no entries match. The previous empty-state only showed when the feed was completely empty, so "no matches" looked identical to "server broken". Now any time visible === 0, show a filter-aware message: "Waiting for new deployments..." when the feed has never received a card, "No deployments match this filter" otherwise. Verified in browser: clicking Flagged on a 7-card feed now shows the filter-match message; card-host links have correct href + target=_blank. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…tests Cleanup bundle after the scan-pipeline rewrite: - lib/platforms.js: drop 'brilliant-' from Netlify internalRe. It was one of Netlify's default adjective prefixes for auto-named sites — filtering it silently dropped roughly 1 in N free-tier Netlify deployments. - lib/rss.js: feed title and description no longer claim this is Vercel-only. - lib/github-pages.js: compare event IDs by equality instead of lexical <=. GitHub IDs are strings that can exceed Number.MAX_SAFE_INTEGER, and lexical compare breaks at digit-length boundaries; equality-stop catches up correctly with 100-per-page polling. - index.js: startup log says 'deployment-feed' not 'vercel-feed'; add a 204-no-content favicon route to suppress the console 404. - test/certstream.test.js: fix broken imports (isValidDeployment lives in lib/platforms.js, not lib/certstream.js) and drop tests that called extractHostnames with a structured object — the real signature is a base64 TLS leaf blob. Added Netlify brilliant-* coverage. - test/queue.test.js: removed two tests that asserted a MAX_QUEUE_SIZE cap feature that was never actually implemented in JobQueue.push. Full test suite now 66/66 passing — previously the npm test glob failed to expand on Windows and those two files never ran cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…dcards Three interlocking changes to actually make the 12-platform story work: 1. CT log throughput bumped 4x. MAX_PER_POLL was 2048, which left Argon/Xenon permanently behind real ingestion rate — platforms issuing Let's Encrypt certs (Render, Railway, Fly.io, Deno) barely surfaced. Raised to 8192. 2. crt.sh poller is now resilient and selective: - Only polls platforms flagged crtshSupplement=true (Render, Replit, Deno, Railway). Previously polled all 12, including Fly.io (always 404s), Vercel/Netlify/Glitch/Surge (wildcard-only, returns nothing useful). Full cycle is now ~8min instead of ~24min. - Exponential backoff on 502/504/429 (crt.sh is often flaky). - After 3 consecutive failures, a platform is paused for 30min — stops the poller hammering a dead endpoint. - HTML-instead-of-JSON bodies (crt.sh's common error mode) no longer silently swallowed — they're logged as the failure reason. 3. Each platform in lib/platforms.js now has an 'ingestion' field documenting how its deployments become visible: - 'ct-log' — per-deployment cert lands in CT logs - 'events-api' — GitHub Events API - 'wildcard' — single *.domain cert, no public firehose exists map.html now says "Platforms tracked: 8" with a note that Vercel, Netlify, Glitch and Surge use wildcard certs — setting accurate expectations rather than showing empty planets users assume are broken. Smoke test: startup log confirms 4-platform crt.sh rotation, Argon grabs +4/poll vs +1 before, no 404/502 noise in logs. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

node --test doesn't expand globs itself; it relies on the shell. Bash without globstar and Windows bash both treat ** as *, so the previous test/**/*.test.js pattern matched only subdirectories of test/ — and since all test files live directly in test/, the glob resolved to nothing and CI failed with "Could not find ...". All existing tests live at test/*.test.js so the simple glob catches them. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Cam and others added 6 commits April 20, 2026 16:41

C-Moir merged commit 5cee20c into master Apr 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/clean path no urlscan#1

Fix/clean path no urlscan#1
C-Moir merged 6 commits into
masterfrom
fix/clean-path-no-urlscan

C-Moir commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

C-Moir commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant