Skip to content

Fix/clean path no urlscan#1

Merged
C-Moir merged 6 commits into
masterfrom
fix/clean-path-no-urlscan
Apr 20, 2026
Merged

Fix/clean path no urlscan#1
C-Moir merged 6 commits into
masterfrom
fix/clean-path-no-urlscan

Conversation

@C-Moir
Copy link
Copy Markdown
Owner

@C-Moir C-Moir commented Apr 20, 2026

No description provided.

Cam and others added 6 commits April 20, 2026 16:41
Clean path now fetches DOM directly with a 5s timeout and 1MB cap rather than
submitting every deployment to URLScan. Free-tier URLScan is ~2/min — at CT
log volume this backed up the worker pool so ~93% of cards sat in pending
forever.

- lib/metadata.js: fetchDom, extractTitle, extractMetaDescription
- lib/fingerprint.js: detectContentTags now accepts hostname; add
  detectSuspiciousHostname reusing PRIORITY_KEYWORDS (single source of truth)
- index.js: rewrite processEntry. URLhaus hit -> URLScan (flagged).
  Otherwise fetch DOM, fingerprint, check hostname. Suspicious hostname +
  URLSCAN_KEY -> escalate to URLScan for sandboxed evidence. Otherwise
  Playwright screenshot and mark clean.
- preflight404 removed; fetchDom returns 404 via its status code
- AbuseIPDB auto-report call removed — scanner.js currently treats every
  non-Vercel IP as C2, so reporting would spam CDN operators with false
  positives. Will be re-enabled after scanner.js uses real URLScan verdicts.

Smoke test (60s run): 18/18 entries reached a final state, no pending
backlog. Titles being extracted. deploysPerHour now shows a real rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Previously extractScanData flagged every non-Vercel remote IP as C2 and every
non-vercel.app asset as a redirect domain. For a 12-platform tool this meant
every Netlify/CF/Render site had its own origin IPs and every CDN (Google
Fonts, jsdelivr, Cloudflare analytics) classified as malicious. Paired with
the AbuseIPDB auto-reporter that was previously wired in, this would have
shipped libellous reports against CDN operators at scale.

- Remove VERCEL_IP_PREFIXES and isVercelIp. First-party boundary is now the
  actual page domain per URLScan's result.page.domain.
- c2Ips only populated when result.verdicts.overall.malicious is true.
- scriptSources/redirectDomains now compare against the scanned page's
  domain, not a hardcoded platform.
- .js query-string suffixes (/foo.js?v=1) now caught by the script regex.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two UI bugs made working filters look broken:

1. Clean cards without a URLScan ID (now the norm, since commit 2724ca9 stops
   calling URLScan on clean sites) had no actionable links at all — just a
   bare hostname span. Fixed: card-host is now an <a> for non-flagged,
   non-suspicious entries. Flagged/suspicious stay as text-only so we never
   auto-link users to phishing.

2. Filters like Interesting / Flagged / Suspicious render blank when no
   entries match. The previous empty-state only showed when the feed was
   completely empty, so "no matches" looked identical to "server broken".
   Now any time visible === 0, show a filter-aware message: "Waiting for new
   deployments..." when the feed has never received a card, "No deployments
   match this filter" otherwise.

Verified in browser: clicking Flagged on a 7-card feed now shows the
filter-match message; card-host links have correct href + target=_blank.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…tests

Cleanup bundle after the scan-pipeline rewrite:

- lib/platforms.js: drop 'brilliant-' from Netlify internalRe. It was one of
  Netlify's default adjective prefixes for auto-named sites — filtering it
  silently dropped roughly 1 in N free-tier Netlify deployments.
- lib/rss.js: feed title and description no longer claim this is Vercel-only.
- lib/github-pages.js: compare event IDs by equality instead of lexical <=.
  GitHub IDs are strings that can exceed Number.MAX_SAFE_INTEGER, and lexical
  compare breaks at digit-length boundaries; equality-stop catches up
  correctly with 100-per-page polling.
- index.js: startup log says 'deployment-feed' not 'vercel-feed'; add a
  204-no-content favicon route to suppress the console 404.
- test/certstream.test.js: fix broken imports (isValidDeployment lives in
  lib/platforms.js, not lib/certstream.js) and drop tests that called
  extractHostnames with a structured object — the real signature is a
  base64 TLS leaf blob. Added Netlify brilliant-* coverage.
- test/queue.test.js: removed two tests that asserted a MAX_QUEUE_SIZE cap
  feature that was never actually implemented in JobQueue.push.

Full test suite now 66/66 passing — previously the npm test glob failed to
expand on Windows and those two files never ran cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…dcards

Three interlocking changes to actually make the 12-platform story work:

1. CT log throughput bumped 4x. MAX_PER_POLL was 2048, which left Argon/Xenon
   permanently behind real ingestion rate — platforms issuing Let's Encrypt
   certs (Render, Railway, Fly.io, Deno) barely surfaced. Raised to 8192.

2. crt.sh poller is now resilient and selective:
   - Only polls platforms flagged crtshSupplement=true (Render, Replit, Deno,
     Railway). Previously polled all 12, including Fly.io (always 404s),
     Vercel/Netlify/Glitch/Surge (wildcard-only, returns nothing useful).
     Full cycle is now ~8min instead of ~24min.
   - Exponential backoff on 502/504/429 (crt.sh is often flaky).
   - After 3 consecutive failures, a platform is paused for 30min — stops
     the poller hammering a dead endpoint.
   - HTML-instead-of-JSON bodies (crt.sh's common error mode) no longer
     silently swallowed — they're logged as the failure reason.

3. Each platform in lib/platforms.js now has an 'ingestion' field documenting
   how its deployments become visible:
   - 'ct-log'     — per-deployment cert lands in CT logs
   - 'events-api' — GitHub Events API
   - 'wildcard'   — single *.domain cert, no public firehose exists
   map.html now says "Platforms tracked: 8" with a note that Vercel, Netlify,
   Glitch and Surge use wildcard certs — setting accurate expectations rather
   than showing empty planets users assume are broken.

Smoke test: startup log confirms 4-platform crt.sh rotation, Argon grabs
+4/poll vs +1 before, no 404/502 noise in logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
node --test doesn't expand globs itself; it relies on the shell. Bash without
globstar and Windows bash both treat ** as *, so the previous
test/**/*.test.js pattern matched only subdirectories of test/ — and since
all test files live directly in test/, the glob resolved to nothing and CI
failed with "Could not find ...".

All existing tests live at test/*.test.js so the simple glob catches them.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@C-Moir C-Moir merged commit 5cee20c into master Apr 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant