Skip to content

httpx blocked by Cloudflare TLS fingerprinting on Substack feeds — needs curl fallback #4

@11me

Description

@11me

Bug Report

Description

fetch_rss() in herald/collect.py uses httpx.Client to fetch RSS feeds. When fetching Substack RSS feeds (*.substack.com/feed), Cloudflare returns a 403 Forbidden with a "Just a moment..." challenge page, even with a proper User-Agent header set.

The issue is that Cloudflare uses TLS fingerprinting (JA3/JA4) to distinguish automated HTTP clients from browsers. httpx has a distinctive TLS fingerprint that gets blocked, while curl from the same machine succeeds with a 200 response.

Expected Behavior

fetch_rss() should successfully retrieve RSS feeds from Substack and other Cloudflare-protected sites.

Current Behavior

All Substack feeds return 403. The _fetch_with_retry() function exhausts retries and returns None, resulting in 0 items collected from those sources.

Suggested Fix

Add a curl subprocess fallback in fetch_rss() when httpx returns a non-200 response. Example:

def _fetch_with_curl(url: str, timeout: int = 10) -> str | None:
    import subprocess
    try:
        result = subprocess.run(
            ["curl", "-sL", "--max-time", str(timeout),
             "-H", "User-Agent: Mozilla/5.0 ...",
             "-H", "Accept: application/rss+xml, application/xml, text/xml, */*",
             url],
            capture_output=True, text=True, timeout=timeout + 5,
        )
        if result.returncode == 0 and result.stdout and "Just a moment" not in result.stdout[:500]:
            return result.stdout
    except Exception:
        pass
    return None

Reproduction

# Any Substack feed triggers this
python3 -c "
import httpx
r = httpx.get('https://example.substack.com/feed', follow_redirects=True)
print(r.status_code)  # 403 — Cloudflare challenge
"
# Same URL works with curl
curl -sL -o /dev/null -w '%{http_code}' 'https://example.substack.com/feed'
# 200

Frequency

Always — affects all Substack-hosted feeds.

Environment

  • Herald: 2.0.0
  • OS: Linux 6.8.0-90-generic
  • Python: 3.12.3
  • httpx: 0.28.1
  • curl: available, not affected by TLS fingerprinting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions