Bug Report
Description
fetch_rss() in herald/collect.py uses httpx.Client to fetch RSS feeds. When fetching Substack RSS feeds (*.substack.com/feed), Cloudflare returns a 403 Forbidden with a "Just a moment..." challenge page, even with a proper User-Agent header set.
The issue is that Cloudflare uses TLS fingerprinting (JA3/JA4) to distinguish automated HTTP clients from browsers. httpx has a distinctive TLS fingerprint that gets blocked, while curl from the same machine succeeds with a 200 response.
Expected Behavior
fetch_rss() should successfully retrieve RSS feeds from Substack and other Cloudflare-protected sites.
Current Behavior
All Substack feeds return 403. The _fetch_with_retry() function exhausts retries and returns None, resulting in 0 items collected from those sources.
Suggested Fix
Add a curl subprocess fallback in fetch_rss() when httpx returns a non-200 response. Example:
def _fetch_with_curl(url: str, timeout: int = 10) -> str | None:
import subprocess
try:
result = subprocess.run(
["curl", "-sL", "--max-time", str(timeout),
"-H", "User-Agent: Mozilla/5.0 ...",
"-H", "Accept: application/rss+xml, application/xml, text/xml, */*",
url],
capture_output=True, text=True, timeout=timeout + 5,
)
if result.returncode == 0 and result.stdout and "Just a moment" not in result.stdout[:500]:
return result.stdout
except Exception:
pass
return None
Reproduction
# Any Substack feed triggers this
python3 -c "
import httpx
r = httpx.get('https://example.substack.com/feed', follow_redirects=True)
print(r.status_code) # 403 — Cloudflare challenge
"
# Same URL works with curl
curl -sL -o /dev/null -w '%{http_code}' 'https://example.substack.com/feed'
# 200
Frequency
Always — affects all Substack-hosted feeds.
Environment
- Herald: 2.0.0
- OS: Linux 6.8.0-90-generic
- Python: 3.12.3
- httpx: 0.28.1
- curl: available, not affected by TLS fingerprinting
Bug Report
Description
fetch_rss()inherald/collect.pyuseshttpx.Clientto fetch RSS feeds. When fetching Substack RSS feeds (*.substack.com/feed), Cloudflare returns a 403 Forbidden with a "Just a moment..." challenge page, even with a properUser-Agentheader set.The issue is that Cloudflare uses TLS fingerprinting (JA3/JA4) to distinguish automated HTTP clients from browsers.
httpxhas a distinctive TLS fingerprint that gets blocked, whilecurlfrom the same machine succeeds with a 200 response.Expected Behavior
fetch_rss()should successfully retrieve RSS feeds from Substack and other Cloudflare-protected sites.Current Behavior
All Substack feeds return 403. The
_fetch_with_retry()function exhausts retries and returnsNone, resulting in 0 items collected from those sources.Suggested Fix
Add a
curlsubprocess fallback infetch_rss()whenhttpxreturns a non-200 response. Example:Reproduction
Frequency
Always — affects all Substack-hosted feeds.
Environment