Skip to content

Conversation

@aki-008
Copy link

@aki-008 aki-008 commented Nov 22, 2025

This Pull Request addresses critical stability issues related to network failures and enhances the scraper's resilience against website blocks (403 Forbidden errors). The primary goal is to ensure the scraper runs reliably in various environments (local or cloud) without crashing due to temporary network issues or bot detection.

🛠️ Key Fixes & Changes
Implemented Robust HTTP Session with Retries:

Replaced direct requests.get() calls with a shared requests.Session object configured with a urllib3.util.retry.Retry policy.

This automatically retries failed requests (up to 3 times) on common transient server errors (5xx) and rate limit errors (429), improving overall success rate.

Added Browser Mimicking Headers:

The session now includes a User-Agent and a Referer header. This significantly reduces the likelihood of encountering 403 Forbidden errors, especially when running on cloud IPs or virtual machines.

Enhanced Crash Prevention (Null Checks):

Introduced explicit checks for None values returned by BeautifulSoup (e.g., when searching for entry-content or data.body).

This prevents the application from crashing with an AttributeError (e.g., AttributeError: 'NoneType' object has no attribute 'insert_before') when an initial request is blocked or returns malformed HTML.

Files Modified
sanfoundry_scraper/pagescrape.py

sanfoundry_scraper/mcqscrape.py

(Implicitly, the changes require new imports: requests.adapters.HTTPAdapter and urllib3.util.retry.Retry)

Testing Notes
Testing with the fixed code successfully prevents crashes when the scraper hits a 403 error. The added headers also drastically increase the chance of successful scraping where previously the request was blocked instantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant