Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This Pull Request addresses critical stability issues related to network failures and enhances the scraper's resilience against website blocks (403 Forbidden errors). The primary goal is to ensure the scraper runs reliably in various environments (local or cloud) without crashing due to temporary network issues or bot detection.
🛠️ Key Fixes & Changes
Implemented Robust HTTP Session with Retries:
Replaced direct requests.get() calls with a shared requests.Session object configured with a urllib3.util.retry.Retry policy.
This automatically retries failed requests (up to 3 times) on common transient server errors (5xx) and rate limit errors (429), improving overall success rate.
Added Browser Mimicking Headers:
The session now includes a User-Agent and a Referer header. This significantly reduces the likelihood of encountering 403 Forbidden errors, especially when running on cloud IPs or virtual machines.
Enhanced Crash Prevention (Null Checks):
Introduced explicit checks for None values returned by BeautifulSoup (e.g., when searching for entry-content or data.body).
This prevents the application from crashing with an AttributeError (e.g., AttributeError: 'NoneType' object has no attribute 'insert_before') when an initial request is blocked or returns malformed HTML.
Files Modified
sanfoundry_scraper/pagescrape.py
sanfoundry_scraper/mcqscrape.py
(Implicitly, the changes require new imports: requests.adapters.HTTPAdapter and urllib3.util.retry.Retry)
Testing Notes
Testing with the fixed code successfully prevents crashes when the scraper hits a 403 error. The added headers also drastically increase the chance of successful scraping where previously the request was blocked instantly.