Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6

aki-008 · 2025-11-22T15:52:49Z

This Pull Request addresses critical stability issues related to network failures and enhances the scraper's resilience against website blocks (403 Forbidden errors). The primary goal is to ensure the scraper runs reliably in various environments (local or cloud) without crashing due to temporary network issues or bot detection.

🛠️ Key Fixes & Changes
Implemented Robust HTTP Session with Retries:

Replaced direct requests.get() calls with a shared requests.Session object configured with a urllib3.util.retry.Retry policy.

This automatically retries failed requests (up to 3 times) on common transient server errors (5xx) and rate limit errors (429), improving overall success rate.

Added Browser Mimicking Headers:

The session now includes a User-Agent and a Referer header. This significantly reduces the likelihood of encountering 403 Forbidden errors, especially when running on cloud IPs or virtual machines.

Enhanced Crash Prevention (Null Checks):

Introduced explicit checks for None values returned by BeautifulSoup (e.g., when searching for entry-content or data.body).

This prevents the application from crashing with an AttributeError (e.g., AttributeError: 'NoneType' object has no attribute 'insert_before') when an initial request is blocked or returns malformed HTML.

Files Modified
sanfoundry_scraper/pagescrape.py

sanfoundry_scraper/mcqscrape.py

(Implicitly, the changes require new imports: requests.adapters.HTTPAdapter and urllib3.util.retry.Retry)

Testing Notes
Testing with the fixed code successfully prevents crashes when the scraper hits a 403 error. The added headers also drastically increase the chance of successful scraping where previously the request was blocked instantly.

Added headers to mimic a real browser request and included error handling for HTTP requests.

aki-008 added 7 commits November 22, 2025 21:07

Enhance pagescrape with headers and error handling

fcf3c33

Added headers to mimic a real browser request and included error handling for HTTP requests.

Update pagescrape.py

7414d26

Update mcqscrape.py

011808b

Update README.md

d05a9e4

Update mcqscrape.py

79656f7

Update main.py

450ddc0

Update main.py

4ee8276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6

Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6

Uh oh!

aki-008 commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6

Are you sure you want to change the base?

Feat/Fix: Implement Robust Retry Logic, Browser Headers, and Enhanced Error Handling Description #6

Uh oh!

Conversation

aki-008 commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant