Comprehensive Google Maps Scraper Improvements #148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses critical issues in the Google Maps scraper, making it more resilient to UI changes and improving error handling throughout the codebase. It resolves the "Cannot read properties of null (reading 'scrollHeight')" error that was causing scraper failures, while also enhancing the overall reliability of the system.
Primary Changes
Enhanced URL parsing and sanitization in searchjob.go
Added special handling for template URLs (e.g., https://www.google.com/maps/place/Your+Business/@xx.xxxx,yy.yyyy,17z)
Improved query parameter extraction and cleanup in runner/jobs.go
Added safeguards against malformed URLs in multiple components
Updated the scroll function in gmaps/job.go to try multiple selectors when the primary selector fails
Added fallback mechanisms for scrolling when container elements aren't found
Implemented better scroll state detection to avoid infinite loops
Enhanced error handling for scrolling operations
Improved iframe detection in gmaps/reviews.go with multiple selector attempts
Added direct page review extraction when iframe approach fails
Fixed string quoting in JavaScript evaluation code
Improved handling of review pagination and "More" buttons
Added proper type checking in gmaps/place.go for Playwright page conversion
Improved error logging with more descriptive messages
Implemented fallback strategies when primary extraction methods fail
Enhanced timeout and retry mechanisms
Added test files to verify URL cleaning logic
Created a simplified test harness in testcase/test.go
Included sample clean URLs for testing
Supporting Changes
Updated gmaps/entry.go to ensure compatibility with the new error handling
Synchronized changes across all runner components:
runner/databaserunner/databaserunner.go
runner/filerunner/filerunner.go
runner/lambdaaws/io.go
runner/lambdaaws/lambdaaws.go
runner/runner.go
runner/webrunner/webrunner.go
Optimized scroll intervals and wait times
Improved detection of when scrolling has reached the end
Enhanced review extraction logic to capture more reviews reliably
Better logging to help diagnose issues
More graceful handling of edge cases
Improved error messages for troubleshooting
Testing
All changes have been thoroughly tested against a variety of URL patterns, including:
Standard Google Maps search URLs
Template URLs with placeholder coordinates
Business profile URLs
Malformed URLs
The scraper now successfully:
Navigates to Google Maps search pages
Handles all types of URLs
Scrolls through search results reliably
Extracts business information accurately
Collects reviews when using the -extra-reviews flag
Technical Details
The primary issue was in the scrolling mechanism, which was attempting to access properties of null elements when Google Maps' UI structure changed. Our comprehensive solution implements a more resilient approach that:
Uses multiple selector strategies to adapt to Google Maps UI variations
Provides graceful fallbacks when primary methods fail
Handles errors consistently throughout the codebase
Improves logging and diagnostics
Ensures compatibility across all components of the system
These improvements make the scraper significantly more maintainable and adaptable to future Google Maps UI updates, reducing the need for frequent fixes.