Skip to content

Comprehensive Google Maps Scraper Improvements #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhaoyangpp
Copy link

Summary
This PR addresses critical issues in the Google Maps scraper, making it more resilient to UI changes and improving error handling throughout the codebase. It resolves the "Cannot read properties of null (reading 'scrollHeight')" error that was causing scraper failures, while also enhancing the overall reliability of the system.
Primary Changes

  1. Fixed URL Handling
    Enhanced URL parsing and sanitization in searchjob.go
    Added special handling for template URLs (e.g., https://www.google.com/maps/place/Your+Business/@xx.xxxx,yy.yyyy,17z)
    Improved query parameter extraction and cleanup in runner/jobs.go
    Added safeguards against malformed URLs in multiple components
  2. Improved Scroll Functionality
    Updated the scroll function in gmaps/job.go to try multiple selectors when the primary selector fails
    Added fallback mechanisms for scrolling when container elements aren't found
    Implemented better scroll state detection to avoid infinite loops
    Enhanced error handling for scrolling operations
  3. Enhanced Review Extraction
    Improved iframe detection in gmaps/reviews.go with multiple selector attempts
    Added direct page review extraction when iframe approach fails
    Fixed string quoting in JavaScript evaluation code
    Improved handling of review pagination and "More" buttons
  4. Better Error Handling and Resilience
    Added proper type checking in gmaps/place.go for Playwright page conversion
    Improved error logging with more descriptive messages
    Implemented fallback strategies when primary extraction methods fail
    Enhanced timeout and retry mechanisms
  5. Test Cases
    Added test files to verify URL cleaning logic
    Created a simplified test harness in testcase/test.go
    Included sample clean URLs for testing
    Supporting Changes
  6. System-wide Integration
    Updated gmaps/entry.go to ensure compatibility with the new error handling
    Synchronized changes across all runner components:
    runner/databaserunner/databaserunner.go
    runner/filerunner/filerunner.go
    runner/lambdaaws/io.go
    runner/lambdaaws/lambdaaws.go
    runner/runner.go
    runner/webrunner/webrunner.go
  7. Performance Improvements
    Optimized scroll intervals and wait times
    Improved detection of when scrolling has reached the end
    Enhanced review extraction logic to capture more reviews reliably
  8. Usability Enhancements
    Better logging to help diagnose issues
    More graceful handling of edge cases
    Improved error messages for troubleshooting
    Testing
    All changes have been thoroughly tested against a variety of URL patterns, including:
    Standard Google Maps search URLs
    Template URLs with placeholder coordinates
    Business profile URLs
    Malformed URLs
    The scraper now successfully:
    Navigates to Google Maps search pages
    Handles all types of URLs
    Scrolls through search results reliably
    Extracts business information accurately
    Collects reviews when using the -extra-reviews flag
    Technical Details
    The primary issue was in the scrolling mechanism, which was attempting to access properties of null elements when Google Maps' UI structure changed. Our comprehensive solution implements a more resilient approach that:
    Uses multiple selector strategies to adapt to Google Maps UI variations
    Provides graceful fallbacks when primary methods fail
    Handles errors consistently throughout the codebase
    Improves logging and diagnostics
    Ensures compatibility across all components of the system
    These improvements make the scraper significantly more maintainable and adaptable to future Google Maps UI updates, reducing the need for frequent fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant