Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 122 additions & 125 deletions Sagar_Bawankule/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
# Google Search Scraper using Selenium

A Python script that uses Selenium WebDriver to scrape Google search results.
A simple and efficient Python script that uses Selenium WebDriver with XPath selectors to scrape Google search results.

## Features
## 🌟 Features

- ✅ Scrapes Google search results (title, URL, description)
- ✅ Configurable number of results to scrape
- ✅ Headless mode option for background execution
- ✅ Export results to CSV or JSON format
- ✅ Anti-detection measures to avoid being blocked
- ✅ User-friendly command-line interface
- ✅ **XPath-only selectors** - Uses only XPath for element location
- ✅ **Anti-detection measures** - Bypasses bot detection with custom configurations
- ✅ **Smart result filtering** - Automatically skips empty/invalid results
- ✅ **Browser stays open** - View results in Chrome as long as you need
- ✅ **Clean terminal output** - Displays results with rank, title, and URL
- ✅ **Error handling** - Robust exception handling for stable scraping

## Prerequisites
## 📋 Prerequisites

- Python 3.7 or higher
- Google Chrome browser installed
- ChromeDriver (automatically managed by Selenium 4.6+)

## Installation
## 🚀 Installation

1. **Install required packages:**
1. **Install Selenium:**

```bash
pip install selenium
```

Or install all dependencies from requirements.txt (if available):
Or use the requirements file:

```bash
pip install -r requirements.txt
```

## Usage
## 💻 Usage

### Basic Usage

Expand All @@ -41,159 +41,156 @@ Run the script:
python google_scraper.py
```

The script will prompt you for:
- Search query
- Number of results to scrape
- Export format (CSV/JSON/none)
Follow the prompts:
1. **Enter search query**: Type what you want to search for
2. **Enter number of results**: Specify how many results (default: 10)
3. **View results**: Results appear both in browser and terminal
4. **Exit**: Press `Ctrl+C` to close the browser and exit

### Using as a Module
### Example

You can also import and use the scraper in your own code:

```python
from google_scraper import GoogleScraper

# Create scraper instance
scraper = GoogleScraper(headless=True)
```
Search query: Python programming
Number of results (10): 10

Opening Google...
Searching...
Extracting results...
1. Welcome to Python.org
2. Python Tutorial - W3Schools
3. Learn Python - Free Interactive Python Tutorial
...
10. Python Programming Language

# Search and get results
results = scraper.search_google("Python programming", num_results=10)
✓ Scraped 10 results
Browser will remain open. Press Ctrl+C in terminal to exit.

# Save results
scraper.save_to_csv("my_results.csv")
scraper.save_to_json("my_results.json")
======================================================================
1. Welcome to Python.org
https://www.python.org/

# Close the browser
scraper.close()
2. Python Tutorial - W3Schools
https://www.w3schools.com/python/
...
======================================================================
```

### Advanced Usage
## 🔧 How It Works

### 1. **Anti-Detection Setup**
```python
from google_scraper import GoogleScraper

# Initialize with headless mode
scraper = GoogleScraper(headless=True)

try:
# Perform search
results = scraper.search_google("machine learning tutorials", num_results=20)

# Process results
for result in results:
print(f"{result['rank']}. {result['title']}")
print(f" URL: {result['url']}")
print(f" Description: {result['description']}\n")

# Save to both formats
scraper.save_to_csv("ml_tutorials.csv")
scraper.save_to_json("ml_tutorials.json")

finally:
scraper.close()
- Custom user agent
- Disabled automation flags
- CDP command for user agent override
- Webdriver property masking
```

## Output Format

### CSV Format
The CSV file contains the following columns:
- `rank`: Position in search results (1, 2, 3, ...)
- `title`: Title of the search result
- `url`: URL of the webpage
- `description`: Snippet/description from Google

### JSON Format
```json
[
{
"rank": 1,
"title": "Example Title",
"url": "https://example.com",
"description": "Example description text..."
},
...
]
```
### 2. **XPath-Based Scraping**
The script uses XPath expressions to locate elements:
- Search box: `//textarea[@name='q']` or `//input[@name='q']`
- Results container: `(//div[contains(@class, 'g') and .//h3])[i]`
- Title: `//h3`
- URL: `//a[@href]`
- Description: `//div[contains(@class, 'VwiC3b')]`

### 3. **Smart Result Collection**
- Skips empty or invalid results
- Continues searching until requested number of valid results found
- Maximum 20 extra attempts to avoid infinite loops

## 📊 Output Format

Results are displayed in terminal with:
- **Rank**: Position number (1, 2, 3, ...)
- **Title**: Page title from search result
- **URL**: Full webpage URL
- **Description**: Snippet from Google (if available)

## Configuration Options
## ⚙️ Configuration

### Headless Mode
Run the browser in the background without GUI:
To run without opening a visible browser:
```python
scraper = GoogleScraper(headless=True)
results = scrape_google(query, num_results, headless=True)
```

### Number of Results
Specify how many results to scrape:
### Custom Number of Results
Default is 10, but you can specify any number:
```python
results = scraper.search_google("query", num_results=20)
results = scrape_google("your query", 20) # Get 20 results
```

## Features Explained
## 🛠️ Code Structure

### Anti-Detection Measures
The scraper includes several techniques to avoid being detected as a bot:
- Custom user agent
- Disabled automation flags
- Randomized delays
- WebDriver property masking
```
google_scraper.py
├── setup_driver() # Configures Chrome with anti-detection
├── scrape_google() # Main scraping function
└── main() # User interface and result display
```

## ⚠️ Important Notes

### Browser Behavior
- Browser **stays open** after scraping completes
- Press `Ctrl+C` in terminal to close browser and exit
- Script keeps running in an infinite loop to maintain browser session

### Rate Limiting
- Built-in delays (3-5 seconds) between actions
- Respectful of Google's servers
- Avoid running too frequently

### Error Handling
The script includes robust error handling for:
- Network issues
- Element not found errors
- Browser crashes
- Invalid queries
### Legal & Ethical Use
- ⚠️ **Respect Google's Terms of Service**
- 🚫 Don't use for automated/commercial scraping at scale
- ✅ Use responsibly for educational/personal purposes only
- ⚠️ Consider Google's robots.txt policies

## Troubleshooting
## 🐛 Troubleshooting

### ChromeDriver Issues
If you get ChromeDriver errors:
```bash
pip install --upgrade selenium
```
Selenium 4.6+ manages ChromeDriver automatically.

Selenium 4.6+ automatically manages ChromeDriver.
### Empty Results
- Some results may be ads or special content
- Script automatically skips these and continues
- Searches up to 20 extra results to find valid ones

### Browser Closes Immediately
- Make sure you don't have errors in terminal
- The infinite loop should keep browser open
- Check that Selenium is properly installed

### Import Errors
If you see "ModuleNotFoundError: No module named 'selenium'":
```bash
pip install selenium
```

### Google Blocking
If Google blocks your requests:
- Add delays between searches
- Use headless mode sparingly
- Don't scrape too frequently
- Consider using proxies for large-scale scraping

## Notes

- **Respect Google's Terms of Service**: Use this tool responsibly
- **Rate Limiting**: Don't send too many requests in a short time
- **Legal Considerations**: Ensure your use case complies with applicable laws
- **Robots.txt**: Be aware of Google's robots.txt file

## Example Output
## 📝 Requirements

```
Searching for: Python programming
Chrome WebDriver started successfully.
Result 1: Welcome to Python.org
Result 2: Python Tutorial - W3Schools
Result 3: Learn Python - Free Interactive Python Tutorial
...
Successfully scraped 10 results.

Results saved to google_search_results.csv
WebDriver closed.
selenium>=4.6.0
```

## License
## 🎯 Best Practices

1. **Don't scrape too frequently** - Add delays between runs
2. **Respect robots.txt** - Check Google's crawling policies
3. **Use for learning** - Great for understanding web scraping
4. **Be ethical** - Don't overwhelm servers with requests

## 📄 License

This project is for educational purposes only.

## Contributing
## 🤝 Contributing

Feel free to submit issues or pull requests for improvements.

---

**Note**: Web scraping should always be done responsibly and in compliance with the website's terms of service and applicable laws.
Loading