Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 86 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
# Web Crawler

A minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency.
A minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency. Now with **LLM-powered content summarization** using GPT-4o-mini!

<img width="1162" height="628" alt="Screenshot 2025-08-25 at 12 31 22 PM" src="https://github.com/user-attachments/assets/12e05c97-4e46-4fd3-a467-3276f290b63d" />
<img width="1162" height="628" alt="Screenshot 2025-08-25 at 12 31 22 PM" src="https://github.com/user-attachments/assets/12e05c97-4e46-4fd3-a467-3f290b63d" />

## Features

- 🔍 **Real-time web search** across multiple sources
- 📰 **Content extraction** from JavaScript-heavy pages using Playwright
- 🤖 **High-quality LLM summarization** using GPT-4o-mini (one-shot processing)
- ⚡ **Fast and efficient** with caching and rate limiting
- 🎯 **Production-ready** with robust error handling

## Setup
**Prerequisites**: Python 3.12+ and [uv](https://docs.astral.sh/uv/)
Expand All @@ -16,19 +23,94 @@ git clone https://github.com/financial-datasets/web-crawler.git
cd web-crawler
```

### LLM Configuration (Optional)

To enable content summarization, you'll need an LLM API key:

1. **Copy the example config:**
```bash
cp config.env.example .env
```

2. **Edit `.env` and add your API key:**
```bash
# For OpenAI (default)
OPENAI_API_KEY=your_actual_api_key_here

# Or for Anthropic
# ANTHROPIC_API_KEY=your_actual_api_key_here
# LLM_PROVIDER=anthropic
```

3. **Load the environment variables:**
```bash
source .env
```

## How to Run
```bash
# From the repo root, run:
uv run web-crawler
```

- When prompted, enter your search (e.g., "AAPL latest earnings transcript").
- Results print as JSON. Enter another query to continue.
- If you have an LLM API key configured, you'll be asked if you want summaries.
- Results print as JSON with optional summaries. Enter another query to continue.
- Quit with `q`, `quit`, `exit`, or press Ctrl+C.

### Example Output with Summaries

```json
{
"query": "AAPL latest earnings transcript",
"summaries_generated": true,
"results": [
{
"title": "Apple Q4 2024 Earnings Call Transcript",
"url": "https://example.com/apple-earnings",
"published_date": "2024-10-28T20:00:00",
"summary": "Apple reported strong Q4 2024 results with revenue of $89.5 billion, up 8% year-over-year. iPhone sales grew 6% to $43.8 billion, while services revenue increased 16% to $22.3 billion. The company highlighted strong performance in emerging markets and continued growth in its services ecosystem.",
"content_length": 15420
}
]
}
```

## Roadmap
We'd love to get help on:
- [x] **Summarizing parsed content with LLMs** ✅
- [ ] Parsing content from JavaScript-heavy pages (e.g. MSN, Bloomberg, etc.)
- [ ] Summarizing parsed content with LLMs
- [ ] Adding more sources (Bing, Reddit, etc.)
- [ ] Parallelization for faster queries

## Architecture

The summarization feature works as follows:

1. **Search**: Query multiple web sources simultaneously
2. **Extract**: Use Playwright to extract readable content from URLs
3. **Summarize**: Send entire content to GPT-4o-mini via litellm for high-quality summaries
4. **Return**: Enhanced search results with AI-generated summaries

### Key Components

- **`ContentSummarizer`**: Core LLM integration using litellm with GPT-4o-mini
- **`SummarizationService`**: Orchestrates content extraction and summarization
- **`PageParser`**: Extracts readable content from web pages
- **`SearchEngine`**: Integrates search and summarization workflows

### Supported LLM Providers

- **OpenAI**: GPT-4o-mini (default), GPT-4, GPT-3.5-turbo
- **Anthropic**: Claude-3-Haiku, Claude-3-Sonnet
- **Azure OpenAI**: GPT-4o-mini, GPT-4, GPT-3.5-turbo
- **Local Models**: Via litellm's local model support

## Production Considerations

- **Rate Limiting**: Built-in semaphore-based concurrency control
- **Error Handling**: Graceful degradation when summarization fails
- **Caching**: 15-minute cache for search results
- **Timeout Management**: 60-second timeout for LLM calls
- **Content Processing**: One-shot summarization with smart content truncation (50K char limit)
- **Model Selection**: GPT-4o-mini for optimal quality/speed balance
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ dependencies = [
"googlenewsdecoder>=0.1.7",
"playwright>=1.54.0",
"requests>=2.32.4",
"litellm>=1.0.0",
]

[project.scripts]
Expand Down
48 changes: 46 additions & 2 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
import contextlib

from .search.engine import SearchEngine
from .summarizer.service import SummarizationService
from .utils import spinner


# Example usage
async def search(query: str):
async def search(query: str, include_summaries: bool = False, max_summaries: int = 5):
async with SearchEngine() as search_engine:
spinner_task = asyncio.create_task(spinner("Searching the web..."))
try:
Expand All @@ -17,6 +18,38 @@ async def search(query: str):
with contextlib.suppress(asyncio.CancelledError):
await spinner_task
print()

# If user wants summaries, process them
if include_summaries and results.get("results"):
print("🤖 Generating AI summaries...")
spinner_task = asyncio.create_task(spinner("Summarizing content..."))

try:
# Convert search results to format expected by SummarizationService
search_results = []
for result in results["results"]:
search_results.append({
"title": result["title"],
"url": result["url"],
"published_date": result["published_date"]
})

# Create summarization service and process results
summarization_service = SummarizationService()
summarized_results = await summarization_service.summarize_search_results(
search_results,
max_summaries=max_summaries
)

# Update results with summaries
results["results"] = summarized_results

finally:
spinner_task.cancel()
with contextlib.suppress(asyncio.CancelledError):
await spinner_task
print()

print("Search Results:")
print(json.dumps(results, indent=2, default=str))

Expand All @@ -32,7 +65,18 @@ def main():
if query.lower() in {"q", "quit", "exit"}:
print("Goodbye.")
return
asyncio.run(search(query))

# Ask user if they want summaries
want_summaries = input("Include AI summaries? (y/n): ").strip().lower()
include_summaries = want_summaries in {"y", "yes"}
if include_summaries:
max_summaries = input("Enter maximum number of summaries: ").strip()
if max_summaries:
max_summaries = int(max_summaries)
else:
max_summaries = 5

asyncio.run(search(query, include_summaries, max_summaries))
except KeyboardInterrupt:
print() # graceful newline on Ctrl+C

Expand Down
Loading