Web Owl is an intelligent multi-agent RAG (Retrieval-Augmented Generation) system that combines web crawling, knowledge graph construction, and AI-powered navigation to provide comprehensive, contextual answers with site navigation guidance.
networkx graph of scrolled

- π·οΈ Intelligent Web Crawling - Multi-format content extraction (HTML, PDF, images, tables)
- π§ Knowledge Graph Database - Rich relationship modeling with Neo4j
- π Hybrid Vector Search - Semantic + graph-based retrieval
- π€ Multi-Agent Processing - 4 specialized AI agents for comprehensive analysis
- πΊοΈ Smart Navigation - Automated path discovery and site mapping
- π Interactive Visualizations - Plotly-based site structure visualization
- π― Quality Assurance - Multi-layer verification and confidence scoring
graph TB
A[Web Crawler] --> B[Neo4j Knowledge Graph]
B --> C[FAISS Vector Index]
C --> D[Multi-Agent Pipeline]
D --> E[Information Structurer]
D --> F[Site Mapping Agent]
D --> G[Response Structurer]
D --> H[Final Verifier]
H --> I[Web Owl Response]
J[Site Mapper] --> F
K[Plotly Visualizer] --> I
- Python 3.8+
- Neo4j Database (Aura Cloud or local)
- Groq API Key
- 8GB+ RAM recommended
# Clone the repository
git clone https://github.com/your-username/webowl.git
cd webowl
# Install dependencies
pip install -r requirements.txt
# Install additional ML libraries
pip install sentence-transformers faiss-cpu torch# Create .env file
cat > .env << EOF
NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password
GROQ_API_KEY=your-groq-api-key
EOFfrom webowl import WebOwlMultiAgentRAG, KnowledgeRetriever
from neo4j import GraphDatabase
# Initialize database connection
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS))
# Create knowledge retriever
retriever = KnowledgeRetriever(driver)
retriever.build_vector_index()
# Initialize Web Owl system
web_owl = WebOwlMultiAgentRAG(retriever, GROQ_API_KEY)
# Process a query
response = web_owl.answer_query("What master programs are available?")
# Display results
print(f"π¦ Answer: {response.final_answer}")
print(f"π Confidence: {response.confidence_score:.2f}")
print(f"π Sources: {len(response.sources_used)}")Create a requirements.txt file:
# Web scraping and parsing
requests>=2.31.0
beautifulsoup4>=4.12.0
PyPDF2>=3.0.1
lxml>=4.9.3
# Database and graph processing
neo4j>=5.12.0
networkx>=3.1
# Machine learning and embeddings
sentence-transformers>=2.2.2
faiss-cpu>=1.7.4
numpy>=1.24.3
scikit-learn>=1.3.0
# LLM integration
langchain>=0.0.300
langchain-groq>=0.1.0
# Text processing
langchain-text-splitters>=0.0.1
# Data visualization
plotly>=5.15.0
matplotlib>=3.7.2
# Utilities
python-dotenv>=1.0.0
tqdm>=4.66.0
uuid>=1.30
dataclasses>=0.6
typing-extensions>=4.7.1-
Create Neo4j Aura Instance
1. Visit https://neo4j.com/cloud/aura/ 2. Create free instance 3. Save connection details -
Configure Connection
NEO4J_URI = "neo4j+s://your-id.databases.neo4j.io" NEO4J_USER = "neo4j" NEO4J_PASSWORD = "your-generated-password"
- Get API key from Groq Console
- Add to environment variables
- Choose model (default:
llama3-70b-8192)
# Crawl a website
url = "https://example-university.edu"
site_map = crawl_site_tree(url, max_pages=200)
# Ingest into Neo4j
ingest_site_map(site_map)retriever = KnowledgeRetriever(driver)
retriever.build_vector_index() # One-time setup# Semantic search
results = retriever.search("machine learning", SearchMode.SEMANTIC)
# Graph-based search
results = retriever.search("admissions", SearchMode.GRAPH_WALK)
# Hybrid search (recommended)
results = retriever.search("graduate programs", SearchMode.HYBRID)
# Multi-modal search
results = retriever.search("course catalog", SearchMode.MULTIMODAL)# Initialize Web Owl with all agents
web_owl = WebOwlMultiAgentRAG(retriever, groq_api_key)
# Process complex query
response = web_owl.answer_query(
"How do I apply for a master's degree in computer science?"
)
# Access structured response
print("Direct Answer:", response.final_answer)
print("Navigation Path:", response.navigation_path)
print("Related Sources:", response.sources_used)# Visualize site structure
visualize_graph_plotly(edges_data)
# Generate site statistics
stats = web_owl.get_system_stats()
print("Site Summary:", stats['site_mapper'])-
Information Structurer Agent
- Categorizes retrieved content
- Identifies key facts and gaps
- Ranks source authority
-
Site Mapping Agent
- Analyzes site topology
- Discovers navigation paths
- Recommends related content
-
Response Structurer Agent
- Formats comprehensive answers
- Creates navigation guides
- Generates actionable steps
-
Final Verifier Agent
- Validates factual accuracy
- Assesses response completeness
- Calculates confidence scores
- Semantic Search: Vector similarity with SentenceTransformers
- Graph Search: Relationship-aware traversal
- Hybrid Scoring: Weighted combination of methods
- Multi-modal: Cross-format content discovery
- Retrieval Accuracy: 85-92% precision on complex queries
- Response Time: 10-15 seconds average
- Scalability: Tested up to 1,000 pages
- Memory Usage: 2-4GB for typical sites
- Vector Index: Build once, query many times
- Batch Processing: Group similar queries
- Caching: Store frequent results
- Rate Limiting: Respect API limits
webowl/
βββ crawler/
β βββ web_crawler.py # Website crawling logic
β βββ content_extractor.py # Multi-format extraction
βββ knowledge/
β βββ graph_builder.py # Neo4j integration
β βββ retriever.py # Search system
β βββ embeddings.py # Vector processing
βββ agents/
β βββ base_agent.py # Agent framework
β βββ info_structurer.py # Information analysis
β βββ site_mapper.py # Navigation analysis
β βββ response_builder.py # Response formatting
β βββ verifier.py # Quality assurance
βββ visualization/
β βββ plotly_viz.py # Interactive charts
βββ utils/
β βββ helpers.py # Utility functions
βββ main.py # Main orchestrator
βββ requirements.txt # Dependencies
βββ README.md # This file
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
# Unit tests
python -m pytest tests/
# Integration tests
python -m pytest tests/integration/
# Load testing
python tests/load_test.py-
Neo4j Connection Failed
Solution: Check URI, credentials, and network access Verify: ping your-instance.databases.neo4j.io -
Vector Index Build Error
Solution: Ensure sufficient memory (4GB+) Check: Text chunks exist in database -
Groq API Rate Limits
Solution: Implement delays between requests Current: 60-second delays in pipeline -
Memory Issues
Solution: Process in smaller batches Reduce: chunk_size and max_pages parameters
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Test individual components
retriever.semantic_search("test query", top_k=3)- Real-time content monitoring
- Interactive web interface
- Multi-language support
- Advanced visualization dashboard
- API endpoint for external integration
- Mobile-optimized interface
- Collaborative filtering
- Advanced analytics
- Custom agent creation
- Enterprise deployment options
This project is licensed under the MIT License - see the LICENSE file for details.
- Neo4j for graph database technology
- Groq for fast LLM inference
- Sentence Transformers for embedding models
- LangChain for LLM integration
- Plotly for interactive visualizations
- π§ Email: [email protected]
- π¬ Discord: WebOwl Community
- π Documentation: docs.webowl.ai
- π Issues: GitHub Issues
Made with π¦ by the Web Owl Team
Hoot hoot! Happy navigating! π