The BrightData Database System is a comprehensive Python library for accessing and filtering data using the BrightData API across multiple datasets. The system provides type-safe database queries with built-in support for various e-commerce datasets for competitive intelligence and market research.
Purpose: Main interface for interacting with the BrightData API
Key Features:
- Multi-dataset support with automatic dataset ID resolution
- Type-aware filtering with runtime validation
- Smart deduplication to prevent duplicate queries
- Local record storage for all submissions
- Real-time status monitoring
- Cost-aware download management
Key Methods:
search_data(filter_obj, records_limit, description, title): Submit filter queriesget_snapshot_metadata(snapshot_id): Retrieve snapshot informationdownload_snapshot_content(snapshot_id, format, compress): Download snapshot datadeliver_snapshot(snapshot_id, delivery_config): Trigger snapshot delivery
Technical Details:
- Automatic API key loading from
secrets.yaml - Order-independent filter comparison for deduplication
- Support for JSON and CSV download formats
- Compression support for large datasets
- Comprehensive error handling and validation
Purpose: Type-aware filtering with automatic validation
Key Classes:
FilterField: Base class for all filter fieldsNumericalFilterField: Handles numeric operations (>, <, >=, <=, =, !=)StringFilterField: Handles string operations (contains, includes, in, not_in)BooleanFilterField: Handles boolean operations (is_true, is_false)ArrayFilterField: Handles array operations (includes, not_includes, array_includes)DatasetFilterFields: Dynamic field generation based on dataset schema
Technical Details:
- Runtime field validation against dataset schemas
- Operator overloading for intuitive syntax (
field >= 4.5) - Support for complex logical operations (
&for AND,|for OR) - Automatic type coercion and validation
- Backward compatibility with legacy field names
Purpose: Centralized management of dataset schemas and metadata
Key Features:
- Dataset schema definitions with field types and operators
- Automatic dataset ID resolution from names
- Field validation and operator compatibility checking
- Support for multiple datasets (Amazon, Amazon-Walmart, Shopee)
Technical Details:
- JSON-based schema definitions
- Dynamic field generation based on schemas
- Comprehensive field metadata (type, operators, description)
- Extensible architecture for adding new datasets
Purpose: Secure configuration and secret management
Key Features:
- YAML-based configuration loading
- Secure API key management
- Environment variable support
- Validation of required secrets
Technical Details:
- Automatic
secrets.yamlloading - Fallback to environment variables
- Comprehensive validation and error reporting
- Support for multiple configuration sources
Key Methods:
fetch_platform_products(platform, category, limit, min_rating): Fetch products from platformanalyze_competitive_landscape(walmart_products, competitor_products, competitor_platform): Comprehensive analysisgenerate_market_opportunity_report(analysis_results, output_file): Generate reportssave_analysis_results(results, filename): Save analysis results
User Input → Filter Validation → Deduplication Check → API Submission → Local Record Storage
- Filter Validation: Runtime validation against dataset schema
- Deduplication Check: Compare with existing snapshots to prevent duplicates
- API Submission: Submit to BrightData API with proper authentication
- Local Record Storage: Save submission details to JSON file
Local Records → API Status Check → Record Update → UI Refresh
- Local Records: Read from
snapshot_records/*.json - API Status Check: Query BrightData API for current status
- Record Update: Update local records with latest information
- UI Refresh: Display updated status in user interface
Snapshot Ready → Download Request → Content Retrieval → Local Storage → UI Update
- Snapshot Ready: Check if snapshot is ready for download
- Download Request: Request content from BrightData API
- Content Retrieval: Download data in specified format
- Local Storage: Save to
data/downloads/directory - UI Update: Update interface with download status
├── util/ # Core utility modules
│ ├── __init__.py # Package initialization and exports
│ ├── brightdata_filter.py # Main API interface
│ ├── filter_criteria.py # Type-aware filtering system
│ ├── dataset_registry.py # Dataset schema management
│ └── config.py # Configuration management
├── docs/ # Documentation
│ ├── architecture.mermaid # System architecture diagram
│ └── technical.md # This file
├── tasks/ # Task management
│ └── tasks.md # Current development tasks
├── datasets/ # Dataset schemas
│ ├── amazon_products_dataset.md
│ ├── amazon_walmart_dataset.md
│ └── shopee_dataset.md
├── snapshot_records/ # Local snapshot records
│ └── *.json # Individual snapshot records
├── data/downloads/ # Downloaded snapshot data
│ └── *.json # Downloaded snapshot files
├── snapshot_viewer.py # Streamlit UI
├── snapshot_manager.py # CLI management tool
└── secrets.yaml # Configuration and secrets
-
Filter Dataset:
POST /datasets/filter- Submit filter queries
- Returns snapshot ID for tracking
-
Get Snapshot Metadata:
GET /datasets/snapshots/{snapshot_id}- Retrieve snapshot status and metadata
- Used for status monitoring
-
Download Snapshot Content:
GET /datasets/snapshots/{snapshot_id}/content- Download snapshot data
- Supports multiple formats (JSON, CSV)
- Supports compression
-
Deliver Snapshot:
POST /datasets/snapshots/{snapshot_id}/deliver- Trigger snapshot delivery
- Used when download URL is not available
- Bearer token authentication using API key from
secrets.yaml - Automatic token loading and header management
- Comprehensive error handling for authentication failures
- HTTP status code validation
- Detailed error message extraction
- Graceful fallback for network issues
- Retry logic for transient failures
- Field existence validation
- Operator compatibility checking
- Type coercion with error reporting
- Comprehensive error messages with suggestions
- Safe file operations with proper error handling
- Directory creation with error checking
- JSON serialization/deserialization error handling
- Backup and recovery mechanisms
- Order-independent filter comparison
- Efficient JSON comparison algorithms
- Caching of comparison results
- Minimal API calls for duplicate detection
- Lazy loading of large datasets
- Streaming for large file operations
- Efficient data structures for filter operations
- Garbage collection optimization
- Connection pooling for API requests
- Request batching where possible
- Timeout management for long-running operations
- Retry logic with exponential backoff
- Secure storage in
secrets.yaml - Environment variable fallback
- No hardcoded credentials
- Access control and validation
- Local data encryption where applicable
- Secure file permissions
- Input validation and sanitization
- Protection against injection attacks
- Create dataset schema in
dataset_registry.py - Add field definitions with types and operators
- Update dataset documentation
- Test with sample queries
- Extend
FilterFieldbase class - Implement required operators
- Add validation logic
- Update documentation and tests
- Extend Streamlit interface
- Add new functionality to
snapshot_viewer.py - Update UI documentation
- Test user interactions
- Individual component testing
- Mock API responses
- Edge case validation
- Error condition testing
- End-to-end workflow testing
- API integration testing
- File system operations testing
- User interface testing
- Load testing with large datasets
- Memory usage profiling
- Network performance testing
- Concurrent operation testing