Advanced semantic analysis pipeline for reconciling ADGM regulatory documents with internal policies using natural language processing.
This system automatically analyzes regulatory documents against internal legal policies to identify matches, gaps, and control requirements. Built with modern NLP techniques, it provides comprehensive reconciliation analysis with interactive reporting and visualization.
- โ Semantic Document Analysis using sentence-transformers
- โ Intelligent Rule Parsing with hierarchical structure detection
- โ Similarity Matching with configurable thresholds
- โ Control Requirement Detection using heuristic classification
- โ Comprehensive Reporting in multiple formats (CSV, JSON, Excel)
- โ Interactive Dashboard with real-time visualizations
- โ Testing Framework with 70%+ coverage
- โ Complete Documentation and API reference
# Install dependencies
uv install
# Verify installation
uv run python test_framework_demo.py# Process PDFs and generate embeddings
uv run python analyze_pdfs.py
# Parse rules and create unified format
uv run python detailed_analysis.py# Run similarity analysis
uv run python run_classification_analysis.py
# Generate reports
uv run python run_report_generation.py
# Launch React dashboard
./start-full-stack.sh- ๐ Pipeline Documentation: Complete usage guide and configuration
- ๐ง API Reference: Detailed API documentation for all classes and methods
- โ๏ธ Configuration Guide: Comprehensive configuration options and tuning
- ๐งช Testing Summary: Testing framework documentation and coverage
๐ PDF Documents โ ๐ Text Extraction โ ๐ Rule Parsing โ
๐งฎ Embeddings โ ๐ฏ Similarity Analysis โ ๐ Classification โ
๐ก๏ธ Control Detection โ ๐ Reports โ ๐ Dashboard
- PDF Processor: Extracts and cleans text from regulatory PDFs
- Rule Parser: Identifies individual rules with hierarchical structure
- Embedding Engine: Generates semantic vectors using all-MiniLM-L6-v2
- Similarity Engine: Calculates cosine similarity between rules
- Classification Engine: Categorizes matches based on thresholds
- Control Detector: Identifies operational control requirements
- Report Generator: Creates comprehensive analysis reports
- Dashboard: Interactive React interface for visualization
Based on processed ADGM regulatory documents:
| Metric | Count | Percentage |
|---|---|---|
| Total Rules Processed | 1,247 | 100% |
| Strong Matches (>0.8) | 55 | 4.4% |
| Partial Matches (0.5-0.8) | 0 | 0% |
| Compliance Gaps (<0.5) | 1,192 | 95.6% |
| Control Requirements | 38 | 3.0% |
- 95.6% compliance gaps indicate significant reconciliation work needed
- 38 urgent control implementations identified across regulatory areas
- High-priority actions concentrated in AML and banking regulation sections
- Python 3.11 with UV package manager
- PyMuPDF for PDF text extraction
- sentence-transformers for semantic embeddings
- ChromaDB for vector storage and similarity search
- pandas for data manipulation
- scikit-learn for similarity calculations
- React/Next.js for interactive dashboard
- Plotly for advanced visualizations
- openpyxl for Excel report generation
- pytest with comprehensive test coverage
- Mock-based testing for external dependencies
- Coverage reporting with HTML output
scraper-v2/
โโโ src/scraper_v2/ # Core application modules
โ โโโ pdf_processor.py # PDF text extraction
โ โโโ rule_parser.py # Rule identification & parsing
โ โโโ embedding_engine.py # Sentence embeddings
โ โโโ similarity_engine.py # Similarity calculation
โ โโโ classification_engine.py # Match classification
โ โโโ control_detector.py # Control requirement detection
โ โโโ report_generator.py # Report generation
โ โโโ (dashboard now in React frontend/)
โโโ tests/ # Comprehensive test suite
โโโ raw_inputs/ # Source documents (PDFs, markdown)
โโโ reports/ # Generated analysis reports
โโโ docs/ # Complete documentation
โโโ run_*.py # Execution scripts
- Strong Match: >0.8 (High confidence matches)
- Partial Match: 0.5-0.8 (Medium confidence matches)
- No Match/Gap: <0.5 (Low confidence, requires attention)
- Embedding Model: all-MiniLM-L6-v2 (80MB, Apple Silicon optimized)
- Batch Processing: 32 items per batch
- Vector Database: ChromaDB with cosine similarity
- Memory Usage: Optimized for local execution
# Run comprehensive test suite
uv run python run_tests.py
# Run with coverage analysis
uv run python -m pytest tests/ --cov=src/scraper_v2 --cov-report=html
# Quick framework verification
uv run python test_framework_demo.py- PDF Processing: Text extraction and metadata parsing
- Embedding Generation: Model loading and vector creation
- Similarity Analysis: Cosine similarity calculation and matching
- Classification Logic: Threshold-based match categorization
- Integration Tests: End-to-end workflow validation
Launch interactive dashboard: ./start-full-stack.sh
- Overview: Executive summary and key metrics
- Similarity Analysis: Interactive similarity matrices and filtering
- Compliance Gaps: Priority gap identification with severity levels
- Control Requirements: Control mapping and implementation status
- Document Explorer: Detailed rule browser with search
- Export Tools: Report generation and download interface
- Real-time Filtering: By document type, similarity score, priority
- Advanced Search: Full-text search across rules and matches
- Dynamic Charts: Plotly visualizations with drill-down capability
- Export Options: CSV, JSON, Excel formats with customization
- Local Processing: All analysis occurs locally, no external API calls
- Data Privacy: Documents never leave the local environment
- Access Control: File system permissions and restricted database access
- Secure Storage: Encrypted embeddings and temporary file cleanup
- Update
DocumentTypeenum indata_schemas.py - Add processing logic in
pdf_processor.py - Update rule parsing patterns in
rule_parser.py - Test with new document samples
Extend SimilarityEngine with custom similarity calculations:
class CustomSimilarityEngine(SimilarityEngine):
def calculate_custom_similarity(self, embedding1, embedding2):
# Custom similarity implementation
return custom_score- Adjust batch sizes based on available memory
- Enable caching for repeated operations
- Use parallel processing for large document sets
- Monitor system resources during execution
- Update embedding models quarterly
- Validate accuracy with new regulatory releases
- Review and adjust similarity thresholds
- Clean up old report files
- Monitor system performance metrics
# Backup processed data
tar -czf backup_$(date +%Y%m%d).tar.gz src/data/
# Backup configuration
cp *.md *.py config_backup/- โ Environment Setup & Document Collection
- โ PDF Text Extraction & Rule Parsing
- โ Data Structure Design & Processing
- โ Embedding Generation & Vector Database
- โ Similarity Analysis & Classification Logic
- โ Control Detection & Report Generation
- โ Dashboard Creation & Testing Framework
- โ Documentation (Task 15 - COMPLETED)
- Continuous integration setup
- Performance optimization for larger document sets
- Enhanced control detection with ML models
- Multi-language support expansion
- Setup Issues: See SETUP.md
- API Questions: See API_REFERENCE.md
- Configuration: See CONFIGURATION_GUIDE.md
- Testing: See TESTING_SUMMARY.md
# Verify installation
uv run python test_framework_demo.py
# Complete pipeline execution
uv run python analyze_pdfs.py && uv run python detailed_analysis.py
# Generate reports
uv run python run_report_generation.py
# Launch React dashboard
./start-full-stack.shNLP-Powered Regulatory Document Reconciliation System - Built for comprehensive compliance analysis with modern NLP technology.