Skip to content

xcarbo/scraper-v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ” NLP-Powered Regulatory Document Reconciliation System

Advanced semantic analysis pipeline for reconciling ADGM regulatory documents with internal policies using natural language processing.

๐ŸŽฏ Overview

This system automatically analyzes regulatory documents against internal legal policies to identify matches, gaps, and control requirements. Built with modern NLP techniques, it provides comprehensive reconciliation analysis with interactive reporting and visualization.

Key Features

  • โœ… Semantic Document Analysis using sentence-transformers
  • โœ… Intelligent Rule Parsing with hierarchical structure detection
  • โœ… Similarity Matching with configurable thresholds
  • โœ… Control Requirement Detection using heuristic classification
  • โœ… Comprehensive Reporting in multiple formats (CSV, JSON, Excel)
  • โœ… Interactive Dashboard with real-time visualizations
  • โœ… Testing Framework with 70%+ coverage
  • โœ… Complete Documentation and API reference

๐Ÿš€ Quick Start

1. Environment Setup

# Install dependencies
uv install

# Verify installation
uv run python test_framework_demo.py

2. Document Processing

# Process PDFs and generate embeddings
uv run python analyze_pdfs.py

# Parse rules and create unified format
uv run python detailed_analysis.py

3. Analysis Pipeline

# Run similarity analysis
uv run python run_classification_analysis.py

# Generate reports
uv run python run_report_generation.py

# Launch React dashboard
./start-full-stack.sh

๐Ÿ“š Documentation

๐Ÿ—๏ธ Architecture

Pipeline Flow

๐Ÿ“„ PDF Documents โ†’ ๐Ÿ” Text Extraction โ†’ ๐Ÿ“ Rule Parsing โ†’ 
๐Ÿงฎ Embeddings โ†’ ๐ŸŽฏ Similarity Analysis โ†’ ๐Ÿ“Š Classification โ†’ 
๐Ÿ›ก๏ธ Control Detection โ†’ ๐Ÿ“‹ Reports โ†’ ๐Ÿ“ˆ Dashboard

Core Components

  • PDF Processor: Extracts and cleans text from regulatory PDFs
  • Rule Parser: Identifies individual rules with hierarchical structure
  • Embedding Engine: Generates semantic vectors using all-MiniLM-L6-v2
  • Similarity Engine: Calculates cosine similarity between rules
  • Classification Engine: Categorizes matches based on thresholds
  • Control Detector: Identifies operational control requirements
  • Report Generator: Creates comprehensive analysis reports
  • Dashboard: Interactive React interface for visualization

๐Ÿ“Š Current Analysis Results

Based on processed ADGM regulatory documents:

Metric Count Percentage
Total Rules Processed 1,247 100%
Strong Matches (>0.8) 55 4.4%
Partial Matches (0.5-0.8) 0 0%
Compliance Gaps (<0.5) 1,192 95.6%
Control Requirements 38 3.0%

Key Findings

  • 95.6% compliance gaps indicate significant reconciliation work needed
  • 38 urgent control implementations identified across regulatory areas
  • High-priority actions concentrated in AML and banking regulation sections

๐Ÿ”ง Tech Stack

Core Technologies

  • Python 3.11 with UV package manager
  • PyMuPDF for PDF text extraction
  • sentence-transformers for semantic embeddings
  • ChromaDB for vector storage and similarity search
  • pandas for data manipulation
  • scikit-learn for similarity calculations

Visualization & Interface

  • React/Next.js for interactive dashboard
  • Plotly for advanced visualizations
  • openpyxl for Excel report generation

Testing & Quality

  • pytest with comprehensive test coverage
  • Mock-based testing for external dependencies
  • Coverage reporting with HTML output

๐Ÿ“ Project Structure

scraper-v2/
โ”œโ”€โ”€ src/scraper_v2/           # Core application modules
โ”‚   โ”œโ”€โ”€ pdf_processor.py     # PDF text extraction
โ”‚   โ”œโ”€โ”€ rule_parser.py       # Rule identification & parsing  
โ”‚   โ”œโ”€โ”€ embedding_engine.py  # Sentence embeddings
โ”‚   โ”œโ”€โ”€ similarity_engine.py # Similarity calculation
โ”‚   โ”œโ”€โ”€ classification_engine.py # Match classification
โ”‚   โ”œโ”€โ”€ control_detector.py  # Control requirement detection
โ”‚   โ”œโ”€โ”€ report_generator.py  # Report generation
โ”‚   โ””โ”€โ”€ (dashboard now in React frontend/)
โ”œโ”€โ”€ tests/                   # Comprehensive test suite
โ”œโ”€โ”€ raw_inputs/             # Source documents (PDFs, markdown)
โ”œโ”€โ”€ reports/                # Generated analysis reports
โ”œโ”€โ”€ docs/                   # Complete documentation
โ””โ”€โ”€ run_*.py               # Execution scripts

โš™๏ธ Configuration

Similarity Thresholds

  • Strong Match: >0.8 (High confidence matches)
  • Partial Match: 0.5-0.8 (Medium confidence matches)
  • No Match/Gap: <0.5 (Low confidence, requires attention)

Performance Settings

  • Embedding Model: all-MiniLM-L6-v2 (80MB, Apple Silicon optimized)
  • Batch Processing: 32 items per batch
  • Vector Database: ChromaDB with cosine similarity
  • Memory Usage: Optimized for local execution

๐Ÿงช Testing

Test Coverage

# Run comprehensive test suite
uv run python run_tests.py

# Run with coverage analysis
uv run python -m pytest tests/ --cov=src/scraper_v2 --cov-report=html

# Quick framework verification
uv run python test_framework_demo.py

Test Components

  • PDF Processing: Text extraction and metadata parsing
  • Embedding Generation: Model loading and vector creation
  • Similarity Analysis: Cosine similarity calculation and matching
  • Classification Logic: Threshold-based match categorization
  • Integration Tests: End-to-end workflow validation

๐Ÿ“ˆ Dashboard Features

Launch interactive dashboard: ./start-full-stack.sh

Dashboard Pages

  1. Overview: Executive summary and key metrics
  2. Similarity Analysis: Interactive similarity matrices and filtering
  3. Compliance Gaps: Priority gap identification with severity levels
  4. Control Requirements: Control mapping and implementation status
  5. Document Explorer: Detailed rule browser with search
  6. Export Tools: Report generation and download interface

Interactive Features

  • Real-time Filtering: By document type, similarity score, priority
  • Advanced Search: Full-text search across rules and matches
  • Dynamic Charts: Plotly visualizations with drill-down capability
  • Export Options: CSV, JSON, Excel formats with customization

๐Ÿ”’ Security & Privacy

  • Local Processing: All analysis occurs locally, no external API calls
  • Data Privacy: Documents never leave the local environment
  • Access Control: File system permissions and restricted database access
  • Secure Storage: Encrypted embeddings and temporary file cleanup

๐Ÿ› ๏ธ Development

Adding New Document Types

  1. Update DocumentType enum in data_schemas.py
  2. Add processing logic in pdf_processor.py
  3. Update rule parsing patterns in rule_parser.py
  4. Test with new document samples

Custom Similarity Metrics

Extend SimilarityEngine with custom similarity calculations:

class CustomSimilarityEngine(SimilarityEngine):
    def calculate_custom_similarity(self, embedding1, embedding2):
        # Custom similarity implementation
        return custom_score

Performance Optimization

  • Adjust batch sizes based on available memory
  • Enable caching for repeated operations
  • Use parallel processing for large document sets
  • Monitor system resources during execution

๐Ÿ“‹ Maintenance

Regular Tasks

  • Update embedding models quarterly
  • Validate accuracy with new regulatory releases
  • Review and adjust similarity thresholds
  • Clean up old report files
  • Monitor system performance metrics

Data Backup

# Backup processed data
tar -czf backup_$(date +%Y%m%d).tar.gz src/data/

# Backup configuration
cp *.md *.py config_backup/

๐ŸŽฏ Project Status

Completed Tasks (14/15)

  • โœ… Environment Setup & Document Collection
  • โœ… PDF Text Extraction & Rule Parsing
  • โœ… Data Structure Design & Processing
  • โœ… Embedding Generation & Vector Database
  • โœ… Similarity Analysis & Classification Logic
  • โœ… Control Detection & Report Generation
  • โœ… Dashboard Creation & Testing Framework
  • โœ… Documentation (Task 15 - COMPLETED)

Next Steps

  • Continuous integration setup
  • Performance optimization for larger document sets
  • Enhanced control detection with ML models
  • Multi-language support expansion

๐Ÿ“ž Support

Quick Links

Common Commands

# Verify installation
uv run python test_framework_demo.py

# Complete pipeline execution
uv run python analyze_pdfs.py && uv run python detailed_analysis.py

# Generate reports
uv run python run_report_generation.py

# Launch React dashboard
./start-full-stack.sh

NLP-Powered Regulatory Document Reconciliation System - Built for comprehensive compliance analysis with modern NLP technology.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors