Autoavanza is an application built with Streamlit that automates the extraction, classification, and validation of vehicle and official identification documents in Mexico. It is designed to optimize processes such as vehicle pawn loans, ensuring that submitted documents comply with regulations through intelligent processing and automated validation.
- Achieve at least 80% accuracy in document extraction, classification, and validation.
- Reduce document review time from 2 hours to less than 15 minutes.
- Generate a clear and precise ruling in natural language in at least 80% of cases.
The system consists of the following main modules:
-
Text Extraction OCR-based system to detect and extract textual content from documents.
-
File Classification Automatic classification system that identifies the type of document based on OCR results.
-
Data Extraction Module that extracts key data from documents using an API (such as Gemini) from OCR content.
-
QR Code Detection & Web Scraping Detects QR codes in documents and extracts official information from the SAT portal using web scraping.
-
Signature Detection Identifies and extracts signatures present in documents.
-
Signature Comparison (In development) Compares detected signatures against a database or reference signature.
-
Data Validation Applies business rules for each document type, checking validity, data consistency, and more.
-
Ruling Generates a final validation ruling, useful for deciding whether to accept or reject the pawn loan process.
| Document | Accuracy |
|---|---|
| Invoice | 100% |
| Invoice Back | 80% |
| INE (ID card) | 100% |
| INE Back | 90% |
| Circulation Card | 100% |
| Circulation Card Back | 50% |
| Overall Accuracy: | 92.3% |
- Extraction rate: 91.7%
- Extracted values accuracy: 87.6%
- Completed checks: 94.4%
- Accuracy with correct values: 100%
- Accuracy with missing values: 70.6%
- Previous time: 2 hours
- With Autoavanza: 15 minutes
- 87.5% reduction
- Python: Main programming language.
- Gemini API: LLM used for flexible data extraction.
- GitHub: Version control and collaboration.
- Streamlit: Framework for building the interactive interface.
Autoavanza/
├── README.md
├── assets/
│ ├── img/
│ │ └── logo.png # Project logo with Monte de Piedad
│ └── videos/
│ └── DemoAutoavanza.mov # Demonstration video
├── data/ # Test cases in .zip format
├── src/ # Processing and validation modules
│ ├── DataExtraction.py # Extracts data from OCR text
│ ├── DataValidation.py # Validates extracted data against business rules
│ ├── DocumentClassification.py # Automatic document classification
│ ├── OCR.py # OCR module
│ ├── QRExctraction.py # QR detection + SAT scraping
│ ├── Ruling.py # Automated ruling generation
│ ├── SignatureComparison.py # Automatic signature comparison
│ ├── SignatureStampValidation.py # Signature and stamp validation
│ ├── Staging.py # Temporary storage and processing
│ ├── autoavanza.py # Main Streamlit script
│ └── models/
│ └── best.pt # Trained model (e.g., for signature detection)
└── temp/ # Temporary processed files
├── archivos/ # Decompressed documents
├── captchas/ # SAT captchas
└── signatures/ # Extracted document signatures
- Format: documents must be uploaded as a
.zipfile. - Minimum content: Invoice, INE, Circulation Card.
- Orientation: documents must be in vertical orientation.
- Manual intervention: required in case of classification, extraction, or signature errors.
- Strengthen signature comparison with more data for production use.
- Define a robust confidence index for automatic acceptance/rejection.
- Improve date detection and validity checks.
- Add verification of debts (Repuve & Transunion) and fiscal seals.
- Optimize interface with a smoother framework.
- Scale validation with a larger sample to strengthen the signature model.
- Design a confidence index for automated decisions.
- Add new rules and additional validation checks.
