Features • Quick Start • Architecture • API • Documentation
An end-to-end AI pipeline that processes documents and audio files through extraction, translation, summarization, and speech synthesis. Powered by state-of-the-art transformer models from Meta AI, Facebook Research, and OpenAI.
Input: PDF, DOCX, or Audio files
Output: Translated text, AI summaries, and synthesized speech
|
Document Processing
|
AI Translation & Summarization
|
|
Speech Synthesis
|
Performance & Integration
|
# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech
uv pip install -r requirements.txt
# Configure Hugging Face token in main.py
# HF_TOKEN = "your_token_here"
# Launch application
python main.pyThe Gradio interface will open automatically in your browser. Upload a document or audio file to begin processing.
| Component | Model | Parameters | Purpose |
|---|---|---|---|
| Translation | NLLB-200 | 3.3B | Multilingual neural machine translation |
| Summarization | LLaMA 3.1 Instruct | 8B (4-bit) | Instruction-tuned text generation |
| Transcription | Whisper Base | 74M | Audio-to-text conversion |
| Speech Synthesis | MMS-TTS | - | Text-to-speech for Marathi |
graph TD
A[Input File<br/>PDF/DOCX/Audio] --> B[Text Extraction<br/>Whisper/PyMuPDF/python-docx]
B --> C[Translation to Marathi<br/>NLLB-200 3.3B]
C --> D[English Summarization<br/>LLaMA 3.1 8B]
D --> E[Summary Translation to Marathi<br/>NLLB-200 3.3B]
E --> F[Text-to-Speech<br/>MMS-TTS Marathi]
C -.-> G[Marathi Text Output]
D -.-> H[English Summary Output]
E -.-> I[Marathi Summary Output]
F -.-> J[Audio File Output]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
style B fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
style C fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
style D fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
style E fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
style F fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
style G fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
style H fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
style I fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
style J fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
What You Get:
- Full document translation in Marathi
- Concise English summary
- Marathi summary translation
- Audio file of the Marathi summary
✓ Python 3.8 or higher
✓ CUDA-compatible GPU (recommended) or CPU
✓ 16GB RAM minimum
✓ 20GB free storage for models
✓ Hugging Face account
UV provides fast, reliable Python package management.
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech
# Install dependencies
uv pip install -r requirements.txt# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Create a Hugging Face account at huggingface.co
- Navigate to Settings → Access Tokens
- Generate a new token with read permissions
- Update
main.pywith your token:
HF_TOKEN = "your_huggingface_token_here"Note: LLaMA 3.1 requires acceptance of Meta's license agreement on Hugging Face.
Launch the interactive web application:
python main.pyFeatures:
- Drag-and-drop file upload
- Real-time processing status
- Text display for translations and summaries
- Audio player for synthesized speech
- Shareable interface (optional)
The Flask API runs concurrently on port 5000.
Endpoint: POST /api/process
Request:
curl -X POST http://localhost:5000/api/process \
-F "[email protected]"Response:
{
"marathi_text": "संपूर्ण दस्तऐवज मराठी भाषांतर...",
"english_summary": "Brief summary of the document...",
"marathi_summary": "दस्तऐवजाचा संक्षिप्त सारांश...",
"audio_output": "marathi_summary.wav"
}Parameters:
audio_file: Audio file (WAV, MP3, etc.) - optionaldoc_file: Document file (PDF, DOCX) - optional
| 4-bit Quantization | LLaMA model compressed to 25% memory footprint using BitsAndBytes |
| Mixed Precision | FP16 inference on CUDA devices for 2x speedup |
| Dynamic Batching | Automatic batch size adjustment based on available GPU memory |
| Text Chunking | Smart document segmentation for processing long texts |
| Memory Management | Aggressive CUDA cache clearing between model operations |
| LRU Caching | Translation cache for repeated content processing |
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 8GB | 16GB+ |
| GPU | None (CPU fallback) | NVIDIA RTX 3060+ (8GB VRAM) |
| Storage | 20GB free space | 30GB SSD |
| OS | Linux, Windows, macOS | Linux with CUDA 11.8+ |
Current Implementation:
- Source: English
- Target: Marathi (मराठी)
Extensibility:
The NLLB-200 model supports 200+ languages. Modify the MARATHI_CODE variable in main.py to enable other languages:
| Language | Code | Language | Code |
|---|---|---|---|
| Hindi | hin_Deva |
Spanish | spa_Latn |
| French | fra_Latn |
German | deu_Latn |
| Arabic | arb_Arab |
Chinese | zho_Hans |
Document-Audio-Translator-Summarizer-with-speech/
│
├── main.py # Main application (Gradio + Flask)
├── requirements.txt # Python dependencies
└── README.md # Documentation
Code Organization:
- Model initialization at startup for faster inference
- Cached translation function with LRU cache
- Separate extraction functions for each file type
- Dynamic batch processing for optimal GPU usage
- Concurrent Flask and Gradio interfaces
- Model Access: LLaMA 3.1 requires Meta's license acceptance via Hugging Face
- Memory Usage: GPU memory requirements scale with document length
- Language Support: TTS currently supports Marathi only (extensible to other MMS-TTS languages)
- Processing Time: Large documents may take several minutes on CPU
- First Run: Initial model downloads require ~15GB bandwidth
Contributions are welcome! Here are areas where you can help:
- Language Support: Add support for additional languages and TTS models
- Batch Processing: Implement multi-document processing queue
- Testing: Create unit tests and integration tests
- Performance: Optimize inference speed and memory usage
- Documentation: Improve examples and tutorials
Getting Started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is built on groundbreaking research from the AI community:
- Meta AI - NLLB-200 translation, LLaMA 3.1 language model, MMS-TTS synthesis
- OpenAI - Whisper automatic speech recognition
- Hugging Face - Transformers library and model hosting infrastructure
- Gradio - Interactive web interface framework
If you use this project in research or production, please cite the underlying models:
@article{nllb2022,
title={No Language Left Behind: Scaling Human-Centered Machine Translation},
author={NLLB Team and others},
journal={arXiv preprint arXiv:2207.04672},
year={2022}
}
@article{llama31,
title={The Llama 3 Herd of Models},
author={Meta AI},
year={2024}
}
@article{whisper2022,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and others},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}Built with ❤️ for the multilingual AI community