Document Audio Translator & Summarizer with Speech Synthesis

Transform documents and audio into multilingual summaries with AI-powered speech synthesis

Features • Quick Start • Architecture • API • Documentation

Overview

An end-to-end AI pipeline that processes documents and audio files through extraction, translation, summarization, and speech synthesis. Powered by state-of-the-art transformer models from Meta AI, Facebook Research, and OpenAI.

Input: PDF, DOCX, or Audio files
Output: Translated text, AI summaries, and synthesized speech

Features

Document Processing

PDF text extraction via PyMuPDF
DOCX document parsing
Audio transcription with Whisper
Handles multi-page documents

AI Translation & Summarization

NLLB-200 (3.3B) neural translation
LLaMA 3.1 (8B) text summarization
200+ language support capability
Context-aware processing

Speech Synthesis

Natural-sounding TTS with MMS-TTS
Adjustable speech speed
High-quality audio output
Marathi language support

Performance & Integration

GPU acceleration with CUDA
4-bit quantization for efficiency
REST API + Gradio UI
Smart caching and batching

Quick Start

# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech
uv pip install -r requirements.txt

# Configure Hugging Face token in main.py
# HF_TOKEN = "your_token_here"

# Launch application
python main.py

The Gradio interface will open automatically in your browser. Upload a document or audio file to begin processing.

Architecture

Core Models

Component	Model	Parameters	Purpose
Translation	NLLB-200	3.3B	Multilingual neural machine translation
Summarization	LLaMA 3.1 Instruct	8B (4-bit)	Instruction-tuned text generation
Transcription	Whisper Base	74M	Audio-to-text conversion
Speech Synthesis	MMS-TTS	-	Text-to-speech for Marathi

Processing Pipeline

graph TD
    A[Input File<br/>PDF/DOCX/Audio] --> B[Text Extraction<br/>Whisper/PyMuPDF/python-docx]
    B --> C[Translation to Marathi<br/>NLLB-200 3.3B]
    C --> D[English Summarization<br/>LLaMA 3.1 8B]
    D --> E[Summary Translation to Marathi<br/>NLLB-200 3.3B]
    E --> F[Text-to-Speech<br/>MMS-TTS Marathi]
    
    C -.-> G[Marathi Text Output]
    D -.-> H[English Summary Output]
    E -.-> I[Marathi Summary Output]
    F -.-> J[Audio File Output]
    
    style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    style B fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style C fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style D fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style E fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style F fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style G fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style H fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style I fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style J fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000

What You Get:

Full document translation in Marathi
Concise English summary
Marathi summary translation
Audio file of the Marathi summary

Installation

Prerequisites

✓ Python 3.8 or higher
✓ CUDA-compatible GPU (recommended) or CPU
✓ 16GB RAM minimum
✓ 20GB free storage for models
✓ Hugging Face account

Using UV (Recommended)

UV provides fast, reliable Python package management.

# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech

# Install dependencies
uv pip install -r requirements.txt

Using pip

# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a Hugging Face account at huggingface.co
Navigate to Settings → Access Tokens
Generate a new token with read permissions
Update main.py with your token:

HF_TOKEN = "your_huggingface_token_here"

Note: LLaMA 3.1 requires acceptance of Meta's license agreement on Hugging Face.

Usage

Gradio Web Interface

Launch the interactive web application:

python main.py

Features:

Drag-and-drop file upload
Real-time processing status
Text display for translations and summaries
Audio player for synthesized speech
Shareable interface (optional)

REST API

The Flask API runs concurrently on port 5000.

Endpoint: POST /api/process

Request:

curl -X POST http://localhost:5000/api/process \
  -F "[email protected]"

Response:

{
  "marathi_text": "संपूर्ण दस्तऐवज मराठी भाषांतर...",
  "english_summary": "Brief summary of the document...",
  "marathi_summary": "दस्तऐवजाचा संक्षिप्त सारांश...",
  "audio_output": "marathi_summary.wav"
}

Parameters:

audio_file: Audio file (WAV, MP3, etc.) - optional
doc_file: Document file (PDF, DOCX) - optional

Performance Optimizations

4-bit Quantization	LLaMA model compressed to 25% memory footprint using BitsAndBytes
Mixed Precision	FP16 inference on CUDA devices for 2x speedup
Dynamic Batching	Automatic batch size adjustment based on available GPU memory
Text Chunking	Smart document segmentation for processing long texts
Memory Management	Aggressive CUDA cache clearing between model operations
LRU Caching	Translation cache for repeated content processing

System Requirements

Component	Minimum	Recommended
CPU	4 cores	8+ cores
RAM	8GB	16GB+
GPU	None (CPU fallback)	NVIDIA RTX 3060+ (8GB VRAM)
Storage	20GB free space	30GB SSD
OS	Linux, Windows, macOS	Linux with CUDA 11.8+

Supported Languages

Current Implementation:

Source: English
Target: Marathi (मराठी)

Extensibility:

The NLLB-200 model supports 200+ languages. Modify the MARATHI_CODE variable in main.py to enable other languages:

Language	Code	Language	Code
Hindi	`hin_Deva`	Spanish	`spa_Latn`
French	`fra_Latn`	German	`deu_Latn`
Arabic	`arb_Arab`	Chinese	`zho_Hans`

View full language list →

Project Structure

Document-Audio-Translator-Summarizer-with-speech/
│
├── main.py                 # Main application (Gradio + Flask)
├── requirements.txt        # Python dependencies
└── README.md              # Documentation

Code Organization:

Model initialization at startup for faster inference
Cached translation function with LRU cache
Separate extraction functions for each file type
Dynamic batch processing for optimal GPU usage
Concurrent Flask and Gradio interfaces

Limitations & Considerations

Model Access: LLaMA 3.1 requires Meta's license acceptance via Hugging Face
Memory Usage: GPU memory requirements scale with document length
Language Support: TTS currently supports Marathi only (extensible to other MMS-TTS languages)
Processing Time: Large documents may take several minutes on CPU
First Run: Initial model downloads require ~15GB bandwidth

Contributing

Contributions are welcome! Here are areas where you can help:

Language Support: Add support for additional languages and TTS models
Batch Processing: Implement multi-document processing queue
Testing: Create unit tests and integration tests
Performance: Optimize inference speed and memory usage
Documentation: Improve examples and tutorials

Getting Started:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project is built on groundbreaking research from the AI community:

Meta AI - NLLB-200 translation, LLaMA 3.1 language model, MMS-TTS synthesis
OpenAI - Whisper automatic speech recognition
Hugging Face - Transformers library and model hosting infrastructure
Gradio - Interactive web interface framework

Citation

If you use this project in research or production, please cite the underlying models:

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team and others},
  journal={arXiv preprint arXiv:2207.04672},
  year={2022}
}

@article{llama31,
  title={The Llama 3 Herd of Models},
  author={Meta AI},
  year={2024}
}

@article{whisper2022,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Built with ❤️ for the multilingual AI community

Report Bug • Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Audio Translator & Summarizer with Speech Synthesis

Transform documents and audio into multilingual summaries with AI-powered speech synthesis

Overview

Features

Quick Start

Architecture

Core Models

Processing Pipeline

Installation

Prerequisites

Using UV (Recommended)

Using pip

Configuration

Usage

Gradio Web Interface

REST API

Performance Optimizations

System Requirements

Supported Languages

Project Structure

Limitations & Considerations

Contributing

License

Acknowledgments

Citation

About

Uh oh!

Releases

Languages

Prasaderp/AI-Multilingual-Document-Processor

Folders and files

Latest commit

History

Repository files navigation

Document Audio Translator & Summarizer with Speech Synthesis

Transform documents and audio into multilingual summaries with AI-powered speech synthesis

Overview

Features

Quick Start

Architecture

Core Models

Processing Pipeline

Installation

Prerequisites

Using UV (Recommended)

Using pip

Configuration

Usage

Gradio Web Interface

REST API

Performance Optimizations

System Requirements

Supported Languages

Project Structure

Limitations & Considerations

Contributing

License

Acknowledgments

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages