Skip to content

AI-powered document and audio processing pipeline with multilingual translation, LLaMA summarization, and speech synthesis. Supports PDF, DOCX, audio files. Built with PyTorch, Transformers, Whisper, NLLB-200.

Notifications You must be signed in to change notification settings

Prasaderp/AI-Multilingual-Document-Processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Audio Translator & Summarizer with Speech Synthesis

Transform documents and audio into multilingual summaries with AI-powered speech synthesis

Python 3.8+ PyTorch Transformers Gradio License: MIT

FeaturesQuick StartArchitectureAPIDocumentation


Overview

An end-to-end AI pipeline that processes documents and audio files through extraction, translation, summarization, and speech synthesis. Powered by state-of-the-art transformer models from Meta AI, Facebook Research, and OpenAI.

Input: PDF, DOCX, or Audio files
Output: Translated text, AI summaries, and synthesized speech


Features

Document Processing

  • PDF text extraction via PyMuPDF
  • DOCX document parsing
  • Audio transcription with Whisper
  • Handles multi-page documents

AI Translation & Summarization

  • NLLB-200 (3.3B) neural translation
  • LLaMA 3.1 (8B) text summarization
  • 200+ language support capability
  • Context-aware processing

Speech Synthesis

  • Natural-sounding TTS with MMS-TTS
  • Adjustable speech speed
  • High-quality audio output
  • Marathi language support

Performance & Integration

  • GPU acceleration with CUDA
  • 4-bit quantization for efficiency
  • REST API + Gradio UI
  • Smart caching and batching

Quick Start

# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech
uv pip install -r requirements.txt

# Configure Hugging Face token in main.py
# HF_TOKEN = "your_token_here"

# Launch application
python main.py

The Gradio interface will open automatically in your browser. Upload a document or audio file to begin processing.


Architecture

Core Models

Component Model Parameters Purpose
Translation NLLB-200 3.3B Multilingual neural machine translation
Summarization LLaMA 3.1 Instruct 8B (4-bit) Instruction-tuned text generation
Transcription Whisper Base 74M Audio-to-text conversion
Speech Synthesis MMS-TTS - Text-to-speech for Marathi

Processing Pipeline

graph TD
    A[Input File<br/>PDF/DOCX/Audio] --> B[Text Extraction<br/>Whisper/PyMuPDF/python-docx]
    B --> C[Translation to Marathi<br/>NLLB-200 3.3B]
    C --> D[English Summarization<br/>LLaMA 3.1 8B]
    D --> E[Summary Translation to Marathi<br/>NLLB-200 3.3B]
    E --> F[Text-to-Speech<br/>MMS-TTS Marathi]
    
    C -.-> G[Marathi Text Output]
    D -.-> H[English Summary Output]
    E -.-> I[Marathi Summary Output]
    F -.-> J[Audio File Output]
    
    style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    style B fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style C fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style D fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style E fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style F fill:#f5f5f5,stroke:#424242,stroke-width:2px,color:#000
    style G fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style H fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style I fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
    style J fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000
Loading

What You Get:

  • Full document translation in Marathi
  • Concise English summary
  • Marathi summary translation
  • Audio file of the Marathi summary

Installation

Prerequisites

✓ Python 3.8 or higher
✓ CUDA-compatible GPU (recommended) or CPU
✓ 16GB RAM minimum
✓ 20GB free storage for models
✓ Hugging Face account

Using UV (Recommended)

UV provides fast, reliable Python package management.

# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech

# Install dependencies
uv pip install -r requirements.txt

Using pip

# Clone repository
git clone https://github.com/yourusername/Document-Audio-Translator-Summarizer-with-speech.git
cd Document-Audio-Translator-Summarizer-with-speech

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

  1. Create a Hugging Face account at huggingface.co
  2. Navigate to SettingsAccess Tokens
  3. Generate a new token with read permissions
  4. Update main.py with your token:
HF_TOKEN = "your_huggingface_token_here"

Note: LLaMA 3.1 requires acceptance of Meta's license agreement on Hugging Face.


Usage

Gradio Web Interface

Launch the interactive web application:

python main.py

Features:

  • Drag-and-drop file upload
  • Real-time processing status
  • Text display for translations and summaries
  • Audio player for synthesized speech
  • Shareable interface (optional)

REST API

The Flask API runs concurrently on port 5000.

Endpoint: POST /api/process

Request:

curl -X POST http://localhost:5000/api/process \
  -F "[email protected]"

Response:

{
  "marathi_text": "संपूर्ण दस्तऐवज मराठी भाषांतर...",
  "english_summary": "Brief summary of the document...",
  "marathi_summary": "दस्तऐवजाचा संक्षिप्त सारांश...",
  "audio_output": "marathi_summary.wav"
}

Parameters:

  • audio_file: Audio file (WAV, MP3, etc.) - optional
  • doc_file: Document file (PDF, DOCX) - optional

Performance Optimizations

4-bit QuantizationLLaMA model compressed to 25% memory footprint using BitsAndBytes
Mixed PrecisionFP16 inference on CUDA devices for 2x speedup
Dynamic BatchingAutomatic batch size adjustment based on available GPU memory
Text ChunkingSmart document segmentation for processing long texts
Memory ManagementAggressive CUDA cache clearing between model operations
LRU CachingTranslation cache for repeated content processing

System Requirements

Component Minimum Recommended
CPU 4 cores 8+ cores
RAM 8GB 16GB+
GPU None (CPU fallback) NVIDIA RTX 3060+ (8GB VRAM)
Storage 20GB free space 30GB SSD
OS Linux, Windows, macOS Linux with CUDA 11.8+

Supported Languages

Current Implementation:

  • Source: English
  • Target: Marathi (मराठी)

Extensibility:

The NLLB-200 model supports 200+ languages. Modify the MARATHI_CODE variable in main.py to enable other languages:

Language Code Language Code
Hindi hin_Deva Spanish spa_Latn
French fra_Latn German deu_Latn
Arabic arb_Arab Chinese zho_Hans

View full language list →


Project Structure

Document-Audio-Translator-Summarizer-with-speech/
│
├── main.py                 # Main application (Gradio + Flask)
├── requirements.txt        # Python dependencies
└── README.md              # Documentation

Code Organization:

  • Model initialization at startup for faster inference
  • Cached translation function with LRU cache
  • Separate extraction functions for each file type
  • Dynamic batch processing for optimal GPU usage
  • Concurrent Flask and Gradio interfaces

Limitations & Considerations

  • Model Access: LLaMA 3.1 requires Meta's license acceptance via Hugging Face
  • Memory Usage: GPU memory requirements scale with document length
  • Language Support: TTS currently supports Marathi only (extensible to other MMS-TTS languages)
  • Processing Time: Large documents may take several minutes on CPU
  • First Run: Initial model downloads require ~15GB bandwidth

Contributing

Contributions are welcome! Here are areas where you can help:

  • Language Support: Add support for additional languages and TTS models
  • Batch Processing: Implement multi-document processing queue
  • Testing: Create unit tests and integration tests
  • Performance: Optimize inference speed and memory usage
  • Documentation: Improve examples and tutorials

Getting Started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

This project is built on groundbreaking research from the AI community:

  • Meta AI - NLLB-200 translation, LLaMA 3.1 language model, MMS-TTS synthesis
  • OpenAI - Whisper automatic speech recognition
  • Hugging Face - Transformers library and model hosting infrastructure
  • Gradio - Interactive web interface framework

Citation

If you use this project in research or production, please cite the underlying models:

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team and others},
  journal={arXiv preprint arXiv:2207.04672},
  year={2022}
}

@article{llama31,
  title={The Llama 3 Herd of Models},
  author={Meta AI},
  year={2024}
}

@article{whisper2022,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Built with ❤️ for the multilingual AI community

Report BugRequest Feature

About

AI-powered document and audio processing pipeline with multilingual translation, LLaMA summarization, and speech synthesis. Supports PDF, DOCX, audio files. Built with PyTorch, Transformers, Whisper, NLLB-200.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages