This project is now OFFICIALLY accepted for:
π Participating in GSSOC'25 & Hacktoberfest 2025! π
An open-source framework to build and deploy intelligent AI agents that can handle real-world phone calls using cutting-edge cloud APIs.
π Get Started
Β·
π Report a Bug
Β·
β¨ Request a Feature
| π Stars | π΄ Forks | π Issues | π Open PRs | π Closed PRs | β±οΈ Last Commit | π οΈ Languages | π₯ Contributors |
Voice Marketing Agents leverages the power of Google Gemini, Groq, and ElevenLabs to deliver production-ready voice AI capabilities. No local models, no GPU infrastructure - just powerful cloud APIs.
- Lightning-Fast Responses: Groq's ultra-fast inference + ElevenLabs' low-latency TTS = natural conversations
- Cloud-Powered AI: Gemini for intelligence, Groq for speed, ElevenLabs for studio-quality voice
- Developer-First: Fully containerized with Docker - one command to start everything
- Simple Management UI: Clean React dashboard for agent configuration
- Extensible: Built with modern tech stack for easy customization
- No Infrastructure Hassle: Everything via cloud APIs - no model management needed
| Component | Technology | Why |
|---|---|---|
| Frontend | React & Vite | Fast, modern UI development |
| Backend | Python & FastAPI | Async performance for AI tasks |
| STT | Google Gemini Voice API | High-accuracy speech recognition |
| LLM | Gemini & Groq | Smart + Fast conversation engine |
| TTS | ElevenLabs | Studio-quality voice synthesis |
| Database | PostgreSQL | Reliable data storage |
| Deploy | Docker Compose | One-command deployment |
- Docker & Docker Compose - Get it here
- API Keys from:
-
Clone:
git clone https://github.com/OpenVoiceX/Voice-Marketing-Agent.git cd Voice-Marketing-Agent -
Configure
.env:# Database DATABASE_URL=postgresql://user:password@db:5432/voicegenie_db # Gemini GEMINI_API_KEY=your_gemini_key GEMINI_MODEL=gemini-1.5-flash GEMINI_VOICE_MODEL=gemini-1.5-flash # Groq GROQ_API_KEY=your_groq_key GROQ_MODEL=llama-3.1-70b-versatile # LLM Provider (gemini or groq) LLM_PROVIDER=gemini # ElevenLabs ELEVENLABS_API_KEY=your_elevenlabs_key ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM ELEVENLABS_MODEL_ID=eleven_monolingual_v1 # Twilio TWILIO_ACCOUNT_SID=your_sid TWILIO_AUTH_TOKEN=your_token TWILIO_PHONE_NUMBER=your_number # App SECRET_KEY=your_secret_key AUDIO_DIR=/app/audio_files PUBLIC_URL=http://your-server:8000
-
Launch:
docker compose up --build -d
-
Access:
- Dashboard:
http://localhost:3000 - API Docs:
http://localhost:8000/docs
- Dashboard:
Gemini (LLM_PROVIDER=gemini)
- Advanced reasoning & multimodal
- ~100 tokens/sec
- Free tier available
Groq (LLM_PROVIDER=groq)
- Ultra-fast (up to 750 tokens/sec)
- Perfect for real-time conversations
- Free tier available
The platform is designed as a set of coordinated microservices, orchestrated by Docker Compose. This modular architecture allows for scalability, maintainability, and clear separation of concerns.
-
Telephony Gateway (External): A VoIP service handles the actual phone call connection. When it's the AI's turn to speak or listen, the VoIP server makes a webhook call to our backend.
-
Audio Ingestion: The VoIP server sends the user's speech as a
.wavfile in amultipart/form-datarequest to the/webhookendpoint of our FastAPI Backend. -
STT Micro-Task (Speech-to-Text):
- The backend receives the audio file.
- It calls the
STTService, which is powered by Google Gemini Voice API. - The API transcribes the audio to text in a few hundred milliseconds.
-
LLM Micro-Task (Reasoning & Response Generation):
- The transcribed text is passed to the
LLMService. - This service constructs a prompt and sends it to either Gemini or Groq.
- The LLM generates the text for the agent's response.
- The transcribed text is passed to the
-
TTS Micro-Task (Text-to-Speech):
- The LLM's text response is sent to the
TTSService. - ElevenLabs synthesizes this text into high-quality audio.
- The resulting audio is saved as a temporary file.
- The LLM's text response is sent to the
-
Webhook Response: The FastAPI backend responds to the initial webhook request from the Telephony Gateway, providing a URL to the newly generated audio file. The gateway then plays this audio to the user over the phone.
This entire end-to-end process is optimized to complete in under 2 seconds, which is crucial for maintaining a natural conversational rhythm.
We love contributions! Check our open issues and see the Contribution Guide.
- Visual call flow builder
- Campaign management UI
- Multi-language support
- Voice cloning
- CRM integrations
- Kubernetes deployment
- Analytics dashboard
Thanks to these wonderful people:
MIT License - See LICENSE file.
Built with β€οΈ and powered by βοΈ cloud AI for GSSoC'25
Let's democratize voice AI! π


