GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2
GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2TacImpulse wants to merge 1 commit into
Conversation
…nfig, docs added (2025-11-28)
|
Keep this PR in a mergeable state → Learn moreAll Green is an AI agent that automatically: ✅ Addresses code review comments ✅ Fixes failing CI checks ✅ Resolves merge conflicts |
❌ Deploy Preview for uvb failed.
|
Reviewer's GuideConfigures PM2 and environment to run the backend from a specific Flash Attention 2 Python virtualenv with GPU-enabled STT/TTS, stabilizes Windows/console behavior, and adds documentation for the GPU latency optimization work. Sequence diagram for GPU-accelerated voice request with FA2 backendsequenceDiagram
actor User
participant Browser
participant FrontendDev as Frontend_next_dev
participant Backend as Backend_main
participant STT as STT_service_FasterWhisper
participant TTS as TTS_service_VibeVoice_FA2
participant GPU as CUDA_GPU
User ->> Browser: Start voice chat
Browser ->> FrontendDev: POST /api/voice with audio
FrontendDev ->> Backend: HTTP request /voice with WAV
Backend ->> STT: Transcribe(audio_wav)
STT ->> GPU: Run_FasterWhisper_cuda(16k_mono)
GPU -->> STT: Text_transcript
STT -->> Backend: Transcript
Backend ->> TTS: Synthesize(transcript)
TTS ->> GPU: VibeVoice_flash_attention_2_inference
GPU -->> TTS: Generated_audio
TTS -->> Backend: WAV_response
Backend -->> FrontendDev: WAV audio
FrontendDev -->> Browser: Stream/return WAV
Browser -->> User: Play low_latency_audio
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
WalkthroughThe pull request consolidates development configuration files and adds optimization documentation. Changes include simplifying .gitignore patterns, updating PM2 ecosystem configuration with explicit binary paths and Windows compatibility settings, adjusting package.json dev scripts, and adding documentation for GPU latency optimization with Flash Attention 2. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- The hard-coded absolute path to the FA2 virtualenv Python executable in
ecosystem.config.js(F:/Git/...) makes the setup non-portable; consider using a relative path, environment variable, or PM2interpreter/interpreter_argsto select the venv dynamically. - The Next dev script in
ecosystem.config.jsdirectly invokesnode_modules/next/dist/bin/nextwith a relative path that assumes a specific working directory; you may want to reference the localnextbinary vianpxor a package.json script to avoid path issues on different environments. - The newly added dependency
"@modelcontextprotocol/create-server"inpackage.jsondoes not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The hard-coded absolute path to the FA2 virtualenv Python executable in `ecosystem.config.js` (`F:/Git/...`) makes the setup non-portable; consider using a relative path, environment variable, or PM2 `interpreter`/`interpreter_args` to select the venv dynamically.
- The Next dev script in `ecosystem.config.js` directly invokes `node_modules/next/dist/bin/next` with a relative path that assumes a specific working directory; you may want to reference the local `next` binary via `npx` or a package.json script to avoid path issues on different environments.
- The newly added dependency `"@modelcontextprotocol/create-server"` in `package.json` does not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.
## Individual Comments
### Comment 1
<location> `docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md:30` </location>
<code_context>
+## Next Steps (Planned)
+- Instrument per-stage timings; enable FP16/TF32, cuDNN benchmark, channels‑last
+- Prewarm kernels; cache embeddings; stream sentence chunks via `StreamingResponse`
+- Frontend consume streaming audio; reduce sample rate to 22.05 kHz mono for speed
+
+## Notes
</code_context>
<issue_to_address>
**nitpick (typo):** Consider fixing the verb form in "Frontend consume streaming audio" for grammatical correctness.
You could rephrase that bullet to something like “Frontend will consume streaming audio” or “Frontend consumes streaming audio” so the sentence reads grammatically correctly.
```suggestion
- Frontend consumes streaming audio; reduce sample rate to 22.05 kHz mono for speed
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
.gitignore (2)
2-4: Remove redundant specific path pattern.Line 4 is redundant since line 3 (
**/.venv-*/) already matches the specific virtualenv path. Hardcoded repository-specific paths in.gitignoreare generally discouraged—use generic patterns instead.Apply this diff:
**/.venv/ **/.venv-*/ -Ultimate_Voice_Bridge/backend/.venv-fa2-313/
11-13: Consider more specific environment file patterns.The pattern
**/.env.*on line 13 may be too broad—it will match commonly-committed files like.env.exampleor.env.template. Consider explicit patterns like**/.env.development,**/.env.production, etc., or document that templates should use a different naming convention.ecosystem.config.js (2)
10-10: Document Windows-specific configuration.The
windowsHide: trueoption on lines 10, 27, and 39 is Windows-specific. Consider documenting this in the README or adding a comment in the config to note that this configuration is tailored for Windows environments, which aligns with the PR's noted RTX 5090/Windows setup.Also applies to: 27-27, 39-39
23-24: Simplify frontend script invocation.Since
scriptis already set tonode, theargsshould point to the Next.js binary without redundancy. Consider using a more standard approach like running via npm script.Apply this diff:
- script: 'node', - args: 'node_modules/next/dist/bin/next dev -p 3000', + script: 'npm', + args: 'run dev -- -p 3000',Or if you prefer the direct node approach:
- script: 'node', - args: 'node_modules/next/dist/bin/next dev -p 3000', + script: './Ultimate_Voice_Bridge/frontend/node_modules/.bin/next', + args: 'dev -p 3000',
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.gitignore(1 hunks)docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md(1 hunks)ecosystem.config.js(2 hunks)package.json(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (1)
docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md (1)
1-34: Clear and well-structured development documentation.The documentation provides a comprehensive overview of the GPU latency optimization work, including rationale, changes, verification steps, and planned next steps. This is valuable context for the team.



UVB GPU Latency Optimization (Flash Attention 2)
Summary
Changes
ecosystem.config.js: run backend via FA2 venv python;windowsHide=true;watch=false;PYTHONIOENCODING=utf-8backend/main.py: reloader off; ASCII logs; health verificationservices/stt_service.py: Faster-Whisper on CUDA with safe fallbacks and 16k mono conversionnode;windowsHide=true; disabled prefetch on voice route (avoid aborts).env:WHISPER_DEVICE=cuda,RELOAD=false,BACKEND_HOST=127.0.0.1,BACKEND_PORT=8001GPU/FA2
backend/.venv-fa2-313withtorch 2.9.1+cu130andflash_attn 2.8.3flash_attention_2on CUDA at load; large model path honored if presentVerification
/healthshowsstt/tts/llm/gpu_acceleration/vibevoice_gpuhealthyNext Steps (Planned)
StreamingResponseNotes
.gitignoreexcludes.envand local artifactsSummary by Sourcery
Optimize PM2 process configuration for GPU-accelerated TTS/STT workflow and document the new FA2-based setup.
Enhancements:
Build:
Documentation:
Summary by CodeRabbit
Performance Improvements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.