Skip to content

GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2

Open
TacImpulse wants to merge 1 commit into
masterfrom
uvb-fa2-baseline-20251128
Open

GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2
TacImpulse wants to merge 1 commit into
masterfrom
uvb-fa2-baseline-20251128

Conversation

@TacImpulse
Copy link
Copy Markdown
Owner

@TacImpulse TacImpulse commented Nov 28, 2025

UVB GPU Latency Optimization (Flash Attention 2)

Summary

  • Enabled GPU TTS/STT end-to-end, pinned backend to FA2 venv
  • Reduced single-voice response latency to < 1/3 of previous (~52s → ~<18s)
  • Stabilized Windows console behavior (UTF-8, no emojis) and PM2 config

Changes

  • Backend
    • ecosystem.config.js: run backend via FA2 venv python; windowsHide=true; watch=false; PYTHONIOENCODING=utf-8
    • backend/main.py: reloader off; ASCII logs; health verification
    • services/stt_service.py: Faster-Whisper on CUDA with safe fallbacks and 16k mono conversion
  • Frontend
    • Next dev launch via node; windowsHide=true; disabled prefetch on voice route (avoid aborts)
  • Environment
    • .env: WHISPER_DEVICE=cuda, RELOAD=false, BACKEND_HOST=127.0.0.1, BACKEND_PORT=8001
    • Kept ONNX accel off to prioritize FA2

GPU/FA2

  • Venv: backend/.venv-fa2-313 with torch 2.9.1+cu130 and flash_attn 2.8.3
  • VibeVoice selects flash_attention_2 on CUDA at load; large model path honored if present

Verification

  • Health: /health shows stt/tts/llm/gpu_acceleration/vibevoice_gpu healthy
  • Voice chat returns WAV; clone test produces WAV; multi‑speaker works

Next Steps (Planned)

  • Instrument per-stage timings; enable FP16/TF32, cuDNN benchmark, channels‑last
  • Prewarm kernels; cache embeddings; stream sentence chunks via StreamingResponse
  • Frontend consume streaming audio; reduce sample rate to 22.05 kHz mono for speed

Notes

  • Secrets are not committed; .gitignore excludes .env and local artifacts
  • All work performed on Windows with RTX 5090 (Blackwell) GPU

Summary by Sourcery

Optimize PM2 process configuration for GPU-accelerated TTS/STT workflow and document the new FA2-based setup.

Enhancements:

  • Update PM2 ecosystem config to run the backend with the FA2 Python virtualenv, hide Windows consoles, and ensure UTF-8 output.
  • Adjust frontend dev process to run Next directly via Node with PM2, disabling file watching in PM2 for more stable dev sessions.

Build:

  • Simplify the dev script to start PM2 without watch mode and add a new runtime dependency for the MCP create-server package.

Documentation:

  • Add developer notes documenting the FA2 GPU latency optimization setup, environment expectations, and verification steps.

Summary by CodeRabbit

  • Performance Improvements

    • Significantly reduced voice processing latency through GPU acceleration improvements, decreasing response times from approximately 52 seconds to under 18 seconds.
  • Chores

    • Updated development infrastructure and process management configuration.
    • Added new server dependency to support enhanced functionality.

✏️ Tip: You can customize this high-level summary in your review settings.

@continue
Copy link
Copy Markdown

continue Bot commented Nov 28, 2025

Keep this PR in a mergeable state →

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

@netlify
Copy link
Copy Markdown

netlify Bot commented Nov 28, 2025

Deploy Preview for uvb failed.

Name Link
🔨 Latest commit aca65ef
🔍 Latest deploy log https://app.netlify.com/projects/uvb/deploys/6929a77a85bf010008a40c55

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Nov 28, 2025

Reviewer's Guide

Configures PM2 and environment to run the backend from a specific Flash Attention 2 Python virtualenv with GPU-enabled STT/TTS, stabilizes Windows/console behavior, and adds documentation for the GPU latency optimization work.

Sequence diagram for GPU-accelerated voice request with FA2 backend

sequenceDiagram
  actor User
  participant Browser
  participant FrontendDev as Frontend_next_dev
  participant Backend as Backend_main
  participant STT as STT_service_FasterWhisper
  participant TTS as TTS_service_VibeVoice_FA2
  participant GPU as CUDA_GPU

  User ->> Browser: Start voice chat
  Browser ->> FrontendDev: POST /api/voice with audio
  FrontendDev ->> Backend: HTTP request /voice with WAV

  Backend ->> STT: Transcribe(audio_wav)
  STT ->> GPU: Run_FasterWhisper_cuda(16k_mono)
  GPU -->> STT: Text_transcript
  STT -->> Backend: Transcript

  Backend ->> TTS: Synthesize(transcript)
  TTS ->> GPU: VibeVoice_flash_attention_2_inference
  GPU -->> TTS: Generated_audio
  TTS -->> Backend: WAV_response

  Backend -->> FrontendDev: WAV audio
  FrontendDev -->> Browser: Stream/return WAV
  Browser -->> User: Play low_latency_audio
Loading

File-Level Changes

Change Details Files
Pin PM2 backend and frontend processes to stable, non-watching commands suitable for Windows and FA2 venv usage.
  • Run voice-backend via the FA2 virtualenv python executable instead of generic python.
  • Set windowsHide: true for all PM2 apps to avoid visible console windows on Windows.
  • Disable PM2 watch for the voice-backend and voice-frontend apps to prevent unnecessary restarts during development.
  • Run the Next.js dev server directly with node and the Next CLI entrypoint instead of cmd /c npm run dev.
  • Set PYTHONIOENCODING=utf-8 in the backend environment to enforce UTF-8 console encoding.
ecosystem.config.js
Adjust PM2-related npm scripts and add a new dependency.
  • Remove --watch from the pm2 start dev script to avoid auto-reload behavior.
  • Add @modelcontextprotocol/create-server as a new runtime dependency.
package.json
Document the GPU latency optimization setup and behavior for FA2, TTS/STT, and environment configuration.
  • Add a developer notes markdown file describing the FA2 virtualenv, CUDA/torch/flash-attn versions, and VibeVoice configuration.
  • Document environment variable expectations for GPU/STT/TTS and backend host/port.
  • Record verification steps and planned next steps for further latency optimizations.
  • Clarify that secrets are not committed and .env remains ignored.
docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@vercel
Copy link
Copy Markdown

vercel Bot commented Nov 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
uvb-knight-tac Ready Ready Preview Comment Nov 28, 2025 1:45pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 28, 2025

Walkthrough

The pull request consolidates development configuration files and adds optimization documentation. Changes include simplifying .gitignore patterns, updating PM2 ecosystem configuration with explicit binary paths and Windows compatibility settings, adjusting package.json dev scripts, and adding documentation for GPU latency optimization with Flash Attention 2.

Changes

Cohort / File(s) Summary
Configuration simplification
\.gitignore
Rewrites ignore rules: removes broad multi-language blocks; consolidates Python venv patterns with targeted .venv paths; replaces extensive Node.js ignores with minimal .pm2/ entry; standardizes environment file patterns to **/.env*; adds audio file ignores (**/*.wav, **/*.mp3, **/*.ogg); reduces OS/IDE artifacts to *.DS_Store.
PM2 ecosystem updates
ecosystem.config.js
Updates voice-backend with explicit Windows Python path, adds windowsHide: true, sets PYTHONIOENCODING: utf-8; replaces voice-frontend script from cmd to node with Next.js dev args, adds windowsHide: true and watch: false; adds new memory-playground app entry with python script and windowsHide: true.
Dev workflow adjustments
package.json
Removes --watch flag from dev script (now: pm2 start ecosystem.config.js); adds new dependency @modelcontextprotocol/create-server@^0.3.1.
Optimization documentation
docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md
Adds comprehensive guide documenting GPU latency optimization workflow using Flash Attention 2, covering setup steps, configuration changes across backend/frontend/environment, verification procedures, and planned next steps for further optimization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • ecosystem.config.js: Verify all three app entries are correctly configured; confirm Windows path handling and new environment variables don't introduce portability issues
  • package.json: Confirm removal of --watch flag aligns with updated ecosystem.config.js and doesn't break dev workflow

Poem

🐰 Configuration files refined with care,
The watches removed, the paths laid bare,
Flash Attention quickens the GPU's stride,
From fifty seconds—eighteen, my, what a ride!
Windows plays nice, the frontend springs to life,
All optimized, with minimal strife.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: GPU TTS/STT latency optimization using Flash Attention 2 and PM2 stabilization, which aligns with the core objectives of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch uvb-fa2-baseline-20251128

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@TacImpulse TacImpulse added this to the Latency Reduction milestone Nov 28, 2025
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The hard-coded absolute path to the FA2 virtualenv Python executable in ecosystem.config.js (F:/Git/...) makes the setup non-portable; consider using a relative path, environment variable, or PM2 interpreter/interpreter_args to select the venv dynamically.
  • The Next dev script in ecosystem.config.js directly invokes node_modules/next/dist/bin/next with a relative path that assumes a specific working directory; you may want to reference the local next binary via npx or a package.json script to avoid path issues on different environments.
  • The newly added dependency "@modelcontextprotocol/create-server" in package.json does not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The hard-coded absolute path to the FA2 virtualenv Python executable in `ecosystem.config.js` (`F:/Git/...`) makes the setup non-portable; consider using a relative path, environment variable, or PM2 `interpreter`/`interpreter_args` to select the venv dynamically.
- The Next dev script in `ecosystem.config.js` directly invokes `node_modules/next/dist/bin/next` with a relative path that assumes a specific working directory; you may want to reference the local `next` binary via `npx` or a package.json script to avoid path issues on different environments.
- The newly added dependency `"@modelcontextprotocol/create-server"` in `package.json` does not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.

## Individual Comments

### Comment 1
<location> `docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md:30` </location>
<code_context>
+## Next Steps (Planned)
+- Instrument per-stage timings; enable FP16/TF32, cuDNN benchmark, channels‑last
+- Prewarm kernels; cache embeddings; stream sentence chunks via `StreamingResponse`
+- Frontend consume streaming audio; reduce sample rate to 22.05 kHz mono for speed
+
+## Notes
</code_context>

<issue_to_address>
**nitpick (typo):** Consider fixing the verb form in "Frontend consume streaming audio" for grammatical correctness.

You could rephrase that bullet to something like “Frontend will consume streaming audio” or “Frontend consumes streaming audio” so the sentence reads grammatically correctly.

```suggestion
- Frontend consumes streaming audio; reduce sample rate to 22.05 kHz mono for speed
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md
@TacImpulse TacImpulse added performance gpu audio tts stt streaming Streaming audio/response work windows Windows-specific behavior and fixes labels Nov 28, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
.gitignore (2)

2-4: Remove redundant specific path pattern.

Line 4 is redundant since line 3 (**/.venv-*/) already matches the specific virtualenv path. Hardcoded repository-specific paths in .gitignore are generally discouraged—use generic patterns instead.

Apply this diff:

 **/.venv/
 **/.venv-*/
-Ultimate_Voice_Bridge/backend/.venv-fa2-313/

11-13: Consider more specific environment file patterns.

The pattern **/.env.* on line 13 may be too broad—it will match commonly-committed files like .env.example or .env.template. Consider explicit patterns like **/.env.development, **/.env.production, etc., or document that templates should use a different naming convention.

ecosystem.config.js (2)

10-10: Document Windows-specific configuration.

The windowsHide: true option on lines 10, 27, and 39 is Windows-specific. Consider documenting this in the README or adding a comment in the config to note that this configuration is tailored for Windows environments, which aligns with the PR's noted RTX 5090/Windows setup.

Also applies to: 27-27, 39-39


23-24: Simplify frontend script invocation.

Since script is already set to node, the args should point to the Next.js binary without redundancy. Consider using a more standard approach like running via npm script.

Apply this diff:

-      script: 'node',
-      args: 'node_modules/next/dist/bin/next dev -p 3000',
+      script: 'npm',
+      args: 'run dev -- -p 3000',

Or if you prefer the direct node approach:

-      script: 'node',
-      args: 'node_modules/next/dist/bin/next dev -p 3000',
+      script: './Ultimate_Voice_Bridge/frontend/node_modules/.bin/next',
+      args: 'dev -p 3000',
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70a4b9f and aca65ef.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md (1 hunks)
  • ecosystem.config.js (2 hunks)
  • package.json (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (1)
docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md (1)

1-34: Clear and well-structured development documentation.

The documentation provides a comprehensive overview of the GPU latency optimization work, including rationale, changes, verification steps, and planned next steps. This is valuable context for the team.

Comment thread ecosystem.config.js
Comment thread package.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio gpu performance streaming Streaming audio/response work stt tts windows Windows-specific behavior and fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant