GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized) by TacImpulse · Pull Request #2 · TacImpulse/UVB-Knight-Tac

TacImpulse · 2025-11-28T13:45:27Z

UVB GPU Latency Optimization (Flash Attention 2)

Summary

Enabled GPU TTS/STT end-to-end, pinned backend to FA2 venv
Reduced single-voice response latency to < 1/3 of previous (~52s → ~<18s)
Stabilized Windows console behavior (UTF-8, no emojis) and PM2 config

Changes

Backend
- ecosystem.config.js: run backend via FA2 venv python; windowsHide=true; watch=false; PYTHONIOENCODING=utf-8
- backend/main.py: reloader off; ASCII logs; health verification
- services/stt_service.py: Faster-Whisper on CUDA with safe fallbacks and 16k mono conversion
Frontend
- Next dev launch via node; windowsHide=true; disabled prefetch on voice route (avoid aborts)
Environment
- .env: WHISPER_DEVICE=cuda, RELOAD=false, BACKEND_HOST=127.0.0.1, BACKEND_PORT=8001
- Kept ONNX accel off to prioritize FA2

GPU/FA2

Venv: backend/.venv-fa2-313 with torch 2.9.1+cu130 and flash_attn 2.8.3
VibeVoice selects flash_attention_2 on CUDA at load; large model path honored if present

Verification

Health: /health shows stt/tts/llm/gpu_acceleration/vibevoice_gpu healthy
Voice chat returns WAV; clone test produces WAV; multi‑speaker works

Next Steps (Planned)

Instrument per-stage timings; enable FP16/TF32, cuDNN benchmark, channels‑last
Prewarm kernels; cache embeddings; stream sentence chunks via StreamingResponse
Frontend consume streaming audio; reduce sample rate to 22.05 kHz mono for speed

Notes

Secrets are not committed; .gitignore excludes .env and local artifacts
All work performed on Windows with RTX 5090 (Blackwell) GPU

Summary by Sourcery

Optimize PM2 process configuration for GPU-accelerated TTS/STT workflow and document the new FA2-based setup.

Enhancements:

Update PM2 ecosystem config to run the backend with the FA2 Python virtualenv, hide Windows consoles, and ensure UTF-8 output.
Adjust frontend dev process to run Next directly via Node with PM2, disabling file watching in PM2 for more stable dev sessions.

Build:

Simplify the dev script to start PM2 without watch mode and add a new runtime dependency for the MCP create-server package.

Documentation:

Add developer notes documenting the FA2 GPU latency optimization setup, environment expectations, and verification steps.

Summary by CodeRabbit

Performance Improvements
- Significantly reduced voice processing latency through GPU acceleration improvements, decreasing response times from approximately 52 seconds to under 18 seconds.
Chores
- Updated development infrastructure and process management configuration.
- Added new server dependency to support enhanced functionality.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…nfig, docs added (2025-11-28)

continue · 2025-11-28T13:45:30Z

Keep this PR in a mergeable state →

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

netlify · 2025-11-28T13:45:32Z

❌ Deploy Preview for uvb failed.

Name	Link
🔨 Latest commit	`aca65ef`
🔍 Latest deploy log	https://app.netlify.com/projects/uvb/deploys/6929a77a85bf010008a40c55

sourcery-ai · 2025-11-28T13:45:33Z

Reviewer's Guide

Configures PM2 and environment to run the backend from a specific Flash Attention 2 Python virtualenv with GPU-enabled STT/TTS, stabilizes Windows/console behavior, and adds documentation for the GPU latency optimization work.

Sequence diagram for GPU-accelerated voice request with FA2 backend

sequenceDiagram
  actor User
  participant Browser
  participant FrontendDev as Frontend_next_dev
  participant Backend as Backend_main
  participant STT as STT_service_FasterWhisper
  participant TTS as TTS_service_VibeVoice_FA2
  participant GPU as CUDA_GPU

  User ->> Browser: Start voice chat
  Browser ->> FrontendDev: POST /api/voice with audio
  FrontendDev ->> Backend: HTTP request /voice with WAV

  Backend ->> STT: Transcribe(audio_wav)
  STT ->> GPU: Run_FasterWhisper_cuda(16k_mono)
  GPU -->> STT: Text_transcript
  STT -->> Backend: Transcript

  Backend ->> TTS: Synthesize(transcript)
  TTS ->> GPU: VibeVoice_flash_attention_2_inference
  GPU -->> TTS: Generated_audio
  TTS -->> Backend: WAV_response

  Backend -->> FrontendDev: WAV audio
  FrontendDev -->> Browser: Stream/return WAV
  Browser -->> User: Play low_latency_audio

File-Level Changes

Change	Details	Files
Pin PM2 backend and frontend processes to stable, non-watching commands suitable for Windows and FA2 venv usage.	Run voice-backend via the FA2 virtualenv python executable instead of generic `python`. Set `windowsHide: true` for all PM2 apps to avoid visible console windows on Windows. Disable PM2 `watch` for the voice-backend and voice-frontend apps to prevent unnecessary restarts during development. Run the Next.js dev server directly with `node` and the Next CLI entrypoint instead of `cmd /c npm run dev`. Set `PYTHONIOENCODING=utf-8` in the backend environment to enforce UTF-8 console encoding.	`ecosystem.config.js`
Adjust PM2-related npm scripts and add a new dependency.	Remove `--watch` from the `pm2 start` dev script to avoid auto-reload behavior. Add `@modelcontextprotocol/create-server` as a new runtime dependency.	`package.json`
Document the GPU latency optimization setup and behavior for FA2, TTS/STT, and environment configuration.	Add a developer notes markdown file describing the FA2 virtualenv, CUDA/torch/flash-attn versions, and VibeVoice configuration. Document environment variable expectations for GPU/STT/TTS and backend host/port. Record verification steps and planned next steps for further latency optimizations. Clarify that secrets are not committed and `.env` remains ignored.	`docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

vercel · 2025-11-28T13:45:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
uvb-knight-tac	Ready	Preview	Comment	Nov 28, 2025 1:45pm

coderabbitai · 2025-11-28T13:45:35Z

Walkthrough

The pull request consolidates development configuration files and adds optimization documentation. Changes include simplifying .gitignore patterns, updating PM2 ecosystem configuration with explicit binary paths and Windows compatibility settings, adjusting package.json dev scripts, and adding documentation for GPU latency optimization with Flash Attention 2.

Changes

Cohort / File(s)	Summary
Configuration simplification `\.gitignore`	Rewrites ignore rules: removes broad multi-language blocks; consolidates Python venv patterns with targeted `.venv` paths; replaces extensive Node.js ignores with minimal `.pm2/` entry; standardizes environment file patterns to `*/.env`; adds audio file ignores (`*/.wav`, `*/.mp3`, `*/.ogg`); reduces OS/IDE artifacts to `*.DS_Store`.
PM2 ecosystem updates `ecosystem.config.js`	Updates voice-backend with explicit Windows Python path, adds `windowsHide: true`, sets `PYTHONIOENCODING: utf-8`; replaces voice-frontend script from `cmd` to `node` with Next.js dev args, adds `windowsHide: true` and `watch: false`; adds new memory-playground app entry with `python` script and `windowsHide: true`.
Dev workflow adjustments `package.json`	Removes `--watch` flag from dev script (now: `pm2 start ecosystem.config.js`); adds new dependency `@modelcontextprotocol/create-server@^0.3.1`.
Optimization documentation `docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md`	Adds comprehensive guide documenting GPU latency optimization workflow using Flash Attention 2, covering setup steps, configuration changes across backend/frontend/environment, verification procedures, and planned next steps for further optimization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

ecosystem.config.js: Verify all three app entries are correctly configured; confirm Windows path handling and new environment variables don't introduce portability issues
package.json: Confirm removal of --watch flag aligns with updated ecosystem.config.js and doesn't break dev workflow

Poem

🐰 Configuration files refined with care,
The watches removed, the paths laid bare,
Flash Attention quickens the GPU's stride,
From fifty seconds—eighteen, my, what a ride!
Windows plays nice, the frontend springs to life,
All optimized, with minimal strife. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: GPU TTS/STT latency optimization using Flash Attention 2 and PM2 stabilization, which aligns with the core objectives of the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch uvb-fa2-baseline-20251128

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2025-11-28T13:46:03Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The hard-coded absolute path to the FA2 virtualenv Python executable in ecosystem.config.js (F:/Git/...) makes the setup non-portable; consider using a relative path, environment variable, or PM2 interpreter/interpreter_args to select the venv dynamically.
The Next dev script in ecosystem.config.js directly invokes node_modules/next/dist/bin/next with a relative path that assumes a specific working directory; you may want to reference the local next binary via npx or a package.json script to avoid path issues on different environments.
The newly added dependency "@modelcontextprotocol/create-server" in package.json does not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The hard-coded absolute path to the FA2 virtualenv Python executable in `ecosystem.config.js` (`F:/Git/...`) makes the setup non-portable; consider using a relative path, environment variable, or PM2 `interpreter`/`interpreter_args` to select the venv dynamically.
- The Next dev script in `ecosystem.config.js` directly invokes `node_modules/next/dist/bin/next` with a relative path that assumes a specific working directory; you may want to reference the local `next` binary via `npx` or a package.json script to avoid path issues on different environments.
- The newly added dependency `"@modelcontextprotocol/create-server"` in `package.json` does not appear to be used in this diff; if it is not required for these changes, consider removing it to avoid unnecessary dependency bloat.

## Individual Comments

### Comment 1
<location> `docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md:30` </location>
<code_context>
+## Next Steps (Planned)
+- Instrument per-stage timings; enable FP16/TF32, cuDNN benchmark, channels‑last
+- Prewarm kernels; cache embeddings; stream sentence chunks via `StreamingResponse`
+- Frontend consume streaming audio; reduce sample rate to 22.05 kHz mono for speed
+
+## Notes
</code_context>

<issue_to_address>
**nitpick (typo):** Consider fixing the verb form in "Frontend consume streaming audio" for grammatical correctness.

You could rephrase that bullet to something like “Frontend will consume streaming audio” or “Frontend consumes streaming audio” so the sentence reads grammatically correctly.

```suggestion
- Frontend consumes streaming audio; reduce sample rate to 22.05 kHz mono for speed
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

.gitignore (2)
2-4: Remove redundant specific path pattern.

Line 4 is redundant since line 3 (**/.venv-*/) already matches the specific virtualenv path. Hardcoded repository-specific paths in .gitignore are generally discouraged—use generic patterns instead.

Apply this diff:
 **/.venv/
 **/.venv-*/
-Ultimate_Voice_Bridge/backend/.venv-fa2-313/
11-13: Consider more specific environment file patterns.

The pattern **/.env.* on line 13 may be too broad—it will match commonly-committed files like .env.example or .env.template. Consider explicit patterns like **/.env.development, **/.env.production, etc., or document that templates should use a different naming convention.
ecosystem.config.js (2)
10-10: Document Windows-specific configuration.

The windowsHide: true option on lines 10, 27, and 39 is Windows-specific. Consider documenting this in the README or adding a comment in the config to note that this configuration is tailored for Windows environments, which aligns with the PR's noted RTX 5090/Windows setup.

Also applies to: 27-27, 39-39

23-24: Simplify frontend script invocation.

Since script is already set to node, the args should point to the Next.js binary without redundancy. Consider using a more standard approach like running via npm script.

Apply this diff:
-      script: 'node',
-      args: 'node_modules/next/dist/bin/next dev -p 3000',
+      script: 'npm',
+      args: 'run dev -- -p 3000',
Or if you prefer the direct node approach:
-      script: 'node',
-      args: 'node_modules/next/dist/bin/next dev -p 3000',
+      script: './Ultimate_Voice_Bridge/frontend/node_modules/.bin/next',
+      args: 'dev -p 3000',

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70a4b9f and aca65ef.

📒 Files selected for processing (4)

.gitignore (1 hunks)
docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md (1 hunks)
ecosystem.config.js (2 hunks)
package.json (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Sourcery review

🔇 Additional comments (1)

docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md (1)

1-34: Clear and well-structured development documentation.

The documentation provides a comprehensive overview of the GPU latency optimization work, including rationale, changes, verification steps, and planned next steps. This is valuable context for the team.

checkpoint: GPU TTS/STT latency optimizations with FA2, stable PM2 co…

aca65ef

…nfig, docs added (2025-11-28)

TacImpulse added this to the Latency Reduction milestone Nov 28, 2025

sourcery-ai Bot reviewed Nov 28, 2025

View reviewed changes

Comment thread docs/dev-notes/2025-11-28-uvb-fa2-latency-optimization.md

TacImpulse added performance gpu audio tts stt streaming Streaming audio/response work windows Windows-specific behavior and fixes labels Nov 28, 2025

coderabbitai Bot reviewed Nov 28, 2025

View reviewed changes

Comment thread ecosystem.config.js

Comment thread package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2

GPU TTS/STT latency optimization checkpoint (FA2 enabled, PM2 stabilized)#2
TacImpulse wants to merge 1 commit into
masterfrom
uvb-fa2-baseline-20251128

TacImpulse commented Nov 28, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

continue Bot commented Nov 28, 2025

Uh oh!

netlify Bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Nov 28, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

vercel Bot commented Nov 28, 2025

Uh oh!

coderabbitai Bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Nov 28, 2025

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TacImpulse commented Nov 28, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UVB GPU Latency Optimization (Flash Attention 2)

Summary

Changes

GPU/FA2

Verification

Next Steps (Planned)

Notes

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

continue Bot commented Nov 28, 2025

Uh oh!

netlify Bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for uvb failed.

Uh oh!

sourcery-ai Bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for GPU-accelerated voice request with FA2 backend

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

vercel Bot commented Nov 28, 2025

Uh oh!

coderabbitai Bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

sonarqubecloud Bot commented Nov 28, 2025

Quality Gate passed

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TacImpulse commented Nov 28, 2025 •

edited by coderabbitai Bot

Loading

netlify Bot commented Nov 28, 2025 •

edited

Loading

sourcery-ai Bot commented Nov 28, 2025 •

edited

Loading

coderabbitai Bot commented Nov 28, 2025 •

edited

Loading