Skip to content

feat: support audio modality input, add voice input and voice attachment bubbles#453

Draft
luosc wants to merge 8 commits intoChevey339:masterfrom
luosc:feat/gemini-voice-input
Draft

feat: support audio modality input, add voice input and voice attachment bubbles#453
luosc wants to merge 8 commits intoChevey339:masterfrom
luosc:feat/gemini-voice-input

Conversation

@luosc
Copy link
Copy Markdown
Contributor

@luosc luosc commented Apr 6, 2026

Scope

  • OUT OF SCOPE DECLARATION: Transcription and then sending transcribed text to LLM is out of scope of this PR, this PR only handles audio and send it to models that support audio modality
  • MODELS TARGETED: Models that support audio modality as input. This also include transcription models such as gpt4o-transcribe and whisper.

Stage 1: Proof of Concept

  • Add mobile voice input with press-and-hold recording, haptic feedback
  • drag-to-cancel keep zone
  • localized voice attachment bubbles that show user voice duration instead of raw file names.
  • Model scope: Only support Gemini
  • Platform scope: mobile
  • Test scope: Only tested on Android

Stage 2

  • Support iPadOS (iOS untested) 0b5ccc8
  • Add UI support to MacOS d2bec61
  • Linux support 0d1bd3d
  • Support voice playback 895926d
  • Windows support eeee747 (test delegate to other contributors)

Stage 3

  • Add support audio modality property to model manifest abstract classes a6758fc
  • Extend model scope: OpenAI-compatible API for audio modality input (e.g. whisper and gpt4o-transcribe)

Backlog

  • Voice cleanup strategy. Now voice files are categorized as normal files, but they should belong to "chat history"
  • Add settings entries
    • master switch for the function
    • maximum recording time (default 1min)
    • hotkey customization entries for desktop

@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 6, 2026

893cef7 is stable, cherry-pick is welcome

@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 6, 2026

0b5ccc8 确认引入了iOS支持。iPadOS测试通过。没有iPhone测试。

@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 6, 2026

d2bec61 引入桌面版逻辑,macOS测试通过。macOS默认绑cmd+shift+R长按录制。

@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 7, 2026

0d1bd3d 引入linux支持,需要系统安装parecord
例如ubuntu/debian:
sudo apt install pulseaudio-utils ffmpeg

@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 7, 2026

895926d 引入录音回放功能,加入显示播放进度的UI,已经在android/macos/iPadOS上测试通过

@luosc luosc changed the title feat: add voice input and voice attachment bubbles feat: support audio modality input, add voice input and voice attachment bubbles Apr 7, 2026
@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 7, 2026

@Chevey339 Stage 2已经完成,这个PR就可以合并了。因为Gemini已经功能完备。后面涉及到模型音频模态支持判定,和拓展音频输入支持到其他api。可以在第二个PR开展。

@luosc luosc marked this pull request as ready for review April 7, 2026 16:46
@luosc luosc marked this pull request as draft April 9, 2026 01:05
@luosc
Copy link
Copy Markdown
Contributor Author

luosc commented Apr 9, 2026

在等待的时间里,我再往前推一下吧

luosc added 7 commits April 9, 2026 21:45
Add Gemini-native mobile voice input with press-and-hold recording, haptic feedback, drag-to-cancel keep zone, and localized voice attachment bubbles that show user voice duration instead of raw file names.

Files changed:\n- android/app/src/main/AndroidManifest.xml: add RECORD_AUDIO permission for mobile voice input.\n- lib/core/utils/multimodal_input_utils.dart: add Gemini native audio input model detection helpers.\n- lib/features/chat/widgets/chat_message_widget.dart: render recorded voice attachments as audio bubbles labeled with localized duration instead of raw file names.\n- lib/features/home/controllers/home_page_controller.dart: wire start/stop/cancel voice recording into the shared home controller and expose audio capability checks to the input UI.\n- lib/features/home/pages/home_page.dart: pass voice recording state and callbacks into the shared input section.\n- lib/features/home/services/message_builder_service.dart: append Gemini audio-input system guidance only when a request contains audio media.\n- lib/features/home/services/message_generation_service.dart: gate audio attachments on Gemini native audio support and detect audio media paths for prompt injection.\n- lib/features/home/services/voice_input_service.dart: add mobile WAV recording, raise the max recording duration to 1 minute, and rename recorded files with embedded duration metadata.\n- lib/features/home/widgets/chat_input_bar.dart: add the press-and-hold mic button between plus/send, haptic feedback, cancel keep zone overlay, animated mic scaling, recording keep zone visuals, and localized voice attachment chips.\n- lib/features/home/widgets/chat_input_section.dart: plumb voice-input gating and recording callbacks into the input bar.\n- lib/icons/lucide_adapter.dart: expose Lucide.Mic and Lucide.AudioLines for the new voice UI.\n- lib/l10n/app_en.arb: add localized voice-input and voice duration display strings.\n- lib/l10n/app_localizations.dart: regenerate localization interface after adding voice-input strings.\n- lib/l10n/app_localizations_en.dart: regenerate English localization output.\n- lib/l10n/app_localizations_zh.dart: regenerate Chinese localization output.\n- lib/l10n/app_zh.arb: add matching Simplified Chinese voice-input strings.\n- lib/l10n/app_zh_Hans.arb: add matching zh_Hans voice-input strings.\n- lib/l10n/app_zh_Hant.arb: add matching Traditional Chinese voice-input strings.\n- lib/utils/voice_attachment_utils.dart: add helpers to build and parse recorded voice file names and format mm:ss labels.\n- pubspec.yaml: add the record dependency for mobile voice capture.\n- test/gemini_audio_input_support_test.dart: cover Gemini audio-capability gating behavior.\n- test/voice_attachment_utils_test.dart: cover recorded voice filename metadata parsing and duration formatting.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
… cap keep-zone radius to a fixed value

Enable the iOS microphone permission required for voice recording and keep the recording keep-zone consistently sized on wide layouts. This preserves the existing voice input interaction while fixing iOS permission handling and preventing the solid keep-zone circle from becoming excessively large.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
…nd hotkey flow

Add macOS desktop voice input with a popover flow, including countdown start, stop-to-confirm send, and press-and-hold Cmd+Shift+R recording. This adds macOS-native microphone permission handling, the required audio-input entitlements, and proper shortcut event consumption so desktop recording starts reliably without invalid-key system beeps.

Files changed:
- lib/desktop/desktop_home_page.dart: cancel active desktop voice sessions when leaving the chat tab so the IndexedStack-kept page does not retain stale recording UI.
- lib/desktop/hotkeys/chat_action_bus.dart: add a cancelTransientUi chat action for desktop voice popover cleanup.
- lib/features/home/controllers/home_page_controller.dart: cancel desktop voice UI on desktop lifecycle changes and chat action bus cleanup events.
- lib/features/home/pages/home_page.dart: add macOS in-app Cmd+Shift+R press-and-hold handling and consume repeated shortcut events to avoid system invalid-key feedback.
- lib/features/home/services/voice_input_service.dart: enable macOS recording, switch macOS microphone permission checks to the record plugin, and open the macOS microphone privacy settings when permission is denied.
- lib/features/home/widgets/chat_input_bar.dart: add the desktop voice popover flow with countdown, recording, confirmation, Enter-to-send, Esc-to-cancel, and temporary file cleanup while preserving the existing mobile press-and-hold behavior.
- lib/features/home/widgets/chat_input_section.dart: expose the voice input entry on macOS desktop while keeping existing capability gating intact.
- lib/l10n/app_en.arb: add desktop voice popover strings for countdown, recording, confirmation, and actions.
- lib/l10n/app_localizations.dart: regenerate localization interface after adding macOS desktop voice strings.
- lib/l10n/app_localizations_en.dart: regenerate English localization output.
- lib/l10n/app_localizations_zh.dart: regenerate Chinese localization output.
- lib/l10n/app_zh.arb: add matching Chinese desktop voice popover strings.
- lib/l10n/app_zh_Hans.arb: add matching zh_Hans desktop voice popover strings.
- lib/l10n/app_zh_Hant.arb: add matching Traditional Chinese desktop voice popover strings.
- macos/Flutter/GeneratedPluginRegistrant.swift: register the macOS record plugin required for desktop voice capture.
- macos/Runner/DebugProfile.entitlements: enable the audio-input entitlement for debug/profile builds.
- macos/Runner/Info.plist: add the macOS microphone usage description required for desktop recording permission prompts.
- macos/Runner/Release.entitlements: enable the audio-input entitlement for release builds.
- test/chat_action_bus_test.dart: cover desktop transient UI cleanup event delivery.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
…nd hotkey flow

Add Linux desktop voice input to the existing desktop popover recording flow, reusing the shared countdown, stop-to-confirm send, and attachment pipeline. This enables Linux recording in the shared input layer, adds explicit parecord/ffmpeg dependency checks with localized errors, and keeps Windows excluded while preserving existing mobile and macOS behavior.

Files changed:
- lib/features/home/pages/home_page.dart: extend the desktop voice hotkey handler to Linux and use Ctrl+Shift+R outside macOS.
- lib/features/home/services/voice_input_service.dart: allow Linux recording, skip unsupported Linux permission requests, and surface explicit missing-dependency errors for parecord/ffmpeg before recording starts.
- lib/features/home/utils/desktop_voice_input_utils.dart: centralize desktop voice platform support and shortcut matching for macOS and Linux.
- lib/features/home/widgets/chat_input_bar.dart: enable the desktop voice popover flow on Linux and show the correct platform-specific shortcut label.
- lib/features/home/widgets/chat_input_section.dart: expose desktop voice input on Linux while keeping the existing capability gating intact.
- lib/l10n/app_en.arb: add localized Linux voice dependency error text.
- lib/l10n/app_localizations.dart: regenerate localization interface after adding the Linux dependency error string.
- lib/l10n/app_localizations_en.dart: regenerate English localization output.
- lib/l10n/app_localizations_zh.dart: regenerate Chinese localization output.
- lib/l10n/app_zh.arb: add matching Chinese Linux voice dependency error text.
- lib/l10n/app_zh_Hans.arb: add matching zh_Hans Linux voice dependency error text.
- lib/l10n/app_zh_Hant.arb: add matching Traditional Chinese Linux voice dependency error text.
- test/linux_voice_input_support_test.dart: cover Linux desktop voice support, hotkey matching, and missing dependency detection.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
Add in-app replay for recorded voice message bubbles so users can review sent voice notes without leaving chat. Show playback progress directly inside the bubble, update only the duration label inside the existing localized voice bubble text to a countdown while playing, stop active TTS before replay, and keep playback synchronized through a shared single-player controller across mobile and desktop.

Files changed:

- lib/core/providers/voice_message_playback_provider.dart: add shared playback state for sent voice bubbles, including stop-on-retap, progress tracking, remaining time, and completion cleanup.

- lib/features/chat/widgets/chat_message_widget.dart: route recorded voice attachments to in-app playback, render the in-bubble progress overlay, update only the duration portion of the localized voice bubble label to a countdown while playing, and keep normal files on the existing open-file path.

- lib/main.dart: register the shared voice message playback provider.

- lib/utils/voice_attachment_utils.dart: add a helper to format voice bubble labels from explicit durations so playback countdown can reuse the existing localized label structure.

- lib/l10n/app_en.arb: add localized voice playback failure text.

- lib/l10n/app_localizations.dart: regenerate localization interface after adding the voice playback failure string.

- lib/l10n/app_localizations_en.dart: regenerate English localization output.

- lib/l10n/app_localizations_zh.dart: regenerate Chinese localization output.

- lib/l10n/app_zh.arb: add matching Chinese voice playback failure text.

- lib/l10n/app_zh_Hans.arb: add matching zh_Hans voice playback failure text.

- lib/l10n/app_zh_Hant.arb: add matching Traditional Chinese voice playback failure text.

- test/voice_message_playback_provider_test.dart: cover shared playback activation, stop-on-retap, progress updates, playback switching, and failure cleanup.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
Allow Windows desktop sessions to enter the existing voice recording flow so audio-capable models expose the same in-app controls and shortcut path as Linux. This keeps the desktop voice UI aligned across supported platforms without changing the audio payload flow.

Files changed:
- lib/features/home/utils/desktop_voice_input_utils.dart: include Windows in desktop voice support, shortcut labels, and hotkey matching.
- lib/features/home/services/voice_input_service.dart: allow Windows to enter the existing recorder start flow.
- test/linux_voice_input_support_test.dart: cover Windows desktop support and hotkey behavior.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
Mark supported audio-capable models with an explicit audio input modality so follow-up routing can rely on one manifest source of truth. Preserve that modality across override parsing, model tags, and both model editors so audio capability is displayed and saved correctly.

Files changed:

- lib/core/models/model_types.dart: add Modality.audio and a shared storage serializer for persisted modality values.

- lib/core/providers/model_provider.dart: infer audio input capability for supported Gemini, LongCat Omni, Whisper, and transcribe model ids.

- lib/core/services/model_override_resolver.dart: parse audio modality values from model override payloads.

- lib/shared/widgets/model_tag_wrap.dart: render audio modality with dedicated labels and icons instead of collapsing it into image tags.

- lib/features/model/widgets/model_detail_sheet.dart: expose audio as an input mode in the mobile model editor and serialize audio modalities correctly when saving overrides.

- lib/desktop/model_edit_dialog.dart: expose audio as an input mode in the desktop model editor and serialize audio modalities correctly when saving overrides.

- lib/l10n/app_en.arb: add the localized audio mode label.

- lib/l10n/app_localizations.dart: regenerate the localization interface for the new audio mode label.

- lib/l10n/app_localizations_en.dart: regenerate English localization output.

- lib/l10n/app_localizations_zh.dart: regenerate Chinese localization output, including zh_Hans and zh_Hant variants.

- lib/l10n/app_zh.arb: add the matching Chinese audio mode label.

- lib/l10n/app_zh_Hans.arb: add the matching zh_Hans audio mode label.

- lib/l10n/app_zh_Hant.arb: add the matching Traditional Chinese audio mode label.

- test/model_manifest_audio_support_test.dart: cover audio modality inference and override preservation for supported model manifests.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
@luosc luosc force-pushed the feat/gemini-voice-input branch from a6758fc to 6353467 Compare April 10, 2026 01:47
Fix the mobile voice overlay parent-data structure and block message-list scrolling while a hold-to-record gesture is active so drag movement is reserved for cancel detection.

Files changed:
- lib/features/home/widgets/chat_input_bar.dart: fix the mobile voice overlay parent-data structure for the hold-to-record keep-zone.
- lib/features/home/widgets/message_list_view.dart: allow temporarily disabling user scrolling during hold-to-record.
- lib/features/home/pages/home_page.dart: disable mobile message-list scrolling while voice recording is active.

Signed-off-by: Shuchen Luo <nemo0806@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants