diff --git a/README.de.md b/README.de.md
index 49fb922..c063c3c 100644
--- a/README.de.md
+++ b/README.de.md
@@ -38,7 +38,7 @@
   - ✅ Telegram zum Steuern von Codex / Copilot CLI verwenden
   - ✅ Antworten und geänderte Dateien bequem in Code-Blöcken prüfen
   - ✅ Folgefragen während eines laufenden Agentenlaufs in die Queue stellen
-  - ✅ Unterstützt Text- und Bildeingaben
+  - ✅ Akzeptiert ✏️ Text-, 🌄 Bild- und 🎙️ Sprachnachrichten sowie Audiodateien
 
    ## 🔁 Nahtlos zwischen Geräten und Sessions wechseln
 
@@ -99,6 +99,7 @@ Vor dem Start des Servers brauchst du:
 - Lokal installiertes Codex CLI und/oder Copilot CLI
 - [Codex CLI Installation](https://developers.openai.com/codex/cli)
 - [Copilot CLI Installation](https://github.com/features/copilot/cli)
+- [Optional] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Bot-Server starten
+### 🌐 Bot-Server starten
 ##### Beim ersten Start legt die App die Env-Datei an und sagt dir, welche Felder du ausfüllen musst.
 ##### Nach dem Bearbeiten der Env-Datei starte erneut:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [Optional] Speech-to-Text-Funktion: lokale OpenAI-Whisper-Voraussetzungen vorbereiten
+
+Damit aktivierst du optional lokale Whisper-basierte Sprach-zu-Text-Unterstützung für Telegram-Sprachnotizen. Audiodateien sind auf maximal `20 MB` begrenzt.
+
+```bash
+# wenn du per pip oder per Einzeiler install.sh installiert hast
+coding-agent-telegram-stt-install
+
+# wenn du aus einem geklonten Repository startest
+./install-stt.sh
+```
+
+Empfohlene Env-Einstellungen:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+Hinweise:
+
+- Whisper lädt das ausgewählte Modell beim ersten Aufruf automatisch nach `~/.cache/whisper` herunter.
+- Wenn du `OPENAI_WHISPER_MODEL=turbo` wählst, ist es wahrscheinlicher, dass die erste Sprachnachricht das Zeitlimit erreicht, während `large-v3-turbo.pt` noch heruntergeladen wird.
+- Nach der Transkription einer Sprachnachricht sendet der Bot das erkannte Transkript zuerst zurück an Telegram und gibt es danach an den Agenten weiter. So lassen sich Erkennungsfehler leichter prüfen.
+
 ## 🔑 Telegram-Einrichtung
 
 ### Bot-Token holen
@@ -175,6 +202,7 @@ Der Bot akzeptiert derzeit:
 
 - Textnachrichten
 - Fotos
+- Sprachnachrichten und Audiodateien, wenn `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` gesetzt ist und die lokalen Whisper-Voraussetzungen installiert sind
 - Codex und Copilot unterstützen aktuell nur Text und Bilder, kein Video.
 
 ## 🤖 Telegram-Befehle
@@ -329,6 +357,18 @@ Der Bot akzeptiert derzeit:
     <td width="250"><code>ENABLE_SECRET_SCRUB_FILTER</code></td>
     <td>Tokens, Schlüssel, <code>.env</code>-Werte, Zertifikate und ähnliche geheime Ausgaben vor dem Senden an Telegram unkenntlich machen. Standard: <code>true</code> (dringend empfohlen).</td>
   </tr>
+  <tr>
+    <td width="250"><code>ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT</code></td>
+    <td>Standard: <code>false</code>. Wenn <code>true</code>, werden Audionachrichten und Sprachdateien erkannt. Das System prüft die Voraussetzungen für benötigte Binärdateien oder Bibliotheken und fordert bei Bedarf zur Installation auf.</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_MODEL</code></td>
+    <td>Modell für Whisper STT. Standard: <code>base</code><br />Verfügbare Modelle: <code>tiny</code> ca. <code>72 MB</code>, <code>base</code> ca. <code>139 MB</code>, <code>large-v3-turbo</code> ca. <code>1.5 GB</code><br />Modelle werden bei der ersten Sprachnachricht automatisch heruntergeladen. Empfehlung: <code>base</code> für den allgemeinen Einsatz. Für bessere Genauigkeit und Qualität kannst du <code>turbo</code> ausprobieren.</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_TIMEOUT_SECONDS</code></td>
+    <td>Standard: <code>120</code>. Zeitlimit für den STT-Prozess. Normalerweise ist die Verarbeitung schnell genug. Wenn du jedoch <code>turbo</code> wählst, kann der erste Download je nach Internetgeschwindigkeit das Zeitlimit überschreiten.</td>
+  </tr>
   <tr>
     <td width="250"><code>SNAPSHOT_INCLUDE_PATH_GLOBS</code></td>
     <td>Passende Pfade in Diffs immer einschließen. Beispiel: <code>.github/*,.profile.test,.profile.prod</code></td>
diff --git a/README.fr.md b/README.fr.md
index 2ee4349..b47b9f6 100644
--- a/README.fr.md
+++ b/README.fr.md
@@ -38,7 +38,7 @@
   - ✅ Utiliser Telegram pour piloter Codex / Copilot CLI
   - ✅ Révision facile des réponses et des fichiers modifiés dans des blocs de code
   - ✅ Les messages de suivi peuvent être mis en file d’attente pendant qu’un agent travaille
-  - ✅ Prend en charge le texte et les images
+  - ✅ Accepte les messages ✏️ texte, 🌄 image et 🎙️ vocaux ainsi que les fichiers audio
 
    ## 🔁 Changement fluide entre appareils et sessions
 
@@ -99,6 +99,7 @@ Avant de démarrer le serveur, assurez-vous d’avoir :
 - Codex CLI et/ou Copilot CLI installés localement
 - [Installation Codex CLI](https://developers.openai.com/codex/cli)
 - [Installation Copilot CLI](https://github.com/features/copilot/cli)
+- [Optionnel] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Démarrer le serveur du bot
+### 🌐 Démarrer le serveur du bot
 ##### Au premier lancement, l’application crée le fichier env et vous indique quels champs remplir.
 ##### Après avoir mis à jour le fichier env, relancez :
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [Optionnel] Fonction Speech-to-Text : préparer les prérequis locaux OpenAI-Whisper
+
+Cela active la transcription locale optionnelle des notes vocales Telegram avec Whisper. Les fichiers audio sont limités à `20 MB` maximum.
+
+```bash
+# si vous avez installé avec pip ou avec l’install.sh en une ligne
+coding-agent-telegram-stt-install
+
+# si vous utilisez un dépôt cloné
+./install-stt.sh
+```
+
+Réglages env recommandés :
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+Remarques :
+
+- Whisper télécharge automatiquement le modèle sélectionné lors du premier usage dans `~/.cache/whisper`.
+- Si vous choisissez `OPENAI_WHISPER_MODEL=turbo`, la première transcription vocale a davantage de chances d’atteindre le délai pendant que `large-v3-turbo.pt` est encore en cours de téléchargement.
+- Après transcription d’un message vocal, le bot renvoie d’abord le texte reconnu dans Telegram avant de l’envoyer à l’agent. Cela aide à diagnostiquer les erreurs de reconnaissance.
+
 ## 🔑 Configuration Telegram
 
 ### Obtenir un Bot Token
@@ -171,6 +198,13 @@ Remarques :
 
 ## 📨 Types de messages pris en charge
 
+Le bot accepte actuellement :
+
+- les messages texte
+- les photos
+- les messages vocaux et les fichiers audio quand `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` et que les prérequis locaux de Whisper sont installés
+- Codex et Copilot prennent actuellement en charge uniquement le texte et les images, pas la vidéo
+
 ## 🤖 Commandes Telegram
 
 <table>
diff --git a/README.ja.md b/README.ja.md
index a196d37..6204cc7 100644
--- a/README.ja.md
+++ b/README.ja.md
@@ -38,7 +38,7 @@
   - ✅ Telegram で Codex / Copilot CLI を操作できる
   - ✅ エージェントの回答や変更ファイルをコードブロックで確認しやすい
   - ✅ エージェント実行中でも追加入力をキューに積める
-  - ✅ テキストと画像入力に対応
+  - ✅ ✏️ テキスト、🌄 画像、🎙️ 音声メッセージ、および音声ファイルに対応
 
    ## 🔁 デバイス/セッションをシームレスに切り替え
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - ローカルにインストール済みの Codex CLI または Copilot CLI
 - [Codex CLI インストール](https://developers.openai.com/codex/cli)
 - [Copilot CLI インストール](https://github.com/features/copilot/cli)
+- [任意] Whisper、ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Bot サーバーを起動
+### 🌐 Bot サーバーを起動
 ##### 初回起動時にアプリが env ファイルを作成し、入力すべき項目を案内します。
 ##### env ファイルを更新したら、次を再実行してください:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [任意] Speech-to-Text 機能: ローカル OpenAI-Whisper の前提条件を準備
+
+これにより、Telegram のボイスノートに対するローカル Whisper ベースの音声文字起こしを任意で有効にできます。音声ファイルは最大 `20 MB` に制限されます。
+
+```bash
+# pip または one-liner install.sh でインストールした場合
+coding-agent-telegram-stt-install
+
+# クローンしたリポジトリから使う場合
+./install-stt.sh
+```
+
+推奨される env 設定:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+メモ:
+
+- Whisper は選択したモデルを初回利用時に `~/.cache/whisper` へ自動ダウンロードします。
+- `OPENAI_WHISPER_MODEL=turbo` を選ぶと、`large-v3-turbo.pt` のダウンロード中に最初の音声文字起こしがタイムアウトしやすくなります。
+- 音声メッセージを文字起こしした後、ボットはまず認識したテキストを Telegram に返し、その後でエージェントへ渡します。これにより認識ミスを確認しやすくなります。
+
 ## 🔑 Telegram セットアップ
 
 ### Bot Token を取得
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 対応メッセージタイプ
 
+このボットが現在受け付けるもの:
+
+- テキストメッセージ
+- 写真
+- `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` が設定され、ローカル Whisper の前提条件がインストールされている場合の音声メッセージと音声ファイル
+- Codex と Copilot は現在、テキストと画像のみをサポートしており、動画はサポートしていません
+
 ## 🤖 Telegram コマンド
 
 <table>
diff --git a/README.ko.md b/README.ko.md
index 9f68399..05348ea 100644
--- a/README.ko.md
+++ b/README.ko.md
@@ -38,7 +38,7 @@
   - ✅ Telegram 으로 Codex / Copilot CLI 를 제어
   - ✅ 에이전트 응답과 변경 파일을 코드 블록으로 쉽게 검토
   - ✅ 에이전트가 작업 중일 때도 후속 질문을 큐에 저장
-  - ✅ 텍스트와 이미지 입력 지원
+  - ✅ ✏️ 텍스트, 🌄 이미지, 🎙️ 음성 메시지와 오디오 파일 지원
 
    ## 🔁 기기/세션 간 자연스러운 전환
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - 로컬에 설치된 Codex CLI 및/또는 Copilot CLI
 - [Codex CLI 설치](https://developers.openai.com/codex/cli)
 - [Copilot CLI 설치](https://github.com/features/copilot/cli)
+- [선택 사항] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Bot 서버 시작
+### 🌐 Bot 서버 시작
 ##### 첫 실행 시 앱이 env 파일을 만들고 어떤 항목을 채워야 하는지 알려줍니다.
 ##### env 파일을 수정한 뒤 다시 실행하세요:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [선택 사항] Speech-to-Text 기능: 로컬 OpenAI-Whisper 전제 조건 준비
+
+이 기능을 사용하면 Telegram 음성 노트에 대해 로컬 Whisper 기반 음성-텍스트 기능을 선택적으로 활성화할 수 있습니다. 오디오 파일은 최대 `20 MB` 까지만 지원됩니다.
+
+```bash
+# pip 으로 설치한 경우
+coding-agent-telegram-stt-install
+
+# 클론한 저장소에서 실행하는 경우
+./install-stt.sh
+```
+
+권장 env 설정:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+참고:
+
+- Whisper 는 선택한 모델을 처음 사용할 때 `~/.cache/whisper` 로 자동 다운로드합니다.
+- `OPENAI_WHISPER_MODEL=turbo` 를 선택하면 `large-v3-turbo.pt` 를 다운로드하는 동안 첫 음성 전사가 시간 초과에 걸릴 가능성이 더 높습니다.
+- 음성 메시지를 전사한 뒤 봇은 먼저 인식된 텍스트를 Telegram 에 다시 보여주고, 그 다음 에이전트에 전달합니다. 그래서 인식 오류를 확인하기 쉽습니다.
+
 ## 🔑 Telegram 설정
 
 ### Bot Token 받기
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 지원되는 메시지 유형
 
+현재 이 봇이 받는 메시지:
+
+- 텍스트 메시지
+- 사진
+- `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` 로 설정되어 있고 로컬 Whisper 전제 조건이 설치된 경우의 음성 메시지와 오디오 파일
+- Codex 와 Copilot 은 현재 텍스트와 이미지만 지원하며, 비디오는 지원하지 않습니다
+
 ## 🤖 Telegram 명령어
 
 <table>
diff --git a/README.md b/README.md
index 2ddc80c..645bd09 100644
--- a/README.md
+++ b/README.md
@@ -38,7 +38,7 @@
    - ✅ Use Telegram to control Codex / Copilot CLI
    - ✅ Easily review files changed by agent in code block
    - ✅ Queue follow-up messages while the agent is working
-   - ✅ Accept Text and Image input
+   - ✅ Accept ✏️ Text, 🌄 Image, and 🎙️ Voice messages as well as Audio files
 
    ## 🔁 Seamless Device/Session Switching
    
@@ -97,8 +97,8 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
    - Telegram bot token created from _@BotFather_
    - Your Telegram chat ID
    - Codex CLI and/or Copilot CLI installed locally
-   - [Codex CLI install](https://developers.openai.com/codex/cli)
-   - [Copilot CLI install](https://github.com/features/copilot/cli)
+   - [Codex CLI install](https://developers.openai.com/codex/cli) / [Copilot CLI install](https://github.com/features/copilot/cli)
+   - [Optional] `Whisper`, `ffmpeg`
    </td>
    </tr>
 </table>
@@ -129,7 +129,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Start Bot Server
+### 🌐 Start Bot Server
 ##### On first run, the app creates the env file, tells you what to fill in.
 ##### After updating the environment file then run:
 
@@ -141,6 +141,40 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [Optional] Speech-to-Text Feature: prepare local OpenAI-Whisper prerequisites
+
+This enables optional local Whisper-based voice-message speech-to-text for Telegram voice notes. Voice files are capped to `20MB` max.
+
+```bash
+# if you installed from pip or one-liner install.sh
+coding-agent-telegram-stt-install
+
+# if you run from a cloned repository
+./install-stt.sh
+```
+
+The installer writes the STT env flags automatically after prerequisites are ready.
+
+Estimated local footprint:
+
+- `openai-whisper`: about `50 MB`
+- `ffmpeg` package: about `50 MB`
+- Whisper model downloads vary by model: `tiny` about `72 MB`, `base` about `139 MB`, `large-v3-turbo` about `1.5 GB`
+
+Recommended env settings for the local Whisper backend:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+Notes:
+
+- Whisper downloads the selected model automatically on first use into `~/.cache/whisper`.
+- If you choose `OPENAI_WHISPER_MODEL=turbo`, the first voice transcription is more likely to hit the timeout while `large-v3-turbo.pt` is still downloading.
+- After a voice note is transcribed, the bot immediately sends the recognized transcript back to Telegram before the agent reply. If the run can start immediately it says “working on it”; if the project is busy it shows that the transcript was queued instead.
+
 ## 🔑 Telegram Setup
 
 ### Get a Bot Token
@@ -179,61 +213,62 @@ The bot currently accepts:
 
 - Text messages
 - photos
+- voice messages when `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` and local Whisper prerequisites are installed
 - Codex and Copilot currently supports text and image only, video is not supported.
 
 ## 🤖 Telegram Commands
 
 <table>
   <tr>
-    <td width="250"><code>/provider</code></td>
+    <td width="332"><code>/provider</code></td>
     <td>Choose the provider for new sessions. The selection is stored per bot and chat until you change it.</td>
   </tr>
   <tr>
-    <td width="250"><code>/project &lt;project_folder&gt;</code></td>
+    <td><code>/project &lt;project_folder&gt;</code></td>
     <td>Set the current project folder. If the folder does not exist, the app creates it and marks it trusted. If it already exists and is still untrusted, the app asks you to trust it explicitly.</td>
   </tr>
   <tr>
-    <td width="250"><code>/branch &lt;new_branch&gt;</code></td>
+    <td><code>/branch &lt;new_branch&gt;</code></td>
     <td>Prepare or switch a branch for the current project. If the branch already exists, the bot treats that branch as the source candidate. Otherwise it uses the repository default branch as the source candidate.</td>
   </tr>
   <tr>
-    <td width="250"><code>/branch &lt;origin_branch&gt; &lt;new_branch&gt;</code></td>
+    <td><code>/branch &lt;origin_branch&gt; &lt;new_branch&gt;</code></td>
     <td>Prepare or switch a branch using <code>&lt;origin_branch&gt;</code> as the source candidate. <br /> For both forms, the bot then offers the source choices that actually exist: <code>local/&lt;branch&gt;</code> <code>origin/&lt;branch&gt;</code> <br />If only one of those exists, only that option is shown. If neither exists, the bot tells you the branch source is missing.</td>
   </tr>
   <tr>
-    <td width="250"><code>/current</code></td>
+    <td><code>/current</code></td>
     <td>Show the active session for the current bot and chat.</td>
   </tr>
   <tr>
-    <td width="250"><code>/new [session_name]</code></td>
+    <td><code>/new [session_name]</code></td>
     <td>Create a new session for the current project. If you omit the name, the bot uses the real session ID. If provider, project, or branch is missing, the bot guides you through the missing step.</td>
   </tr>
   <tr>
-    <td width="250"><code>/switch</code></td>
+    <td><code>/switch</code></td>
     <td>Show the latest sessions, newest first. The list includes both bot-managed sessions and local Codex/Copilot CLI sessions for the current project.</td>
   </tr>
   <tr>
-    <td width="250"><code>/switch page &lt;number&gt;</code></td>
+    <td><code>/switch page &lt;number&gt;</code></td>
     <td>Show another page of stored sessions.</td>
   </tr>
   <tr>
-    <td width="250"><code>/switch &lt;session_id&gt;</code></td>
+    <td><code>/switch &lt;session_id&gt;</code></td>
     <td>Switch to a specific session by ID. If you choose a local CLI session, the bot imports it and continues from there.</td>
   </tr>
   <tr>
-    <td width="250"><code>/compact</code></td>
+    <td><code>/compact</code></td>
     <td>Create a fresh compacted session from the active session and switch to it.</td>
   </tr>
   <tr>
-    <td width="250"><code>/commit &lt;git commands&gt;</code></td>
+    <td><code>/commit &lt;git commands&gt;</code></td>
     <td>Run validated git commit-related commands inside the active session project. Available only when <code>ENABLE_COMMIT_COMMAND=true</code>. Mutating git commands require a trusted project.</td>
   </tr>
   <tr>
-    <td width="250"><code>/push</code></td>
+    <td><code>/push</code></td>
     <td>Push <code>origin &lt;branch&gt;</code> for the current active session. The bot asks for confirmation before pushing.</td>
   </tr>
   <tr>
-    <td width="250"><code>/abort</code></td>
+    <td><code>/abort</code></td>
     <td>Abort the current agent run for the current project. If queued questions are waiting, the bot asks whether to continue them.</td>
   </tr>
 </table>
@@ -260,7 +295,7 @@ The bot currently accepts:
 
 <table>
   <tr>
-    <td width="250"><code>WORKSPACE_ROOT</code></td>
+    <td width="332"><code>WORKSPACE_ROOT</code></td>
     <td>Parent folder that contains your project directories.</td>
   </tr>
   <tr>
@@ -277,7 +312,7 @@ The bot currently accepts:
 
 <table>
   <tr>
-    <td width="250"><code>APP_LOCALE</code></td>
+    <td width="332"><code>APP_LOCALE</code></td>
     <td>UI locale for shared bot messages and command descriptions. Supported values: <code>en</code>, <code>de</code>, <code>fr</code>, <code>ja</code>, <code>ko</code>, <code>nl</code>, <code>th</code>, <code>vi</code>, <code>zh-CN</code>, <code>zh-HK</code>, <code>zh-TW</code>.</td>
   </tr>
   <tr>
@@ -352,6 +387,24 @@ The bot currently accepts:
   </tr>
 </table>
 
+<h3>Speech to Text</h3>
+<table>
+  <tr>
+    <td width="332"><code>ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT</code></td>
+    <td>Default: <code>false</code>. If true, it enables the audio messages capability. System will check the prerequisites regarding required binaries or libraries on startup.</td>
+  </tr>
+  <tr>
+  <td><code>OPENAI_WHISPER_MODEL</code></td>
+  <td>Model for the Whisper SST. Default: <code>base</code><br />Available models: <code>tiny</code> about <code>72 MB</code>, <code>base</code> about <codoe>139 MB</codoe>, <code>large-v3-turbo</code> about <code>1.5 GB</code><br /> 
+  Models will be automatically downloaded on your first voice message. Recommended: <code>base</code> for general usage. If you want better accuracy and quality, you can try with <code>turbo</code>
+  </td>
+  </tr>
+  <tr>
+    <td><code>OPENAI_WHISPER_TIMEOUT_SECONDS</code></td>
+    <td>Default: <code>120</code>Timeout for the STT process. Usually the STT processing is fast enough.</td>
+  </tr>
+</table>
+
 <h3>State and Logs</h3>
 
 <table>
diff --git a/README.nl.md b/README.nl.md
index 953102e..8c01e3c 100644
--- a/README.nl.md
+++ b/README.nl.md
@@ -38,7 +38,7 @@
   - ✅ Gebruik Telegram om Codex / Copilot CLI te bedienen
   - ✅ Antwoorden en gewijzigde bestanden eenvoudig beoordelen in codeblokken
   - ✅ Vervolgvragen kunnen in de wachtrij terwijl de agent werkt
-  - ✅ Ondersteunt tekst- en afbeeldingsinvoer
+  - ✅ Accepteert ✏️ tekst-, 🌄 afbeelding- en 🎙️ spraakberichten, evenals audiobestanden
 
    ## 🔁 Naadloos wisselen tussen apparaten en sessies
 
@@ -99,6 +99,7 @@ Voordat je de server start, zorg dat je hebt:
 - Codex CLI en/of Copilot CLI lokaal geïnstalleerd
 - [Codex CLI installatie](https://developers.openai.com/codex/cli)
 - [Copilot CLI installatie](https://github.com/features/copilot/cli)
+- [Optioneel] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Botserver starten
+### 🌐 Botserver starten
 ##### Bij de eerste start maakt de app het env-bestand aan en vertelt welke velden je moet invullen.
 ##### Start na het bijwerken van het env-bestand opnieuw:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [Optioneel] Speech-to-Text-functie: lokale OpenAI-Whisper-vereisten voorbereiden
+
+Hiermee schakel je optionele lokale Whisper-gebaseerde spraak-naar-tekst in voor Telegram-spraaknotities. Audiobestanden zijn beperkt tot maximaal `20 MB`.
+
+```bash
+# als je via pip hebt geïnstalleerd
+coding-agent-telegram-stt-install
+
+# als je vanuit een gekloonde repository werkt
+./install-stt.sh
+```
+
+Aanbevolen env-instellingen:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+Opmerkingen:
+
+- Whisper downloadt het gekozen model automatisch bij het eerste gebruik naar `~/.cache/whisper`.
+- Als je `OPENAI_WHISPER_MODEL=turbo` kiest, is de kans groter dat de eerste spraaktranscriptie de time-out raakt terwijl `large-v3-turbo.pt` nog wordt gedownload.
+- Nadat een spraakbericht is getranscribeerd, stuurt de bot eerst het herkende transcript terug naar Telegram en daarna pas naar de agent. Dat helpt om herkenningsfouten te controleren.
+
 ## 🔑 Telegram-instelling
 
 ### Een Bot Token krijgen
@@ -171,7 +198,14 @@ Opmerkingen:
 
 ## 📨 Ondersteunde berichttypen
 
-## 🤖 Telegram-commando’s
+De bot accepteert momenteel:
+
+- tekstberichten
+- foto’s
+- spraakberichten en audiobestanden wanneer `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` is ingesteld en de lokale Whisper-vereisten zijn geïnstalleerd
+- Codex en Copilot ondersteunen momenteel alleen tekst en afbeeldingen, geen video
+
+## 🤖 Telegram-commando's
 
 <table>
   <tr>
diff --git a/README.th.md b/README.th.md
index 7ff78a6..c22641e 100644
--- a/README.th.md
+++ b/README.th.md
@@ -38,7 +38,7 @@
   - ✅ ใช้ Telegram เพื่อควบคุม Codex / Copilot CLI
   - ✅ ตรวจคำตอบและไฟล์ที่ถูกแก้ได้ง่ายใน code block
   - ✅ ส่งคำถามต่อคิวไว้ได้ระหว่างที่ agent กำลังทำงาน
-  - ✅ รองรับข้อความและรูปภาพ
+  - ✅ รองรับ ✏️ ข้อความ, 🌄 รูปภาพ, 🎙️ ข้อความเสียง และไฟล์เสียง
 
    ## 🔁 สลับอุปกรณ์และเซสชันได้ลื่นไหล
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - ติดตั้ง Codex CLI และ/หรือ Copilot CLI ไว้ในเครื่องแล้ว
 - [ติดตั้ง Codex CLI](https://developers.openai.com/codex/cli)
 - [ติดตั้ง Copilot CLI](https://github.com/features/copilot/cli)
+- [ทางเลือก] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### เริ่ม Bot Server
+### 🌐 เริ่ม Bot Server
 ##### ครั้งแรกแอปจะสร้างไฟล์ env และบอกว่าต้องกรอกค่าใดบ้าง
 ##### หลังแก้ไฟล์ env แล้ว ให้รันอีกครั้ง:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [ทางเลือก] ฟีเจอร์ Speech-to-Text: เตรียมส่วนที่ OpenAI-Whisper ต้องใช้ในเครื่อง
+
+ส่วนนี้ใช้เปิดการแปลงข้อความจากข้อความเสียง Telegram ด้วย Whisper แบบโลคัลตามตัวเลือกของคุณ ไฟล์เสียงถูกจำกัดไว้ที่สูงสุด `20 MB`
+
+```bash
+# ถ้าติดตั้งด้วย pip
+coding-agent-telegram-stt-install
+
+# ถ้าใช้งานจาก repository ที่ clone มา
+./install-stt.sh
+```
+
+ค่า env ที่แนะนำ:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+หมายเหตุ:
+
+- Whisper จะดาวน์โหลดโมเดลที่เลือกโดยอัตโนมัติครั้งแรกไปยัง `~/.cache/whisper`
+- หากเลือก `OPENAI_WHISPER_MODEL=turbo` การถอดข้อความจากเสียงครั้งแรกมีโอกาสหมดเวลามากขึ้น ขณะ `large-v3-turbo.pt` ยังดาวน์โหลดไม่เสร็จ
+- หลังจากถอดข้อความจากเสียงแล้ว บอตจะส่งข้อความที่รู้จำได้กลับไปใน Telegram ก่อน แล้วจึงส่งต่อให้เอเจนต์ เพื่อช่วยตรวจสอบความคลาดเคลื่อนของการรู้จำ
+
 ## 🔑 ตั้งค่า Telegram
 
 ### รับ Bot Token
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 ประเภทข้อความที่รองรับ
 
+บอตรองรับสิ่งต่อไปนี้ในตอนนี้:
+
+- ข้อความตัวอักษร
+- รูปภาพ
+- ข้อความเสียงและไฟล์เสียง เมื่อกำหนด `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` และติดตั้งส่วนที่ Whisper ต้องใช้ในเครื่องแล้ว
+- ปัจจุบัน Codex และ Copilot รองรับเฉพาะข้อความและรูปภาพ ยังไม่รองรับวิดีโอ
+
 ## 🤖 คำสั่ง Telegram
 
 <table>
diff --git a/README.vi.md b/README.vi.md
index 96e6e04..7390cf7 100644
--- a/README.vi.md
+++ b/README.vi.md
@@ -38,7 +38,7 @@
   - ✅ Dùng Telegram để điều khiển Codex / Copilot CLI
   - ✅ Dễ xem câu trả lời và các file đã thay đổi trong code block
   - ✅ Có thể xếp hàng câu hỏi tiếp theo khi agent đang làm việc
-  - ✅ Hỗ trợ đầu vào văn bản và hình ảnh
+  - ✅ Chấp nhận tin nhắn ✏️ văn bản, 🌄 hình ảnh, 🎙️ thoại và cả tệp âm thanh
 
    ## 🔁 Chuyển thiết bị/phiên liền mạch
 
@@ -99,6 +99,7 @@ Trước khi khởi động server, hãy chuẩn bị:
 - Codex CLI và/hoặc Copilot CLI đã được cài cục bộ
 - [Cài Codex CLI](https://developers.openai.com/codex/cli)
 - [Cài Copilot CLI](https://github.com/features/copilot/cli)
+- [Tùy chọn] Whisper, ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### Khởi động bot server
+### 🌐 Khởi động bot server
 ##### Ở lần chạy đầu, app sẽ tạo file env và cho bạn biết cần điền trường nào.
 ##### Sau khi cập nhật file env, hãy chạy lại:
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [Tùy chọn] Tính năng Speech-to-Text: chuẩn bị các điều kiện cần cục bộ của OpenAI-Whisper
+
+Phần này dùng để bật tùy chọn chuyển tin nhắn thoại Telegram thành văn bản bằng Whisper chạy cục bộ. Tệp âm thanh được giới hạn tối đa `20 MB`.
+
+```bash
+# nếu bạn cài bằng pip
+coding-agent-telegram-stt-install
+
+# nếu bạn chạy từ repository đã clone
+./install-stt.sh
+```
+
+Thiết lập env được khuyến nghị:
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+Lưu ý:
+
+- Whisper sẽ tự động tải model đã chọn vào `~/.cache/whisper` ở lần dùng đầu tiên.
+- Nếu bạn chọn `OPENAI_WHISPER_MODEL=turbo`, lần chuyển giọng nói đầu tiên có khả năng chạm timeout cao hơn khi `large-v3-turbo.pt` vẫn đang được tải.
+- Sau khi một tin nhắn thoại được chép lại, bot sẽ gửi lại bản transcript đã nhận dạng vào Telegram trước rồi mới chuyển cho tác nhân. Điều này giúp kiểm tra lỗi nhận dạng dễ hơn.
+
 ## 🔑 Thiết lập Telegram
 
 ### Lấy Bot Token
@@ -171,6 +198,13 @@ Lưu ý:
 
 ## 📨 Loại tin nhắn được hỗ trợ
 
+Hiện tại bot chấp nhận:
+
+- tin nhắn văn bản
+- ảnh
+- tin nhắn thoại và tệp âm thanh khi `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` và các điều kiện cần cục bộ của Whisper đã được cài đặt
+- hiện tại Codex và Copilot chỉ hỗ trợ văn bản và hình ảnh, chưa hỗ trợ video
+
 ## 🤖 Lệnh Telegram
 
 <table>
diff --git a/README.zh-CN.md b/README.zh-CN.md
index 3b3206c..a2a21af 100644
--- a/README.zh-CN.md
+++ b/README.zh-CN.md
@@ -38,7 +38,7 @@
   - ✅ 使用 Telegram 控制 Codex / Copilot CLI
   - ✅ 可以在代码块中轻松查看 agent 回复和改动文件
   - ✅ agent 工作时也能继续排队后续问题
-  - ✅ 支持文本和图片输入
+  - ✅ 支持 ✏️ 文本、🌄 图片、🎙️ 语音消息以及音频文件
 
    ## 🔁 设备与会话无缝切换
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - 已在本地安装 Codex CLI 和/或 Copilot CLI
 - [安装 Codex CLI](https://developers.openai.com/codex/cli)
 - [安装 Copilot CLI](https://github.com/features/copilot/cli)
+- [可选] Whisper、ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### 启动 Bot Server
+### 🌐 启动 Bot Server
 ##### 首次运行时，应用会创建 env 文件，并告诉你需要填写哪些字段。
 ##### 更新 env 文件后，再次运行：
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [可选] Speech-to-Text 功能：准备本地 OpenAI-Whisper 依赖
+
+这部分用于可选启用 Telegram 语音消息的本地 Whisper 语音转文字功能。音频文件最大限制为 `20 MB`。
+
+```bash
+# 如果你是通过 pip 安装
+coding-agent-telegram-stt-install
+
+# 如果你是从克隆的仓库运行
+./install-stt.sh
+```
+
+推荐的 env 设置：
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+说明：
+
+- Whisper 会在首次使用时自动把所选模型下载到 `~/.cache/whisper`。
+- 如果你选择 `OPENAI_WHISPER_MODEL=turbo`，第一次语音转写更容易在 `large-v3-turbo.pt` 仍在下载时触发超时。
+- 语音消息转写完成后，bot 会先把识别出的文本回传到 Telegram，再把它交给 agent。这样更方便排查识别错误。
+
 ## 🔑 Telegram 设置
 
 ### 获取 Bot Token
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 支持的消息类型
 
+bot 当前接受：
+
+- 文本消息
+- 图片
+- 当 `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` 且已安装本地 Whisper 依赖时的语音消息和音频文件
+- Codex 和 Copilot 当前只支持文本和图片，不支持视频
+
 ## 🤖 Telegram 命令
 
 <table>
diff --git a/README.zh-HK.md b/README.zh-HK.md
index 24a199e..81afc2e 100644
--- a/README.zh-HK.md
+++ b/README.zh-HK.md
@@ -38,7 +38,7 @@
   - ✅ 使用 Telegram 控制 Codex / Copilot CLI
   - ✅ 可在 code block 中輕鬆查看 agent 回覆及改動檔案
   - ✅ agent 執行中仍可把後續問題排入佇列
-  - ✅ 支援文字與圖片輸入
+  - ✅ 支援 ✏️ 文字、🌄 圖片、🎙️ 語音訊息以及音訊檔案
 
    ## 🔁 裝置與工作階段無縫切換
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - 已在本機安裝 Codex CLI 及/或 Copilot CLI
 - [安裝 Codex CLI](https://developers.openai.com/codex/cli)
 - [安裝 Copilot CLI](https://github.com/features/copilot/cli)
+- [可選] Whisper、ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### 啟動 Bot Server
+### 🌐 啟動 Bot Server
 ##### 第一次執行時，app 會建立 env 檔案，並告訴你需要填寫哪些欄位。
 ##### 更新 env 檔案後，再次執行：
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [可選] Speech-to-Text 功能：準備本機 OpenAI-Whisper 依賴
+
+這部分可選啟用 Telegram 語音訊息的本機 Whisper 語音轉文字功能。音訊檔案最大限制為 `20 MB`。
+
+```bash
+# 如果你是用 pip 安裝
+coding-agent-telegram-stt-install
+
+# 如果你是從 clone 的 repository 執行
+./install-stt.sh
+```
+
+建議的 env 設定：
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+說明：
+
+- Whisper 會在首次使用時自動把所選模型下載到 `~/.cache/whisper`。
+- 如果你選擇 `OPENAI_WHISPER_MODEL=turbo`，第一次語音轉錄更容易在 `large-v3-turbo.pt` 尚在下載時觸發逾時。
+- 語音訊息轉錄完成後，bot 會先把辨識出的文字回傳到 Telegram，再把它交給 agent。這樣更方便排查辨識錯誤。
+
 ## 🔑 Telegram 設定
 
 ### 取得 Bot Token
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 支援的訊息類型
 
+bot 目前接受：
+
+- 文字訊息
+- 圖片
+- 當 `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` 且已安裝本機 Whisper 依賴時的語音訊息與音訊檔案
+- Codex 與 Copilot 目前只支援文字與圖片，不支援影片
+
 ## 🤖 Telegram 指令
 
 <table>
@@ -323,6 +357,18 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
     <td width="250"><code>ENABLE_SECRET_SCRUB_FILTER</code></td>
     <td>在送往 Telegram 之前，對 tokens、keys、<code>.env</code> 值、certificates 及類似秘密輸出做遮罩。預設：<code>true</code>（強烈建議啟用）。</td>
   </tr>
+  <tr>
+    <td width="250"><code>ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT</code></td>
+    <td>預設：<code>false</code>。如果為 <code>true</code>，就會啟用音訊訊息與語音檔案識別。系統會檢查所需的 binary 或 library 依賴，缺少時會提示用戶安裝。</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_MODEL</code></td>
+    <td>Whisper STT 使用的模型。預設：<code>base</code><br />可用模型：<code>tiny</code> 約 <code>72 MB</code>、<code>base</code> 約 <code>139 MB</code>、<code>large-v3-turbo</code> 約 <code>1.5 GB</code><br />模型會在你第一次傳送語音訊息時自動下載。建議一般使用選 <code>base</code>。如果你想要更好的準確率與品質，可以嘗試 <code>turbo</code>。</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_TIMEOUT_SECONDS</code></td>
+    <td>預設：<code>120</code>。STT 進程的逾時時間。一般來說處理速度已足夠快，但如果你選擇 <code>turbo</code>，首次下載可能會視乎網速而超出逾時限制。</td>
+  </tr>
   <tr>
     <td width="250"><code>SNAPSHOT_INCLUDE_PATH_GLOBS</code></td>
     <td>強制把符合條件的 path 納入 diff。例子：<code>.github/*,.profile.test,.profile.prod</code></td>
diff --git a/README.zh-TW.md b/README.zh-TW.md
index 689d0b0..6161ac6 100644
--- a/README.zh-TW.md
+++ b/README.zh-TW.md
@@ -38,7 +38,7 @@
   - ✅ 使用 Telegram 控制 Codex / Copilot CLI
   - ✅ 可以在 code block 中輕鬆檢視 agent 回覆與改動檔案
   - ✅ agent 執行期間也能把後續問題排入佇列
-  - ✅ 支援文字與圖片輸入
+  - ✅ 支援 ✏️ 文字、🌄 圖片、🎙️ 語音訊息以及音訊檔案
 
    ## 🔁 裝置與工作階段無縫切換
 
@@ -99,6 +99,7 @@ curl -fsSL https://raw.githubusercontent.com/daocha/coding-agent-telegram/main/i
 - 已在本機安裝 Codex CLI 及/或 Copilot CLI
 - [安裝 Codex CLI](https://developers.openai.com/codex/cli)
 - [安裝 Copilot CLI](https://github.com/features/copilot/cli)
+- [可選] Whisper、ffmpeg
    </td>
    </tr>
 </table>
@@ -126,7 +127,7 @@ cd coding-agent-telegram
 ./startup.sh
 ```
 
-### 啟動 Bot Server
+### 🌐 啟動 Bot Server
 ##### 第一次執行時，app 會建立 env 檔案，並告訴你需要填寫哪些欄位。
 ##### 更新 env 檔案後，再次執行：
 ```bash
@@ -137,6 +138,32 @@ coding-agent-telegram
 ./startup.sh
 ```
 
+## 🎙️ [可選] Speech-to-Text 功能：準備本機 OpenAI-Whisper 依賴
+
+這部分可選啟用 Telegram 語音訊息的本機 Whisper 語音轉文字功能。音訊檔案最大限制為 `20 MB`。
+
+```bash
+# 如果你是用 pip 安裝
+coding-agent-telegram-stt-install
+
+# 如果你是從 clone 的 repository 執行
+./install-stt.sh
+```
+
+建議的 env 設定：
+
+```text
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true
+OPENAI_WHISPER_MODEL=base
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+```
+
+說明：
+
+- Whisper 會在首次使用時自動把所選模型下載到 `~/.cache/whisper`。
+- 如果你選擇 `OPENAI_WHISPER_MODEL=turbo`，第一次語音轉錄更容易在 `large-v3-turbo.pt` 尚在下載時觸發逾時。
+- 語音訊息轉錄完成後，bot 會先把辨識出的文字回傳到 Telegram，再把它交給 agent。這樣更方便排查辨識錯誤。
+
 ## 🔑 Telegram 設定
 
 ### 取得 Bot Token
@@ -171,6 +198,13 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
 
 ## 📨 支援的訊息類型
 
+bot 目前接受：
+
+- 文字訊息
+- 圖片
+- 當 `ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true` 且已安裝本機 Whisper 依賴時的語音訊息與音訊檔案
+- Codex 與 Copilot 目前只支援文字與圖片，不支援影片
+
 ## 🤖 Telegram 指令
 
 <table>
@@ -323,6 +357,18 @@ https://api.telegram.org/bot<BOT_TOKEN>/getUpdates
     <td width="250"><code>ENABLE_SECRET_SCRUB_FILTER</code></td>
     <td>在送往 Telegram 之前，對 tokens、keys、<code>.env</code> 值、certificates 及類似秘密輸出做遮罩。預設：<code>true</code>（強烈建議啟用）。</td>
   </tr>
+  <tr>
+    <td width="250"><code>ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT</code></td>
+    <td>預設：<code>false</code>。如果為 <code>true</code>，就會啟用音訊訊息與語音檔案識別。系統會檢查所需的 binary 或 library 依賴，缺少時會提示使用者安裝。</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_MODEL</code></td>
+    <td>Whisper STT 使用的模型。預設：<code>base</code><br />可用模型：<code>tiny</code> 約 <code>72 MB</code>、<code>base</code> 約 <code>139 MB</code>、<code>large-v3-turbo</code> 約 <code>1.5 GB</code><br />模型會在你第一次傳送語音訊息時自動下載。建議一般使用選 <code>base</code>。如果你想要更好的準確率與品質，可以嘗試 <code>turbo</code>。</td>
+  </tr>
+  <tr>
+    <td width="250"><code>OPENAI_WHISPER_TIMEOUT_SECONDS</code></td>
+    <td>預設：<code>120</code>。STT 進程的逾時時間。一般來說處理速度已足夠快，但如果你選擇 <code>turbo</code>，首次下載可能會視乎網速而超出逾時限制。</td>
+  </tr>
   <tr>
     <td width="250"><code>SNAPSHOT_INCLUDE_PATH_GLOBS</code></td>
     <td>強制把符合條件的 path 納入 diff。例子：<code>.github/*,.profile.test,.profile.prod</code></td>
diff --git a/install-stt.sh b/install-stt.sh
new file mode 100755
index 0000000..a74dba7
--- /dev/null
+++ b/install-stt.sh
@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+
+PYTHON_BIN="${PYTHON_BIN:-python3}"
+VENV_DIR="${VENV_DIR:-.venv}"
+ENV_FILE="${ENV_FILE:-}"
+LOCAL_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION_FOR_CODING_AGENT_TELEGRAM:-0.0.dev0}"
+
+if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then
+  echo "Error: $PYTHON_BIN was not found in PATH." >&2
+  exit 1
+fi
+
+if [[ ! -d "$VENV_DIR" ]]; then
+  "$PYTHON_BIN" -m venv "$VENV_DIR"
+fi
+
+source "$VENV_DIR/bin/activate"
+python -m pip install --upgrade pip >/dev/null
+
+if ! python -c "import coding_agent_telegram" >/dev/null 2>&1; then
+  echo "Installing local package into $VENV_DIR so the shared STT installer is available."
+  SETUPTOOLS_SCM_PRETEND_VERSION_FOR_CODING_AGENT_TELEGRAM="$LOCAL_PRETEND_VERSION" \
+    python -m pip install -e .
+fi
+
+ARGS=("install")
+if [[ -n "$ENV_FILE" ]]; then
+  ARGS+=("--env-file" "$ENV_FILE")
+fi
+ARGS+=("--python-bin" "$(command -v python)")
+
+exec python -m coding_agent_telegram.stt_setup "${ARGS[@]}"
diff --git a/install.sh b/install.sh
index 4ff3b92..4424b95 100644
--- a/install.sh
+++ b/install.sh
@@ -28,7 +28,4 @@ if [[ -z "$COMMAND_PATH" && ":$PATH:" != *":$SCRIPT_DIR:"* ]]; then
 fi
 
 echo "Starting coding-agent-telegram..."
-if [[ -n "$COMMAND_PATH" ]]; then
-  exec "$COMMAND_PATH"
-fi
 exec "$PYTHON_BIN" -m coding_agent_telegram
diff --git a/pyproject.toml b/pyproject.toml
index 20ff289..59db972 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -18,6 +18,7 @@ dependencies = [
 
 [project.scripts]
 coding-agent-telegram = "coding_agent_telegram.cli:main"
+coding-agent-telegram-stt-install = "coding_agent_telegram.stt_setup:main"
 
 [tool.setuptools]
 package-dir = {"" = "src"}
diff --git a/src/coding_agent_telegram/bot.py b/src/coding_agent_telegram/bot.py
index c0fd0aa..1d4e585 100644
--- a/src/coding_agent_telegram/bot.py
+++ b/src/coding_agent_telegram/bot.py
@@ -20,6 +20,25 @@
 TELEGRAM_GET_UPDATES_CONNECTION_POOL_SIZE = 2
 
 
+def _describe_message_types(message) -> list[str]:
+    types: list[str] = []
+    for field_name in (
+        "text",
+        "photo",
+        "voice",
+        "audio",
+        "document",
+        "video",
+        "video_note",
+        "animation",
+        "sticker",
+    ):
+        value = getattr(message, field_name, None)
+        if value:
+            types.append(field_name)
+    return types
+
+
 def default_bot_commands(*, enable_commit_command: bool, locale: str = DEFAULT_LOCALE) -> list[BotCommand]:
     commands = [
         BotCommand("provider", translate(locale, "bot.command.provider")),
@@ -106,9 +125,22 @@ def build_application(token: str, router: CommandRouter, *, allowed_chat_ids: se
         | tg_filters.Sticker.ALL
         | tg_filters.VIDEO
         | tg_filters.VIDEO_NOTE
-        | tg_filters.VOICE
     )
 
+    async def log_incoming_private_message(update, _context) -> None:
+        message = getattr(update, "message", None)
+        chat = getattr(update, "effective_chat", None)
+        if message is None or chat is None:
+            return
+        logger.info(
+            "Incoming Telegram message chat=%s message_id=%s types=%s text_preview=%.120r",
+            chat.id,
+            getattr(message, "message_id", None),
+            ",".join(_describe_message_types(message)) or "unknown",
+            getattr(message, "text", None) or "",
+        )
+
+    app.add_handler(MessageHandler(allowed_private, log_incoming_private_message, block=False), group=-1)
     app.add_handler(CommandHandler("provider", router.handle_provider, filters=allowed_private))
     app.add_handler(CommandHandler("project", router.handle_project, filters=allowed_private))
     app.add_handler(CommandHandler("branch", router.handle_branch, filters=allowed_private))
@@ -127,6 +159,8 @@ def build_application(token: str, router: CommandRouter, *, allowed_chat_ids: se
     app.add_handler(CallbackQueryHandler(router.handle_push_callback, pattern=r"^push:(confirm|cancel)$"))
     app.add_handler(CallbackQueryHandler(router.handle_trust_project_callback, pattern=r"^trustproject:(yes|no):"))
     app.add_handler(MessageHandler(allowed_private & tg_filters.PHOTO, router.handle_photo, block=False))
+    app.add_handler(MessageHandler(allowed_private & tg_filters.AUDIO, router.handle_audio, block=False))
+    app.add_handler(MessageHandler(allowed_private & tg_filters.VOICE, router.handle_voice, block=False))
     app.add_handler(MessageHandler(allowed_private & tg_filters.TEXT & ~tg_filters.COMMAND, router.handle_message, block=False))
     app.add_handler(MessageHandler(allowed_private & unsupported_media, router.handle_unsupported_message))
     app.add_error_handler(build_error_handler(router.deps.cfg.locale))
diff --git a/src/coding_agent_telegram/cli.py b/src/coding_agent_telegram/cli.py
index 2debcd3..3cf2299 100644
--- a/src/coding_agent_telegram/cli.py
+++ b/src/coding_agent_telegram/cli.py
@@ -14,6 +14,7 @@
 from coding_agent_telegram.i18n import translate
 from coding_agent_telegram.logging_utils import setup_logging
 from coding_agent_telegram.session_store import SessionStore
+from coding_agent_telegram.stt_setup import ensure_stt_runtime_or_exit, offer_stt_install_for_new_env
 
 
 logger = logging.getLogger(__name__)
@@ -123,6 +124,11 @@ def main() -> None:
             ),
             file=sys.stderr,
         )
+        offer_stt_install_for_new_env(
+            env_file=str(env_path),
+            python_bin=sys.executable,
+            installer_label="coding-agent-telegram-stt-install",
+        )
     try:
         cfg = load_config(env_path)
     except ValueError as exc:
@@ -140,6 +146,11 @@ def main() -> None:
 
     log_file = setup_logging(cfg.log_level, cfg.log_dir)
     logger.info("Logging to %s", log_file)
+    try:
+        ensure_stt_runtime_or_exit(cfg.enable_openai_whisper_speech_to_text)
+    except SystemExit as exc:
+        logger.error("%s", exc)
+        raise
 
     store = SessionStore(cfg.state_file, cfg.state_backup_file)
     runner = MultiAgentRunner(
diff --git a/src/coding_agent_telegram/config.py b/src/coding_agent_telegram/config.py
index 59c435a..be8eb09 100644
--- a/src/coding_agent_telegram/config.py
+++ b/src/coding_agent_telegram/config.py
@@ -21,6 +21,8 @@
 DEFAULT_ENV_FILE_NAME = ".env_coding_agent_telegram"
 # 0 = disabled. Set to a positive value to kill runaway agent processes.
 DEFAULT_AGENT_HARD_TIMEOUT_SECONDS = 0
+DEFAULT_OPENAI_WHISPER_MODEL = "base"
+DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS = 120
 
 
 @dataclass(frozen=True)
@@ -51,6 +53,9 @@ class AppConfig:
     max_telegram_message_length: int
     enable_sensitive_diff_filter: bool
     enable_secret_scrub_filter: bool
+    enable_openai_whisper_speech_to_text: bool
+    openai_whisper_model: str
+    openai_whisper_timeout_seconds: int
     default_agent_provider: str
     agent_hard_timeout_seconds: int
     app_internal_root: Path
@@ -227,6 +232,15 @@ def load_config(env_file: Optional[Path] = None) -> AppConfig:
         ),
         enable_sensitive_diff_filter=_parse_bool(os.getenv("ENABLE_SENSITIVE_DIFF_FILTER", "true"), default=True),
         enable_secret_scrub_filter=_parse_bool(os.getenv("ENABLE_SECRET_SCRUB_FILTER", "true"), default=True),
+        enable_openai_whisper_speech_to_text=_parse_bool(
+            os.getenv("ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT", "false")
+        ),
+        openai_whisper_model=os.getenv("OPENAI_WHISPER_MODEL", DEFAULT_OPENAI_WHISPER_MODEL).strip()
+        or DEFAULT_OPENAI_WHISPER_MODEL,
+        openai_whisper_timeout_seconds=max(
+            1,
+            int(os.getenv("OPENAI_WHISPER_TIMEOUT_SECONDS", str(DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS))),
+        ),
         default_agent_provider=provider,
         agent_hard_timeout_seconds=int(
             os.getenv("AGENT_HARD_TIMEOUT_SECONDS", str(DEFAULT_AGENT_HARD_TIMEOUT_SECONDS))
diff --git a/src/coding_agent_telegram/resources/.env.example b/src/coding_agent_telegram/resources/.env.example
index eac4569..582d866 100644
--- a/src/coding_agent_telegram/resources/.env.example
+++ b/src/coding_agent_telegram/resources/.env.example
@@ -90,6 +90,22 @@ ENABLE_SENSITIVE_DIFF_FILTER=true
 # Strongly recommended: keep this set to true.
 ENABLE_SECRET_SCRUB_FILTER=true
 
+# If true, enable Telegram voice-message speech-to-text through local openai-whisper.
+# Default: false. Run coding-agent-telegram-stt-install (pip install) or ./install-stt.sh (repo clone) first.
+# Estimated local footprint: openai-whisper package ~50 MB, ffmpeg ~50 MB, plus Whisper model downloads.
+# Example model cache sizes: tiny ~72 MB, base ~139 MB, large-v3-turbo ~1.5 GB.
+ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=false
+
+# Whisper model name to use for Telegram voice-message speech-to-text.
+# Recommended default: base. `turbo` downloads the large-v3-turbo model (~1.5 GB).
+# Models download automatically on first use into ~/.cache/whisper.
+# If the selected model is not cached yet, the first voice transcription may take longer.
+# With `turbo`, that first call is more likely to hit OPENAI_WHISPER_TIMEOUT_SECONDS before the download finishes.
+OPENAI_WHISPER_MODEL=base
+
+# Timeout for a single Whisper transcription call, in seconds.
+OPENAI_WHISPER_TIMEOUT_SECONDS=120
+
 # Default agent provider for new sessions: codex or copilot.
 DEFAULT_AGENT_PROVIDER=codex
 
diff --git a/src/coding_agent_telegram/resources/locales/de.json b/src/coding_agent_telegram/resources/locales/de.json
index 0efee11..1d48a75 100644
--- a/src/coding_agent_telegram/resources/locales/de.json
+++ b/src/coding_agent_telegram/resources/locales/de.json
@@ -26,7 +26,8 @@
   "git.usage_push": "Verwendung: /push",
   "message.photo_only_codex": "Fotoanhänge werden derzeit nur für Codex-Sitzungen unterstützt.",
   "message.question_queued": "Frage als Q{question_number} in die Warteschlange gestellt. Sie wird verarbeitet, sobald die aktuelle Agent-Aufgabe abgeschlossen ist.",
-  "message.unsupported_message_type": "Nicht unterstützter Nachrichtentyp.\nDieser Bot akzeptiert derzeit nur Textnachrichten und Fotos.",
+  "message.voice_speech_to_text_disabled": "Sprachnachrichten sind nicht aktiviert.\nSetze ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true und installiere zuerst die lokalen Whisper-Voraussetzungen.",
+  "message.unsupported_message_type": "Nicht unterstützter Nachrichtentyp.\nDieser Bot akzeptiert derzeit Textnachrichten, Fotos, Sprachnachrichten und Audiodateien.",
   "queue.button_group": "Fragen gruppieren",
   "queue.button_no": "Nein",
   "queue.button_single": "Einzeln verarbeiten",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "Das Fortsetzen ist fehlgeschlagen, daher wurde eine neue Sitzung erstellt.\nNeue Sitzungs-ID: {session_id}\nNeuer Sitzungsname: {session_name}",
   "runtime.resume_id_changed": "Das Fortsetzen war erfolgreich, aber die Sitzungs-ID hat sich geändert.\nNeue Sitzungs-ID: {session_id}\nNeuer Sitzungsname: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nDiese Datei enthält sensible Inhalte und wurde ausgelassen.",
+  "runtime.voice_conversion_failed": "Sprachumwandlung fehlgeschlagen.",
+  "runtime.voice_conversion_timed_out": "Zeitlimit für Sprachumwandlung erreicht.",
+  "runtime.voice_model_initial_download_note": "Das gewählte Whisper-Modell wird beim ersten Aufruf möglicherweise noch heruntergeladen. Größere Modelle wie turbo erreichen dieses Zeitlimit eher.",
+  "runtime.voice_transcript_preview": "Erkanntes Sprachtranskript:\n{transcript}\n\nWird bearbeitet...",
+  "runtime.voice_transcript_queued_preview": "Erkanntes Sprachtranskript:\n{transcript}\n\nAls Q{question_number} in die Warteschlange gestellt. Es wird verarbeitet, sobald die aktuelle Agent-Aufgabe abgeschlossen ist.",
   "runtime.working_on_it": "Wird bearbeitet...",
   "status.abort_signal_sent": "Abbruchsignal für den aktuellen Projektlauf gesendet.",
   "status.no_running_agent": "Für das aktuelle Projekt wurde kein laufender Agent-Prozess gefunden.",
diff --git a/src/coding_agent_telegram/resources/locales/en.json b/src/coding_agent_telegram/resources/locales/en.json
index 7748e71..c1e1ba0 100644
--- a/src/coding_agent_telegram/resources/locales/en.json
+++ b/src/coding_agent_telegram/resources/locales/en.json
@@ -26,7 +26,8 @@
   "git.usage_push": "Usage: /push",
   "message.photo_only_codex": "Photo attachments are currently supported only for codex sessions.",
   "message.question_queued": "Question queued as Q{question_number}. It will run after the current agent task finishes.",
-  "message.unsupported_message_type": "Unsupported message type.\nThis bot currently accepts only text messages and photos.",
+  "message.voice_speech_to_text_disabled": "Voice messages are not enabled.\nSet ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true and install the local Whisper prerequisites first.",
+  "message.unsupported_message_type": "Unsupported message type.\nThis bot currently accepts text messages, photos, voice messages, and audio files.",
   "queue.button_group": "Group the questions",
   "queue.button_cancel": "Cancel",
   "queue.button_no": "No",
@@ -56,6 +57,12 @@
   "runtime.resume_created_new": "Resume failed, so a new session was created.\nNew session ID: {session_id}\nNew session name: {session_name}",
   "runtime.resume_id_changed": "Resume succeeded, but the session ID changed.\nNew session ID: {session_id}\nNew session name: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nThis file contains sensitive content and was omitted.",
+  "runtime.voice_conversion_failed": "Voice conversion failed.",
+  "runtime.voice_conversion_timed_out": "Voice conversion timed out.",
+  "runtime.voice_audio_too_large": "Audio is too large for local speech-to-text. The maximum supported size is {max_size_mb} MB.",
+  "runtime.voice_model_initial_download_note": "The selected Whisper model may still be downloading on first use. Larger models such as turbo are more likely to hit this timeout.",
+  "runtime.voice_transcript_preview": "Recognized voice transcript:\n{transcript}\n\nWorking on it...",
+  "runtime.voice_transcript_queued_preview": "Recognized voice transcript:\n{transcript}\n\nQueued as Q{question_number}. It will run after the current agent task finishes.",
   "runtime.working_on_it": "Working on it...",
   "status.abort_signal_sent": "Abort signal sent for the current project run.",
   "status.no_running_agent": "No running agent process was found for the current project.",
diff --git a/src/coding_agent_telegram/resources/locales/fr.json b/src/coding_agent_telegram/resources/locales/fr.json
index 2fb4187..9700b88 100644
--- a/src/coding_agent_telegram/resources/locales/fr.json
+++ b/src/coding_agent_telegram/resources/locales/fr.json
@@ -26,7 +26,8 @@
   "git.usage_push": "Utilisation : /push",
   "message.photo_only_codex": "Les pièces jointes photo sont actuellement prises en charge uniquement pour les sessions Codex.",
   "message.question_queued": "Question mise en file d’attente sous Q{question_number}. Elle sera traitée une fois la tâche actuelle terminée.",
-  "message.unsupported_message_type": "Type de message non pris en charge.\nCe bot accepte actuellement uniquement les messages texte et les photos.",
+  "message.voice_speech_to_text_disabled": "Les messages vocaux ne sont pas activés.\nDéfinissez ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true et installez d'abord les prérequis locaux de Whisper.",
+  "message.unsupported_message_type": "Type de message non pris en charge.\nCe bot accepte actuellement les messages texte, les photos, les messages vocaux et les fichiers audio.",
   "queue.button_group": "Regrouper les questions",
   "queue.button_no": "Non",
   "queue.button_single": "Traiter une par une",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "La reprise a échoué, donc une nouvelle session a été créée.\nNouvel ID de session : {session_id}\nNouveau nom de session : {session_name}",
   "runtime.resume_id_changed": "La reprise a réussi, mais l’ID de session a changé.\nNouvel ID de session : {session_id}\nNouveau nom de session : {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nCe fichier contient des données sensibles et a été omis.",
+  "runtime.voice_conversion_failed": "La conversion vocale a échoué.",
+  "runtime.voice_conversion_timed_out": "La conversion vocale a dépassé le délai.",
+  "runtime.voice_model_initial_download_note": "Le modèle Whisper sélectionné est peut-être encore en cours de téléchargement lors du premier usage. Les modèles plus volumineux comme turbo risquent davantage d’atteindre ce délai.",
+  "runtime.voice_transcript_preview": "Transcription vocale reconnue :\n{transcript}\n\nTraitement en cours...",
+  "runtime.voice_transcript_queued_preview": "Transcription vocale reconnue :\n{transcript}\n\nMise en file d’attente sous Q{question_number}. Elle sera traitée une fois la tâche actuelle terminée.",
   "runtime.working_on_it": "Traitement en cours...",
   "status.abort_signal_sent": "Signal d’arrêt envoyé pour l’exécution actuelle du projet.",
   "status.no_running_agent": "Aucun processus d’agent en cours n’a été trouvé pour le projet actuel.",
diff --git a/src/coding_agent_telegram/resources/locales/ja.json b/src/coding_agent_telegram/resources/locales/ja.json
index c9aa776..6524730 100644
--- a/src/coding_agent_telegram/resources/locales/ja.json
+++ b/src/coding_agent_telegram/resources/locales/ja.json
@@ -26,7 +26,8 @@
   "git.usage_push": "使い方: /push",
   "message.photo_only_codex": "写真添付は現在 Codex セッションでのみサポートされています。",
   "message.question_queued": "質問は Q{question_number} としてキューに追加されました。現在のエージェント処理が終わった後に実行されます。",
-  "message.unsupported_message_type": "未対応のメッセージ種類です。\nこのボットは現在、テキストメッセージと写真のみ受け付けます。",
+  "message.voice_speech_to_text_disabled": "音声メッセージは有効になっていません。\nENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true を設定し、先にローカル Whisper の前提条件をインストールしてください。",
+  "message.unsupported_message_type": "未対応のメッセージ種類です。\nこのボットは現在、テキストメッセージ、写真、音声メッセージ、音声ファイルを受け付けます。",
   "queue.button_group": "質問をまとめる",
   "queue.button_no": "いいえ",
   "queue.button_single": "1つずつ処理",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "再開に失敗したため、新しいセッションを作成しました。\n新しいセッション ID: {session_id}\n新しいセッション名: {session_name}",
   "runtime.resume_id_changed": "再開には成功しましたが、セッション ID が変わりました。\n新しいセッション ID: {session_id}\n新しいセッション名: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nこのファイルには機密内容が含まれているため省略されました。",
+  "runtime.voice_conversion_failed": "音声の変換に失敗しました。",
+  "runtime.voice_conversion_timed_out": "音声の変換がタイムアウトしました。",
+  "runtime.voice_model_initial_download_note": "選択した Whisper モデルは初回利用時にまだダウンロード中の可能性があります。turbo のような大きなモデルはこのタイムアウトに達しやすくなります。",
+  "runtime.voice_transcript_preview": "認識された音声文字起こし:\n{transcript}\n\n処理中です...",
+  "runtime.voice_transcript_queued_preview": "認識された音声文字起こし:\n{transcript}\n\nQ{question_number} としてキューに追加されました。現在のエージェント処理が終わった後に実行されます。",
   "runtime.working_on_it": "処理中です...",
   "status.abort_signal_sent": "現在のプロジェクト実行に中止シグナルを送信しました。",
   "status.no_running_agent": "現在のプロジェクトで実行中のエージェントプロセスは見つかりませんでした。",
diff --git a/src/coding_agent_telegram/resources/locales/ko.json b/src/coding_agent_telegram/resources/locales/ko.json
index 5a18c0d..2680cd3 100644
--- a/src/coding_agent_telegram/resources/locales/ko.json
+++ b/src/coding_agent_telegram/resources/locales/ko.json
@@ -26,7 +26,8 @@
   "git.usage_push": "사용법: /push",
   "message.photo_only_codex": "사진 첨부는 현재 Codex 세션에서만 지원됩니다.",
   "message.question_queued": "질문이 Q{question_number} 로 대기열에 추가되었습니다. 현재 에이전트 작업이 끝난 뒤 처리됩니다.",
-  "message.unsupported_message_type": "지원되지 않는 메시지 유형입니다.\n이 봇은 현재 텍스트 메시지와 사진만 받습니다.",
+  "message.voice_speech_to_text_disabled": "음성 메시지가 활성화되어 있지 않습니다.\nENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true 를 설정하고 먼저 로컬 Whisper 필수 요소를 설치하세요.",
+  "message.unsupported_message_type": "지원되지 않는 메시지 유형입니다.\n이 봇은 현재 텍스트 메시지, 사진, 음성 메시지, 오디오 파일을 받습니다.",
   "queue.button_group": "질문 묶기",
   "queue.button_no": "아니요",
   "queue.button_single": "하나씩 처리",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "재개에 실패하여 새 세션이 생성되었습니다.\n새 세션 ID: {session_id}\n새 세션 이름: {session_name}",
   "runtime.resume_id_changed": "재개에는 성공했지만 세션 ID가 변경되었습니다.\n새 세션 ID: {session_id}\n새 세션 이름: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\n이 파일에는 민감한 내용이 포함되어 있어 생략되었습니다.",
+  "runtime.voice_conversion_failed": "음성 변환에 실패했습니다.",
+  "runtime.voice_conversion_timed_out": "음성 변환 시간이 초과되었습니다.",
+  "runtime.voice_model_initial_download_note": "선택한 Whisper 모델이 첫 사용 시 아직 다운로드 중일 수 있습니다. turbo 같은 큰 모델은 이 시간 제한에 더 걸리기 쉽습니다.",
+  "runtime.voice_transcript_preview": "인식된 음성 전사:\n{transcript}\n\n처리 중입니다...",
+  "runtime.voice_transcript_queued_preview": "인식된 음성 전사:\n{transcript}\n\nQ{question_number} 로 대기열에 추가되었습니다. 현재 에이전트 작업이 끝난 뒤 처리됩니다.",
   "runtime.working_on_it": "처리 중...",
   "status.abort_signal_sent": "현재 프로젝트 실행에 중단 신호를 보냈습니다.",
   "status.no_running_agent": "현재 프로젝트에서 실행 중인 에이전트 프로세스를 찾지 못했습니다.",
diff --git a/src/coding_agent_telegram/resources/locales/nl.json b/src/coding_agent_telegram/resources/locales/nl.json
index ca860b9..401388b 100644
--- a/src/coding_agent_telegram/resources/locales/nl.json
+++ b/src/coding_agent_telegram/resources/locales/nl.json
@@ -26,7 +26,8 @@
   "git.usage_push": "Gebruik: /push",
   "message.photo_only_codex": "Foto-bijlagen worden momenteel alleen ondersteund voor Codex-sessies.",
   "message.question_queued": "Vraag in de wachtrij geplaatst als Q{question_number}. Deze wordt verwerkt nadat de huidige agenttaak is voltooid.",
-  "message.unsupported_message_type": "Niet-ondersteund berichttype.\nDeze bot accepteert momenteel alleen tekstberichten en foto's.",
+  "message.voice_speech_to_text_disabled": "Spraakberichten zijn niet ingeschakeld.\nZet ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true en installeer eerst de lokale Whisper-vereisten.",
+  "message.unsupported_message_type": "Niet-ondersteund berichttype.\nDeze bot accepteert momenteel tekstberichten, foto's, spraakberichten en audiobestanden.",
   "queue.button_group": "Vragen groeperen",
   "queue.button_no": "Nee",
   "queue.button_single": "Eén voor één verwerken",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "Hervatten is mislukt, daarom is een nieuwe sessie gemaakt.\nNieuwe sessie-ID: {session_id}\nNieuwe sessienaam: {session_name}",
   "runtime.resume_id_changed": "Hervatten is gelukt, maar de sessie-ID is gewijzigd.\nNieuwe sessie-ID: {session_id}\nNieuwe sessienaam: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nDit bestand bevat gevoelige inhoud en is weggelaten.",
+  "runtime.voice_conversion_failed": "Spraakconversie mislukt.",
+  "runtime.voice_conversion_timed_out": "Time-out tijdens spraakconversie.",
+  "runtime.voice_model_initial_download_note": "Het gekozen Whisper-model wordt bij het eerste gebruik mogelijk nog gedownload. Grotere modellen zoals turbo lopen eerder tegen deze time-out aan.",
+  "runtime.voice_transcript_preview": "Herkend spraaktranscript:\n{transcript}\n\nBezig...",
+  "runtime.voice_transcript_queued_preview": "Herkend spraaktranscript:\n{transcript}\n\nIn de wachtrij geplaatst als Q{question_number}. Dit wordt verwerkt nadat de huidige agenttaak is voltooid.",
   "runtime.working_on_it": "Bezig...",
   "status.abort_signal_sent": "Afbreeksignaal verzonden voor de huidige projectrun.",
   "status.no_running_agent": "Er is geen draaiend agentproces gevonden voor het huidige project.",
diff --git a/src/coding_agent_telegram/resources/locales/th.json b/src/coding_agent_telegram/resources/locales/th.json
index 9c4416e..449a3e0 100644
--- a/src/coding_agent_telegram/resources/locales/th.json
+++ b/src/coding_agent_telegram/resources/locales/th.json
@@ -26,7 +26,8 @@
   "git.usage_push": "วิธีใช้: /push",
   "message.photo_only_codex": "ขณะนี้รองรับไฟล์แนบรูปภาพเฉพาะสำหรับเซสชัน Codex เท่านั้น",
   "message.question_queued": "จัดคิวคำถามเป็น Q{question_number} แล้ว จะประมวลผลหลังจากงานเอเจนต์ปัจจุบันเสร็จสิ้น",
-  "message.unsupported_message_type": "ประเภทข้อความไม่รองรับ\nขณะนี้บอตนี้รองรับเฉพาะข้อความตัวอักษรและรูปภาพเท่านั้น",
+  "message.voice_speech_to_text_disabled": "ยังไม่ได้เปิดใช้งานข้อความเสียง\nตั้งค่า ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true และติดตั้งส่วนที่ Whisper ต้องใช้ในเครื่องก่อน",
+  "message.unsupported_message_type": "ประเภทข้อความไม่รองรับ\nขณะนี้บอตนี้รองรับข้อความตัวอักษร รูปภาพ ข้อความเสียง และไฟล์เสียง",
   "queue.button_group": "รวมคำถาม",
   "queue.button_no": "ไม่",
   "queue.button_single": "ประมวลผลทีละข้อ",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "กลับมาทำงานต่อไม่สำเร็จ จึงสร้างเซสชันใหม่แทน\nSession ID ใหม่: {session_id}\nชื่อเซสชันใหม่: {session_name}",
   "runtime.resume_id_changed": "กลับมาทำงานต่อได้สำเร็จ แต่ session ID เปลี่ยนไป\nSession ID ใหม่: {session_id}\nชื่อเซสชันใหม่: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nไฟล์นี้มีข้อมูลสำคัญจึงถูกละไว้",
+  "runtime.voice_conversion_failed": "แปลงเสียงเป็นข้อความไม่สำเร็จ",
+  "runtime.voice_conversion_timed_out": "การแปลงเสียงเป็นข้อความหมดเวลา",
+  "runtime.voice_model_initial_download_note": "โมเดล Whisper ที่เลือกอาจกำลังดาวน์โหลดอยู่ในการใช้งานครั้งแรก โมเดลขนาดใหญ่เช่น turbo มีโอกาสเจอ timeout นี้มากกว่า",
+  "runtime.voice_transcript_preview": "ข้อความที่ถอดจากเสียง:\n{transcript}\n\nกำลังดำเนินการ...",
+  "runtime.voice_transcript_queued_preview": "ข้อความที่ถอดจากเสียง:\n{transcript}\n\nจัดคิวเป็น Q{question_number} แล้ว จะประมวลผลหลังจากงานเอเจนต์ปัจจุบันเสร็จสิ้น",
   "runtime.working_on_it": "กำลังดำเนินการ...",
   "status.abort_signal_sent": "ส่งสัญญาณยกเลิกสำหรับการทำงานของโปรเจ็กต์ปัจจุบันแล้ว",
   "status.no_running_agent": "ไม่พบโปรเซสเอเจนต์ที่กำลังทำงานสำหรับโปรเจ็กต์ปัจจุบัน",
diff --git a/src/coding_agent_telegram/resources/locales/vi.json b/src/coding_agent_telegram/resources/locales/vi.json
index 6553de1..80e7fd9 100644
--- a/src/coding_agent_telegram/resources/locales/vi.json
+++ b/src/coding_agent_telegram/resources/locales/vi.json
@@ -26,7 +26,8 @@
   "git.usage_push": "Cách dùng: /push",
   "message.photo_only_codex": "Hiện tại tệp đính kèm ảnh chỉ được hỗ trợ cho các phiên Codex.",
   "message.question_queued": "Câu hỏi đã được xếp hàng dưới dạng Q{question_number}. Nó sẽ được xử lý sau khi tác vụ hiện tại của tác nhân hoàn tất.",
-  "message.unsupported_message_type": "Loại tin nhắn không được hỗ trợ.\nBot này hiện chỉ chấp nhận tin nhắn văn bản và ảnh.",
+  "message.voice_speech_to_text_disabled": "Tin nhắn thoại chưa được bật.\nHãy đặt ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true và cài đặt trước các điều kiện cần cục bộ của Whisper.",
+  "message.unsupported_message_type": "Loại tin nhắn không được hỗ trợ.\nBot này hiện chấp nhận tin nhắn văn bản, ảnh, tin nhắn thoại và tệp âm thanh.",
   "queue.button_group": "Gộp các câu hỏi",
   "queue.button_no": "Không",
   "queue.button_single": "Xử lý từng câu một",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "Tiếp tục thất bại, vì vậy một phiên mới đã được tạo.\nID phiên mới: {session_id}\nTên phiên mới: {session_name}",
   "runtime.resume_id_changed": "Tiếp tục thành công, nhưng ID phiên đã thay đổi.\nID phiên mới: {session_id}\nTên phiên mới: {session_name}",
   "runtime.sensitive_diff_omitted": "{path}\nTệp này chứa nội dung nhạy cảm và đã bị lược bỏ.",
+  "runtime.voice_conversion_failed": "Chuyển giọng nói thành văn bản thất bại.",
+  "runtime.voice_conversion_timed_out": "Chuyển giọng nói thành văn bản đã hết thời gian chờ.",
+  "runtime.voice_model_initial_download_note": "Model Whisper đã chọn có thể vẫn đang được tải xuống ở lần dùng đầu tiên. Các model lớn như turbo dễ chạm mốc timeout này hơn.",
+  "runtime.voice_transcript_preview": "Bản chép lời giọng nói đã nhận dạng:\n{transcript}\n\nĐang xử lý...",
+  "runtime.voice_transcript_queued_preview": "Bản chép lời giọng nói đã nhận dạng:\n{transcript}\n\nCâu hỏi đã được xếp hàng dưới dạng Q{question_number}. Nó sẽ được xử lý sau khi tác vụ hiện tại của tác nhân hoàn tất.",
   "runtime.working_on_it": "Đang xử lý...",
   "status.abort_signal_sent": "Đã gửi tín hiệu hủy cho lần chạy hiện tại của dự án.",
   "status.no_running_agent": "Không tìm thấy tiến trình tác nhân đang chạy cho dự án hiện tại.",
diff --git a/src/coding_agent_telegram/resources/locales/zh-CN.json b/src/coding_agent_telegram/resources/locales/zh-CN.json
index 8268d6d..65c2959 100644
--- a/src/coding_agent_telegram/resources/locales/zh-CN.json
+++ b/src/coding_agent_telegram/resources/locales/zh-CN.json
@@ -26,7 +26,8 @@
   "git.usage_push": "用法：/push",
   "message.photo_only_codex": "当前仅 Codex 会话支持图片附件。",
   "message.question_queued": "问题已加入队列，编号为 Q{question_number}。当前代理任务完成后将开始处理。",
-  "message.unsupported_message_type": "不支持的消息类型。\n此 bot 当前仅接受文本消息和图片。",
+  "message.voice_speech_to_text_disabled": "语音消息功能尚未启用。\n请先设置 ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true，并安装本地 Whisper 依赖。",
+  "message.unsupported_message_type": "不支持的消息类型。\n此 bot 当前接受文本消息、图片、语音消息和音频文件。",
   "queue.button_group": "合并问题",
   "queue.button_no": "否",
   "queue.button_single": "逐个处理",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "恢复失败，因此已创建一个新会话。\n新的会话 ID：{session_id}\n新的会话名称：{session_name}",
   "runtime.resume_id_changed": "恢复成功，但会话 ID 已更改。\n新的会话 ID：{session_id}\n新的会话名称：{session_name}",
   "runtime.sensitive_diff_omitted": "{path}\n此文件包含敏感内容，已省略。",
+  "runtime.voice_conversion_failed": "语音转换失败。",
+  "runtime.voice_conversion_timed_out": "语音转换超时。",
+  "runtime.voice_model_initial_download_note": "所选 Whisper 模型在首次使用时可能仍在下载。像 turbo 这样更大的模型更容易触发这个超时。",
+  "runtime.voice_transcript_preview": "识别出的语音文本：\n{transcript}\n\n正在处理...",
+  "runtime.voice_transcript_queued_preview": "识别出的语音文本：\n{transcript}\n\n问题已加入队列，编号为 Q{question_number}。当前代理任务完成后将开始处理。",
   "runtime.working_on_it": "正在处理...",
   "status.abort_signal_sent": "已向当前项目运行发送中止信号。",
   "status.no_running_agent": "当前项目未找到正在运行的代理进程。",
diff --git a/src/coding_agent_telegram/resources/locales/zh-HK.json b/src/coding_agent_telegram/resources/locales/zh-HK.json
index 5aaf6b5..8c4332f 100644
--- a/src/coding_agent_telegram/resources/locales/zh-HK.json
+++ b/src/coding_agent_telegram/resources/locales/zh-HK.json
@@ -26,7 +26,8 @@
   "git.usage_push": "用法：/push",
   "message.photo_only_codex": "目前只有 Codex 工作階段支援圖片附件。",
   "message.question_queued": "問題已加入佇列，編號為 Q{question_number}。目前代理工作完成後會開始處理。",
-  "message.unsupported_message_type": "不支援的訊息類型。\n此 bot 目前只接受文字訊息與圖片。",
+  "message.voice_speech_to_text_disabled": "語音訊息功能尚未啟用。\n請先設定 ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true，並安裝本機 Whisper 依賴。",
+  "message.unsupported_message_type": "不支援的訊息類型。\n此 bot 目前接受文字訊息、圖片、語音訊息與音訊檔案。",
   "queue.button_group": "合併問題",
   "queue.button_no": "否",
   "queue.button_single": "逐一處理",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "恢復失敗，因此已建立新的工作階段。\n新的工作階段 ID：{session_id}\n新的工作階段名稱：{session_name}",
   "runtime.resume_id_changed": "恢復成功，但工作階段 ID 已變更。\n新的工作階段 ID：{session_id}\n新的工作階段名稱：{session_name}",
   "runtime.sensitive_diff_omitted": "{path}\n此檔案包含敏感內容，已省略。",
+  "runtime.voice_conversion_failed": "語音轉換失敗。",
+  "runtime.voice_conversion_timed_out": "語音轉換逾時。",
+  "runtime.voice_model_initial_download_note": "所選的 Whisper 模型在首次使用時可能仍在下載。像 turbo 這類較大的模型更容易觸發這個逾時。",
+  "runtime.voice_transcript_preview": "辨識出的語音文字：\n{transcript}\n\n處理中...",
+  "runtime.voice_transcript_queued_preview": "辨識出的語音文字：\n{transcript}\n\n問題已加入佇列，編號為 Q{question_number}。目前代理工作完成後會開始處理。",
   "runtime.working_on_it": "處理中...",
   "status.abort_signal_sent": "已向目前專案執行送出中止訊號。",
   "status.no_running_agent": "目前專案找不到正在執行的代理程序。",
diff --git a/src/coding_agent_telegram/resources/locales/zh-TW.json b/src/coding_agent_telegram/resources/locales/zh-TW.json
index 399a65a..ef3dc35 100644
--- a/src/coding_agent_telegram/resources/locales/zh-TW.json
+++ b/src/coding_agent_telegram/resources/locales/zh-TW.json
@@ -26,7 +26,8 @@
   "git.usage_push": "用法：/push",
   "message.photo_only_codex": "目前只有 Codex 工作階段支援圖片附件。",
   "message.question_queued": "問題已加入佇列，編號為 Q{question_number}。目前代理工作完成後會開始處理。",
-  "message.unsupported_message_type": "不支援的訊息類型。\n此 bot 目前只接受文字訊息與圖片。",
+  "message.voice_speech_to_text_disabled": "語音訊息功能尚未啟用。\n請先設定 ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true，並安裝本機 Whisper 依賴。",
+  "message.unsupported_message_type": "不支援的訊息類型。\n此 bot 目前接受文字訊息、圖片、語音訊息與音訊檔案。",
   "queue.button_group": "合併問題",
   "queue.button_no": "否",
   "queue.button_single": "逐一處理",
@@ -54,6 +55,11 @@
   "runtime.resume_created_new": "恢復失敗，因此已建立新的工作階段。\n新的工作階段 ID：{session_id}\n新的工作階段名稱：{session_name}",
   "runtime.resume_id_changed": "恢復成功，但工作階段 ID 已變更。\n新的工作階段 ID：{session_id}\n新的工作階段名稱：{session_name}",
   "runtime.sensitive_diff_omitted": "{path}\n此檔案包含敏感內容，已省略。",
+  "runtime.voice_conversion_failed": "語音轉換失敗。",
+  "runtime.voice_conversion_timed_out": "語音轉換逾時。",
+  "runtime.voice_model_initial_download_note": "所選的 Whisper 模型在首次使用時可能仍在下載。像 turbo 這類較大的模型更容易觸發這個逾時。",
+  "runtime.voice_transcript_preview": "辨識出的語音文字：\n{transcript}\n\n處理中...",
+  "runtime.voice_transcript_queued_preview": "辨識出的語音文字：\n{transcript}\n\n問題已加入佇列，編號為 Q{question_number}。目前代理工作完成後會開始處理。",
   "runtime.working_on_it": "處理中...",
   "status.abort_signal_sent": "已向目前專案執行送出中止訊號。",
   "status.no_running_agent": "目前專案找不到正在執行的代理程序。",
diff --git a/src/coding_agent_telegram/router/base.py b/src/coding_agent_telegram/router/base.py
index 16e7817..17866f4 100644
--- a/src/coding_agent_telegram/router/base.py
+++ b/src/coding_agent_telegram/router/base.py
@@ -25,6 +25,7 @@
 from coding_agent_telegram.i18n import translate
 from coding_agent_telegram.session_runtime import PhotoAttachmentStore, SessionRuntime
 from coding_agent_telegram.session_store import SessionStore
+from coding_agent_telegram.speech_to_text import WhisperSpeechToText
 from coding_agent_telegram.telegram_sender import send_text
 
 
@@ -108,6 +109,7 @@ def __init__(self, deps: RouterDeps) -> None:
         self.deps = deps
         self.git = GitWorkspaceManager()
         self.photo_attachments = PhotoAttachmentStore(deps.cfg.app_internal_root)
+        self.speech_to_text = WhisperSpeechToText(deps.cfg)
         self.runtime = SessionRuntime(
             cfg=deps.cfg,
             store=deps.store,
@@ -224,6 +226,7 @@ async def _notify_if_current_project_busy(self, update: Update, context: Context
             update,
             context,
             self._t(update, "common.project_busy", project_folder=project_folder),
+            reply_to_message_id=getattr(update.message, "message_id", None),
         )
         return True
 
@@ -245,6 +248,7 @@ async def _run_with_typing(self, update: Update, context: ContextTypes.DEFAULT_T
                     update,
                     context,
                     self._t(update, "common.agent_already_running", project_folder=workspace_lock_key),
+                    reply_to_message_id=getattr(update.message, "message_id", None),
                 )
                 return None
             async with lock:
@@ -350,8 +354,18 @@ async def publish(info: AgentProgressInfo) -> None:
                         text=message_text,
                     )
                 except BadRequest:
+                    previous_message_id = progress_state["message_id"]
                     message = await context.bot.send_message(chat_id=chat.id, text=message_text)
                     message_id = getattr(message, "message_id", None)
+                    if (
+                        previous_message_id is not None
+                        and previous_message_id != message_id
+                        and hasattr(context.bot, "delete_message")
+                    ):
+                        try:
+                            await context.bot.delete_message(chat_id=chat.id, message_id=previous_message_id)
+                        except BadRequest:
+                            pass
                     if progress_state.get("closed") and message_id is not None and hasattr(context.bot, "delete_message"):
                         try:
                             await context.bot.delete_message(chat_id=chat.id, message_id=message_id)
diff --git a/src/coding_agent_telegram/router/message_commands.py b/src/coding_agent_telegram/router/message_commands.py
index a514c35..23a7dd1 100644
--- a/src/coding_agent_telegram/router/message_commands.py
+++ b/src/coding_agent_telegram/router/message_commands.py
@@ -1,34 +1,61 @@
 from __future__ import annotations
 
+import logging
+import tempfile
+from pathlib import Path
+
 from telegram import Update
 from telegram.ext import ContextTypes
 
 from coding_agent_telegram.session_runtime import PhotoAttachmentError
+from coding_agent_telegram.speech_to_text import SpeechToTextError
 from coding_agent_telegram.telegram_sender import send_text
 
 from .base import require_allowed_chat
 
 
+logger = logging.getLogger(__name__)
+MAX_STT_AUDIO_BYTES = 20 * 1024 * 1024
+
+
 class MessageCommandMixin:
-    @require_allowed_chat()
-    async def handle_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
-        if update.message is None or not update.message.text:
-            return
-        user_message = update.message.text
+    async def _process_user_message(
+        self,
+        update: Update,
+        context: ContextTypes.DEFAULT_TYPE,
+        user_message: str,
+        *,
+        suppress_working_notice: bool = False,
+    ) -> None:
         chat_id = update.effective_chat.id
-        if self._is_project_busy(chat_id) or self._has_pending_queue_decision(chat_id):
-            _queue_file, question_number = self._enqueue_chat_message(chat_id, user_message)
+        pending_action = self._pending_action(chat_id)
+        message_pending = isinstance(pending_action, dict) and pending_action.get("kind") == "message"
+        if self._is_project_busy(chat_id) or self._has_pending_queue_decision(chat_id) or message_pending:
+            _queue_file, question_number = self._enqueue_chat_message(
+                chat_id,
+                user_message,
+                reply_to_message_id=getattr(update.message, "message_id", None),
+            )
+            logger.info(
+                "Queued user message for chat %s as Q%s. Preview: %.120r",
+                chat_id,
+                question_number,
+                user_message,
+            )
             await send_text(
                 update,
                 context,
                 self._t(update, "message.question_queued", question_number=question_number),
+                reply_to_message_id=getattr(update.message, "message_id", None),
             )
             return
+        logger.info("Processing user message immediately for chat %s. Preview: %.120r", chat_id, user_message)
         self._store_pending_action(
             chat_id,
             {
                 "kind": "message",
                 "user_message": user_message,
+                "suppress_working_notice": suppress_working_notice,
             },
         )
         try:
@@ -37,6 +64,12 @@ async def handle_message(self, update: Update, context: ContextTypes.DEFAULT_TYP
         finally:
             await self._drain_chat_message_queue(chat_id, context)
 
+    @require_allowed_chat()
+    async def handle_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        if update.message is None or not update.message.text:
+            return
+        await self._process_user_message(update, context, update.message.text)
+
     @require_allowed_chat()
     async def handle_photo(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
         if update.message is None or not update.message.photo:
@@ -60,8 +93,186 @@ async def handle_photo(self, update: Update, context: ContextTypes.DEFAULT_TYPE)
         prompt = self.photo_attachments.build_prompt(attachment_path, project_path, caption)
         await self.runtime.run_active_session(update, context, user_message=prompt, image_paths=(attachment_path,))
 
+    async def _handle_audio_like(
+        self,
+        update: Update,
+        context: ContextTypes.DEFAULT_TYPE,
+        telegram_media,
+        *,
+        media_kind: str,
+    ) -> None:
+        if update.message is None or telegram_media is None:
+            return
+        logger.info(
+            "Received Telegram %s message for speech-to-text in chat %s.",
+            media_kind,
+            update.effective_chat.id if update.effective_chat is not None else "unknown",
+        )
+        if not self.speech_to_text.enabled:
+            await send_text(update, context, self._t(update, "message.voice_speech_to_text_disabled"))
+            return
+
+        suffix = Path(
+            getattr(telegram_media, "file_name", "") or getattr(telegram_media, "file_unique_id", "") or media_kind
+        ).suffix or ".ogg"
+        telegram_file = await telegram_media.get_file()
+        logger.debug(
+            "Speech-to-text input prepared for chat %s: media_kind=%s file_path=%r initial_suffix=%r model=%s timeout=%ss",
+            update.effective_chat.id if update.effective_chat is not None else "unknown",
+            media_kind,
+            getattr(telegram_file, "file_path", None),
+            suffix,
+            self.speech_to_text.model,
+            self.speech_to_text.timeout_seconds,
+        )
+        if suffix == ".ogg" and getattr(telegram_file, "file_path", None):
+            resolved_suffix = Path(telegram_file.file_path).suffix.lower()
+            if resolved_suffix:
+                suffix = resolved_suffix
+
+        declared_size = getattr(telegram_media, "file_size", None)
+        if isinstance(declared_size, int) and declared_size > MAX_STT_AUDIO_BYTES:
+            await send_text(
+                update,
+                context,
+                self._t(
+                    update,
+                    "runtime.voice_audio_too_large",
+                    max_size_mb=MAX_STT_AUDIO_BYTES // (1024 * 1024),
+                ),
+            )
+            return
+
+        with tempfile.NamedTemporaryFile(prefix="coding-agent-telegram-voice-", suffix=suffix, delete=False) as handle:
+            temp_path = Path(handle.name)
+        try:
+            content = bytes(await telegram_file.download_as_bytearray())
+            if len(content) > MAX_STT_AUDIO_BYTES:
+                await send_text(
+                    update,
+                    context,
+                    self._t(
+                        update,
+                        "runtime.voice_audio_too_large",
+                        max_size_mb=MAX_STT_AUDIO_BYTES // (1024 * 1024),
+                    ),
+                )
+                return
+            temp_path.write_bytes(content)
+            logger.debug(
+                "Downloaded Telegram %s message for chat %s to %s (%s bytes).",
+                media_kind,
+                update.effective_chat.id if update.effective_chat is not None else "unknown",
+                temp_path,
+                len(content),
+            )
+            result = await self._run_with_typing(
+                update,
+                context,
+                self.speech_to_text.transcribe_file,
+                temp_path,
+            )
+        except SpeechToTextError as exc:
+            logger.warning(
+                "Telegram %s speech-to-text failed for chat %s: code=%s detail=%s",
+                media_kind,
+                update.effective_chat.id if update.effective_chat is not None else "unknown",
+                exc.code,
+                exc.detail or "(none)",
+            )
+            if exc.code == "timeout":
+                message = self._t(update, "runtime.voice_conversion_timed_out")
+            else:
+                message = self._t(update, "runtime.voice_conversion_failed")
+            if exc.likely_first_download:
+                message = f"{message}\n\n{self._t(update, 'runtime.voice_model_initial_download_note')}"
+            await send_text(update, context, message)
+            return
+        except Exception:
+            logger.exception(
+                "Unexpected Telegram %s speech-to-text failure for chat %s.",
+                media_kind,
+                update.effective_chat.id if update.effective_chat is not None else "unknown",
+            )
+            await send_text(update, context, self._t(update, "runtime.voice_conversion_failed"))
+            return
+        finally:
+            temp_path.unlink(missing_ok=True)
+
+        if result is None:
+            return
+        chat_id = update.effective_chat.id
+        logger.info(
+            "Speech-to-text succeeded for Telegram %s message in chat %s. Transcript preview: %.120r",
+            media_kind,
+            chat_id,
+            result.text,
+        )
+        logger.debug(
+            "Transcript metadata for chat %s: media_kind=%s chars=%s reply_to_message_id=%s",
+            chat_id,
+            media_kind,
+            len(result.text),
+            getattr(update.message, "message_id", None),
+        )
+        pending_action = self._pending_action(chat_id)
+        message_pending = isinstance(pending_action, dict) and pending_action.get("kind") == "message"
+        if self._is_project_busy(chat_id) or self._has_pending_queue_decision(chat_id) or message_pending:
+            _queue_file, question_number = self._enqueue_chat_message(
+                chat_id,
+                result.text,
+                reply_to_message_id=getattr(update.message, "message_id", None),
+            )
+            logger.info(
+                "Queued transcript from Telegram %s message for chat %s as Q%s.",
+                media_kind,
+                chat_id,
+                question_number,
+            )
+            await send_text(
+                update,
+                context,
+                self._t(
+                    update,
+                    "runtime.voice_transcript_queued_preview",
+                    transcript=result.text,
+                    question_number=question_number,
+                ),
+            )
+            return
+        logger.info("Dispatching transcript from Telegram %s message immediately for chat %s.", media_kind, chat_id)
+        await send_text(
+            update,
+            context,
+            self._t(update, "runtime.voice_transcript_preview", transcript=result.text),
+        )
+        await self._process_user_message(update, context, result.text, suppress_working_notice=True)
+
+    @require_allowed_chat()
+    async def handle_voice(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        if update.message is None or not update.message.voice:
+            return
+        await self._handle_audio_like(update, context, update.message.voice, media_kind="voice")
+
+    @require_allowed_chat()
+    async def handle_audio(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        if update.message is None or not update.message.audio:
+            return
+        await self._handle_audio_like(update, context, update.message.audio, media_kind="audio")
+
     @require_allowed_chat()
     async def handle_unsupported_message(self, update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
+        if update.message is not None:
+            unsupported_types = [
+                field_name
+                for field_name in ("animation", "audio", "document", "sticker", "video", "video_note")
+                if getattr(update.message, field_name, None) is not None
+            ]
+            logger.info(
+                "Unsupported Telegram message type from chat %s: %s",
+                update.effective_chat.id if update.effective_chat is not None else "unknown",
+                ", ".join(unsupported_types) or "unknown",
+            )
         await send_text(
             update,
             context,
diff --git a/src/coding_agent_telegram/router/queue_processing.py b/src/coding_agent_telegram/router/queue_processing.py
index 00a44b5..a227f04 100644
--- a/src/coding_agent_telegram/router/queue_processing.py
+++ b/src/coding_agent_telegram/router/queue_processing.py
@@ -1,7 +1,9 @@
 from __future__ import annotations
 
+import logging
 import re
 from collections import deque
+from dataclasses import dataclass
 from pathlib import Path
 from types import SimpleNamespace
 
@@ -12,6 +14,13 @@
 
 
 QUEUED_QUESTIONS_DIR = "queued_questions"
+logger = logging.getLogger(__name__)
+
+
+@dataclass(frozen=True)
+class QueuedQuestion:
+    text: str
+    reply_to_message_id: int | None = None
 
 
 class QueueProcessingMixin:
@@ -38,38 +47,89 @@ def _next_queue_file_path(self, chat_id: int) -> Path:
         session_id = self._sanitize_queue_session_id(str(chat_state.get("active_session_id") or "session"))
         return queue_dir / f"{session_id}-queue-{next_index}.txt"
 
-    def _read_queue_questions(self, queue_file: Path) -> list[str]:
+    def _read_queue_questions(self, queue_file: Path) -> list[QueuedQuestion]:
         if not queue_file.exists():
             return []
         raw = queue_file.read_text(encoding="utf-8")
-        pattern = re.compile(r"^\[Question (\d+)\]\n(.*?)\n\[End Question \1\]\s*$", re.MULTILINE | re.DOTALL)
-        return [match.group(2).strip() for match in pattern.finditer(raw) if match.group(2).strip()]
+        pattern = re.compile(
+            r"^\[Question (\d+)\]\n(?:\[ReplyToMessageId (\d+)\]\n)?(.*?)\n\[End Question \1\]\s*$",
+            re.MULTILINE | re.DOTALL,
+        )
+        questions: list[QueuedQuestion] = []
+        for match in pattern.finditer(raw):
+            text = match.group(3).strip()
+            if not text:
+                continue
+            questions.append(
+                QueuedQuestion(
+                    text=text,
+                    reply_to_message_id=int(match.group(2)) if match.group(2) else None,
+                )
+            )
+        logger.debug("Loaded %s queued question(s) from %s.", len(questions), queue_file)
+        return questions
 
-    def _append_question_to_queue_file(self, queue_file: Path, user_message: str) -> int:
+    def _append_question_to_queue_file(
+        self,
+        queue_file: Path,
+        user_message: str,
+        *,
+        reply_to_message_id: int | None = None,
+    ) -> int:
         questions = self._read_queue_questions(queue_file)
         next_number = len(questions) + 1
         with queue_file.open("a", encoding="utf-8") as fh:
             if queue_file.stat().st_size > 0:
                 fh.write("\n")
-            fh.write(f"[Question {next_number}]\n{user_message.strip()}\n[End Question {next_number}]\n")
+            fh.write(f"[Question {next_number}]\n")
+            if reply_to_message_id is not None:
+                fh.write(f"[ReplyToMessageId {reply_to_message_id}]\n")
+            fh.write(f"{user_message.strip()}\n[End Question {next_number}]\n")
+        logger.debug(
+            "Appended queued question Q%s to %s with reply_to_message_id=%s.",
+            next_number,
+            queue_file,
+            reply_to_message_id,
+        )
         return next_number
 
-    def _write_queue_questions(self, queue_file: Path, questions: list[str]) -> None:
+    def _write_queue_questions(self, queue_file: Path, questions: list[QueuedQuestion]) -> None:
         with queue_file.open("w", encoding="utf-8") as fh:
             for index, question in enumerate(questions, start=1):
                 if index > 1:
                     fh.write("\n")
-                fh.write(f"[Question {index}]\n{question.strip()}\n[End Question {index}]\n")
+                fh.write(f"[Question {index}]\n")
+                if question.reply_to_message_id is not None:
+                    fh.write(f"[ReplyToMessageId {question.reply_to_message_id}]\n")
+                fh.write(f"{question.text.strip()}\n[End Question {index}]\n")
+        logger.debug("Rewrote %s queued question(s) to %s.", len(questions), queue_file)
 
-    def _enqueue_chat_message(self, chat_id: int, user_message: str) -> tuple[Path, int]:
+    def _enqueue_chat_message(
+        self,
+        chat_id: int,
+        user_message: str,
+        *,
+        reply_to_message_id: int | None = None,
+    ) -> tuple[Path, int]:
         queue = self._chat_message_queue_files.setdefault(chat_id, deque())
         queue_file = queue[-1] if queue else self._next_queue_file_path(chat_id)
         if not queue:
             queue.append(queue_file)
-        question_number = self._append_question_to_queue_file(queue_file, user_message)
+        question_number = self._append_question_to_queue_file(
+            queue_file,
+            user_message,
+            reply_to_message_id=reply_to_message_id,
+        )
+        logger.debug(
+            "Queued message for chat %s in %s as Q%s with reply_to_message_id=%s.",
+            chat_id,
+            queue_file,
+            question_number,
+            reply_to_message_id,
+        )
         return queue_file, question_number
 
-    def _dequeue_chat_message_file(self, chat_id: int) -> tuple[Path | None, list[str]]:
+    def _dequeue_chat_message_file(self, chat_id: int) -> tuple[Path | None, list[QueuedQuestion]]:
         queue = self._chat_message_queue_files.get(chat_id)
         if not queue:
             return None, []
@@ -81,12 +141,13 @@ def _dequeue_chat_message_file(self, chat_id: int) -> tuple[Path | None, list[st
             return None, []
         if not queue:
             self._chat_message_queue_files.pop(chat_id, None)
+        logger.debug("Dequeued %s queued question(s) for chat %s from %s.", len(questions), chat_id, queue_file)
         return queue_file, questions
 
-    def _queued_batch_prompt(self, queued_messages: list[str]) -> str:
+    def _queued_batch_prompt(self, queued_messages: list[QueuedQuestion]) -> str:
         lines = ["Answer the following queued user questions in order."]
         for index, message in enumerate(queued_messages, start=1):
-            lines.extend(["", f"[Question {index}]", message.strip(), f"[End Question {index}]"])
+            lines.extend(["", f"[Question {index}]", message.text.strip(), f"[End Question {index}]"])
         return "\n".join(lines)
 
     def _preview_queued_message(self, message: str, *, max_chars: int = 100) -> str:
@@ -97,10 +158,10 @@ def _preview_queued_message(self, message: str, *, max_chars: int = 100) -> str:
             return stripped[:max_chars]
         return f"{stripped[: max_chars - 3]}..."
 
-    def _queued_batch_notice(self, chat_id: int, queued_messages: list[str]) -> str:
+    def _queued_batch_notice(self, chat_id: int, queued_messages: list[QueuedQuestion]) -> str:
         lines = [translate(self._chat_locale(chat_id), "queue.working_on_queued")]
         for index, message in enumerate(queued_messages, start=1):
-            lines.append(f"{index}. {self._preview_queued_message(message)}")
+            lines.append(f"{index}. {self._preview_queued_message(message.text)}")
         return "\n".join(lines)
 
     def _has_pending_queue_decision(self, chat_id: int) -> bool:
@@ -134,7 +195,7 @@ async def _prompt_queue_batch_decision(
         self,
         chat_id: int,
         context: ContextTypes.DEFAULT_TYPE,
-        queued_messages: list[str],
+        queued_messages: list[QueuedQuestion],
     ) -> None:
         if not hasattr(context.bot, "send_message"):
             return
@@ -147,7 +208,7 @@ async def _prompt_queue_batch_decision(
             translate(locale, "queue.here_are_questions"),
         ]
         for index, message in enumerate(queued_messages, start=1):
-            lines.append(f"Q{index}: {self._preview_queued_message(message)}")
+            lines.append(f"Q{index}: {self._preview_queued_message(message.text)}")
         lines.extend(
             [
                 "",
@@ -190,7 +251,7 @@ async def _dispatch_queued_questions(
         context: ContextTypes.DEFAULT_TYPE,
         *,
         queue_file: Path,
-        queued_messages: list[str],
+        queued_messages: list[QueuedQuestion],
         grouped: bool,
     ) -> bool:
         self._chat_processing_queue_files[chat_id] = queue_file
@@ -199,16 +260,25 @@ async def _dispatch_queued_questions(
         queued_notice = self._queued_batch_notice(chat_id, current_batch)
         queued_update = SimpleNamespace(
             effective_chat=SimpleNamespace(id=chat_id, type="private"),
-            message=SimpleNamespace(text=queued_notice, photo=None, caption=None),
+            message=SimpleNamespace(text=queued_notice, photo=None, caption=None, message_id=None),
         )
         await send_text(queued_update, context, queued_notice)
         if grouped:
             user_message = self._queued_batch_prompt(queued_messages)
+            reply_to_message_id = None
         else:
-            user_message = queued_messages[0]
+            user_message = queued_messages[0].text
+            reply_to_message_id = queued_messages[0].reply_to_message_id
+        logger.debug(
+            "Dispatching queued question(s) for chat %s grouped=%s count=%s reply_to_message_id=%s.",
+            chat_id,
+            grouped,
+            len(queued_messages),
+            reply_to_message_id,
+        )
         queued_update = SimpleNamespace(
             effective_chat=SimpleNamespace(id=chat_id, type="private"),
-            message=SimpleNamespace(text=user_message, photo=None, caption=None),
+            message=SimpleNamespace(text=user_message, photo=None, caption=None, message_id=reply_to_message_id),
         )
         self.deps.store.set_pending_action(
             self.deps.bot_id,
@@ -234,11 +304,19 @@ async def _dispatch_queued_questions(
 
     async def _drain_chat_message_queue(self, chat_id: int, context: ContextTypes.DEFAULT_TYPE) -> None:
         if chat_id in self._chat_message_queue_draining:
+            logger.debug("Queue drain already active for chat %s; skipping nested call.", chat_id)
             return
         self._chat_message_queue_draining.add(chat_id)
         try:
             while True:
                 if self._is_project_busy(chat_id):
+                    logger.debug("Stopping queue drain for chat %s because project is busy.", chat_id)
+                    return
+                if self._pending_action(chat_id):
+                    logger.debug("Stopping queue drain for chat %s because a pending action is unresolved.", chat_id)
+                    return
+                if self._has_pending_queue_decision(chat_id):
+                    logger.debug("Stopping queue drain for chat %s because a queue batch decision is pending.", chat_id)
                     return
                 last_result = self._last_run_results.pop(chat_id, None)
                 if self._run_result_was_aborted(last_result) and self._has_pending_queue_files(chat_id):
@@ -256,6 +334,7 @@ async def _drain_chat_message_queue(self, chat_id: int, context: ContextTypes.DE
                     self._chat_processing_queue_files.pop(chat_id, None)
                 queue_file, queued_messages = self._dequeue_chat_message_file(chat_id)
                 if queue_file is None or not queued_messages:
+                    logger.debug("No queued messages remain for chat %s.", chat_id)
                     if chat_id not in self._chat_processing_queue_files and chat_id not in self._chat_message_queue_files:
                         self._chat_queue_batch_modes.pop(chat_id, None)
                         self._chat_next_queue_file_index.pop(chat_id, None)
diff --git a/src/coding_agent_telegram/router/session_lifecycle_commands.py b/src/coding_agent_telegram/router/session_lifecycle_commands.py
index 92dde64..bcb7237 100644
--- a/src/coding_agent_telegram/router/session_lifecycle_commands.py
+++ b/src/coding_agent_telegram/router/session_lifecycle_commands.py
@@ -214,8 +214,15 @@ async def _continue_pending_action(self, update: Update, context: ContextTypes.D
                     return False
             if not await self._ensure_active_session_ready_for_run(update, context):
                 return False
-            self._store_pending_action(chat_id, None)
-            self._last_run_results[chat_id] = await self.runtime.run_active_session(update, context, user_message=user_message)
+            try:
+                self._last_run_results[chat_id] = await self.runtime.run_active_session(
+                    update,
+                    context,
+                    user_message=user_message,
+                    suppress_working_notice=bool(pending_action.get("suppress_working_notice")),
+                )
+            finally:
+                self._store_pending_action(chat_id, None)
             return True
 
         self._store_pending_action(chat_id, None)
diff --git a/src/coding_agent_telegram/session_runtime.py b/src/coding_agent_telegram/session_runtime.py
index e5eea62..bab9478 100644
--- a/src/coding_agent_telegram/session_runtime.py
+++ b/src/coding_agent_telegram/session_runtime.py
@@ -58,6 +58,11 @@
 _ABSOLUTE_PATH_RE = re.compile(r"(?:^|(?<=\s)|(?<=[\"'(]))((?:/[^\s\"',;)]+)+|[A-Za-z]:\\[^\s\"',;)]+)")
 
 
+def _reply_to_message_id(update: Update) -> int | None:
+    message = getattr(update, "message", None)
+    return getattr(message, "message_id", None)
+
+
 def _load_secret_scrub_patterns() -> tuple[tuple[re.Pattern[str], str], ...]:
     resource = importlib.resources.files("coding_agent_telegram").joinpath("resources/secret_scrub_patterns.properties")
     compiled: list[tuple[re.Pattern[str], str]] = []
@@ -183,6 +188,11 @@ def _locale(self, update: Update | None) -> str:
     def _t(self, update: Update | None, key: str, **kwargs) -> str:
         return translate(self._locale(update), key, **kwargs)
 
+    def _take_reply_to_message_id(self, reply_state: dict[str, int | None]) -> int | None:
+        reply_to_message_id = reply_state.get("reply_to_message_id")
+        reply_state["reply_to_message_id"] = None
+        return reply_to_message_id
+
     def _next_rotated_session_name(self, chat_id: int, base_name: str) -> str:
         existing = {
             data.get("name", "").strip().lower()
@@ -234,6 +244,7 @@ async def run_active_session(
         *,
         user_message: str,
         image_paths: Sequence[Path] = (),
+        suppress_working_notice: bool = False,
     ) -> AgentRunResult | None:
         chat_id = update.effective_chat.id
         active_id, session, project_path = await self._active_session_or_notify(update, context)
@@ -264,7 +275,14 @@ async def run_active_session(
             max_text_file_bytes=self.cfg.snapshot_text_file_max_bytes,
         )
         before = set(changed_files(project_path))
-        await send_text(update, context, self._t(update, "runtime.working_on_it"))
+        reply_to_message_id = _reply_to_message_id(update)
+        if not suppress_working_notice:
+            await send_text(
+                update,
+                context,
+                self._t(update, "runtime.working_on_it"),
+                reply_to_message_id=reply_to_message_id,
+            )
         result = await self.run_with_typing(
             update,
             context,
@@ -366,6 +384,7 @@ async def run_active_session(
             result=result,
             before_snapshot=before_snapshot,
             before=before,
+            reply_to_message_id=reply_to_message_id,
         )
         return result
 
@@ -589,6 +608,7 @@ async def _send_run_results(
         result,
         before_snapshot: dict[str, str | None],
         before: set[str],
+        reply_to_message_id: int | None,
     ) -> None:
         after_snapshot = snapshot_project_files(
             project_path,
@@ -603,8 +623,15 @@ async def _send_run_results(
             for file_diff in collect_snapshot_diffs(before_snapshot, after_snapshot, files)
         }
         diffs = self._merge_snapshot_diffs(diffs, snapshot_diffs_by_path)
+        reply_state = {"reply_to_message_id": reply_to_message_id}
 
-        await self._send_assistant_chunks(update, context, result.assistant_text, provider=provider)
+        await self._send_assistant_chunks(
+            update,
+            context,
+            result.assistant_text,
+            provider=provider,
+            reply_state=reply_state,
+        )
         logger.info(
             "Completed run for chat %s on session '%s' (%s); %d changed file(s).",
             update.effective_chat.id,
@@ -622,8 +649,9 @@ async def _send_run_results(
                 branch_name=branch_name or None,
                 locale=self._locale(update),
             ),
+            reply_to_message_id=self._take_reply_to_message_id(reply_state),
         )
-        await self._send_diffs(update, context, diffs)
+        await self._send_diffs(update, context, diffs, reply_state=reply_state)
 
     def _merge_snapshot_diffs(self, diffs, snapshot_diffs_by_path):
         if not snapshot_diffs_by_path:
@@ -650,6 +678,7 @@ async def _send_assistant_chunks(
         assistant_text: str,
         *,
         provider: str,
+        reply_state: dict[str, int | None],
     ) -> None:
         if self.cfg.enable_secret_scrub_filter:
             assistant_text = _scrub_secrets(assistant_text)
@@ -666,6 +695,7 @@ async def _send_assistant_chunks(
                     f"{segment.header} ({index}/{total})",
                     segment.text,
                     language=segment.language,
+                    reply_to_message_id=self._take_reply_to_message_id(reply_state),
                 )
                 continue
 
@@ -682,7 +712,12 @@ async def _send_assistant_chunks(
                 )
             )
             for message in self._chunk_assistant_prose(title_prefix, segment.text):
-                await send_html_text(update, context, message)
+                await send_html_text(
+                    update,
+                    context,
+                    message,
+                    reply_to_message_id=self._take_reply_to_message_id(reply_state),
+                )
 
     def _chunk_assistant_prose(self, title_prefix: str, text: str) -> list[str]:
         normalized = text.strip()
@@ -726,10 +761,22 @@ def _split_assistant_body(self, body: str) -> tuple[str, str]:
             left = body[:-1].rstrip() or body[:1]
         return left, right
 
-    async def _send_diffs(self, update: Update, context: ContextTypes.DEFAULT_TYPE, diffs) -> None:
+    async def _send_diffs(
+        self,
+        update: Update,
+        context: ContextTypes.DEFAULT_TYPE,
+        diffs,
+        *,
+        reply_state: dict[str, int | None],
+    ) -> None:
         for file_diff in diffs:
             if self.cfg.enable_sensitive_diff_filter and is_sensitive_path(file_diff.path):
-                await send_text(update, context, self._t(update, "runtime.sensitive_diff_omitted", path=file_diff.path))
+                await send_text(
+                    update,
+                    context,
+                    self._t(update, "runtime.sensitive_diff_omitted", path=file_diff.path),
+                    reply_to_message_id=self._take_reply_to_message_id(reply_state),
+                )
                 continue
             for chunk in chunk_fenced_diff(
                 file_diff.path,
@@ -737,4 +784,11 @@ async def _send_diffs(self, update: Update, context: ContextTypes.DEFAULT_TYPE,
                 self.cfg.max_telegram_message_length,
                 locale=self._locale(update),
             ):
-                await send_code_block(update, context, chunk.header, chunk.code, language=chunk.language)
+                await send_code_block(
+                    update,
+                    context,
+                    chunk.header,
+                    chunk.code,
+                    language=chunk.language,
+                    reply_to_message_id=self._take_reply_to_message_id(reply_state),
+                )
diff --git a/src/coding_agent_telegram/speech_to_text.py b/src/coding_agent_telegram/speech_to_text.py
new file mode 100644
index 0000000..6c55105
--- /dev/null
+++ b/src/coding_agent_telegram/speech_to_text.py
@@ -0,0 +1,140 @@
+from __future__ import annotations
+
+import json
+import logging
+import os
+import subprocess
+import sys
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+
+from coding_agent_telegram.config import AppConfig, DEFAULT_OPENAI_WHISPER_MODEL
+
+
+logger = logging.getLogger(__name__)
+_MODEL_CACHE_FILENAMES = {
+    "tiny": "tiny.pt",
+    "tiny.en": "tiny.en.pt",
+    "base": "base.pt",
+    "base.en": "base.en.pt",
+    "small": "small.pt",
+    "small.en": "small.en.pt",
+    "medium": "medium.pt",
+    "medium.en": "medium.en.pt",
+    "large": "large-v3.pt",
+    "large-v1": "large-v1.pt",
+    "large-v2": "large-v2.pt",
+    "large-v3": "large-v3.pt",
+    "large-v3-turbo": "large-v3-turbo.pt",
+    "turbo": "large-v3-turbo.pt",
+}
+
+
+class SpeechToTextError(RuntimeError):
+    def __init__(self, code: str, *, likely_first_download: bool = False, detail: str | None = None) -> None:
+        super().__init__(code)
+        self.code = code
+        self.likely_first_download = likely_first_download
+        self.detail = detail
+
+
+@dataclass(frozen=True)
+class SpeechToTextResult:
+    text: str
+    model: str
+
+
+class WhisperSpeechToText:
+    def __init__(self, cfg: AppConfig) -> None:
+        self.enabled = cfg.enable_openai_whisper_speech_to_text
+        self.model = cfg.openai_whisper_model or DEFAULT_OPENAI_WHISPER_MODEL
+        self.timeout_seconds = cfg.openai_whisper_timeout_seconds
+
+    def _model_cache_path(self) -> Path:
+        cache_root = Path(os.getenv("XDG_CACHE_HOME", Path.home() / ".cache")).expanduser()
+        file_name = _MODEL_CACHE_FILENAMES.get(self.model, f"{self.model}.pt")
+        return cache_root / "whisper" / file_name
+
+    def _likely_first_download(self) -> bool:
+        return not self._model_cache_path().exists()
+
+    def _summarize_process_output(self, result: subprocess.CompletedProcess[str]) -> str:
+        parts: list[str] = [f"whisper exited with status {result.returncode}"]
+        stderr = (result.stderr or "").strip()
+        stdout = (result.stdout or "").strip()
+        if stderr:
+            parts.append(f"stderr: {stderr[:500]}")
+        if stdout:
+            parts.append(f"stdout: {stdout[:500]}")
+        return "; ".join(parts)
+
+    def transcribe_file(self, audio_path: Path) -> SpeechToTextResult:
+        likely_first_download = self._likely_first_download()
+
+        with tempfile.TemporaryDirectory(prefix="coding-agent-telegram-whisper-") as output_dir:
+            command = [
+                sys.executable,
+                "-m",
+                "whisper",
+                str(audio_path),
+                "--model",
+                self.model,
+                "--task",
+                "transcribe",
+                "--output_format",
+                "json",
+                "--output_dir",
+                output_dir,
+                "--verbose",
+                "False",
+                "--fp16",
+                "False",
+                "--condition_on_previous_text",
+                "False",
+            ]
+            try:
+                result = subprocess.run(
+                    command,
+                    check=False,
+                    capture_output=True,
+                    text=True,
+                    timeout=self.timeout_seconds,
+                )
+            except subprocess.TimeoutExpired as exc:
+                raise SpeechToTextError(
+                    "timeout",
+                    likely_first_download=likely_first_download,
+                    detail=f"whisper timed out after {self.timeout_seconds} seconds",
+                ) from exc
+
+            if result.returncode != 0:
+                detail = self._summarize_process_output(result)
+                logger.warning("Whisper transcription failed for %s using model %s: %s", audio_path, self.model, detail)
+                raise SpeechToTextError("failed", likely_first_download=likely_first_download, detail=detail)
+
+            transcript_path = Path(output_dir) / f"{audio_path.stem}.json"
+            if not transcript_path.exists():
+                raise SpeechToTextError(
+                    "failed",
+                    likely_first_download=likely_first_download,
+                    detail=f"whisper finished without writing transcript json for {audio_path.name}",
+                )
+
+            try:
+                payload = json.loads(transcript_path.read_text(encoding="utf-8"))
+            except (OSError, json.JSONDecodeError) as exc:
+                raise SpeechToTextError(
+                    "failed",
+                    likely_first_download=likely_first_download,
+                    detail=f"failed to parse whisper transcript json for {audio_path.name}: {exc}",
+                ) from exc
+
+        text = str(payload.get("text") or "").strip()
+        if not text:
+            raise SpeechToTextError(
+                "empty",
+                likely_first_download=likely_first_download,
+                detail=f"whisper returned an empty transcript for {audio_path.name}",
+            )
+        return SpeechToTextResult(text=text, model=self.model)
diff --git a/src/coding_agent_telegram/stt_setup.py b/src/coding_agent_telegram/stt_setup.py
new file mode 100644
index 0000000..90867be
--- /dev/null
+++ b/src/coding_agent_telegram/stt_setup.py
@@ -0,0 +1,303 @@
+from __future__ import annotations
+
+import argparse
+import importlib
+import importlib.util
+import os
+import shutil
+import subprocess
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+
+from coding_agent_telegram.config import (
+    DEFAULT_OPENAI_WHISPER_MODEL,
+    DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS,
+    create_initial_env_file,
+    resolve_env_file_path,
+)
+
+
+ENABLE_STT_ENV = "ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT"
+STT_INSTALL_HINT_ENV = "CODING_AGENT_TELEGRAM_STT_INSTALL_HINT"
+STT_SIZE_GUIDANCE = (
+    "Estimated local footprint: openai-whisper package about 50 MB, ffmpeg about 50 MB, "
+    "and Whisper model downloads vary by model size "
+    "(tiny about 72 MB, base about 139 MB, large-v3-turbo about 1.5 GB)."
+)
+
+
+@dataclass(frozen=True)
+class SttPrereqStatus:
+    ffmpeg: bool
+    whisper_module: bool
+
+    @property
+    def missing(self) -> list[str]:
+        missing: list[str] = []
+        if not self.ffmpeg:
+            missing.append("ffmpeg")
+        if not self.whisper_module:
+            missing.append("openai-whisper (Python module)")
+        return missing
+
+    @property
+    def ready(self) -> bool:
+        return not self.missing
+
+
+def _has_whisper_module(python_bin: str | None = None) -> bool:
+    if python_bin is None:
+        return importlib.util.find_spec("whisper") is not None
+    result = subprocess.run(
+        [python_bin, "-c", "import importlib.util, sys; raise SystemExit(0 if importlib.util.find_spec('whisper') else 1)"],
+        check=False,
+        capture_output=True,
+        text=True,
+    )
+    return result.returncode == 0
+
+
+def detect_stt_prereqs(*, python_bin: str | None = None) -> SttPrereqStatus:
+    importlib.invalidate_caches()
+    return SttPrereqStatus(
+        ffmpeg=shutil.which("ffmpeg") is not None,
+        whisper_module=_has_whisper_module(python_bin),
+    )
+
+
+def ensure_stt_runtime_or_exit(enabled: bool, *, install_hint: Optional[str] = None) -> None:
+    if not enabled:
+        return
+
+    status = detect_stt_prereqs()
+    if status.ready:
+        return
+
+    resolved_hint = (install_hint or os.getenv(STT_INSTALL_HINT_ENV, "")).strip() or "coding-agent-telegram-stt-install"
+    missing_text = ", ".join(status.missing)
+    raise SystemExit(
+        "\n".join(
+            [
+                f"Error: {ENABLE_STT_ENV}=true but speech-to-text prerequisites are missing: {missing_text}",
+                f"Run: {resolved_hint}",
+                STT_SIZE_GUIDANCE,
+            ]
+        )
+    )
+
+
+def _resolve_env_path(explicit: str | None = None) -> Path:
+    env_path = resolve_env_file_path(Path(explicit).expanduser() if explicit else None)
+    env_path.parent.mkdir(parents=True, exist_ok=True)
+    if not env_path.exists():
+        create_initial_env_file(env_path)
+    return env_path
+
+
+def _set_env_flag(env_path: Path, enabled: bool) -> None:
+    lines = []
+    if env_path.exists():
+        lines = env_path.read_text(encoding="utf-8").splitlines()
+
+    def upsert(key: str, value: str, comments: list[str] | None = None) -> None:
+        replacement = f"{key}={value}"
+        for index, line in enumerate(lines):
+            if line.startswith(f"{key}="):
+                lines[index] = replacement
+                return
+        if lines and lines[-1].strip():
+            lines.append("")
+        if comments:
+            lines.extend(comments)
+        lines.append(replacement)
+
+    upsert(
+        ENABLE_STT_ENV,
+        "true" if enabled else "false",
+        comments=[
+            "# If true, enable Telegram voice-message speech-to-text with local openai-whisper.",
+            "# Estimated local footprint: package ~50 MB, ffmpeg ~50 MB, model downloads vary by model size.",
+        ],
+    )
+    upsert(
+        "OPENAI_WHISPER_MODEL",
+        DEFAULT_OPENAI_WHISPER_MODEL,
+        comments=[
+            "# Whisper model name for Telegram voice-message speech-to-text.",
+            "# `turbo` downloads the large-v3-turbo model (~1.5 GB) on first use into ~/.cache/whisper.",
+            "# If turbo is not cached yet, the first voice transcription is more likely to hit the timeout.",
+        ],
+    )
+    upsert(
+        "OPENAI_WHISPER_TIMEOUT_SECONDS",
+        str(DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS),
+        comments=["# Timeout for a single Whisper transcription call, in seconds."],
+    )
+
+    env_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+
+
+def _prompt_yes_no(prompt: str, *, default: bool = True) -> bool:
+    suffix = "[Y/n]" if default else "[y/N]"
+    while True:
+        try:
+            answer = input(f"{prompt} {suffix} ").strip().lower()
+        except EOFError:
+            return default
+        if not answer:
+            return default
+        if answer in {"y", "yes"}:
+            return True
+        if answer in {"n", "no"}:
+            return False
+        print("Please answer yes or no.")
+
+
+def _package_manager() -> tuple[str, list[str]] | tuple[None, None]:
+    if sys.platform == "darwin" and shutil.which("brew"):
+        return "brew", ["brew", "install", "ffmpeg"]
+    if sys.platform.startswith("linux"):
+        if shutil.which("apt-get"):
+            prefix = ["sudo"] if hasattr(os, "geteuid") and os.geteuid() != 0 and shutil.which("sudo") else []
+            return "apt-get", [*prefix, "apt-get", "update", "&&", *prefix, "apt-get", "install", "-y", "ffmpeg"]
+        if shutil.which("dnf"):
+            prefix = ["sudo"] if hasattr(os, "geteuid") and os.geteuid() != 0 and shutil.which("sudo") else []
+            return "dnf", [*prefix, "dnf", "install", "-y", "ffmpeg"]
+        if shutil.which("yum"):
+            prefix = ["sudo"] if hasattr(os, "geteuid") and os.geteuid() != 0 and shutil.which("sudo") else []
+            return "yum", [*prefix, "yum", "install", "-y", "ffmpeg"]
+    return None, None
+
+
+def _run_shell_command(command: str) -> bool:
+    print(f"Running: {command}")
+    result = subprocess.run(command, shell=True, check=False)
+    return result.returncode == 0
+
+
+def _ensure_ffmpeg_installed() -> bool:
+    while True:
+        status = detect_stt_prereqs()
+        if status.ffmpeg:
+            return True
+
+        print("Missing required system binary: ffmpeg")
+
+        manager, command_parts = _package_manager()
+        if manager == "apt-get":
+            install_command = " ".join(command_parts)
+        elif command_parts is not None:
+            install_command = " ".join(command_parts)
+        else:
+            install_command = ""
+
+        if install_command:
+            if not _prompt_yes_no(f"Install ffmpeg now using {manager}?"):
+                return False
+            if _run_shell_command(install_command):
+                continue
+            print("Automatic ffmpeg installation did not complete successfully.")
+            if not _prompt_yes_no("Retry ffmpeg installation?"):
+                return False
+            continue
+
+        print("Automatic ffmpeg installation is not available on this OS/package-manager combination.")
+        print("Install ffmpeg manually, then return here and choose continue.")
+        if not _prompt_yes_no("Continue after manual installation?", default=False):
+            return False
+
+
+def _ensure_whisper_installed(python_bin: str) -> bool:
+    while True:
+        status = detect_stt_prereqs(python_bin=python_bin)
+        if status.whisper_module:
+            return True
+
+        print("Missing required Python package: openai-whisper")
+        if not _prompt_yes_no(f"Install openai-whisper with {python_bin} -m pip?"):
+            return False
+        command = f"{python_bin} -m pip install --upgrade openai-whisper"
+        if _run_shell_command(command):
+            continue
+        print("openai-whisper installation did not complete successfully.")
+        if not _prompt_yes_no("Retry openai-whisper installation?"):
+            return False
+
+
+def install_stt_dependencies(*, env_file: str | None = None, python_bin: str | None = None) -> int:
+    env_path = _resolve_env_path(env_file)
+    resolved_python = python_bin or sys.executable
+
+    print(STT_SIZE_GUIDANCE)
+    print(f"Using env file: {env_path}")
+
+    if not _ensure_ffmpeg_installed():
+        print("Speech-to-text installation aborted before ffmpeg prerequisites were satisfied.")
+        return 1
+    if not _ensure_whisper_installed(resolved_python):
+        print("Speech-to-text installation aborted before openai-whisper was installed.")
+        return 1
+
+    _set_env_flag(env_path, True)
+    print(f"Speech-to-text prerequisites are ready. Enabled {ENABLE_STT_ENV}=true in {env_path}.")
+    return 0
+
+
+def offer_stt_install_for_new_env(
+    *,
+    env_file: str | None = None,
+    python_bin: str | None = None,
+    installer_label: str,
+) -> int:
+    env_path = _resolve_env_path(env_file)
+    print("A new env file was created for coding-agent-telegram.")
+    print(STT_SIZE_GUIDANCE)
+    if not _prompt_yes_no(
+        f"Do you want to enable local Whisper speech-to-text now? This will run {installer_label}.",
+        default=False,
+    ):
+        print(f"Keeping {ENABLE_STT_ENV}=false in {env_path}.")
+        return 0
+
+    result = install_stt_dependencies(env_file=str(env_path), python_bin=python_bin)
+    if result != 0:
+        print(f"Speech-to-text setup did not complete. Keeping {ENABLE_STT_ENV}=false unless you enable it later.")
+        _set_env_flag(env_path, False)
+        return 0
+    return 0
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    if argv is None:
+        argv = sys.argv[1:]
+    if not argv:
+        argv = ["install"]
+
+    parser = argparse.ArgumentParser(description="Install or validate local Whisper speech-to-text support.")
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    install_parser = subparsers.add_parser("install", help="Install missing speech-to-text prerequisites.")
+    install_parser.add_argument("--env-file", help="Explicit env file path to update.")
+    install_parser.add_argument("--python-bin", help="Python executable to use for pip installation.")
+    offer_parser = subparsers.add_parser("offer", help="Prompt whether to enable speech-to-text for a new env file.")
+    offer_parser.add_argument("--env-file", help="Explicit env file path to update.")
+    offer_parser.add_argument("--python-bin", help="Python executable to use for pip installation.")
+    offer_parser.add_argument("--installer-label", required=True, help="User-facing installer command label.")
+
+    args = parser.parse_args(argv)
+
+    if args.command == "install":
+        return install_stt_dependencies(env_file=args.env_file, python_bin=args.python_bin)
+    if args.command == "offer":
+        return offer_stt_install_for_new_env(
+            env_file=args.env_file,
+            python_bin=args.python_bin,
+            installer_label=args.installer_label,
+        )
+    return 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/src/coding_agent_telegram/telegram_sender.py b/src/coding_agent_telegram/telegram_sender.py
index e8a243b..aa0d5a2 100644
--- a/src/coding_agent_telegram/telegram_sender.py
+++ b/src/coding_agent_telegram/telegram_sender.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import html
+import logging
 import re
 from dataclasses import dataclass
 from typing import Optional
@@ -56,6 +57,7 @@
 )
 SHELL_LANGUAGES = {"bash", "console", "shell", "sh", "zsh"}
 DEFAULT_TELEGRAM_MESSAGE_LENGTH = 3000
+logger = logging.getLogger(__name__)
 
 
 @dataclass(frozen=True)
@@ -75,45 +77,92 @@ def _max_telegram_message_length(context: ContextTypes.DEFAULT_TYPE) -> int:
     return DEFAULT_TELEGRAM_MESSAGE_LENGTH
 
 
-async def send_text(update: Update, context: ContextTypes.DEFAULT_TYPE, text: str) -> None:
+def _default_reply_to_message_id(update: Update, explicit_reply_to_message_id: Optional[int] = None) -> Optional[int]:
+    return explicit_reply_to_message_id
+
+
+async def send_text(
+    update: Update,
+    context: ContextTypes.DEFAULT_TYPE,
+    text: str,
+    *,
+    reply_to_message_id: Optional[int] = None,
+) -> None:
     if update.effective_chat is None:
         return
     max_length = _max_telegram_message_length(context)
-    for chunk in _split_text_chunks(text, max_length=max_length):
+    resolved_reply_to_message_id = _default_reply_to_message_id(update, reply_to_message_id)
+    chunks = _split_text_chunks(text, max_length=max_length)
+    logger.debug(
+        "Sending Telegram text message chat=%s chunks=%s reply_to_message_id=%s preview=%.120r",
+        update.effective_chat.id,
+        len(chunks),
+        resolved_reply_to_message_id,
+        text,
+    )
+    for index, chunk in enumerate(chunks):
         await context.bot.send_message(
             chat_id=update.effective_chat.id,
             text=html.escape(chunk),
             parse_mode=ParseMode.HTML,
+            reply_to_message_id=resolved_reply_to_message_id if index == 0 else None,
         )
 
 
-async def send_markdown_text(update: Update, context: ContextTypes.DEFAULT_TYPE, text: str) -> None:
+async def send_markdown_text(
+    update: Update,
+    context: ContextTypes.DEFAULT_TYPE,
+    text: str,
+    *,
+    reply_to_message_id: Optional[int] = None,
+) -> None:
     if update.effective_chat is None:
         return
+    logger.debug(
+        "Sending Telegram markdown message chat=%s reply_to_message_id=%s preview=%.120r",
+        update.effective_chat.id,
+        reply_to_message_id,
+        text,
+    )
     await context.bot.send_message(
         chat_id=update.effective_chat.id,
         text=text,
         parse_mode=ParseMode.MARKDOWN,
+        reply_to_message_id=_default_reply_to_message_id(update, reply_to_message_id),
     )
 
 
-async def send_html_text(update: Update, context: ContextTypes.DEFAULT_TYPE, text: str) -> None:
+async def send_html_text(
+    update: Update,
+    context: ContextTypes.DEFAULT_TYPE,
+    text: str,
+    *,
+    reply_to_message_id: Optional[int] = None,
+) -> None:
     if update.effective_chat is None:
         return
     max_length = _max_telegram_message_length(context)
+    logger.debug(
+        "Sending Telegram HTML message chat=%s reply_to_message_id=%s length=%s preview=%.120r",
+        update.effective_chat.id,
+        reply_to_message_id,
+        len(text),
+        text,
+    )
     if len(text) > max_length:
-        await send_text(update, context, _strip_html_tags(text))
+        await send_text(update, context, _strip_html_tags(text), reply_to_message_id=reply_to_message_id)
         return
     try:
         await context.bot.send_message(
             chat_id=update.effective_chat.id,
             text=text,
             parse_mode=ParseMode.HTML,
+            reply_to_message_id=_default_reply_to_message_id(update, reply_to_message_id),
         )
     except BadRequest as exc:
         if "Can't parse entities" not in str(exc):
             raise
-        await send_text(update, context, _strip_html_tags(text))
+        await send_text(update, context, _strip_html_tags(text), reply_to_message_id=reply_to_message_id)
 
 
 def markdownish_to_html(text: str) -> str:
@@ -276,12 +325,22 @@ async def send_code_block(
     code: str,
     *,
     language: Optional[str] = None,
+    reply_to_message_id: Optional[int] = None,
 ) -> None:
     if update.effective_chat is None:
         return
     max_length = _max_telegram_message_length(context)
     chunks = _split_code_chunks(code, language, max_length=max_length)
     total = len(chunks)
+    resolved_reply_to_message_id = _default_reply_to_message_id(update, reply_to_message_id)
+    logger.debug(
+        "Sending Telegram code block chat=%s header=%r chunks=%s reply_to_message_id=%s language=%r",
+        update.effective_chat.id,
+        header,
+        total,
+        resolved_reply_to_message_id,
+        language,
+    )
     for index, chunk in enumerate(chunks, start=1):
         current_header = header if total == 1 else f"{header} ({index}/{total})"
         escaped_code = html.escape(chunk)
@@ -289,6 +348,7 @@ async def send_code_block(
             chat_id=update.effective_chat.id,
             text=html.escape(current_header),
             parse_mode=ParseMode.HTML,
+            reply_to_message_id=resolved_reply_to_message_id if index == 1 else None,
         )
         if language:
             text = f"<pre><code class=\"language-{html.escape(language)}\">{escaped_code}</code></pre>"
@@ -298,4 +358,5 @@ async def send_code_block(
             chat_id=update.effective_chat.id,
             text=text,
             parse_mode=ParseMode.HTML,
+            reply_to_message_id=None,
         )
diff --git a/startup.sh b/startup.sh
index 1e212cc..9b61be4 100755
--- a/startup.sh
+++ b/startup.sh
@@ -67,6 +67,7 @@ if [[ -z "$ENV_FILE" ]]; then
   fi
 fi
 
+NEW_ENV_CREATED=0
 if [[ ! -f "$ENV_FILE" ]]; then
   if [[ -f "$ENV_TEMPLATE_FILE" ]]; then
     ENV_FILE_TARGET="$ENV_FILE" ENV_TEMPLATE_SOURCE="$ENV_TEMPLATE_FILE" PYTHONPATH="$SCRIPT_DIR/src${PYTHONPATH:+:$PYTHONPATH}" "$PYTHON_BIN" - <<'PY'
@@ -81,16 +82,13 @@ app_locale = create_initial_env_file(env_path, template_path)
 print(translate(app_locale, "bootstrap.env_created_locale_line", env_path=env_path, app_locale=app_locale))
 print(translate(app_locale, "bootstrap.env_created_change_line", env_path=env_path))
 PY
+    NEW_ENV_CREATED=1
   else
     echo "Error: $ENV_FILE is missing and $ENV_TEMPLATE_FILE was not found." >&2
     exit 1
   fi
 fi
 
-set -a
-source "$ENV_FILE"
-set +a
-
 STATE_FILE="$STATE_FILE_DEFAULT"
 STATE_BACKUP_FILE="$STATE_BACKUP_FILE_DEFAULT"
 if [[ -f "$APP_HOME_DIR/state.json" ]]; then
@@ -108,6 +106,49 @@ LOG_DIR="$LOG_DIR_DEFAULT"
 mkdir -p "$(dirname "$STATE_FILE")" "$(dirname "$STATE_BACKUP_FILE")" "$LOG_DIR"
 touch "$STATE_FILE" "$STATE_BACKUP_FILE"
 
+if [[ ! -d "$VENV_DIR" ]]; then
+  "$PYTHON_BIN" -m venv "$VENV_DIR"
+fi
+
+source "$VENV_DIR/bin/activate"
+
+python -m pip install --upgrade pip >/dev/null
+INSTALL_STATE_FILE="$VENV_DIR/$INSTALL_STATE_FILE_NAME"
+CURRENT_INSTALL_FINGERPRINT="$(compute_install_fingerprint)"
+STORED_INSTALL_FINGERPRINT=""
+if [[ -f "$INSTALL_STATE_FILE" ]]; then
+  STORED_INSTALL_FINGERPRINT="$(<"$INSTALL_STATE_FILE")"
+fi
+
+NEEDS_REINSTALL=0
+if [[ "$FORCE_REINSTALL" == "1" ]]; then
+  NEEDS_REINSTALL=1
+elif ! python -c "import coding_agent_telegram" >/dev/null 2>&1; then
+  NEEDS_REINSTALL=1
+elif [[ "$CURRENT_INSTALL_FINGERPRINT" != "$STORED_INSTALL_FINGERPRINT" ]]; then
+  NEEDS_REINSTALL=1
+fi
+
+if [[ "$NEEDS_REINSTALL" == "1" ]]; then
+  echo "Installing local package into $VENV_DIR."
+  SETUPTOOLS_SCM_PRETEND_VERSION_FOR_CODING_AGENT_TELEGRAM="$LOCAL_PRETEND_VERSION" \
+    python -m pip install -e .
+  printf '%s\n' "$CURRENT_INSTALL_FINGERPRINT" > "$INSTALL_STATE_FILE"
+else
+  echo "Existing editable install detected; skipping reinstall."
+fi
+
+if [[ "$NEW_ENV_CREATED" == "1" ]]; then
+  python -m coding_agent_telegram.stt_setup offer \
+    --env-file "$ENV_FILE" \
+    --python-bin "$VENV_DIR/bin/python" \
+    --installer-label "./install-stt.sh"
+fi
+
+set -a
+source "$ENV_FILE"
+set +a
+
 required_vars=(
   WORKSPACE_ROOT
   TELEGRAM_BOT_TOKENS
@@ -159,43 +200,13 @@ case "$DEFAULT_AGENT_PROVIDER" in
     ;;
 esac
 
-if [[ ! -d "$VENV_DIR" ]]; then
-  "$PYTHON_BIN" -m venv "$VENV_DIR"
-fi
-
-source "$VENV_DIR/bin/activate"
-
-python -m pip install --upgrade pip >/dev/null
-INSTALL_STATE_FILE="$VENV_DIR/$INSTALL_STATE_FILE_NAME"
-CURRENT_INSTALL_FINGERPRINT="$(compute_install_fingerprint)"
-STORED_INSTALL_FINGERPRINT=""
-if [[ -f "$INSTALL_STATE_FILE" ]]; then
-  STORED_INSTALL_FINGERPRINT="$(<"$INSTALL_STATE_FILE")"
-fi
-
-NEEDS_REINSTALL=0
-if [[ "$FORCE_REINSTALL" == "1" ]]; then
-  NEEDS_REINSTALL=1
-elif ! python -c "import coding_agent_telegram" >/dev/null 2>&1; then
-  NEEDS_REINSTALL=1
-elif [[ "$CURRENT_INSTALL_FINGERPRINT" != "$STORED_INSTALL_FINGERPRINT" ]]; then
-  NEEDS_REINSTALL=1
-fi
-
-if [[ "$NEEDS_REINSTALL" == "1" ]]; then
-  echo "Installing local package into $VENV_DIR."
-  SETUPTOOLS_SCM_PRETEND_VERSION_FOR_CODING_AGENT_TELEGRAM="$LOCAL_PRETEND_VERSION" \
-    python -m pip install -e .
-  printf '%s\n' "$CURRENT_INSTALL_FINGERPRINT" > "$INSTALL_STATE_FILE"
-else
-  echo "Existing editable install detected; skipping reinstall."
-fi
-
 echo "Post-installation guide:"
 echo "1. Confirm $ENV_FILE contains WORKSPACE_ROOT, TELEGRAM_BOT_TOKENS, and ALLOWED_CHAT_IDS."
 echo "2. State files are ready at $STATE_FILE and $STATE_BACKUP_FILE."
 echo "3. Application logs will be written under $LOG_DIR."
-echo "4. Start the server with: ./startup.sh"
-echo "5. In Telegram, start conversations."
+echo "4. Optional voice-to-text: run ./install-stt.sh if you want local Whisper support."
+echo "5. Start the server with: ./startup.sh"
+echo "6. In Telegram, start conversations."
 echo "Starting coding-agent-telegram..."
+export CODING_AGENT_TELEGRAM_STT_INSTALL_HINT="./install-stt.sh"
 exec python -m coding_agent_telegram
diff --git a/tests/test_command_router.py b/tests/test_command_router.py
index 2fdcae1..1c22faa 100644
--- a/tests/test_command_router.py
+++ b/tests/test_command_router.py
@@ -2,6 +2,7 @@
 
 import asyncio
 import html
+import logging
 import sqlite3
 import shlex
 import sys
@@ -14,6 +15,8 @@
 from coding_agent_telegram.command_router import CommandRouter, RouterDeps
 from coding_agent_telegram.config import AppConfig
 from coding_agent_telegram.session_store import SessionStore
+from coding_agent_telegram.speech_to_text import SpeechToTextError
+from telegram.error import BadRequest
 
 
 class DummyRunner:
@@ -316,13 +319,23 @@ def resume_session(
 class FakeBot:
     def __init__(self):
         self.messages = []
+        self.sent_messages = []
         self.actions = []
         self.deleted_messages = []
         self.send_count = 0
         self.edit_count = 0
 
-    async def send_message(self, chat_id, text, parse_mode=None, reply_markup=None):
+    async def send_message(self, chat_id, text, parse_mode=None, reply_markup=None, reply_to_message_id=None):
         self.send_count += 1
+        self.sent_messages.append(
+            {
+                "chat_id": chat_id,
+                "text": text,
+                "parse_mode": parse_mode,
+                "reply_markup": reply_markup,
+                "reply_to_message_id": reply_to_message_id,
+            }
+        )
         self.messages.append((chat_id, text, parse_mode, reply_markup))
         return SimpleNamespace(message_id=len(self.messages))
 
@@ -338,10 +351,21 @@ async def send_chat_action(self, chat_id, action):
 
 
 class SlowProgressBot(FakeBot):
-    async def send_message(self, chat_id, text, parse_mode=None, reply_markup=None):
+    async def send_message(self, chat_id, text, parse_mode=None, reply_markup=None, reply_to_message_id=None):
         if "Live agent output" in text:
             await asyncio.sleep(0.2)
-        return await super().send_message(chat_id, text, parse_mode=parse_mode, reply_markup=reply_markup)
+        return await super().send_message(
+            chat_id,
+            text,
+            parse_mode=parse_mode,
+            reply_markup=reply_markup,
+            reply_to_message_id=reply_to_message_id,
+        )
+
+
+class EditFailingProgressBot(FakeBot):
+    async def edit_message_text(self, chat_id, message_id, text, parse_mode=None, reply_markup=None):
+        raise BadRequest("message can't be edited")
 
 
 class FakeGitManager:
@@ -436,6 +460,24 @@ async def download_as_bytearray(self):
         return bytearray(self._content)
 
 
+class FakeVoiceMessage:
+    def __init__(
+        self,
+        telegram_file: FakeTelegramFile,
+        *,
+        file_unique_id: str = "voice.ogg",
+        file_size=None,
+        file_name: str | None = None,
+    ):
+        self.telegram_file = telegram_file
+        self.file_unique_id = file_unique_id
+        self.file_size = file_size if file_size is not None else len(getattr(telegram_file, "_content", b""))
+        self.file_name = file_name
+
+    async def get_file(self):
+        return self.telegram_file
+
+
 class FakePhotoSize:
     def __init__(self, telegram_file: FakeTelegramFile, *, file_size=None):
         self.telegram_file = telegram_file
@@ -445,10 +487,10 @@ async def get_file(self):
         return self.telegram_file
 
 
-def make_update(chat_id=123, chat_type="private", text="hello"):
+def make_update(chat_id=123, chat_type="private", text="hello", message_id=1):
     return SimpleNamespace(
         effective_chat=SimpleNamespace(id=chat_id, type=chat_type),
-        message=SimpleNamespace(text=text, photo=None, caption=None),
+        message=SimpleNamespace(text=text, photo=None, caption=None, message_id=message_id),
     )
 
 
@@ -480,6 +522,9 @@ def make_config(tmp_path: Path, *, locale: str = "en") -> AppConfig:
         max_telegram_message_length=3000,
         enable_sensitive_diff_filter=True,
         enable_secret_scrub_filter=True,
+        enable_openai_whisper_speech_to_text=False,
+        openai_whisper_model="base",
+        openai_whisper_timeout_seconds=120,
         default_agent_provider="codex",
         agent_hard_timeout_seconds=0,
         app_internal_root=tmp_path / ".coding-agent-telegram",
@@ -2078,6 +2123,336 @@ def test_photo_message_rejected_for_copilot_session(tmp_path: Path):
     assert "Photo attachments are currently supported only for codex sessions." in bot.messages[-1][1]
 
 
+def test_voice_message_sends_transcript_preview_before_running_agent(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_voice", "voice-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+    router.speech_to_text.transcribe_file = lambda _path: SimpleNamespace(text="fix the flaky test")
+
+    update = SimpleNamespace(
+        effective_chat=SimpleNamespace(id=123, type="private"),
+        message=SimpleNamespace(
+            text=None,
+            photo=None,
+            caption=None,
+            voice=FakeVoiceMessage(FakeTelegramFile(b"voice-bytes", "voice/note.ogg")),
+        ),
+    )
+    bot = FakeBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_voice(update, context))
+
+    assert bot.messages[0][1] == "Recognized voice transcript:\nfix the flaky test\n\nWorking on it..."
+    assert runner.resume_calls[-1]["user_message"] == "fix the flaky test"
+    working_entries = [entry for entry in bot.sent_messages if "Working on it..." in entry["text"]]
+    assert len(working_entries) == 1
+
+
+def test_voice_message_sends_queued_transcript_notice_when_project_busy(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    runner.has_running_process = lambda _project_path: True
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_voice", "voice-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+    router.speech_to_text.transcribe_file = lambda _path: SimpleNamespace(text="fix the flaky test")
+
+    update = SimpleNamespace(
+        effective_chat=SimpleNamespace(id=123, type="private"),
+        message=SimpleNamespace(
+            text=None,
+            photo=None,
+            caption=None,
+            voice=FakeVoiceMessage(FakeTelegramFile(b"voice-bytes", "voice/note.ogg")),
+        ),
+    )
+    bot = FakeBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_voice(update, context))
+
+    assert "Recognized voice transcript:\nfix the flaky test\n\nQueued as Q1." in bot.messages[0][1]
+    assert runner.resume_calls == []
+
+
+def test_audio_message_is_transcribed_and_forwarded(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_audio", "audio-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+    router.speech_to_text.transcribe_file = lambda _path: SimpleNamespace(text="summarize this meeting note")
+
+    update = SimpleNamespace(
+        effective_chat=SimpleNamespace(id=123, type="private"),
+        message=SimpleNamespace(
+            text=None,
+            photo=None,
+            caption=None,
+            voice=None,
+            audio=FakeVoiceMessage(FakeTelegramFile(b"audio-bytes", "audio/clip.mp3"), file_unique_id="clip.mp3"),
+        ),
+    )
+    bot = FakeBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_audio(update, context))
+
+    assert runner.resume_calls[-1]["user_message"] == "summarize this meeting note"
+
+
+def test_voice_message_logs_stt_error_details(tmp_path: Path, caplog: pytest.LogCaptureFixture):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_voice", "voice-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+
+    def fail_transcription(_path):
+        raise SpeechToTextError("failed", detail="ffmpeg exited with status 1")
+
+    router.speech_to_text.transcribe_file = fail_transcription
+
+    update = SimpleNamespace(
+        effective_chat=SimpleNamespace(id=123, type="private"),
+        message=SimpleNamespace(
+            text=None,
+            photo=None,
+            caption=None,
+            voice=FakeVoiceMessage(FakeTelegramFile(b"voice-bytes", "voice/note.ogg")),
+        ),
+    )
+    bot = FakeBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    with caplog.at_level(logging.WARNING):
+        asyncio.run(router.handle_voice(update, context))
+
+    assert bot.messages[-1][1] == "Voice conversion failed."
+    assert "ffmpeg exited with status 1" in caplog.text
+
+
+def test_voice_message_is_queued_when_message_pending_before_runner_busy(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = BlockingRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_voice_pending", "voice-pending-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+    router.speech_to_text.transcribe_file = lambda _path: SimpleNamespace(text="queued via voice")
+
+    async def exercise():
+        bot = FakeBot()
+        first_update = make_update(text="first text", message_id=101)
+        voice_update = SimpleNamespace(
+            effective_chat=SimpleNamespace(id=123, type="private"),
+            message=SimpleNamespace(
+                text=None,
+                photo=None,
+                caption=None,
+                message_id=202,
+                voice=FakeVoiceMessage(FakeTelegramFile(b"voice-bytes", "voice/note.ogg")),
+            ),
+        )
+
+        first_task = asyncio.create_task(router.handle_message(first_update, SimpleNamespace(args=[], bot=bot)))
+        await asyncio.sleep(0)
+        await router.handle_voice(voice_update, SimpleNamespace(args=[], bot=bot))
+
+        assert any("Queued as Q1." in entry["text"] for entry in bot.sent_messages)
+        assert not any(
+            entry["text"] == "Recognized voice transcript:\nqueued via voice\n\nWorking on it..."
+            for entry in bot.sent_messages
+        )
+
+        runner.release_next()
+        started_second = await asyncio.to_thread(runner.wait_started, 2, 1.0)
+        assert started_second is True
+        runner.release_next()
+        await first_task
+
+        assert runner.resume_calls[0]["user_message"] == "first text"
+        assert runner.resume_calls[1]["user_message"] == "queued via voice"
+
+    asyncio.run(exercise())
+
+
+def test_audio_message_rejected_when_declared_size_exceeds_stt_limit(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_audio_limit", "audio-limit-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+
+    update = SimpleNamespace(
+        effective_chat=SimpleNamespace(id=123, type="private"),
+        message=SimpleNamespace(
+            text=None,
+            photo=None,
+            caption=None,
+            voice=None,
+            audio=FakeVoiceMessage(
+                FakeTelegramFile(b"small-audio", "audio/clip.mp3"),
+                file_unique_id="clip.mp3",
+                file_size=(20 * 1024 * 1024) + 1,
+                file_name="clip.mp3",
+            ),
+        ),
+    )
+    bot = FakeBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_audio(update, context))
+
+    assert runner.resume_calls == []
+    assert bot.messages[-1][1] == "Audio is too large for local speech-to-text. The maximum supported size is 20 MB."
+
+
+def test_text_message_is_processed_after_voice_triggered_run_finishes(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = BlockingRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_voice", "voice-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+    router.speech_to_text.enabled = True
+    router.speech_to_text.transcribe_file = lambda _path: SimpleNamespace(text="first via voice")
+
+    async def exercise():
+        bot = FakeBot()
+        voice_update = SimpleNamespace(
+            effective_chat=SimpleNamespace(id=123, type="private"),
+            message=SimpleNamespace(
+                text=None,
+                photo=None,
+                caption=None,
+                voice=FakeVoiceMessage(FakeTelegramFile(b"voice-bytes", "voice/note.ogg")),
+            ),
+        )
+        text_update = make_update(text="second via text")
+
+        voice_task = asyncio.create_task(router.handle_voice(voice_update, SimpleNamespace(args=[], bot=bot)))
+        started = await asyncio.to_thread(runner.wait_started, 1, 1.0)
+        assert started is True
+
+        await router.handle_message(text_update, SimpleNamespace(args=[], bot=bot))
+        assert any("Question queued as Q1." in message for _, message, _, _ in bot.messages)
+
+        runner.release_next()
+        started_second = await asyncio.to_thread(runner.wait_started, 2, 1.0)
+        assert started_second is True
+        runner.release_next()
+        await voice_task
+
+        assert len(runner.resume_calls) == 2
+        assert runner.resume_calls[0]["user_message"] == "first via voice"
+        assert runner.resume_calls[1]["user_message"] == "second via text"
+
+    asyncio.run(exercise())
+
+
+def test_busy_queue_and_final_output_reply_to_original_message(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = BlockingRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_reply", "reply-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+
+    async def exercise():
+        bot = FakeBot()
+        first_update = make_update(text="first question", message_id=101)
+        second_update = make_update(text="second question", message_id=202)
+
+        first_task = asyncio.create_task(router.handle_message(first_update, SimpleNamespace(args=[], bot=bot)))
+        started = await asyncio.to_thread(runner.wait_started, 1, 1.0)
+        assert started is True
+
+        await router.handle_message(second_update, SimpleNamespace(args=[], bot=bot))
+        queued_entries = [entry for entry in bot.sent_messages if "Question queued as Q1." in entry["text"]]
+        assert queued_entries
+        assert queued_entries[-1]["reply_to_message_id"] == 202
+
+        runner.release_next()
+        started_second = await asyncio.to_thread(runner.wait_started, 2, 1.0)
+        assert started_second is True
+        runner.release_next()
+        await first_task
+
+        working_entries = [entry for entry in bot.sent_messages if "Working on it..." in entry["text"]]
+        assert working_entries
+        assert working_entries[0]["reply_to_message_id"] == 101
+        assert working_entries[-1]["reply_to_message_id"] == 202
+
+        final_entries = [
+            entry
+            for entry in bot.sent_messages
+            if "Codex output" in entry["text"] or "Task completed." in entry["text"]
+        ]
+        assert final_entries
+        reply_targets = {entry["reply_to_message_id"] for entry in final_entries}
+        assert 101 in reply_targets
+        assert 202 in reply_targets
+
+    asyncio.run(exercise())
+
+
+def test_final_output_replies_only_on_first_message(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = CommandBlockRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_final_reply", "final-reply-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+
+    bot = FakeBot()
+    update = make_update(text="show me the result", message_id=777)
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_message(update, context))
+
+    final_entries = [
+        entry
+        for entry in bot.sent_messages
+        if "Codex output" in entry["text"] or "Command" in entry["text"] or "Task completed." in entry["text"]
+    ]
+    assert len(final_entries) >= 3
+    assert final_entries[0]["reply_to_message_id"] == 777
+    assert all(entry["reply_to_message_id"] is None for entry in final_entries[1:])
+
+
 def test_photo_message_rejected_when_declared_size_exceeds_limit(tmp_path: Path):
     backend = tmp_path / "backend"
     backend.mkdir()
@@ -2236,6 +2611,33 @@ def test_message_prompts_for_provider_when_not_selected(tmp_path: Path):
     assert store.get_chat_state("bot-a", 123)["pending_action"]["kind"] == "message"
 
 
+def test_pending_action_blocks_queue_drain_until_prerequisites_are_resolved(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = DummyRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.set_current_project_folder("bot-a", 123, "backend")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+
+    async def exercise():
+        bot = FakeBot()
+        first_update = make_update(text="first question", message_id=101)
+        second_update = make_update(text="second question", message_id=202)
+        context = SimpleNamespace(args=[], bot=bot)
+
+        await router.handle_message(first_update, context)
+        await router.handle_message(second_update, context)
+
+        state = store.get_chat_state("bot-a", 123)
+        assert state["pending_action"]["kind"] == "message"
+        assert state["pending_action"]["user_message"] == "first question"
+        assert any("Question queued as Q1." in entry["text"] for entry in bot.sent_messages)
+        assert runner.resume_calls == []
+
+    asyncio.run(exercise())
+
+
 def test_message_prompts_for_branch_discrepancy_before_running_bot_managed_session(tmp_path: Path):
     backend = tmp_path / "backend"
     backend.mkdir()
@@ -2972,6 +3374,27 @@ def test_active_session_deletes_live_progress_message_even_if_progress_send_is_s
     assert len(bot.deleted_messages) == 1
 
 
+def test_active_session_deletes_previous_live_progress_message_when_edit_falls_back_to_send(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = RapidProgressRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_progress", "progress-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+
+    update = make_update(text="continue")
+    bot = EditFailingProgressBot()
+    context = SimpleNamespace(args=[], bot=bot)
+
+    asyncio.run(router.handle_message(update, context))
+
+    assert len(bot.deleted_messages) == 2
+    deleted_ids = [message_id for chat_id, message_id in bot.deleted_messages if chat_id == 123]
+    assert len(set(deleted_ids)) == 2
+
+
 def test_second_message_is_queued_while_first_run_is_still_running(tmp_path: Path):
     backend = tmp_path / "backend"
     backend.mkdir()
@@ -3015,6 +3438,42 @@ async def exercise():
     asyncio.run(exercise())
 
 
+def test_second_message_is_queued_even_before_runner_reports_busy(tmp_path: Path):
+    backend = tmp_path / "backend"
+    backend.mkdir()
+    runner = BlockingRunner()
+    cfg = make_config(tmp_path)
+    store = SessionStore(cfg.state_file, cfg.state_backup_file)
+    store.create_session("bot-a", 123, "sess_queue", "queue-session", "backend", "codex")
+    router = CommandRouter(RouterDeps(cfg=cfg, store=store, agent_runner=runner, bot_id="bot-a"))
+    router.git = FakeGitManager(is_git_repo=False)
+
+    async def exercise():
+        bot = FakeBot()
+        first_update = make_update(text="first question", message_id=101)
+        second_update = make_update(text="second question", message_id=202)
+
+        first_task = asyncio.create_task(router.handle_message(first_update, SimpleNamespace(args=[], bot=bot)))
+        await asyncio.sleep(0)
+        await router.handle_message(second_update, SimpleNamespace(args=[], bot=bot))
+
+        assert any("Question queued as Q1." in message for _, message, _, _ in bot.messages)
+
+        started = await asyncio.to_thread(runner.wait_started, 1, 1.0)
+        assert started is True
+        runner.release_next()
+        started_second = await asyncio.to_thread(runner.wait_started, 2, 1.0)
+        assert started_second is True
+        runner.release_next()
+        await first_task
+
+        assert len(runner.resume_calls) == 2
+        assert runner.resume_calls[0]["user_message"] == "first question"
+        assert runner.resume_calls[1]["user_message"] == "second question"
+
+    asyncio.run(exercise())
+
+
 def test_grouped_queue_batch_requires_user_decision_then_processes_remaining_queue(tmp_path: Path):
     backend = tmp_path / "backend"
     backend.mkdir()
@@ -3027,10 +3486,10 @@ def test_grouped_queue_batch_requires_user_decision_then_processes_remaining_que
 
     async def exercise():
         bot = FakeBot()
-        first_update = make_update(text="first question")
-        second_update = make_update(text="two")
-        third_update = make_update(text="three")
-        fourth_update = make_update(text="four four four four four four four")
+        first_update = make_update(text="first question", message_id=101)
+        second_update = make_update(text="two", message_id=202)
+        third_update = make_update(text="three", message_id=303)
+        fourth_update = make_update(text="four four four four four four four", message_id=404)
         first_context = SimpleNamespace(args=[], bot=bot)
 
         first_task = asyncio.create_task(router.handle_message(first_update, first_context))
@@ -3088,6 +3547,17 @@ async def fake_edit(text):
         queued_notices = [message for _, message, _, _ in bot.messages if "Working on queued questions:" in message]
         assert any("1. two" in message and "2. three" in message for message in queued_notices)
         assert any("1. four four four four four four four" in message for message in queued_notices)
+        working_entries = [entry for entry in bot.sent_messages if "Working on it..." in entry["text"]]
+        assert [entry["reply_to_message_id"] for entry in working_entries] == [101, None, 404]
+        final_entries = [
+            entry
+            for entry in bot.sent_messages
+            if "Codex output" in entry["text"] or "Task completed." in entry["text"]
+        ]
+        reply_targets = {entry["reply_to_message_id"] for entry in final_entries}
+        assert 101 in reply_targets
+        assert None in reply_targets
+        assert 404 in reply_targets
 
     asyncio.run(exercise())
 
diff --git a/tests/test_config.py b/tests/test_config.py
index 447fdaf..7d31aa5 100644
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -6,6 +6,8 @@
 import coding_agent_telegram.config as config_module
 from coding_agent_telegram.config import (
     DEFAULT_MAX_TELEGRAM_MESSAGE_LENGTH,
+    DEFAULT_OPENAI_WHISPER_MODEL,
+    DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS,
     DEFAULT_SNAPSHOT_TEXT_FILE_MAX_BYTES,
     create_initial_env_file,
     detect_system_locale,
@@ -46,6 +48,9 @@ def _isolate_env(monkeypatch, tmp_path):
         "MAX_TELEGRAM_MESSAGE_LENGTH",
         "ENABLE_SENSITIVE_DIFF_FILTER",
         "ENABLE_SECRET_SCRUB_FILTER",
+        "ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT",
+        "OPENAI_WHISPER_MODEL",
+        "OPENAI_WHISPER_TIMEOUT_SECONDS",
         "APP_LOCALE",
         "DEFAULT_AGENT_PROVIDER",
     ):
@@ -89,6 +94,9 @@ def test_load_config_required(monkeypatch, tmp_path):
     assert cfg.snapshot_text_file_max_bytes == DEFAULT_SNAPSHOT_TEXT_FILE_MAX_BYTES
     assert cfg.max_telegram_message_length == DEFAULT_MAX_TELEGRAM_MESSAGE_LENGTH
     assert cfg.enable_secret_scrub_filter is True
+    assert cfg.enable_openai_whisper_speech_to_text is False
+    assert cfg.openai_whisper_model == DEFAULT_OPENAI_WHISPER_MODEL
+    assert cfg.openai_whisper_timeout_seconds == DEFAULT_OPENAI_WHISPER_TIMEOUT_SECONDS
     assert cfg.locale == "en"
     assert cfg.default_agent_provider == "codex"
     assert cfg.log_dir.name == "logs"
@@ -148,6 +156,32 @@ def test_load_config_secret_scrub_filter_can_be_disabled(monkeypatch, tmp_path):
     assert cfg.enable_secret_scrub_filter is False
 
 
+def test_load_config_whisper_speech_to_text_can_be_enabled(monkeypatch, tmp_path):
+    _isolate_env(monkeypatch, tmp_path)
+    monkeypatch.setenv("WORKSPACE_ROOT", "~/git")
+    monkeypatch.setenv("TELEGRAM_BOT_TOKENS", "token-a")
+    monkeypatch.setenv("ALLOWED_CHAT_IDS", "123")
+    monkeypatch.setenv("ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT", "true")
+
+    cfg = load_config()
+
+    assert cfg.enable_openai_whisper_speech_to_text is True
+
+
+def test_load_config_whisper_model_and_timeout_override(monkeypatch, tmp_path):
+    _isolate_env(monkeypatch, tmp_path)
+    monkeypatch.setenv("WORKSPACE_ROOT", "~/git")
+    monkeypatch.setenv("TELEGRAM_BOT_TOKENS", "token-a")
+    monkeypatch.setenv("ALLOWED_CHAT_IDS", "123")
+    monkeypatch.setenv("OPENAI_WHISPER_MODEL", "turbo")
+    monkeypatch.setenv("OPENAI_WHISPER_TIMEOUT_SECONDS", "300")
+
+    cfg = load_config()
+
+    assert cfg.openai_whisper_model == "turbo"
+    assert cfg.openai_whisper_timeout_seconds == 300
+
+
 def test_load_config_locale_override(monkeypatch, tmp_path):
     _isolate_env(monkeypatch, tmp_path)
     monkeypatch.setenv("WORKSPACE_ROOT", "~/git")
diff --git a/tests/test_speech_to_text.py b/tests/test_speech_to_text.py
new file mode 100644
index 0000000..e405d82
--- /dev/null
+++ b/tests/test_speech_to_text.py
@@ -0,0 +1,105 @@
+import json
+import subprocess
+from pathlib import Path
+
+import pytest
+
+from coding_agent_telegram.config import AppConfig
+from coding_agent_telegram.speech_to_text import SpeechToTextError, WhisperSpeechToText
+
+
+def _cfg(tmp_path: Path, *, model: str = "base", timeout: int = 120) -> AppConfig:
+    return AppConfig(
+        workspace_root=tmp_path,
+        state_file=tmp_path / "state.json",
+        state_backup_file=tmp_path / "state.json.bak",
+        log_level="INFO",
+        log_dir=tmp_path / "logs",
+        telegram_bot_tokens=("token",),
+        allowed_chat_ids={123},
+        codex_bin="codex",
+        copilot_bin="copilot",
+        codex_model="",
+        copilot_model="",
+        copilot_autopilot=True,
+        copilot_no_ask_user=True,
+        copilot_allow_all=True,
+        copilot_allow_all_tools=False,
+        copilot_allow_tools=(),
+        copilot_deny_tools=(),
+        copilot_available_tools=(),
+        codex_approval_policy="never",
+        codex_sandbox_mode="workspace-write",
+        codex_skip_git_repo_check=False,
+        enable_commit_command=False,
+        snapshot_text_file_max_bytes=200000,
+        max_telegram_message_length=3000,
+        enable_sensitive_diff_filter=True,
+        enable_secret_scrub_filter=True,
+        enable_openai_whisper_speech_to_text=True,
+        openai_whisper_model=model,
+        openai_whisper_timeout_seconds=timeout,
+        default_agent_provider="codex",
+        agent_hard_timeout_seconds=0,
+        app_internal_root=tmp_path / ".coding-agent-telegram",
+        locale="en",
+    )
+
+
+def test_model_cache_path_maps_turbo_alias(tmp_path):
+    transcriber = WhisperSpeechToText(_cfg(tmp_path, model="turbo"))
+
+    assert transcriber._model_cache_path().name == "large-v3-turbo.pt"
+
+
+def test_transcribe_file_returns_text(monkeypatch, tmp_path):
+    audio_path = tmp_path / "voice.ogg"
+    audio_path.write_bytes(b"voice")
+    transcriber = WhisperSpeechToText(_cfg(tmp_path))
+
+    def fake_run(command, **kwargs):
+        output_dir = Path(command[command.index("--output_dir") + 1])
+        (output_dir / "voice.json").write_text(json.dumps({"text": "hello world"}), encoding="utf-8")
+        return subprocess.CompletedProcess(command, 0, "", "")
+
+    monkeypatch.setattr("coding_agent_telegram.speech_to_text.subprocess.run", fake_run)
+
+    result = transcriber.transcribe_file(audio_path)
+
+    assert result.text == "hello world"
+    assert result.model == "base"
+
+
+def test_transcribe_file_timeout_marks_likely_first_download(monkeypatch, tmp_path):
+    audio_path = tmp_path / "voice.ogg"
+    audio_path.write_bytes(b"voice")
+    transcriber = WhisperSpeechToText(_cfg(tmp_path, model="turbo", timeout=1))
+    monkeypatch.setattr(WhisperSpeechToText, "_likely_first_download", lambda self: True)
+
+    def fake_run(command, **kwargs):
+        raise subprocess.TimeoutExpired(command, timeout=1)
+
+    monkeypatch.setattr("coding_agent_telegram.speech_to_text.subprocess.run", fake_run)
+
+    with pytest.raises(SpeechToTextError) as exc:
+        transcriber.transcribe_file(audio_path)
+
+    assert exc.value.code == "timeout"
+    assert exc.value.likely_first_download is True
+
+
+def test_transcribe_file_includes_process_detail_on_failure(monkeypatch, tmp_path):
+    audio_path = tmp_path / "voice.ogg"
+    audio_path.write_bytes(b"voice")
+    transcriber = WhisperSpeechToText(_cfg(tmp_path))
+
+    def fake_run(command, **kwargs):
+        return subprocess.CompletedProcess(command, 1, "stdout note", "stderr note")
+
+    monkeypatch.setattr("coding_agent_telegram.speech_to_text.subprocess.run", fake_run)
+
+    with pytest.raises(SpeechToTextError) as exc:
+        transcriber.transcribe_file(audio_path)
+
+    assert exc.value.code == "failed"
+    assert "stderr note" in (exc.value.detail or "")
diff --git a/tests/test_stt_setup.py b/tests/test_stt_setup.py
new file mode 100644
index 0000000..e2eee02
--- /dev/null
+++ b/tests/test_stt_setup.py
@@ -0,0 +1,81 @@
+from pathlib import Path
+
+import pytest
+
+from coding_agent_telegram import stt_setup
+
+
+def test_detect_stt_prereqs_reports_missing(monkeypatch):
+    monkeypatch.setattr(stt_setup.shutil, "which", lambda name: None)
+    monkeypatch.setattr(stt_setup.importlib.util, "find_spec", lambda name: None)
+
+    status = stt_setup.detect_stt_prereqs()
+
+    assert status.ready is False
+    assert status.missing == ["ffmpeg", "openai-whisper (Python module)"]
+
+
+def test_detect_stt_prereqs_checks_target_python_when_provided(monkeypatch):
+    monkeypatch.setattr(stt_setup.shutil, "which", lambda name: "/usr/bin/ffmpeg")
+    monkeypatch.setattr(
+        stt_setup.subprocess,
+        "run",
+        lambda *args, **kwargs: type("Result", (), {"returncode": 0})(),
+    )
+
+    status = stt_setup.detect_stt_prereqs(python_bin="/custom/python")
+
+    assert status.ready is True
+    assert status.whisper_module is True
+
+
+def test_ensure_stt_runtime_or_exit_uses_install_hint(monkeypatch):
+    monkeypatch.setattr(
+        stt_setup,
+        "detect_stt_prereqs",
+        lambda **kwargs: stt_setup.SttPrereqStatus(ffmpeg=True, whisper_module=False),
+    )
+
+    with pytest.raises(SystemExit) as exc:
+        stt_setup.ensure_stt_runtime_or_exit(True, install_hint="./install-stt.sh")
+
+    assert "./install-stt.sh" in str(exc.value)
+    assert "openai-whisper" in str(exc.value)
+
+
+def test_set_env_flag_appends_when_missing(tmp_path):
+    env_path = tmp_path / ".env_coding_agent_telegram"
+    env_path.write_text("WORKSPACE_ROOT=~/git\n", encoding="utf-8")
+
+    stt_setup._set_env_flag(env_path, True)
+
+    text = env_path.read_text(encoding="utf-8")
+    assert "ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true" in text
+    assert "openai-whisper" in text
+
+
+def test_set_env_flag_replaces_existing_value(tmp_path):
+    env_path = tmp_path / ".env_coding_agent_telegram"
+    env_path.write_text("ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=false\n", encoding="utf-8")
+
+    stt_setup._set_env_flag(env_path, True)
+
+    text = env_path.read_text(encoding="utf-8")
+    assert "ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=true" in text
+    assert "OPENAI_WHISPER_MODEL=base" in text
+    assert "OPENAI_WHISPER_TIMEOUT_SECONDS=120" in text
+
+
+def test_offer_stt_install_for_new_env_keeps_false_when_declined(monkeypatch, tmp_path):
+    env_path = tmp_path / ".env_coding_agent_telegram"
+    env_path.write_text("ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=false\n", encoding="utf-8")
+    monkeypatch.setattr(stt_setup, "_prompt_yes_no", lambda *args, **kwargs: False)
+
+    result = stt_setup.offer_stt_install_for_new_env(
+        env_file=str(env_path),
+        python_bin="python3",
+        installer_label="coding-agent-telegram-stt-install",
+    )
+
+    assert result == 0
+    assert "ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT=false" in env_path.read_text(encoding="utf-8")

`/provider`	`/provider`	Choose the provider for new sessions. The selection is stored per bot and chat until you change it.
`/project <project_folder>`	`/project <project_folder>`	Set the current project folder. If the folder does not exist, the app creates it and marks it trusted. If it already exists and is still untrusted, the app asks you to trust it explicitly.
`/branch <new_branch>`	`/branch <new_branch>`	Prepare or switch a branch for the current project. If the branch already exists, the bot treats that branch as the source candidate. Otherwise it uses the repository default branch as the source candidate.
`/branch <origin_branch> <new_branch>`	`/branch <origin_branch> <new_branch>`	Prepare or switch a branch using `<origin_branch>` as the source candidate. For both forms, the bot then offers the source choices that actually exist: `local/<branch>` `origin/<branch>` If only one of those exists, only that option is shown. If neither exists, the bot tells you the branch source is missing.
`/current`	`/current`	Show the active session for the current bot and chat.
`/new [session_name]`	`/new [session_name]`	Create a new session for the current project. If you omit the name, the bot uses the real session ID. If provider, project, or branch is missing, the bot guides you through the missing step.
`/switch`	`/switch`	Show the latest sessions, newest first. The list includes both bot-managed sessions and local Codex/Copilot CLI sessions for the current project.
`/switch page <number>`	`/switch page <number>`	Show another page of stored sessions.
`/switch <session_id>`	`/switch <session_id>`	Switch to a specific session by ID. If you choose a local CLI session, the bot imports it and continues from there.
`/compact`	`/compact`	Create a fresh compacted session from the active session and switch to it.
`/commit <git commands>`	`/commit <git commands>`	Run validated git commit-related commands inside the active session project. Available only when `ENABLE_COMMIT_COMMAND=true`. Mutating git commands require a trusted project.
`/push`	`/push`	Push `origin <branch>` for the current active session. The bot asks for confirmation before pushing.
`/abort`	`/abort`	Abort the current agent run for the current project. If queued questions are waiting, the bot asks whether to continue them.
`ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT`	Default: `false`. If true, it enables the audio messages capability. System will check the prerequisites regarding required binaries or libraries on startup.
`OPENAI_WHISPER_MODEL`	Model for the Whisper SST. Default: `base` Available models: `tiny` about `72 MB`, `base` about 139 MB, `large-v3-turbo` about `1.5 GB` + Models will be automatically downloaded on your first voice message. Recommended: `base` for general usage. If you want better accuracy and quality, you can try with `turbo` +
`OPENAI_WHISPER_TIMEOUT_SECONDS`	Default: `120`Timeout for the STT process. Usually the STT processing is fast enough.
`ENABLE_SECRET_SCRUB_FILTER`	在送往 Telegram 之前，對 tokens、keys、`.env` 值、certificates 及類似秘密輸出做遮罩。預設：`true`（強烈建議啟用）。
`ENABLE_OPENAI_WHISPER_SPEECH_TO_TEXT`	預設：`false`。如果為 `true`，就會啟用音訊訊息與語音檔案識別。系統會檢查所需的 binary 或 library 依賴，缺少時會提示用戶安裝。
`OPENAI_WHISPER_MODEL`	Whisper STT 使用的模型。預設：`base` 可用模型：`tiny` 約 `72 MB`、`base` 約 `139 MB`、`large-v3-turbo` 約 `1.5 GB` 模型會在你第一次傳送語音訊息時自動下載。建議一般使用選 `base`。如果你想要更好的準確率與品質，可以嘗試 `turbo`。
`OPENAI_WHISPER_TIMEOUT_SECONDS`	預設：`120`。STT 進程的逾時時間。一般來說處理速度已足夠快，但如果你選擇 `turbo`，首次下載可能會視乎網速而超出逾時限制。
`SNAPSHOT_INCLUDE_PATH_GLOBS`	強制把符合條件的 path 納入 diff。例子：`.github/*,.profile.test,.profile.prod`