diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb index 2ae3c26e5c..e350984063 100644 --- a/examples/Realtime_out_of_band_transcription.ipynb +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -8,7 +8,7 @@ "\n", "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n", "\n", - "> We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n", + "We call this [out-of-band](https://platform.openai.com/docs/guides/realtime-conversations#create-responses-outside-the-default-conversation) transcription using the Realtime model. It’s simply a second response.create request on the same Realtime WebSocket, tagged so it doesn’t write back to the active conversation state. The model runs again with a different set of instructions (a transcription prompt), triggering a new inference pass that’s separate from the assistant’s main speech turn.\n", "\n", "It covers how to build a server-to-server client that:\n", "\n", @@ -64,7 +64,7 @@ "\n", "- Realtime Model (for transcription):\n", " - Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.\n", - " - Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).\n", + " - Cached Session Context: $0.40 per 1M cached context tokens.\n", "\n", " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00\n", "\n", @@ -72,29 +72,28 @@ "\n", " - Audio Input: $6.00 per 1M audio tokens\n", "\n", - " - Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible input prompt)\n", + " - Text Input: $2.50 per 1M tokens.\n", "\n", " - Text Output: $10.00 per 1M tokens\n", "\n", " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00\n", "\n", - "- Direct Cost Comparison:\n", + "- Direct Cost Comparison (see examples in the end of the cookbook):\n", "\n", - " - Realtime Transcription: ~$48.00\n", + " - Using full session context: 16-22x (if transcription cost is 0.001$/session, realtime transcription will be 0.016$/session)\n", + " - The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.\n", + " - Using only latest user turn: 3-5x (if transcription cost is 0.001$/session, realtime transcription will be 0.003$/session)\n", + " - The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.\n", + " - Using 1 < N (turn) < Full Context, the price would be between 3-20x more expensive depending on how many turns you decide to keep in context\n", "\n", - " - GPT-4o Transcription: ~$16.00\n", - "\n", - " - Absolute Difference: $48.00 − $16.00 = $32.00\n", - "\n", - " - Cost Ratio: $48.00 / $16.00 = 3×\n", - "\n", - " Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.\n", + " - **Note:** These cost estimates are specific to the examples covered in this cookbook. Actual costs may vary depending on factors such as session length, how often context is cached, the ratio of audio to text input, and the details of your particular use case.\n", + " \n", "\n", "- Other Considerations:\n", "\n", - " - Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", + " - Implementing transcription via the Realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", "\n", - "> Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", + "**Note**: Out-of-band responses using the Realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", "\n", "\"drawing\"\n" ] @@ -126,14 +125,12 @@ "\n", " ```bash\n", " export OPENAI_API_KEY=sk-...\n", - " ```\n", - "\n", - "```\n" + " ```" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 100, "id": "c399f440", "metadata": {}, "outputs": [], @@ -150,10 +147,10 @@ "\n", "We use **two distinct prompts**:\n", "\n", - "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the realtime model for the Speech 2 Speech interactions.\n", + "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the Realtime model for the Speech 2 Speech interactions.\n", "2. **Transcription Prompt** (`REALTIME_MODEL_TRANSCRIPTION_PROMPT`): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.\n", "\n", - "> For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" + "For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" ] }, { @@ -163,38 +160,250 @@ "metadata": {}, "outputs": [], "source": [ - "REALTIME_MODEL_PROMPT = \"\"\"You are a calm insurance claims intake voice agent. Follow this script strictly:\n", + "REALTIME_MODEL_PROMPT = \"\"\"\n", + "You are a calm, professional, and empathetic insurance claims intake voice agent working for OpenAI Insurance Solutions. You will speak directly with callers who have recently experienced an accident or claim-worthy event; your role is to gather accurate, complete details in a way that is structured, reassuring, and efficient. Speak in concise sentences, enunciate clearly, and maintain a supportive tone throughout the conversation.\n", + "\n", + "## OVERVIEW\n", + "Your job is to walk every caller methodically through three main phases:\n", + "\n", + "1. **Phase 1: Basics Collection**\n", + "2. **Phase 2: Incident Clarification and Yes/No Questions**\n", + "3. **Phase 3: Summary, Confirmation, and Submission**\n", + "\n", + "You should strictly adhere to this structure, make no guesses, never skip required fields, and always confirm critical facts directly with the caller.\n", + "\n", + "## PHASE 1: BASICS COLLECTION\n", + "- **Greet the caller**: Briefly introduce yourself (“Thank you for calling OpenAI Insurance Claims. My name is [Assistant Name], and I’ll help you file your claim today.”).\n", + "- **Gather the following details:**\n", + " - Full legal name of the policyholder (“May I please have your full legal name as it appears on your policy?”).\n", + " - Policy number (ask for and repeat back, following the `XXXX-XXXX` format, and clarify spelling or numbers if uncertain).\n", + " - Type of accident (auto, home, or other; if ‘other’, ask for brief clarification, e.g., “Can you tell me what type of claim you’d like to file?”).\n", + " - Preferred phone number for follow-up.\n", + " - Date and time of the incident.\n", + "- **Repeat and confirm all collected details at the end of this phase** (“Just to confirm, I have... [summarize each field]. Is that correct?”).\n", + "\n", + "## PHASE 2: INCIDENT CLARIFICATION AND YES/NO QUESTIONS\n", + "- **Ask YES/NO questions tailored to the incident type:**\n", + " - Was anyone injured?\n", + " - For vehicle claims: Is the vehicle still drivable?\n", + " - For home claims: Is the property currently safe to occupy?\n", + " - Was a police or official report filed? If yes, request report/reference number if available.\n", + " - Are there any witnesses to the incident?\n", + "- **For each YES/NO answer:** Restate the caller’s response in your own words to confirm understanding.\n", + "- **If a caller is unsure or does not have information:** Note it politely and move on without pressing (“That’s okay, we can always collect it later if needed.”).\n", + "\n", + "## PHASE 3: SUMMARY, CONFIRMATION & CLAIM SUBMISSION\n", + "- **Concise Recap**: Summarize all key facts in a single, clear paragraph (“To quickly review, you, [caller’s name], experienced [incident description] on [date] and provided the following answers... Is that all correct?”).\n", + "- **Final Confirmation**: Ask if there is any other relevant information they wish to add about the incident.\n", + "- **Submission**: Inform the caller you will submit the claim and briefly outline next steps (“I’ll now submit your claim. Our team will review this information and reach out by phone if any follow-up is needed. You'll receive an initial update within [X] business days.”).\n", + "- **Thank the caller**: Express appreciation for their patience.\n", + "\n", + "## GENERAL GUIDELINES\n", + "- Always state the purpose of each question before asking it.\n", + "- Be patient: Adjust your pacing if the caller seems upset or confused.\n", + "- Provide reassurance but do not make guarantees about claim approvals.\n", + "- If the caller asks a question outside your scope, politely redirect (“That’s a great question, and our adjusters will be able to give you more information after your claim is submitted.”).\n", + "- Never provide legal advice.\n", + "- Do not deviate from the script structure, but feel free to use natural language and slight rephrasings to maintain human-like flow.\n", + "- Spell out any confusing words, numbers, or codes as needed.\n", + "\n", + "## COMMUNICATION STYLE\n", + "- Use warm, professional language.\n", + "- If at any point the caller becomes upset, acknowledge their feelings (“I understand this situation can be stressful. I'm here to make the process as smooth as possible for you.”).\n", + "- When confirming, always explicitly state the value you are confirming.\n", + "- Never speculate or invent information. All responses must be grounded in the caller’s direct answers.\n", + "\n", + "## SPECIAL SCENARIOS\n", + "- **Caller does not know policy number:** Ask for alternative identification such as address or date of birth, and note that the claim will be linked once verified.\n", + "- **Multiple incidents:** Politely explain that each claim must be filed separately, and help with the first; offer instructions for subsequent claims if necessary.\n", + "- **Caller wishes to pause or end:** Respect their wishes, provide information on how to resume the claim, and thank them for their time.\n", + "\n", + "Remain calm and methodical for every call. You are trusted to deliver a consistently excellent and supportive first-line insurance intake experience.\n", + "\"\"\"\n", "\n", - "## Phase 1 – Basics\n", - "Collect the caller's full name, policy number, and type of accident (for example: auto, home, or other). Ask for each item clearly and then repeat the values back to confirm.\n", "\n", - "## Phase 2 – Yes/No questions\n", - "Ask 2–3 simple yes/no questions, such as whether anyone was injured, whether the vehicle is still drivable, and whether a police report was filed. Confirm each yes/no answer in your own words.\n", + "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", + "# Task: Verbatim Transcription of the Latest User Turn\n", "\n", - "## Phase 3 – Submit claim\n", - "Once you have the basics and yes/no answers, briefly summarize the key facts in one or two sentences.\n", - "\"\"\"\n", + "You are a **strict transcription engine**. Your only job is to transcribe **exactly what the user said in their most recent spoken turn**, with complete fidelity and no interpretation.\n", "\n", - "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", - "# Role\n", - "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n", - "Follow the instructions and output format below.\n", + "You must produce a **literal, unedited transcript** of the latest user utterance only. Read and follow all instructions below carefully.\n", "\n", - "# Instructions\n", - "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n", - "- Preserve every spoken detail: intent, tense, grammar quirks, filler words, repetitions, disfluencies, numbers, and casing.\n", - "- Keep timing words, partial words, hesitations (e.g., \"um\", \"uh\").\n", - "- Do not correct mistakes, infer meaning, answer questions, or insert punctuation beyond what the model already supplies.\n", - "- Do not invent or add any information that is not directly present in the user's latest turn.\n", "\n", - "# Output format\n", - "- Output the raw verbatim transcript as a single block of text. No labels, prefixes, quotes, bullets, or markdown.\n", - "- If the realtime model produced nothing for the latest turn, output nothing (empty response). Never fabricate content.\n", + "## 1. Scope of Your Task\n", "\n", - "## Policy Number Normalization\n", - "- All policy numbers should be 8 digits and of the format `XXXX-XXXX` for example `56B5-12C0`\n", + "1. **Only the latest user turn**\n", + " - Transcribe **only** the most recent spoken user turn.\n", + " - Do **not** include text from any earlier user turns or system / assistant messages.\n", + " - Do **not** summarize, merge, or stitch together content across multiple turns.\n", "\n", - "Do not summarize or paraphrase other turns beyond the latest user utterance. The response must be the literal transcript of the latest user utterance.\n", + "2. **Use past context only for disambiguation**\n", + " - You may look at earlier turns **only** to resolve ambiguity (e.g., a spelled word, a reference like “that thing I mentioned before”).\n", + " - Even when using context, the actual transcript must still contain **only the words spoken in the latest turn**.\n", + "\n", + "3. **No conversation management**\n", + " - You are **not** a dialogue agent.\n", + " - You do **not** answer questions, give advice, or continue the conversation.\n", + " - You only output the text of what the user just said.\n", + "\n", + "\n", + "## 2. Core Transcription Principles\n", + "\n", + "Your goal is to create a **perfectly faithful** transcript of the latest user turn.\n", + "\n", + "1. **Verbatim fidelity**\n", + " - Capture the user’s speech **exactly as spoken**.\n", + " - Preserve:\n", + " - All words (including incomplete or cut-off words)\n", + " - Mispronunciations\n", + " - Grammatical mistakes\n", + " - Slang and informal language\n", + " - Filler words (“um”, “uh”, “like”, “you know”, etc.)\n", + " - Self-corrections and restarts\n", + " - Repetitions and stutters\n", + "\n", + "2. **No rewriting or cleaning**\n", + " - Do **not**:\n", + " - Fix grammar or spelling\n", + " - Replace slang with formal language\n", + " - Reorder words\n", + " - Simplify or rewrite sentences\n", + " - “Smooth out” repetitions or disfluencies\n", + " - If the user says something awkward, incorrect, or incomplete, your transcript must **match that awkwardness or incompleteness exactly**.\n", + "\n", + "3. **Spelling and letter sequences**\n", + " - If the user spells a word (e.g., “That’s M-A-R-I-A.”), transcribe it exactly as spoken.\n", + " - If they spell something unclearly, still reflect what you received, even if it seems wrong.\n", + " - Do **not** infer the “intended” spelling; transcribe the letters as they were given.\n", + "\n", + "4. **Numerals and formatting**\n", + " - If the user says a number in words (e.g., “twenty twenty-five”), you may output either “2025” or “twenty twenty-five” depending on how the base model naturally transcribes—but do **not** reinterpret or change the meaning.\n", + " - Do **not**:\n", + " - Convert numbers into different units or formats.\n", + " - Expand abbreviations or acronyms beyond what was spoken.\n", + "\n", + "5. **Language and code-switching**\n", + " - If the user switches languages mid-sentence, reflect that in the transcript.\n", + " - Transcribe non-English content as accurately as possible.\n", + " - Do **not** translate; keep everything in the language(s) spoken.\n", + "\n", + "\n", + "## 3. Disfluencies, Non-Speech Sounds, and Ambiguity\n", + "\n", + "1. **Disfluencies**\n", + " - Always include:\n", + " - “Um”, “uh”, “er”\n", + " - Repeated words (“I I I think…”)\n", + " - False starts (“I went to the— I mean, I stayed home.”)\n", + " - Do not remove or compress them.\n", + "\n", + "2. **Non-speech vocalizations**\n", + " - If the model’s transcription capabilities represent non-speech sounds (e.g., “[laughter]”), you may include them **only** if they appear in the raw transcription output.\n", + " - Do **not** invent labels like “[cough]”, “[sigh]”, or “[laughs]” on your own.\n", + " - If the model does not explicitly provide such tokens, **omit them** rather than inventing them.\n", + "\n", + "3. **Unclear or ambiguous audio**\n", + " - If parts of the audio are unclear and the base transcription gives partial or uncertain tokens, you must **not** guess or fill in missing material.\n", + " - Do **not** replace unclear fragments with what you “think” the user meant.\n", + " - Your duty is to preserve exactly what the transcription model produced, even if it looks incomplete or strange.\n", + "\n", + "\n", + "## 4. Policy Numbers Format\n", + "\n", + "The user may sometimes mention **policy numbers**. These must be handled with extra care.\n", + "\n", + "1. **General rule**\n", + " - Always transcribe the policy number exactly as it was spoken.\n", + "\n", + "2. **Expected pattern**\n", + " - When the policy number fits the pattern `XXXX-XXXX`:\n", + " - `X` can be any letter (A–Z) or digit (0–9).\n", + " - Example: `56B5-12C0`\n", + " - If the user clearly speaks this pattern, preserve it exactly.\n", + "\n", + "3. **Do not “fix” policy numbers**\n", + " - If the spoken policy number does **not** match `XXXX-XXXX` (e.g., different length or missing hyphen), **do not**:\n", + " - Invent missing characters\n", + " - Add or remove hyphens\n", + " - Correct perceived mistakes\n", + " - Transcribe **exactly what was said**, even if it seems malformed.\n", + "\n", + "\n", + "## 5. Punctuation and Casing\n", + "\n", + "1. **Punctuation**\n", + " - Use the punctuation that the underlying transcription model naturally produces.\n", + " - Do **not**:\n", + " - Add extra punctuation for clarity or style.\n", + " - Re-punctuate sentences to “improve” them.\n", + " - If the transcription model emits text with **no punctuation**, leave it that way.\n", + "\n", + "2. **Casing**\n", + " - Preserve the casing (uppercase/lowercase) as the model output provides.\n", + " - Do not change “i” to “I” or adjust capitalization at sentence boundaries unless the model already did so.\n", + "\n", + "\n", + "## 6. Output Format Requirements\n", + "Your final output must be a **single, plain-text transcript** of the latest user turn.\n", + "\n", + "1. **Single block of text**\n", + " - Output only the transcript content.\n", + " - Do **not** include:\n", + " - Labels (e.g., “Transcript:”, “User said:”)\n", + " - Section headers\n", + " - Bullet points or numbering\n", + " - Markdown formatting or code fences\n", + " - Quotes or extra brackets\n", + "\n", + "2. **No additional commentary**\n", + " - Do not output:\n", + " - Explanations\n", + " - Apologies\n", + " - Notes about uncertainty\n", + " - References to these instructions\n", + " - The output must **only** be the words of the user’s last turn, as transcribed.\n", + "\n", + "3. **Empty turns**\n", + " - If the latest user turn contains **no transcribable content** (e.g., silence, noise, or the transcription model produces an empty string), you must:\n", + " - Return an **empty output** (no text at all).\n", + " - Do **not** insert placeholders like “[silence]”, “[no audio]”, or “(no transcript)”.\n", + "\n", + "## 7. What You Must Never Do\n", + "\n", + "1. **No responses or conversation**\n", + " - Do **not**:\n", + " - Address the user.\n", + " - Answer questions.\n", + " - Provide suggestions.\n", + " - Continue or extend the conversation.\n", + "\n", + "2. **No mention of rules or prompts**\n", + " - Do **not** refer to:\n", + " - These instructions\n", + " - The system prompt\n", + " - Internal reasoning or process\n", + " - The user should see **only** the transcript of their own speech.\n", + "\n", + "3. **No multi-turn aggregation**\n", + " - Do not combine the latest user turn with any previous turns.\n", + " - Do not produce summaries or overviews across turns.\n", + "\n", + "4. **No rewriting or “helpfulness”**\n", + " - Even if the user’s statement appears:\n", + " - Incorrect\n", + " - Confusing\n", + " - Impolite\n", + " - Incomplete\n", + " - Your job is **not** to fix or improve it. Your only job is to **transcribe** it exactly.\n", + "\n", + "\n", + "## 8. IMPORTANT REMINDER\n", + "\n", + "- You are **not** a chat assistant.\n", + "- You are **not** an editor, summarizer, or interpreter.\n", + "- You **are** a **verbatim transcription tool** for the latest user turn.\n", + "\n", + "Your output must be the **precise, literal, and complete transcript of the most recent user utterance**—with no additional content, no corrections, and no commentary.\n", "\"\"\"" ] }, @@ -214,7 +423,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 126, "id": "4b952a29", "metadata": {}, "outputs": [ @@ -222,7 +431,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", + "/var/folders/cn/p1ryy08146b7vvvhbh24j9b00000gn/T/ipykernel_48882/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", " from websockets.client import WebSocketClientProtocol\n" ] } @@ -251,7 +460,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 127, "id": "7254080a", "metadata": {}, "outputs": [], @@ -300,19 +509,19 @@ "- Audio input/output\n", "- Server‑side VAD\n", "- Set built‑in transcription (`input_audio_transcription_model`)\n", - " + We set this so that we can compare to the realtime model transcription\n", + " + We set this so that we can compare to the Realtime model transcription\n", "\n", - "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n", + "The out‑of‑band transcription is a `response.create` triggered after user input audio is committed `input_audio_buffer.committed`:\n", "\n", - "- `conversation: \"none\"` – use session state but don’t write to the main conversation session state\n", - "- `output_modalities: [\"text\"]` – get a text transcript only\n", + "- [`conversation: \"none\"`](https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-conversation) – use session state but don’t write to the main conversation session state\n", + "- [`output_modalities: [\"text\"]`](https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-output_modalities) – get a text transcript only\n", "\n", - "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" + "**Note**: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 138, "id": "4baf1870", "metadata": {}, "outputs": [], @@ -377,17 +586,29 @@ " }\n", "\n", "\n", - "def build_transcription_request(transcription_instructions: str) -> dict[str, object]:\n", - " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of the latest user turn.\"\"\"\n", + "def build_transcription_request(\n", + " transcription_instructions: str,\n", + " item_ids: list[str] | None = None,\n", + ") -> dict[str, object]:\n", + " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of selected user turns.\n", + " If item_ids is provided, the model will only consider the turns with the given IDs. You can use this to limit the session context window.\n", + " \"\"\"\n", + "\n", + " response: dict[str, object] = {\n", + " \"conversation\": \"none\", # <--- out-of-band\n", + " \"output_modalities\": [\"text\"],\n", + " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # easier to identify in the logs\n", + " \"instructions\": transcription_instructions,\n", + " }\n", + "\n", + " if item_ids:\n", + " response[\"input\"] = [\n", + " {\"type\": \"item_reference\", \"id\": item_id} for item_id in item_ids\n", + " ]\n", "\n", " return {\n", " \"type\": \"response.create\",\n", - " \"response\": {\n", - " \"conversation\": \"none\", # <--- out-of-band\n", - " \"output_modalities\": [\"text\"],\n", - " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs\n", - " \"instructions\": transcription_instructions,\n", - " },\n", + " \"response\": response,\n", " }\n" ] }, @@ -408,7 +629,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 139, "id": "11218bbb", "metadata": {}, "outputs": [], @@ -527,7 +748,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 140, "id": "cb6acbf0", "metadata": {}, "outputs": [], @@ -537,6 +758,8 @@ "\n", " pending_prints: deque | None = shared_state.get(\"pending_transcription_prints\")\n", " input_transcripts: deque | None = shared_state.get(\"input_transcripts\")\n", + " transcription_model_costs: deque | None = shared_state.get(\"transcription_model_costs\")\n", + " debug_usage_and_cost: bool = bool(shared_state.get(\"debug_usage_and_cost\", False))\n", "\n", " if not pending_prints or not input_transcripts:\n", " return\n", @@ -547,10 +770,35 @@ " print(\"=== User turn (Transcription model) ===\")\n", " if comparison_text:\n", " print(comparison_text, flush=True)\n", - " print()\n", " else:\n", " print(\"\", flush=True)\n", - " print()\n" + "\n", + " # After printing the transcription text, print any stored granular cost.\n", + " cost_info = None\n", + " if transcription_model_costs:\n", + " cost_info = transcription_model_costs.popleft()\n", + "\n", + " if cost_info and debug_usage_and_cost:\n", + " audio_input_cost = cost_info.get(\"audio_input_cost\", 0.0)\n", + " text_input_cost = cost_info.get(\"text_input_cost\", 0.0)\n", + " text_output_cost = cost_info.get(\"text_output_cost\", 0.0)\n", + " total_cost = cost_info.get(\"total_cost\", 0.0)\n", + "\n", + " usage = cost_info.get(\"usage\")\n", + " if usage:\n", + " print(\"[Transcription model usage]\")\n", + " print(json.dumps(usage, indent=2))\n", + "\n", + " print(\n", + " \"[Transcription model cost estimate] \"\n", + " f\"audio_in=${audio_input_cost:.6f}, \"\n", + " f\"text_in=${text_input_cost:.6f}, \"\n", + " f\"text_out=${text_output_cost:.6f}, \"\n", + " f\"total=${total_cost:.6f}\",\n", + " flush=True,\n", + " )\n", + "\n", + " print()\n" ] }, { @@ -563,14 +811,137 @@ "`listen_for_events` drives the session:\n", "\n", "- Watches for `speech_started` / `speech_stopped` / `committed`\n", - "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`)\n", + "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`) when only_last_user_turn == False\n", + "- Sends the out‑of‑band transcription request when a user turn is added to conversation (`conversation.item.added\"`) when only_last_user_turn == True\n", + "- Calculates token usage and cost for both transcription methods\n", "- Streams assistant audio to the playback queue\n", "- Buffers text deltas per `response_id`" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 148, + "id": "32dc2aac", + "metadata": {}, + "outputs": [], + "source": [ + "# Pricing constants (USD per 1M tokens). See https://platform.openai.com/pricing.\n", + "# gpt-4o-transcribe\n", + "GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M = 6.00\n", + "GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M = 2.50\n", + "GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M = 10.00\n", + "\n", + "# gpt-realtime\n", + "REALTIME_TEXT_INPUT_PRICE_PER_1M = 4\n", + "REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M = 0.4\n", + "REALTIME_TEXT_OUTPUT_PRICE_PER_1M = 16.00\n", + "REALTIME_AUDIO_INPUT_PRICE_PER_1M = 32.00\n", + "REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M = 0.40\n", + "REALTIME_AUDIO_OUTPUT_PRICE_PER_1M = 64.00\n", + "\n", + "def _compute_transcription_model_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " output_tokens = usage.get(\"output_tokens\") or 0\n", + "\n", + " audio_input_cost = (\n", + " audio_input_tokens * GPT4O_TRANSCRIBE_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_input_cost = (\n", + " text_input_tokens * GPT4O_TRANSCRIBE_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " output_tokens * GPT4O_TRANSCRIBE_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " total_cost = audio_input_cost + text_input_cost + text_output_cost\n", + "\n", + " return {\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"text_input_cost\": text_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }\n", + "\n", + "def _compute_realtime_oob_cost(usage: dict | None) -> dict | None:\n", + " if not usage:\n", + " return None\n", + "\n", + " input_details = usage.get(\"input_token_details\") or {}\n", + " output_details = usage.get(\"output_token_details\") or {}\n", + " cached_details = input_details.get(\"cached_tokens_details\") or {}\n", + "\n", + " text_input_tokens = input_details.get(\"text_tokens\") or 0\n", + " cached_text_tokens = (\n", + " cached_details.get(\"text_tokens\")\n", + " or input_details.get(\"cached_tokens\")\n", + " or 0\n", + " )\n", + " non_cached_text_input_tokens = max(text_input_tokens - cached_text_tokens, 0)\n", + "\n", + " audio_input_tokens = input_details.get(\"audio_tokens\") or 0\n", + " cached_audio_tokens = cached_details.get(\"audio_tokens\") or 0\n", + " non_cached_audio_input_tokens = max(audio_input_tokens - cached_audio_tokens, 0)\n", + "\n", + " text_output_tokens = output_details.get(\"text_tokens\") or 0\n", + " audio_output_tokens = output_details.get(\"audio_tokens\") or 0\n", + "\n", + " text_input_cost = (\n", + " non_cached_text_input_tokens * REALTIME_TEXT_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_text_input_cost = (\n", + " cached_text_tokens * REALTIME_TEXT_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_input_cost = (\n", + " non_cached_audio_input_tokens * REALTIME_AUDIO_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " cached_audio_input_cost = (\n", + " cached_audio_tokens * REALTIME_AUDIO_CACHED_INPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " text_output_cost = (\n", + " text_output_tokens * REALTIME_TEXT_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + " audio_output_cost = (\n", + " audio_output_tokens * REALTIME_AUDIO_OUTPUT_PRICE_PER_1M\n", + " / 1_000_000\n", + " )\n", + "\n", + " total_cost = (\n", + " text_input_cost\n", + " + cached_text_input_cost\n", + " + audio_input_cost\n", + " + cached_audio_input_cost\n", + " + text_output_cost\n", + " + audio_output_cost\n", + " )\n", + "\n", + " return {\n", + " \"text_input_cost\": text_input_cost,\n", + " \"cached_text_input_cost\": cached_text_input_cost,\n", + " \"audio_input_cost\": audio_input_cost,\n", + " \"cached_audio_input_cost\": cached_audio_input_cost,\n", + " \"text_output_cost\": text_output_cost,\n", + " \"audio_output_cost\": audio_output_cost,\n", + " \"total_cost\": total_cost,\n", + " \"usage\": usage,\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": 149, "id": "d099babd", "metadata": {}, "outputs": [], @@ -594,6 +965,12 @@ " pending_transcription_prints = shared_state.setdefault(\n", " \"pending_transcription_prints\", deque()\n", " )\n", + " transcription_model_costs = shared_state.setdefault(\n", + " \"transcription_model_costs\", deque()\n", + " )\n", + " debug_usage_and_cost: bool = bool(shared_state.get(\"debug_usage_and_cost\", False))\n", + " only_last_user_turn: bool = bool(shared_state.get(\"only_last_user_turn\", False))\n", + " last_user_audio_item_id: str | None = None\n", "\n", " async for raw in ws:\n", " if stop_event.is_set():\n", @@ -611,14 +988,42 @@ " if message_type == \"input_audio_buffer.speech_stopped\":\n", " print(\"[client] Detected silence; preparing transcript...\", flush=True)\n", "\n", - " # This is where the out-of-band transcription request is sent. <-------\n", - " if awaiting_transcription_prompt:\n", + " # Default behavior: trigger immediately after audio commit unless\n", + " # only_last_user_turn requires waiting for conversation.item.added.\n", + " if awaiting_transcription_prompt and not only_last_user_turn:\n", " request_payload = build_transcription_request(\n", - " transcription_instructions\n", + " transcription_instructions,\n", + " item_ids=None,\n", " )\n", " await ws.send(json.dumps(request_payload))\n", " awaiting_transcription_prompt = False\n", "\n", + " elif message_type == \"conversation.item.added\":\n", + " item = message.get(\"item\") or {}\n", + " item_id = item.get(\"id\")\n", + " role = item.get(\"role\")\n", + " status = item.get(\"status\")\n", + " content_blocks = item.get(\"content\") or []\n", + " has_user_audio = any(\n", + " block.get(\"type\") == \"input_audio\" for block in content_blocks\n", + " )\n", + "\n", + " if (\n", + " role == \"user\"\n", + " and status == \"completed\"\n", + " and has_user_audio\n", + " and item_id\n", + " ):\n", + " last_user_audio_item_id = item_id\n", + "\n", + " if only_last_user_turn and awaiting_transcription_prompt:\n", + " request_payload = build_transcription_request(\n", + " transcription_instructions,\n", + " item_ids=[item_id],\n", + " )\n", + " await ws.send(json.dumps(request_payload))\n", + " awaiting_transcription_prompt = False\n", + "\n", " # --- Built-in transcription model stream -------------------------------\n", " elif message_type in TRANSCRIPTION_DELTA_TYPES:\n", " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", @@ -648,7 +1053,12 @@ " final_text = item.get(\"transcription\")\n", " final_text = final_text or \"\"\n", "\n", - " final_text = final_text.strip()\n", + " # Compute and store cost estimate for the transcription model (e.g., gpt-4o-transcribe).\n", + " usage = message.get(\"usage\") or {}\n", + " cost_info = _compute_transcription_model_cost(usage)\n", + " transcription_model_costs.append(cost_info)\n", + "\n", + " final_text = (final_text or \"\").strip()\n", " if final_text:\n", " input_transcripts.append(final_text)\n", " flush_pending_transcription_prints(shared_state)\n", @@ -706,11 +1116,34 @@ " responses[response_id][\"done\"] = True\n", "\n", " is_transcription = responses[response_id][\"is_transcription\"]\n", + "\n", + " # For out-of-band transcription responses, compute usage-based cost estimates.\n", + " usage = response.get(\"usage\") or {}\n", + " oob_cost_info: dict | None = None\n", + " if usage and is_transcription:\n", + " oob_cost_info = _compute_realtime_oob_cost(usage)\n", + "\n", " text = buffers.get(response_id, \"\").strip()\n", " if text:\n", " if is_transcription:\n", " print(\"\\n=== User turn (Realtime transcript) ===\")\n", " print(text, flush=True)\n", + " if debug_usage_and_cost and oob_cost_info:\n", + " usage_for_print = oob_cost_info.get(\"usage\")\n", + " if usage_for_print:\n", + " print(\"[Realtime out-of-band transcription usage]\")\n", + " print(json.dumps(usage_for_print, indent=2))\n", + " print(\n", + " \"[Realtime out-of-band transcription cost estimate] \"\n", + " f\"text_in=${oob_cost_info['text_input_cost']:.6f}, \"\n", + " f\"text_in_cached=${oob_cost_info['cached_text_input_cost']:.6f}, \"\n", + " f\"audio_in=${oob_cost_info['audio_input_cost']:.6f}, \"\n", + " f\"audio_in_cached=${oob_cost_info['cached_audio_input_cost']:.6f}, \"\n", + " f\"text_out=${oob_cost_info['text_output_cost']:.6f}, \"\n", + " f\"audio_out=${oob_cost_info['audio_output_cost']:.6f}, \"\n", + " f\"total=${oob_cost_info['total_cost']:.6f}\",\n", + " flush=True,\n", + " )\n", " print()\n", " pending_transcription_prints.append(object())\n", " flush_pending_transcription_prints(shared_state)\n", @@ -743,7 +1176,7 @@ "source": [ "# 9. Run Script\n", "\n", - "In this step, we run the the code which will allow us to view the realtime model transcription vs transcription model transcriptions. The code does the following:\n", + "In this step, we run the code which will allow us to view the Realtime model transcription vs transcription model transcriptions. The code does the following:\n", "\n", "- Loads configuration and prompts\n", "- Establishes a WebSocket connection\n", @@ -774,7 +1207,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 150, "id": "35c4d7b5", "metadata": {}, "outputs": [], @@ -793,6 +1226,8 @@ " idle_timeout_ms: int | None = None,\n", " max_turns: int | None = None,\n", " timeout_seconds: int = 0,\n", + " debug_usage_and_cost: bool = True,\n", + " only_last_user_turn: bool = False,\n", ") -> None:\n", " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", @@ -816,6 +1251,8 @@ " \"mute_mic\": False,\n", " \"input_transcripts\": deque(),\n", " \"pending_transcription_prints\": deque(),\n", + " \"debug_usage_and_cost\": debug_usage_and_cost,\n", + " \"only_last_user_turn\": only_last_user_turn,\n", " }\n", "\n", " async with websockets.connect(\n", @@ -858,7 +1295,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "id": "c9a2a33b", "metadata": {}, "outputs": [ @@ -966,7 +1403,7 @@ } ], "source": [ - "await run_realtime_session()" + "await run_realtime_session(debug_usage_and_cost=False)" ] }, { @@ -976,8 +1413,857 @@ "source": [ "From the above example, we can notice:\n", "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", - "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", - "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + "- The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", + "- With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + ] + }, + { + "cell_type": "markdown", + "id": "bd1f343b", + "metadata": {}, + "source": [ + "## Example with Cost Calculations\n", + "\n", + "There are significant price differences between the available methods for transcribing user audio. GPT-4o-Transcribe is by far the most cost-effective approach: it charges only for the raw audio input and a small amount of text output, resulting in transcripts that cost just fractions of a cent per turn. In contrast, using the Realtime model for out-of-band transcription is more expensive. If you transcribe only the latest user turn with Realtime, it typically costs about 3–5× more than GPT-4o-Transcribe. If you include the full session context in each transcription request, the cost can increase to about 16–20× higher. This is because each request to the Realtime model processes the entire session context again at higher pricing, and the cost grows as the conversation gets longer." + ] + }, + { + "cell_type": "markdown", + "id": "79819701", + "metadata": {}, + "source": [ + "### Cost for Transcribing Only the Latest Turn\n", + "Let's walk through an example that uses full session context for realtime out-of-band transcription:" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "4a0f9911", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_Cfpt8RCQdpsNsz2OZ4rxQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_Cfpt9JS3PCvlCxoO15mLt', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello. How can I help you today?\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1841,\n", + " \"input_tokens\": 1830,\n", + " \"output_tokens\": 11,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1830,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 11,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.007320, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.007496\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 19,\n", + " \"input_tokens\": 16,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 16\n", + " },\n", + " \"output_tokens\": 3\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1327,\n", + " \"input_tokens\": 1042,\n", + " \"output_tokens\": 285,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1026,\n", + " \"audio_tokens\": 16,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 66,\n", + " \"audio_tokens\": 219\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for calling OpenAI Insurance Claims. My name is Ava, and I’ll help you file your claim today. Let’s start with your full legal name as it appears on your policy. Could you share that with me, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfptNPygis1UcQYQMDh1f', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfptSg4tU6WnRkdiPvR3D', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full legal name would be M-I-N-H, H-O-Q-U-E.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2020,\n", + " \"input_tokens\": 2001,\n", + " \"output_tokens\": 19,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1906,\n", + " \"audio_tokens\": 95,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 19,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000200, text_in_cached=$0.000742, audio_in=$0.003040, audio_in_cached=$0.000000, text_out=$0.000304, audio_out=$0.000000, total=$0.004286\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My full legal name would be Minhajul Hoque.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 71,\n", + " \"input_tokens\": 57,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 57\n", + " },\n", + " \"output_tokens\": 14\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000140, total=$0.000482\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1675,\n", + " \"input_tokens\": 1394,\n", + " \"output_tokens\": 281,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1102,\n", + " \"audio_tokens\": 292,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1344,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1088,\n", + " \"audio_tokens\": 256,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 63,\n", + " \"audio_tokens\": 218\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you, Minhajul Hoque. I’ve got your full name noted. Next, may I have your policy number? Please share it in the format of four digits, a dash, and then four more digits.\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpthEQKfNqaoD86Iolvf', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfptnqCGAdlEXuAxGUvvK', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number is P-0-0-2-X-0-7-5.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2137,\n", + " \"input_tokens\": 2116,\n", + " \"output_tokens\": 21,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1963,\n", + " \"audio_tokens\": 153,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 21,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000428, text_in_cached=$0.000742, audio_in=$0.004896, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.006402\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My policy number is P002X075.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 70,\n", + " \"input_tokens\": 59,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 59\n", + " },\n", + " \"output_tokens\": 11\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000354, text_in=$0.000000, text_out=$0.000110, total=$0.000464\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1811,\n", + " \"input_tokens\": 1509,\n", + " \"output_tokens\": 302,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1159,\n", + " \"audio_tokens\": 350,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 57,\n", + " \"audio_tokens\": 245\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "I want to confirm I heard that correctly. It sounded like your policy number is P002-X075. Could you please confirm if that’s correct, or provide any clarification if needed?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_Cfpu59HqXhBMHvHmW0SvX', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_Cfpu8juH7cCWuQAxCsYUT', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "That is indeed correct.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2233,\n", + " \"input_tokens\": 2226,\n", + " \"output_tokens\": 7,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2014,\n", + " \"audio_tokens\": 212,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 7,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000632, text_in_cached=$0.000742, audio_in=$0.006784, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.008270\n", + "\n", + "=== User turn (Transcription model) ===\n", + "That is indeed correct.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 39,\n", + " \"input_tokens\": 32,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 32\n", + " },\n", + " \"output_tokens\": 7\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000192, text_in=$0.000000, text_out=$0.000070, total=$0.000262\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1818,\n", + " \"input_tokens\": 1619,\n", + " \"output_tokens\": 199,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1210,\n", + " \"audio_tokens\": 409,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 49,\n", + " \"audio_tokens\": 150\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you for confirming. Now, could you tell me the type of accident you’re filing this claim for—whether it’s auto, home, or something else?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpuJcnmWJEzfxS2MgHv0', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfpuPtFYTrlz1uQJBKMVF', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "It's an auto one, but I think you got my name wrong. Can you ask my name again?\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2255,\n", + " \"input_tokens\": 2232,\n", + " \"output_tokens\": 23,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2055,\n", + " \"audio_tokens\": 177,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 23,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000796, text_in_cached=$0.000742, audio_in=$0.005664, audio_in_cached=$0.000000, text_out=$0.000368, audio_out=$0.000000, total=$0.007570\n", + "\n", + "=== User turn (Transcription model) ===\n", + "It's a auto one, but I think you got my name wrong, can you ask my name again?\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 83,\n", + " \"input_tokens\": 60,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 60\n", + " },\n", + " \"output_tokens\": 23\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000360, text_in=$0.000000, text_out=$0.000230, total=$0.000590\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1779,\n", + " \"input_tokens\": 1625,\n", + " \"output_tokens\": 154,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1251,\n", + " \"audio_tokens\": 374,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 41,\n", + " \"audio_tokens\": 113\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Of course, let’s make sure I have it correct. Could you please spell out your full legal name for me again, carefully?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "conversation.item.added: {'id': 'item_CfpuYJBwNQubeb7uuHqQQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}\n", + "conversation.item.added: {'id': 'item_CfpuaI6ZvKBwZG6yXxE1l', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Minhajul Hoque.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 2261,\n", + " \"input_tokens\": 2252,\n", + " \"output_tokens\": 9,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 2092,\n", + " \"audio_tokens\": 160,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1856,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1856,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 9,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000944, text_in_cached=$0.000742, audio_in=$0.005120, audio_in_cached=$0.000000, text_out=$0.000144, audio_out=$0.000000, total=$0.006950\n", + "\n", + "=== User turn (Transcription model) ===\n", + "مينهاجو حق.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 27,\n", + " \"input_tokens\": 20,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 20\n", + " },\n", + " \"output_tokens\": 7\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000120, text_in=$0.000000, text_out=$0.000070, total=$0.000190\n", + "\n", + "[Realtime usage]\n", + "{\n", + " \"total_tokens\": 1902,\n", + " \"input_tokens\": 1645,\n", + " \"output_tokens\": 257,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1288,\n", + " \"audio_tokens\": 357,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 832,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 832,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 54,\n", + " \"audio_tokens\": 203\n", + " }\n", + "}\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm: your full legal name is spelled M-I-N-H-A-J-U-L, and the last name H-O-Q-U-E. Is that correct?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True)" + ] + }, + { + "cell_type": "markdown", + "id": "7567b84c", + "metadata": {}, + "source": [ + "#### Transcription Cost Comparison\n", + "\n", + "##### Costs Summary\n", + "\n", + "* **Realtime Out-of-Band (OOB):** $0.040974 total (~$0.006829 per turn)\n", + "* **Dedicated Transcription:** $0.002114 total (~$0.000352 per turn)\n", + "* **OOB is ~19× more expensive using full session context**\n", + "\n", + "##### Considerations\n", + "\n", + "* **Caching:** Because these conversations are short, you benefit little from caching beyond the initial system prompt.\n", + "* **Transcription System Prompt:** The transcription model uses a minimal system prompt, so input costs would typically be higher.\n", + "\n", + "##### Recommended Cost-Saving Strategy\n", + "\n", + "* **Limit transcription to recent turns:** Minimizing audio/text context significantly reduces OOB transcription costs.\n", + "\n", + "##### Understanding Cache Behavior\n", + "\n", + "* Effective caching requires stable prompt instructions (usually 1,024+ tokens).\n", + "* Different instruction prompts between OOB and main assistant sessions result in separate caches." + ] + }, + { + "cell_type": "markdown", + "id": "59f508c4", + "metadata": {}, + "source": [ + "### Cost for Transcribing Only the Latest Turn\n", + "You can limit transcription to only the latest user turn by supplying input item_references like this:\n", + "```python\n", + " if item_ids:\n", + " response[\"input\"] = [\n", + " {\"type\": \"item_reference\", \"id\": item_id} for item_id in item_ids\n", + " ]\n", + "\n", + " return {\n", + " \"type\": \"response.create\",\n", + " \"response\": response,\n", + " }\n", + "```\n", + "\n", + "Transcribing just the most recent user turn lowers costs by restricting the session context sent to the model. However, this approach has trade-offs: the model won’t have access to previous conversation history to help resolve ambiguities or correct errors (for example, accurately recalling a username mentioned earlier). Additionally, because you’re always updating which input is referenced, little caching benefit is realized, the cache prefix changes each turn, so you don’t accumulate reusable context.\n", + "\n", + "Now, let’s look at a second example that uses only the most recent user audio turn for realtime out-of-band transcription:" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "7d42ceb8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1813,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 4,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 0,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 4,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.007236, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.007300\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 17,\n", + " \"input_tokens\": 14,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 14\n", + " },\n", + " \"output_tokens\": 3\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000084, text_in=$0.000000, text_out=$0.000030, total=$0.000114\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My full legal name is M-I-N-H A-J-U-L H-O-Q-U-E\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1829,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 20,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 20,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.001105\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My full legal name is Minhajul Hoque.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 87,\n", + " \"input_tokens\": 74,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 74\n", + " },\n", + " \"output_tokens\": 13\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000444, text_in=$0.000000, text_out=$0.000130, total=$0.000574\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you, Minhajul Hoque. I’ve noted your full legal name. Next, could you please provide your policy number? Remember, it's usually in a format like XXXX-XXXX.\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number is X007-PX75.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1821,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Sure, my policy number is AG007-PX75.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 102,\n", + " \"input_tokens\": 88,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 88\n", + " },\n", + " \"output_tokens\": 14\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000528, text_in=$0.000000, text_out=$0.000140, total=$0.000668\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Just to confirm, I heard your policy number as E G 0 0 7 - P X 7 5. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "No, I said X007-PX75.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1821,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 12,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 12,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977\n", + "\n", + "=== User turn (Transcription model) ===\n", + "No, I said X007-PX75.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 65,\n", + " \"input_tokens\": 53,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 53\n", + " },\n", + " \"output_tokens\": 12\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000318, text_in=$0.000000, text_out=$0.000120, total=$0.000438\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you for clarifying. I’ve got it now. Your policy number is E G 0 0 7 - P X 7 5. Let’s move on. Could you tell me the type of accident—is it auto, home, or something else?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "It's an auto, but I think you got my name wrong, can you ask me again?\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1830,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 21,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 21,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.001121\n", + "\n", + "=== User turn (Transcription model) ===\n", + "It's an auto, but I think you got my name wrong. Can you ask me again?\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 67,\n", + " \"input_tokens\": 46,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 46\n", + " },\n", + " \"output_tokens\": 21\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000276, text_in=$0.000000, text_out=$0.000210, total=$0.000486\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Of course, I’m happy to correct that. Let’s go back. Could you please spell your full legal name for me, so I can make sure I’ve got it exactly right?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yeah, my full legal name is Minhajul Haque.\n", + "[Realtime out-of-band transcription usage]\n", + "{\n", + " \"total_tokens\": 1824,\n", + " \"input_tokens\": 1809,\n", + " \"output_tokens\": 15,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 1809,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0,\n", + " \"cached_tokens\": 1792,\n", + " \"cached_tokens_details\": {\n", + " \"text_tokens\": 1792,\n", + " \"audio_tokens\": 0,\n", + " \"image_tokens\": 0\n", + " }\n", + " },\n", + " \"output_token_details\": {\n", + " \"text_tokens\": 15,\n", + " \"audio_tokens\": 0\n", + " }\n", + "}\n", + "[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000240, audio_out=$0.000000, total=$0.001025\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yeah, my full legal name is Minhajul Haque.\n", + "[Transcription model usage]\n", + "{\n", + " \"type\": \"tokens\",\n", + " \"total_tokens\": 60,\n", + " \"input_tokens\": 45,\n", + " \"input_token_details\": {\n", + " \"text_tokens\": 0,\n", + " \"audio_tokens\": 45\n", + " },\n", + " \"output_tokens\": 15\n", + "}\n", + "[Transcription model cost estimate] audio_in=$0.000270, text_in=$0.000000, text_out=$0.000150, total=$0.000420\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you for that. Just to confirm, your full legal name is Minhajul Hoque. Is that correct?\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session(debug_usage_and_cost=True, only_last_user_turn=True)" + ] + }, + { + "cell_type": "markdown", + "id": "820420e5", + "metadata": {}, + "source": [ + "#### Cost Analysis Summary\n", + "\n", + "Realtime Out-of-Band Transcription (OOB)\n", + "\n", + "* **Total Cost:** $0.013354\n", + "* **Average per Turn:** ~$0.001908\n", + "\n", + "Dedicated Transcription Model\n", + "\n", + "* **Total Cost:** $0.002630\n", + "* **Average per Turn:** ~$0.000376\n", + "\n", + "\n", + "Difference in Costs\n", + "\n", + "* **Additional cost using OOB:** **+$0.010724**\n", + "* **Cost Multiplier:** OOB is about **5×** more expensive than the dedicated transcription model.\n", + "\n", + "This approach costs significantly less than using the full session context. You should evaluate your use case to decide whether regular transcription, out-of-band transcription with full context, or transcribing only the latest turn best fits your needs. You can also choose an intermediate strategy, such as including just the last N turns in the input.\n" ] }, { @@ -993,10 +2279,19 @@ "* You need a more reliable and steerable method for generating transcriptions.\n", "* The current transcripts fail to normalize entities correctly, causing downstream issues.\n", "\n", + "Keep in mind the trade-offs:\n", + "- Cost: Out-of-band (OOB) transcription is more expensive. Be sure that the extra expense makes sense for your typical session lengths and business needs.\n", + "- Complexity: Implementing OOB transcription takes extra engineering effort to connect all the pieces correctly. Only choose this approach if its benefits are important for your use case.\n", + "\n", "If you decide to pursue this method, make sure you:\n", "\n", "* Set up the transcription trigger correctly, ensuring it activates after the audio commit.\n", - "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n" + "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n", + "\n", + "## Documentation:\n", + "- https://platform.openai.com/docs/guides/realtime-conversations#create-responses-outside-the-default-conversation\n", + "- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-conversation\n", + "- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-output_modalities" ] } ], @@ -1016,7 +2311,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.8" + "version": "3.12.9" } }, "nbformat": 4,