feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592

tanthcstt · 2025-12-01T02:24:58Z

Description

This PR adds support for the OpenAI-compatible /v1/audio/transcriptions speech-to-text endpoint with translation support for GCP Vertex AI backends.

Key features

Streaming (SSE) and non-streaming response handling
SSE streaming chunk parsing and accumulation of transcription text
Token usage tracking (input/output/total) and metrics emission

Additional notes: clients may request server-sent-event (SSE) style streaming by passing an explicit stream boolean parameter (see table below). When stream=true, the gateway will select the streaming endpoint and translate SSE chunks into the OpenAI-compatible streaming behavior.

Supported request parameters

Parameter	Type	Description
`model`	string	Model identifier (e.g., `whisper-1`)
`language`	string	Language of the audio (optional)
`prompt`	string	Custom transcription prompt (optional)
`response_format`	string	Output format (optional)
`temperature`	float	Sampling temperature (optional)
`stream`	bool	Request streaming (SSE) responses when `true` (optional)
`timestamp_granularities`	`[]string`	Timestamp detail level (optional)

Backend support

OpenAI: passthrough translator that forwards request bodies with optional model override.
GCP Vertex AI / Gemini: translator builds Gemini requests, handles streaming and non-streaming responses, and converts them back to OpenAI-compatible responses.

Related issues / PRs

Closes / relates to: Support for OpenAI Audio / Transcription Endpoints #1499
Special notes for reviewers (if applicable)
it can be conflict with PR feat: add audio endpoint support /v1/audio/speech #1584

…ptions **Description** add audio transcription support for OpenAI and GCP Vertex AI Signed-off-by: tanthcstt <[email protected]>

Copilot

Pull request overview

This PR adds support for the OpenAI-compatible /v1/audio/transcriptions endpoint, enabling speech-to-text functionality with translation support for GCP Vertex AI backends. The implementation follows the existing patterns in the codebase for handling multi-backend API translation with comprehensive metrics and testing.

Key changes:

Implements audio transcription endpoint with multipart/form-data and JSON request support
Adds OpenAI-to-GCP Vertex AI translator with streaming SSE response handling
Introduces token usage tracking and GenAI operation metrics for audio transcription
Provides comprehensive test coverage for translators, processors, and metrics

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`internal/translator/audio_transcription_openai_gcpvertexai.go`	Core translator converting OpenAI audio transcription requests to GCP Vertex AI Gemini API format with SSE streaming support
`internal/translator/audio_transcription_openai_gcpvertexai_test.go`	Comprehensive test coverage for GCP Vertex AI translator including multipart parsing, streaming, and token usage
`internal/translator/audio_openai_openai.go`	Passthrough translator for OpenAI backends (minimal transformation)
`internal/translator/audio_openai_openai_test.go`	Tests for OpenAI passthrough translator
`internal/extproc/audiotranscription_processor.go`	Request processor implementing router and upstream filter logic for audio transcription endpoint
`internal/extproc/audiotranscription_processor_test.go`	Extensive tests for processor covering multipart parsing, backend selection, and error scenarios
`internal/apischema/openai/audio.go`	OpenAI-compatible schema definitions for audio transcription requests and responses
`internal/metrics/audio_transcription_metrics.go`	Metrics factory setup for audio transcription operations
`internal/metrics/audio_transcription_metrics_test.go`	Metrics testing with token usage tracking and multiple backend scenarios
`internal/metrics/genai.go`	Adds `GenAIOperationAudioTranscription` operation constant to existing metrics
`cmd/extproc/mainlib/main.go`	Registers audio transcription endpoint and metrics factory in the main server

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-01T02:34:48Z

internal/translator/audio_transcription_openai_gcpvertexai.go

+	ext := strings.ToLower(filename)
+	switch {
+	case strings.HasSuffix(ext, ".wav"):
+		return "audio/wav"
+	case strings.HasSuffix(ext, ".mp3"):
+		return "audio/mpeg"
+	case strings.HasSuffix(ext, ".m4a"):
+		return "audio/mp4"
+	case strings.HasSuffix(ext, ".ogg"):
+		return "audio/ogg"
+	case strings.HasSuffix(ext, ".flac"):
+		return "audio/flac"
+	case strings.HasSuffix(ext, ".webm"):
+		return "audio/webm"
+	case strings.HasSuffix(ext, ".aac"):


The variable name ext is misleading since it contains the entire filename (converted to lowercase), not just the extension. Consider renaming it to lowerFilename or filename for clarity:

func detectAudioMimeType(filename string) string { lowerFilename := strings.ToLower(filename) switch { case strings.HasSuffix(lowerFilename, ".wav"): return "audio/wav" // ...

Suggested change

ext := strings.ToLower(filename)

switch {

case strings.HasSuffix(ext, ".wav"):

return "audio/wav"

case strings.HasSuffix(ext, ".mp3"):

return "audio/mpeg"

case strings.HasSuffix(ext, ".m4a"):

return "audio/mp4"

case strings.HasSuffix(ext, ".ogg"):

return "audio/ogg"

case strings.HasSuffix(ext, ".flac"):

return "audio/flac"

case strings.HasSuffix(ext, ".webm"):

return "audio/webm"

case strings.HasSuffix(ext, ".aac"):

lowerFilename := strings.ToLower(filename)

switch {

case strings.HasSuffix(lowerFilename, ".wav"):

return "audio/wav"

case strings.HasSuffix(lowerFilename, ".mp3"):

return "audio/mpeg"

case strings.HasSuffix(lowerFilename, ".m4a"):

return "audio/mp4"

case strings.HasSuffix(lowerFilename, ".ogg"):

return "audio/ogg"

case strings.HasSuffix(lowerFilename, ".flac"):

return "audio/flac"

case strings.HasSuffix(lowerFilename, ".webm"):

return "audio/webm"

case strings.HasSuffix(lowerFilename, ".aac"):

Copilot · 2025-12-01T02:34:49Z

internal/translator/audio_transcription_openai_gcpvertexai.go

+		"streaming", a.stream,
+		"body", string(geminiReqBody))


The log statement on line 214 logs the full geminiReqBody which contains the base64-encoded audio data. This could result in logging large amounts of data and potential performance issues.

The debug logging on lines 193-208 already creates a sanitized version with <AUDIO_DATA_N_BYTES> placeholder, but then line 214 logs the actual body anyway. Consider removing the "body", string(geminiReqBody) from the log statement or only including it at a higher log level (e.g., slog.Debug) if needed for debugging.

Suggested change

"streaming", a.stream,

"body", string(geminiReqBody))

"streaming", a.stream)

Copilot · 2025-12-01T02:34:49Z

internal/translator/audio_transcription_openai_gcpvertexai.go

+
+	var geminiResp genai.GenerateContentResponse
+	if unmarshalErr := json.Unmarshal(responseBody, &geminiResp); unmarshalErr != nil {
+		return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err)


The error wrapping is incorrect here. The function returns unmarshalErr but wraps the wrong error variable err. This should be:

return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)

Suggested change

return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err)

return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)

Copilot · 2025-12-01T02:34:49Z

internal/translator/audio_transcription_openai_gcpvertexai.go

+	// Check if streaming should be enabled based on response_format
+	// For now, we'll default to streaming to support both modes


The comment suggests checking response_format to determine streaming mode, but the code unconditionally sets a.stream = true. Additionally, the PR description mentions a stream boolean parameter in the request, but the AudioTranscriptionRequest struct doesn't include this field.

Either:

Add a Stream field to the AudioTranscriptionRequest struct and use it here

Update the comment and PR description to clarify that streaming is always used for GCP Vertex AI backend

Suggested change

// Check if streaming should be enabled based on response_format

// For now, we'll default to streaming to support both modes

// Streaming is always enabled for GCP Vertex AI backend.

// The response_format and stream fields are not checked here.

Copilot · 2025-12-01T02:34:49Z

internal/extproc/audiotranscription_processor.go

+			part, err := reader.NextPart()
+			if err == io.EOF {
+				break
+			}
+			if err != nil {
+				return "", nil, fmt.Errorf("failed to read multipart part: %w", err)
+			}
+
+			formName := part.FormName()
+			switch formName {
+			case "model":
+				modelBytes, err := io.ReadAll(part)
+				if err != nil {
+					return "", nil, fmt.Errorf("failed to read model field: %w", err)
+				}
+				req.Model = string(modelBytes)
+			case "language":
+				langBytes, err := io.ReadAll(part)
+				if err != nil {
+					return "", nil, fmt.Errorf("failed to read language field: %w", err)
+				}
+				req.Language = string(langBytes)
+			case "prompt":
+				promptBytes, err := io.ReadAll(part)
+				if err != nil {
+					return "", nil, fmt.Errorf("failed to read prompt field: %w", err)
+				}
+				req.Prompt = string(promptBytes)
+			case "response_format":
+				formatBytes, err := io.ReadAll(part)
+				if err != nil {
+					return "", nil, fmt.Errorf("failed to read response_format field: %w", err)
+				}
+				req.ResponseFormat = string(formatBytes)
+			case "temperature":
+				tempBytes, err := io.ReadAll(part)
+				if err != nil {
+					return "", nil, fmt.Errorf("failed to read temperature field: %w", err)
+				}
+				var temp float64
+				if err := json.Unmarshal(tempBytes, &temp); err == nil {
+					req.Temperature = &temp
+				}
+			case "file":
+				// Just skip the file part - the translator will extract it from rawBody
+			default:
+			}
+			part.Close()


The multipart part should be closed even if an error occurs during reading. Consider using defer part.Close() right after successful reader.NextPart() to ensure proper cleanup:

part, err := reader.NextPart() if err == io.EOF { break } if err != nil { return "", nil, fmt.Errorf("failed to read multipart part: %w", err) } defer part.Close()

This ensures the part is closed even if the switch cases return early with an error.

Copilot · 2025-12-01T02:34:50Z

internal/metrics/audio_transcription_metrics_test.go

+		assert.Equal(t, before, am.requestStart)
+		assert.Equal(t, after, am.requestStart)


This test assertion is incorrect. The test asserts that both before and after are equal to am.requestStart, which is logically impossible unless time doesn't advance. The correct assertions should verify that requestStart is within the time window:

assert.GreaterOrEqual(t, am.requestStart, before) assert.LessOrEqual(t, am.requestStart, after)

Or simply:

assert.True(t, am.requestStart.After(before) || am.requestStart.Equal(before)) assert.True(t, am.requestStart.Before(after) || am.requestStart.Equal(after))

Suggested change

assert.Equal(t, before, am.requestStart)

assert.Equal(t, after, am.requestStart)

assert.GreaterOrEqual(t, am.requestStart, before)

assert.LessOrEqual(t, am.requestStart, after)

Copilot · 2025-12-01T02:34:50Z

internal/translator/audio_transcription_openai_gcpvertexai.go

+					return nil, nil, fmt.Errorf("error reading multipart: %w", partErr)
+				}
+				if part.FormName() == "file" {
+					audioData, _ = io.ReadAll(part)


The error from io.ReadAll(part) is silently ignored. If reading the audio data fails, this could lead to an empty or incomplete audio data being sent to the backend. Consider handling this error:

audioData, err := io.ReadAll(part) if err != nil { return nil, nil, fmt.Errorf("error reading audio file: %w", err) }

Suggested change

audioData, _ = io.ReadAll(part)

audioData, err = io.ReadAll(part)

if err != nil {

return nil, nil, fmt.Errorf("error reading audio file: %w", err)

}

Copilot · 2025-12-01T02:34:50Z

internal/translator/audio_openai_openai.go

+func (a *audioTranscriptionOpenAIToOpenAITranslator) RequestBody(rawBody []byte, _ *openai.AudioTranscriptionRequest, _ bool) (*extprocv3.HeaderMutation, *extprocv3.BodyMutation, error) {
+	return nil, &extprocv3.BodyMutation{
+		Mutation: &extprocv3.BodyMutation_Body{Body: rawBody},
+	}, nil
+}


The modelNameOverride field is stored but never used. Other OpenAI-to-OpenAI translators (e.g., chat completion, embeddings, completions) apply the model override by modifying the request body using sjson.SetBytesOptions.

For audio transcription, since the request can be multipart/form-data, implementing model override is more complex and would require parsing and reconstructing the multipart body. Consider either:

Implementing model override for both JSON and multipart formats

Documenting that model override is not supported for audio transcription endpoints with OpenAI backends

Removing the unused modelNameOverride field if it's not meant to be supported

**Description** This consolidates all the copy-pasted processors that existed per endpoint we support into one generic processor. This was made possible thanks to the series of refactoring that we landed in the past few weeks primarily for dynamic module work #90. Notably, now in order to add an endpoint, majority of the new code will be in translator (where also have shared generic interface) as well as the type definition. No longer it requires the huge copy paste of processors. **Related Issues/PRs (if applicable)** Resolves #1083 Blocker for #1429 #1584 #1592 #1594 --------- Signed-off-by: Takeshi Yoneda <[email protected]> Co-authored-by: Ignasi Barrera <[email protected]>

mathetake · 2025-12-03T16:21:21Z

Per #1584 (review), let's work on a feature step by step. The first endpoint impl is done including all the user facing doc, then let's come back here and do the same thing with the reviewed context/style, etc rather than doing the exactly same back-and-forth in parallel in multiple PRs

feat: add support for audio transcription endpoint /v1/audio/transcri…

cc5d05a

…ptions **Description** add audio transcription support for OpenAI and GCP Vertex AI Signed-off-by: tanthcstt <[email protected]>

tanthcstt marked this pull request as ready for review December 1, 2025 02:27

tanthcstt requested a review from a team as a code owner December 1, 2025 02:27

Copilot AI review requested due to automatic review settings December 1, 2025 02:27

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Dec 1, 2025

Copilot started reviewing on behalf of tanthcstt December 1, 2025 02:28 View session

Copilot finished reviewing on behalf of tanthcstt December 1, 2025 02:31

Copilot AI reviewed Dec 1, 2025

View reviewed changes

mathetake mentioned this pull request Dec 1, 2025

refactor: consolidates all processors with one generic type #1600

Merged

tanthcstt mentioned this pull request Dec 3, 2025

Support for OpenAI Audio / Transcription Endpoints #1499

Open

mathetake closed this Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592

feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592

Uh oh!

tanthcstt commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

mathetake commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"streaming", a.stream,
	"body", string(geminiReqBody))
	"streaming", a.stream)

	return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err)
	return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)

		// Check if streaming should be enabled based on response_format
		// For now, we'll default to streaming to support both modes

		assert.Equal(t, before, am.requestStart)
		assert.Equal(t, after, am.requestStart)

feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592

feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592

Uh oh!

Conversation

tanthcstt commented Dec 1, 2025

Key features

Supported request parameters

Backend support

Related issues / PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

mathetake commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants