Skip to content

Conversation

@tanthcstt
Copy link

Description

This PR adds support for the OpenAI-compatible /v1/audio/transcriptions speech-to-text endpoint with translation support for GCP Vertex AI backends.

Key features

  • Streaming (SSE) and non-streaming response handling
  • SSE streaming chunk parsing and accumulation of transcription text
  • Token usage tracking (input/output/total) and metrics emission

Additional notes: clients may request server-sent-event (SSE) style streaming by passing an explicit stream boolean parameter (see table below). When stream=true, the gateway will select the streaming endpoint and translate SSE chunks into the OpenAI-compatible streaming behavior.

Supported request parameters

Parameter Type Description
model string Model identifier (e.g., whisper-1)
language string Language of the audio (optional)
prompt string Custom transcription prompt (optional)
response_format string Output format (optional)
temperature float Sampling temperature (optional)
stream bool Request streaming (SSE) responses when true (optional)
timestamp_granularities []string Timestamp detail level (optional)

Backend support

  • OpenAI: passthrough translator that forwards request bodies with optional model override.
  • GCP Vertex AI / Gemini: translator builds Gemini requests, handles streaming and non-streaming responses, and converts them back to OpenAI-compatible responses.

Related issues / PRs

…ptions

**Description**
add audio transcription support for OpenAI and GCP Vertex AI

Signed-off-by: tanthcstt <[email protected]>
@tanthcstt tanthcstt marked this pull request as ready for review December 1, 2025 02:27
@tanthcstt tanthcstt requested a review from a team as a code owner December 1, 2025 02:27
Copilot AI review requested due to automatic review settings December 1, 2025 02:27
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Dec 1, 2025
Copilot finished reviewing on behalf of tanthcstt December 1, 2025 02:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the OpenAI-compatible /v1/audio/transcriptions endpoint, enabling speech-to-text functionality with translation support for GCP Vertex AI backends. The implementation follows the existing patterns in the codebase for handling multi-backend API translation with comprehensive metrics and testing.

Key changes:

  • Implements audio transcription endpoint with multipart/form-data and JSON request support
  • Adds OpenAI-to-GCP Vertex AI translator with streaming SSE response handling
  • Introduces token usage tracking and GenAI operation metrics for audio transcription
  • Provides comprehensive test coverage for translators, processors, and metrics

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
internal/translator/audio_transcription_openai_gcpvertexai.go Core translator converting OpenAI audio transcription requests to GCP Vertex AI Gemini API format with SSE streaming support
internal/translator/audio_transcription_openai_gcpvertexai_test.go Comprehensive test coverage for GCP Vertex AI translator including multipart parsing, streaming, and token usage
internal/translator/audio_openai_openai.go Passthrough translator for OpenAI backends (minimal transformation)
internal/translator/audio_openai_openai_test.go Tests for OpenAI passthrough translator
internal/extproc/audiotranscription_processor.go Request processor implementing router and upstream filter logic for audio transcription endpoint
internal/extproc/audiotranscription_processor_test.go Extensive tests for processor covering multipart parsing, backend selection, and error scenarios
internal/apischema/openai/audio.go OpenAI-compatible schema definitions for audio transcription requests and responses
internal/metrics/audio_transcription_metrics.go Metrics factory setup for audio transcription operations
internal/metrics/audio_transcription_metrics_test.go Metrics testing with token usage tracking and multiple backend scenarios
internal/metrics/genai.go Adds GenAIOperationAudioTranscription operation constant to existing metrics
cmd/extproc/mainlib/main.go Registers audio transcription endpoint and metrics factory in the main server

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +32 to +46
ext := strings.ToLower(filename)
switch {
case strings.HasSuffix(ext, ".wav"):
return "audio/wav"
case strings.HasSuffix(ext, ".mp3"):
return "audio/mpeg"
case strings.HasSuffix(ext, ".m4a"):
return "audio/mp4"
case strings.HasSuffix(ext, ".ogg"):
return "audio/ogg"
case strings.HasSuffix(ext, ".flac"):
return "audio/flac"
case strings.HasSuffix(ext, ".webm"):
return "audio/webm"
case strings.HasSuffix(ext, ".aac"):
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name ext is misleading since it contains the entire filename (converted to lowercase), not just the extension. Consider renaming it to lowerFilename or filename for clarity:

func detectAudioMimeType(filename string) string {
	lowerFilename := strings.ToLower(filename)
	switch {
	case strings.HasSuffix(lowerFilename, ".wav"):
		return "audio/wav"
	// ...
Suggested change
ext := strings.ToLower(filename)
switch {
case strings.HasSuffix(ext, ".wav"):
return "audio/wav"
case strings.HasSuffix(ext, ".mp3"):
return "audio/mpeg"
case strings.HasSuffix(ext, ".m4a"):
return "audio/mp4"
case strings.HasSuffix(ext, ".ogg"):
return "audio/ogg"
case strings.HasSuffix(ext, ".flac"):
return "audio/flac"
case strings.HasSuffix(ext, ".webm"):
return "audio/webm"
case strings.HasSuffix(ext, ".aac"):
lowerFilename := strings.ToLower(filename)
switch {
case strings.HasSuffix(lowerFilename, ".wav"):
return "audio/wav"
case strings.HasSuffix(lowerFilename, ".mp3"):
return "audio/mpeg"
case strings.HasSuffix(lowerFilename, ".m4a"):
return "audio/mp4"
case strings.HasSuffix(lowerFilename, ".ogg"):
return "audio/ogg"
case strings.HasSuffix(lowerFilename, ".flac"):
return "audio/flac"
case strings.HasSuffix(lowerFilename, ".webm"):
return "audio/webm"
case strings.HasSuffix(lowerFilename, ".aac"):

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +214
"streaming", a.stream,
"body", string(geminiReqBody))
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log statement on line 214 logs the full geminiReqBody which contains the base64-encoded audio data. This could result in logging large amounts of data and potential performance issues.

The debug logging on lines 193-208 already creates a sanitized version with <AUDIO_DATA_N_BYTES> placeholder, but then line 214 logs the actual body anyway. Consider removing the "body", string(geminiReqBody) from the log statement or only including it at a higher log level (e.g., slog.Debug) if needed for debugging.

Suggested change
"streaming", a.stream,
"body", string(geminiReqBody))
"streaming", a.stream)

Copilot uses AI. Check for mistakes.

var geminiResp genai.GenerateContentResponse
if unmarshalErr := json.Unmarshal(responseBody, &geminiResp); unmarshalErr != nil {
return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err)
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error wrapping is incorrect here. The function returns unmarshalErr but wraps the wrong error variable err. This should be:

return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)
Suggested change
return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err)
return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +86
// Check if streaming should be enabled based on response_format
// For now, we'll default to streaming to support both modes
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment suggests checking response_format to determine streaming mode, but the code unconditionally sets a.stream = true. Additionally, the PR description mentions a stream boolean parameter in the request, but the AudioTranscriptionRequest struct doesn't include this field.

Either:

  1. Add a Stream field to the AudioTranscriptionRequest struct and use it here
  2. Update the comment and PR description to clarify that streaming is always used for GCP Vertex AI backend
Suggested change
// Check if streaming should be enabled based on response_format
// For now, we'll default to streaming to support both modes
// Streaming is always enabled for GCP Vertex AI backend.
// The response_format and stream fields are not checked here.

Copilot uses AI. Check for mistakes.
Comment on lines +360 to +407
part, err := reader.NextPart()
if err == io.EOF {
break
}
if err != nil {
return "", nil, fmt.Errorf("failed to read multipart part: %w", err)
}

formName := part.FormName()
switch formName {
case "model":
modelBytes, err := io.ReadAll(part)
if err != nil {
return "", nil, fmt.Errorf("failed to read model field: %w", err)
}
req.Model = string(modelBytes)
case "language":
langBytes, err := io.ReadAll(part)
if err != nil {
return "", nil, fmt.Errorf("failed to read language field: %w", err)
}
req.Language = string(langBytes)
case "prompt":
promptBytes, err := io.ReadAll(part)
if err != nil {
return "", nil, fmt.Errorf("failed to read prompt field: %w", err)
}
req.Prompt = string(promptBytes)
case "response_format":
formatBytes, err := io.ReadAll(part)
if err != nil {
return "", nil, fmt.Errorf("failed to read response_format field: %w", err)
}
req.ResponseFormat = string(formatBytes)
case "temperature":
tempBytes, err := io.ReadAll(part)
if err != nil {
return "", nil, fmt.Errorf("failed to read temperature field: %w", err)
}
var temp float64
if err := json.Unmarshal(tempBytes, &temp); err == nil {
req.Temperature = &temp
}
case "file":
// Just skip the file part - the translator will extract it from rawBody
default:
}
part.Close()
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multipart part should be closed even if an error occurs during reading. Consider using defer part.Close() right after successful reader.NextPart() to ensure proper cleanup:

part, err := reader.NextPart()
if err == io.EOF {
    break
}
if err != nil {
    return "", nil, fmt.Errorf("failed to read multipart part: %w", err)
}
defer part.Close()

This ensures the part is closed even if the switch cases return early with an error.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +48
assert.Equal(t, before, am.requestStart)
assert.Equal(t, after, am.requestStart)
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assertion is incorrect. The test asserts that both before and after are equal to am.requestStart, which is logically impossible unless time doesn't advance. The correct assertions should verify that requestStart is within the time window:

assert.GreaterOrEqual(t, am.requestStart, before)
assert.LessOrEqual(t, am.requestStart, after)

Or simply:

assert.True(t, am.requestStart.After(before) || am.requestStart.Equal(before))
assert.True(t, am.requestStart.Before(after) || am.requestStart.Equal(after))
Suggested change
assert.Equal(t, before, am.requestStart)
assert.Equal(t, after, am.requestStart)
assert.GreaterOrEqual(t, am.requestStart, before)
assert.LessOrEqual(t, am.requestStart, after)

Copilot uses AI. Check for mistakes.
return nil, nil, fmt.Errorf("error reading multipart: %w", partErr)
}
if part.FormName() == "file" {
audioData, _ = io.ReadAll(part)
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from io.ReadAll(part) is silently ignored. If reading the audio data fails, this could lead to an empty or incomplete audio data being sent to the backend. Consider handling this error:

audioData, err := io.ReadAll(part)
if err != nil {
    return nil, nil, fmt.Errorf("error reading audio file: %w", err)
}
Suggested change
audioData, _ = io.ReadAll(part)
audioData, err = io.ReadAll(part)
if err != nil {
return nil, nil, fmt.Errorf("error reading audio file: %w", err)
}

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +42
func (a *audioTranscriptionOpenAIToOpenAITranslator) RequestBody(rawBody []byte, _ *openai.AudioTranscriptionRequest, _ bool) (*extprocv3.HeaderMutation, *extprocv3.BodyMutation, error) {
return nil, &extprocv3.BodyMutation{
Mutation: &extprocv3.BodyMutation_Body{Body: rawBody},
}, nil
}
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modelNameOverride field is stored but never used. Other OpenAI-to-OpenAI translators (e.g., chat completion, embeddings, completions) apply the model override by modifying the request body using sjson.SetBytesOptions.

For audio transcription, since the request can be multipart/form-data, implementing model override is more complex and would require parsing and reconstructing the multipart body. Consider either:

  1. Implementing model override for both JSON and multipart formats
  2. Documenting that model override is not supported for audio transcription endpoints with OpenAI backends
  3. Removing the unused modelNameOverride field if it's not meant to be supported

Copilot uses AI. Check for mistakes.
nacx added a commit that referenced this pull request Dec 3, 2025
**Description**
This consolidates all the copy-pasted processors that existed per
endpoint we support into one generic processor. This was made possible
thanks to the series of refactoring that we landed in the past few weeks
primarily for dynamic module work #90.

Notably, now in order to add an endpoint, majority of the new code will
be in translator (where also have shared generic interface) as well as
the type definition. No longer it requires the huge copy paste of
processors.

**Related Issues/PRs (if applicable)**

Resolves #1083 
Blocker for #1429 #1584 #1592 #1594

---------

Signed-off-by: Takeshi Yoneda <[email protected]>
Co-authored-by: Ignasi Barrera <[email protected]>
@mathetake
Copy link
Member

Per #1584 (review), let's work on a feature step by step. The first endpoint impl is done including all the user facing doc, then let's come back here and do the same thing with the reviewed context/style, etc rather than doing the exactly same back-and-forth in parallel in multiple PRs

@mathetake mathetake closed this Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants