-
Notifications
You must be signed in to change notification settings - Fork 133
feat: add support for audio transcription endpoint /v1/audio/transcriptions #1592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ptions **Description** add audio transcription support for OpenAI and GCP Vertex AI Signed-off-by: tanthcstt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for the OpenAI-compatible /v1/audio/transcriptions endpoint, enabling speech-to-text functionality with translation support for GCP Vertex AI backends. The implementation follows the existing patterns in the codebase for handling multi-backend API translation with comprehensive metrics and testing.
Key changes:
- Implements audio transcription endpoint with multipart/form-data and JSON request support
- Adds OpenAI-to-GCP Vertex AI translator with streaming SSE response handling
- Introduces token usage tracking and GenAI operation metrics for audio transcription
- Provides comprehensive test coverage for translators, processors, and metrics
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
internal/translator/audio_transcription_openai_gcpvertexai.go |
Core translator converting OpenAI audio transcription requests to GCP Vertex AI Gemini API format with SSE streaming support |
internal/translator/audio_transcription_openai_gcpvertexai_test.go |
Comprehensive test coverage for GCP Vertex AI translator including multipart parsing, streaming, and token usage |
internal/translator/audio_openai_openai.go |
Passthrough translator for OpenAI backends (minimal transformation) |
internal/translator/audio_openai_openai_test.go |
Tests for OpenAI passthrough translator |
internal/extproc/audiotranscription_processor.go |
Request processor implementing router and upstream filter logic for audio transcription endpoint |
internal/extproc/audiotranscription_processor_test.go |
Extensive tests for processor covering multipart parsing, backend selection, and error scenarios |
internal/apischema/openai/audio.go |
OpenAI-compatible schema definitions for audio transcription requests and responses |
internal/metrics/audio_transcription_metrics.go |
Metrics factory setup for audio transcription operations |
internal/metrics/audio_transcription_metrics_test.go |
Metrics testing with token usage tracking and multiple backend scenarios |
internal/metrics/genai.go |
Adds GenAIOperationAudioTranscription operation constant to existing metrics |
cmd/extproc/mainlib/main.go |
Registers audio transcription endpoint and metrics factory in the main server |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ext := strings.ToLower(filename) | ||
| switch { | ||
| case strings.HasSuffix(ext, ".wav"): | ||
| return "audio/wav" | ||
| case strings.HasSuffix(ext, ".mp3"): | ||
| return "audio/mpeg" | ||
| case strings.HasSuffix(ext, ".m4a"): | ||
| return "audio/mp4" | ||
| case strings.HasSuffix(ext, ".ogg"): | ||
| return "audio/ogg" | ||
| case strings.HasSuffix(ext, ".flac"): | ||
| return "audio/flac" | ||
| case strings.HasSuffix(ext, ".webm"): | ||
| return "audio/webm" | ||
| case strings.HasSuffix(ext, ".aac"): |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name ext is misleading since it contains the entire filename (converted to lowercase), not just the extension. Consider renaming it to lowerFilename or filename for clarity:
func detectAudioMimeType(filename string) string {
lowerFilename := strings.ToLower(filename)
switch {
case strings.HasSuffix(lowerFilename, ".wav"):
return "audio/wav"
// ...| ext := strings.ToLower(filename) | |
| switch { | |
| case strings.HasSuffix(ext, ".wav"): | |
| return "audio/wav" | |
| case strings.HasSuffix(ext, ".mp3"): | |
| return "audio/mpeg" | |
| case strings.HasSuffix(ext, ".m4a"): | |
| return "audio/mp4" | |
| case strings.HasSuffix(ext, ".ogg"): | |
| return "audio/ogg" | |
| case strings.HasSuffix(ext, ".flac"): | |
| return "audio/flac" | |
| case strings.HasSuffix(ext, ".webm"): | |
| return "audio/webm" | |
| case strings.HasSuffix(ext, ".aac"): | |
| lowerFilename := strings.ToLower(filename) | |
| switch { | |
| case strings.HasSuffix(lowerFilename, ".wav"): | |
| return "audio/wav" | |
| case strings.HasSuffix(lowerFilename, ".mp3"): | |
| return "audio/mpeg" | |
| case strings.HasSuffix(lowerFilename, ".m4a"): | |
| return "audio/mp4" | |
| case strings.HasSuffix(lowerFilename, ".ogg"): | |
| return "audio/ogg" | |
| case strings.HasSuffix(lowerFilename, ".flac"): | |
| return "audio/flac" | |
| case strings.HasSuffix(lowerFilename, ".webm"): | |
| return "audio/webm" | |
| case strings.HasSuffix(lowerFilename, ".aac"): |
| "streaming", a.stream, | ||
| "body", string(geminiReqBody)) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log statement on line 214 logs the full geminiReqBody which contains the base64-encoded audio data. This could result in logging large amounts of data and potential performance issues.
The debug logging on lines 193-208 already creates a sanitized version with <AUDIO_DATA_N_BYTES> placeholder, but then line 214 logs the actual body anyway. Consider removing the "body", string(geminiReqBody) from the log statement or only including it at a higher log level (e.g., slog.Debug) if needed for debugging.
| "streaming", a.stream, | |
| "body", string(geminiReqBody)) | |
| "streaming", a.stream) |
|
|
||
| var geminiResp genai.GenerateContentResponse | ||
| if unmarshalErr := json.Unmarshal(responseBody, &geminiResp); unmarshalErr != nil { | ||
| return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error wrapping is incorrect here. The function returns unmarshalErr but wraps the wrong error variable err. This should be:
return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr)| return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", err) | |
| return nil, nil, metrics.TokenUsage{}, "", fmt.Errorf("error unmarshaling Gemini response: %w", unmarshalErr) |
| // Check if streaming should be enabled based on response_format | ||
| // For now, we'll default to streaming to support both modes |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment suggests checking response_format to determine streaming mode, but the code unconditionally sets a.stream = true. Additionally, the PR description mentions a stream boolean parameter in the request, but the AudioTranscriptionRequest struct doesn't include this field.
Either:
- Add a
Streamfield to theAudioTranscriptionRequeststruct and use it here - Update the comment and PR description to clarify that streaming is always used for GCP Vertex AI backend
| // Check if streaming should be enabled based on response_format | |
| // For now, we'll default to streaming to support both modes | |
| // Streaming is always enabled for GCP Vertex AI backend. | |
| // The response_format and stream fields are not checked here. |
| part, err := reader.NextPart() | ||
| if err == io.EOF { | ||
| break | ||
| } | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read multipart part: %w", err) | ||
| } | ||
|
|
||
| formName := part.FormName() | ||
| switch formName { | ||
| case "model": | ||
| modelBytes, err := io.ReadAll(part) | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read model field: %w", err) | ||
| } | ||
| req.Model = string(modelBytes) | ||
| case "language": | ||
| langBytes, err := io.ReadAll(part) | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read language field: %w", err) | ||
| } | ||
| req.Language = string(langBytes) | ||
| case "prompt": | ||
| promptBytes, err := io.ReadAll(part) | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read prompt field: %w", err) | ||
| } | ||
| req.Prompt = string(promptBytes) | ||
| case "response_format": | ||
| formatBytes, err := io.ReadAll(part) | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read response_format field: %w", err) | ||
| } | ||
| req.ResponseFormat = string(formatBytes) | ||
| case "temperature": | ||
| tempBytes, err := io.ReadAll(part) | ||
| if err != nil { | ||
| return "", nil, fmt.Errorf("failed to read temperature field: %w", err) | ||
| } | ||
| var temp float64 | ||
| if err := json.Unmarshal(tempBytes, &temp); err == nil { | ||
| req.Temperature = &temp | ||
| } | ||
| case "file": | ||
| // Just skip the file part - the translator will extract it from rawBody | ||
| default: | ||
| } | ||
| part.Close() |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multipart part should be closed even if an error occurs during reading. Consider using defer part.Close() right after successful reader.NextPart() to ensure proper cleanup:
part, err := reader.NextPart()
if err == io.EOF {
break
}
if err != nil {
return "", nil, fmt.Errorf("failed to read multipart part: %w", err)
}
defer part.Close()This ensures the part is closed even if the switch cases return early with an error.
| assert.Equal(t, before, am.requestStart) | ||
| assert.Equal(t, after, am.requestStart) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test assertion is incorrect. The test asserts that both before and after are equal to am.requestStart, which is logically impossible unless time doesn't advance. The correct assertions should verify that requestStart is within the time window:
assert.GreaterOrEqual(t, am.requestStart, before)
assert.LessOrEqual(t, am.requestStart, after)Or simply:
assert.True(t, am.requestStart.After(before) || am.requestStart.Equal(before))
assert.True(t, am.requestStart.Before(after) || am.requestStart.Equal(after))| assert.Equal(t, before, am.requestStart) | |
| assert.Equal(t, after, am.requestStart) | |
| assert.GreaterOrEqual(t, am.requestStart, before) | |
| assert.LessOrEqual(t, am.requestStart, after) |
| return nil, nil, fmt.Errorf("error reading multipart: %w", partErr) | ||
| } | ||
| if part.FormName() == "file" { | ||
| audioData, _ = io.ReadAll(part) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from io.ReadAll(part) is silently ignored. If reading the audio data fails, this could lead to an empty or incomplete audio data being sent to the backend. Consider handling this error:
audioData, err := io.ReadAll(part)
if err != nil {
return nil, nil, fmt.Errorf("error reading audio file: %w", err)
}| audioData, _ = io.ReadAll(part) | |
| audioData, err = io.ReadAll(part) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("error reading audio file: %w", err) | |
| } |
| func (a *audioTranscriptionOpenAIToOpenAITranslator) RequestBody(rawBody []byte, _ *openai.AudioTranscriptionRequest, _ bool) (*extprocv3.HeaderMutation, *extprocv3.BodyMutation, error) { | ||
| return nil, &extprocv3.BodyMutation{ | ||
| Mutation: &extprocv3.BodyMutation_Body{Body: rawBody}, | ||
| }, nil | ||
| } |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modelNameOverride field is stored but never used. Other OpenAI-to-OpenAI translators (e.g., chat completion, embeddings, completions) apply the model override by modifying the request body using sjson.SetBytesOptions.
For audio transcription, since the request can be multipart/form-data, implementing model override is more complex and would require parsing and reconstructing the multipart body. Consider either:
- Implementing model override for both JSON and multipart formats
- Documenting that model override is not supported for audio transcription endpoints with OpenAI backends
- Removing the unused
modelNameOverridefield if it's not meant to be supported
**Description** This consolidates all the copy-pasted processors that existed per endpoint we support into one generic processor. This was made possible thanks to the series of refactoring that we landed in the past few weeks primarily for dynamic module work #90. Notably, now in order to add an endpoint, majority of the new code will be in translator (where also have shared generic interface) as well as the type definition. No longer it requires the huge copy paste of processors. **Related Issues/PRs (if applicable)** Resolves #1083 Blocker for #1429 #1584 #1592 #1594 --------- Signed-off-by: Takeshi Yoneda <[email protected]> Co-authored-by: Ignasi Barrera <[email protected]>
|
Per #1584 (review), let's work on a feature step by step. The first endpoint impl is done including all the user facing doc, then let's come back here and do the same thing with the reviewed context/style, etc rather than doing the exactly same back-and-forth in parallel in multiple PRs |
Description
This PR adds support for the OpenAI-compatible
/v1/audio/transcriptionsspeech-to-text endpoint with translation support for GCP Vertex AI backends.Key features
Additional notes: clients may request server-sent-event (SSE) style streaming by passing an explicit
streamboolean parameter (see table below). Whenstream=true, the gateway will select the streaming endpoint and translate SSE chunks into the OpenAI-compatible streaming behavior.Supported request parameters
modelwhisper-1)languagepromptresponse_formattemperaturestreamtrue(optional)timestamp_granularities[]stringBackend support
Related issues / PRs
Special notes for reviewers (if applicable)