-
Notifications
You must be signed in to change notification settings - Fork 451
[draft] Responses API proxy server #1576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Wauplin
wants to merge
12
commits into
main
Choose a base branch
from
responses-server
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PR on top of #1576. I'm starting to think that it makes sense to define Zod schemas for inputs as we need to validate user's inputs. But that for outputs we "only" need static type checking and therefore we could reuse the types defined in https://github.com/openai/openai-node. **Benefits:** no need to redefine stuff manually. It's easy to make mistakes (a parameter that shouldn't be nullable, that could be an array, etc.) when translating from the specs to our codebase. If static typing don't complain then we can assume "it's good". Also less code to maintain. **Drawback:** less flexibility. We don't own the stack and things might get updated in the wild. It's less a problem in this context as it's a server and not a client (and therefore we manage the dependency updates). Overall I do think it's better to import from openai. Since we won't implement everything at first, it's fine to use `Omit<..., "key-that-we-dont-implement">` which **explicitly** removes a feature (better than implicit non-definition) --- **EDIT:** and it's fine to use them for now and if it's ever blocking in the future, then we redefine stuff ourselves.
Built on top of #1576. This PR adds support for streaming mode to the Responses API. Tested it using the [openai-responses-starter-app](https://github.com/openai/openai-responses-starter-app): [Screencast from 02-07-2025 07:43:52.webm](https://github.com/user-attachments/assets/6eb77c9c-5796-4841-af55-f526da8da847) ``` pnpm run example streaming ``` ```js { type: 'response.created', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'in_progress', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [] }, sequence_number: 0 } { type: 'response.in_progress', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'in_progress', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [] }, sequence_number: 1 } { type: 'response.output_item.added', output_index: 0, item: { id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', type: 'message', role: 'assistant', status: 'in_progress', content: [] }, sequence_number: 2 } { type: 'response.content_part.added', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, part: { type: 'output_text', text: '', annotations: [] }, sequence_number: 3 } { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: 'Double', sequence_number: 4 } { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: ' bubble', sequence_number: 5 } ... { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: '!', sequence_number: 43 } { type: 'response.output_text.done', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!', sequence_number: 44 } { type: 'response.content_part.done', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, part: { type: 'output_text', text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!', annotations: [] }, sequence_number: 45 } { type: 'response.output_item.done', output_index: 0, item: { id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', type: 'message', role: 'assistant', status: 'completed', content: [ [Object] ] }, sequence_number: 46 } { type: 'response.completed', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'completed', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [ [Object] ] }, sequence_number: 47 } ```
Built on top of #1576. Based on https://platform.openai.com/docs/guides/structured-outputs Works both with and without streaming. ## Non-stream **Run** ```bash pnpm run example structured_output ``` (which core logic is:) ```js (...) const response = await openai.responses.parse({ model: "Qwen/Qwen2.5-VL-72B-Instruct", provider: "nebius", input: [ { role: "system", content: "You are a helpful math tutor. Guide the user through the solution step by step.", }, { role: "user", content: "how can I solve 8x + 7 = -23" }, ], text: { format: zodTextFormat(MathReasoning, "math_reasoning"), }, }); (...) ``` **Output:** ```js { steps: [ { explanation: 'To solve for x, we need to isolate it on one side of the equation. We start by subtracting 7 from both sides of the equation.', output: '8x + 7 - 7 = -23 - 7' }, { explanation: 'Simplify the equation after performing the subtraction.', output: '8x = -30' }, { explanation: 'Now that we have isolated the term with x, we divide both sides by 8 to get x by itself.', output: '8x / 8 = -30 / 8' }, { explanation: 'Perform the division to find the value of x.', output: 'x = -30 / 8' }, { explanation: 'Simplify the fraction if possible.', output: 'x = -15 / 4' } ], final_answer: 'The solution is x = -15/4 or x = -3.75.' } ``` ## Stream **Run** ```bash pnpm run example structured_output_streaming ``` (which core logic is:) ```js const stream = openai.responses .stream({ model: "Qwen/Qwen2.5-VL-72B-Instruct", provider: "nebius", instructions: "Extract the event information.", input: "Alice and Bob are going to a science fair on Friday.", text: { format: zodTextFormat(CalendarEvent, "calendar_event"), }, }) .on("response.refusal.delta", (event) => { process.stdout.write(event.delta); }) .on("response.output_text.delta", (event) => { process.stdout.write(event.delta); }) .on("response.output_text.done", () => { process.stdout.write("\n"); }) .on("response.error", (event) => { console.error(event.error); }); const result = await stream.finalResponse(); console.log(result.output_parsed); ``` **Output:** ```js { "name": "Science Fair", "date": "Friday", "participants": ["Alice", "Bob"] } { name: 'Science Fair', date: 'Friday', participants: [ 'Alice', 'Bob' ] } ```
Built on top of #1576. Based on https://platform.openai.com/docs/api-reference/responses/create and https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming Works both with and without streaming. **Note:** implementation starts to be a completely messy, especially in streaming mode. Complexity increases as we add new event types. I do think a refactoring would be beneficial e.g. with an internal state object that keeps track of the current state and "knows" what to emit and when (typically to emit the "done"/"completed" events each time a new output/content is generated). Food for thoughts for a future PR. ## Non-stream **Run** ```bash pnpm run example function ``` **Output** ```js { created_at: 1751467285177, error: null, id: 'resp_0b2ab98168a9813e0f7373f940221da4ef3211f43c9faac8', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [ { type: 'function_call', id: 'fc_f40ac964165602e2fcb2f955777acff8c4b9359d49eaf79b', call_id: '9cd167c7f', name: 'get_current_weather', arguments: '{"location": "Boston, MA", "unit": "fahrenheit"}', status: 'completed' } ], status: 'completed', tool_choice: 'auto', tools: [ { name: 'get_current_weather', parameters: [Object], strict: true, type: 'function', description: 'Get the current weather in a given location' } ], temperature: 1, top_p: 1, output_text: '' } ``` ## Stream **Run:** ``` pnpm run example function_streaming ``` **Output:** ```js { type: 'response.created', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [], status: 'in_progress', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 0 } { type: 'response.in_progress', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [], status: 'in_progress', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 1 } { type: 'response.output_item.added', output_index: 0, item: { type: 'function_call', id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', call_id: '83a9d4baf', name: 'get_weather', arguments: '' }, sequence_number: 2 } { type: 'response.function_call_arguments.delta', item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', output_index: 0, delta: '{"latitude": 48.8567, "longitude": 2.3508}', sequence_number: 3 } { type: 'response.function_call_arguments.done', item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', output_index: 0, arguments: '{"latitude": 48.8567, "longitude": 2.3508}', sequence_number: 4 } { type: 'response.output_item.done', output_index: 0, item: { type: 'function_call', id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', call_id: '83a9d4baf', name: 'get_weather', arguments: '{"latitude": 48.8567, "longitude": 2.3508}', status: 'completed' }, sequence_number: 5 } { type: 'response.completed', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [ [Object] ], status: 'completed', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 6 } ```
Some tweaks to make it work in a demo: - provider can be passed like this `model="cohere@CohereLabs/c4ai-command-a-03-2025"` - clean up input message if empty message (not supported by some providers) - check tools calls is not an empty list
This was referenced Jul 2, 2025
closing in favor of its own repository: https://github.com/huggingface/responses.js |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Very early draft. Goal is to run a server that implements OpenAI's Responses API on top of Inference Providers.
cc @Vaibhavs10 @julien-c @hanouticelina
No need to review for now. Mostly to start as a base + try things.
Quick Start
Start server
cd packages/responses-server pnpm install pnpm dev
Run examples
TODOs
This is a POC. Most features are currently not implemented. Check in
schemas.ts
all fields that exist in official API but are not supported yet.Here is a list of the main features of the API in no particular order. We don't need all of them right now:
Example