[draft] Responses API proxy server #1576

Wauplin · 2025-07-01T08:42:26Z

⚠️⚠️ Codebase moved to its own repository: https://github.com/huggingface/responses.js ⚠️⚠️

Very early draft. Goal is to run a server that implements OpenAI's Responses API on top of Inference Providers.

No need to review for now. Mostly to start as a base + try things.

Quick Start

Start server

cd packages/responses-server
pnpm install
pnpm dev

Run examples

# Text input
pnpm run example text

# Multi-turn text input
pnpm run example multi_turn

# Text + image input
pnpm run example image

# Streaming
pnpm run example streaming

# Structured output
pnpm run example structured_output
pnpm run example structured_output_streaming

# Function calling
pnpm run example function
pnpm run example function_streaming

TODOs

This is a POC. Most features are currently not implemented. Check in schemas.ts all fields that exist in official API but are not supported yet.

Here is a list of the main features of the API in no particular order. We don't need all of them right now:

Example

import OpenAI from "openai";

const openai = new OpenAI({ baseURL: "http://localhost:3000/v1", apiKey: process.env.HF_TOKEN });

const response = await openai.responses.create({
	model: "Qwen/Qwen2.5-VL-7B-Instruct",
	instructions: "You are a helpful assistant.",
	input: "Tell me a three sentence bedtime story about a unicorn.",
});

console.log(response);
console.log(response.output_text);

pnpm run example text_single

> @huggingface/responses-server@0.1.0 example /home/wauplin/projects/huggingface.js/packages/responses-server
> node scripts/run-example.js text_single

{
  object: 'response',
  id: 'resp_ff152426c1c1ad544392e8ff512ef73dedc3a9cbab60e65a',
  status: 'completed',
  error: null,
  instructions: 'You are a helpful assistant.',
  model: 'Qwen/Qwen2.5-VL-7B-Instruct',
  temperature: 1,
  top_p: 1,
  created_at: 1751359624,
  output: [
    {
      id: 'msg_0d05372ec992169434028304842f1123b195e9bebdf1305f',
      type: 'message',
      role: 'assistant',
      status: 'completed',
      content: [Array]
    }
  ],
  output_text: "Once upon a time, in a faraway realm, lived a young unicorn named Starlight. Every night, Starlight would Sadly come down from a hill and heatpocumentt the Sleeping Forest. One Moonlight night, Starlight accidentally stumbled upon a secluded glade filled with silver stars and glowing pixies. Dazzled by the magical light, Starlight decided to let the pixies light Starlight. With the pixies' help, Starlight found a way to make the Severnvalsea never gonna gonna gonna stop glittering magnificently. As the last stars disappeared into the night, Starlight felt a sense of peace, knowing that the Gadondo's secret was now safe, forever encompassed in the magic of their icy depiction."
}

PR on top of #1576. I'm starting to think that it makes sense to define Zod schemas for inputs as we need to validate user's inputs. But that for outputs we "only" need static type checking and therefore we could reuse the types defined in https://github.com/openai/openai-node. **Benefits:** no need to redefine stuff manually. It's easy to make mistakes (a parameter that shouldn't be nullable, that could be an array, etc.) when translating from the specs to our codebase. If static typing don't complain then we can assume "it's good". Also less code to maintain. **Drawback:** less flexibility. We don't own the stack and things might get updated in the wild. It's less a problem in this context as it's a server and not a client (and therefore we manage the dependency updates). Overall I do think it's better to import from openai. Since we won't implement everything at first, it's fine to use `Omit<..., "key-that-we-dont-implement">` which **explicitly** removes a feature (better than implicit non-definition) --- **EDIT:** and it's fine to use them for now and if it's ever blocking in the future, then we redefine stuff ourselves.

Built on top of #1576. This PR adds support for streaming mode to the Responses API. Tested it using the [openai-responses-starter-app](https://github.com/openai/openai-responses-starter-app): [Screencast from 02-07-2025 07:43:52.webm](https://github.com/user-attachments/assets/6eb77c9c-5796-4841-af55-f526da8da847) ``` pnpm run example streaming ``` ```js { type: 'response.created', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'in_progress', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [] }, sequence_number: 0 } { type: 'response.in_progress', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'in_progress', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [] }, sequence_number: 1 } { type: 'response.output_item.added', output_index: 0, item: { id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', type: 'message', role: 'assistant', status: 'in_progress', content: [] }, sequence_number: 2 } { type: 'response.content_part.added', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, part: { type: 'output_text', text: '', annotations: [] }, sequence_number: 3 } { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: 'Double', sequence_number: 4 } { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: ' bubble', sequence_number: 5 } ... { type: 'response.output_text.delta', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, delta: '!', sequence_number: 43 } { type: 'response.output_text.done', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!', sequence_number: 44 } { type: 'response.content_part.done', item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', output_index: 0, content_index: 0, part: { type: 'output_text', text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!', annotations: [] }, sequence_number: 45 } { type: 'response.output_item.done', output_index: 0, item: { id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b', type: 'message', role: 'assistant', status: 'completed', content: [ [Object] ] }, sequence_number: 46 } { type: 'response.completed', response: { object: 'response', id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258', status: 'completed', error: null, instructions: null, model: 'Qwen/Qwen2.5-VL-7B-Instruct', temperature: 1, top_p: 1, created_at: 1751383702199, output: [ [Object] ] }, sequence_number: 47 } ```

Built on top of #1576. Based on https://platform.openai.com/docs/guides/structured-outputs Works both with and without streaming. ## Non-stream **Run** ```bash pnpm run example structured_output ``` (which core logic is:) ```js (...) const response = await openai.responses.parse({ model: "Qwen/Qwen2.5-VL-72B-Instruct", provider: "nebius", input: [ { role: "system", content: "You are a helpful math tutor. Guide the user through the solution step by step.", }, { role: "user", content: "how can I solve 8x + 7 = -23" }, ], text: { format: zodTextFormat(MathReasoning, "math_reasoning"), }, }); (...) ``` **Output:** ```js { steps: [ { explanation: 'To solve for x, we need to isolate it on one side of the equation. We start by subtracting 7 from both sides of the equation.', output: '8x + 7 - 7 = -23 - 7' }, { explanation: 'Simplify the equation after performing the subtraction.', output: '8x = -30' }, { explanation: 'Now that we have isolated the term with x, we divide both sides by 8 to get x by itself.', output: '8x / 8 = -30 / 8' }, { explanation: 'Perform the division to find the value of x.', output: 'x = -30 / 8' }, { explanation: 'Simplify the fraction if possible.', output: 'x = -15 / 4' } ], final_answer: 'The solution is x = -15/4 or x = -3.75.' } ``` ## Stream **Run** ```bash pnpm run example structured_output_streaming ``` (which core logic is:) ```js const stream = openai.responses .stream({ model: "Qwen/Qwen2.5-VL-72B-Instruct", provider: "nebius", instructions: "Extract the event information.", input: "Alice and Bob are going to a science fair on Friday.", text: { format: zodTextFormat(CalendarEvent, "calendar_event"), }, }) .on("response.refusal.delta", (event) => { process.stdout.write(event.delta); }) .on("response.output_text.delta", (event) => { process.stdout.write(event.delta); }) .on("response.output_text.done", () => { process.stdout.write("\n"); }) .on("response.error", (event) => { console.error(event.error); }); const result = await stream.finalResponse(); console.log(result.output_parsed); ``` **Output:** ```js { "name": "Science Fair", "date": "Friday", "participants": ["Alice", "Bob"] } { name: 'Science Fair', date: 'Friday', participants: [ 'Alice', 'Bob' ] } ```

Built on top of #1576. Based on https://platform.openai.com/docs/api-reference/responses/create and https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming Works both with and without streaming. **Note:** implementation starts to be a completely messy, especially in streaming mode. Complexity increases as we add new event types. I do think a refactoring would be beneficial e.g. with an internal state object that keeps track of the current state and "knows" what to emit and when (typically to emit the "done"/"completed" events each time a new output/content is generated). Food for thoughts for a future PR. ## Non-stream **Run** ```bash pnpm run example function ``` **Output** ```js { created_at: 1751467285177, error: null, id: 'resp_0b2ab98168a9813e0f7373f940221da4ef3211f43c9faac8', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [ { type: 'function_call', id: 'fc_f40ac964165602e2fcb2f955777acff8c4b9359d49eaf79b', call_id: '9cd167c7f', name: 'get_current_weather', arguments: '{"location": "Boston, MA", "unit": "fahrenheit"}', status: 'completed' } ], status: 'completed', tool_choice: 'auto', tools: [ { name: 'get_current_weather', parameters: [Object], strict: true, type: 'function', description: 'Get the current weather in a given location' } ], temperature: 1, top_p: 1, output_text: '' } ``` ## Stream **Run:** ``` pnpm run example function_streaming ``` **Output:** ```js { type: 'response.created', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [], status: 'in_progress', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 0 } { type: 'response.in_progress', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [], status: 'in_progress', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 1 } { type: 'response.output_item.added', output_index: 0, item: { type: 'function_call', id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', call_id: '83a9d4baf', name: 'get_weather', arguments: '' }, sequence_number: 2 } { type: 'response.function_call_arguments.delta', item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', output_index: 0, delta: '{"latitude": 48.8567, "longitude": 2.3508}', sequence_number: 3 } { type: 'response.function_call_arguments.done', item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', output_index: 0, arguments: '{"latitude": 48.8567, "longitude": 2.3508}', sequence_number: 4 } { type: 'response.output_item.done', output_index: 0, item: { type: 'function_call', id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454', call_id: '83a9d4baf', name: 'get_weather', arguments: '{"latitude": 48.8567, "longitude": 2.3508}', status: 'completed' }, sequence_number: 5 } { type: 'response.completed', response: { created_at: 1751467334073, error: null, id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05', instructions: null, max_output_tokens: null, metadata: null, model: 'meta-llama/Llama-3.3-70B-Instruct', object: 'response', output: [ [Object] ], status: 'completed', tool_choice: 'auto', tools: [ [Object] ], temperature: 1, top_p: 1 }, sequence_number: 6 } ```

Some tweaks to make it work in a demo: - provider can be passed like this `model="cohere@CohereLabs/c4ai-command-a-03-2025"` - clean up input message if empty message (not supported by some providers) - check tools calls is not an empty list

Wauplin · 2025-07-02T21:36:11Z

closing in favor of its own repository: https://github.com/huggingface/responses.js

Wauplin added 4 commits June 30, 2025 16:55

first commit

4ba0492

working text only api

abc0d88

image example

555054c

comments

942fc8e

Wauplin mentioned this pull request Jul 1, 2025

[ResponsesAPI] return openai-defined types #1580

Merged

Wauplin mentioned this pull request Jul 1, 2025

[ResponsesAPI] Implement streaming mode #1582

Merged

Wauplin and others added 3 commits July 2, 2025 10:09

Add metadata support

eb8082a

Support max_output_tokens

69f0cd0

Wauplin mentioned this pull request Jul 2, 2025

[Responses API] Structured output #1586

Merged

Wauplin mentioned this pull request Jul 2, 2025

[Responses API] Function calling #1587

Merged

Wauplin and others added 3 commits July 2, 2025 16:52

rename examples

c0bde43

This was referenced Jul 2, 2025

[ResponsesAPI] Provider in model + clean input #1588

Merged

Initial import from huggingface.js huggingface/responses.js#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft] Responses API proxy server #1576

[draft] Responses API proxy server #1576

Uh oh!

Wauplin commented Jul 1, 2025 •

edited

Loading

Uh oh!

Wauplin commented Jul 2, 2025

Uh oh!

Uh oh!

[draft] Responses API proxy server #1576

Are you sure you want to change the base?

[draft] Responses API proxy server #1576

Uh oh!

Conversation

Wauplin commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️⚠️ Codebase moved to its own repository: https://github.com/huggingface/responses.js ⚠️⚠️

Quick Start

TODOs

Example

Uh oh!

Wauplin commented Jul 2, 2025

Uh oh!

Uh oh!

Wauplin commented Jul 1, 2025 •

edited

Loading