Skip to content

[draft] Responses API proxy server #1576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

[draft] Responses API proxy server #1576

wants to merge 12 commits into from

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Jul 1, 2025

⚠️⚠️ Codebase moved to its own repository: https://github.com/huggingface/responses.js ⚠️⚠️


Very early draft. Goal is to run a server that implements OpenAI's Responses API on top of Inference Providers.

cc @Vaibhavs10 @julien-c @hanouticelina

No need to review for now. Mostly to start as a base + try things.

Quick Start

Start server

cd packages/responses-server
pnpm install
pnpm dev

Run examples

# Text input
pnpm run example text

# Multi-turn text input
pnpm run example multi_turn

# Text + image input
pnpm run example image

# Streaming
pnpm run example streaming

# Structured output
pnpm run example structured_output
pnpm run example structured_output_streaming

# Function calling
pnpm run example function
pnpm run example function_streaming

TODOs

This is a POC. Most features are currently not implemented. Check in schemas.ts all fields that exist in official API but are not supported yet.

Here is a list of the main features of the API in no particular order. We don't need all of them right now:

Example

import OpenAI from "openai";

const openai = new OpenAI({ baseURL: "http://localhost:3000/v1", apiKey: process.env.HF_TOKEN });

const response = await openai.responses.create({
	model: "Qwen/Qwen2.5-VL-7B-Instruct",
	instructions: "You are a helpful assistant.",
	input: "Tell me a three sentence bedtime story about a unicorn.",
});

console.log(response);
console.log(response.output_text);
pnpm run example text_single

> @huggingface/responses-server@0.1.0 example /home/wauplin/projects/huggingface.js/packages/responses-server
> node scripts/run-example.js text_single

{
  object: 'response',
  id: 'resp_ff152426c1c1ad544392e8ff512ef73dedc3a9cbab60e65a',
  status: 'completed',
  error: null,
  instructions: 'You are a helpful assistant.',
  model: 'Qwen/Qwen2.5-VL-7B-Instruct',
  temperature: 1,
  top_p: 1,
  created_at: 1751359624,
  output: [
    {
      id: 'msg_0d05372ec992169434028304842f1123b195e9bebdf1305f',
      type: 'message',
      role: 'assistant',
      status: 'completed',
      content: [Array]
    }
  ],
  output_text: "Once upon a time, in a faraway realm, lived a young unicorn named Starlight. Every night, Starlight would Sadly come down from a hill and heatpocumentt the Sleeping Forest. One Moonlight night, Starlight accidentally stumbled upon a secluded glade filled with silver stars and glowing pixies. Dazzled by the magical light, Starlight decided to let the pixies light Starlight. With the pixies' help, Starlight found a way to make the Severnvalsea never gonna gonna gonna stop glittering magnificently. As the last stars disappeared into the night, Starlight felt a sense of peace, knowing that the Gadondo's secret was now safe, forever encompassed in the magic of their icy depiction."
}

PR on top of #1576.

I'm starting to think that it makes sense to define Zod schemas for
inputs as we need to validate user's inputs. But that for outputs we
"only" need static type checking and therefore we could reuse the types
defined in https://github.com/openai/openai-node.

**Benefits:** no need to redefine stuff manually. It's easy to make
mistakes (a parameter that shouldn't be nullable, that could be an
array, etc.) when translating from the specs to our codebase. If static
typing don't complain then we can assume "it's good".
Also less code to maintain.

**Drawback:** less flexibility. We don't own the stack and things might
get updated in the wild. It's less a problem in this context as it's a
server and not a client (and therefore we manage the dependency
updates).

Overall I do think it's better to import from openai. Since we won't
implement everything at first, it's fine to use `Omit<...,
"key-that-we-dont-implement">` which **explicitly** removes a feature
(better than implicit non-definition)


---

**EDIT:** and it's fine to use them for now and if it's ever blocking in
the future, then we redefine stuff ourselves.
Wauplin and others added 3 commits July 2, 2025 10:09
Built on top of #1576.

This PR adds support for streaming mode to the Responses API.

Tested it using the
[openai-responses-starter-app](https://github.com/openai/openai-responses-starter-app):

[Screencast from 02-07-2025
07:43:52.webm](https://github.com/user-attachments/assets/6eb77c9c-5796-4841-af55-f526da8da847)


```
pnpm run example streaming
```

```js
{
  type: 'response.created',
  response: {
    object: 'response',
    id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258',
    status: 'in_progress',
    error: null,
    instructions: null,
    model: 'Qwen/Qwen2.5-VL-7B-Instruct',
    temperature: 1,
    top_p: 1,
    created_at: 1751383702199,
    output: []
  },
  sequence_number: 0
}
{
  type: 'response.in_progress',
  response: {
    object: 'response',
    id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258',
    status: 'in_progress',
    error: null,
    instructions: null,
    model: 'Qwen/Qwen2.5-VL-7B-Instruct',
    temperature: 1,
    top_p: 1,
    created_at: 1751383702199,
    output: []
  },
  sequence_number: 1
}
{
  type: 'response.output_item.added',
  output_index: 0,
  item: {
    id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
    type: 'message',
    role: 'assistant',
    status: 'in_progress',
    content: []
  },
  sequence_number: 2
}
{
  type: 'response.content_part.added',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  part: { type: 'output_text', text: '', annotations: [] },
  sequence_number: 3
}
{
  type: 'response.output_text.delta',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  delta: 'Double',
  sequence_number: 4
}
{
  type: 'response.output_text.delta',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  delta: ' bubble',
  sequence_number: 5
}

...

{
  type: 'response.output_text.delta',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  delta: '!',
  sequence_number: 43
}
{
  type: 'response.output_text.done',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!',
  sequence_number: 44
}
{
  type: 'response.content_part.done',
  item_id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
  output_index: 0,
  content_index: 0,
  part: {
    type: 'output_text',
    text: 'Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath! Double bubble bath!',
    annotations: []
  },
  sequence_number: 45
}
{
  type: 'response.output_item.done',
  output_index: 0,
  item: {
    id: 'msg_def4b731a2654f7eab4fb2efdff217079da37154709c0f0b',
    type: 'message',
    role: 'assistant',
    status: 'completed',
    content: [ [Object] ]
  },
  sequence_number: 46
}
{
  type: 'response.completed',
  response: {
    object: 'response',
    id: 'resp_861131785bfb75f24f944aa7cbc4767b194a2ea320cff258',
    status: 'completed',
    error: null,
    instructions: null,
    model: 'Qwen/Qwen2.5-VL-7B-Instruct',
    temperature: 1,
    top_p: 1,
    created_at: 1751383702199,
    output: [ [Object] ]
  },
  sequence_number: 47
}
```
Built on top of #1576.

Based on https://platform.openai.com/docs/guides/structured-outputs

Works both with and without streaming.

## Non-stream

**Run**
```bash
pnpm run example structured_output
```

(which core logic is:)
```js
(...)
const response = await openai.responses.parse({
	model: "Qwen/Qwen2.5-VL-72B-Instruct",
	provider: "nebius",
	input: [
		{
			role: "system",
			content: "You are a helpful math tutor. Guide the user through the solution step by step.",
		},
		{ role: "user", content: "how can I solve 8x + 7 = -23" },
	],
	text: {
		format: zodTextFormat(MathReasoning, "math_reasoning"),
	},
});
(...)
```

**Output:**
```js
{
  steps: [
    {
      explanation: 'To solve for x, we need to isolate it on one side of the equation. We start by subtracting 7 from both sides of the equation.',
      output: '8x + 7 - 7 = -23 - 7'
    },
    {
      explanation: 'Simplify the equation after performing the subtraction.',
      output: '8x = -30'
    },
    {
      explanation: 'Now that we have isolated the term with x, we divide both sides by 8 to get x by itself.',
      output: '8x / 8 = -30 / 8'
    },
    {
      explanation: 'Perform the division to find the value of x.',
      output: 'x = -30 / 8'
    },
    {
      explanation: 'Simplify the fraction if possible.',
      output: 'x = -15 / 4'
    }
  ],
  final_answer: 'The solution is x = -15/4 or x = -3.75.'
}
```

## Stream

**Run**
```bash
pnpm run example structured_output_streaming
```

(which core logic is:)

```js
const stream = openai.responses
	.stream({
		model: "Qwen/Qwen2.5-VL-72B-Instruct",
		provider: "nebius",
		instructions: "Extract the event information.",
		input: "Alice and Bob are going to a science fair on Friday.",
		text: {
			format: zodTextFormat(CalendarEvent, "calendar_event"),
		},
	})
	.on("response.refusal.delta", (event) => {
		process.stdout.write(event.delta);
	})
	.on("response.output_text.delta", (event) => {
		process.stdout.write(event.delta);
	})
	.on("response.output_text.done", () => {
		process.stdout.write("\n");
	})
	.on("response.error", (event) => {
		console.error(event.error);
	});

const result = await stream.finalResponse();
console.log(result.output_parsed);
```

**Output:**

```js
{
  "name": "Science Fair",
  "date": "Friday",
  "participants": ["Alice", "Bob"]
}
{
  name: 'Science Fair',
  date: 'Friday',
  participants: [ 'Alice', 'Bob' ]
}
```
Wauplin and others added 3 commits July 2, 2025 16:52
Built on top of #1576.

Based on https://platform.openai.com/docs/api-reference/responses/create
and
https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming

Works both with and without streaming.

**Note:** implementation starts to be a completely messy, especially in
streaming mode. Complexity increases as we add new event types. I do
think a refactoring would be beneficial e.g. with an internal state
object that keeps track of the current state and "knows" what to emit
and when (typically to emit the "done"/"completed" events each time a
new output/content is generated). Food for thoughts for a future PR.

## Non-stream

**Run**
```bash
pnpm run example function
```

**Output**

```js
{
  created_at: 1751467285177,
  error: null,
  id: 'resp_0b2ab98168a9813e0f7373f940221da4ef3211f43c9faac8',
  instructions: null,
  max_output_tokens: null,
  metadata: null,
  model: 'meta-llama/Llama-3.3-70B-Instruct',
  object: 'response',
  output: [
    {
      type: 'function_call',
      id: 'fc_f40ac964165602e2fcb2f955777acff8c4b9359d49eaf79b',
      call_id: '9cd167c7f',
      name: 'get_current_weather',
      arguments: '{"location": "Boston, MA", "unit": "fahrenheit"}',
      status: 'completed'
    }
  ],
  status: 'completed',
  tool_choice: 'auto',
  tools: [
    {
      name: 'get_current_weather',
      parameters: [Object],
      strict: true,
      type: 'function',
      description: 'Get the current weather in a given location'
    }
  ],
  temperature: 1,
  top_p: 1,
  output_text: ''
}
```

## Stream

**Run:**

```
pnpm run example function_streaming
```

**Output:**


```js
{
  type: 'response.created',
  response: {
    created_at: 1751467334073,
    error: null,
    id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05',
    instructions: null,
    max_output_tokens: null,
    metadata: null,
    model: 'meta-llama/Llama-3.3-70B-Instruct',
    object: 'response',
    output: [],
    status: 'in_progress',
    tool_choice: 'auto',
    tools: [ [Object] ],
    temperature: 1,
    top_p: 1
  },
  sequence_number: 0
}
{
  type: 'response.in_progress',
  response: {
    created_at: 1751467334073,
    error: null,
    id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05',
    instructions: null,
    max_output_tokens: null,
    metadata: null,
    model: 'meta-llama/Llama-3.3-70B-Instruct',
    object: 'response',
    output: [],
    status: 'in_progress',
    tool_choice: 'auto',
    tools: [ [Object] ],
    temperature: 1,
    top_p: 1
  },
  sequence_number: 1
}
{
  type: 'response.output_item.added',
  output_index: 0,
  item: {
    type: 'function_call',
    id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454',
    call_id: '83a9d4baf',
    name: 'get_weather',
    arguments: ''
  },
  sequence_number: 2
}
{
  type: 'response.function_call_arguments.delta',
  item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454',
  output_index: 0,
  delta: '{"latitude": 48.8567, "longitude": 2.3508}',
  sequence_number: 3
}
{
  type: 'response.function_call_arguments.done',
  item_id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454',
  output_index: 0,
  arguments: '{"latitude": 48.8567, "longitude": 2.3508}',
  sequence_number: 4
}
{
  type: 'response.output_item.done',
  output_index: 0,
  item: {
    type: 'function_call',
    id: 'fc_9bdc8945b9cb6c95c5c248db4203f0707ba9fd338dee2454',
    call_id: '83a9d4baf',
    name: 'get_weather',
    arguments: '{"latitude": 48.8567, "longitude": 2.3508}',
    status: 'completed'
  },
  sequence_number: 5
}
{
  type: 'response.completed',
  response: {
    created_at: 1751467334073,
    error: null,
    id: 'resp_8d86745178f2b9fc0da000156655956181c76a7701712a05',
    instructions: null,
    max_output_tokens: null,
    metadata: null,
    model: 'meta-llama/Llama-3.3-70B-Instruct',
    object: 'response',
    output: [ [Object] ],
    status: 'completed',
    tool_choice: 'auto',
    tools: [ [Object] ],
    temperature: 1,
    top_p: 1
  },
  sequence_number: 6
}
```
Some tweaks to make it work in a demo:
- provider can be passed like this
`model="cohere@CohereLabs/c4ai-command-a-03-2025"`
- clean up input message if empty message (not supported by some
providers)
- check tools calls is not an empty list
@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 2, 2025

closing in favor of its own repository: https://github.com/huggingface/responses.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant