Skip to content

Commit 6b3bfa2

Browse files
authored
(Feat) - return x-litellm-attempted-fallbacks in responses from litellm proxy (BerriAI#8558)
* add_fallback_headers_to_response * test x-litellm-attempted-fallbacks * unit test attempted fallbacks * fix add_fallback_headers_to_response * docs document response headers * fix file name
1 parent a9276f2 commit 6b3bfa2

9 files changed

+200
-117
lines changed
+58-11
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,71 @@
1-
# Rate Limit Headers
1+
# Response Headers
22

3-
When you make a request to the proxy, the proxy will return the following [OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):
3+
When you make a request to the proxy, the proxy will return the following headers:
44

5-
- `x-ratelimit-remaining-requests` - Optional[int]: The remaining number of requests that are permitted before exhausting the rate limit.
6-
- `x-ratelimit-remaining-tokens` - Optional[int]: The remaining number of tokens that are permitted before exhausting the rate limit.
7-
- `x-ratelimit-limit-requests` - Optional[int]: The maximum number of requests that are permitted before exhausting the rate limit.
8-
- `x-ratelimit-limit-tokens` - Optional[int]: The maximum number of tokens that are permitted before exhausting the rate limit.
9-
- `x-ratelimit-reset-requests` - Optional[int]: The time at which the rate limit will reset.
10-
- `x-ratelimit-reset-tokens` - Optional[int]: The time at which the rate limit will reset.
5+
## Rate Limit Headers
6+
[OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):
117

12-
These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
8+
| Header | Type | Description |
9+
|--------|------|-------------|
10+
| `x-ratelimit-remaining-requests` | Optional[int] | The remaining number of requests that are permitted before exhausting the rate limit |
11+
| `x-ratelimit-remaining-tokens` | Optional[int] | The remaining number of tokens that are permitted before exhausting the rate limit |
12+
| `x-ratelimit-limit-requests` | Optional[int] | The maximum number of requests that are permitted before exhausting the rate limit |
13+
| `x-ratelimit-limit-tokens` | Optional[int] | The maximum number of tokens that are permitted before exhausting the rate limit |
14+
| `x-ratelimit-reset-requests` | Optional[int] | The time at which the rate limit will reset |
15+
| `x-ratelimit-reset-tokens` | Optional[int] | The time at which the rate limit will reset |
1316

14-
## How are these headers calculated?
17+
### How Rate Limit Headers work
1518

1619
**If key has rate limits set**
1720

1821
The proxy will return the [remaining rate limits for that key](https://github.com/BerriAI/litellm/blob/bfa95538190575f7f317db2d9598fc9a82275492/litellm/proxy/hooks/parallel_request_limiter.py#L778).
1922

2023
**If key does not have rate limits set**
2124

22-
The proxy returns the remaining requests/tokens returned by the backend provider.
25+
The proxy returns the remaining requests/tokens returned by the backend provider. (LiteLLM will standardize the backend provider's response headers to match the OpenAI format)
2326

2427
If the backend provider does not return these headers, the value will be `None`.
28+
29+
These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
30+
31+
32+
## Latency Headers
33+
| Header | Type | Description |
34+
|--------|------|-------------|
35+
| `x-litellm-response-duration-ms` | float | Total duration of the API response in milliseconds |
36+
| `x-litellm-overhead-duration-ms` | float | LiteLLM processing overhead in milliseconds |
37+
38+
## Retry, Fallback Headers
39+
| Header | Type | Description |
40+
|--------|------|-------------|
41+
| `x-litellm-attempted-retries` | int | Number of retry attempts made |
42+
| `x-litellm-attempted-fallbacks` | int | Number of fallback attempts made |
43+
| `x-litellm-max-fallbacks` | int | Maximum number of fallback attempts allowed |
44+
45+
## Cost Tracking Headers
46+
| Header | Type | Description |
47+
|--------|------|-------------|
48+
| `x-litellm-response-cost` | float | Cost of the API call |
49+
| `x-litellm-key-spend` | float | Total spend for the API key |
50+
51+
## LiteLLM Specific Headers
52+
| Header | Type | Description |
53+
|--------|------|-------------|
54+
| `x-litellm-call-id` | string | Unique identifier for the API call |
55+
| `x-litellm-model-id` | string | Unique identifier for the model used |
56+
| `x-litellm-model-api-base` | string | Base URL of the API endpoint |
57+
| `x-litellm-version` | string | Version of LiteLLM being used |
58+
| `x-litellm-model-group` | string | Model group identifier |
59+
60+
## Response headers from LLM providers
61+
62+
LiteLLM also returns the original response headers from the LLM provider. These headers are prefixed with `llm_provider-` to distinguish them from LiteLLM's headers.
63+
64+
Example response headers:
65+
```
66+
llm_provider-openai-processing-ms: 256
67+
llm_provider-openai-version: 2020-10-01
68+
llm_provider-x-ratelimit-limit-requests: 30000
69+
llm_provider-x-ratelimit-limit-tokens: 150000000
70+
```
71+

docs/my-website/sidebars.js

+1-1
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ const sidebars = {
6565
items: [
6666
"proxy/user_keys",
6767
"proxy/clientside_auth",
68-
"proxy/response_headers",
6968
"proxy/request_headers",
69+
"proxy/response_headers",
7070
],
7171
},
7272
{

litellm/router.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,10 @@
5757
from litellm.router_strategy.lowest_tpm_rpm_v2 import LowestTPMLoggingHandler_v2
5858
from litellm.router_strategy.simple_shuffle import simple_shuffle
5959
from litellm.router_strategy.tag_based_routing import get_deployments_for_tag
60-
from litellm.router_utils.add_retry_headers import add_retry_headers_to_response
60+
from litellm.router_utils.add_retry_fallback_headers import (
61+
add_fallback_headers_to_response,
62+
add_retry_headers_to_response,
63+
)
6164
from litellm.router_utils.batch_utils import (
6265
_get_router_metadata_variable_name,
6366
replace_model_in_jsonl,
@@ -2888,6 +2891,10 @@ async def async_function_with_fallbacks(self, *args, **kwargs): # noqa: PLR0915
28882891
else:
28892892
response = await self.async_function_with_retries(*args, **kwargs)
28902893
verbose_router_logger.debug(f"Async Response: {response}")
2894+
response = add_fallback_headers_to_response(
2895+
response=response,
2896+
attempted_fallbacks=0,
2897+
)
28912898
return response
28922899
except Exception as e:
28932900
verbose_router_logger.debug(f"Traceback{traceback.format_exc()}")

litellm/router_utils/add_retry_headers.py litellm/router_utils/add_retry_fallback_headers.py

+43-15
Original file line numberDiff line numberDiff line change
@@ -5,24 +5,13 @@
55
from litellm.types.utils import HiddenParams
66

77

8-
def add_retry_headers_to_response(
9-
response: Any,
10-
attempted_retries: int,
11-
max_retries: Optional[int] = None,
12-
) -> Any:
8+
def _add_headers_to_response(response: Any, headers: dict) -> Any:
139
"""
14-
Add retry headers to the request
10+
Helper function to add headers to a response's hidden params
1511
"""
16-
1712
if response is None or not isinstance(response, BaseModel):
1813
return response
1914

20-
retry_headers = {
21-
"x-litellm-attempted-retries": attempted_retries,
22-
}
23-
if max_retries is not None:
24-
retry_headers["x-litellm-max-retries"] = max_retries
25-
2615
hidden_params: Optional[Union[dict, HiddenParams]] = getattr(
2716
response, "_hidden_params", {}
2817
)
@@ -33,8 +22,47 @@ def add_retry_headers_to_response(
3322
hidden_params = hidden_params.model_dump()
3423

3524
hidden_params.setdefault("additional_headers", {})
36-
hidden_params["additional_headers"].update(retry_headers)
25+
hidden_params["additional_headers"].update(headers)
3726

3827
setattr(response, "_hidden_params", hidden_params)
39-
4028
return response
29+
30+
31+
def add_retry_headers_to_response(
32+
response: Any,
33+
attempted_retries: int,
34+
max_retries: Optional[int] = None,
35+
) -> Any:
36+
"""
37+
Add retry headers to the request
38+
"""
39+
retry_headers = {
40+
"x-litellm-attempted-retries": attempted_retries,
41+
}
42+
if max_retries is not None:
43+
retry_headers["x-litellm-max-retries"] = max_retries
44+
45+
return _add_headers_to_response(response, retry_headers)
46+
47+
48+
def add_fallback_headers_to_response(
49+
response: Any,
50+
attempted_fallbacks: int,
51+
) -> Any:
52+
"""
53+
Add fallback headers to the response
54+
55+
Args:
56+
response: The response to add the headers to
57+
attempted_fallbacks: The number of fallbacks attempted
58+
59+
Returns:
60+
The response with the headers added
61+
62+
Note: It's intentional that we don't add max_fallbacks in response headers
63+
Want to avoid bloat in the response headers for performance.
64+
"""
65+
fallback_headers = {
66+
"x-litellm-attempted-fallbacks": attempted_fallbacks,
67+
}
68+
return _add_headers_to_response(response, fallback_headers)

litellm/router_utils/fallback_event_handlers.py

+9-50
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
import litellm
55
from litellm._logging import verbose_router_logger
66
from litellm.integrations.custom_logger import CustomLogger
7+
from litellm.router_utils.add_retry_fallback_headers import (
8+
add_fallback_headers_to_response,
9+
)
710
from litellm.types.router import LiteLLMParamsTypedDict
811

912
if TYPE_CHECKING:
@@ -130,12 +133,17 @@ async def run_async_fallback(
130133
kwargs.setdefault("metadata", {}).update(
131134
{"model_group": kwargs.get("model", None)}
132135
) # update model_group used, if fallbacks are done
133-
kwargs["fallback_depth"] = fallback_depth + 1
136+
fallback_depth = fallback_depth + 1
137+
kwargs["fallback_depth"] = fallback_depth
134138
kwargs["max_fallbacks"] = max_fallbacks
135139
response = await litellm_router.async_function_with_fallbacks(
136140
*args, **kwargs
137141
)
138142
verbose_router_logger.info("Successful fallback b/w models.")
143+
response = add_fallback_headers_to_response(
144+
response=response,
145+
attempted_fallbacks=fallback_depth,
146+
)
139147
# callback for successfull_fallback_event():
140148
await log_success_fallback_event(
141149
original_model_group=original_model_group,
@@ -153,55 +161,6 @@ async def run_async_fallback(
153161
raise error_from_fallbacks
154162

155163

156-
def run_sync_fallback(
157-
litellm_router: LitellmRouter,
158-
*args: Tuple[Any],
159-
fallback_model_group: List[str],
160-
original_model_group: str,
161-
original_exception: Exception,
162-
**kwargs,
163-
) -> Any:
164-
"""
165-
Synchronous version of run_async_fallback.
166-
Loops through all the fallback model groups and calls kwargs["original_function"] with the arguments and keyword arguments provided.
167-
168-
If the call is successful, returns the response.
169-
If the call fails, continues to the next fallback model group.
170-
If all fallback model groups fail, it raises the most recent exception.
171-
172-
Args:
173-
litellm_router: The litellm router instance.
174-
*args: Positional arguments.
175-
fallback_model_group: List[str] of fallback model groups. example: ["gpt-4", "gpt-3.5-turbo"]
176-
original_model_group: The original model group. example: "gpt-3.5-turbo"
177-
original_exception: The original exception.
178-
**kwargs: Keyword arguments.
179-
180-
Returns:
181-
The response from the successful fallback model group.
182-
Raises:
183-
The most recent exception if all fallback model groups fail.
184-
"""
185-
error_from_fallbacks = original_exception
186-
for mg in fallback_model_group:
187-
if mg == original_model_group:
188-
continue
189-
try:
190-
# LOGGING
191-
kwargs = litellm_router.log_retry(kwargs=kwargs, e=original_exception)
192-
verbose_router_logger.info(f"Falling back to model_group = {mg}")
193-
kwargs["model"] = mg
194-
kwargs.setdefault("metadata", {}).update(
195-
{"model_group": mg}
196-
) # update model_group used, if fallbacks are done
197-
response = litellm_router.function_with_fallbacks(*args, **kwargs)
198-
verbose_router_logger.info("Successful fallback b/w models.")
199-
return response
200-
except Exception as e:
201-
error_from_fallbacks = e
202-
raise error_from_fallbacks
203-
204-
205164
async def log_success_fallback_event(
206165
original_model_group: str, kwargs: dict, original_exception: Exception
207166
):

proxy_server_config.yaml

+7
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,13 @@ model_list:
135135
api_key: my-fake-key
136136
api_base: https://exampleopenaiendpoint-production.up.railway.app/
137137
timeout: 1
138+
- model_name: badly-configured-openai-endpoint
139+
litellm_params:
140+
model: openai/my-fake-model
141+
api_key: my-fake-key
142+
api_base: https://exampleopenaiendpoint-production.up.railway.appxxxx/
143+
144+
138145
litellm_settings:
139146
# set_verbose: True # Uncomment this if you want to see verbose logs; not recommended in production
140147
drop_params: True

tests/local_testing/test_router_fallback_handlers.py

-39
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525

2626
from litellm.router_utils.fallback_event_handlers import (
2727
run_async_fallback,
28-
run_sync_fallback,
2928
log_success_fallback_event,
3029
log_failure_fallback_event,
3130
)
@@ -109,44 +108,6 @@ async def test_run_async_fallback(original_function):
109108
assert isinstance(result, litellm.EmbeddingResponse)
110109

111110

112-
@pytest.mark.parametrize("original_function", [router._completion, router._embedding])
113-
def test_run_sync_fallback(original_function):
114-
litellm.set_verbose = True
115-
fallback_model_group = ["gpt-4"]
116-
original_model_group = "gpt-3.5-turbo"
117-
original_exception = litellm.exceptions.InternalServerError(
118-
message="Simulated error",
119-
llm_provider="openai",
120-
model="gpt-3.5-turbo",
121-
)
122-
123-
request_kwargs = {
124-
"mock_response": "hello this is a test for run_async_fallback",
125-
"metadata": {"previous_models": ["gpt-3.5-turbo"]},
126-
}
127-
128-
if original_function == router._embedding:
129-
request_kwargs["input"] = "hello this is a test for run_async_fallback"
130-
elif original_function == router._completion:
131-
request_kwargs["messages"] = [{"role": "user", "content": "Hello, world!"}]
132-
result = run_sync_fallback(
133-
router,
134-
original_function=original_function,
135-
num_retries=1,
136-
fallback_model_group=fallback_model_group,
137-
original_model_group=original_model_group,
138-
original_exception=original_exception,
139-
**request_kwargs
140-
)
141-
142-
assert result is not None
143-
144-
if original_function == router._completion:
145-
assert isinstance(result, litellm.ModelResponse)
146-
elif original_function == router._embedding:
147-
assert isinstance(result, litellm.EmbeddingResponse)
148-
149-
150111
class CustomTestLogger(CustomLogger):
151112
def __init__(self):
152113
super().__init__()

tests/local_testing/test_router_fallbacks.py

+51
Original file line numberDiff line numberDiff line change
@@ -1604,3 +1604,54 @@ def test_fallbacks_with_different_messages():
16041604
)
16051605

16061606
print(resp)
1607+
1608+
1609+
@pytest.mark.parametrize("expected_attempted_fallbacks", [0, 1, 3])
1610+
@pytest.mark.asyncio
1611+
async def test_router_attempted_fallbacks_in_response(expected_attempted_fallbacks):
1612+
"""
1613+
Test that the router returns the correct number of attempted fallbacks in the response
1614+
1615+
- Test cases: works on first try, `x-litellm-attempted-fallbacks` is 0
1616+
- Works on 1st fallback, `x-litellm-attempted-fallbacks` is 1
1617+
- Works on 3rd fallback, `x-litellm-attempted-fallbacks` is 3
1618+
"""
1619+
router = Router(
1620+
model_list=[
1621+
{
1622+
"model_name": "working-fake-endpoint",
1623+
"litellm_params": {
1624+
"model": "openai/working-fake-endpoint",
1625+
"api_key": "my-fake-key",
1626+
"api_base": "https://exampleopenaiendpoint-production.up.railway.app",
1627+
},
1628+
},
1629+
{
1630+
"model_name": "badly-configured-openai-endpoint",
1631+
"litellm_params": {
1632+
"model": "openai/my-fake-model",
1633+
"api_base": "https://exampleopenaiendpoint-production.up.railway.appzzzzz",
1634+
},
1635+
},
1636+
],
1637+
fallbacks=[{"badly-configured-openai-endpoint": ["working-fake-endpoint"]}],
1638+
)
1639+
1640+
if expected_attempted_fallbacks == 0:
1641+
resp = router.completion(
1642+
model="working-fake-endpoint",
1643+
messages=[{"role": "user", "content": "Hey, how's it going?"}],
1644+
)
1645+
assert (
1646+
resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
1647+
== expected_attempted_fallbacks
1648+
)
1649+
elif expected_attempted_fallbacks == 1:
1650+
resp = router.completion(
1651+
model="badly-configured-openai-endpoint",
1652+
messages=[{"role": "user", "content": "Hey, how's it going?"}],
1653+
)
1654+
assert (
1655+
resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
1656+
== expected_attempted_fallbacks
1657+
)

0 commit comments

Comments
 (0)