forked from deepjavalibrary/djl-serving
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds streaming docs (deepjavalibrary#1017)
This adds docs for the use of streaming. It also makes some updates to the plugin docs.
- Loading branch information
Showing
10 changed files
with
220 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,10 @@ | ||
# DJL Serving - Cache Paginator Plugin | ||
# DJL Serving - Cache Plugin | ||
|
||
Allows the model server to use additional cache engine types: | ||
Allows the model server to use additional cache engine types for asynchronous requests and pagination: | ||
|
||
- DynamoDB Cache | ||
- S3 Cache | ||
|
||
## Instructions | ||
|
||
Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,7 @@ | ||
# DJL Serving - Static File Plugin | ||
|
||
Allows the model server to also serve static files as well | ||
Allows the model server to also serve static files as well | ||
|
||
## Instructions | ||
|
||
Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# DJL Serving Async and Caching | ||
|
||
DJL Serving has support for asynchronous request and request caching. The asynchronous requests can be used when the model is large and may cause timeouts waiting for it to compute the response. In this case, an asynchronous request will complete immediately without timeout concerns and later requests can be used to get the result and/or check if it is completed. The cache choice is global and can be set using environment variables or system properties (see below). | ||
|
||
It is also possible to use this as an LRU cache to avoid recomputing common inputs. To enable this, apply the multi-tenant cache configuration (see below). This use case is currently experimental. | ||
|
||
### Initial Request | ||
|
||
To run an asynchronous request, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below. | ||
|
||
```sh | ||
curl -v -X POST "http://localhost:8080/invocations" \ | ||
-H "content-type: application/json" \ | ||
-H "x-synchronous: false" \ | ||
-d "..." | ||
``` | ||
|
||
``` | ||
> POST /invocations HTTP/1.1 | ||
> Host: localhost:8080 | ||
> User-Agent: curl/7.68.0 | ||
> Accept: */* | ||
> content-type: application/json | ||
> x-synchronous: false | ||
> Content-Length: 73 | ||
> | ||
* upload completely sent off: 73 out of 73 bytes | ||
* Mark bundle as not supporting multiuse | ||
< HTTP/1.1 200 OK | ||
< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d | ||
< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d | ||
< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338 | ||
< Pragma: no-cache | ||
< Cache-Control: no-cache; no-store, must-revalidate, private | ||
< Expires: Thu, 01 Jan 1970 00:00:00 UTC | ||
< content-length: 0 | ||
< connection: keep-alive | ||
< | ||
* Connection #0 to host localhost left intact | ||
``` | ||
|
||
### Data Requests | ||
|
||
After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker: | ||
|
||
```sh | ||
curl -v -X POST "http://127.0.0.1:8080/invocations" \ | ||
-H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d" | ||
``` | ||
|
||
```sh | ||
curl -v -X POST "http://127.0.0.1:8080/invocations" \ | ||
-H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d" | ||
``` | ||
|
||
If the result is not yet available, you will receive a response with HTTP code `202`. | ||
|
||
## Cache Configuration | ||
|
||
In addition, there are a number of options that can be used to configure the cache. In DJL, cache support is enabled by an implementation of the `CacheEngine`. The first option to choose is which `CacheEngine` to use. The `MemoryCacheEngine` is the default and the only one that is always available. The other cache engines require the use of the [DJL Serving Cache Plugin](http://docs.djl.ai/docs/serving/plugins/cache/index.html). | ||
|
||
Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3. | ||
|
||
There are also several properties (both environment variables and system properties) that apply to all engines: | ||
|
||
`SERVING_CACHE_MULTITENANT` (default false) indicates to use a multi-tenant cache with keys based on the hash of the input. Using this, two users with the same input would share the same cache entry, allowing the later one to avoid recomputing it. The alternative has a UUID for each cache entry ensuring they are unique. This feature is still experimental. | ||
|
||
`SERVING_CACHE_BATCH` (default 1) indicates how many streaming items to store in the cache at once. It can be used to reduce writes for granular streaming like text generation. For example, setting it to 5 with DDB would mean each DB save would include 5 characters instead of 1. | ||
|
||
### Memory Cache | ||
|
||
The memory cache is the default cache that stores data in memory on the machine (JVM). It is not suitable for use cases with horizontal scaling. It supports the following properties (environment variables and system properties): | ||
|
||
`SERVING_MEMORY_CACHE_CAPACITY` (default none) provides an optional maximum capacity for the cache. This makes it follow an LRU strategy. | ||
|
||
### DDB Cache | ||
|
||
The DDB cache is based on [Amazon DynamoDB](https://aws.amazon.com/dynamodb/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for smaller outputs like text. It supports the following properties (environment variables and system properties): | ||
|
||
`SERVING_DDB_CACHE` can be set to "true" to use the DDB cache. | ||
|
||
`SERVING_DDB_TABLE_NAME` (default "djl-serving-pagination-table") sets the table to use for the cache. | ||
|
||
There are a few final notes. The default `SERVING_CACHE_BATCH` when using the DDB cache is 5. It also does not support multi-tenant. | ||
|
||
### S3 Cache | ||
|
||
The S3 cache is based on [Amazon S3](https://aws.amazon.com/s3/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for larger outputs like images. It supports the following properties (environment variables and system properties): | ||
|
||
`SERVING_S3_CACHE` can be set to "true" to use the S3 cache. | ||
|
||
`SERVING_S3_CACHE_BUCKET` (required) sets the name of the bucket to use. | ||
|
||
`SERVING_S3_CACHE_KEY_PREFIX` (default "") sets a prefix for the caching path in the bucket. For example, a prefix of "serving/cache/" with an entry "xxx" would make the entry have the combined path "serving/cache/xxx". It can be used to reuse a bucket or share a bucket with other use cases. | ||
|
||
`SERVING_S3_CACHE_AUTOCREATE` (default false) can be set to "true" to automatically create the S3 bucket if it does not exist. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# DJL Serving Streaming | ||
|
||
When a model is run, typically the response is given all at once. However, there are also some cases that benefit from having the response given back as it is generated. For example, a LLM can return back the characters as they are generated. This allows for more dynamic UIs that seem speedier as user's could then read the output as they are written rather than waiting. | ||
|
||
## Model Support | ||
|
||
The first step to supporting streaming is to use a model that supports it. | ||
|
||
For a model using Python, see the [Streaming Python configuration guide](streaming_config.md). This provides instructions for modifying your `handle()` function to use the streaming output. After the model is modified to support streaming, you must also add `option.enable_streaming=true` to the `serving.properties` to enable the streaming support. | ||
|
||
For a Java model, this means that it must have a Block implementing [`StreamingBlock`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingBlock.html) and a Translator implementing async [`StreamingTranslator`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingTranslator.html). Right now, it is available through the DJL api but is not yet in DJL Serving. | ||
|
||
## HTTP Streaming | ||
|
||
This simplest way to support streaming is with [HTTP Chunked Encoding](https://en.wikipedia.org/wiki/Chunked_transfer_encoding). This allows HTTP/1 to send back data is chunks rather than all at once. You can access support for this through whatever HTTP API you are using to make the request. As an example, you can see how it is handled by the [JavaScript fetch Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API/Using_readable_streams). | ||
|
||
## Pagination Streaming | ||
|
||
It is also possible to implement streaming using pagination. The main use case for pagination is when HTTP streaming is not available. For example, if you have a proxy before the model server then the proxy must also support streaming. For those that do not such as Amazon SageMaker (currently), pagination will enable streaming. | ||
|
||
The support for pagination is built on top of the [DJL Serving Caching Support](cache.md). It works by having the first request run asynchronously and stream the results to the cache. It will then provide a token to access the results. Subsequent requests using the token will return the elements from the cache. See the diagram below: | ||
|
||
``` | ||
// Request 1 streams to the cache and returns the access token | ||
Request 1 [input data] -------> Cache | ||
Token <- | ||
// Following requests returns the currently available data from the cache | ||
Request 2+ [token, start] | ||
Partial Output <----------- Cache | ||
``` | ||
|
||
For good results, it is important to properly configure the cache. The available configuration options can be found on the [cache page](cache.md). Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache or persist requests from the same user to the same instance of DJL Serving. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3 to share the same cache. | ||
|
||
### Initial Request | ||
|
||
To run a request with pagination, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below. | ||
|
||
```sh | ||
curl -v -X POST "http://localhost:8080/invocations" \ | ||
-H "content-type: application/json" \ | ||
-H "x-synchronous: false" \ | ||
-d "..." | ||
``` | ||
|
||
``` | ||
> POST /invocations HTTP/1.1 | ||
> Host: localhost:8080 | ||
> User-Agent: curl/7.68.0 | ||
> Accept: */* | ||
> content-type: application/json | ||
> x-synchronous: false | ||
> Content-Length: 73 | ||
> | ||
* upload completely sent off: 73 out of 73 bytes | ||
* Mark bundle as not supporting multiuse | ||
< HTTP/1.1 200 OK | ||
< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d | ||
< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d | ||
< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338 | ||
< Pragma: no-cache | ||
< Cache-Control: no-cache; no-store, must-revalidate, private | ||
< Expires: Thu, 01 Jan 1970 00:00:00 UTC | ||
< content-length: 0 | ||
< connection: keep-alive | ||
< | ||
* Connection #0 to host localhost left intact | ||
``` | ||
|
||
### Data Requests | ||
|
||
After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker: | ||
|
||
```sh | ||
curl -v -X POST "http://127.0.0.1:8080/invocations" \ | ||
-H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d" | ||
``` | ||
|
||
```sh | ||
curl -v -X POST "http://127.0.0.1:8080/invocations" \ | ||
-H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d" | ||
``` | ||
|
||
Within the response, it will include the streamed data concatenated together for everything that is computed at the time. It may also return the response header `x-next-token` that can be used to retrieve the following page of computed results. If the header `x-next-token` is not found in the response, this indicates that computation has finished and the last of the data was sent in that response. | ||
|
||
Depending on the size of the results, it may be difficult to get all of the results at once. In that case, pass the `x-max-items` header to place a limit on the number of items streamed back at once. It the maximum number of items is not available at the time of the request, it will send as many as have already been computed. Note that these items are the number of "streamed element" from the model, not bytes. The number of bytes per item depends on the model and depending on the model can vary. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters