Skip to content

Commit

Permalink
Adds streaming docs (deepjavalibrary#1017)
Browse files Browse the repository at this point in the history
This adds docs for the use of streaming. It also makes some updates to the
plugin docs.
  • Loading branch information
zachgk authored Aug 14, 2023
1 parent 4bc70f7 commit c4a9db9
Show file tree
Hide file tree
Showing 10 changed files with 220 additions and 7 deletions.
8 changes: 6 additions & 2 deletions plugins/cache/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# DJL Serving - Cache Paginator Plugin
# DJL Serving - Cache Plugin

Allows the model server to use additional cache engine types:
Allows the model server to use additional cache engine types for asynchronous requests and pagination:

- DynamoDB Cache
- S3 Cache

## Instructions

Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,16 @@ public S3CacheEngine(
this.bucket = bucket;
this.keyPrefix = keyPrefix;

if (bucket == null) {
throw new IllegalStateException(
"When using the S3CacheEngine, the bucket can't be null or missing. Try setting"
+ " SERVING_S3_CACHE_BUCKET.");
}

if (keyPrefix == null) {
this.keyPrefix = "";
}

if (asyncClient == null) {
asyncClient = S3AsyncClient.builder().build();
}
Expand Down
6 changes: 5 additions & 1 deletion plugins/kserve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,8 @@
`POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer`

### Reference from Kserve
See [KServe Requirements](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md)
See [KServe Requirements](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md)

## Instructions

Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
4 changes: 4 additions & 0 deletions plugins/management-console/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,7 @@ npm run build

### Customize configuration
See [Configuration Reference](https://cli.vuejs.org/config/).

## Instructions

Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
6 changes: 5 additions & 1 deletion plugins/plugin-management-plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,8 @@

`GET /plugins`

Returns a list of the currently added plugins
Returns a list of the currently added plugins

## Instructions

Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
6 changes: 5 additions & 1 deletion plugins/static-file-plugin/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# DJL Serving - Static File Plugin

Allows the model server to also serve static files as well
Allows the model server to also serve static files as well

## Instructions

Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
96 changes: 96 additions & 0 deletions serving/docs/cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# DJL Serving Async and Caching

DJL Serving has support for asynchronous request and request caching. The asynchronous requests can be used when the model is large and may cause timeouts waiting for it to compute the response. In this case, an asynchronous request will complete immediately without timeout concerns and later requests can be used to get the result and/or check if it is completed. The cache choice is global and can be set using environment variables or system properties (see below).

It is also possible to use this as an LRU cache to avoid recomputing common inputs. To enable this, apply the multi-tenant cache configuration (see below). This use case is currently experimental.

### Initial Request

To run an asynchronous request, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below.

```sh
curl -v -X POST "http://localhost:8080/invocations" \
-H "content-type: application/json" \
-H "x-synchronous: false" \
-d "..."
```

```
> POST /invocations HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> content-type: application/json
> x-synchronous: false
> Content-Length: 73
>
* upload completely sent off: 73 out of 73 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d
< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d
< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 0
< connection: keep-alive
<
* Connection #0 to host localhost left intact
```

### Data Requests

After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker:

```sh
curl -v -X POST "http://127.0.0.1:8080/invocations" \
-H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d"
```

```sh
curl -v -X POST "http://127.0.0.1:8080/invocations" \
-H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d"
```

If the result is not yet available, you will receive a response with HTTP code `202`.

## Cache Configuration

In addition, there are a number of options that can be used to configure the cache. In DJL, cache support is enabled by an implementation of the `CacheEngine`. The first option to choose is which `CacheEngine` to use. The `MemoryCacheEngine` is the default and the only one that is always available. The other cache engines require the use of the [DJL Serving Cache Plugin](http://docs.djl.ai/docs/serving/plugins/cache/index.html).

Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3.

There are also several properties (both environment variables and system properties) that apply to all engines:

`SERVING_CACHE_MULTITENANT` (default false) indicates to use a multi-tenant cache with keys based on the hash of the input. Using this, two users with the same input would share the same cache entry, allowing the later one to avoid recomputing it. The alternative has a UUID for each cache entry ensuring they are unique. This feature is still experimental.

`SERVING_CACHE_BATCH` (default 1) indicates how many streaming items to store in the cache at once. It can be used to reduce writes for granular streaming like text generation. For example, setting it to 5 with DDB would mean each DB save would include 5 characters instead of 1.

### Memory Cache

The memory cache is the default cache that stores data in memory on the machine (JVM). It is not suitable for use cases with horizontal scaling. It supports the following properties (environment variables and system properties):

`SERVING_MEMORY_CACHE_CAPACITY` (default none) provides an optional maximum capacity for the cache. This makes it follow an LRU strategy.

### DDB Cache

The DDB cache is based on [Amazon DynamoDB](https://aws.amazon.com/dynamodb/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for smaller outputs like text. It supports the following properties (environment variables and system properties):

`SERVING_DDB_CACHE` can be set to "true" to use the DDB cache.

`SERVING_DDB_TABLE_NAME` (default "djl-serving-pagination-table") sets the table to use for the cache.

There are a few final notes. The default `SERVING_CACHE_BATCH` when using the DDB cache is 5. It also does not support multi-tenant.

### S3 Cache

The S3 cache is based on [Amazon S3](https://aws.amazon.com/s3/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for larger outputs like images. It supports the following properties (environment variables and system properties):

`SERVING_S3_CACHE` can be set to "true" to use the S3 cache.

`SERVING_S3_CACHE_BUCKET` (required) sets the name of the bucket to use.

`SERVING_S3_CACHE_KEY_PREFIX` (default "") sets a prefix for the caching path in the bucket. For example, a prefix of "serving/cache/" with an entry "xxx" would make the entry have the combined path "serving/cache/xxx". It can be used to reuse a bucket or share a bucket with other use cases.

`SERVING_S3_CACHE_AUTOCREATE` (default false) can be set to "true" to automatically create the S3 bucket if it does not exist.
3 changes: 2 additions & 1 deletion serving/docs/plugin_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
## Available Plugins

- [KServe plugin](../../plugins/kserve/README.md) - KServe V2 Protocol support
- [Management console](../../plugins/management-console/README.md) - DJL Management console UI
- [Management console plugin](../../plugins/management-console/README.md) - DJL Management console UI
- [Cache plugin](../../plugins/cache/README.md) - Provides additional options for caches
- [Static File plugin](../../plugins/static-file-plugin/README.md) - Allows DJL Serving to also serve static files
- [Plugin Management plugin](../../plugins/plugin-management-plugin/README.md) - Adds plugin management to the management API

Expand Down
86 changes: 86 additions & 0 deletions serving/docs/streaming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# DJL Serving Streaming

When a model is run, typically the response is given all at once. However, there are also some cases that benefit from having the response given back as it is generated. For example, a LLM can return back the characters as they are generated. This allows for more dynamic UIs that seem speedier as user's could then read the output as they are written rather than waiting.

## Model Support

The first step to supporting streaming is to use a model that supports it.

For a model using Python, see the [Streaming Python configuration guide](streaming_config.md). This provides instructions for modifying your `handle()` function to use the streaming output. After the model is modified to support streaming, you must also add `option.enable_streaming=true` to the `serving.properties` to enable the streaming support.

For a Java model, this means that it must have a Block implementing [`StreamingBlock`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingBlock.html) and a Translator implementing async [`StreamingTranslator`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingTranslator.html). Right now, it is available through the DJL api but is not yet in DJL Serving.

## HTTP Streaming

This simplest way to support streaming is with [HTTP Chunked Encoding](https://en.wikipedia.org/wiki/Chunked_transfer_encoding). This allows HTTP/1 to send back data is chunks rather than all at once. You can access support for this through whatever HTTP API you are using to make the request. As an example, you can see how it is handled by the [JavaScript fetch Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API/Using_readable_streams).

## Pagination Streaming

It is also possible to implement streaming using pagination. The main use case for pagination is when HTTP streaming is not available. For example, if you have a proxy before the model server then the proxy must also support streaming. For those that do not such as Amazon SageMaker (currently), pagination will enable streaming.

The support for pagination is built on top of the [DJL Serving Caching Support](cache.md). It works by having the first request run asynchronously and stream the results to the cache. It will then provide a token to access the results. Subsequent requests using the token will return the elements from the cache. See the diagram below:

```
// Request 1 streams to the cache and returns the access token
Request 1 [input data] -------> Cache
Token <-
// Following requests returns the currently available data from the cache
Request 2+ [token, start]
Partial Output <----------- Cache
```

For good results, it is important to properly configure the cache. The available configuration options can be found on the [cache page](cache.md). Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache or persist requests from the same user to the same instance of DJL Serving. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3 to share the same cache.

### Initial Request

To run a request with pagination, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below.

```sh
curl -v -X POST "http://localhost:8080/invocations" \
-H "content-type: application/json" \
-H "x-synchronous: false" \
-d "..."
```

```
> POST /invocations HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> content-type: application/json
> x-synchronous: false
> Content-Length: 73
>
* upload completely sent off: 73 out of 73 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d
< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d
< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 0
< connection: keep-alive
<
* Connection #0 to host localhost left intact
```

### Data Requests

After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker:

```sh
curl -v -X POST "http://127.0.0.1:8080/invocations" \
-H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d"
```

```sh
curl -v -X POST "http://127.0.0.1:8080/invocations" \
-H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d"
```

Within the response, it will include the streamed data concatenated together for everything that is computed at the time. It may also return the response header `x-next-token` that can be used to retrieve the following page of computed results. If the header `x-next-token` is not found in the response, this indicates that computation has finished and the last of the data was sent in that response.

Depending on the size of the results, it may be difficult to get all of the results at once. In that case, pass the `x-max-items` header to place a limit on the number of items streamed back at once. It the maximum number of items is not available at the time of the request, it will send as many as have already been computed. Note that these items are the number of "streamed element" from the model, not bytes. The number of bytes per item depends on the model and depending on the model can vary.
2 changes: 1 addition & 1 deletion serving/docs/streaming_config.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Streaming configuration
# Streaming Python configuration

We explain various options that can be configured while using response streaming when running in [Python mode](modes.md#python-mode). Response streaming can be enabled in djl-serving by setting `enable_streaming` option in `serving.properties` file.

Expand Down

0 comments on commit c4a9db9

Please sign in to comment.