Adds streaming docs (deepjavalibrary#1017)

This adds docs for the use of streaming. It also makes some updates to the plugin docs.
david-sitsky · Aug 14, 2023 · c4a9db9 · c4a9db9
1 parent 4bc70f7
commit c4a9db9
Show file tree

Hide file tree

Showing 10 changed files with 220 additions and 7 deletions.
diff --git a/plugins/cache/README.md b/plugins/cache/README.md
@@ -1,6 +1,10 @@
-# DJL Serving - Cache Paginator Plugin
+# DJL Serving - Cache Plugin
 
-Allows the model server to use additional cache engine types:
+Allows the model server to use additional cache engine types for asynchronous requests and pagination:
 
 - DynamoDB Cache
 - S3 Cache
+
+## Instructions
+
+Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
diff --git a/plugins/cache/src/main/java/ai/djl/serving/cache/S3CacheEngine.java b/plugins/cache/src/main/java/ai/djl/serving/cache/S3CacheEngine.java
@@ -84,6 +84,16 @@ public S3CacheEngine(
         this.bucket = bucket;
         this.keyPrefix = keyPrefix;
 
+        if (bucket == null) {
+            throw new IllegalStateException(
+                    "When using the S3CacheEngine, the bucket can't be null or missing. Try setting"
+                            + " SERVING_S3_CACHE_BUCKET.");
+        }
+
+        if (keyPrefix == null) {
+            this.keyPrefix = "";
+        }
+
         if (asyncClient == null) {
             asyncClient = S3AsyncClient.builder().build();
         }

diff --git a/plugins/kserve/README.md b/plugins/kserve/README.md
@@ -19,4 +19,8 @@
 `POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer`
 
 ### Reference from Kserve
-See [KServe Requirements](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md)
+See [KServe Requirements](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md)
+
+## Instructions
+
+Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
diff --git a/plugins/management-console/README.md b/plugins/management-console/README.md
@@ -21,3 +21,7 @@ npm run build
 
 ### Customize configuration
 See [Configuration Reference](https://cli.vuejs.org/config/).
+
+## Instructions
+
+Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
diff --git a/plugins/plugin-management-plugin/README.md b/plugins/plugin-management-plugin/README.md
@@ -6,4 +6,8 @@
 
 `GET /plugins`
 
-Returns a list of the currently added plugins
+Returns a list of the currently added plugins
+
+## Instructions
+
+Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
diff --git a/plugins/static-file-plugin/README.md b/plugins/static-file-plugin/README.md
@@ -1,3 +1,7 @@
 # DJL Serving - Static File Plugin
 
-Allows the model server to also serve static files as well
+Allows the model server to also serve static files as well
+
+## Instructions
+
+Instructions for using plugins can be found on the [main plugins page](../../serving/docs/plugin_management.md).
diff --git a/serving/docs/cache.md b/serving/docs/cache.md
@@ -0,0 +1,96 @@
+# DJL Serving Async and Caching
+
+DJL Serving has support for asynchronous request and request caching. The asynchronous requests can be used when the model is large and may cause timeouts waiting for it to compute the response. In this case, an asynchronous request will complete immediately without timeout concerns and later requests can be used to get the result and/or check if it is completed. The cache choice is global and can be set using environment variables or system properties (see below).
+
+It is also possible to use this as an LRU cache to avoid recomputing common inputs. To enable this, apply the multi-tenant cache configuration (see below). This use case is currently experimental.
+
+### Initial Request
+
+To run an asynchronous request, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below.
+
+```sh
+curl -v -X POST "http://localhost:8080/invocations" \
+    -H "content-type: application/json" \
+    -H "x-synchronous: false" \
+    -d "..."
+```
+
+```
+> POST /invocations HTTP/1.1
+> Host: localhost:8080
+> User-Agent: curl/7.68.0
+> Accept: */*
+> content-type: application/json
+> x-synchronous: false
+> Content-Length: 73
+>
+* upload completely sent off: 73 out of 73 bytes
+* Mark bundle as not supporting multiuse
+< HTTP/1.1 200 OK
+< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d
+< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d
+< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338
+< Pragma: no-cache
+< Cache-Control: no-cache; no-store, must-revalidate, private
+< Expires: Thu, 01 Jan 1970 00:00:00 UTC
+< content-length: 0
+< connection: keep-alive
+<
+* Connection #0 to host localhost left intact
+```
+
+### Data Requests
+
+After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker:
+
+```sh
+curl -v -X POST "http://127.0.0.1:8080/invocations" \
+    -H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d"
+```
+
+```sh
+curl -v -X POST "http://127.0.0.1:8080/invocations" \
+    -H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d"
+```
+
+If the result is not yet available, you will receive a response with HTTP code `202`.
+
+## Cache Configuration
+
+In addition, there are a number of options that can be used to configure the cache. In DJL, cache support is enabled by an implementation of the `CacheEngine`. The first option to choose is which `CacheEngine` to use. The `MemoryCacheEngine` is the default and the only one that is always available. The other cache engines require the use of the [DJL Serving Cache Plugin](http://docs.djl.ai/docs/serving/plugins/cache/index.html).
+
+Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3.
+
+There are also several properties (both environment variables and system properties) that apply to all engines:
+
+`SERVING_CACHE_MULTITENANT` (default false) indicates to use a multi-tenant cache with keys based on the hash of the input. Using this, two users with the same input would share the same cache entry, allowing the later one to avoid recomputing it. The alternative has a UUID for each cache entry ensuring they are unique. This feature is still experimental.
+
+`SERVING_CACHE_BATCH` (default 1) indicates how many streaming items to store in the cache at once. It can be used to reduce writes for granular streaming like text generation. For example, setting it to 5 with DDB would mean each DB save would include 5 characters instead of 1.
+
+### Memory Cache
+
+The memory cache is the default cache that stores data in memory on the machine (JVM). It is not suitable for use cases with horizontal scaling. It supports the following properties (environment variables and system properties):
+
+`SERVING_MEMORY_CACHE_CAPACITY` (default none) provides an optional maximum capacity for the cache. This makes it follow an LRU strategy.
+
+### DDB Cache
+
+The DDB cache is based on [Amazon DynamoDB](https://aws.amazon.com/dynamodb/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for smaller outputs like text. It supports the following properties (environment variables and system properties):
+
+`SERVING_DDB_CACHE` can be set to "true" to use the DDB cache.
+
+`SERVING_DDB_TABLE_NAME` (default "djl-serving-pagination-table") sets the table to use for the cache.
+
+There are a few final notes. The default `SERVING_CACHE_BATCH` when using the DDB cache is 5. It also does not support multi-tenant.
+
+### S3 Cache
+
+The S3 cache is based on [Amazon S3](https://aws.amazon.com/s3/). This cache requires the cache plugin. It can be used for horizontal scaling and is recommended for larger outputs like images. It supports the following properties (environment variables and system properties):
+
+`SERVING_S3_CACHE` can be set to "true" to use the S3 cache.
+
+`SERVING_S3_CACHE_BUCKET` (required) sets the name of the bucket to use.
+
+`SERVING_S3_CACHE_KEY_PREFIX` (default "") sets a prefix for the caching path in the bucket. For example, a prefix of "serving/cache/" with an entry "xxx" would make the entry have the combined path "serving/cache/xxx". It can be used to reuse a bucket or share a bucket with other use cases.
+
+`SERVING_S3_CACHE_AUTOCREATE` (default false) can be set to "true" to automatically create the S3 bucket if it does not exist.
diff --git a/serving/docs/plugin_management.md b/serving/docs/plugin_management.md
@@ -3,7 +3,8 @@
 ## Available Plugins
 
 - [KServe plugin](../../plugins/kserve/README.md) - KServe V2 Protocol support
-- [Management console](../../plugins/management-console/README.md) - DJL Management console UI
+- [Management console plugin](../../plugins/management-console/README.md) - DJL Management console UI
+- [Cache plugin](../../plugins/cache/README.md) - Provides additional options for caches
 - [Static File plugin](../../plugins/static-file-plugin/README.md) - Allows DJL Serving to also serve static files
 - [Plugin Management plugin](../../plugins/plugin-management-plugin/README.md) - Adds plugin management to the management API
 

diff --git a/serving/docs/streaming.md b/serving/docs/streaming.md
@@ -0,0 +1,86 @@
+# DJL Serving Streaming
+
+When a model is run, typically the response is given all at once. However, there are also some cases that benefit from having the response given back as it is generated. For example, a LLM can return back the characters as they are generated. This allows for more dynamic UIs that seem speedier as user's could then read the output as they are written rather than waiting.
+
+## Model Support
+
+The first step to supporting streaming is to use a model that supports it.
+
+For a model using Python, see the [Streaming Python configuration guide](streaming_config.md). This provides instructions for modifying your `handle()` function to use the streaming output. After the model is modified to support streaming, you must also add `option.enable_streaming=true` to the `serving.properties` to enable the streaming support.
+
+For a Java model, this means that it must have a Block implementing [`StreamingBlock`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingBlock.html) and a Translator implementing async [`StreamingTranslator`](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/streaming/StreamingTranslator.html). Right now, it is available through the DJL api but is not yet in DJL Serving.
+
+## HTTP Streaming
+
+This simplest way to support streaming is with [HTTP Chunked Encoding](https://en.wikipedia.org/wiki/Chunked_transfer_encoding). This allows HTTP/1 to send back data is chunks rather than all at once. You can access support for this through whatever HTTP API you are using to make the request. As an example, you can see how it is handled by the [JavaScript fetch Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API/Using_readable_streams).
+
+## Pagination Streaming
+
+It is also possible to implement streaming using pagination. The main use case for pagination is when HTTP streaming is not available. For example, if you have a proxy before the model server then the proxy must also support streaming. For those that do not such as Amazon SageMaker (currently), pagination will enable streaming.
+
+The support for pagination is built on top of the [DJL Serving Caching Support](cache.md). It works by having the first request run asynchronously and stream the results to the cache. It will then provide a token to access the results. Subsequent requests using the token will return the elements from the cache. See the diagram below:
+
+```
+// Request 1 streams to the cache and returns the access token
+Request 1 [input data] -------> Cache
+    Token <-
+
+// Following requests returns the currently available data from the cache
+Request 2+ [token, start]
+    Partial Output  <----------- Cache
+```
+
+For good results, it is important to properly configure the cache. The available configuration options can be found on the [cache page](cache.md). Keep in mind that if you use a horizontally scaling service with DJL Serving such as Amazon Elastic Container Service, Amazon SageMaker, or Kubernetes, they must share the same cache or persist requests from the same user to the same instance of DJL Serving. This means you must enable usage of one of the external cache variants like Amazon DynamoDB or Amazon S3 to share the same cache.
+
+### Initial Request
+
+To run a request with pagination, pass the query header `x-synchronous: false`. Once this is done, the response will include the header `x-next-token`. Note that when using this as part of SageMaker, you will need to use [X-Amzn-SageMaker-Custom-Attributes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseSyntax). In that case, it would be `X-Amzn-SageMaker-Custom-Attributes: x-synchronous=false` for the request and the response is also shown below.
+
+```sh
+curl -v -X POST "http://localhost:8080/invocations" \
+    -H "content-type: application/json" \
+    -H "x-synchronous: false" \
+    -d "..."
+```
+
+```
+> POST /invocations HTTP/1.1
+> Host: localhost:8080
+> User-Agent: curl/7.68.0
+> Accept: */*
+> content-type: application/json
+> x-synchronous: false
+> Content-Length: 73
+>
+* upload completely sent off: 73 out of 73 bytes
+* Mark bundle as not supporting multiuse
+< HTTP/1.1 200 OK
+< X-Amzn-SageMaker-Custom-Attributes: x-next-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d
+< x-next-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d
+< x-request-id: 0e805b7d-f1b6-49af-a595-d8dc35aee338
+< Pragma: no-cache
+< Cache-Control: no-cache; no-store, must-revalidate, private
+< Expires: Thu, 01 Jan 1970 00:00:00 UTC
+< content-length: 0
+< connection: keep-alive
+<
+* Connection #0 to host localhost left intact
+```
+
+### Data Requests
+
+After the initial request to queue the job, you can use the following requests to access the data. This is done by passing the `x-starting-token` as a header. Here are examples both directly and for SageMaker:
+
+```sh
+curl -v -X POST "http://127.0.0.1:8080/invocations" \
+    -H "x-starting-token: df87d942-7f39-48c8-a9af-d4e762a2ab1d"
+```
+
+```sh
+curl -v -X POST "http://127.0.0.1:8080/invocations" \
+    -H "X-Amzn-SageMaker-Custom-Attributes: x-starting-token=df87d942-7f39-48c8-a9af-d4e762a2ab1d"
+```
+
+Within the response, it will include the streamed data concatenated together for everything that is computed at the time. It may also return the response header `x-next-token` that can be used to retrieve the following page of computed results. If the header `x-next-token` is not found in the response, this indicates that computation has finished and the last of the data was sent in that response.
+
+Depending on the size of the results, it may be difficult to get all of the results at once. In that case, pass the `x-max-items` header to place a limit on the number of items streamed back at once. It the maximum number of items is not available at the time of the request, it will send as many as have already been computed. Note that these items are the number of "streamed element" from the model, not bytes. The number of bytes per item depends on the model and depending on the model can vary.
diff --git a/serving/docs/streaming_config.md b/serving/docs/streaming_config.md
@@ -1,4 +1,4 @@
-# Streaming configuration
+# Streaming Python configuration
 
 We explain various options that can be configured while using response streaming when running in [Python mode](modes.md#python-mode). Response streaming can be enabled in djl-serving by setting `enable_streaming` option in `serving.properties` file.