oom management doc (deepjavalibrary#926)

david-sitsky · Jul 10, 2023 · 99d25b8 · 99d25b8
1 parent 882ac89
commit 99d25b8
Showing 1 changed file with 21 additions and 0 deletions.
diff --git a/serving/docs/out_of_memory_management.md b/serving/docs/out_of_memory_management.md
@@ -0,0 +1,21 @@
+# OutofMemory handling in djl-serving
+
+This document explains properties that can be configured in djl-serving to better handle OutOfMemory exceptions.
+
+The following properties can be configured in `serving.properties` file per each model
+
+* `required_memory_mb`: Required memory for CPU and GPU in MB to load the model. GPU required memory can be overridden by setting `gpu.required_memory_mb`.
+* `gpu.required_memory_mb`: Required GPU memory in MB to load the model. This allows user to set a different value for GPU required memory from CPU required memory. If this is not specified, `required_memory_mb` will be used for GPU as well if specified.
+* `reserved_memory_mb`: Memory to reserve in MB in addition to required memory to account for inference memory costs for CPU and GPU.
+* `gpu.reserved_memory_mb`: GPU memory to reserve in MB in addition to required memory to account for inference memory costs. This allows user to set a different value for GPU reserved memory from CPU reserved memory. If this is not specified, `reserved_memory_mb` will be used for GPU as well if specified.
+
+
+djl-serving will use `required_memory_mb`  and `reserved_memory_mb` to decide whether a model can be loaded and successful inference request can run. djl-serving will fetch free memory available on CPU and GPU and check whether free memory is greater than `required_memory_mb` plus `reserved_memory_mb` . If djl-serving cannot load the model due to inadequate free memory, it throws HTTP `507` error facilitating clients to handle the error for e.g by unloading few models and re-trying.
+
+This approach helps us with:
+
+* Failing fast without needing to download the model from an external repository
+* Prevents the need to create backend process and eventually leading to killed process
+
+
+In addition to user configurable properties, djl-serving’s python engine handles exceptions of types `OutOfMemoryError` (e.g `torch.cuda.OutOfMemoryError`), `MemoryError`  during both load and inference time and returns HTTP `507` error. Out of memory exception handling is best effort from djl-serving and there’s risk of python process getting killed already or the memory cannot freed correctly.