Skip to content

Commit

Permalink
oom management doc (deepjavalibrary#926)
Browse files Browse the repository at this point in the history
  • Loading branch information
rohithkrn authored Jul 10, 2023
1 parent 882ac89 commit 99d25b8
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions serving/docs/out_of_memory_management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# OutofMemory handling in djl-serving

This document explains properties that can be configured in djl-serving to better handle OutOfMemory exceptions.

The following properties can be configured in `serving.properties` file per each model

* `required_memory_mb`: Required memory for CPU and GPU in MB to load the model. GPU required memory can be overridden by setting `gpu.required_memory_mb`.
* `gpu.required_memory_mb`: Required GPU memory in MB to load the model. This allows user to set a different value for GPU required memory from CPU required memory. If this is not specified, `required_memory_mb` will be used for GPU as well if specified.
* `reserved_memory_mb`: Memory to reserve in MB in addition to required memory to account for inference memory costs for CPU and GPU.
* `gpu.reserved_memory_mb`: GPU memory to reserve in MB in addition to required memory to account for inference memory costs. This allows user to set a different value for GPU reserved memory from CPU reserved memory. If this is not specified, `reserved_memory_mb` will be used for GPU as well if specified.


djl-serving will use `required_memory_mb` and `reserved_memory_mb` to decide whether a model can be loaded and successful inference request can run. djl-serving will fetch free memory available on CPU and GPU and check whether free memory is greater than `required_memory_mb` plus `reserved_memory_mb` . If djl-serving cannot load the model due to inadequate free memory, it throws HTTP `507` error facilitating clients to handle the error for e.g by unloading few models and re-trying.

This approach helps us with:

* Failing fast without needing to download the model from an external repository
* Prevents the need to create backend process and eventually leading to killed process


In addition to user configurable properties, djl-serving’s python engine handles exceptions of types `OutOfMemoryError` (e.g `torch.cuda.OutOfMemoryError`), `MemoryError` during both load and inference time and returns HTTP `507` error. Out of memory exception handling is best effort from djl-serving and there’s risk of python process getting killed already or the memory cannot freed correctly.

0 comments on commit 99d25b8

Please sign in to comment.