forked from deepjavalibrary/djl-serving
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
oom management doc (deepjavalibrary#926)
- Loading branch information
Showing
1 changed file
with
21 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# OutofMemory handling in djl-serving | ||
|
||
This document explains properties that can be configured in djl-serving to better handle OutOfMemory exceptions. | ||
|
||
The following properties can be configured in `serving.properties` file per each model | ||
|
||
* `required_memory_mb`: Required memory for CPU and GPU in MB to load the model. GPU required memory can be overridden by setting `gpu.required_memory_mb`. | ||
* `gpu.required_memory_mb`: Required GPU memory in MB to load the model. This allows user to set a different value for GPU required memory from CPU required memory. If this is not specified, `required_memory_mb` will be used for GPU as well if specified. | ||
* `reserved_memory_mb`: Memory to reserve in MB in addition to required memory to account for inference memory costs for CPU and GPU. | ||
* `gpu.reserved_memory_mb`: GPU memory to reserve in MB in addition to required memory to account for inference memory costs. This allows user to set a different value for GPU reserved memory from CPU reserved memory. If this is not specified, `reserved_memory_mb` will be used for GPU as well if specified. | ||
|
||
|
||
djl-serving will use `required_memory_mb` and `reserved_memory_mb` to decide whether a model can be loaded and successful inference request can run. djl-serving will fetch free memory available on CPU and GPU and check whether free memory is greater than `required_memory_mb` plus `reserved_memory_mb` . If djl-serving cannot load the model due to inadequate free memory, it throws HTTP `507` error facilitating clients to handle the error for e.g by unloading few models and re-trying. | ||
|
||
This approach helps us with: | ||
|
||
* Failing fast without needing to download the model from an external repository | ||
* Prevents the need to create backend process and eventually leading to killed process | ||
|
||
|
||
In addition to user configurable properties, djl-serving’s python engine handles exceptions of types `OutOfMemoryError` (e.g `torch.cuda.OutOfMemoryError`), `MemoryError` during both load and inference time and returns HTTP `507` error. Out of memory exception handling is best effort from djl-serving and there’s risk of python process getting killed already or the memory cannot freed correctly. |