From 99d25b87b35b419da7ae5f26eb31b035a0fafcb4 Mon Sep 17 00:00:00 2001
From: rohithkrn <rohith.nallamaddi@gmail.com>
Date: Mon, 10 Jul 2023 09:50:47 -0700
Subject: [PATCH] oom management doc (#926)

---
 serving/docs/out_of_memory_management.md | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 serving/docs/out_of_memory_management.md

diff --git a/serving/docs/out_of_memory_management.md b/serving/docs/out_of_memory_management.md
new file mode 100644
index 000000000..9a9d8bb52
--- /dev/null
+++ b/serving/docs/out_of_memory_management.md
@@ -0,0 +1,21 @@
+# OutofMemory handling in djl-serving
+
+This document explains properties that can be configured in djl-serving to better handle OutOfMemory exceptions.
+
+The following properties can be configured in `serving.properties` file per each model
+
+* `required_memory_mb`: Required memory for CPU and GPU in MB to load the model. GPU required memory can be overridden by setting `gpu.required_memory_mb`.
+* `gpu.required_memory_mb`: Required GPU memory in MB to load the model. This allows user to set a different value for GPU required memory from CPU required memory. If this is not specified, `required_memory_mb` will be used for GPU as well if specified.
+* `reserved_memory_mb`: Memory to reserve in MB in addition to required memory to account for inference memory costs for CPU and GPU.
+* `gpu.reserved_memory_mb`: GPU memory to reserve in MB in addition to required memory to account for inference memory costs. This allows user to set a different value for GPU reserved memory from CPU reserved memory. If this is not specified, `reserved_memory_mb` will be used for GPU as well if specified.
+
+
+djl-serving will use `required_memory_mb`  and `reserved_memory_mb` to decide whether a model can be loaded and successful inference request can run. djl-serving will fetch free memory available on CPU and GPU and check whether free memory is greater than `required_memory_mb` plus `reserved_memory_mb` . If djl-serving cannot load the model due to inadequate free memory, it throws HTTP `507` error facilitating clients to handle the error for e.g by unloading few models and re-trying.
+
+This approach helps us with:
+
+* Failing fast without needing to download the model from an external repository
+* Prevents the need to create backend process and eventually leading to killed process
+
+
+In addition to user configurable properties, djl-serving’s python engine handles exceptions of types `OutOfMemoryError` (e.g `torch.cuda.OutOfMemoryError`), `MemoryError`  during both load and inference time and returns HTTP `507` error. Out of memory exception handling is best effort from djl-serving and there’s risk of python process getting killed already or the memory cannot freed correctly.