diff --git a/docs/source/build-with-bentoml/distributed-services.rst b/docs/source/build-with-bentoml/distributed-services.rst index 062268802bb..f00c0d1b9d6 100644 --- a/docs/source/build-with-bentoml/distributed-services.rst +++ b/docs/source/build-with-bentoml/distributed-services.rst @@ -232,7 +232,7 @@ Deploying a project with distributed Services to BentoCloud is similar to deploy To set custom configurations for each, we recommend you use a separate configuration file and reference it in the BentoML CLI command or Python API for deployment. -The following is an example file that defines some custom configurations for the above two Services. You set configurations of each Service in the ``services`` field. Refer to :doc:`/bentocloud/how-tos/configure-deployments` to see the available configuration fields. +The following is an example file that defines some custom configurations for the above two Services. You set configurations of each Service in the ``services`` field. Refer to :doc:`/scale-with-bentocloud/deployment/configure-deployments` to see the available configuration fields. .. code-block:: yaml diff --git a/docs/source/build-with-bentoml/observability/metrics.rst b/docs/source/build-with-bentoml/observability/metrics.rst index 57473faf63f..7804c628259 100644 --- a/docs/source/build-with-bentoml/observability/metrics.rst +++ b/docs/source/build-with-bentoml/observability/metrics.rst @@ -44,7 +44,7 @@ BentoML automatically collects a set of default metrics for each Service. These - ``request_in_progress``: The number of requests that are currently being processed by a Service. - ``request_total``: The total number of requests that a Service has processed. - ``request_duration_seconds``: The time taken to process requests, including the total sum of request processing time, count of requests processed, and distribution across specified duration buckets. -- ``adaptive_batch_size``: The adaptive batch sizes used during Service execution, which is relevant for optimizing performance in batch processing scenarios. You need to enable :doc:`adaptive batching ` to collect this metric. +- ``adaptive_batch_size``: The adaptive batch sizes used during Service execution, which is relevant for optimizing performance in batch processing scenarios. You need to enable :doc:`adaptive batching ` to collect this metric. Metric types ^^^^^^^^^^^^ diff --git a/docs/source/build-with-bentoml/parallelize-requests.rst b/docs/source/build-with-bentoml/parallelize-requests.rst index 19220966735..fd7ab74569c 100644 --- a/docs/source/build-with-bentoml/parallelize-requests.rst +++ b/docs/source/build-with-bentoml/parallelize-requests.rst @@ -19,7 +19,7 @@ When you define a BentoML Service, use the ``workers`` parameter to set the numb class MyService: # Service implementation -The number of workers isn't necessarily equivalent to the number of concurrent requests a BentoML Service can serve in parallel. With optimizations like :doc:`adaptable batching ` and continuous batching, each worker can potentially handle many requests simultaneously to enhance the throughput of your Service. To specify the ideal number of concurrent requests for a Service (namely, all workers within the Service), you can configure :doc:`concurrency `. +The number of workers isn't necessarily equivalent to the number of concurrent requests a BentoML Service can serve in parallel. With optimizations like :doc:`adaptable batching ` and continuous batching, each worker can potentially handle many requests simultaneously to enhance the throughput of your Service. To specify the ideal number of concurrent requests for a Service (namely, all workers within the Service), you can configure :doc:`concurrency `. Use cases --------- diff --git a/docs/source/examples/controlnet.rst b/docs/source/examples/controlnet.rst index 704a4994250..1b6f77a16c2 100644 --- a/docs/source/examples/controlnet.rst +++ b/docs/source/examples/controlnet.rst @@ -120,7 +120,7 @@ Create BentoML :doc:`Services ` in a ``service.py` controlnet_conditioning_scale: float = 0.5 num_inference_steps: int = 25 -This file defines a BentoML Service ``ControlNet`` with custom :doc:`configurations ` in timeout, worker count, and resources. +This file defines a BentoML Service ``ControlNet`` with custom :doc:`configurations ` in timeout, worker count, and resources. - It loads the three pre-trained models and configures them to use GPU if available. The main pipeline (``StableDiffusionXLControlNetPipeline``) integrates these models. - It defines an asynchronous API endpoint ``generate``, which takes an image and a set of parameters as input. The parameters for the generation process are extracted from a ``Params`` instance, a Pydantic model that provides automatic data validation. diff --git a/docs/source/examples/function-calling.rst b/docs/source/examples/function-calling.rst index d9f3be7c03a..550a74a2c9e 100644 --- a/docs/source/examples/function-calling.rst +++ b/docs/source/examples/function-calling.rst @@ -70,7 +70,7 @@ The ``service.py`` file outlines the logic of the two required BentoML Services. 2. Create a Python class (``Llama`` in the example) to initialize the model and tokenizer, and use the following decorators to add BentoML functionalities. - - ``@bentoml.service``: Converts this class into a BentoML Service. You can optionally set :doc:`configurations ` like timeout and GPU resources to use on BentoCloud. We recommend you use an NVIDIA A100 GPU of 80 GB for optimal performance. + - ``@bentoml.service``: Converts this class into a BentoML Service. You can optionally set :doc:`configurations ` like timeout and GPU resources to use on BentoCloud. We recommend you use an NVIDIA A100 GPU of 80 GB for optimal performance. - ``@bentoml.mount_asgi_app``: Mounts an `existing ASGI application `_ defined in the ``openai_endpoints.py`` file to this class. It sets the base path to ``/v1``, making it accessible via HTTP requests. The mounted ASGI application provides OpenAI-compatible APIs and can be served side-by-side with the LLM Service. For more information, see :doc:`/build-with-bentoml/asgi`. .. code-block:: python diff --git a/docs/source/examples/langgraph.rst b/docs/source/examples/langgraph.rst index 14343fc1244..f1bca399b43 100644 --- a/docs/source/examples/langgraph.rst +++ b/docs/source/examples/langgraph.rst @@ -83,7 +83,7 @@ service.py The ``service.py`` file defines the ``SearchAgentService``, a BentoML Service that wraps around the LangGraph agent and calls the ``MistralService``. -1. Create a Python class and decorate it with ``@bentoml.service``, which transforms it into a BentoML Service. You can optionally set :doc:`configurations ` like :doc:`workers ` and :doc:`concurrency `. +1. Create a Python class and decorate it with ``@bentoml.service``, which transforms it into a BentoML Service. You can optionally set :doc:`configurations ` like :doc:`workers ` and :doc:`concurrency `. .. code-block:: python diff --git a/docs/source/examples/mlflow.rst b/docs/source/examples/mlflow.rst index 64f870b4de0..9599b1f0ce5 100644 --- a/docs/source/examples/mlflow.rst +++ b/docs/source/examples/mlflow.rst @@ -110,7 +110,7 @@ Create a separate ``service.py`` file where you define a BentoML :doc:`Service < The Service code: -- Uses the ``@bentoml.service`` decorator to define a BentoML Service. Optionally, you can set additional :doc:`configurations ` like resource allocation and traffic timeout. +- Uses the ``@bentoml.service`` decorator to define a BentoML Service. Optionally, you can set additional :doc:`configurations ` like resource allocation and traffic timeout. - Retrieves the model from the Model Store and defines it a class variable. - Uses the ``@bentoml.api`` decorator to expose the ``predict`` function as an API endpoint, which :doc:`takes a NumPy array as input and returns a NumPy array `. diff --git a/docs/source/examples/shieldgemma.rst b/docs/source/examples/shieldgemma.rst index dbe28c98ee7..34520b80392 100644 --- a/docs/source/examples/shieldgemma.rst +++ b/docs/source/examples/shieldgemma.rst @@ -67,7 +67,7 @@ The ``service.py`` file outlines the logic of the two required BentoML Services. 2. Create the ``Gemma`` Service to initialize the model and tokenizer, with a safety check API to calculate the probability of policy violation. - - The ``Gemma`` class is decorated with ``@bentoml.service``, which converts it into a BentoML Service. You can optionally set :doc:`configurations ` like timeout, :doc:`concurrency ` and GPU resources to use on BentoCloud. We recommend you use an NVIDIA T4 GPU to host ShieldGemma 2B. + - The ``Gemma`` class is decorated with ``@bentoml.service``, which converts it into a BentoML Service. You can optionally set :doc:`configurations ` like timeout, :doc:`concurrency ` and GPU resources to use on BentoCloud. We recommend you use an NVIDIA T4 GPU to host ShieldGemma 2B. - The API ``check``, decorated with ``@bentoml.api``, functions as a web API endpoint. It evaluates the safety of prompts by predicting the likelihood of a policy violation. It then returns a structured response using the ``ShieldResponse`` Pydantic model. .. code-block:: python diff --git a/docs/source/scale-with-bentocloud/scaling/autoscaling.rst b/docs/source/scale-with-bentocloud/scaling/autoscaling.rst index 6958249fcf5..d3e1fff8d41 100644 --- a/docs/source/scale-with-bentocloud/scaling/autoscaling.rst +++ b/docs/source/scale-with-bentocloud/scaling/autoscaling.rst @@ -56,7 +56,7 @@ In general, the autoscaler will scale the number of replicas based on the follow Key points about concurrency: - By default, BentoML does not impose a limit on ``concurrency`` to avoid bottlenecks. To determine the optimal value for ``concurrency``, we recommend conducting a stress test on your Service using a load generation tool such as `Locust `_ either locally or on BentoCloud. The purpose of the stress test is to identify the maximum number of concurrent requests your Service can manage. After identifying this maximum, set the concurrency parameter to a value slightly below this threshold ensuring that the Service has adequate headroom to handle traffic fluctuations. -- If your Service supports :doc:`adaptive batching ` or continuous batching, set ``concurrency`` to match the batch size. This aligns processing capacity with batch requirements, optimizing throughput. +- If your Service supports :doc:`adaptive batching ` or continuous batching, set ``concurrency`` to match the batch size. This aligns processing capacity with batch requirements, optimizing throughput. - For Services designed to handle one request at a time, set ``concurrency`` to ``1``, ensuring that requests are processed sequentially without overlap. External queue