Deploying a Text-Generation Inference server on a Google Cloud TPU instance

Context

Text-Generation-Inference (TGI) is a highly optimized serving engine enabling serving Large Language Models (LLMs) in a way that better leverages the underlying hardware, Cloud TPU in this case.

Deploy TGI on Cloud TPU instance

We assume the reader already has a Cloud TPU instance up and running. If this is not the case, please see our guide to deploy one here

Docker Container Build

Optimum-TPU provides a make tpu-tgi command at the root level to help you create local docker image.

Docker Container Run

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                huggingface/optimum-tpu:latest \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

Executing requests against the service

You can query the model using either the /generate or /generate_stream routes:

curl localhost/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

curl localhost/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Jetstream Pytorch and Pytorch XLA backends

Jetstream Pytorch is a highly optimized Pytorch engine for serving LLMs on Cloud TPU. This engine is selected by default if the dependency is available. If for some reason you want to use the Pytorch/XLA backend instead, you can set the JETSTREAM_PT_DISABLE=1 environment variable.

When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the QUANTIZATION=1 environment variable. For instance, on a 2x4 TPU v5e, you can serve models up to 70B parameters such as Llama 3.3-70B.

Note: Quantization is still experimental and may produce lower quality results compared to the non-quantized version.