Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Dec 10, 2025

What does this PR do?

Fixes # (issue)
Hides startup latency from cloning tokenizers (0.1s). Since tokenization requests are queued anyways, there is no effect. Also, tokenizer worker is init'ed before the backend, and since the backend is starting after the tokenizer and takes >> 0.1s (10s of seconds), we have a net reduction of ~0.1s * num_tokenizers (60 - 200), which saves 20s.

AFTER PR:

2025-12-10T02:20:09.338172Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "BAA*/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: Some("hf_m******************************wnj"), hostname: "michaelfeildns-dev-pod-h100-0", port: 3000, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-12-10T02:20:09.516212Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T02:20:09.516241Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T02:20:09.518213Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T02:20:09.519112Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T02:20:09.520179Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T02:20:09.521038Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T02:20:09.521956Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 5.747292ms
2025-12-10T02:20:09.543257Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T02:20:09.543277Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T02:20:09.543786Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 208 tokenization workers
2025-12-10T02:20:09.543895Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T02:20:09.543910Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T02:20:09.544861Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 950.751µs
2025-12-10T02:20:09.544879Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T02:20:09.546265Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 1.395499ms
2025-12-10T02:20:09.558937Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:249: Starting Bert model on Cpu
2025-12-10T02:20:09.741718Z  INFO text_embeddings_router: router/src/lib.rs:282: Warming up model
2025-12-10T02:20:10.050648Z  WARN text_embeddings_router: router/src/lib.rs:291: Backend does not support a batch size > 4
2025-12-10T02:20:10.050889Z  WARN text_embeddings_router: router/src/lib.rs:292: forcing `max_batch_requests=4`
2025-12-10T02:20:10.051131Z  WARN text_embeddings_router: router/src/lib.rs:341: Invalid hostname, defaulting to 0.0.0.0
2025-12-10T02:20:10.052194Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:3000
2025-12-10T02:20:10.052213Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready

BEFORE PR:

2025-12-10T03:58:19.968967Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T03:58:19.968997Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T03:58:19.975482Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T03:58:19.977307Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T03:58:19.979207Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T03:58:19.980129Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T03:58:19.981625Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 12.661192ms
2025-12-10T03:58:20.021686Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T03:58:20.021707Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T03:58:20.022883Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 512 tokenization workers
2025-12-10T03:58:22.529459Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T03:58:22.530967Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T03:58:22.533130Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 2.163728ms
2025-12-10T03:58:22.533163Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T03:58:22.536127Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 2.98543ms

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@michaelfeil michaelfeil changed the title feat: startup time: add cloned tokenzier fix feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time Dec 10, 2025
@michaelfeil michaelfeil changed the title feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant