Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion Dockerfile-cuda
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS base-builder
FROM nvidia/cuda:12.9.0-devel-ubuntu22.04 AS base-builder
Copy link
Member

@alvarobartt alvarobartt Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bumping CUDA required? As it might eventually be a breaking change for instances running on older versions of NVIDIA as 12.2, 12.4 and 12.6; besides that everything LGTM

Copy link
Author

@danielealbano danielealbano Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvarobartt CUDA 12.8 is required to support GPUs like the 5080 and 5090, we can potentially downgrade to 12.8 and it should still work (I can test) however I don't think it would help too much.

I understand that it might be a problem, however CUDA 12.2 is 2 years (July 2023) old and it would need to be upgraded at some point.

What if we the cuda 12.9 is used with a :129-1.x docker image tag? It doesn't feel the right solution but it wouldn't break any backward compatibility.

Copy link
Member

@alvarobartt alvarobartt Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm fair enough, I then think we maybe just create Dockerfile-cuda-blackwell in the meantime with CUDA 12.8, whilst keeping the rest of the changes, just adding that to the CI and making sure we build with a different CUDA version for Blackwell, and eventually for TEI v1.9.0 we can think about bumping CUDA from 12.2 to 12.6.

In any case, I guess that given how recent Blackwell is it makes sense to be isolated for the moment to not break anything, but ideally all those should be under the same Dockerfile in the future.

Copy link
Author

@danielealbano danielealbano Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to test with CUDA 12.8 to be certain there no odd surprises, I'll need to figure out which packages to swap to downgrade the CUDA version on my test hardware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks for the contribution @danielealbano, I'll try to test on my end too, and add it into the CI to make sure the Dockerfile-cuda-blackwell image is built as experimental, and later on we can consider on bumping CUDA on the Dockerfile-cuda and Dockerfile-cuda-all images to make sure that it supports all the architectures today!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @alvarobartt, I didn't get the chance to do the test yet, I will try over the weekend. However, I was wondering, if we can just stick with 12.9 taking into account this is going to be an ad-hoc blackwell build.


ENV SCCACHE=0.10.0
ENV RUSTC_WRAPPER=/usr/local/bin/sccache
Expand Down Expand Up @@ -58,6 +58,9 @@ RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
elif [ ${CUDA_COMPUTE_CAP} -eq 90 ]; \
then \
nvprune --generate-code code=sm_90 /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a; \
elif [ ${CUDA_COMPUTE_CAP} -eq 120 ]; \
then \
nvprune --generate-code code=sm_120 /usr/local/cuda/lib64/libcublas_static.a -o /usr/local/cuda/lib64/libcublas_static.a; \
else \
echo "cuda compute cap ${CUDA_COMPUTE_CAP} is not supported"; exit 1; \
fi;
Expand Down
22 changes: 20 additions & 2 deletions Dockerfile-cuda-all
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS base-builder
FROM nvidia/cuda:12.9.0-devel-ubuntu22.04 AS base-builder

ENV SCCACHE=0.10.0
ENV RUSTC_WRAPPER=/usr/local/bin/sccache
Expand Down Expand Up @@ -85,6 +85,15 @@ RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
CUDA_COMPUTE_CAP=90 cargo chef cook --release --features candle-cuda --recipe-path recipe.json && sccache -s; \
fi;

RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
--mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN \
if [ $VERTEX = "true" ]; \
then \
CUDA_COMPUTE_CAP=120 cargo chef cook --release --features google --features candle-cuda --recipe-path recipe.json && sccache -s; \
else \
CUDA_COMPUTE_CAP=120 cargo chef cook --release --features candle-cuda --recipe-path recipe.json && sccache -s; \
fi;

COPY backends backends
COPY core core
COPY router router
Expand Down Expand Up @@ -122,9 +131,18 @@ RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
CUDA_COMPUTE_CAP=90 cargo build --release --bin text-embeddings-router -F candle-cuda && sccache -s; \
fi;

RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL \
--mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN \
if [ $VERTEX = "true" ]; \
then \
CUDA_COMPUTE_CAP=120 cargo build --release --bin text-embeddings-router -F candle-cuda -F google && sccache -s; \
else \
CUDA_COMPUTE_CAP=120 cargo build --release --bin text-embeddings-router -F candle-cuda && sccache -s; \
fi;

RUN mv /usr/src/target/release/text-embeddings-router /usr/src/target/release/text-embeddings-router-90

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 AS base
FROM nvidia/cuda:12.9.0-runtime-ubuntu22.04 AS base

ARG DEFAULT_USE_FLASH_ATTENTION=True

Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -581,6 +581,9 @@ runtime_compute_cap=89
# Example for H100
runtime_compute_cap=90

# Example for Blackwell (RTX 5000 series, ...)
runtime_compute_cap=120

docker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap
```

Expand Down
2 changes: 2 additions & 0 deletions backends/candle/src/compute_cap.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ fn compute_cap_matching(runtime_compute_cap: usize, compile_compute_cap: usize)
(86..=89, 80..=86) => true,
(89, 89) => true,
(90, 90) => true,
(120, 120) => true,
(_, _) => false,
}
}
Expand All @@ -54,6 +55,7 @@ mod tests {
assert!(compute_cap_matching(86, 86));
assert!(compute_cap_matching(89, 89));
assert!(compute_cap_matching(90, 90));
assert!(compute_cap_matching(120, 120));

assert!(compute_cap_matching(86, 80));
assert!(compute_cap_matching(89, 80));
Expand Down
2 changes: 1 addition & 1 deletion backends/candle/src/flash_attn.rs
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ pub(crate) fn flash_attn_varlen(
}
#[cfg(not(feature = "flash-attn-v1"))]
candle::bail!("Flash attention v1 is not installed. Use `flash-attn-v1` feature.")
} else if (80..90).contains(&runtime_compute_cap) || runtime_compute_cap == 90 {
} else if (80..90).contains(&runtime_compute_cap) || runtime_compute_cap == 90 || runtime_compute_cap == 120 {
#[cfg(feature = "flash-attn")]
{
use candle_flash_attn::{flash_attn_varlen_alibi_windowed, flash_attn_varlen_windowed};
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/custom_container.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ the examples of runtime compute capabilities for various GPU types:
- A10 - `runtime_compute_cap=86`
- Ada Lovelace (RTX 4000 series, ...) - `runtime_compute_cap=89`
- H100 - `runtime_compute_cap=90`
- Blackwell (RTX 5000 series, ...) - `runtime_compute_cap=120`

Once you have determined the compute capability is determined, set it as the `runtime_compute_cap` variable and build
the container as shown in the example below:
Expand Down