This page is project tracker to get halo models like llama3, Flux.1, Mistral etc. working on one or more MI3xx using shark/iree.
- Shark V3.1.0 (Jan 6, 2025) llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch, Flux.1 dev
- Shark V3.2.0 (Feb 2025) Grok-1 and Mixtral 8x7B performant
TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR
TTFT: Time To First Token (time taken from processing of prompt to first token generated by the prefill stage)
ITL: Average time between each new token generated in decode phase (second token onwards)
-Read cookbooks for user-like inference run instructions.
Model | Tracy Profile | dump file | Comments |
---|---|---|---|
llama3.1 8B Fp16 prefill TP1 | Tracy Profile | Dump File | |
llama3.1 8B Fp16 decode TP1 | Tracy Profile | Dump File | |
llama3.1 8B Fp16 prefill TP8 | Tracy Profile | Dump File | |
llama3.1 8B Fp16 decode TP8 | Tracy Profile | Dump File | |
llama3.1 70B Fp16 prefill TP1 | Tracy Profile | Dump File | |
llama3.1 70B Fp16 decode TP1 | Tracy Profile | Dump File | |
llama3.1 405B Fp16 prefill TP8 | Tracy Profile | Issue 19571 |
- See Testing Status
(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")
Item | Previous Week (Jan 13-17) | Next Week (Jan 20-24) |
---|---|---|
Sharktank Modeling | - @Ean get all controlnet pipeline models running (Done 1/14) - @Boian debugging numerics in flux-transformer (ETA 1/16) - @Rob look at padding top matmul in prefill (ETA 1/14) - @Rob multi-device tracing still serial and debugging (Done 1/16) - @Dan Adding fp8 kernel tests (ETA 1/14) - @Dan continuing fp8 70b llama model numeric debugging (ETA 1/24) - @Chi Helping debug fp8 llama compilation (ETA 1/17) |
- @ Rob triaging numerical errors in llama (Done 1/20) - @ Rob generating tracy profiles with new multi-device for 405b (ETA 1/20) - @Stephen Verifying latest llama numerics (ETA 1/20) - @ Stephen mlperf harness for llama (ETA 1/24) - @Dan prefill numerics compiling and good with HF rotary embedding, verifying ours now (ETA 1/20) - @Dan fp8 attention llama (ETA 1/23) - @Boian refatctoring flux to use llama rotary embedding from llama (ETA 1/21) |
IREE codegeneration | ||
Serving | - @Ean Controlnet pipeline functionality in shortfin (Done 1/15) - @Ean Debugging controlnet numerics (ETA 1/16) - @Kyle fixing up shortfin flux support (ETA 1/14) - @Stephen triaging iree compile regression (ETA 1/14) - @Stephen triaging potential shortfin regression with 70b (Done 1/16) - @Archana shortfin PPL debugging (ETA 1/14) - @Xida debugging concurrent request numeric issue (ETA 1/16) - @Jinchen mlperf loadgen integration with shortfin (ETA 1/17) - @Vinayak helping mlperf integration (ETA 1/17) - @Phaneesh flux performance |
- @Jinchen remove extra client connection every request and keep open (1/22) - @Vinayak help event pool client connection issues (ETA 1/21) - @Archana likely found root cause issue on PPL, working on fix now (ETA 1/20) - @Archana debugging llama regression from masked attention (ETA 1/21) - @Ean help out on mlperf and client handoff (ETA 1/22) - @Kyle cleanup and refactor flux pipeline (ETA 1/21) - @Phaneesh flux perf burndown (ETA 1/22) |
Test Automation | ||
Performance Tuning |
See latest CI/Nightly Test Report. Use Nod.AI Lab page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to Issues table below.
category | issue link | assigned to | status |
---|---|---|---|
quark quantization | QUARK-71 | Bowen Bow | FP8 matmul should be used in attention |
shark-sre | N/A | Sai Enduri | Need more systems for CI/dev of 405b |
runtime | 19812 | Stephen Baione | Llama3.1_fp16_8b_tp8 fails prefill for long prompts when using async allocator |
runtime | 19832 | Stephen Baione | Llama_405b_tp8 OOM w/ Long Input Prompt |
iree runtime python package | 19886 | Avinash Sharma | TP8 benchmarking hits segfault with clang-17 |
iree tracy profiling | 19571 | hits assertion for tracing 405B sharded |
Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)
UnSharded Weights:
/data/<model_name>/weights/<model_size>/<modelname_modelsize_datatype>.irpa
Example: /data/llama-3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa
Sharded Weights:
/data/<model_name>/weights/<model_size>/<shard_size>/<modelname_modelsize_shardsize_ranksuffix>.irpa
Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa
Artifacts:
/data/<model_name>/artifacts/<model_size>/<model_name>\_<model_size>\_<data_type>\_<attention_kind>\_<sharding>\_<batch_size>.[mlir | vmfb]
Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir
(MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
llama3.1-8B-FP16 bs4 TP1 (prefill) | PASS mlir_tp1 irpa | PASS compile command | PASS benchmark command numpy inputs | tbd | tbd |
llama3.1-8B-FP16 bs4 TP8 (prefill) | PASS mlir_tp8 | PASS | PASS | tbd | tbd |
llama3.1-8B-FP16 | PASS mlir | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-70B-FP16 | PASS mlir | Fails in iree, patch | tbd | tbd | tbd |
llama3.1-405B-FP16 bs4 TP8 (prefill) | PASS mlir_tp8 | PASS compile command | PASS benchmark command numpy inputs | tbd | tbd |
llama3.1-405B-FP16 bs4 TP8 (decode) | PASS | PASS | FAIL (Segfault) | tbd | tbd |
llama3.1-8B-FP8 | PASS mlir | yes | tbd | tbd | tbd |
llama3.1-70B-FP8 | ETA: 11/1 | tbd | tbd | tbd | tbd |
llama3.1-405B-FP8 | ETA: 11/5 | tbd | tbd | tbd | tbd |
llama-toy-size-FP32-TP2-CPU | PASS | PASS | tbd | tbd | tbd |
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
sharktank black-forest-labs--FLUX.1-dev--transformer-single-layer-bf16 |
MLIR IRPA | tbd | tbd | N/A | N/A |
sharktank black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16 (this is the real production model) |
MLIR IRPA | tbd | tbd | tbd | tbd |
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
sharktank black-forest-labs--FLUX.1-schnell--transformer-single-layer-bf16 |
MLIR IRPA | tbd | tbd | N/A | N/A |
sharktank black-forest-labs--FLUX.1-schnell--black-forest-labs-transformer-bf16 |
MLIR IRPA | tbd | tbd | tbd | tbd |
Schenll is almost the same as Dev. Dev has a guidance layer and guidance parameter, while Schenll does not.
black-forest-labs--FLUX.1-<schnell/dev>--transformer-single-layer-bf16
is a single layer with random weights.
It is meant to help for faster iteration when working with the model.
The actual models black-forest-labs--FLUX.1-<dev/schnell>--black-forest-labs-transformer-bf1
are with real pretrained parameters and have 19 MMDiT layers.
iree-compile \
black-forest-labs--FLUX.1-dev--transformer-single-layer-b16.mlir \
-o black-forest-labs--FLUX.1-dev--transformer-single-layer-b16-hip.vmfb \
--iree-hal-target-device=hip \
--iree-hip-target=gfx942 \
--iree-opt-const-eval=false \
--iree-opt-strip-assertions=true \
--iree-global-opt-propagate-transposes=true \
--iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-outer-dim-concat=true \
--iree-vm-target-truncate-unsupported-floats \
--iree-llvmgpu-enable-prefetch=true \
--iree-opt-data-tiling=false \
--iree-codegen-gpu-native-math-precision=true \
--iree-codegen-llvmgpu-use-vector-distribution \
--iree-hip-waves-per-eu=2 \
--iree-execution-model=async-external \
"--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"
Only the xxl
variant is actually used in FLUX. The small
variant is provided for faster iteration if needed.
iree-compile \
google__t5_v1_1_xxl_encoder_fp32.mlir \
--iree-hal-target-device=hip \
--iree-hip-target=gfx942 \
-o google__t5_v1_1_xxl_encoder_fp32.vmfb
iree-run-module \
--device=hip \
--module=google__t5_v1_1_xxl_encoder_fp32.vmfb \
--parameters=model=google__t5_v1_1_xxl_encoder_fp32.irpa \
--function=forward_bs4 \
--input=@google__t5_v1_1_xxl_iree_forward_bs4_arg0.npy
(MI300X GPU, SPX Mode)
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
t5-v1.1-small-encoder-bf16 | PASS mlir gguf irpa | PASS | PASS args expected_result | FAIL | tbd |
t5-v1.1-xxl-encoder-bf16 | PASS mlir gguf irpa | PASS | PASS args expected_result | FAIL | tbd |
t5-v1.1-small-encoder-f32 | PASS mlir gguf irpa | PASS | PASS args expected_result | PASS tol < (atol=1e-4, rtol=1.5e-3) |
tbd |
t5-v1.1-xxl-encoder-f32 | PASS mlir gguf irpa | PASS | PASS args expected_result | PASS tol < (atol=1e-4, rtol=1.5e-3) |
tbd |
Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
---|---|---|---|---|---|
Mixtral 8x7B ONNX | tbd | tbd | tbd | tbd | tbd |
Generate IR
python3 -m sharktank.examples.export_paged_llm_v1 \
--irpa-file <input_irpa path with correct sharding and dtype> --output-mlir <output-mlir> \
--bs <batch size> --tensor-parallelism-size <TP size if sharding> \
--attention-kernel <decomposed or torch_sdpa> --no-fake-quant <only for fp8>
Generate vmfb
iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o <output-vmfb path>
Follow the steps here
In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"
Or, use command line by first installing az cli as:
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:
az storage blob upload --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
To download:
az storage blob download --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>
if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:
az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf \
--file llama8b_f16.gguf --account-key <key string>
Follow the steps here.
Follow the steps here
Feature | Description | Enabled | Enablement Requirements | Reference(s) |
---|---|---|---|---|
gen |
Generate shortfin completion, given a prompt | Yes | Enabled | Shortfin Implementation |
streaming |
Stream shortfin completion, given a prompt | Yes | Enabled | Shortfin Implementation |
run_batch |
Run batch of disjoint requests with continous batching | Yes | Enabled | Batch Docs |
fork |
Launch parallel prompts | Yes | Enabled | Fork Docs |
choices |
Given set of choices, generate response based on best log probs | No | Should work with greedy. Needs backend implementation | Greedy Token Selection OpenAI Implementation |
image |
Pass image as part of multi-modal prompt | No | Multi-Modal not supported by SF | sgl.image Docs |
regex |
Specify regular expression as decoding constraint | No | Only supported for local models | Regex Docs |
The latest benchmark results for the SGLang integration can be found here
(Note: Do not update this one)
Models | compile | inference (SPX mode) | tracy |
---|---|---|---|
llama3.1-8b-Q4_1 | PASS | prefill (1817 ms), decode (57.3 ms), commands | prefill decode |
llama3.1-8b-Q4_k | PASS | ||
llama3.1-70b-Q4_1 | PASS | prefill (3543 ms), decode (213 ms), commands | prefill decode |
grok-1-Q4_1 | PASS | FAIL, out of memory | prefill decode |
(Note: Update Schedule-Numerics table for llama3.1 artifacts instead of this table (10/20/2024 onwards))
- small files and MLIR files check into llm-dev
- large files upload to sharkblobs -> "halo-models" container on Azure and put link to that in the table(s) below
- Very large files, store on GPU server and note the name/location of/on the machine in table(s) below
Note: If a link to Azure sharkblob below gives you an error, either use az cli to download (see section Accessing sharkblobs on Azure) or click on sharkblobs, then click on "Blob containers" and then navigate to the file manually and download it.
Models | FP16 | FP8 | Q4_1 | Q4_K | Attention IRs |
---|---|---|---|---|---|
llama2-7b | irpa mlir | Attention IRs | |||
llama3-8b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-70b | mlir gguf | mlir irpa | mlir gguf | mlir gguf | |
llama3-405b | mlir gguf | mlir gguf | mlir gguf | ||
grok-1 | mlir gguf | NA | mlir gguf | gguf |