Introduction

This page is project tracker to get halo models like llama3, Flux.1, Mistral etc. working on one or more MI3xx using shark/iree.

Release Goals

Shark V3.1.0 (Jan 6, 2025) llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch, Flux.1 dev
Shark V3.2.0 (Feb 2025) Grok-1 and Mixtral 8x7B performant

Glossary

TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR

TTFT: Time To First Token (time taken from processing of prompt to first token generated by the prefill stage)

ITL: Average time between each new token generated in decode phase (second token onwards)

User Instructions

-Read cookbooks for user-like inference run instructions.

Tracy Profiles (nondecomposed)

Model	Tracy Profile	dump file	Comments
llama3.1 8B Fp16 prefill TP1	Tracy Profile	Dump File
llama3.1 8B Fp16 decode TP1	Tracy Profile	Dump File
llama3.1 8B Fp16 prefill TP8	Tracy Profile	Dump File
llama3.1 8B Fp16 decode TP8	Tracy Profile	Dump File
llama3.1 70B Fp16 prefill TP1	Tracy Profile	Dump File
llama3.1 70B Fp16 decode TP1	Tracy Profile	Dump File
llama3.1 405B Fp16 prefill TP8	Tracy Profile		Issue 19571

Testing

See Testing Status

Schedule

(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")

Item	Previous Week (Jan 13-17)	Next Week (Jan 20-24)
Sharktank Modeling	- @Ean get all controlnet pipeline models running (Done 1/14) - @Boian debugging numerics in flux-transformer (ETA 1/16) - @Rob look at padding top matmul in prefill (ETA 1/14) - @Rob multi-device tracing still serial and debugging (Done 1/16) - @Dan Adding fp8 kernel tests (ETA 1/14) - @Dan continuing fp8 70b llama model numeric debugging (ETA 1/24) - @Chi Helping debug fp8 llama compilation (ETA 1/17)	- @ Rob triaging numerical errors in llama (Done 1/20) - @ Rob generating tracy profiles with new multi-device for 405b (ETA 1/20) - @Stephen Verifying latest llama numerics (ETA 1/20) - @ Stephen mlperf harness for llama (ETA 1/24) - @Dan prefill numerics compiling and good with HF rotary embedding, verifying ours now (ETA 1/20) - @Dan fp8 attention llama (ETA 1/23) - @Boian refatctoring flux to use llama rotary embedding from llama (ETA 1/21)
IREE codegeneration
Serving	- @Ean Controlnet pipeline functionality in shortfin (Done 1/15) - @Ean Debugging controlnet numerics (ETA 1/16) - @Kyle fixing up shortfin flux support (ETA 1/14) - @Stephen triaging iree compile regression (ETA 1/14) - @Stephen triaging potential shortfin regression with 70b (Done 1/16) - @Archana shortfin PPL debugging (ETA 1/14) - @Xida debugging concurrent request numeric issue (ETA 1/16) - @Jinchen mlperf loadgen integration with shortfin (ETA 1/17) - @Vinayak helping mlperf integration (ETA 1/17) - @Phaneesh flux performance	- @Jinchen remove extra client connection every request and keep open (1/22) - @Vinayak help event pool client connection issues (ETA 1/21) - @Archana likely found root cause issue on PPL, working on fix now (ETA 1/20) - @Archana debugging llama regression from masked attention (ETA 1/21) - @Ean help out on mlperf and client handoff (ETA 1/22) - @Kyle cleanup and refactor flux pipeline (ETA 1/21) - @Phaneesh flux perf burndown (ETA 1/22)
Test Automation
Performance Tuning

Nightly Test Reports

See latest CI/Nightly Test Report. Use Nod.AI Lab page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to Issues table below.

Issues

category	issue link	assigned to	status
quark quantization	QUARK-71	Bowen Bow	FP8 matmul should be used in attention
shark-sre	N/A	Sai Enduri	Need more systems for CI/dev of 405b
runtime	19812	Stephen Baione	Llama3.1_fp16_8b_tp8 fails prefill for long prompts when using async allocator
runtime	19832	Stephen Baione	Llama_405b_tp8 OOM w/ Long Input Prompt
iree runtime python package	19886	Avinash Sharma	TP8 benchmarking hits segfault with clang-17
iree tracy profiling	19571		hits assertion for tracing 405B sharded

Status-Numerics

Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)

UnSharded Weights:

/data/<model_name>/weights/<model_size>/<modelname_modelsize_datatype>.irpa

Example: /data/llama-3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa

Sharded Weights:

/data/<model_name>/weights/<model_size>/<shard_size>/<modelname_modelsize_shardsize_ranksuffix>.irpa

Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa

Artifacts:

/data/<model_name>/artifacts/<model_size>/<model_name>\_<model_size>\_<data_type>\_<attention_kind>\_<sharding>\_<batch_size>.[mlir | vmfb]

Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir

Artifacts

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
llama3.1-8B-FP16 bs4 TP1 (prefill)	PASS mlir_tp1 irpa	PASS compile command	PASS benchmark command numpy inputs	tbd	tbd
llama3.1-8B-FP16 bs4 TP8 (prefill)	PASS mlir_tp8	PASS	PASS	tbd	tbd
llama3.1-8B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-70B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-405B-FP16 bs4 TP8 (prefill)	PASS mlir_tp8	PASS compile command	PASS benchmark command numpy inputs	tbd	tbd
llama3.1-405B-FP16 bs4 TP8 (decode)	PASS	PASS	FAIL (Segfault)	tbd	tbd
llama3.1-8B-FP8	PASS mlir	yes	tbd	tbd	tbd
llama3.1-70B-FP8	ETA: 11/1	tbd	tbd	tbd	tbd
llama3.1-405B-FP8	ETA: 11/5	tbd	tbd	tbd	tbd
llama-toy-size-FP32-TP2-CPU	PASS	PASS	tbd	tbd	tbd

Flux.1 Transformer

Download

To download all artifacts, MLIR, VMFB, Tracy traces, etc. you would need to get an Azure account key to access the files in the blob container. Note that this will download the weights as well, which is ~50 GB.

az storage blob download-batch \
  --destination "." \
  --pattern "flux/transformer/*" \
  --source "halo-models" \
  --account-name "sharkblobs" \
  --account-key <account-key>

Export, compile and trace

If you want to regenerate the artifacts instead of downloading them you could use

from sharktank.models.flux.export import export_flux_transformer_models
from pathlib import Path

export_path = Path("my_artifacts")
export_flux_transformer_models(export_path)

Note that you would need access to an AMD GPU with CDNA3 arch (gfx942 processor) to generate the trace. It also requires merging of this PR. You may want to set env var HF_HOME for a non-default Hugging Faces models cache directory.

Dev variant

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
black-forest-labs bf16	PASS	PASS	PASS	PASS	tbd

Schnell variant

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
black-forest-labs bf16	PASS	PASS	PASS	PASS	tbd

Schenll is almost the same as Dev. Dev has a guidance layer and guidance parameter, while Schenll does not.

Compile command

iree-compile \
  black-forest-labs--FLUX.1-dev--transformer-single-layer-b16.mlir \
  -o black-forest-labs--FLUX.1-dev--transformer-single-layer-b16-hip.vmfb \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  --iree-opt-const-eval=false \
  --iree-opt-strip-assertions=true \
  --iree-global-opt-propagate-transposes=true \
  --iree-dispatch-creation-enable-fuse-horizontal-contractions=true \
  --iree-dispatch-creation-enable-aggressive-fusion=true \
  --iree-opt-aggressively-propagate-transposes=true \
  --iree-opt-outer-dim-concat=true \
  --iree-vm-target-truncate-unsupported-floats \
  --iree-llvmgpu-enable-prefetch=true \
  --iree-opt-data-tiling=false \
  --iree-codegen-gpu-native-math-precision=true \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-hip-waves-per-eu=2 \
  --iree-execution-model=async-external \
  "--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)"

T5 Encoder (part of Flux.1 dev)

Only the xxl variant is actually used in FLUX. The small variant is provided for faster iteration if needed.

Compile

iree-compile \
  google__t5_v1_1_xxl_encoder_fp32.mlir \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  -o google__t5_v1_1_xxl_encoder_fp32.vmfb

Run

iree-run-module \
  --device=hip \
  --module=google__t5_v1_1_xxl_encoder_fp32.vmfb \
  --parameters=model=google__t5_v1_1_xxl_encoder_fp32.irpa \
  --function=forward_bs4 \
  --input=@google__t5_v1_1_xxl_iree_forward_bs4_arg0.npy

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
t5-v1.1-small-encoder-bf16	PASS mlir gguf irpa	PASS	PASS args expected_result	FAIL	tbd
t5-v1.1-xxl-encoder-bf16	PASS mlir gguf irpa	PASS	PASS args expected_result	FAIL	tbd
t5-v1.1-small-encoder-f32	PASS mlir gguf irpa	PASS	PASS args expected_result	PASS `tol < (atol=1e-4, rtol=1.5e-3)`	tbd
t5-v1.1-xxl-encoder-f32	PASS mlir gguf irpa	PASS	PASS args expected_result	PASS `tol < (atol=1e-4, rtol=1.5e-3)`	tbd

Mixtral 8x7B

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
Mixtral 8x7B ONNX	tbd	tbd	tbd	tbd	tbd

AMD GPU Machines

MI300

MLIR generation and Compilation

Generate IR

python3 -m sharktank.examples.export_paged_llm_v1 \
  --irpa-file <input_irpa path with correct sharding and dtype> --output-mlir <output-mlir> \
  --bs <batch size> --tensor-parallelism-size <TP size if sharding> \
  --attention-kernel <decomposed or torch_sdpa> --no-fake-quant <only for fp8>

Generate vmfb

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o <output-vmfb path>

Evaluation tests

Perplexity

Follow the steps here

Accessing sharkblobs on Azure:

In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"

Or, use command line by first installing az cli as:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:

az storage blob upload --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

To download:

az storage blob download --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:

az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf \
  --file llama8b_f16.gguf --account-key <key string>

Export With `sharktank` and Server with `shortfin`:

Follow the steps here.

Setup SGLang With Shortfin

Follow the steps here

SGLang/Shortfin Feature Enablement

Feature	Description	Enabled	Enablement Requirements	Reference(s)
`gen`	Generate shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`streaming`	Stream shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`run_batch`	Run batch of disjoint requests with continous batching	Yes	Enabled	Batch Docs
`fork`	Launch parallel prompts	Yes	Enabled	Fork Docs
`choices`	Given set of choices, generate response based on best log probs	No	Should work with greedy. Needs backend implementation	Greedy Token Selection OpenAI Implementation
`image`	Pass image as part of multi-modal prompt	No	Multi-Modal not supported by SF	sgl.image Docs
`regex`	Specify regular expression as decoding constraint	No	Only supported for local models	Regex Docs

SGLang Benchmark Results

The latest benchmark results for the SGLang integration can be found here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!