[sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B #2170

etiotto · 2024-09-09T15:22:26Z

Meta has written the following PyTorch blog: https://pytorch.org/blog/cuda-free-inference-for-llms
They have evaluated Llama3-8B using Triton on A100 and H100 GPUs.
We should do the same for PVC after porting the code.

The instructions are as follows:

1 -Get the code:
git clone https://github.com/AdnanHoque/foundation-model-stack.git
git checkout amd_attn cd foundation-model-stack pip install -e. cd scripts/
2 - weights and tokenizer from: https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main

3 - to run (update model path and tokenizer to your local drive) :
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 python inference.py --architecture=llama --variant=3-8b --tokenizer="/net/storage149/autofs/css22/nmg/models/llama3-8b/base" --model_path="/net/storage149/autofs/css22/nmg/models/llama3-8b/base" --device_type cuda --model_source hf --compile
4 - script options are controlled in : https://github.com/.../blob/amd_attn/scripts/inference.py

The text was updated successfully, but these errors were encountered:

Dewei-Wang-sh · 2024-09-11T09:17:49Z

depend on below issues to make it work.

[FA2 performance] make sure end2end perf of FA2 match POC branch #1758
[sycl-free-inference-for-llms] run llama3-8B with Pytorch for xpu and get the base line #2197
[sycl-free-inference-for-llms] gemm kernel perf get up to ~290tflops #2194
[sycl-free-inference-for-llms] flashattention kernel get up to ~100tflops #2195
[sycl-free-inference-for-llms] integrate triton gemm/attention in pytorch for xpu #2196

victor-eds self-assigned this Sep 9, 2024

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 9, 2024

vlad-penkin added performance tests: e2e research and removed tests: e2e labels Sep 9, 2024

aregm assigned alexbaden Sep 9, 2024

vlad-penkin unassigned alexbaden and victor-eds Sep 9, 2024

aregm assigned alexbaden Sep 9, 2024

vlad-penkin changed the title ~~Port and evaluate amd_attn~~ [sycl-free-inference-for-llms] Port and evaluate amd_attn Sep 11, 2024

vlad-penkin added the umbrella label Sep 11, 2024

This was referenced Sep 11, 2024

[sycl-free-inference-for-llms] integrate triton gemm/attention in pytorch for xpu #2196

Open

[sycl-free-inference-for-llms] run llama3-8B with Pytorch for xpu and get the base line #2197

Open

vlad-penkin modified the milestones: 4.0 [Performance] Core, 4.6 [Performance] E2E Sep 11, 2024

This was referenced Sep 11, 2024

[sycl-free-inference-for-llms] flashattention kernel get up to ~100tflops #2195

Open

[sycl-free-inference-for-llms] gemm kernel perf get up to ~290tflops #2194

Open

vlad-penkin changed the title ~~[sycl-free-inference-for-llms] Port and evaluate amd_attn~~ [sycl-free-inference-for-llms] Port and evaluate Llama3-8B Sep 11, 2024

vlad-penkin changed the title ~~[sycl-free-inference-for-llms] Port and evaluate Llama3-8B~~ [sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B Sep 11, 2024

vlad-penkin mentioned this issue Sep 23, 2024

[sycl-free-inference-for-llms] run llama3-8B with Pytorch and get the base line #2310

Closed

vlad-penkin unassigned alexbaden Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B #2170

[sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B #2170

etiotto commented Sep 9, 2024 •

edited

Loading

Dewei-Wang-sh commented Sep 11, 2024

[sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B #2170

[sycl-free-inference-for-llms] Port and evaluate LLama3-8B and Granite-8B #2170

Comments

etiotto commented Sep 9, 2024 • edited Loading

Dewei-Wang-sh commented Sep 11, 2024

etiotto commented Sep 9, 2024 •

edited

Loading