From 75d0e27c8d31b72dfe3ae407e016756f774bac81 Mon Sep 17 00:00:00 2001 From: anandhu-eng Date: Mon, 3 Mar 2025 16:50:55 +0530 Subject: [PATCH 1/4] Updated with mlc commands for model,dataset,accuracy,submission --- automotive/3d-object-detection/README.md | 4 + graph/R-GAT/README.md | 4 + language/bert/README.md | 43 ++++++++++ language/gpt-j/README.md | 32 +++++++- language/llama2-70b/README.md | 49 +++++++++++- language/llama3.1-405b/README.md | 50 +++++++----- language/mixtral-8x7b/README.md | 41 +++++++++- recommendation/dlrm_v2/pytorch/README.md | 32 ++++++-- text_to_image/README.md | 12 +++ vision/classification_and_detection/README.md | 80 ++++++++++++++++++- .../medical_imaging/3d-unet-kits19/README.md | 53 ++++++++++++ 11 files changed, 369 insertions(+), 31 deletions(-) diff --git a/automotive/3d-object-detection/README.md b/automotive/3d-object-detection/README.md index e1190e8132..d0430d444c 100644 --- a/automotive/3d-object-detection/README.md +++ b/automotive/3d-object-detection/README.md @@ -101,3 +101,7 @@ Please click [here](https://github.com/mlcommons/inference/blob/master/automotiv ``` python accuracy_waymo.py --mlperf-accuracy-file /mlperf_log_accuracy.json --waymo-dir /waymo/kitti_format/ ``` + +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. \ No newline at end of file diff --git a/graph/R-GAT/README.md b/graph/R-GAT/README.md index 1380cb047d..d9f5fafa44 100644 --- a/graph/R-GAT/README.md +++ b/graph/R-GAT/README.md @@ -181,6 +181,10 @@ mlcr process,mlperf,accuracy,_igbh --result_dir= -j +``` + +**Onnx Framework** + +``` +mlcr get,ml-model,bert-large,_onnx --outdirname= -j +``` + +**TensorFlow Framework** + +``` +mlcr get,ml-model,bert-large,_tensorflow --outdirname= -j +``` + +### Download dataset through MLCFlow Automation + +``` +mlcr get,dataset,squad,validation --outdirname= -j +``` + ## Commands Please run the following commands: @@ -45,6 +77,17 @@ Please run the following commands: - The script [tf_freeze_bert.py] freezes the TensorFlow model into pb file. - The script [bert_tf_to_pytorch.py] converts the TensorFlow model into the PyTorch `BertForQuestionAnswering` module in [HuggingFace Transformers](https://github.com/huggingface/transformers) and also exports the model to [ONNX](https://github.com/onnx/onnx) format. +### Evaluate the accuracy through MLCFlow Automation +```bash +mlcr process,mlperf,accuracy,_squad --result_dir= +``` + +Please click [here](https://github.com/mlcommons/inference/blob/master/language/bert/accuracy-squad.py) to view the Python script for evaluating accuracy for the squad dataset. + +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. + ## Loadgen over the Network ``` diff --git a/language/gpt-j/README.md b/language/gpt-j/README.md index cfcf068791..9c952b65db 100644 --- a/language/gpt-j/README.md +++ b/language/gpt-j/README.md @@ -1,9 +1,28 @@ # GPT-J Reference Implementation -Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/gpt-j) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. +## Automated command to run the benchmark via MLCFlow Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/gpt-j/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. +You can also do `pip install mlc-scripts` and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. + +### Download model through MLCFlow Automation + +``` +mlcr get,ml-model,gptj,_pytorch --outdirname= -j +``` + +### Download dataset through MLCFlow Automation + +**Validation Dataset** +``` +mlcr get,dataset,cnndm,_validation --outdirname= -j +``` + +**Calibration Dataset** +``` +mlcr get,dataset,cnndm,_calibration --outdirname= -j +``` ### Setup Instructions @@ -113,6 +132,13 @@ Evaluates the ROGUE scores from the accuracy logs. Only applicable when specifyi python evaluation.py --mlperf-accuracy-file ./build/logs/mlperf_log_accuracy.json --dataset-file ./data/cnn_eval.json ``` +### Evaluate the accuracy through MLCFlow Automation +```bash +mlcr process,mlperf,accuracy,_cnndm --result_dir= +``` + +Please click [here](https://github.com/mlcommons/inference/blob/master/language/gpt-j/evaluation.py) to view the Python script for evaluating accuracy for the cnndm dataset. + ### Reference Model - ROUGE scores The following are the rouge scores obtained when evaluating the GPT-J fp32 model on the entire validation set (13368 samples) using beam search, beam_size=4 @@ -122,6 +148,10 @@ ROUGE 2 - 20.1235 ROUGE L - 29.9881 +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. + ### License: Apache License Version 2.0. diff --git a/language/llama2-70b/README.md b/language/llama2-70b/README.md index 506423cc2c..bbd9889564 100644 --- a/language/llama2-70b/README.md +++ b/language/llama2-70b/README.md @@ -7,6 +7,9 @@ - For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured. - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/) + +## Automated command to run the benchmark via MLCFlow + Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama2-70b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. You can also do `pip install mlc-scripts` and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. @@ -65,9 +68,11 @@ CPU-only setup, as well as any GPU versions for applicable libraries like PyTorc ### MLCommons Members Download MLCommons hosts the model and preprocessed dataset for download **exclusively by MLCommons Members**. You must first agree to the [confidentiality notice](https://llama2.mlcommons.org) using your organizational email address, then you will receive a link to a directory containing Rclone download instructions. _If you cannot access the form but you are part of a MLCommons Member organization, submit the [MLCommons subscription form](https://mlcommons.org/community/subscribe/) with your organizational email address and [associate a Google account](https://accounts.google.com/SignUpWithoutGmail) with your organizational email address._ -Once you have the access, you can download the model automatically via the below command + +### Download model through MLCFlow Automation + ``` -mlcr get,ml-model,llama2 --outdirname=${CHECKPOINT_PATH} -j +mlcr get,ml-model,llama2-70b,_pytorch -j --outdirname= -j ``` ### External Download (Not recommended for official submission) @@ -82,6 +87,34 @@ git clone https://huggingface.co/meta-llama/Llama-2-70b-chat-hf ${CHECKPOINT_PAT ## Get Dataset +### Download Preprocessed dataset through MLCFlow Automation + +**Validation** + +``` +mlcr get,dataset,preprocessed,openorca,_validation --outdirname= -j +``` + +**Calibration** + +``` +mlcr get,dataset,preprocessed,openorca,_calibration --outdirname= -j +``` + +### Download Unprocessed dataset through MLCFlow Automation + +**Validation** + +``` +mlcr get,dataset,openorca,_validation --outdirname= -j +``` + +**Calibration** + +``` +mlcr get,dataset,openorca,_calibration --outdirname= -j +``` + ### Preprocessed You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket. @@ -244,6 +277,18 @@ scale from a 0.0-1.0 scale): This was run on a DGX-H100 node. Total runtime was ~4.5 days. +### Evaluate the accuracy through MLCFlow Automation +```bash +mlcr process,mlperf,accuracy,_openorca --result_dir= +``` + +Please click [here](https://github.com/mlcommons/inference/blob/master/language/llama2-70b/evaluate-accuracy.py) to view the Python script for evaluating accuracy for the Waymo dataset. + +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. + + # Run llama2-70b-interactive benchmark For official, Llama2-70b submissions it is also possible to submit in the interactive category. This sets a more strict latency requirements for Time to First Token (ttft) and Time per Output Token (tpot). Specifically, the interactive category requires loadgen to enforce `ttft <= 450ms` and `ttft <= 40ms` diff --git a/language/llama3.1-405b/README.md b/language/llama3.1-405b/README.md index 50668263c4..65cf226d03 100644 --- a/language/llama3.1-405b/README.md +++ b/language/llama3.1-405b/README.md @@ -7,11 +7,12 @@ - For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured. - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/) -## Automated command to run the benchmark via MLFlow +## Automated command to run the benchmark via MLCFlow Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3_1-405b/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. -You can also do pip install mlc-scripts and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. +You can also do `pip install mlc-scripts` and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. + ## Prepare environment @@ -99,11 +100,24 @@ pip install -e ../../loadgen ## Get Model ### MLCommons Members Download (Recommended for official submission) -You need to request for access to [MLcommons](http://llama3-1.mlcommons.org/) and you'll receive an email with the download instructions. You can download the model automatically via the below command +You need to request for access to [MLcommons](http://llama3-1.mlcommons.org/) and you'll receive an email with the download instructions. + +### Download model through MLCFlow Automation + +**From MLCOMMONS Google Drive** + ``` mlcr get,ml-model,llama3 --outdirname=${CHECKPOINT_PATH} -j ``` +**From HuggingFace** + +``` +mlcr get,ml-model,llama3,_hf --outdirname=${CHECKPOINT_PATH} --hf_token= -j +``` + +**Note:** +Downloading llama3.1-405B model from Hugging Face will require an [**access token**](https://huggingface.co/settings/tokens) which could be generated for your account. Additionally, ensure that your account has access to the [llama3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model. ### External Download (Not recommended for official submission) + First go to [llama3.1-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to HuggingFace (if you don't have account, you'll need to create one). **Please note your authentication credentials** as you may be required to provide them when cloning below. @@ -115,16 +129,22 @@ git clone https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct ${CHECKPOINT cd ${CHECKPOINT_PATH} && git checkout be673f326cab4cd22ccfef76109faf68e41aa5f1 ``` -### Download huggingface model through MLC + +## Get Dataset + +### Download dataset through MLCFlow Automation + +**Validation** ``` -mlcr get,ml-model,llama3,_hf --outdirname=${CHECKPOINT_PATH} --hf_token= -j +mlcr get,dataset,mlperf,inference,llama3,_validation --outdirname= -j ``` -**Note:** -Downloading llama3.1-405B model from Hugging Face will require an [**access token**](https://huggingface.co/settings/tokens) which could be generated for your account. Additionally, ensure that your account has access to the [llama3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model. +**Calibration** -## Get Dataset +``` +mlcr get,dataset,mlperf,inference,llama3,_calibration --outdirname= -j +``` ### Preprocessed @@ -144,11 +164,6 @@ You can then navigate in the terminal to your desired download directory and run ``` rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_dataset_8313_processed_fp16_eval.pkl ./ -P ``` -**MLC Command** - -``` -mlcr get,dataset,mlperf,inference,llama3,_validation --outdirname= -j -``` You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command: @@ -156,11 +171,6 @@ You can also download the calibration dataset from the Cloudflare R2 bucket by r rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_405b/mlperf_llama3.1_405b_calibration_dataset_512_processed_fp16_eval.pkl ./ -P ``` -**MLC Command** -``` -mlcr get,dataset,mlperf,inference,llama3,_calibration --outdirname= -j -``` - ## Run Performance Benchmarks @@ -267,3 +277,7 @@ Running the GPU implementation in FP16 precision resulted in the following FP16 } ``` The accuracy target is 99% for rougeL and exact_match, and 90% for tokens_per_sample + +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. \ No newline at end of file diff --git a/language/mixtral-8x7b/README.md b/language/mixtral-8x7b/README.md index 74935a4dc2..2bdcceb12c 100644 --- a/language/mixtral-8x7b/README.md +++ b/language/mixtral-8x7b/README.md @@ -9,7 +9,11 @@ - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/) -Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/mixtral-8x7b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. +## Automated command to run the benchmark via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/mixtral-8x7b/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. + +You can also do `pip install mlc-scripts` and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. ## Prepare environment @@ -66,6 +70,12 @@ CPU-only setup, as well as any GPU versions for applicable libraries like PyTorc **Important Note:** Files and configurations of the model have changed, and might change in the future. If you are going to get the model from Hugging Face or any external source, use a version of the model that exactly matches the one in this [commit](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commit/a60832cb6c88d5cb6e507680d0e9996fbad77050). We strongly recommend to get the model following the steps in the next section: +### Download model through MLCFlow Automation + +``` +mlcr get,ml-model,mixtral --outdirname= -j +``` + ### Get Checkpoint #### Using Rclone @@ -87,6 +97,22 @@ rclone copy mlc-inference:mlcommons-inference-wg-public/mixtral_8x7b/mixtral-8x7 ## Get Dataset +### Download Preprocessed dataset through MLCFlow Automation + +**Validation** + +``` +mlcr get,dataset-mixtral,openorca-mbxp-gsm8k-combined,_validation --outdirname= -j +``` + +**Calibration** + +``` +mlcr get,dataset-mixtral,openorca-mbxp-gsm8k-combined,_calibration --outdirname= -j +``` + +- Adding `_wget` tag to the run command will change the download tool from `rclone` to `wget`. + ### Preprocessed #### Using Rclone @@ -228,6 +254,15 @@ fi The ServerSUT was not tested for GPU runs. +## Accuracy Evaluation + +### Evaluate the accuracy through MLCFlow Automation +```bash +mlcr process,mlperf,accuracy,_openorca-gsm8k-mbxp-combined --result_dir= +``` + +Please click [here](https://github.com/mlcommons/inference/blob/master/language/mixtral-8x7b/evaluate-accuracy.py) to view the Python script for evaluating accuracy for the Waymo dataset. + ### Evaluation Recreating the enviroment for evaluating the quality metrics can be quite tedious. Therefore we provide a dockerfile and recommend using docker for this task. 1. Build the evaluation container @@ -269,3 +304,7 @@ For official submissions, 99% of each reference score is enforced. Additionally, ```json {'tokens_per_sample': 144.84} ``` + +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. \ No newline at end of file diff --git a/recommendation/dlrm_v2/pytorch/README.md b/recommendation/dlrm_v2/pytorch/README.md index 6f09e26ded..1c0c6a615e 100755 --- a/recommendation/dlrm_v2/pytorch/README.md +++ b/recommendation/dlrm_v2/pytorch/README.md @@ -2,8 +2,12 @@ This is the reference implementation for MLCommons Inference benchmarks. +## Automated command to run the benchmark via MLCFlow + Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/recommendation/dlrm-v2/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker. +You can also do `pip install mlc-scripts` and then use `mlcr` commands for downloading the model and datasets using the commands given in the later sections. + ### Supported Models **TODO: Decide benchmark name** @@ -71,7 +75,13 @@ CFLAGS="-std=c++14" python setup.py develop --user ### Download preprocessed Dataset -Download the preprocessed dataset using Rclone. +#### Download dataset through MLCFlow Automation + +``` +mlcr get,preprocessed,dataset,criteo,_validation --outdirname= -j +``` + +#### Download the preprocessed dataset using Rclone. To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows). To install Rclone on Linux/macOS/BSD systems, run: @@ -102,13 +112,10 @@ framework | Size in bytes (`du *`) | MD5 hash (`md5sum *`) N/A | pytorch | <2GB | - pytorch | 97.31GB | - -#### MLC method - -The following MLCommons MLC commands can be used to programmatically download the model checkpoint. +#### Download model through MLCFlow Automation ``` -pip install mlc-scripts -mlcr get,ml-model,dlrm,_pytorch,_weight_sharded,_rclone -j +mlcr get,ml-model,get,ml-model,dlrm,_pytorch,weight_sharded,_rclone --outdirname= -j ``` #### Manual method @@ -312,6 +319,15 @@ In the reference implementation, each sample is mapped to 100-700 user-item pair ### Running accuracy script +#### Evaluate the accuracy through MLCFlow Automation + +```bash +mlcr process,mlperf,accuracy,_terabyte --result_dir= +``` + +Please click [here](https://github.com/mlcommons/inference/blob/master/recommendation/dlrm_v2/pytorch/tools/accuracy-dlrm.py) to view the Python script for evaluating accuracy for the Waymo dataset. + + To get the accuracy from a LoadGen accuracy json log file, 1. If your SUT outputs the predictions and the ground truth labels in a packed format like the reference implementation then run @@ -414,6 +430,10 @@ usage: main.py [-h] `--find-peak-performance` determine the maximum QPS for the Server, while not applicable to other scenarios. +## Automated command for submission generation via MLCFlow + +Please see the [new docs site](https://docs.mlcommons.org/inference/submission/) for an automated way to generate submission through MLCFlow. + ## License [Apache License 2.0](LICENSE) diff --git a/text_to_image/README.md b/text_to_image/README.md index b00595785b..b11873049e 100644 --- a/text_to_image/README.md +++ b/text_to_image/README.md @@ -164,3 +164,15 @@ Add the `--accuracy` to the command to run the benchmark ```bash python3 main.py --dataset "coco-1024" --dataset-path coco2014 --profile stable-diffusion-xl-pytorch --accuracy --model-path model/ [--dtype ] [--device ] [--time