ROCm
diff --git a/‎README.md
+18-86 b/‎README.md
+18-86
@@ -3,107 +3,39 @@
 </div>
 
 
-# Byte MLPerf Inference Benchmark Tool
-Byte MLPerf(Inference) is an AI Accelerator Benchmark that focuses on evaluating AI Accelerators from practical production perspective, including the ease of use and versatility of software and hardware. Byte MLPerf has the following characteristics:
+# ByteMLPerf Benchmark Tool
+ByteMLPerf is an AI Accelerator Benchmark that focuses on evaluating AI Accelerators from practical production perspective, including the ease of use and versatility of software and hardware. Byte MLPerf has the following characteristics:
 - Models and runtime environments are more closely aligned with practical business use cases.
 - For ASIC hardware evaluation, besides evaluate performance and accuracy, it also measure metrics like compiler usability and coverage.
 - Performance and accuracy results obtained from testing on the open Model Zoo serve as reference metrics for evaluating ASIC hardware integration.
 
-Vendors can refer to this document for guidance on building backend: [ByteMLPerf Guide](https://bytedance.us.feishu.cn/docx/L98Mdw3J6obMtJxeRBzuHeRbsof) [[中文版](https://bytedance.feishu.cn/docs/doccno9eLS3OseTA5aMBeeQf2cf#TDK8of)]
-Also, you can visit our offical website here:[bytemlperf.ai](https://bytemlperf.ai/)
+## Category
+The ByteMLPerf benchmark is structured into three main categories: Inference, Training, and Micro, each targeting different aspects of AI accelerator performance:
 
-## Usage
-The user uses launch.py as the entry point. When using Byte MLPerf to evaluate the model, you only need to pass in two parameters --task and --hardware_type, as shown below:
-```bash
-python3 launch.py --task xxx --hardware_type xxx
-```
+- Inference: This category is subdivided into two distinct sections to cater to different types of models:
 
-1. task
---task parameter is the name of the incoming workload. You need to specify the workload. For example, if you would like to evaluate the workload: bert-tf-fp16.json, you need to specify --task bert-tf-fp16.
-Note: All workloads are defined under general_perf/workloads, and the name needs to be aligned with the file name when passing parameters. The current format is model-framework-precision.
+  - General Performance: This section is dedicated to evaluating the inference capabilities of accelerators using common models such as ResNet-50 and BERT. It aims to provide a broad understanding of the accelerator's performance across a range of typical tasks. Vendors can refer to this document for guidance on building general perf backend: [ByteMLPerf General Perf Guide](https://bytedance.us.feishu.cn/docx/L98Mdw3J6obMtJxeRBzuHeRbsof) [[中文版](https://bytedance.feishu.cn/docs/doccno9eLS3OseTA5aMBeeQf2cf#TDK8of)]
 
-2. hardware_type
---hardware_type parameter is the incoming hardware_type name, there is no default value, it must be specified by the user. Example: To evaluate Habana Goya, specify --hardware_type GOYA .
-Note: All hardware types are defined under general_perf/backends, and the name needs to be aligned with the folder name when passing parameters.
+  - Large Language Model (LLM) Performance: Specifically designed to assess the capabilities of accelerators in handling large language models, this section addresses the unique challenges posed by the size and complexity of these models. Vendors can refer to this document for guidance on building llm perf backend: [ByteMLPerf LLM Perf Guide](https://bytedance.larkoffice.com/docx/ZoU7dkPXYoKtJtxlrRMcNGMwnTc) [[中文版](https://bytedance.larkoffice.com/docx/ZoU7dkPXYoKtJtxlrRMcNGMwnTc)]
 
-3. compile_only
---compile_only parameter will make task stoped once compilation is finished
+- Micro: The Micro category focuses on the performance of specific operations or "ops" that are fundamental to AI computations, such as Gemm, Softmax, and various communication operations. This granular level of testing is crucial for understanding the capabilities and limitations of accelerators at a more detailed operational level. Vendors can refer to this document for guidance on building micro perf backend: [ByteMLPerf Micro Perf Guide](https://bytedance.us.larkoffice.com/docx/EpjFdSpRsoOIHWxtKgjuRsMPsFB)[[中文版](https://bytedance.us.larkoffice.com/docx/LJWvdGVAzoxXkTxF9h9uIETbsWc)]
 
-4. show_task_list
---show_task_list parameter will print all task name
+- Training: Currently under development, this category aims to evaluate the performance of AI accelerators in training scenarios. It will provide insights into how well accelerators can handle the computationally intensive process of training AI models, which is vital for the development of new and more advanced AI systems.
 
-5. show_hardware_list
---show_hardware_list parameter will print all hardware backend
+Vendors looking to evaluate and improve their AI accelerators can utilize the ByteMLPerf benchmark as a comprehensive guide. The benchmark not only offers a detailed framework for performance and accuracy evaluation but also includes considerations for compiler usability and coverage for ASIC hardware, ensuring a holistic assessment approach.
 
-### Workload Description
-A workload definition needs to contain the following fields:
-```javascript
-{
-    "model": "bert-torch-fp32",   //The name of the model to be evaluated, which needs to be aligned with the model_zoo name
-    "test_perf": true,            //Evaluate model performance
-    "test_accuracy": true,        //Evaluate model accuracy
-    "test_numeric": true,         //Accuracy：Evaluate model numeric
-    "clients": 3,                 //Performance：Client threads that submit data
-    "iterations": 100,            //Performance：How many iterations are submitted by each thread
-    "batch_sizes":[1,4,8,16,32,64],//Performance：The batch size when each thread submits data
-    "data_percent": 50,           //Accuracy：Ratio of data to assess accuracy, [1-100]
-    "compile_only": false,           //Compile the model only
-}
-```
-
-## Model Zoo List
-Model Zoo&Dataset
-The models supported by Byte MLPerf are collected under the Model Zoo. From the perspective of access rights, they are currently divided into internal models and open models. Released with Byte MLPerf is the open model included in the corresponding version.
-
-Open model collection principles:
-- Basic Model: including Resnet50, Bert and WnD;
-- Popular Model：Includes models currently widely used in the industry;
-- SOTA: including SOTA models corresponding to business domains;
-
-In addition to the complete model structure, Byte MLPerf will also add some typical model substructure subgraphs or OPs (provided that the open model cannot find a suitable model containing such classic substructures), such as transformer encoder/decoder with different sequence lengths , all kinds of common conv ops, such as group conv, depwise-conv, point-wise conv, and rnn common structures, such as gru/lstm, etc.
-
-| Model | Domain | Purpose | Framework | Dataset | Precision |
-| ---- | ---- | ---- | ---- | ---- | ---- |
-| resnet50-v1.5 | cv | regular | tensorflow, pytorch | imagenet2012 | fp32 |
-| bert-base | nlp | regular | tensorflow, pytorch | squad-1.1 | fp32 |
-| wide&deep | rec | regular | tensorflow | criteo | fp32 |
-| videobert | mm  |popular | onnx | cifar100 | fp32 |
-| albert | nlp | popular | pytorch | squad-1.1 | fp32 |
-| conformer | nlp | popular | onnx | none | fp32 |
-| roformer | nlp | popular | tensorflow | cail2019 | fp32 |
-| yolov5 | cv | popular | onnx | none | fp32 |
-| roberta | nlp | popular | pytorch | squad-1.1 | fp32 |
-| deberta | nlp | popular | pytorch | squad-1.1 | fp32 |
-| swin-transformer | cv | popular | pytorch | imagenet2012 | fp32 |
-| gpt2 | nlp | sota | pytorch | none | fp32 |
-| stable diffusion | cv | sota | onnx | none | fp32 |
-| LlaMa2 7B | nlp | sota | torch | none | fp16 |
-| chatGLM2 6B | nlp | sota | torch | none | fp16 |
-
-### ByteIR
-
-The ByteIR Project is a ByteDance model compilation solution. ByteIR includes compiler, runtime, and frontends, and provides an end-to-end model compilation solution.
-
-Although all ByteIR components (compiler/runtime/frontends) are together to provide an end-to-end solution, and all under the same umbrella of this repository, each component technically can perform independently.
-
-For More Information, please refer to [ByteIR](https://github.com/bytedance/byteir)
-
-Models Supported By ByteIR:
-| Model | Domain | Purpose | Framework | Dataset | Precision |
-| ---- | ---- | ---- | ---- | ---- | ---- |
-| resnet50-v1.5 | cv | regular | [mhlo](https://lf-bytemlperf.17mh.cn/obj/bytemlperf-zoo/resnet50_mhlo.tar) | imagenet2012 | fp32 |
-| bert-base | nlp | regular | [mhlo](https://lf-bytemlperf.17mh.cn/obj/bytemlperf-zoo/bert_mhlo.tar) | squad-1.1 | fp32 |
+For more details, you can visit our offical website here: [bytemlperf.ai](https://bytemlperf.ai/)
 
 ## Vendor List
 ByteMLPerf Vendor Backend List will be shown below
 
-| Vendor |  SKU | Key Parameters | Supplement |
-| :---- | :----| :---- | :---- |
-| Intel | Xeon | - | - |
-| Stream Computing | STC P920 | <li>Computation Power:128 TFLOPS@FP16 <li> Last Level Buffer: 8MB, 256GB/s <li>Level 1 Buffer: 1.25MB, 512GB/s   <li> Memory: 16GB, 119.4GB/S <li> Host Interface：PCIe 4, 16x, 32GB/s <li> TDP: 160W | [STC Introduction](byte_infer_perf/general_perf/backends/STC/README.md) |
-| Graphcore | Graphcore® C600 | <li>Compute: 280 TFLOPS@FP16, 560 TFLOPS@FP8 <li> In Processor Memory: 900 MB, 52 TB/s <li> Host Interface: Dual PCIe Gen4 8-lane interfaces, 32GB/s <li> TDP: 185W | [IPU Introduction](byte_infer_perf/general_perf/backends/IPU/README.md) |
-| Moffett-AI | Moffett-AI S30 | <li>Compute: 1440 (32x-Sparse) TFLOPS@BF16, 2880 (32x-Sparse) TOPS@INT8, <li> Memory: 60 GB,  <li> Host Interface: Dual PCIe Gen4 8-lane interfaces, 32GB/s <li> TDP: 250W                           | [SPU Introduction](byte_infer_perf/general_perf/backends/SPU/README.md) |
-| Habana | Gaudi2 | <li>24 Tensor Processor Cores, Dual matrix multiplication engines <li> Memory: 96 GB HBM2E, 48MB SRAM                            | [HPU Introduction](byte_infer_perf/general_perf/backends/HPU/README.md) |
+| Vendor | SKU | Key Parameters | Inference(General Perf) | Inference(LLM Perf) |
+| :---- | :----| :---- | :---- | :---- |
+| Intel | Xeon | - | - | - |
+| Stream Computing | STC P920 | <li>Computation Power:128 TFLOPS@FP16 <li> Last Level Buffer: 8MB, 256GB/s <li>Level 1 Buffer: 1.25MB, 512GB/s   <li> Memory: 16GB, 119.4GB/S <li> Host Interface：PCIe 4, 16x, 32GB/s <li> TDP: 160W | [STC Introduction](byte_infer_perf/general_perf/backends/STC/README.md) | - |
+| Graphcore | Graphcore® C600 | <li>Compute: 280 TFLOPS@FP16, 560 TFLOPS@FP8 <li> In Processor Memory: 900 MB, 52 TB/s <li> Host Interface: Dual PCIe Gen4 8-lane interfaces, 32GB/s <li> TDP: 185W | [IPU Introduction](byte_infer_perf/general_perf/backends/IPU/README.md) | - |
+| Moffett-AI | Moffett-AI S30 | <li>Compute: 1440 (32x-Sparse) TFLOPS@BF16, 2880 (32x-Sparse) TOPS@INT8, <li> Memory: 60 GB,  <li> Host Interface: Dual PCIe Gen4 8-lane interfaces, 32GB/s <li> TDP: 250W | [SPU Introduction](byte_infer_perf/general_perf/backends/SPU/README.md) | - |
+| Habana | Gaudi2 | <li>24 Tensor Processor Cores, Dual matrix multiplication engines <li> Memory: 96 GB HBM2E, 48MB SRAM | [HPU Introduction](byte_infer_perf/general_perf/backends/HPU/README.md) | - |
 
 ## Statement
 [ASF Statement on Compliance with US Export Regulations and Entity List](https://news.apache.org/foundation/entry/statement-by-the-apache-software)