Intel Extension for Transformers provides multiple reference deployments: 1) Neural Engine; 2) IPEX.
Neural Engine can provide the optimal performance of extremely compressed transformer based models, the optimization is both from HW and SW. It's a reference deployment for Intel Extension for Transformers, we will enable other backends.
Supported Examples
Question-Answering | Text-Classification |
---|---|
Bert-large (SQUAD) | Bert-mini (SST2) MiniLM (SST2) Distilbert (SST2) Distilbert (Emotion) Bert-base (MRPC) Bert-mini (MRPC) Distilbert (MRPC) Roberta-base (MRPC) |
Windows and Linux are supported.
# prepare your env
conda create -n <env name> python=3.7
conda install cmake --yes
conda install absl-py --yes
Install Intel Neural Compressor as a pre-condition, recommend version is 1.14.2
pip install neural-compressor==1.14.2
cd <project folder/intel_extension_for_transformers/>
python setup.py install/develop
If you only use backends, just add "--backends" while installing. The new package is named intel_extension_for_transformers_backends. And the Intel Neural Compressor isn't necessary.
python3 setup.py install/develop --backends
Note: Please check either intel_extension_for_transformers or intel_extension_for_transformers_backends installed in env to prevent possible confilcts. You can pip uninstall intel_extension_for_transformers/intel_extension_for_transformers_backends before installing.
from intel_extension_for_transformers.backends.neural_engine.compile import compile
model = compile('/path/to/your/model')
model.save('/ir/path')
Note that Neural Engine supports TensorFlow and ONNX models.
./neural_engine --config=<path to yaml file> --weight=<path to bin file> --batch_size=32 --iterations=20
You can use the numactl
command to bind cpu cores and open multi-instances:
OMP_NUM_THREADS=4 numactl -C '0-3' ./neural_engine ...
Open/Close Log:(GLOG_minloglevel=1/GLOG_minloglevel=2)
export GLOG_minloglevel=2 ./neural_engine ...
If you use python setup.py install to install the neural engine in your current folder, then you can use python api as following.
from intel_extension_for_transformers.backends.neural_engine.compile import compile
# load the model
graph = compile('./model_and_tokenizer/int8-model.onnx')
# use graph.inference to do inference
out = graph.inference([input_ids, segment_ids, input_mask])
# dump the neural engine IR to file
graph.save('./ir')
The input_ids
, segment_ids
and input_mask
are the input numpy array data of model, and the input dimension is variable.
Note that the out
is a dict contains the bert model output tensor name and numpy data (out={output name : numpy data}
).
If you want to analyze performance of each operator, just export ENGINE_PROFILING=1 and export INST_NUM=<inst_num>. It will dump latency of each operator to <curr_path>/engine_profiling/profiling_<inst_count>.csv and each iteration to <curr_path>/engine_profiling/profiling_<inst_count>.json.
Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware.