Python(PyTorch) example achieves end-to-end inference of the model with streaming output combining the transformer's tokenizer.
Please refer to Installation. This example supports use source code which means you don't need install xFasterTransformer into pip and just build xFasterTransformer library, and it will search library in src directory.
Please refer to Prepare model
- Please refer to Prepare Environment to install oneCCL.
- Python dependencies.
PS: Due to the potential compatibility issues between the model file and the
# requirements.txt in root directory. pip install -r requirements.txttransformersversion, please select the appropriatetransformersversion.
# Recommend preloading `libiomp5.so` to get a better performance.
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
# run single instance like
python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}
# run multi-rank like
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} : \
-n 1 numactl -N 1 -m 1 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}More parameter options settings:
-h,--helpshow help message and exit.-t,--token_pathPath to tokenizer directory.-m,--model_pathPath to model directory.-d,--dtypeData type, default usingfp16, supports{fp16, bf16, int8, w8a8, int4, nf4, bf16_fp16, bf16_int8, bf16_w8a8,bf16_int4, bf16_nf4, w8a8_int8, w8a8_int4, w8a8_nf4}.--streamingStreaming output, Default to True.--num_beamsNum of beams, default to 1 which is greedy search.--output_lenmax tokens can generate excluded input.--paddingEnable tokenizer padding, Default to True.--chatEnable chat mode for ChatGLM models, Default to False.--do_sampleEnable sampling search, Default to False.--temperaturevalue used to modulate next token probabilities.--top_pretain minimal tokens above topP threshold.--top_knum of highest probability tokens to keep for generation.