- Default models list:
- meta-llama/Meta-Llama-3.1-8B-Instruct
- microsoft/Phi-3-mini-4k-instruct
- Qwen/Qwen2-7B
- mistralai/Mistral-7B-Instruct-v0.2
- openbmb/MiniCPM-1B-sft-bf16
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
- User can input any model they like
- No guarantee that every model will compile for the NPU, though
- here is a list of models likely to run on NPU
- One-Time Setup: The script downloads the model, quantizes it, converts it to OpenVINO IR format, compiles it for the NPU, and caches the result for future use. 💡⌛
- Performance: Surprisingly fast inference speeds, even on devices with modest computational power (e.g., my Meteor Lake's 13 TOPS NPU). ⚡⏳
- Power Efficiency: While inference might be faster on a CPU or GPU for some devices, the NPU is significantly more energy-efficient, making it ideal for laptops. 🔋🌐
- Python 3.9 to 3.12
- An Intel processor with an NPU:
- Meteor Lake (Core Ultra Series 1, i.e., 1XX chips)
- Arrow Lake (Core Ultra Series 2, i.e., 2XX chips)
- Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
- Newest Intel NPU driver
git clone https://github.com/justADeni/intel-npu-llm.git
cd intel-npu-llm
python -m venv npu_venv
- On Windows:
npu_venv/Scripts/activate
- On Linux:
source npu_venv/bin/activate
pip install -r requirements.txt
python intel_npu_llm.py
- Resource-Intensive Compilation: The quantization and compilation steps can be time-consuming, taking up to tens of minutes depending on your hardware. However, these steps are performed only once per model and are cached for future use. ⌛⚙️
- But wait, why does context fill up and then reset?: Continuous batching has not yet been implemented for NPU's by Intel's OpenVINO engineers. You can check API coverage % here. 🚧🛠️
Contributions, bug reports, and feature requests are welcome! Feel free to open an issue or submit a pull request. 🔨✍️
This project is licensed under the MIT License. 🔒✨
Enjoy using intel-npu-llm
! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧