Awesome efficient llm

Awesome efficient llm
- Efficient finetuning
  - Adapter
- Quantization
  - Survey
  - Papers
  - Projects
  - Other
- Distillation
- Pruning
- Efficient Inference
  - Other
- Mobile
- Toolkits
- Efficient transformer

Survey

Efficient Exploration for LLMs, arXiv, 2402.00396, arxiv, pdf, cication: -1

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
A Comprehensive Survey of Compression Algorithms for Language Models, arXiv, 2401.15347, arxiv, pdf, cication: -1

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang
A Survey of Resource-efficient LLM and Multimodal Foundation Models, arXiv, 2401.08092, arxiv, pdf, cication: -1

Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang

· (efficient_foundation_model_survey - ubiquitouslearning)
Understanding LLMs: A Comprehensive Overview from Training to Inference, arXiv, 2401.02038, arxiv, pdf, cication: -1

Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong
Efficient Large Language Models: A Survey, arXiv, 2312.03863, arxiv, pdf, cication: -1

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab)
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, arXiv, 2312.15234, arxiv, pdf, cication: -1

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, arXiv, 2312.00678, arxiv, pdf, cication: -1

Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning, arXiv, 2303.15647, arxiv, pdf, cication: -1

Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky

Efficient finetuning

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models, arXiv, 2401.00788, arxiv, pdf, cication: -1

Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion, arXiv, 2312.14327, arxiv, pdf, cication: -1

Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

· (jiqizhixin)
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, arXiv, 2311.11501, arxiv, pdf, cication: -1

Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, arXiv, 2311.09578, arxiv, pdf, cication: -1

Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev
SiRA: Sparse Mixture of Low Rank Adaptation, arXiv, 2311.09179, arxiv, pdf, cication: -1

Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)
Punica: Multi-Tenant LoRA Serving, arXiv, 2310.18547, arxiv, pdf, cication: -1

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai)
S-LoRA: Serving Thousands of Concurrent LoRA Adapters, arXiv, 2311.03285, arxiv, pdf, cication: -1

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora)
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, arXiv, 2310.18356, arxiv, pdf, cication: -1

Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
VeRA: Vector-based Random Matrix Adaptation, arXiv, 2310.11454, arxiv, pdf, cication: -1

Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, arXiv, 2310.08659, arxiv, pdf, cication: 1

Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao

· (peft - huggingface)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993)
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, arXiv, 2309.12307, arxiv, pdf, cication: 5

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research)
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, arXiv, 2307.13269, arxiv, pdf, cication: 6

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet)
LLaMA-Efficient-Tuning - hiyouga

Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)
InRank: Incremental Low-Rank Learning, arXiv, 2306.11250, arxiv, pdf, cication: 2

Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao)
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning, arXiv, 2306.07967, arxiv, pdf, cication: -1

Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400)
Full Parameter Fine-tuning for Large Language Models with Limited Resources, arXiv, 2306.09782, arxiv, pdf, cication: -1

Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab)
PockEngine: Sparse and Efficient Fine-tuning in a Pocket, arXiv, 2310.17752, arxiv, pdf, cication: -1

Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium

Adapter

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning, arXiv, 2311.11077, arxiv, pdf, cication: -1

Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arXiv, 2305.15023, arxiv, pdf, cication: 18

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996) · (mp.weixin.qq)
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arXiv, 2304.15010, arxiv, pdf, cication: 82

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)

Quantization

Survey

A Performance Evaluation of a Quantized Large Language Model on Various Smartphones, arXiv, 2312.12472, arxiv, pdf, cication: -1

Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb
A Survey on Model Compression for Large Language Models, arXiv, 2308.07633, arxiv, pdf, cication: -1

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)

Papers

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, arXiv, 2401.14112, arxiv, pdf, cication: -1

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks, arXiv, 2312.08583, arxiv, pdf, cication: -1

Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, arXiv, 2311.12023, arxiv, pdf, cication: -1

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models, arXiv, 2310.09259, arxiv, pdf, cication: -1

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab)
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
FP8-LM: Training FP8 Large Language Models, arXiv, 2310.18313, arxiv, pdf, cication: -1

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu
LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv, 2310.16836, arxiv, pdf, cication: -1

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv, 2310.16795, arxiv, pdf, cication: -1

Elias Frantar, Dan Alistarh · (mp.weixin.qq)
BitNet: Scaling 1-bit Transformers for Large Language Models, arXiv, 2310.11453, arxiv, pdf, cication: -1

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab)
TEQ: Trainable Equivalent Transformation for Quantization of LLMs, arXiv, 2310.10944, arxiv, pdf, cication: 1

Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
Efficient Post-training Quantization with FP8 Formats, arXiv, 2309.14592, arxiv, pdf, cication: -1

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv, 2309.05516, arxiv, pdf, cication: -1

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv
Memory Efficient Optimizers with 4-bit States, arXiv, 2309.01507, arxiv, pdf, cication: 1

Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models, arXiv, 2308.13137, arxiv, pdf, cication: 2

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab)
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search, arXiv, 2308.03290, arxiv, pdf, cication: -1

Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi
QuIP: 2-Bit Quantization of Large Language Models With Guarantees, arXiv, 2307.13304, arxiv, pdf, cication: -1

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee)
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, arXiv, 2306.12929, arxiv, pdf, cication: -1

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
Training Transformers with 4-bit Integers, arXiv, 2306.11987, arxiv, pdf, cication: -1

Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, arXiv, 2306.03078, arxiv, pdf, cication: -1

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, arXiv, 2306.00978, arxiv, pdf, cication: -1

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, arXiv, 2210.17323, arxiv, pdf, cication: -1

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab)
[2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

· (bitsandbytes - timdettmers)

Projects

exllamav2 - turboderp

A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)
PB-LLM - hahnyuan

PB-LLM: Partially Binarized Large Language Models
AttentionIsOFFByOne - kyegomez

Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)
llama.cpp - ggerganov

Port of Facebook's LLaMA model in C/C++ · (finbarr)
llama2-webui - liltom-eth

Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.
neural-compressor - intel

Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel) · (mp.weixin.qq)
exllama - turboderp

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
squeezellm - squeezeailab

SqueezeLLM: Dense-and-Sparse Quantization

Other

Overview of natively supported quantization schemes in 🤗 Transformers
Making LLMs lighter with AutoGPTQ and transformers
TheBloke (Tom Jobbins)
Quantization

Distillation

Scavenging Hyena: Distilling Transformers into Long Convolution Models, arXiv, 2401.17574, arxiv, pdf, cication: -1

Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning - ACL Anthology

· (twitter)
Initializing Models with Larger Ones, arXiv, 2311.18823, arxiv, pdf, cication: -1

Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq)
Tailoring Self-Rationalizers with Multi-Reward Distillation, arXiv, 2311.02805, arxiv, pdf, cication: -1

Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
Co-training and Co-distillation for Quality Improvement and Compression of Language Models, arXiv, 2311.02849, arxiv, pdf, cication: -1

Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise, arXiv, 2310.19019, arxiv, pdf, cication: -1

Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao
Farzi Data: Autoregressive Data Distillation, arXiv, 2310.09983, arxiv, pdf, cication: -1

Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, arXiv, 2305.02301, arxiv, pdf, cication: 48

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
Composable Function-preserving Expansions for Transformer Architectures, arXiv, 2308.06103, arxiv, pdf, cication: 1

Andrea Gesmundo, Kaitlin Maile
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition, arXiv, 2308.03279, arxiv, pdf, cication: 2

Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon
Generalized Knowledge Distillation for Auto-regressive Language Models, arXiv, 2306.13649, arxiv, pdf, cication: -1

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
Knowledge Distillation of Large Language Models, arXiv, 2306.08543, arxiv, pdf, cication: -1

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Pruning

SliceGPT: Compress Large Language Models by Deleting Rows and Columns, arXiv, 2401.15024, arxiv, pdf, cication: -1

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic
The LLM Surgeon, arXiv, 2312.17244, arxiv, pdf, cication: -1

Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort
Mini-GPTs: Efficient Large Language Models through Contextual Pruning, arXiv, 2312.12682, arxiv, pdf, cication: -1

Tim Valicenti, Justice Vidal, Ritik Patnaik
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, arXiv, 2310.06694, arxiv, pdf, cication: 2

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp)
wanda - locuslab

A simple and effective LLM pruning approach.

Efficient Inference

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, 2401.15077, arxiv, pdf, cication: -1

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, arXiv, 2401.12522, arxiv, pdf, cication: -1

Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, 2401.10774, arxiv, pdf, cication: -1

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

· (medusa - fasterdecoding)
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference, arXiv, 2401.08671, arxiv, pdf, cication: -1

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, arXiv, 2401.08294, arxiv, pdf, cication: -1

Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow)
PainlessInferenceAcceleration - alipay
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, arXiv, 2401.07851, arxiv, pdf, cication: -1

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
Efficient LLM inference solution on Intel GPU, arXiv, 2401.05391, arxiv, pdf, cication: -1

Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu
SwiftInfer - hpcaitech

Efficient AI Inference & Serving · (qbitai)
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, arXiv, 2401.02669, arxiv, pdf, cication: -1

Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li
nitro - janhq

A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API
jan - janhq

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer
Fairness in Serving Large Language Models, arXiv, 2401.00588, arxiv, pdf, cication: -1

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora)
tricksy - austinsilveria

Fast approximate inference on a single GPU with sparsity aware offloading
mixtral-offloading - dvmazur

Run Mixtral-8x7B models in Colab or consumer desktops
LLM in a flash: Efficient Large Language Model Inference with Limited Memory, arXiv, 2312.11514, arxiv, pdf, cication: -1

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Efficiently Programming Large Language Models using SGLang, arXiv, 2312.07104, arxiv, pdf, cication: -1

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project) · (lmsys)
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, arXiv, 2312.12456, arxiv, pdf, cication: -1

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS)
Cascade Speculative Drafting for Even Faster LLM Inference, arXiv, 2312.11462, arxiv, pdf, cication: -1

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
LLMLingua - microsoft

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
vllm - vllm-project

A high-throughput and memory-efficient inference and serving engine for LLMs
SparQ Attention: Bandwidth-Efficient LLM Inference, arXiv, 2312.04985, arxiv, pdf, cication: -1

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

· (yaofu.notion)
Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
PaSS: Parallel Speculative Sampling, arXiv, 2311.13581, arxiv, pdf, cication: -1

Giovanni Monea, Armand Joulin, Edouard Grave
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org

· (LookaheadDecoding - hao-ai-lab)
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, arXiv, 2311.03687, arxiv, pdf, cication: -1

Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi

· (jiqizhixin)
FlashDecoding++: Faster Large Language Model Inference on GPUs, arXiv, 2311.01282, arxiv, pdf, cication: -1

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, ICML, 2023, arxiv, pdf, cication: 16

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re
TensorRT-LLM - NVIDIA

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)
Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv, 2310.10837, arxiv, pdf, cication: -1

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
deepsparse - neuralmagic

Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)
attention_sinks - tomaarsen

Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining
Efficient Streaming Language Models with Attention Sinks, arXiv, 2309.17453, arxiv, pdf, cication: 3

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

· (streaming-llm - mit-han-lab)

· (mp.weixin.qq)
Efficient Memory Management for Large Language Model Serving with PagedAttention, proceedings of the 29th symposium on operating systems principles, 2023, arxiv, pdf, cication: 21

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)
llama2.mojo - tairov

Inference Llama 2 in one file of pure 🔥 · (qbitai)
fastllm - ztxz16

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行
flexflow - flexflow

A distributed deep learning framework.
Accelerating LLM Inference with Staged Speculative Decoding, arXiv, 2308.04623, arxiv, pdf, cication: 3

Benjamin Spector, Chris Re
CTranslate2 - OpenNMT

Fast inference engine for Transformer models
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding, arXiv, 2307.15337, arxiv, pdf, cication: 4

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, arXiv, 2307.02628, arxiv, pdf, cication: -1

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs, arXiv, 2306.16601, arxiv, pdf, cication: -1

Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang
NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes, arXiv, 2306.16869, arxiv, pdf, cication: -1

Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, arXiv, 2306.14048, arxiv, pdf, cication: -1

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference)

· (mp.weixin.qq)
DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication - Microsoft Research

· (zhuanlan.zhihu)
vllm - vllm-project

A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification, arXiv, 2305.09781, arxiv, pdf, cication: -1

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow) · (mp.weixin.qq)
llama.cpp - ggerganov

Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov)

Other

LLM Inference Provider Leaderboard

· (jiqizhixin)
Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive
Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique | by Gavin Li | Nov, 2023 | AI Advances

· (mp.weixin.qq)
How to make LLMs go fast
Sparse LLM Inference on CPU
Optimizing your LLM in production
Speculative execution for LLMs is an excellent inference-time optimization.
tvm_mlir_learn - BBuf

compiler learning resources collect. · (mp.weixin.qq)
LLM生成延迟降低50%！DeepSpeed团队发布FastGen：动态SplitFuse技术，提升2.3倍有效吞吐量
不用4个H100！340亿参数Code Llama在Mac可跑，每秒20个token，代码生成最拿手｜Karpathy转赞
研究完llama.cpp，我发现手机跑大模型竟这么简单 | 机器之心

Mobile

MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, arXiv, 2312.16886, arxiv, pdf, cication: -1

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML)
mlc-llm - mlc-ai

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)

Toolkits

vllm - vllm-project
lorax - predibase

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
gigaGPT - Cerebras

a small code base for training large models · (cerebras)
EAGLE - SafeAILab

EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)

· (jiqizhixin)
optimum-nvidia - huggingface
unsloth - unslothai

5X faster 50% less memory LLM finetuning
lit-gpt - Lightning-AI

Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
gpt-fast - pytorch-labs

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
MS-AMP - Azure

Microsoft Automatic Mixed Precision Library
DeepSpeed - microsoft

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Efficient transformer

FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs | by Fireworks.ai | Jan, 2024 | Medium
flash-linear-attention - sustcsonglin

Fast implementations of causal linear attention for autogressive language modeling (Pytorch)
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation, arXiv, 2312.17276, arxiv, pdf, cication: -1

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang
Agent Attention: On the Integration of Softmax and Linear Attention, arXiv, 2312.08874, arxiv, pdf, cication: -1

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang · (agent-attention - leaplabthu)
Weight subcloning: direct initialization of transformers using larger pretrained ones, arXiv, 2312.09299, arxiv, pdf, cication: -1

Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models, arXiv, 2312.07046, arxiv, pdf, cication: -1

Arnav Chavan, Nahush Lele, Deepak Gupta
Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
Efficient Monotonic Multihead Attention, arXiv, 2312.04515, arxiv, pdf, cication: -1

Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv, 2312.00752, arxiv, pdf, cication: -1

Albert Gu, Tri Dao · (qbitai)
Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

Bobby He, Thomas Hofmann · (jiqizhixin)
Exponentially Faster Language Modelling, arXiv, 2311.10770, arxiv, pdf, cication: -1

Peter Belcak, Roger Wattenhofer
- GitHub - pbelcak/UltraFastBERT: The repository for the code of the UltraFastBERT paper
Simplifying Transformer Blocks, arXiv, 2311.01906, arxiv, pdf, cication: -1

Bobby He, Thomas Hofmann
Alternating Updates for Efficient Transformers, arXiv, 2301.13310, arxiv, pdf, cication: -1

Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022, arxiv, pdf, cication: 278

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Fast Transformer Decoding: One Write-Head is All You Need, arXiv, 1911.02150, arxiv, pdf, cication: 61

Noam Shazeer

· (zhuanlan.zhihu)

Hardware

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, arXiv, 2401.03868, arxiv, pdf, cication: -1

Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang

· (jiqizhixin)

Other

Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL
Schedule | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
Dynamic LoRA loading for better performance and optimized resource usage

Courses

Code LoRA from Scratch - a Lightning Studio by sebastian

EfficientML

EfficientML.ai Lecture, Fall 2023, MIT 6.5940 - YouTube

· (jiqizhixin) · (bilibili) · (dropbox)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_efficient_llm.md

awesome_efficient_llm.md

Awesome efficient llm

Survey

Efficient finetuning

Adapter

Quantization

Survey

Papers

Projects

Other

Distillation

Pruning

Efficient Inference

Other

Mobile

Toolkits

Efficient transformer

Hardware

Other

Courses

EfficientML

Files

awesome_efficient_llm.md

Latest commit

History

awesome_efficient_llm.md

File metadata and controls

Awesome efficient llm

Survey

Efficient finetuning

Adapter

Quantization

Survey

Papers

Projects

Other

Distillation

Pruning

Efficient Inference

Other

Mobile

Toolkits

Efficient transformer

Hardware

Other

Courses

EfficientML