Skip to content

Latest commit

 

History

History
621 lines (434 loc) · 60.9 KB

awesome_efficient_llm.md

File metadata and controls

621 lines (434 loc) · 60.9 KB

Awesome efficient llm

Survey

  • Efficient Exploration for LLMs, arXiv, 2402.00396, arxiv, pdf, cication: -1

    Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

  • A Comprehensive Survey of Compression Algorithms for Language Models, arXiv, 2401.15347, arxiv, pdf, cication: -1

    Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

  • A Survey of Resource-efficient LLM and Multimodal Foundation Models, arXiv, 2401.08092, arxiv, pdf, cication: -1

    Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang

    · (efficient_foundation_model_survey - ubiquitouslearning) Star

  • Understanding LLMs: A Comprehensive Overview from Training to Inference, arXiv, 2401.02038, arxiv, pdf, cication: -1

    Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong

  • Efficient Large Language Models: A Survey, arXiv, 2312.03863, arxiv, pdf, cication: -1

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab) Star

  • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, arXiv, 2312.15234, arxiv, pdf, cication: -1

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)

  • The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, arXiv, 2312.00678, arxiv, pdf, cication: -1

    Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang

  • Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning, arXiv, 2303.15647, arxiv, pdf, cication: -1

    Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky

Efficient finetuning

  • Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models, arXiv, 2401.00788, arxiv, pdf, cication: -1

    Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff

  • Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion, arXiv, 2312.14327, arxiv, pdf, cication: -1

    Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan

  • Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

    · (jiqizhixin)

  • MultiLoRA: Democratizing LoRA for Better Multi-Task Learning, arXiv, 2311.11501, arxiv, pdf, cication: -1

    Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang

  • Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying, arXiv, 2311.09578, arxiv, pdf, cication: -1

    Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev

  • SiRA: Sparse Mixture of Low Rank Adaptation, arXiv, 2311.09179, arxiv, pdf, cication: -1

    Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen

  • Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

    Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)

  • Punica: Multi-Tenant LoRA Serving, arXiv, 2310.18547, arxiv, pdf, cication: -1

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai) Star

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters, arXiv, 2311.03285, arxiv, pdf, cication: -1

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora) Star

  • LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery, arXiv, 2310.18356, arxiv, pdf, cication: -1

    Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang

  • VeRA: Vector-based Random Matrix Adaptation, arXiv, 2310.11454, arxiv, pdf, cication: -1

    Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)

  • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, arXiv, 2310.08659, arxiv, pdf, cication: 1

    Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao

    · (peft - huggingface) Star

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

    Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993) Star

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, arXiv, 2309.12307, arxiv, pdf, cication: 5

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research) Star

  • LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition, arXiv, 2307.13269, arxiv, pdf, cication: 6

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin

  • Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

    Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet) Star

  • LLaMA-Efficient-Tuning - hiyouga Star

    Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)

  • InRank: Incremental Low-Rank Learning, arXiv, 2306.11250, arxiv, pdf, cication: 2

    Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao) Star

  • One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning, arXiv, 2306.07967, arxiv, pdf, cication: -1

    Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400) Star

  • Full Parameter Fine-tuning for Large Language Models with Limited Resources, arXiv, 2306.09782, arxiv, pdf, cication: -1

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab) Star

  • PockEngine: Sparse and Efficient Fine-tuning in a Pocket, arXiv, 2310.17752, arxiv, pdf, cication: -1

    Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han

  • AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium

Adapter

  • Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning, arXiv, 2311.11077, arxiv, pdf, cication: -1

    Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)

  • Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, arXiv, 2305.15023, arxiv, pdf, cication: 18

    Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996) Star · (mp.weixin.qq)

  • LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, arXiv, 2304.15010, arxiv, pdf, cication: 82

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)

Quantization

Survey

  • A Performance Evaluation of a Quantized Large Language Model on Various Smartphones, arXiv, 2312.12472, arxiv, pdf, cication: -1

    Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb

  • A Survey on Model Compression for Large Language Models, arXiv, 2308.07633, arxiv, pdf, cication: -1

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)

Papers

  • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design, arXiv, 2401.14112, arxiv, pdf, cication: -1

    Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou

  • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks, arXiv, 2312.08583, arxiv, pdf, cication: -1

    Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song

  • LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, arXiv, 2311.12023, arxiv, pdf, cication: -1

    Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97) Star

  • QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models, arXiv, 2310.09259, arxiv, pdf, cication: -1

    Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab) Star

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

  • FP8-LM: Training FP8 Large Language Models, arXiv, 2310.18313, arxiv, pdf, cication: -1

    Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu

  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers, arXiv, 2310.16836, arxiv, pdf, cication: -1

    Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng

  • QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models, arXiv, 2310.16795, arxiv, pdf, cication: -1

    Elias Frantar, Dan Alistarh · (mp.weixin.qq)

  • BitNet: Scaling 1-bit Transformers for Large Language Models, arXiv, 2310.11453, arxiv, pdf, cication: -1

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving, arXiv, 2310.19102, arxiv, pdf, cication: -1

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab) Star

  • TEQ: Trainable Equivalent Transformation for Quantization of LLMs, arXiv, 2310.10944, arxiv, pdf, cication: 1

    Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen

  • Efficient Post-training Quantization with FP8 Formats, arXiv, 2309.14592, arxiv, pdf, cication: -1

    Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel) Star

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models, arXiv, 2309.14717, arxiv, pdf, cication: -1

    Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian

  • Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs, arXiv, 2309.05516, arxiv, pdf, cication: -1

    Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv

  • Memory Efficient Optimizers with 4-bit States, arXiv, 2309.01507, arxiv, pdf, cication: 1

    Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)

  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models, arXiv, 2308.13137, arxiv, pdf, cication: 2

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab) Star

  • FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search, arXiv, 2308.03290, arxiv, pdf, cication: -1

    Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees, arXiv, 2307.13304, arxiv, pdf, cication: -1

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee) Star

  • Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, arXiv, 2306.12929, arxiv, pdf, cication: -1

    Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

  • Training Transformers with 4-bit Integers, arXiv, 2306.11987, arxiv, pdf, cication: -1

    Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression, arXiv, 2306.03078, arxiv, pdf, cication: -1

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, arXiv, 2306.00978, arxiv, pdf, cication: -1

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, arXiv, 2210.17323, arxiv, pdf, cication: -1

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab) Star

  • [2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    · (bitsandbytes - timdettmers) Star

Projects

  • exllamav2 - turboderp Star

    A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)

  • PB-LLM - hahnyuan Star

    PB-LLM: Partially Binarized Large Language Models

  • AttentionIsOFFByOne - kyegomez Star

    Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)

  • llama.cpp - ggerganov Star

    Port of Facebook's LLaMA model in C/C++ · (finbarr)

  • llama2-webui - liltom-eth Star

    Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.

  • neural-compressor - intel Star

    Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel) Star · (mp.weixin.qq)

  • exllama - turboderp Star

    A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

  • squeezellm - squeezeailab Star

    SqueezeLLM: Dense-and-Sparse Quantization

Other

Distillation

  • Scavenging Hyena: Distilling Transformers into Long Convolution Models, arXiv, 2401.17574, arxiv, pdf, cication: -1

    Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang

  • Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning - ACL Anthology

    · (twitter)

  • Initializing Models with Larger Ones, arXiv, 2311.18823, arxiv, pdf, cication: -1

    Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq) Star

  • Tailoring Self-Rationalizers with Multi-Reward Distillation, arXiv, 2311.02805, arxiv, pdf, cication: -1

    Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

  • Co-training and Co-distillation for Quality Improvement and Compression of Language Models, arXiv, 2311.02849, arxiv, pdf, cication: -1

    Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min

  • TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise, arXiv, 2310.19019, arxiv, pdf, cication: -1

    Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao

  • Farzi Data: Autoregressive Data Distillation, arXiv, 2310.09983, arxiv, pdf, cication: -1

    Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, arXiv, 2305.02301, arxiv, pdf, cication: 48

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

  • Composable Function-preserving Expansions for Transformer Architectures, arXiv, 2308.06103, arxiv, pdf, cication: 1

    Andrea Gesmundo, Kaitlin Maile

  • UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition, arXiv, 2308.03279, arxiv, pdf, cication: 2

    Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon

  • Generalized Knowledge Distillation for Auto-regressive Language Models, arXiv, 2306.13649, arxiv, pdf, cication: -1

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem

  • Knowledge Distillation of Large Language Models, arXiv, 2306.08543, arxiv, pdf, cication: -1

    Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Pruning

  • SliceGPT: Compress Large Language Models by Deleting Rows and Columns, arXiv, 2401.15024, arxiv, pdf, cication: -1

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

  • Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic

  • The LLM Surgeon, arXiv, 2312.17244, arxiv, pdf, cication: -1

    Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

  • Mini-GPTs: Efficient Large Language Models through Contextual Pruning, arXiv, 2312.12682, arxiv, pdf, cication: -1

    Tim Valicenti, Justice Vidal, Ritik Patnaik

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, arXiv, 2310.06694, arxiv, pdf, cication: 2

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp) Star

  • wanda - locuslab Star

    A simple and effective LLM pruning approach.

Efficient Inference

  • EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, 2401.15077, arxiv, pdf, cication: -1

    Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

  • BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models, arXiv, 2401.12522, arxiv, pdf, cication: -1

    Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao

  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, 2401.10774, arxiv, pdf, cication: -1

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

    · (medusa - fasterdecoding) Star

  • DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference, arXiv, 2401.08671, arxiv, pdf, cication: -1

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko

  • Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, arXiv, 2401.08294, arxiv, pdf, cication: -1

    Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow) Star

  • PainlessInferenceAcceleration - alipay Star

  • Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, arXiv, 2401.07851, arxiv, pdf, cication: -1

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui

  • Efficient LLM inference solution on Intel GPU, arXiv, 2401.05391, arxiv, pdf, cication: -1

    Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu

  • SwiftInfer - hpcaitech Star

    Efficient AI Inference & Serving · (qbitai)

  • Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache, arXiv, 2401.02669, arxiv, pdf, cication: -1

    Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li

  • nitro - janhq Star

    A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API

  • jan - janhq Star

    Jan is an open source alternative to ChatGPT that runs 100% offline on your computer

  • Fairness in Serving Large Language Models, arXiv, 2401.00588, arxiv, pdf, cication: -1

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora) Star

  • tricksy - austinsilveria Star

    Fast approximate inference on a single GPU with sparsity aware offloading

  • mixtral-offloading - dvmazur Star

    Run Mixtral-8x7B models in Colab or consumer desktops

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory, arXiv, 2312.11514, arxiv, pdf, cication: -1

    Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar

  • Efficiently Programming Large Language Models using SGLang, arXiv, 2312.07104, arxiv, pdf, cication: -1

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project) Star · (lmsys)

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, arXiv, 2312.12456, arxiv, pdf, cication: -1

    Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS) Star

  • Cascade Speculative Drafting for Even Faster LLM Inference, arXiv, 2312.11462, arxiv, pdf, cication: -1

    Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang

  • LLMLingua - microsoft Star

    To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

  • vllm - vllm-project Star

    A high-throughput and memory-efficient inference and serving engine for LLMs

  • SparQ Attention: Bandwidth-Efficient LLM Inference, arXiv, 2312.04985, arxiv, pdf, cication: -1

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr

  • Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

    · (yaofu.notion)

  • Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code

  • PaSS: Parallel Speculative Sampling, arXiv, 2311.13581, arxiv, pdf, cication: -1

    Giovanni Monea, Armand Joulin, Edouard Grave

  • Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org

    · (LookaheadDecoding - hao-ai-lab) Star

  • Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, arXiv, 2311.03687, arxiv, pdf, cication: -1

    Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi

    · (jiqizhixin)

  • FlashDecoding++: Faster Large Language Model Inference on GPUs, arXiv, 2311.01282, arxiv, pdf, cication: -1

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, ICML, 2023, arxiv, pdf, cication: 16

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re

  • TensorRT-LLM - NVIDIA Star

    TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)

  • Approximating Two-Layer Feedforward Networks for Efficient Transformers, arXiv, 2310.10837, arxiv, pdf, cication: -1

    Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber

  • deepsparse - neuralmagic Star

    Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)

  • attention_sinks - tomaarsen Star

    Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining

  • Efficient Streaming Language Models with Attention Sinks, arXiv, 2309.17453, arxiv, pdf, cication: 3

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

    · (streaming-llm - mit-han-lab) Star

    · (mp.weixin.qq)

  • Efficient Memory Management for Large Language Model Serving with PagedAttention, proceedings of the 29th symposium on operating systems principles, 2023, arxiv, pdf, cication: 21

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)

  • llama2.mojo - tairov Star

    Inference Llama 2 in one file of pure 🔥 · (qbitai)

  • fastllm - ztxz16 Star

    纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

  • flexflow - flexflow Star

    A distributed deep learning framework.

  • Accelerating LLM Inference with Staged Speculative Decoding, arXiv, 2308.04623, arxiv, pdf, cication: 3

    Benjamin Spector, Chris Re

  • CTranslate2 - OpenNMT Star

    Fast inference engine for Transformer models

  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding, arXiv, 2307.15337, arxiv, pdf, cication: 4

    Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference, arXiv, 2307.02628, arxiv, pdf, cication: -1

    Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee

  • An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs, arXiv, 2306.16601, arxiv, pdf, cication: -1

    Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang

  • NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes, arXiv, 2306.16869, arxiv, pdf, cication: -1

    Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho

  • H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, arXiv, 2306.14048, arxiv, pdf, cication: -1

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference) Star

    · (mp.weixin.qq)

  • DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication - Microsoft Research

    · (zhuanlan.zhihu)

  • vllm - vllm-project Star

    A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)

  • SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification, arXiv, 2305.09781, arxiv, pdf, cication: -1

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow) Star · (mp.weixin.qq)

  • llama.cpp - ggerganov Star

    Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov) Star

Other

Mobile

  • MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, arXiv, 2312.16886, arxiv, pdf, cication: -1

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML) Star

  • mlc-llm - mlc-ai Star

    Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)

Toolkits

  • vllm - vllm-project Star

  • lorax - predibase Star

    Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

  • Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day

  • gigaGPT - Cerebras Star

    a small code base for training large models · (cerebras)

  • EAGLE - SafeAILab Star

    EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)

    · (jiqizhixin)

  • optimum-nvidia - huggingface Star

  • unsloth - unslothai Star

    5X faster 50% less memory LLM finetuning

  • lit-gpt - Lightning-AI Star

    Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

  • gpt-fast - pytorch-labs Star

    Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

  • MS-AMP - Azure Star

    Microsoft Automatic Mixed Precision Library

  • DeepSpeed - microsoft Star

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Efficient transformer

Hardware

  • FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, arXiv, 2401.03868, arxiv, pdf, cication: -1

    Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang

    · (jiqizhixin)

Other

Courses

EfficientML