-
Efficient Exploration for LLMs,
arXiv, 2402.00396
, arxiv, pdf, cication: -1Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
-
A Comprehensive Survey of Compression Algorithms for Language Models,
arXiv, 2401.15347
, arxiv, pdf, cication: -1Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang
-
A Survey of Resource-efficient LLM and Multimodal Foundation Models,
arXiv, 2401.08092
, arxiv, pdf, cication: -1Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang
· (efficient_foundation_model_survey - ubiquitouslearning)
-
Understanding LLMs: A Comprehensive Overview from Training to Inference,
arXiv, 2401.02038
, arxiv, pdf, cication: -1Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong
-
Efficient Large Language Models: A Survey,
arXiv, 2312.03863
, arxiv, pdf, cication: -1Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury · (Efficient-LLMs-Survey - AIoT-MLSys-Lab)
-
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,
arXiv, 2312.15234
, arxiv, pdf, cication: -1Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia · (mp.weixin.qq)
-
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey,
arXiv, 2312.00678
, arxiv, pdf, cication: -1Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang
-
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning,
arXiv, 2303.15647
, arxiv, pdf, cication: -1Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky
-
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models,
arXiv, 2401.00788
, arxiv, pdf, cication: -1Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, Niklas Muennighoff
-
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion,
arXiv, 2312.14327
, arxiv, pdf, cication: -1Katrin Tomanek, Shanqing Cai, Subhashini Venugopalan
-
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
· (jiqizhixin)
-
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning,
arXiv, 2311.11501
, arxiv, pdf, cication: -1Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang
-
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying,
arXiv, 2311.09578
, arxiv, pdf, cication: -1Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev
-
SiRA: Sparse Mixture of Low Rank Adaptation,
arXiv, 2311.09179
, arxiv, pdf, cication: -1Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen
-
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization,
arXiv, 2311.06243
, arxiv, pdf, cication: -1Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng · (boft.wyliu)
-
Punica: Multi-Tenant LoRA Serving,
arXiv, 2310.18547
, arxiv, pdf, cication: -1Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy · (punica - punica-ai)
-
S-LoRA: Serving Thousands of Concurrent LoRA Adapters,
arXiv, 2311.03285
, arxiv, pdf, cication: -1Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer · (s-lora - s-lora)
-
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery,
arXiv, 2310.18356
, arxiv, pdf, cication: -1Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
-
VeRA: Vector-based Random Matrix Adaptation,
arXiv, 2310.11454
, arxiv, pdf, cication: -1Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano · (mp.weixin.qq)
-
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,
arXiv, 2310.08659
, arxiv, pdf, cication: 1Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao
· (peft - huggingface)
-
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models,
arXiv, 2309.14717
, arxiv, pdf, cication: -1Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian · (qa-lora - yuhuixu1993)
-
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models,
arXiv, 2309.12307
, arxiv, pdf, cication: 5Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia · (LongLoRA - dvlab-research)
-
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition,
arXiv, 2307.13269
, arxiv, pdf, cication: 6Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin
-
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates,
arXiv, 2307.05695
, arxiv, pdf, cication: 2Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky · (peft_pretraining - guitaricet)
-
LLaMA-Efficient-Tuning - hiyouga
Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA)
-
InRank: Incremental Low-Rank Learning,
arXiv, 2306.11250
, arxiv, pdf, cication: 2Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar · (inrank - jiaweizzhao)
-
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning,
arXiv, 2306.07967
, arxiv, pdf, cication: -1Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen · (ViT-Slim - Arnav0400)
-
Full Parameter Fine-tuning for Large Language Models with Limited Resources,
arXiv, 2306.09782
, arxiv, pdf, cication: -1Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, Xipeng Qiu · (LOMO - OpenLMLab)
-
PockEngine: Sparse and Efficient Fine-tuning in a Pocket,
arXiv, 2310.17752
, arxiv, pdf, cication: -1Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
-
AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium
-
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning,
arXiv, 2311.11077
, arxiv, pdf, cication: -1Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer · (adapterhub)
-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models,
arXiv, 2305.15023
, arxiv, pdf, cication: 18Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji · (LaVIN - luogen1996)
· (mp.weixin.qq)
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model,
arXiv, 2304.15010
, arxiv, pdf, cication: 82Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue · (mp.weixin.qq)
-
A Performance Evaluation of a Quantized Large Language Model on Various Smartphones,
arXiv, 2312.12472
, arxiv, pdf, cication: -1Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J. Bouw, Stephen Cobb
-
A Survey on Model Compression for Large Language Models,
arXiv, 2308.07633
, arxiv, pdf, cication: -1Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang · (jiqizhixin)
-
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design,
arXiv, 2401.14112
, arxiv, pdf, cication: -1Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou
-
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks,
arXiv, 2312.08583
, arxiv, pdf, cication: -1Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song
-
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning,
arXiv, 2311.12023
, arxiv, pdf, cication: -1Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim · (lq-lora - hanguo97)
-
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models,
arXiv, 2310.09259
, arxiv, pdf, cication: -1Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh · (quik - ist-daslab)
-
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving,
arXiv, 2310.19102
, arxiv, pdf, cication: -1Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
-
FP8-LM: Training FP8 Large Language Models,
arXiv, 2310.18313
, arxiv, pdf, cication: -1Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu
-
LLM-FP4: 4-Bit Floating-Point Quantized Transformers,
arXiv, 2310.16836
, arxiv, pdf, cication: -1Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models,
arXiv, 2310.16795
, arxiv, pdf, cication: -1Elias Frantar, Dan Alistarh · (mp.weixin.qq)
-
BitNet: Scaling 1-bit Transformers for Large Language Models,
arXiv, 2310.11453
, arxiv, pdf, cication: -1Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
-
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving,
arXiv, 2310.19102
, arxiv, pdf, cication: -1Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci · (atom - efeslab)
-
TEQ: Trainable Equivalent Transformation for Quantization of LLMs,
arXiv, 2310.10944
, arxiv, pdf, cication: 1Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
-
Efficient Post-training Quantization with FP8 Formats,
arXiv, 2309.14592
, arxiv, pdf, cication: -1Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang · (neural-compressor - intel)
-
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models,
arXiv, 2309.14717
, arxiv, pdf, cication: -1Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian
-
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs,
arXiv, 2309.05516
, arxiv, pdf, cication: -1Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv
-
Memory Efficient Optimizers with 4-bit States,
arXiv, 2309.01507
, arxiv, pdf, cication: 1Bingrui Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
-
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models,
arXiv, 2308.13137
, arxiv, pdf, cication: 2Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo · (OmniQuant - OpenGVLab)
-
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search,
arXiv, 2308.03290
, arxiv, pdf, cication: -1Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi
-
QuIP: 2-Bit Quantization of Large Language Models With Guarantees,
arXiv, 2307.13304
, arxiv, pdf, cication: -1Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa · (quip - jerry-chee)
-
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,
arXiv, 2306.12929
, arxiv, pdf, cication: -1Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
-
Training Transformers with 4-bit Integers,
arXiv, 2306.11987
, arxiv, pdf, cication: -1Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu · (jiqizhixin)
-
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression,
arXiv, 2306.03078
, arxiv, pdf, cication: -1Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh · (jiqizhixin)
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,
arXiv, 2306.00978
, arxiv, pdf, cication: -1Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,
arXiv, 2210.17323
, arxiv, pdf, cication: -1Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh · (gptq - IST-DASLab)
-
[2208.07339] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
· (bitsandbytes - timdettmers)
-
exllamav2 - turboderp
A fast inference library for running LLMs locally on modern consumer-class GPUs · (mp.weixin.qq)
-
PB-LLM - hahnyuan
PB-LLM: Partially Binarized Large Language Models
-
AttentionIsOFFByOne - kyegomez
Implementation of "Attention Is Off By One" by Evan Miller · (evanmiller) · (jiqizhixin)
-
llama.cpp - ggerganov
Port of Facebook's LLaMA model in C/C++ · (finbarr)
-
llama2-webui - liltom-eth
Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.
-
neural-compressor - intel
Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime. · (neural-compressor - intel)
· (mp.weixin.qq)
-
exllama - turboderp
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
squeezellm - squeezeailab
SqueezeLLM: Dense-and-Sparse Quantization
- Overview of natively supported quantization schemes in 🤗 Transformers
- Making LLMs lighter with AutoGPTQ and transformers
- TheBloke (Tom Jobbins)
- Quantization
-
Scavenging Hyena: Distilling Transformers into Long Convolution Models,
arXiv, 2401.17574
, arxiv, pdf, cication: -1Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang
-
· (twitter)
-
Initializing Models with Larger Ones,
arXiv, 2311.18823
, arxiv, pdf, cication: -1Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu · (weight-selection - oscarxzq)
-
Tailoring Self-Rationalizers with Multi-Reward Distillation,
arXiv, 2311.02805
, arxiv, pdf, cication: -1Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
-
Co-training and Co-distillation for Quality Improvement and Compression of Language Models,
arXiv, 2311.02849
, arxiv, pdf, cication: -1Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
-
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise,
arXiv, 2310.19019
, arxiv, pdf, cication: -1Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao
-
Farzi Data: Autoregressive Data Distillation,
arXiv, 2310.09983
, arxiv, pdf, cication: -1Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes,
arXiv, 2305.02301
, arxiv, pdf, cication: 48Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
-
Composable Function-preserving Expansions for Transformer Architectures,
arXiv, 2308.06103
, arxiv, pdf, cication: 1Andrea Gesmundo, Kaitlin Maile
-
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition,
arXiv, 2308.03279
, arxiv, pdf, cication: 2Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon
-
Generalized Knowledge Distillation for Auto-regressive Language Models,
arXiv, 2306.13649
, arxiv, pdf, cication: -1Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
-
Knowledge Distillation of Large Language Models,
arXiv, 2306.08543
, arxiv, pdf, cication: -1Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
-
SliceGPT: Compress Large Language Models by Deleting Rows and Columns,
arXiv, 2401.15024
, arxiv, pdf, cication: -1Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman
-
Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse - Neural Magic
-
The LLM Surgeon,
arXiv, 2312.17244
, arxiv, pdf, cication: -1Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort
-
Mini-GPTs: Efficient Large Language Models through Contextual Pruning,
arXiv, 2312.12682
, arxiv, pdf, cication: -1Tim Valicenti, Justice Vidal, Ritik Patnaik
-
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning,
arXiv, 2310.06694
, arxiv, pdf, cication: 2Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen · (qbitai) · (xiamengzhou.github) · (llm-shearing - princeton-nlp)
-
wanda - locuslab
A simple and effective LLM pruning approach.
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,
arXiv, 2401.15077
, arxiv, pdf, cication: -1Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
-
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models,
arXiv, 2401.12522
, arxiv, pdf, cication: -1Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,
arXiv, 2401.10774
, arxiv, pdf, cication: -1Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
· (medusa - fasterdecoding)
-
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference,
arXiv, 2401.08671
, arxiv, pdf, cication: -1Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko
-
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models,
arXiv, 2401.08294
, arxiv, pdf, cication: -1Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li · (inferflow - inferflow)
-
PainlessInferenceAcceleration - alipay
-
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding,
arXiv, 2401.07851
, arxiv, pdf, cication: -1Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
-
Efficient LLM inference solution on Intel GPU,
arXiv, 2401.05391
, arxiv, pdf, cication: -1Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu
-
SwiftInfer - hpcaitech
Efficient AI Inference & Serving · (qbitai)
-
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache,
arXiv, 2401.02669
, arxiv, pdf, cication: -1Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li
-
nitro - janhq
A fast, lightweight, embeddable inference engine to supercharge your apps with local AI. OpenAI-compatible API
-
jan - janhq
Jan is an open source alternative to ChatGPT that runs 100% offline on your computer
-
Fairness in Serving Large Language Models,
arXiv, 2401.00588
, arxiv, pdf, cication: -1Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica · (s-lora - s-lora)
-
tricksy - austinsilveria
Fast approximate inference on a single GPU with sparsity aware offloading
-
mixtral-offloading - dvmazur
Run Mixtral-8x7B models in Colab or consumer desktops
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory,
arXiv, 2312.11514
, arxiv, pdf, cication: -1Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
-
Efficiently Programming Large Language Models using SGLang,
arXiv, 2312.07104
, arxiv, pdf, cication: -1Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez · (sglang?tab=readme-ov-file - sgl-project)
· (lmsys)
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU,
arXiv, 2312.12456
, arxiv, pdf, cication: -1Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen · (PowerInfer - SJTU-IPADS)
-
Cascade Speculative Drafting for Even Faster LLM Inference,
arXiv, 2312.11462
, arxiv, pdf, cication: -1Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
-
LLMLingua - microsoft
To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
-
vllm - vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs
-
SparQ Attention: Bandwidth-Efficient LLM Inference,
arXiv, 2312.04985
, arxiv, pdf, cication: -1Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
-
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
· (yaofu.notion)
-
Optimum-NVIDIA Unlocking blazingly fast LLM inference in just 1 line of code
-
PaSS: Parallel Speculative Sampling,
arXiv, 2311.13581
, arxiv, pdf, cication: -1Giovanni Monea, Armand Joulin, Edouard Grave
-
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | LMSYS Org
· (LookaheadDecoding - hao-ai-lab)
-
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models,
arXiv, 2311.03687
, arxiv, pdf, cication: -1Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi
· (jiqizhixin)
-
FlashDecoding++: Faster Large Language Model Inference on GPUs,
arXiv, 2311.01282
, arxiv, pdf, cication: -1Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang
-
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time,
ICML, 2023
, arxiv, pdf, cication: 16Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re
-
TensorRT-LLM - NVIDIA
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs)
-
Approximating Two-Layer Feedforward Networks for Efficient Transformers,
arXiv, 2310.10837
, arxiv, pdf, cication: -1Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
-
deepsparse - neuralmagic
Inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application · (huggingface)
-
attention_sinks - tomaarsen
Extend existing LLMs way beyond the original training length with constant memory usage, and without retraining
-
Efficient Streaming Language Models with Attention Sinks,
arXiv, 2309.17453
, arxiv, pdf, cication: 3Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
· (streaming-llm - mit-han-lab)
· (mp.weixin.qq)
-
Efficient Memory Management for Large Language Model Serving with PagedAttention,
proceedings of the 29th symposium on operating systems principles, 2023
, arxiv, pdf, cication: 21Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica · (jiqizhixin)
-
llama2.mojo - tairov
Inference Llama 2 in one file of pure 🔥 · (qbitai)
-
fastllm - ztxz16
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
-
flexflow - flexflow
A distributed deep learning framework.
-
Accelerating LLM Inference with Staged Speculative Decoding,
arXiv, 2308.04623
, arxiv, pdf, cication: 3Benjamin Spector, Chris Re
-
CTranslate2 - OpenNMT
Fast inference engine for Transformer models
-
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding,
arXiv, 2307.15337
, arxiv, pdf, cication: 4Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, Yu Wang
-
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference,
arXiv, 2307.02628
, arxiv, pdf, cication: -1Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
-
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs,
arXiv, 2306.16601
, arxiv, pdf, cication: -1Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang
-
NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes,
arXiv, 2306.16869
, arxiv, pdf, cication: -1Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, Tsung-Yi Ho
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,
arXiv, 2306.14048
, arxiv, pdf, cication: -1Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett · (H2O - FMInference)
· (mp.weixin.qq)
-
· (zhuanlan.zhihu)
-
vllm - vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs · (mp.weixin.qq) · (jiqizhixin)
-
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification,
arXiv, 2305.09781
, arxiv, pdf, cication: -1Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi · (FlexFlow - flexflow)
· (mp.weixin.qq)
-
llama.cpp - ggerganov
Port of Facebook's LLaMA model in C/C++ · (ggml) · (llama.cpp - ggerganov)
-
LLM Inference Provider Leaderboard
· (jiqizhixin)
-
Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive
-
· (mp.weixin.qq)
-
Speculative execution for LLMs is an excellent inference-time optimization.
-
tvm_mlir_learn - BBuf
compiler learning resources collect. · (mp.weixin.qq)
-
不用4个H100!340亿参数Code Llama在Mac可跑,每秒20个token,代码生成最拿手|Karpathy转赞
-
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices,
arXiv, 2312.16886
, arxiv, pdf, cication: -1Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei · (MobileVLM - Meituan-AutoML)
-
mlc-llm - mlc-ai
Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. · (jiqizhixin) · (jiqizhixin)
-
vllm - vllm-project
-
lorax - predibase
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
-
Winners 🏆 | NeurIPS Large Language Model Efficiency Challenge:1 LLM + 1GPU + 1Day
-
gigaGPT - Cerebras
a small code base for training large models · (cerebras)
-
EAGLE - SafeAILab
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation · (sites.google)
· (jiqizhixin)
-
optimum-nvidia - huggingface
-
unsloth - unslothai
5X faster 50% less memory LLM finetuning
-
lit-gpt - Lightning-AI
Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
-
gpt-fast - pytorch-labs
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
-
MS-AMP - Azure
Microsoft Automatic Mixed Precision Library
-
DeepSpeed - microsoft
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
-
flash-linear-attention - sustcsonglin
Fast implementations of causal linear attention for autogressive language modeling (Pytorch)
-
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation,
arXiv, 2312.17276
, arxiv, pdf, cication: -1Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang
-
Agent Attention: On the Integration of Softmax and Linear Attention,
arXiv, 2312.08874
, arxiv, pdf, cication: -1Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang · (agent-attention - leaplabthu)
-
Weight subcloning: direct initialization of transformers using larger pretrained ones,
arXiv, 2312.09299
, arxiv, pdf, cication: -1Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
-
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models,
arXiv, 2312.07046
, arxiv, pdf, cication: -1Arnav Chavan, Nahush Lele, Deepak Gupta
-
Efficient Monotonic Multihead Attention,
arXiv, 2312.04515
, arxiv, pdf, cication: -1Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces,
arXiv, 2312.00752
, arxiv, pdf, cication: -1Albert Gu, Tri Dao · (qbitai)
-
Simplifying Transformer Blocks,
arXiv, 2311.01906
, arxiv, pdf, cication: -1Bobby He, Thomas Hofmann · (jiqizhixin)
-
Exponentially Faster Language Modelling,
arXiv, 2311.10770
, arxiv, pdf, cication: -1Peter Belcak, Roger Wattenhofer
-
Simplifying Transformer Blocks,
arXiv, 2311.01906
, arxiv, pdf, cication: -1Bobby He, Thomas Hofmann
-
Alternating Updates for Efficient Transformers,
arXiv, 2301.13310
, arxiv, pdf, cication: -1Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, Xin Wang
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,
NeurIPS, 2022
, arxiv, pdf, cication: 278Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
-
Fast Transformer Decoding: One Write-Head is All You Need,
arXiv, 1911.02150
, arxiv, pdf, cication: 61Noam Shazeer
· (zhuanlan.zhihu)
-
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs,
arXiv, 2401.03868
, arxiv, pdf, cication: -1Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang
· (jiqizhixin)