[1]: Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) [2]: OPT: Open Pre-trained Transformer Language Models (Zhang et al., 2022) [3]: Release blog posts for: MPT-7B (May 2023) and MPT-30B (June 2023) [4]: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BigScience, 2023) [5]: Scaling Laws for Neural Language Models (Kaplan et al., 2020) [6]: Mistral 7B (Jiang et al., 2023) [7]: Efficient Streaming Language Models with Attention Sinks (Xiao et al., 2023) + GitHub repository [8]: H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., 2023) + GitHub repository [9]: Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time (Liu et al. 2023) [10]: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (Ge et al., 2023) [11]: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019) [12]: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023) [13]: PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022) [14]: The Falcon Series of Open Language Models (Almazrouei et al., 2023) [15]: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) + GitHub repository [16]: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) + GitHub repository [17]: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) + GitHub repository [18]: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., 2022) + GitHub repository [19]: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Sheng et al., 2023) + GitHub repository [20] Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) + GitHub repository [21] vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (Kwon et al. 2023) [22] Efficiently Programming Large Language Models using SGLang (Zheng et al., 2023) + Blog post [23]: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang et al., 2018) [24]: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Narayanan et al., 2021) [25]: Efficiently Scaling Transformer Inference (Pope et al., 2022) [26]: Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (Lin et al., 2024)