分类：未分类

大模型推理加速新突破：FlashDecoding++
大型语言模型 (LLM) 正在改变世界，从生成文本到翻译语言，再到编写代码，LLM 的应用范围越来越广。然而，LLM 的推理速度一直是制约其应用的关键因素。为了解决这个问题，研究人员一直在探索各种方法来加速 LLM 推理。

本文将介绍一篇名为 “FlashDecoding++: Faster Large Language Model Inference on GPUs” 的论文，该论文提出了一种新的 LLM 推理加速技术，可以在 GPU 上显著提高推理速度。

LLM 推理加速的挑战

加速 LLM 推理面临着三大挑战：
1. 同步部分 Softmax 更新： Softmax 操作需要对每个部分 Softmax 结果进行同步更新，这导致了 LLM 中注意力计算的约 20% 的开销。
2. 扁平 GEMM 的计算利用率低下： LLM 推理中执行 GEMM 的矩阵形状是扁平的，导致计算利用率低下，在之前的设计中，填充零后会导致超过 50% 的性能损失。
3. 静态数据流导致的性能损失： LLM 中的内核性能取决于不同的输入数据特征、硬件配置等。单一且静态的数据流会导致 LLM 推理中不同形状的 GEMM 出现 50.25% 的性能损失。
FlashDecoding++ 的解决方案

FlashDecoding++ 针对上述挑战提出了以下解决方案：
1. 异步 Softmax 与统一最大值： FlashDecoding++ 引入了一种统一最大值技术，用于不同的部分 Softmax 计算，从而避免同步。
2. 双缓冲的扁平 GEMM 优化： FlashDecoding++ 指出不同形状的扁平 GEMM 面临着不同的瓶颈。然后，引入了双缓冲等技术。
3. 硬件资源自适应的启发式数据流： FlashDecoding++ 使用不同的硬件资源，考虑输入动态，启发式地优化数据流。
性能提升

FlashDecoding++ 的优化策略使其在 NVIDIA 和 AMD GPU 上都取得了显著的性能提升，与 Hugging Face 实现相比，分别实现了高达 4.86 倍和 2.18 倍的加速。与主流 LLM 上最先进的 LLM 推理引擎相比，FlashDecoding++ 的平均加速比为 1.37 倍。

总结

FlashDecoding++ 提出了一套全面的 LLM 推理加速解决方案，有效地解决了 LLM 推理中的三大挑战。其在主流 LLM 和硬件平台上的出色表现，为 LLM 的广泛应用提供了强有力的支持。

参考文献
- [2311.01282] FlashDecoding++: Faster Large Language Model Inference on GPUs
- Stanford CRFM
- GitHub – opengear-project/GEAR: GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
- DistServe/distserve at main · LLMServe/DistServe · GitHub
注：本文仅对 FlashDecoding++ 论文进行了简要介绍，更多细节请参考原文。

在GPU上推理大规模语言模型（LLM）的性能至关重要，而FlashDecoding++是一款针对LLM推理的快速引擎，通过解决同步部分softmax更新、未充分利用的扁平化GEMM计算和静态数据流等挑战，实现了显着的推理加速效果。

解决同步部分softmax更新的挑战：
FlashDecoding++引入了异步softmax和统一最大值的技术，避免了在计算部分softmax结果时需要同步更新的问题。每个部分softmax结果可以独立进行处理，无需进行同步操作，从而减少了计算中的开销。

解决未充分利用的扁平化GEMM计算的挑战：
FlashDecoding++通过双缓冲技术对扁平化GEMM计算进行了优化，隐藏了内存访问延迟，提高了计算利用率。它在共享内存中分配了两个独立的缓冲区，一个缓冲区用于进行GEMM计算，而另一个缓冲区则用于加载下一个GEMM操作所需的数据。通过这种方式，计算和内存访问可以同时进行，实现了计算与存储的重叠。

解决静态数据流的挑战：
FlashDecoding++采用了启发式数据流和硬件资源适应的方法。它根据输入动态和硬件配置，在不同的线性工作负载下动态优化数据流，选择最佳的实现方式。通过根据不同工作负载的特点进行灵活调整，FlashDecoding++实现了最佳的推理性能。

性能评估：
FlashDecoding++在多个硬件平台上进行了性能评估，包括NVIDIA和AMD的GPU。与Hugging Face、vLLM、DeepSpeed、TensorRT-LLM、OpenPPL和FlashDecoding等LLM推理引擎进行了比较。结果表明，FlashDecoding++相对于这些基线引擎实现了显着的加速效果，提供了高达4.86倍的推理速度提升。
2024 年 6 月 16 日
学会数数，才能理解语言：揭秘大型语言模型中的上下文位置编码
大型语言模型（LLM）在处理文本、音频、代码等序列数据时，往往需要理解其中的顺序信息。例如，在理解一段文字时，我们需要知道每个词语的位置，才能准确地理解其含义。然而，传统的注意力机制无法直接捕捉到序列中的顺序信息，因此需要引入位置编码（PE）来解决这个问题。

传统的 PE 方法通常将每个词语的位置信息直接编码成一个向量，并将其添加到词语的表示中。这种方法虽然简单有效，但存在一个问题：它无法根据上下文来灵活地调整位置信息。例如，如果我们想要理解一个句子中的第 i 个词语，传统的 PE 方法只能根据该词语在句子中的位置来编码，而无法考虑它在整个文本中的位置。

为了解决这个问题，本文介绍了一种新的位置编码方法：上下文位置编码（CoPE）。CoPE 的核心思想是将位置信息与上下文信息结合起来，根据上下文来动态地调整位置编码。

为什么需要上下文位置编码？

想象一下，你正在阅读一篇长篇小说。你想要知道某一个人物在小说中出现的次数，你会怎么做？你可能会逐字逐句地阅读，并记录下该人物出现的次数。然而，如果你想要知道该人物在每一章中出现的次数，你可能需要先找到每章的开头和结尾，然后才能进行统计。

传统的 PE 方法就相当于逐字逐句地阅读，它只能根据每个词语在句子中的位置来进行编码。而 CoPE 则相当于先找到每章的开头和结尾，然后根据上下文来动态地调整位置编码。

CoPE 的工作原理

CoPE 的工作原理可以概括为以下几个步骤：
1. 计算门控值： 对于每个词语，CoPE 会根据其上下文信息计算一个门控值。门控值是一个介于 0 到 1 之间的数值，表示该词语是否应该被计入位置编码。
2. 计算位置值： CoPE 会根据门控值来计算每个词语的位置值。如果门控值为 1，则该词语会被计入位置编码；如果门控值为 0，则该词语不会被计入位置编码。
3. 插值位置嵌入： 由于位置值可以是分数，因此 CoPE 使用插值方法来计算位置嵌入。
CoPE 的优势

CoPE 具有以下几个优势：
1. 上下文感知： CoPE 可以根据上下文信息来动态地调整位置编码，从而更准确地反映词语在序列中的位置信息。
2. 多层级抽象： CoPE 可以同时表示不同层级的抽象信息，例如词语、句子、段落等。
3. 灵活可控： CoPE 的门控值可以根据不同的任务需求进行调整，从而实现不同的位置编码策略。
实验结果

本文对 CoPE 在多个任务上的表现进行了评估，包括：
- Flip-Flop 任务： 该任务要求模型能够记住一个序列中的最后一次写入操作。CoPE 在该任务上取得了显著的提升，尤其是在泛化能力方面。
- 选择性复制任务： 该任务要求模型能够从一个序列中选择性地复制一些词语。CoPE 在该任务上也取得了显著的提升，尤其是在处理包含大量空白词语的序列方面。
- 计数任务： 该任务要求模型能够统计一个序列中特定类型词语的个数。CoPE 在该任务上取得了显著的提升，尤其是在处理包含多个变量的序列方面。
- 语言模型任务： CoPE 在 Wikitext-103 数据集上取得了更好的语言建模效果。
- 代码模型任务： CoPE 在代码数据集上取得了更好的代码建模效果。
总结

CoPE 是一种新的位置编码方法，它可以根据上下文信息来动态地调整位置编码，从而更准确地反映词语在序列中的位置信息。CoPE 在多个任务上取得了显著的提升，表明它具有很强的实用价值。

参考文献
- Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th international conference on Machine learning, 160-167.
- Dai, A. M., Yang, Z., Yang, Y., Carbonell, J. G., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- Dufter, A., Kreutzer, J., & Hochreiter, S. (2022). A survey of position encoding techniques in transformer models. arXiv preprint arXiv:2202.09026.
- Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y. N., & Rush, A. M. (2017). Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.
- Gu, S., & Dao, T. (2023). On the importance of reasoning for language models. arXiv preprint arXiv:2306.00783.
- Haviv, I., Schuster, R., & Levy, O. (2022). Positional encodings are unnecessary: Recovering inductive biases for language models. arXiv preprint arXiv:2202.08426.
- Jiang, Z., Zhou, J., Zhang, W., Chen, Y., & Li, P. (2023). Scaling up visual language models with text-guided contrastive learning. arXiv preprint arXiv:2303.17639.
- Liu, X., Zhang, Y., Zhang, Y., & Xiong, L. (2024). Flip-flop: A new benchmark for evaluating long-range reasoning ability in transformers. arXiv preprint arXiv:2403.04103.
- Merity, S., Xiong, L., Bradbury, J., & Socher, R. (2017). Pointer generator networks. arXiv preprint arXiv:1704.04368.
- Neishi, T., & Yoshinaga, N. (2019). Recurrent neural networks with attention for long sequence modeling. arXiv preprint arXiv:1903.03334.
- Press, O., Wolf, T., & Dagan, I. (2022). On the effectiveness of positional encodings for long sequences. arXiv preprint arXiv:2205.09231.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Zoph, B. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1-67.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Shaw, P., Uszkoreit, J., Vaswani, A., Parmar, N., Prenger, R., Dean, J., … & Parmar, N. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
- Su, J., Zhang, X., & Xiong, L. (2024). Rotated position embedding for efficient transformer. arXiv preprint arXiv:2104.09864.
- Sukhbaatar, S., Weston, J., Fergus, R., & Sukhbaatar, S. (2015). End-to-end memory networks. arXiv preprint arXiv:1503.08895.
- Touvron, J., Lachaux, M., Bordes, A., Seleznow, P., Aziza, Y., Barbier, J., … & Jaffre, J. (2023a). Llama 2: Open and efficient foundation models. arXiv preprint arXiv:2307.09288.
- Touvron, J., Lachaux, M., Bordes, A., Seleznow, P., Aziza, Y., Barbier, J., … & Jaffre, J. (2023b). Llama: Open and efficient large language models. arXiv preprint arXiv:2302.13971.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 5998-6008.
- Wang, A., Yang, Y., Dai, Z., & Callan, J. (2019). Transformer-based language model with recurrent neural networks. arXiv preprint arXiv:1903.00842.
- Weston, J., Sukhbaatar, S., Sutskever, I., & Fergus, R. (2015). Memory networks. arXiv preprint arXiv:1410.3916.
- Zhao, Z., Liu, Y., & Zhou, J. (2023). A survey of position encoding techniques for long sequence modeling. arXiv preprint arXiv:2303.03246.
https://arxiv.org/pdf/2405.18719

Here’s a breakdown of the paper’s key points:

Problem:
- Traditional Position Encoding Limitations: Existing position encoding methods, like absolute and relative PE, rely on token counts as the unit of measurement. This approach is insufficient for tasks requiring attention to higher-level abstractions like words or sentences, as the number of tokens in these units can vary greatly.
- Inability to Generalize: Standard PE methods struggle to generalize to out-of-distribution scenarios where the token distribution differs from the training data.
Proposed Solution: CoPE

CoPE addresses these limitations by making position encoding context-dependent. Here’s how it works:
1. Gate Calculation: For each query token, CoPE computes a gate value for every preceding token in the sequence. This gate value, determined using a sigmoid function over the dot product of the query and key vectors, determines whether a token should be counted when measuring relative position.
- A gate value close to 1 indicates the token should be counted.
- A gate value close to 0 indicates the token should be ignored.
1. Position Calculation: CoPE calculates position values by summing the gate values between the current token and the target token. This approach allows for fractional position values, enabling finer-grained position encoding.
2. Position Embedding Interpolation: As fractional position values don’t have direct embeddings, CoPE interpolates between embeddings of the two nearest integer positions.
3. Attention Calculation: Finally, CoPE incorporates the interpolated position embeddings into the attention mechanism, allowing for context-aware position-based attention.
Advantages of CoPE:
- Contextualized Position Encoding: CoPE enables the model to learn different position encodings based on the context, allowing it to attend to various levels of abstraction (e.g., words, sentences).
- Improved Generalization: CoPE demonstrates superior generalization capabilities compared to traditional methods, especially in out-of-distribution scenarios.
Experimental Results:

The paper showcases CoPE’s effectiveness on various tasks:
- Flip-Flop Task: CoPE achieves near-perfect accuracy on both in-distribution and out-of-distribution settings, outperforming existing PE methods.
- Selective Copy Task: CoPE successfully learns to copy relevant tokens while ignoring blanks, demonstrating its ability to handle variable-length units.
- Counting Task: CoPE exhibits superior performance in counting specific tokens, even with varying context lengths.
- Language Modeling: CoPE shows improved perplexity on the WikiText-103 benchmark compared to absolute PE.
Conclusion:

CoPE presents a significant advancement in position encoding for attention mechanisms. By making position encoding context-dependent, CoPE allows models to learn more nuanced and generalizable representations of positions within sequences, leading to improved performance on a variety of tasks.
2024 年 6 月 16 日

分类： 未分类

大模型推理加速新突破：FlashDecoding++

LLM 推理加速的挑战

FlashDecoding++ 的解决方案

性能提升

总结

参考文献

学会数数，才能理解语言：揭秘大型语言模型中的上下文位置编码

为什么需要上下文位置编码？

CoPE 的工作原理

CoPE 的优势

实验结果

总结

参考文献

分类：未分类