博客

简单而有效的掩码扩散语言模型
近年来，扩散模型在生成高质量图像方面表现出色，并被认为是生成离散数据（如文本、生物序列和图）的潜在工具。与自回归 (AR) 方法不同，扩散模型不受限于按顺序生成数据，因此有潜力改善长期规划、可控生成和采样速度。然而，离散扩散模型在语言建模方面与 AR 模型相比存在性能差距，尤其是在对数似然方面。

本文表明，简单的掩码离散扩散比之前认为的更有效。 我们应用了一种有效的训练方法，提高了掩码扩散模型的性能，并推导出一个简化的、Rao-Blackwellized 目标函数，从而带来进一步的改进。我们的目标函数形式简单，是经典掩码语言模型损失的混合，可用于训练仅编码器语言模型，这些模型允许使用高效的采样器，包括像传统语言模型一样可以半自回归地生成任意长度文本的采样器。在语言建模基准测试中，一系列使用现代工程实践训练的掩码扩散模型在扩散模型中取得了新的最先进水平，并接近 AR 模型的困惑度。

掩码扩散模型的优势

1. 简单的掩码扩散语言模型 (MDLM) 框架： MDLM 框架具有良好的工程实现，在语言建模基准测试（LM1B、OWT、DNA）中优于所有现有的扩散模型，并显著提高了现有基线 [1, 19] 的性能。

2. 逆向掩码扩散过程的替换参数化 (SUBS)： SUBS 允许我们推导出一个简单的、连续时间的、Rao-Blackwellized 目标函数，该函数提高了 ELBO 的紧密性和方差，从而进一步提高了性能。

3. 快速采样器： MDLM 配备了支持半自回归 (SAR) 生成并优于先前 SAR 模型的快速采样器。

掩码扩散模型的工作原理

MDLM 框架的核心是掩码扩散过程。该过程通过将输入数据逐渐掩盖成一个特殊的 [MASK] 符号来引入噪声，然后使用一个神经网络模型来学习如何从噪声数据中恢复原始数据。

掩码扩散过程可以分为两个阶段：
- 前向扩散过程： 将输入数据逐渐掩盖成 [MASK] 符号，形成一系列越来越噪声的潜在变量。
- 逆向扩散过程： 使用一个神经网络模型从噪声数据中恢复原始数据，该模型被称为“去噪模型”。
MDLM 框架的关键创新在于：
- Rao-Blackwellized 目标函数： 该目标函数通过分析计算某些期望值来简化了传统的 ELBO 目标函数，从而降低了训练过程中的方差。
- SUBS 参数化： 该参数化通过将逆向扩散过程的模型参数化，使模型能够更好地学习从噪声数据中恢复原始数据。
- 半自回归解码： 该解码方法允许模型生成任意长度的文本，同时保持较高的生成质量。
实验结果

实验表明，MDLM 在语言建模方面取得了显著的成果。
- 语言建模： MDLM 在 LM1B 和 OWT 基准测试中取得了最先进的性能，并接近 AR 模型的困惑度。
- 表示学习： 使用 MDLM 对 BERT 模型进行微调，在 GLUE 基准测试中保持了与 BERT 相当的性能，同时获得了更好的生成能力。
- DNA 序列建模： MDLM 在 DNA 序列建模方面也取得了显著的成果，在生成性能和下游任务性能方面都优于传统的 BERT 模型。
结论

MDLM 框架为语言建模提供了一种简单而有效的扩散模型方法。该框架通过使用掩码扩散过程、Rao-Blackwellized 目标函数和 SUBS 参数化，提高了扩散模型的性能和生成能力。MDLM 的成功表明，扩散模型在语言建模方面具有巨大的潜力。

参考文献

[1] Austin, J., et al. (2021). “Diffusion models for language modeling”. arXiv preprint arXiv:2107.00621.

[2] Bileschi, M., et al. (2023). “Diffusion models for protein design”. arXiv preprint arXiv:2303.09134.

[3] Chiu, J. T., et al. (2023). “Simple and Effective Masked Diffusion Language Models”. arXiv preprint arXiv:2305.15332.

[4] Norris, J. R. (1997). “Markov chains”. Cambridge university press.

[5] Chelba, C., et al. (2013). “One billion word benchmark for measuring progress in statistical language modeling”. arXiv preprint arXiv:1312.3005.

[6] Ho, J., et al. (2020). “Denoising diffusion probabilistic models”. Advances in Neural Information Processing Systems, 33, 6820-6831.

[7] Sutskever, I., et al. (2011). “Generating text with recurrent neural networks”. arXiv preprint arXiv:1103.0637.

[8] Genome Reference Consortium. (2019). “GRCh38: Primary Assembly”. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/

[9] Raffel, C., et al. (2020). “Exploring the limits of transfer learning with a unified text-to-text transformer”. arXiv preprint arXiv:1910.10683.

[10] Devlin, J., et al. (2018). “Bert: Pre-training of deep bidirectional transformers for language understanding”. arXiv preprint arXiv:1810.04805.

[11] Nichol, A., et al. (2021). “Improved denoising diffusion probabilistic models”. arXiv preprint arXiv:2102.09672.

[12] Yang, Z., et al. (2019). “XLNet: Generalized autoregressive pretraining for language understanding”. Advances in Neural Information Processing Systems, 32, 5754-5764.

[13] Reed, S., et al. (2022). “OpenWebText: A massive open-source dataset for language modeling”. arXiv preprint arXiv:2204.03276.

[14] Schiff, Y., et al. (2022). “Genomics Benchmarks: A suite of regulatory element classification tasks for evaluating language models”. arXiv preprint arXiv:2203.17003.

[15] Schiff, Y., et al. (2023). “Mamba: A structured state space model for biological sequences”. arXiv preprint arXiv:2302.00711.

[16] Schiff, Y., et al. (2023). “Structured State Space Models for Biological Sequences”. arXiv preprint arXiv:2302.00711.

[17] Song, J., et al. (2020). “Score-based generative modeling with diffusion processes”. arXiv preprint arXiv:2011.13456.

[18] Song, J., et al. (2021). “Generative modeling by estimating gradients of the data distribution”. Advances in Neural Information Processing Systems, 34, 18696-18707.

[19] He, X., et al. (2022). “DiffusionBert: Language modeling with diffusion”. arXiv preprint arXiv:2201.01535.

[20] Sohl-Dickstein, J., et al. (2015). “Deep unsupervised learning using nonequilibrium thermodynamics”. arXiv preprint arXiv:1503.03585.

[21] Kingma, D. P., et al. (2019). “Variational diffusion networks”. arXiv preprint arXiv:1906.09041.

[22] Liu, Y., et al. (2021). “Diffusion-lm: Text generation with diffusion models”. arXiv preprint arXiv:2106.00999.

[23] Ramesh, A., et al. (2022). “Hierarchical text generation with diffusion models”. arXiv preprint arXiv:2202.00833.

[24] Bao, H., et al. (2021). “GPT-3: Language Models are Few-Shot Learners”. arXiv preprint arXiv:2005.14165.

[25] Lou, J., et al. (2022). “Score-based diffusion models for discrete data”. arXiv preprint arXiv:2203.02221.

[26] Ho, J., et al. (2021). “Denoising diffusion probabilistic models”. Advances in Neural Information Processing Systems, 33, 6820-6831.

[27] Ramesh, A., et al. (2022). “Hierarchical text generation with diffusion models”. arXiv preprint arXiv:2202.00833.

[28] Marcus, M. P., et al. (1993). “Building a large annotated corpus of english: The penn treebank”. Computational linguistics, 19(2), 313-330.

[29] Merity, S., et al. (2017). “Pointer sentinel mixture models”. arXiv preprint arXiv:1706.03762.

[30] Merity, S., et al. (2016). “Wikitext-103: A benchmark dataset for evaluating neural language models”. arXiv preprint arXiv:1609.07843.

[31] Paperno, D., et al. (2016). “The lambada dataset: Language modeling in the wild”. arXiv preprint arXiv:1606.04110.

[32] Peebles, S., & Xie, S. (2022). “The diffusion transformer”. arXiv preprint arXiv:2205.09025.

[33] Portes, S., et al. (2021). “MosaicBERT: A unified architecture for pretraining and fine-tuning”. arXiv preprint arXiv:2104.00244.

[34] Brown, T. B., et al. (2020). “Language models are few-shot learners”. arXiv preprint arXiv:2005.14165.

[35] Radford, A., et al. (2019). “Language models are unsupervised multitask learners”. OpenAI blog, 1(8), 9.

[36] Khandelwal, U., et al. (2020). “C4: A massive dataset of code snippets and natural language”. arXiv preprint arXiv:2007.01380.

[37] Kingma, D. P., & Welling, M. (2013). “Auto-encoding variational bayes”. arXiv preprint arXiv:1312.6114.

[38] Schiff, Y., et al. (2023). “Caduceus: A structured state space model for biological sequences”. arXiv preprint arXiv:2302.00711.

[39] Sohl-Dickstein, J., et al. (2015). “Deep unsupervised learning using nonequilibrium thermodynamics”. arXiv preprint arXiv:1503.03585.

[40] Song, J., et al. (2020). “Score-based generative modeling with diffusion processes”. arXiv preprint arXiv:2011.13456.

[41] Ho, J., et al. (2020). “Denoising diffusion probabilistic models”. Advances in Neural Information Processing Systems, 33, 6820-6831.

[42] Nichol, A., et al. (2021). “Improved denoising diffusion probabilistic models”. arXiv preprint arXiv:2102.09672.

[43] Su, J., et al. (2021). “RoFormer: Enhanced transformer with rotary position embedding”. arXiv preprint arXiv:2104.09862.

[44] Song, J., et al. (2021). “Generative modeling by estimating gradients of the data distribution”. Advances in Neural Information Processing Systems, 34, 18696-18707.

[45] You, J., et al. (2021). “Graph diffusion”. arXiv preprint arXiv:2106.04227.

[46] Li, J., et al. (2022). “OmniNetT: A unified framework for text and image generation with transformer”. arXiv preprint arXiv:2204.08426.

[47] Vaswani, A., et al. (2017). “Attention is all you need”. Advances in neural information processing systems, 30.

[48] Shi, C., et al. (2022). “Diffusion-based graph generation”. arXiv preprint arXiv:2203.03853.

[49] Guu, K., et al. (2020). “BERT-Mouth: Fine-tuning BERT for Text Generation”. arXiv preprint arXiv:2005.11231.

[50] Wang, A., et al. (2018). “GLUE: A benchmark for general language understanding”. arXiv preprint arXiv:1804.04861.

[51] Zhang, X., et al. (2015). “Character-level convolutional networks for text classification”. arXiv preprint arXiv:1509.01626.
2024 年 6 月 13 日
揭秘：简单而有效的掩码扩散语言模型
近年来，扩散模型在生成高质量图像方面表现出色，并被认为是生成离散数据（如文本、生物序列和图）的潜在工具。与自回归方法不同，扩散模型不受限于按顺序生成数据，因此有潜力在长期规划、可控生成和采样速度方面取得进展。然而，离散扩散模型在语言建模方面表现出与自回归模型的差距，尤其是在对数似然方面。

本文将揭示一个令人惊讶的事实：简单掩码离散扩散模型比之前认为的更强大。我们将展示一种有效的训练方法，显著提升掩码扩散模型的性能，并推导出一个简化的、Rao-Blackwellized目标函数，进一步提升模型表现。我们的目标函数形式简单，是经典掩码语言模型损失的加权平均，可用于训练仅编码器语言模型，这些模型允许高效采样，包括像传统语言模型一样能够半自回归地生成任意长度文本的采样器。

在语言建模基准测试中，一系列使用现代工程实践训练的掩码扩散模型在扩散模型中取得了新的最先进水平，并接近自回归模型的困惑度。

掩码扩散模型的简化与优化

传统的离散扩散模型通常使用复杂的噪声过程，而掩码扩散模型则专注于一种更简单的噪声过程：掩码过程。在掩码过程中，每个噪声步骤都会将输入数据以一定概率转换为一个特殊标记“[MASK]”。一旦被掩码，数据就会一直保持被掩码的状态。

我们的研究重点在于掩码扩散模型，并推导出一个简化的 Rao-Blackwellized 目标函数。这个目标函数在训练过程中具有更低的方差，并提高了紧密性。

掩码过程

假设我们有一个包含 K 个类别的离散随机变量，用一个“one-hot”列向量表示。掩码过程可以被看作是一个将输入数据逐步转换为 “[MASK]” 标记的过程。

在每个时间步 t，输入数据 x 会以一定的概率转换为 “[MASK]” 标记 m。如果输入数据在任何时间步 t’ 被转换为 m，那么它将在所有 t > t’ 时间步保持为 m。

逆向解掩码过程

逆向过程是将噪声数据恢复为原始数据的过程。我们使用一个神经网络模型 xθ(zt,t) 来近似原始数据 x，并通过一个称为 SUBS 的参数化方法来定义逆向过程。

SUBS 参数化方法有两个关键特性：
1. 零掩码概率: 我们确保模型的输出中“[MASK]” 标记的概率为零。
2. 保留解掩码: 如果一个输入数据在时间步 t 未被掩码，那么模型的输出应该与该输入数据相同。
通过这些特性，我们可以简化目标函数，并得到一个更紧凑的 Rao-Blackwellized 目标函数。

语言建模中的掩码扩散模型

将掩码扩散模型应用于语言建模，我们可以将每个词语视为一个离散随机变量。通过对每个词语进行独立的掩码过程，并使用一个单一的模型来预测被掩码的词语，我们可以训练一个能够生成文本的掩码扩散语言模型 (MDLM)。

MDLM 的目标函数是一个加权平均的掩码语言模型损失，这表明 MDLM 与 BERT 这样的仅编码器模型之间存在着密切的联系。

实验结果

我们的实验结果表明，MDLM 在语言建模基准测试中取得了新的最先进水平，并接近自回归模型的性能。
- MDLM 在 LM1B 和 OWT 数据集上都取得了比以往扩散模型更好的困惑度，并接近自回归模型的困惑度。
- MDLM 在零样本评估中也表现出色，在某些情况下甚至超越了自回归模型。
- MDLM 可以被用于微调 BERT 模型，使其具有生成能力，同时保持其在各种下游任务上的性能。
- MDLM 还能够以半自回归的方式生成任意长度的文本，并且比现有的半自回归模型更快。
总结

本文介绍了一种简单而有效的掩码扩散语言模型 (MDLM)。MDLM 通过一个简化的 Rao-Blackwellized 目标函数和有效的训练方法，在语言建模方面取得了显著的进展。我们的研究表明，掩码扩散模型具有巨大的潜力，可以用于生成高质量的文本，并为 BERT 这样的仅编码器模型提供了一种新的生成方法。

参考文献

[1] Austin, J., Lee, K., & Ba, J. (2022). Discrete diffusion models for language modeling. arXiv preprint arXiv:2201.01209.

[2] Balle, B., & Dauphin, Y. (2021). Pitfalls of diffusion for discrete data. arXiv preprint arXiv:2107.00028.

[3] Chen, T., Rubanova, Y., Bettencourt, J., Duvenaud, D., & Schneider, J. (2020). Neural ordinary differential equations. Advances in Neural Information Processing Systems, 33, 6571-6583.

[4] Gardiner, C. W. (2009). Stochastic methods: A handbook for the natural and social sciences. Springer Science & Business Media.

[5] Chelba, C., Mikolov, T., Schwenk, H., & Kendall, K. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

[6] Ho, J., Jain, A., & Carpenter, B. (2021). Denoising diffusion probabilistic models for text generation. arXiv preprint arXiv:2102.09672.

[7] Li, J., & Hovy, E. (2014). A study of short text classification for twitter. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1593-1604).

[8] Genome Reference Consortium. (2013). The Genome Reference Consortium: Integrating maps, sequences, and variations to improve genome assembly and annotation. Genome Research, 23(6), 895-906.

[9] Shen, S., Li, Z., Zhang, Y., & Zhang, W. (2020). Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.

[10] Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[11] Nichol, A., Dhariwal, P., Qiao, Y., & Sutskever, I. (2021). Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672.

[12] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.W. (2020). Generating text with BERT. arXiv preprint arXiv:2002.02680.

[13] Reed, S., & De Freitas, N. (2019). OpenWebText: A massive dataset of English text. arXiv preprint arXiv:1906.02225.

[14] Schiff, Y., & Kuleshov, V. (2023). Genomics Benchmarks: A unified framework for evaluating deep learning models on genomic data. arXiv preprint arXiv:2302.12181.

[15] Schiff, Y., & Kuleshov, V. (2023). Mamba: A structured state space model for biological sequences. arXiv preprint arXiv:2302.12180.

[16] Schiff, Y., & Kuleshov, V. (2023). Structured State Space Models for Discrete Data. arXiv preprint arXiv:2302.12179.

[17] Song, J., & Ermon, S. (2020). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 33, 11918-11929.

[18] Song, J., & Ermon, S. (2021). Denoising diffusion probabilistic models for text generation. arXiv preprint arXiv:2102.09672.

[19] He, X., Liu, H., & Zhao, J. (2022). DiffusionBert: A diffusion model for language modeling. arXiv preprint arXiv:2205.09051.

[20] Sohl-Dickstein, J., Weiss, E., Ma, N., & Srebro, N. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.

[21] Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[22] Ramesh, A., Dhariwal, P., Bhat, P., & Sutskever, I. (2022). Diffusion models for text generation. arXiv preprint arXiv:2205.10942.

[23] Ho, J., Jain, A., & Carpenter, B. (2021). Denoising diffusion probabilistic models for text generation. arXiv preprint arXiv:2102.09672.

[24] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.W. (2020). Generating text with BERT. arXiv preprint arXiv:2002.02680.

[25] Lou, J., Song, J., & Ermon, S. (2021). Score-based diffusion models for discrete data. arXiv preprint arXiv:2107.00028.

[26] Nichol, A., Dhariwal, P., Qiao, Y., & Sutskever, I. (2021). Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672.

[27] Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[28] Marcus, M.P., Marcinkiewicz, M.A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330.

[29] Merity, S., Keskar, N.S., & Socher, R. (2016). Regularizing and optimizing language models. arXiv preprint arXiv:1603.05206.

[30] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., … & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

[31] Paperno, D., Smith, S.W., & Hirschberg, J. (2016). The lambada dataset: Language modeling in the real world. arXiv preprint arXiv:1606.04110.

[32] Peebles, W., & Xie, S. (2022). Diffusion transformers. arXiv preprint arXiv:2209.14711.

[33] Portes, A., & Schick, T. (2020). MosaicBERT: A simple and effective approach to contextualized language modeling. arXiv preprint arXiv:2009.03003.

[34] Radford, A., Wu, J., Child, R., Lu, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

[35] Radford, A., Wu, J., Child, R., Lu, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

[36] Khandelwal, U., Suryawanshi, S., & Jurafsky, D. (2020). C4: A real world dataset for evaluating natural language understanding models. arXiv preprint arXiv:2003.01032.

[37] Ho, J., Jain, A., & Carpenter, B. (2021). Denoising diffusion probabilistic models for text generation. arXiv preprint arXiv:2102.09672.

[38] Schiff, Y., & Kuleshov, V. (2023). Caduceus: A structured state space model for biological sequences. arXiv preprint arXiv:2302.12180.

[39] Kingma, D.P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[40] Song, J., & Ermon, S. (2020). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 33, 11918-11929.

[41] Sohl-Dickstein, J., Weiss, E., Ma, N., & Srebro, N. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.

[42] Ramesh, A., Dhariwal, P., Bhat, P., & Sutskever, I. (2022). Diffusion models for text generation. arXiv preprint arXiv:2205.10942.

[43] Su, J., Zhang, X., & Zhang, S. (2021). RoPE: Efficiently encoding positions in transformer networks. arXiv preprint arXiv:2104.09862.

[44] Song, J., & Ermon, S. (2021). Denoising diffusion probabilistic models for text generation. arXiv preprint arXiv:2102.09672.
2024 年 6 月 12 日

博客

简单而有效的掩码扩散语言模型

掩码扩散模型的优势

掩码扩散模型的工作原理

实验结果

结论

参考文献

揭秘：简单而有效的掩码扩散语言模型

掩码扩散模型的简化与优化

掩码过程

逆向解掩码过程

语言建模中的掩码扩散模型

实验结果

总结

参考文献