标签： AGI

FILM-7B: A Large Language Model that Makes Full Use of Context
Large language models (LLMs) are becoming increasingly powerful, but they still struggle to fully utilize information within long contexts. This “lost-in-the-middle” challenge can hinder the development of LLMs, as they may fail to understand the full meaning of long texts.

This blog article will discuss a new approach called FILM-7B (FILl-in-the-Middle) that addresses this challenge. FILM-7B is based on Mistral-7B and utilizes information-intensive (IN2) training, a data-driven solution that emphasizes the importance of every position in a long context.

The Lost-in-the-Middle Challenge

LLMs often struggle to understand the full meaning of long texts because they fail to recognize the importance of information in the middle of the context. This can lead to errors in tasks such as question answering and summarization.

The “lost-in-the-middle” challenge is caused by a lack of explicit supervision during training. LLMs are not explicitly taught that every position in a long context can hold crucial information.

FILM-7B: A Data-Driven Solution

FILM-7B addresses the “lost-in-the-middle” challenge through IN2 training. This training method uses a synthesized long-context question-answer dataset, where the answer requires:
- Fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens).
- Integration and reasoning of information from two or more short segments.
By applying IN2 training to Mistral-7B, FILM-7B is able to effectively utilize information from different positions in its 32K context window.

Evaluation and Results

FILM-7B was evaluated on three probing tasks that encompass various context styles and information retrieval patterns. The results demonstrate that FILM-7B can robustly retrieve information from different positions in its long context window.

Furthermore, FILM-7B significantly improves the performance on real-world long-context tasks, while maintaining a comparable performance on short-context tasks. These results indicate that IN2 training can generalize to real-world scenarios and that FILM-7B does not compromise short-text capabilities during training.

Conclusion

FILM-7B is a promising LLM that addresses the “lost-in-the-middle” challenge through IN2 training. This data-driven approach allows FILM-7B to effectively utilize information from different positions in long contexts, leading to improved performance on both probing tasks and real-world long-context tasks.

Further Research

Several areas for further research are identified in the paper, including:
- Exploring the diversity of training data.
- Optimizing training strategies.
- Investigating the impact of different model architectures.
- Enhancing the model’s cross-lingual capabilities.
- Exploring real-time performance and robustness.
These research directions will help to further improve the capabilities of FILM-7B and other LLMs in handling long contexts.

Additional Resources
- GitHub Link: https://github.com/microsoft/FILM
- Paper: https://arxiv.org/abs/2310.05389
2024 年 4 月 26 日
如何让大型语言模型(LLMs)充分利用长文本信息?——微软提出的FILM方法

大家好,相信不少人已经体验过ChatGPT等大型语言模型(LLMs)强大的对话和写作能力。但你可能不知道,目前的LLMs在处理长文本(如长篇小说、学术论文等)时,还面临着一个棘手的问题,那就是”迷失在中间”(Lost-in-the-Middle)。

什么是”迷失在中间”?简单来说,就是模型在阅读一篇很长的文章时,往往能很好地理解文章开头和结尾的内容,但对中间段落的重要信息却视而不见。这就像我们看一部电影,只记住了精彩的开场和结局,但对中间情节毫无印象。

微软的研究人员推测,造成这个问题的原因,可能是目前用于训练LLMs的长文本数据存在偏差——它们没有明确告诉模型:文章的每个部分都可能包含关键信息,要认真对待!这就导致模型养成了”重两头、轻中间”的坏习惯。

为了纠正这个偏差,研究人员提出了一种名为”信息密集型训练”(Information-Intensive Training,简称IN2)的新方法。它的核心思想是:人工合成一批长文本问答数据,其中的问题都需要模型在长文本的不同部分准确定位信息,并将它们联系起来进行推理。通过在这样的数据集上反复训练,模型就能学会关注长文本的每个细节。

研究人员以Mistral-7B模型为基础,应用IN2训练方法,得到了一个名为FILM-7B的新模型。为了全面测试它的长文本理解能力,他们还精心设计了多个探测任务,覆盖不同的文本类型(如文档、代码、表格数据等)和信息检索模式(如串联、跳跃、双向等)。

在这些探测任务上,FILM-7B展现了出色的表现,证明它能够灵活地在长达32,000词的超长文本中准确定位关键信息。更令人兴奋的是,在现实世界的长文本应用中,如长篇问答(NarrativeQA)任务,FILM-7B的F1分数也从23.5大幅提高到26.9,而在需要推理的常识问答(CSQA)等短文本任务上,性能并未下降反而小幅提升(59.3%->59.2%),可见IN2方法的有效性。

此外,研究人员还将FILM-7B与其他知名的开源长文本模型(如ChatGLM、LongChat等)和商业模型(如GPT-3.5/4)进行了比较,结果显示FILM-7B在大多数长文本任务上都实现了最佳表现,充分证明了IN2训练的潜力。

当然,FILM-7B还有进一步改进的空间。例如,研究人员分析发现,在训练过程中合理使用”滑动窗口”和”位置编码”等技巧,有望进一步提高模型性能。未来,他们还计划在更大规模、更多样化的真实数据上应用IN2方法,以进一步提升FILM系列模型的长文本理解能力。

总之,这项研究为LLMs在长文本处理上的瓶颈问题提供了一种简单有效的解决思路,相信通过更多研究者的努力,LLMs必将在各类长文本应用场景中发挥更大的价值。感兴趣的读者可以访问论文 [项目网站](https://github.com/microsoft/FILM) 了解技术细节并动手实践。

以上就是我对这篇论文的通俗解读,不知你觉得如何?欢迎在评论区交流你的想法!

2024 年 4 月 26 日

标签： AGI

FILM-7B: A Large Language Model that Makes Full Use of Context

Large language models (LLMs) are becoming increasingly powerful, but they still struggle to fully utilize information within long contexts. This “lost-in-the-middle” challenge can hinder the development of LLMs, as they may fail to understand the full meaning of long texts.

如何让大型语言模型(LLMs)充分利用长文本信息?——微软提出的FILM方法