Biderman, S. et al. (2023). Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01355.
Chase, H. (2022). Langchain. https://github.com/hwchase17/langchain.
Chiang, W.-L. et al. (2023). Stablelm-alpha 7b: Small and mighty for research. https://github.com/Stability-AI/stablelm.
Clark, E. et al. (2023). Seahorse: A multilingual benchmark for factual correctness in summarization. arXiv preprint arXiv:2306.05125.
Fabbri, A. R. et al. (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9, 408–430.
Fernandes, P. et al. (2023). AutoMQM: Automatic machine translation evaluation with large language models. In Proceedings of the 17th Conference of the European Association for Machine Translation.
Freitag, M. et al. (2021). Results of the WMT21 Metrics Shared Task: Evaluating metrics with explanations. In Proceedings of the Sixth Conference on Machine Translation.
Freitag, M. et al. (2022). Findings of the WMT22 Shared Task on Machine Translation Quality Estimation. In Proceedings of the Seventh Conference on Machine Translation.
Freitag, M. et al. (2023). Findings of the WMT23 Shared Task on Machine Translation Quality Estimation. In Proceedings of the Eighth Conference on Machine Translation.
Fu, Y. et al. (2023). From words to programs: Exploring the potential of large language models for abstract semantic parsing. arXiv preprint arXiv:2305.17770.
Gao, L. et al. (2024a). A survey of large language model based automatic metrics for natural language generation. arXiv preprint arXiv:2404.14012.
Gao, L. et al. (2024b). Retrieval augmentation for large language model based evaluation metrics. arXiv preprint arXiv:2405.12504.
Iyer, S. et al. (2022). OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Khattab, O. et al. (2023). DSPy: Towards general-purpose symbolic programming for composable program synthesis. arXiv preprint arXiv:2305.15956.
Kocmi, T. and Federmann, C. (2023a). Large language models are not fair judges: Exploring the intrinsic bias of dataset average as a metric. arXiv preprint arXiv:2305.13400.
Kocmi, T. and Federmann, C. (2023b). On the evaluation of machine translation systems trained with controlled simplification. In Proceedings of the 17th Conference of the European Association for Machine Translation.
Kojima, T. et al. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
Köpf, B. et al. (2023). OpenAssistant Conversations—democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
Lee, H. Y. et al. (2023a). PLATYPUS: Quick, cheap, and accurate fine-tuning of large language models. arXiv preprint arXiv:2310.11307.
Leidinger, T. et al. (2023). Prompt surveillance: Tracking prompts that expose weaknesses in large language models. arXiv preprint arXiv:2302.12177.
Leiter, C. et al. (2023). Findings of the WMT 2023 Shared Task on Evaluating the Evaluation of Machine Translation and Summarization. In Proceedings of the Eighth Conference on Machine Translation.
Li, H. et al. (2023). Exploring the impact of emotion on large language models. arXiv preprint arXiv:2305.14725.
Li, S. et al. (2024a). Unbabel’s submission to the WMT23 metrics shared task: Prompting large language models for machine translation quality estimation. In Proceedings of the Eighth Conference on Machine Translation.
Li, Y. et al. (2024b). A survey of automatic metrics based on large language models for natural language generation. arXiv preprint arXiv:2404.00774.
Liu, P. et al. (2023a). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.