Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Jul 24, 2024·,,,,,,
Jacob Haimes
Cenny Wenner
Kunvar Thaman
Vassil Tashev
Clement Neo
Esben Kran
Jason Schrieber
Abstract
Public benchmarks are compromised, as the training data for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the sufficient indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset’s public availability. Applying these methods to TruthfulQA, we construct and release Retro-TruthfulQA, on which we evaluate twenty LLMs and find that some have inflated scores by more than 10 percentage points. Our results demonstrate that public benchmark scores do not accurately assess model properties, and underscore the importance of improved data and evaluation practices in the field.
Type
Publication
In The 5th Workshop on Data-Centric Machine Learning Research at The Forty-first International Conference on Machine Learning and The 1st Workshop on Data Contamination at The 62nd Annual Meeting of the Association for Computational Linguistics
Summary
Public LLM benchmarks are compromised. To assess the impact that evaluation gaming is having on benchmark scores, we present a methodology for crafting retro-holdout datasets. Leveraging this strategy, we construct Retro-TruthfulQA, a retro-holdout for the TruthfulQA benchmark. Comparing LLM performance on these two datasets reveals undeniable evidence that developer practices are indeed undermining benchmarks.
Preliminary Results
We conduct an inflation assessment of 20 Open Release and Closed Source models on TruthfulQA using the newly constructed Retro-TruthfulQA. Our results clearly indicate that developer practices have inflated benchmark scores, compromising the validity of LLM evaluations.
Methodology
To make these bold claims, we must first have a dataset which addresses the same evaluation task as the target dataset, but has not yet been publicly available. This means that our new dataset should be difficult to differentiate from the original, within some margin.
We design four tests which a proposed retro-holdout is tested against:
- Similarity of Difficulty: Are the entries in both datasets comparably challenging?
- Semantic Embedding Similarity: Does the distribution of cosine similarities between embeddings of our dataset seem plausible?
- Prediction Accuracy: Can a machine learning classifier predict which set an entry belongs to?
- Human Distinguishability: Can humans identify an entry from the retro-holdout hidden with two sampels from the original dataset?
To aid in the construction and iteration of the retro-holdout, we also introduce multiple tools and enumerate our lessons learned.
Takeaways
- Developer practices are undermining LLM benchmarks
- Benchmark scores should be taken with substantial scepticism when evaluation data have been publicly available for some time
- Dataset creators should keep a private holdout dataset, and decommission their benchmarks once significant benchmark inflation has been measured
Additional Figures
Correspondence
Please send all inquiries to jacob.d.haimes@gmail.com
and cwenner@gmail.com
.
Citation
@manuscript{haimes2024benchmark,
author = {Haimes, Jacob and Wenner, Cenny and
Thaman, Kunvar and Tashev, Vassil and
Neo, Clement and Kran, Esben and Schreiber, Jason},
title = {Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts},
year = 2024,
status = forthcoming,
language = en
}