Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Jul 24, 2024·
Jacob Haimes
Jacob Haimes
,
Cenny Wenner
,
Kunvar Thaman
,
Vassil Tashev
,
Clement Neo
,
Esben Kran
,
Jason Schrieber
Abstract
Public benchmarks are compromised, as the training data for many Large Language Models (LLMs) is contaminated with test data, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the sufficient indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset’s public availability. Applying these methods to TruthfulQA, we construct and release Retro-TruthfulQA, on which we evaluate twenty LLMs and find that some have inflated scores by more than 10 percentage points. Our results demonstrate that public benchmark scores do not accurately assess model properties, and underscore the importance of improved data and evaluation practices in the field.
Type
Publication
In The 5th Workshop on Data-Centric Machine Learning Research at The Forty-first International Conference on Machine Learning and The 1st Workshop on Data Contamination at The 62nd Annual Meeting of the Association for Computational Linguistics

Summary

Public LLM benchmarks are compromised. To assess the impact that evaluation gaming is having on benchmark scores, we present a methodology for crafting retro-holdout datasets. Leveraging this strategy, we construct Retro-TruthfulQA, a retro-holdout for the TruthfulQA benchmark. Comparing LLM performance on these two datasets reveals undeniable evidence that developer practices are indeed undermining benchmarks.

Preliminary Results

We conduct an inflation assessment of 20 Open Release and Closed Source models on TruthfulQA using the newly constructed Retro-TruthfulQA. Our results clearly indicate that developer practices have inflated benchmark scores, compromising the validity of LLM evaluations.
Benchmark Inflation Figure 4 - TruthfulQA Misconceptions Inflation Rank | Haimes, Wenner, et al.
Figure 4: Contemporary models, ordered by their TruthfulQA Misconceptions inflation rank. Models are grouped by color according to their respective developers. Uncertainty is shown with both single sigma error bars and p-values; models with p-value less than .05 are marked with an *.

Methodology

To make these bold claims, we must first have a dataset which addresses the same evaluation task as the target dataset, but has not yet been publicly available. This means that our new dataset should be difficult to differentiate from the original, within some margin.

We design four tests which a proposed retro-holdout is tested against:

  • Similarity of Difficulty: Are the entries in both datasets comparably challenging?
  • Semantic Embedding Similarity: Does the distribution of cosine similarities between embeddings of our dataset seem plausible?
  • Prediction Accuracy: Can a machine learning classifier predict which set an entry belongs to?
  • Human Distinguishability: Can humans identify an entry from the retro-holdout hidden with two sampels from the original dataset?
Benchmark Inflation Figure 1 - Diagram | Haimes, Wenner, et al.
Figure 1: Diagram of the methodology for creating a retro-holdout dataset, and leveraging it to assess benchmark inflation in contemporary models.
To aid in the construction and iteration of the retro-holdout, we also introduce multiple tools and enumerate our lessons learned.

Takeaways

  • Developer practices are undermining LLM benchmarks
  • Benchmark scores should be taken with substantial scepticism when evaluation data have been publicly available for some time
  • Dataset creators should keep a private holdout dataset, and decommission their benchmarks once significant benchmark inflation has been measured

Additional Figures

Benchmark Inflation Figure 2 - Prerelease Model Accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions | Haimes, Wenner, et al.
Figure 2: Results of the difficulty distribution test. All models used were trained prior to the release of the original TruthfulQA benchmark. Note that the all entries fall within the 95% confidence band.
Benchmark Inflation Figure 3 - Model Accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions | Haimes, Wenner, et al.
Figure 3: Contemporary model accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions Non-Adversarial.

Correspondence

Please send all inquiries to jacob.d.haimes@gmail.com and cwenner@gmail.com.

Citation

@manuscript{haimes2024benchmark,
  author     = {Haimes, Jacob and Wenner, Cenny and
                Thaman, Kunvar and Tashev, Vassil and
                Neo, Clement and Kran, Esben and Schreiber, Jason},
  title      = {Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts},
  year       = 2024,
  status     = forthcoming,
  language   = en
}