Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

Summary

Public LLM benchmarks are compromised. To assess the impact that evaluation gaming is having on benchmark scores, we present a methodology for crafting retro-holdout datasets. Leveraging this strategy, we construct Retro-TruthfulQA, a retro-holdout for the TruthfulQA benchmark. Comparing LLM performance on these two datasets reveals undeniable evidence that developer practices are indeed undermining benchmarks.

Preliminary Results

We conduct an inflation assessment of 20 Open Release and Closed Source models on TruthfulQA using the newly constructed Retro-TruthfulQA. Our results clearly indicate that developer practices have inflated benchmark scores, compromising the validity of LLM evaluations.

Benchmark Inflation Figure 4 - TruthfulQA Misconceptions Inflation Rank | Haimes, Wenner, et al. — Figure 4: Contemporary models, ordered by their TruthfulQA Misconceptions inflation rank. Models are grouped by color according to their respective developers. Uncertainty is shown with both single sigma error bars and p-values; models with p-value less than .05 are marked with an *.

Methodology

To make these bold claims, we must first have a dataset which addresses the same evaluation task as the target dataset, but has not yet been publicly available. This means that our new dataset should be difficult to differentiate from the original, within some margin.

We design four tests which a proposed retro-holdout is tested against:

Similarity of Difficulty: Are the entries in both datasets comparably challenging?
Semantic Embedding Similarity: Does the distribution of cosine similarities between embeddings of our dataset seem plausible?
Prediction Accuracy: Can a machine learning classifier predict which set an entry belongs to?
Human Distinguishability: Can humans identify an entry from the retro-holdout hidden with two sampels from the original dataset?

Benchmark Inflation Figure 1 - Diagram | Haimes, Wenner, et al. — Figure 1: Diagram of the methodology for creating a retro-holdout dataset, and leveraging it to assess benchmark inflation in contemporary models.

To aid in the construction and iteration of the retro-holdout, we also introduce multiple tools and enumerate our lessons learned.

Takeaways

Developer practices are undermining LLM benchmarks
Benchmark scores should be taken with substantial scepticism when evaluation data have been publicly available for some time
Dataset creators should keep a private holdout dataset, and decommission their benchmarks once significant benchmark inflation has been measured

Additional Figures

Benchmark Inflation Figure 2 - Prerelease Model Accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions | Haimes, Wenner, et al. — Figure 2: Results of the difficulty distribution test. All models used were trained prior to the release of the original TruthfulQA benchmark. Note that the all entries fall within the 95% confidence band.

Benchmark Inflation Figure 3 - Model Accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions | Haimes, Wenner, et al. — Figure 3: Contemporary model accuracy on Retro-TruthfulQA Misconceptions vs. TruthfulQA Misconceptions Non-Adversarial.

Correspondence

Please send all inquiries to jacob.d.haimes@gmail.com and cwenner@gmail.com.

Citation

@manuscript{haimes2024benchmark,
  author     = {Haimes, Jacob and Wenner, Cenny and
                Thaman, Kunvar and Tashev, Vassil and
                Neo, Clement and Kran, Esben and Schreiber, Jason},
  title      = {Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts},
  year       = 2024,
  status     = forthcoming,
  language   = en
}