Leaderboard illusion: How big tech skewed AI rankings on Chatbot Arena https://t.co/28bOypYe5k
"The Leaderboard Illusion" paper raises an important problem — fair evaluation for every model. Here's what changes it suggests for Chatbot Arena: • No more deleting of low scores and hiding of poor results - all results should stay public. • Limit the number of models that https://t.co/rrBhaU1Eep
Leaderboard Illusion: "We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release & retract scores if desired..the ability of these providers to choose the best score leads to biased Arena scores" https://t.co/0FSYgQXMc2
A recent study titled "The Leaderboard Illusion" has raised concerns about potential bias in LM Arena's AI leaderboard, a prominent benchmarking platform for chatbot models. The research indicates that private testing practices allow a small number of large technology companies, including Meta, Google, and OpenAI, to gain advantages by testing multiple model variants before public release and retracting lower scores. This selective reporting results in distorted rankings that favor these providers. The study calls for increased transparency, recommending that all test results remain public without deletion of low scores, and suggests limiting the number of models submitted to ensure fairer evaluation. These findings highlight growing scrutiny over the fairness and trustworthiness of AI benchmarking processes within the industry.