Recent discussions among AI researchers highlight the ease of gaming benchmarks for large language models (LLMs) such as ChatGPT. Experts point out that training on paraphrased examples of test sets can lead to artificially high scores. LMSys, a gold standard for LLM benchmarking, is also susceptible to manipulation despite its rigorous standards. Critics argue that benchmarks like MMLU are not trustworthy, and emphasize the need for well-curated and secret test sets to maintain the integrity of evaluations. Trusted third-party evaluations, such as those from Scale AI, are recommended for more reliable assessments. This issue has come into focus again in light of the 'Reflection' saga, where a 13B model was able to outperform GPT-4.
LLM "hallucinations" are not always bad and it really depends on the context and use case. One thing is for sure, we see all kinds of hallucinations in the domains we work in (mostly code and knowledge). One thing I've noticed more recently with most of the advanced LLMs is… https://t.co/ZygwOiscoO
It's *incredibly* easy to game LLM benchmarks, and you don't even have to train on the test set for that. Something to keep in mind again in light of the "Reflection" saga. https://t.co/Negcjoi6H3
There is at least one good thing that has come out of the recent debacle. It is now pretty clear that there is no way to trust MMLU and other such benchmarks. In fact, you can also game lmsys: 1. Some portion of the lmsys data is open-source and benchmarks available that show…