Scale has introduced a new public leaderboard aimed at improving the evaluation of large language models (LLMs). This leaderboard addresses key issues in current evaluation methods, such as contamination of evaluation sets during model training and the quality of human raters. The leaderboard requires that models can only be featured the first time an organization encounters the prompts, ensuring integrity. This initiative, which includes a private SEAL leaderboard, is seen as a significant step in enhancing the quality of evaluations and benchmarks in the AI industry.
OMG did OpenAI game the Lmsys eval? Results from @scale_AI's new private SEAL leaderboard! https://t.co/phuhocqKnE
We're going to need a lot more investment in high-quality evals and benchmarks to help us understand the actual comparative utility of the various models. This new set of private evals and leaderboard from Scale are great to see. https://t.co/opRWuokcyV
LLM evals are the hot topic in AI right now, and the work @scale_AI is doing is helping shape the frontier! @danielxberrios @summeryue0 https://t.co/4yzURdupQd