🚨 New paper! Evaluating LLMs using closed-source LLMs has limited transparency, controllability, and affordability. Incredible work by @seungonekim significantly improves all these factors, w/ open models for either relative or absolute response scoring. ⬇️ https://t.co/RBVdas3dAb
An Open Source LM Specialized in Evaluating Other LMs Open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments. They support both direct assessments and pair-wise ranking formats grouped with user-defined… https://t.co/DiHHcYHYZh
How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9

Recent research in AI has focused on improving the evaluation of large language models (LLMs). A new paper led by @pat_verga suggests using an ensemble of smaller LLMs, termed a Panel of LLM Evaluators (PoLL), which is less biased, faster, and seven times cheaper than using a single large model. This approach has shown effectiveness in QA and Arena-hard evaluations. Additionally, @scale_AI introduced a new evaluation benchmark, GSM1K, to address overfitting in popular LLMs. Another significant development is the open-source Prometheus 2, which includes models like 7B & 8x7B, designed to mirror human and GPT-4 judgments closely. These models support both direct assessments and pairwise ranking, as well as user-defined assessments, enhancing transparency, controllability, and affordability in LLM evaluations.
