A new benchmark called Preference Proxy Evaluations (PPE) has been introduced to evaluate reward models (RMs) and large language models (LLMs) in the context of reinforcement learning from human feedback (RLHF). The benchmark addresses critical questions regarding the effectiveness of RMs in guiding RLHF and whether LLM judges can replace human evaluations. PPE utilizes real-world human preferences gathered from Chatbot Arena, encompassing over 16,000 prompts and 32,000 diverse model responses. The initiative, led by Evan Frick and his team, aims to provide a framework for meta-evaluation of these models, thereby facilitating better model alignment and performance assessment in AI systems. This development follows ongoing discussions within the AI community about improving evaluation methodologies for machine learning models.
Excited to release this benchmark we've been working on. It started with the question: How can I choose the best RM for RLHF? Since then, it's snowballed into a framework for meta evaluation, hence "preference proxy". I hope those modeling human preference find this useful! https://t.co/zV4gtFPSzh
Our latest benchmark PPE is finally out! Evaluating reward models is a real challenge. We curate real-world human preference and verifiable benchmarks at scale to rank RM/LLM judges. Solid work led by @evan_a_frick and team! https://t.co/Hyz7cfvSxZ
Our latest benchmark PPE is finally out! Evaluating RM is a real challenge. We curate real-world human preference and verifiable benchmarks to rank reward models/LLM judges. Solid work led by @evan_a_frick and team! https://t.co/Hyz7cfvSxZ