DeepSeek has announced a new framework for enhancing large language model (LLM) reward models, focusing on inference-time scaling. The framework, named GRM-27B, reportedly outperforms the existing GPT-4o in reward modeling. This advancement is based on a novel approach called Self-Principled Critique Tuning (SPCT), which aims to improve the quality and scalability of generalist reward models (GRMs) without introducing significant biases. The new model is set to be open-sourced soon, and it has gained traction on alphaXiv, a platform for research dissemination. The research highlights the importance of inference compute in enhancing reward generation behaviors in LLMs, suggesting that improvements in inference-time scaling could lead to better overall performance in AI systems.
[CL] Inference-Time Scaling for Generalist Reward Modeling Z Liu, P Wang, R Xu, S Ma... [DeepSeek-AI] (2025) https://t.co/V7nOI6JqfM https://t.co/FeBhGrhWqm
DeepSeek just released a new framework for improving LLM reward models — at inference time. Their model, GRM-27B, outperforms GPT-4o in reward modeling 🚀 Generalist reward generation Inference-time scaling > model scaling Open-sourcing models soon Trending on alphaXiv 📈 https://t.co/cFIWGSLBco
Deepseek just announced Inference-Time Scaling for Generalist Reward Modeling on Hugging Face show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve https://t.co/VxHXZVtAal