Recent advancements in large language models (LLMs) have been highlighted through various studies focusing on their decision-making capabilities and reinforcement learning techniques. A paper from Google DeepMind discusses how LLMs can learn optimal exploration strategies via algorithm distillation and inference-time support, addressing the challenge of making decisions under uncertainty. Additionally, the introduction of Reverse Curriculum Reinforcement Learning (R3) aims to enhance LLM reasoning without the need for extensive process annotations, tackling issues related to sparse rewards and high annotation costs. Researchers from Imperial College London have created a benchmark for multi-hop reasoning, revealing the complexities LLMs face in this area. The new benchmark, Preference Proxy Evaluations (PPE), evaluates reward models and their effectiveness in guiding reinforcement learning from human feedback (RLHF). It includes over 16,000 prompts and 32,000 diverse model responses, aiming to determine how well reward models can predict RLHF performance. Furthermore, a new method called in-context preference learning (ICPL) has demonstrated a 30-fold increase in query efficiency for LLMs in RLHF tasks. Collectively, these studies signify a substantial step forward in enhancing LLM capabilities in reasoning and decision-making tasks.
This paper teaches LLM to recognize dead-ends in its thinking process, just like humans do while solving problems. Paper - "CPL: CRITICAL PLANNING STEP LEARNING BOOSTS LLM Generalization in Reasoning Tasks" 🔍 Aims to enhance LLM generalization across reasoning tasks 🌳 Uses… https://t.co/L0k6HIRtnX
In our new paper, we find that LLMs can efficiently do RLHF in-context! Our method, in-context preference learning (ICPL), iterates LLMs writing reward functions, training agents, and putting preferences into context. We see a 30x boost in query efficiency over baseline RLHF! https://t.co/FIqghEouZh
On Designing Effective RL Reward at Training Time for LLM Reasoning. https://t.co/qNSr8lWklb