
Meta has recently developed and applied a new method called Iterative Reasoning Preference Optimization (Iterative RPO) to enhance the reasoning capabilities of its AI models, specifically the Llama-2-70B-Chat. This method involves generating chain-of-thought candidates with a large language model, constructing preference pairs based on the correctness of answers, and training the model accordingly. Significant improvements were noted in model accuracy across various benchmarks: GSM8K (from 55.6% to 81.6%), MATH (from 12.5% to 20.8%), and ARC-Challenge (from 77.8% to 86.7%). Additionally, the LLM2Vec approach was applied to the Meta-Llama-3-8B model, enhancing its performance on embedding tasks.
[CL] Iterative Reasoning Preference Optimization R Y Pang, W Yuan, K Cho, H He… [FAIR at Meta & New York University] (2024) https://t.co/YRGV1eGqew - The paper proposes an iterative reasoning preference optimization (Iterative RPO) method to improve reasoning ability of large… https://t.co/VXXVzufN4d
Meta announces Iterative Reasoning Preference Optimization Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). https://t.co/tTHy1sbr2I
Meta presents Iterative Reasoning Preference Optimization Increasing accuracy for Llama-2-70B-Chat: - 55.6% -> 81.6% on GSM8K - 12.5% -> 20.8% on MATH - 77.8% -> 86.7% on ARC-Challenge https://t.co/xp5a2LhFNN https://t.co/jAp4T0MBOg
