Apr 18, 09:54 AM

UCLA and Meta AI Develop Two-Stage d1 Framework Using Supervised Fine-Tuning and Reinforcement Learning for Diffusion LLM Reasoning

Researchers from UCLA and Meta AI have introduced d1, a novel framework designed to enhance reasoning capabilities in diffusion-based large language models (LLMs) through a combination of supervised fine-tuning (SFT) and reinforcement learning (RL). The approach involves a two-stage pipeline where masked diffusion LLMs are first fine-tuned on a dataset of 1,000 examples, followed by a reinforcement learning phase using a critic-free policy-gradient method called diffu-GRPO. This method allows the models to perform step-by-step reasoning in a single forward pass, improving their ability to generate accurate responses without relying on extensive Monte Carlo simulations. Concurrently, research on retrieval-augmented generation (RAG) highlights its core simplicity in identifying relevant context, retrieving information, and generating responses, with developments in agentic RAG pipelines adding query analysis and reranking to optimize results. Additional studies focus on the expansion of LLM output lengths, multilingual reasoning capabilities, and the transfer of reasoning skills from large to smaller models through knowledge distillation and chain-of-thought fine-tuning. These advancements collectively aim to improve the reasoning accuracy and efficiency of LLMs across various applications and languages.

#UCLA #Meta AI #Monte Carlo

Written with ChatGPT (GPT-4).