Feb 18, 03:26 AM

LLaDA-8B Model Surpasses Llama-2 7B Performance with 2.3 Trillion Tokens and 0.13 Million GPU Hours

A new large language diffusion model, LLaDA-8B, has been introduced, showcasing advancements in the field of natural language processing. Trained on 2.3 trillion tokens using 0.13 million GPU hours, LLaDA-8B underwent supervised fine-tuning on 4.5 million pairs. It reportedly surpasses the performance of Llama-2 7B across nearly all 15 standard zero and few-shot learning benchmarks. The model is noted for being trained entirely from scratch, achieving competitive results against LLaMA3 8B despite utilizing 7x fewer tokens, specifically 2 trillion tokens. Additionally, LLaDA employs a masked diffusion model approach, which diverges from traditional left-to-right text generation methods. This innovation may enable the model to match or exceed the capabilities of leading autoregressive language models in various tasks, potentially paving the way for new methodologies in large-scale language modeling. The paper also addresses challenges in enhancing reasoning in language models without significantly increasing model size or relying on specialized training data.

Written with ChatGPT (GPT-4o mini).