May 15, 01:54 PM

DeepSeek Unveils DeepSeek-V3 AI Model Scaling With FP8 Training on 2,048 NVIDIA H800 GPUs and Multi-Head Latent Attention

DeepSeek has released a new research paper detailing the scaling challenges and hardware architecture considerations involved in training and running large language models (LLMs), specifically focusing on their DeepSeek-V3 model. The model was trained using 2,048 NVIDIA H800 GPUs, achieving efficient training with FP8 precision that incurred less than 0.25% accuracy loss. The training cost per token was reduced to 250 GFLOPS, significantly lower than the 2.45 TFLOPS required for a dense 405-billion parameter model. A key innovation highlighted is the Multi-head Latent Attention mechanism, which shrinks the key-value (KV) cache to 70 KB per token, compared to LLaMA-3.1’s cache that is seven times larger. The paper emphasizes that hardware limitations such as memory capacity, compute speed, and network bandwidth are major bottlenecks as LLMs scale, and matching model design to hardware capabilities can alleviate these constraints. The research includes contributions from Wenfeng Liang among others. The paper is publicly available on Hugging Face. Additionally, companies like WEKA are developing AI infrastructure solutions, such as the Augmented Memory Grid, which offers 41 times faster time to first token and scalable KV cache performance to support future AI inference needs. Other industry players like CoreWeave are expanding GPU resources and automating infrastructure management to support large-scale AI model development. These developments underscore the growing importance of specialized hardware and optimized architectures in advancing AI training and deployment efficiency.

#DeepSeek #Wenfeng Liang #Hugging Face #WEKA #Augmented Memory Grid #KV #CoreWeave

Written with ChatGPT (GPT-4).