Great research proving Muon optimizer at larger model scale "Scaling law experiments indicate that Muon achieves ~ 2x computational efficiency compared to AdamW with compute optimal training." https://t.co/cYvHVyOvym
Discussed with @YouJiacheng and under current Megatron-LM context, Distributed Muon v.s. AdamW should be 1.25 rather than 1.5 for fair comparisons and can be even further reduced! We'll update details in an updated version of the paper soon! https://t.co/xoOzyYbp1l
Moonshot AI and UCLA Researchers Release Moonlight: A 3B/16B-Parameter Mixture-of-Expert (MoE) Model Trained with 5.7T Tokens Using Muon Optimizer Moonlight is offered in two configurations: a version with 3 billion activated parameters and a total of 16 billion parameters,… https://t.co/3GGFbzCfKN
Convergence AI has unveiled its LM2 Large Memory Models, which feature an unprecedented memory capacity aimed at enhancing complex problem-solving and advanced reasoning in artificial intelligence. In a related development, Moonshot AI, in collaboration with UCLA researchers, introduced the Moonlight model, a mixture-of-experts (MoE) architecture with 3 billion activated parameters and a total of 16 billion parameters. This model was trained on 5.7 trillion tokens using the Muon optimizer, which has demonstrated approximately double the computational efficiency compared to the AdamW optimizer. The Moonlight model is expected to push the boundaries of AI capabilities further by advancing performance while utilizing fewer floating-point operations (FLOPs). Additionally, intermediate checkpoints for the model have been released, and further insights regarding the Muon optimizer's scalability and efficiency are anticipated in upcoming research updates.