Meta has introduced a new model called MoMa, which employs a mixture-of-experts (MoE) framework with modality-specific expert groups. This sparse early-fusion architecture for mixed-modal language modeling significantly boosts pre-training efficiency. The MoMa model, trained with 1.4 billion parameters, includes 4 text and 4 image experts. It achieves substantial FLOPs savings, with a 3.7× overall reduction, 2.6× for text, and 5.2× for image compared to dense models. This development follows Meta's previous work on Chameleon, which demonstrated the benefits of early fusion mixed modal large language models over unimodal and late fusion alternatives. The model also incorporates adaptive compute in 3-dimensions: modality, width, and depth.
MoD + Modality specific MoE + early fusion?! I would be so happy if L4 (or at least a large model variant) ends up adopting this architecture (i know, i know, one thing at a time, etc, but we could still stand to get a bit more creative in the attention layer ... ) https://t.co/di0mSa3jfV
With Chameleon we showed that early fusion mixed modal LLMs can deliver strong improvements over unimodal and late fusion alternatives, however with this paradigm shift how do we rethink our core model architecture to optimize for native multimodality and efficiency? We… https://t.co/CV5CzoxZ11
If you were interested in my cryptic posts on how to train Chameleon-like models up to 4x faster, check out our MoMa paper which covers a detailed overview of most of our architectural improvements. tl;dr adaptive compute in 3-dim, modality, width, depth. https://t.co/fvffkIjs2I