Playground v3, a new model for improving text-to-image alignment, has been introduced. This model utilizes a deep-fusion approach combining a DiT image transformer with a pretrained Llama-3-8B text transformer, incorporating image-text joint attention. It is designed to be simpler than MMDIT. Playground v3 also features a 16-channel VAE at 512x512 resolution and is reported to be a better captioner than GPT-4. The model, with a total of 24B parameters, has achieved state-of-the-art (SOTA) status in image diffusion models. The development was led by Suhail.
Here's the real story on how Playground v3 got to SOTA on image diffusion models Amazing work @Suhail https://t.co/Q5ImkUm3al https://t.co/Bpsq5Q3A2e
Playground v1 vs Playground v3 Image models get better so quickly https://t.co/uu86YsZjla
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models. https://t.co/sjdFoYrq2y