
A new study has introduced Recap-DataComp-1B, a dataset comprising 1.3 billion web images recaptioned using the LLaMA-3-powered LLaVA model. This initiative, spearheaded by researchers including X Li, H Tu, M Hui, and Z Wang from UC Santa Cruz, aims to enhance text-image datasets to improve model training for visual-language tasks. The dataset, which finetunes a LLaVA-1.5 model, is open-sourced and has shown substantial benefits in training advanced vision-language models such as CLIP and Diffusion Transformers. The study confirms that the enhanced dataset, derived from the DataComp-1B dataset, offers significant improvements in the performance of these models. The research was published in 2024.
[CV] What If We Recaption Billions of Web Images with LLaMA-3? X Li, H Tu, M Hui, Z Wang... [UC Santa Cruz] (2024) https://t.co/xk69SkzjGt - The paper presents Recap-DataComp-1B, a dataset with 1.3 billion web images recaptioned using LLaMA-3-powered Llava model. - Original… https://t.co/oXvcQjzIau
"What If We Recaption Billions of Web Images with LLaMA-3?"🤯 And the results confirm that this enhanced dataset, Recap-DataComp-1B generated this way, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe… https://t.co/QCCZil11bW
What If We Recaption Billions of Web Images with LLaMA-3 ? ◼ A new study enhances text-image datasets using LLaMA-3, improving model training for visual-language tasks. With the open-source Recap-DataComp-1B dataset, models like CLIP & Diffusion Transformers show better… https://t.co/DqlrY5pkYa


