
Hugging Face has released Docmatix, a comprehensive dataset designed to enhance vision-language models (VLMs) for document understanding. Docmatix includes 9,500,000 question/answer pairs across 2,444,750 images, aiming to bridge the gap between open-source and proprietary VLMs like GPT-4. This dataset addresses the limitations of open-source models in document visual question answering, providing extensive data coverage to improve practical applications. Additionally, advancements in document retrieval have been made with the introduction of ColPali, a new model that leverages Vision Language Models to extract high-quality embeddings from document images, outperforming current systems.



Let's solve semantic OCR with VLMs from PDFs and Documents! The Hugging Face science team just released Docmatix with 9,500,000 Q/A pairs on 2,444,750 Images. 👀 https://t.co/e3Rocq68oO
Introducing Docmatix: a gigantic document understanding dataset 📑 Closed models outperformed open-source models in document tasks so far due to lack of data coverage 💔 but @huggingface M4 is here to change that! keep reading ⥥ https://t.co/g1ZxvLqQIm
We release Docmatix, a huge dataset to enhance vision-language models for document understanding https://t.co/AgBtGJEnXR