Hugging Face has released FineWeb2, a comprehensive dataset aimed at enhancing multilingual AI training. This dataset spans 8 terabytes of compressed text data, covering almost 3 trillion words across 1,000 languages, with support for 1,893 language-script pairs, including 486 languages with over 1MB of data and 80 with over 1GB. FineWeb2 outperforms other publicly available datasets like CC-100, mC4, and HPLT, offering both filtered and unfiltered subsets for different research needs. It was developed using a data-driven approach to pre-training dataset design, processing data through the datatrove library from 96 CommonCrawl snapshots spanning 2013-2024. The release of FineWeb2 is part of a broader effort to bridge cultural and linguistic gaps in AI, as evidenced by the concurrent introduction of Global-MMLU, a benchmark for evaluating multilingual AI across 42 languages.
Let’s make December 2024 the month when Open Source AI becomes multilingual! Last week, we released Global-MMLU to evaluate multilingual LLMs and mitigate Western-centric biases in MMLU. https://t.co/ZYljzcvl6E Yesterday, the FineWeb 2 dataset covering 1000s of languages.… https://t.co/3xzLSLwj9b
FineWeb2: The Future of Multilingual AI 🌍✨ 🥂 High-quality pretraining data for 1,000+ languages 🌐 Covers 1,893 language-script pairs 📊 Validated with hundreds of experiments ✅ Outperforms datasets like CC-100, mC4, and HPLT 🔄 Provides filtered and unfiltered subsets for… https://t.co/8qXQtiM4sC
🥳We have released InternVL2.5, ranging from 1B to 78B, on @huggingface . 😉InternVL2_5-78B is the first open-source #MLLM to achieve over 70% on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o. 🤗HF Space:… https://t.co/V5TEUq1SYG