Dec 9, 02:32 AM

Hugging Face Unveils FineWeb2: 8TB, 3 Trillion Words for Multilingual AI Training

Hugging Face has released FineWeb2, a comprehensive dataset aimed at enhancing multilingual AI training. This dataset spans 8 terabytes of compressed text data, covering almost 3 trillion words across 1,000 languages, with support for 1,893 language-script pairs, including 486 languages with over 1MB of data and 80 with over 1GB. FineWeb2 outperforms other publicly available datasets like CC-100, mC4, and HPLT, offering both filtered and unfiltered subsets for different research needs. It was developed using a data-driven approach to pre-training dataset design, processing data through the datatrove library from 96 CommonCrawl snapshots spanning 2013-2024. The release of FineWeb2 is part of a broader effort to bridge cultural and linguistic gaps in AI, as evidenced by the concurrent introduction of Global-MMLU, a benchmark for evaluating multilingual AI across 42 languages.

#Hugging Face #FineWeb2 #mC4 #HPLT #CommonCrawl

Written with ChatGPT .