The team behind the FineWeb multilingual dataset has launched FineTasks, a comprehensive evaluation suite aimed at extending the dataset to over 1000 languages. This initiative required over 75,000 GPU hours and involved testing nearly 200 tasks to identify those that provide the strongest signal during training. The FineTasks framework is designed to facilitate the evaluation of large language models (LLMs) across a diverse set of languages, including low-resourced languages. The @huggingface team validated the framework across nine different languages. The data-driven filtering approach used to create the FineWeb and FineWeb-edu datasets was also applied.
How can we evaluate LLMs across 1000+ languages? 🌎 The first step towards FineWeb Multilingual was creating FineTasks, a data-driven evaluation framework that helps select reliable evaluation tasks for any language. The @huggingface Team validated it across 9 different languages… https://t.co/VjK9IZezOT
We want to extend up to 1000+ languages the data-driven filtering approach we used to create the *Fineweb* and *Fineweb-edu* large scale pretraining datasets The first step –which proved surprisingly difficult– was to find reliable high-early-signal evaluations in many languages… https://t.co/pcijjWrreN
Today we’re taking the first step towards extending FineWeb 🍷 to 1000+ languages with the launch of our multilingual evaluation suite, FineTasks 🌍. We invested over 75k GPU hours 🏎️ to select tasks that deliver the strongest signal during training, testing nearly 200 tasks!…