Recent developments in the AI research community highlight significant contributions to open science and dataset availability. Notably, a new dataset named Common Corpus has been released, containing 2 trillion tokens of clean, permissibly licensed data across more than 30 languages, making it the first fully-documented training dataset that does not rely on copyrighted or web-scraped content. Additionally, Microsoft Research has unveiled 1 million synthetic instruction pairs that cover various capabilities, including text editing and coding, all under permissive licensing. Furthermore, Cerebras’ partner, MBZUAI, has announced TxT360 (Trillion eXtracted Text), which is the first globally deduplicated dataset sourced from widely used data for large language model (LLM) pretraining, with plans to expand to over 15 trillion tokens of high-quality open-source data. These initiatives underscore the growing commitment of tech leaders to support open science and enhance AI research capabilities.
Cerebras’ partner @mbzuai has announced TxT360 (Trillion eXtracted Text) — the first globally deduplicated dataset across most used data sources for LLM pretraining, and an optimized upsampling recipe to expand to 15T+ tokens of high-quality open-source data for pretraining LLMs.… https://t.co/4jj3U8VxkO
Oh wow! @MSFTResearch released 1 MILLION synthetic instruction pairs covering different capabilities, such as text editing, creative writing, coding, reading comprehension, etc - permissively licensed 🔥 Explore it directly on the Hugging Face Hub! Kudos MSFT! Let the… https://t.co/Y1BmQpi64z
MASSIVE Dataset just released. Common Corpus provides 2 trillion tokens of clean, permissibly licensed data across 30+ languages for training LLMs. First fully-documented training dataset that doesn't rely on copyrighted or web-scraped content. 🔍 What is Common Corpus - 2… https://t.co/vRunkF5tj0