Nov 15, 01:38 AM

AI Research Community Advances with Common Corpus of 2 Trillion Tokens in 30+ Languages and 1 Million Synthetic Instruction Pairs

Recent developments in the AI research community highlight significant contributions to open science and dataset availability. Notably, a new dataset named Common Corpus has been released, containing 2 trillion tokens of clean, permissibly licensed data across more than 30 languages, making it the first fully-documented training dataset that does not rely on copyrighted or web-scraped content. Additionally, Microsoft Research has unveiled 1 million synthetic instruction pairs that cover various capabilities, including text editing and coding, all under permissive licensing. Furthermore, Cerebras’ partner, MBZUAI, has announced TxT360 (Trillion eXtracted Text), which is the first globally deduplicated dataset sourced from widely used data for large language model (LLM) pretraining, with plans to expand to over 15 trillion tokens of high-quality open-source data. These initiatives underscore the growing commitment of tech leaders to support open science and enhance AI research capabilities.

#Common Corpus #Microsoft Research #Cerebras #MBZUAI #TxT360 #Trillion eXtracted Text

Written with ChatGPT (GPT-4o mini).

Sources

Additional media

Image #1 for story ai-research-community-advances-common-corpus-2-trillion-tokens-30-languages-1-e5579821

AI Research Community Advances with Common Corpus of 2 Trillion Tokens in 30+ Languages and 1 Million Synthetic Instruction Pairs

Sources

Additional media

Similar Stories

Similar Stories