Jun 7, 01:00 PM

ZyphraAI Introduces Zyda, a 1.3T Token Open Dataset for Training Language Models

ZyphraAI has introduced Zyda, a new 1.3 trillion token open dataset for language modeling. This open-source dataset aims to bridge the gap between the rapid growth of large language models (LLMs) and the availability of high-quality open-source datasets. Zyda combines data from RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv, and is claimed to outperform existing datasets such as Pile, C4, and arxiv. Zyda is designed for training large language models.

#ZyphraAI #Zyda #RefinedWeb #Starcoder #C4 #Pile #Slimpajama

Written with ChatGPT (GPT-4o).

Sources

SiliconANGLE@SiliconANGLE
2 years ago
Zyphra debuts Zyda LLM training dataset with 1.3T tokens https://t.co/2GcXLh4C2g
The Tech News Roundup@TechNewsRoundup
2 years ago
Zyphra debuts Zyda, a 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv: Zyphra's Zyda is a 1.3T open dataset combining RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv to help train large… https://t.co/VHrDA8FMUd #AI #categoryBusinessIndustrial
VentureBeat@VentureBeat
2 years ago
Zyphra debuts Zyda, a 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv https://t.co/6lwu2FV4kQ https://t.co/svdZ0edowy

Additional media

Image #1 for story zyphraai-introduces-zyda-1-3t-token-open-dataset-training-language-models

ZyphraAI Introduces Zyda, a 1.3T Token Open Dataset for Training Language Models

Sources

Additional media

Similar Stories