
ZyphraAI has introduced Zyda, a new 1.3 trillion token open dataset for language modeling. This open-source dataset aims to bridge the gap between the rapid growth of large language models (LLMs) and the availability of high-quality open-source datasets. Zyda combines data from RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv, and is claimed to outperform existing datasets such as Pile, C4, and arxiv. Zyda is designed for training large language models.
Zyphra debuts Zyda LLM training dataset with 1.3T tokens https://t.co/2GcXLh4C2g
Zyphra debuts Zyda, a 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv: Zyphra's Zyda is a 1.3T open dataset combining RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv to help train large… https://t.co/VHrDA8FMUd #AI #categoryBusinessIndustrial
Zyphra debuts Zyda, a 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv https://t.co/6lwu2FV4kQ https://t.co/svdZ0edowy
