The SmolTalk dataset, a 1 million sample synthetic dataset, has been released under Apache 2.0. This dataset was used to train SmolLM v2 and is noted for its ability to enhance instruction following, reasoning, rewriting, summarization, and function calling. The release has been praised for its well-curated nature and its significant contribution to the performance of SmolLM2 Instruct models, showing better results compared to training on the Orca AgentInstruct 1M dataset. Additionally, the RedPajama dataset, containing over 100 trillion tokens, has been released by @togethercompute to support open-source large language models (LLMs) like OLMo from @allen_ai. The new datasets include Smol-Magpie-Ultra.
NEW: 1M samples instruct dataset📚✨ >>> from datasets import load_dataset >>> ds = load_dataset("HuggingFaceTB/smoltalk", "all") Training on it for SmolLM2-1 Instruct models even shows better performance than training on Orca AgentInstruct 1M! Dataset link + viewer below :) https://t.co/dOkVCPwXyo
Another excellent release from @huggingface My main question is what if you combine Smol-Talk with Orca-AgentInstruct-1M? I'd bet it works, they don't seem super redundant. https://t.co/03Gv1SUYTM
SmolTalk: The dataset powering SmolLM2 Instruct's superior performance https://t.co/gY54fj7nEi