Nov 16, 02:11 AM

Microsoft Research's AgentInstruct-1M-v1 Dataset Boosts Mistral-7b Performance by Up to 54%

Microsoft Research has released a significant dataset named AgentInstruct-1M-v1, comprising 1 million synthetic instruction-response pairs. This dataset is designed to enhance the training of large language models (LLMs) across various capabilities, including text editing, creative writing, coding, and reading comprehension. Notably, when the dataset was used to fine-tune the Mistral-7b model, it demonstrated substantial performance improvements: a 19% increase on the MMLU benchmark, a 40% improvement on AGIEval, a 54% gain on GSM8K, and a 45% boost on AlpacaEval. The dataset is open-source and was generated entirely from publicly available web content, reflecting Microsoft's commitment to advancing artificial intelligence research.

#Microsoft Research #AgentInstruct #AGIEval #AlpacaEval #Microsoft

Written with ChatGPT (GPT-4o mini).