Alibaba has introduced Marco-o1, an advanced open-ended reasoning model designed to handle complex language problems. Developed by AIDC, the model is fine-tuned with Chain-of-Thought (CoT) data and utilizes Monte Carlo Tree Search (MCTS) and reflection mechanisms to expand its solution space. Marco-o1 demonstrates significant improvements in accuracy, achieving a 6.17% increase in MGSM English and a 5.60% increase in MGSM Chinese compared to Qwen2-7B-Instruct. The model excels in disciplines with standard answers, such as mathematics, physics, and coding, as well as open-ended resolutions. Additionally, it shows proficiency in slang translation.
Livebench AI is fast becoming the go to benchmark to evaluate LLMs for real world problems You can’t game it because we change the questions every month For instance, step-2 seems unexpectedly performant and as good as Sonnet and o1! We will know when we evaluate it in the… https://t.co/VnOJbnjfQn
This is amazing! Benchmarks evolving is super necessary. Otherwise people can just train their models to perform on selected benchmarks and fake competence. Kudos to Livebench AI https://t.co/LlE6oCWkbR
Deepseek's new reasoning engine is SUPER interesting and worth a try... most interesting you get to see how the AI is thinking and just how surprised the AI is to discover the third 'r' in Strawberry. link to @deepseek_ai's model below... https://t.co/utunYyyyXR