Sources
Additional media










OpenAI has introduced a new benchmark, MLE-bench, aimed at evaluating the performance of AI agents in comparison to human data scientists. This benchmark is part of a broader trend in AI research, where various studies are assessing the reasoning capabilities of large language models (LLMs). Recent findings indicate that top human performers outscored AI agents in a rigorous Q3 Benchmark, with a statistical significance of p=0.036. In the upcoming Q4 series, which offers a prize pool of $30,000, researchers are eager to see if AI can surpass human performance. Additionally, WecoAI's AIDE has been recognized as the leading Machine Learning Engineer Agent, outperforming competitors like OpenAI's models in several competitions. However, concerns persist regarding the reasoning abilities of LLMs, with studies revealing that they struggle with mathematical tasks and exhibit inconsistent performance when faced with slight variations in problems. Researchers are investigating the cognitive limitations of these models, questioning whether they can genuinely reason or if their outputs are merely sophisticated pattern matches.
Apple writes what Gary Marcus often repeats. But that doesn't make it any more correct. I wonder how you can still support such theses after OpenAI's o1. LLMs can reason. And that already today. https://t.co/hetI2yD4gf
MLE-Bench - Now AI is coming for Kaggle GrandMasters and machine learning engineering skills in general. **Results** 📊: • o1-preview (AIDE): Achieves medals in 16.9% of competitions • GPT-4o (AIDE): Medals in 8.7% of competitions • Performance doubles from pass@1 to pass@8… https://t.co/QnuXvmVbVv https://t.co/g5QSssOzMJ
Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes: https://t.co/jzLVUJd2xR by TechCrunch #infosec #cybersecurity #technology #news