Current models are actually much better at math than the benchmarks show. Now, more than ever, we need accurate math evaluations. The parsing commonly used for these benchmarks absolutely SUCKS, but @HKydlicek has fixed it 👇 https://t.co/8bttjaJBQy
🎉 Our dataset mixing results are in: SN9, pretraining, is producing models competitive with SOTA rivals from companies like DeepSeek AI, Mistral, and Google on prominent benchmarks (1/4) https://t.co/Q8mKHcWyAb
I think academic benchmarks are falling behind. Private human evals show real progress. @scale_AI 's SEAL matches our internal testing for Aya Expanse. Crazy strong multilingual performance for a small model :) https://t.co/YIsDjkSnY7
Aya Expanse has been recognized as the best open-weights model on Scale AI's private multilingual protocol. The model reportedly outperforms proprietary models, larger models, and those developed by more researchers with greater infrastructure in certain languages. Additionally, private evaluations indicate that Aya Expanse demonstrates strong multilingual performance despite being a smaller model. There are ongoing discussions in the AI community regarding the effectiveness of current benchmarking methods, with some experts suggesting that traditional benchmarks may not accurately reflect the capabilities of modern models. The need for new benchmarks that account for randomized or generated models is also emphasized, as current benchmarks may not fully capture the practical performance of these systems.