Sources
Guilherme PenedoCurrent models are actually much better at math than the benchmarks show. Now, more than ever, we need accurate math evaluations. The parsing commonly used for these benchmarks absolutely SUCKS, but @HKydlicek has fixed it 👇 https://t.co/8bttjaJBQy
Macrocosmos🎉 Our dataset mixing results are in: SN9, pretraining, is producing models competitive with SOTA rivals from companies like DeepSeek AI, Mistral, and Google on prominent benchmarks (1/4) https://t.co/Q8mKHcWyAb
Nick FrosstI think academic benchmarks are falling behind. Private human evals show real progress. @scale_AI 's SEAL matches our internal testing for Aya Expanse. Crazy strong multilingual performance for a small model :) https://t.co/YIsDjkSnY7
Additional media






