Results are in! 🔥 o1-preview scores marginally higher (27%) than GPT-4 Turbo (26.4%), the previous #1 model in the Bigcodebench Hard Instruct eval. I expected better tbh! For comparison, a model mix synthesizing responses from Claude 3.5 Sonnet andGPT-4 Turbo recently scored… https://t.co/F6IS7fZyE4
The results are in! ✨ O1-Preview scored 27% on the Bigcodebench Hard Instruct subset, run via the provided Docker. It's the highest-scoring single model we have evaluated (pass@1): → GPT-4 Turbo: 26.4% → GPT-4o: 25.0% → Claude 3.5 Sonnet: 24.3% https://t.co/b6kkLoBEqJ https://t.co/fHKFd9XEMz
o1-preview-2024-09-12 on BigCodeBench-Hard Complete 34.5% (slightly better than Claude-3.5-Sonnet-20240620) Instruct 23.0% (far below other top models) Average 28.8% o1-preview may follow detailed instructions reasonably well, but not the brief ones. Not sure how consistent… https://t.co/cnnMmoLpFB https://t.co/u9cWUGF8lJ
The results for the o1-preview model's performance on the Bigcodebench Hard Instruct evaluation are in. o1-preview scored 27%, slightly higher than GPT-4 Turbo's 26.4%, making it the highest-scoring single model evaluated by Catena Labs. The evaluation was conducted using the provided Docker setup. Other notable scores include GPT-4o at 25.0% and Claude 3.5 Sonnet at 24.3%. However, o1-preview's performance on other benchmarks showed mixed results, with an average score of 28.8% and a significantly lower score of 23.0% on detailed instruction tasks. Additionally, o1-preview achieved 34.5% on BigCodeBench-Hard, but only 10% on an internal codegen benchmark where GPT-4o scored 38%.