Sep 13, 12:54 AM

o1-preview Scores 27%, Outperforms GPT-4 Turbo's 26.4% in Bigcodebench Hard Instruct Evaluation

The results for the o1-preview model's performance on the Bigcodebench Hard Instruct evaluation are in. o1-preview scored 27%, slightly higher than GPT-4 Turbo's 26.4%, making it the highest-scoring single model evaluated by Catena Labs. The evaluation was conducted using the provided Docker setup. Other notable scores include GPT-4o at 25.0% and Claude 3.5 Sonnet at 24.3%. However, o1-preview's performance on other benchmarks showed mixed results, with an average score of 28.8% and a significantly lower score of 23.0% on detailed instruction tasks. Additionally, o1-preview achieved 34.5% on BigCodeBench-Hard, but only 10% on an internal codegen benchmark where GPT-4o scored 38%.

#Bigcodebench Hard Instruct #Catena Labs #Docker #GPT

Written with ChatGPT (GPT-4o).

Sources

joao@jay_wooow
1 year ago
Results are in! 🔥 o1-preview scores marginally higher (27%) than GPT-4 Turbo (26.4%), the previous #1 model in the Bigcodebench Hard Instruct eval. I expected better tbh! For comparison, a model mix synthesizing responses from Claude 3.5 Sonnet andGPT-4 Turbo recently scored… https://t.co/F6IS7fZyE4
Catena Labs@catena_labs
1 year ago
The results are in! ✨ O1-Preview scored 27% on the Bigcodebench Hard Instruct subset, run via the provided Docker. It's the highest-scoring single model we have evaluated (pass@1): → GPT-4 Turbo: 26.4% → GPT-4o: 25.0% → Claude 3.5 Sonnet: 24.3% https://t.co/b6kkLoBEqJ https://t.co/fHKFd9XEMz
Terry Yue Zhuo@terryyuezhuo
1 year ago
o1-preview-2024-09-12 on BigCodeBench-Hard Complete 34.5% (slightly better than Claude-3.5-Sonnet-20240620) Instruct 23.0% (far below other top models) Average 28.8% o1-preview may follow detailed instructions reasonably well, but not the brief ones. Not sure how consistent… https://t.co/cnnMmoLpFB https://t.co/u9cWUGF8lJ

Additional media

Image #1 for story o1-preview-scores-27-outperforms-gpt-4-turbo-s-26-4-bigcodebench-hard-instruct

o1-preview Scores 27%, Outperforms GPT-4 Turbo's 26.4% in Bigcodebench Hard Instruct Evaluation

Sources

Additional media

Similar Stories