May 12, 10:57 PM

OpenAI’s o3 and GPT-4.1 Models Outperform Physicians on HealthBench, Sparking Billing Code Discussions

Recent developments in artificial intelligence have demonstrated that AI-assisted physicians now outperform human physicians working without reference materials. In September 2024, collaborative efforts between physicians and AI surpassed the performance of either alone on the HealthBench doctor benchmark. By April 2025, new AI models such as OpenAI's o3 and GPT-4.1 have advanced to the point where physicians can no longer improve upon the AI-generated medical responses. These models have shown a performance lead, with o3 surpassing GPT-4o by 0.28 points on HealthBench, exceeding the previous gap between GPT-4o and GPT-3.5 Turbo. Additionally, error rates have decreased with these newer models. The accuracy of AI in providing medical advice has prompted discussions about whether AI should be authorized to bill for evaluation and management (E&M) or chronic care management (CCM) codes, given its nearly free and infinitely scalable nature.

#HealthBench #OpenAI

Written with ChatGPT (GPT-4).

Sources

Additional media

Image #1 for story openais-o3-gpt-4-1-models-outperform-physicians-on-healthbench-sparking-billing-269f96d1

Image #2 for story openais-o3-gpt-4-1-models-outperform-physicians-on-healthbench-sparking-billing-269f96d1

Image #3 for story openais-o3-gpt-4-1-models-outperform-physicians-on-healthbench-sparking-billing-269f96d1

Image #4 for story openais-o3-gpt-4-1-models-outperform-physicians-on-healthbench-sparking-billing-269f96d1

OpenAI’s o3 and GPT-4.1 Models Outperform Physicians on HealthBench, Sparking Billing Code Discussions

Sources

Additional media

Similar Stories