Recent developments in artificial intelligence have demonstrated that AI-assisted physicians now outperform human physicians working without reference materials. In September 2024, collaborative efforts between physicians and AI surpassed the performance of either alone on the HealthBench doctor benchmark. By April 2025, new AI models such as OpenAI's o3 and GPT-4.1 have advanced to the point where physicians can no longer improve upon the AI-generated medical responses. These models have shown a performance lead, with o3 surpassing GPT-4o by 0.28 points on HealthBench, exceeding the previous gap between GPT-4o and GPT-3.5 Turbo. Additionally, error rates have decreased with these newer models. The accuracy of AI in providing medical advice has prompted discussions about whether AI should be authorized to bill for evaluation and management (E&M) or chronic care management (CCM) codes, given its nearly free and infinitely scalable nature.
In September, 2024, physicians working with AI did better at the Healthbench doctor benchmark than either AI or physicians alone. With the release of o3 and GPT-4.1, AI answers are no longer improved on by physicians. Also error rates appear to be dropping for newer AI models. https://t.co/HI2GBXiLS8
State of the art models (o3, GPT 4.1) are now providing more accurate medical advice than physicians. When do we say AI should be able to start billing e.g. E&M or CCM codes like humans? Nearly-free and infinitely scalable medical advice is one of the few things that could https://t.co/fLTcczJ1Jb
State of the art models (o3, GPT 4.1) are now providing more accurate medical advice than physicians. At what point do we say AI should be able to start billing E&M or CCM codes like humans? Nearly-free and infinitely scalable medical advice is on of the few things that could https://t.co/KHNCRESn1q