Home » AI from OpenAI outperforms doctors in study

AI from OpenAI outperforms doctors in study

Researchers from Harvard Medical School and Stanford University put OpenAI’s o1-preview AI system through a series of medical diagnosis tests. The results show the AI has made remarkable progress compared to previous versions. In the study, o1-preview correctly diagnosed 78.3% of all cases it examined.

When directly comparing 70 specific cases, the system accurately diagnosed 88.6% of them. This significantly outperformed its predecessor GPT-4, which managed 72.9%. The AI system’s medical reasoning was even more impressive.

Using the R-IDEA scale, a standard measure for evaluating medical reasoning quality, o1-preview achieved perfect scores in 78 out of 80 cases. Experienced doctors reached perfect scores in only 28 cases, while medical residents managed it in just 16 cases. The researchers acknowledge that some test cases might have been included in o1-preview’s training data.

However, when they tested the system on newer cases it had never encountered, its performance only dropped slightly. O1-preview really shined when tackling complex management cases that 25 specialists had specifically designed to be difficult. In these tough cases, o1-preview scored 86% of possible points.

OpenAI’s latest medical diagnosis advancements

That’s more than double what doctors achieved using GPT-4 (41%) or traditional tools (34%). The system isn’t perfect, though.

It struggled with probability assessments, showing no real improvement over older models. The researchers found that while the system excels at tasks requiring critical thinking, like making diagnoses and recommending treatments, it has trouble with more abstract challenges like estimating probabilities. They also point out that o1-preview tends to give detailed answers, which might have boosted its scores.

Plus, the study only looked at o1-preview working alone—not how well it might work alongside human doctors. Some critics argue that having a more capable AI system doesn’t automatically solve the challenge of making it work in real-world healthcare settings. The researchers emphasize the need for better evaluation methods.

They say multiple-choice tests don’t capture the complexity of real medical decision-making. The researchers are calling for new, more practical testing methods, real-world clinical trials, better technical infrastructure, and improved ways for humans and AI to work together. While AI is making significant strides in medical diagnostics, replacing human doctors is a multifaceted challenge that goes beyond mere accuracy.

AI will likely serve as an aid to enhance human medical practice rather than a replacement.

Noah Nguyen

Noah Nguyen is a multi-talented developer who brings a unique perspective to his craft. Initially a creative writing professor, he turned to Dev work for the ability to work remotely. He now lives in Seattle, spending time hiking and drinking craft beer with his fiancee.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.