For the First Time, GPT Outperforms Medical Students on Board Exams
As the various types of artificial intelligence (AI) evolve forward in patient care organizations—in operational, financial, clinician workflow, and clinical settings—it’s clear that real advances are being made. And now, an article in The New England Journal of Medicine reports, the advancement from GPT-3.5 to GPT-4 is demonstrating a major advance in machine learning through the leveraging of large language models (LLMs), with the LLMs outperforming actual medical students on medical board examinations.
On April 12, in the supplemental publication NEJM-AI, a large team of researchers reported that set of medical board exam results, in an article entitled “GPT versus Resident Physicians — A Benchmark Based on Official Board Scores.” The authors are Uriel Katz, M.D., Eran Cohen, M.D., Eliya Shachar, M.D., Jonathan Somer, B.Sc., Adam Fink, M.D., Eli Morse, M.D., Beki Shreiber, B.Sc., and Ido Wolf, M.D.
The authors note at the start that “Artificial intelligence (AI) is a burgeoning technological advancement, with considerable promise for influencing the field of medicine. As a preliminary step toward integrating AI into medical practice, it is imperative to ascertain whether model performance is comparable with that of physicians. We present a systematic comparison of performance by a large language model (LLM) versus that of a large cohort of physicians. The cohort includes all residents who took the medical specialist license examination in Israel in 2022 across the core medical disciplines: internal medicine, general surgery, pediatrics, psychiatry, and obstetrics and gynecology (OB/GYN). We provide the examinations as an accessible benchmark dataset for the medical machine learning and natural language processing communities, which may be adapted for future LLM studies,” they write.
Here's what the researchers did: “We evaluated the performance of generative pretrained transformer 3.5 (GPT-3.5) and GPT-4 on the 2022 Israeli board residency examinations and compared the results with those of 849 practicing physicians. Official physician scores were obtained from the Israeli Medical Association. To compare GPT and physician performance, we computed model percentiles among physicians in each examination. We accounted for model stochasticity by applying the model to each examination 120 times.”
And what did they find? “GPT-4 ranked higher than the majority of physicians in psychiatry, and it performed similarly to the median physician in general surgery and internal medicine,” though “GPT-4 performance was lower in pediatrics and OB/GYN; but remained higher than a considerable fraction of practicing physicians.” Meanwhile, in comparison, “GPT-3.5 did not pass the examination in any discipline and was inferior to the majority of physicians in the five disciplines. Overall, GPT-4 passed the board residency examination in four of five specialties, revealing a median score higher than the official passing score of 65 percent.”
And what does all this mean? “This work showed that GPT-4 performance is comparable with that of physicians on official medical board residency examinations,” the article’s authors write. “Model performance was near or above the official passing rate in all medical specialties tested. Given the maturity of this rapidly improving technology, the adoption of LLMs in clinical medical practice is imminent. Although the integration of AI poses challenges, the potential synergy between AI and physicians holds tremendous promise. This juncture represents an opportunity to reshape physician training and capabilities in tandem with the advancements in AI.”