As the adoption of generative AI (artificial intelligence) advances across a vast array of clinical areas, including in specialized diagnostics, researchers are beginning to try to scientifically test the efficacy of such tools as AI chatbots.
Reporting in JAMA + AI, the recently created publication of the Journal of the American Medical Association (JAMA) dedicated to AI-related topics, a team of researchers has published a research study of the accuracy and effectiveness of AI chatbot output on the part of text-only chatbots versus multimodal chatbots, which make use of both text and images. The article, “Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology cases,” was published in JAMA Network Open on Oct. 23, and authored by a large team of researchers affiliated with either the Princess Margaret Cancer Center in Toronto or the University of Toronto. David Chen, BMSc, led the team, and was joined by Ryan S. Huang, MSc; Jane Jomy, MSc; Philip Wong, M.D., MSc; Michael Yan, M.D., M.P.H.; Jennifer Croke, M.D., M.H.P.E.; Daniel Tong, M.D. Andrew Hope M.D.; Lawson Eng; and Srinivas Raman, M.D., MASc.
The researchers stated in the abstract to their article that their purpose was “To evaluate the utility of prompt engineering (zero-shot chain-of-thought) and compare the competency of multimodal and unimodal AI chatbots to generate medically accurate responses to questions about clinical oncology cases.”
As the article’s authors wrote, “This study evaluated 10 chatbots, including three multimodal and seven unimodal chatbots. On the multiple-choice evaluation, the top-performing chatbot was chatbot 10 (57of79 [72.15 percent]), followed by the multimodal chatbot 2 (56 of 79 [70.89 percent]) and chatbot 5 (54 of 79 [68.35 percent]). On the free-text evaluation, the top-performing chatbots were chatbot 5, chatbot 7, and the multimodal chatbot 2 (30 of 79 [37.97 percent]), followed by chatbot 10 (29 of 79 [36.71 percent]) and chatbot 8 and the multimodal chatbot 3 (25 of 79 [31.65 percent]). The accuracy of multimodal chatbots decreased when tested on cases with multiple images compared with questions with single images. Nine out of 10 chatbots, including all three multimodal chatbots, demonstrated decreased accuracy of their free-text responses compared with multiple-choice responses to questions about cancer cases.”
What does all of that mean? “In this cross-sectional study of chatbot accuracy tested on clinical oncology cases,” the authors wrote, “multimodal chatbots were not consistently more accurate than unimodal chatbots. These results,” they wrote, “suggest that further research is required to optimize multimodal chatbots to make more use of information from images to improve oncology-specific medical accuracy and reliability.”
Indeed, they wrote, “In this cross-sectional study of chatbot accuracy, we observed that multimodal chatbots were comparable to unimodal chatbots when evaluated based on accuracy in response to questions about clinical oncology cases and were less accurate when tested on cases with multiple images compared with a single image. Chatbots were generally less accurate when evaluated based on free-text responses compared with multiple-choice responses. Further research is required to improve the reliability of prompt engineering methods to increase accuracy of multimodal chatbots in oncology settings and evaluate the utility of AI chatbots as useful decision support tools in clinical oncology workflows.”