Study: ChatGPT-4 Use Helped Reduce Clinician Documentation Burnout

Could the use of generative artificial intelligence (ChatGPT-4), as adopted to inbox-message management, reduce the administrative burden on physicians and nurses in patient care organizations? A new study suggests such a potential, though with qualifications and caveats.

The results of a recent study were published in JAMA Network Open on March 20, under the headline “Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages.” The research article was authored by Patricia Garcia, M.D., Stephen P. Ma, M.D., Ph.D., and Shreya Shah, M.D., and 13 other clinical researchers. The key question? “What is the adoption of and clinician experience with clinical practice deployment of a large language model used to draft responses to patient inbox messages?” The researchers write that, “In this five-week, single-group, quality improvement study of 162 clinicians, the mean draft utilization rate was 20 percent, there were statistically significant reductions in burden and burnout score derivatives, and there was no change in time.” As a result, they write, “These findings suggest that the use of large language models in clinical workflows was spontaneously adopted, usable, and associated with improvement in clinician well-being.” The study was executed among clinicians at Stanford Medicine in Palo Alto, California.

Given that early testing of Chat GPT-4 intelligence has found that that technology can help clinicians to quickly generate empathetic draft replies to patient messages, leaders at Stanford Medicine decided to try applying the technology “to help address patient portal messaging, which has seen a 157-percent increase during the COVID-19 pandemic compared with prepandemic levels. This rapidly growing modality for care has strained health system capacity and become a leading factor in clinician burnout,” the researchers note. And, “Although several strategies have been proposed for inbox management, including automated message categorization and triage, team optimization, and billing for patient messages, more effective solutions are required,” they write.

So the Stanford Medicine leaders decided to test the implementation of that technology “using an evaluation guided by the Reach, Efficacy, Adoption, Implementation, Maintenance/Practical, Robust Implementation, and Sustainability Model (RE-AIM/PRISM). The developmental assessment of readiness for adoption and clinician experience included a primary outcome of utilization, with secondary outcomes evaluating time, usability, utility, and impact on clinician well-being.” And so “Stanford Medicine routed select patient messages upon arrival to the inbox messaging pool to EHR [electronic health record] developer Epic (Epic Systems) for categorization via GPT-3. Turbo (selected by EPIC to minimize cost and compute) and GPT-4 for draft reply generation.” What’s more, “Messages were categorized into 1 of 4 categories (general, results, medications, and paperwork), which triggered a corresponding prompt that included the patient message, selected structured data elements (eg, name, age, department, allergies, and so forth) and the last clinic note. Patient messages and draft replies were displayed within the EHR with options to start with draft or start blank reply. Technical reliability was tested with 14 ambulatory clinician beta users before study initiation. Education was provided via email and brief presentations at clinician group meetings.”

What happened? Well, it turned out that “The mean cumulative draft utilization after only 5 weeks was 20 percent. “This is remarkable, the researchers write, “given that (1) these versions of GPT were not trained on medical literature or fine-tuned for this task specifically, (2) limited context was available from the patient’s EHR for draft generation, and (3) minimal user education was necessary for adoption.” Furthermore, “Improvements in task load and emotional exhaustion scores suggest that generated draft replies have the potential to impact cognitive burden and burnout. Similarly, users expressed high expectations about utility, quality, and time that were either met or exceeded at the end of the pilot. Given the evidence that burnout is associated with turnover, reductions in clinical activity, and quality, even a modest improvement may have a substantial impact.”

That said, “[N]o changes in overall reply time, read time, or write time were found when comparing prepilot and pilot periods. It may be that switching from writing to editing may be less cognitively taxing despite taking the same amount of time. That said, survey respondents showed optimism about time saved, suggesting that perceptions of time may be different from time captured via EHR metadata, both of which may be different than real time. Finally, although future iterations may focus on measurable time-savings, considering other relevant outcomes including volume of follow-up messages and patient experience will provide a more complete picture.”

To summarize, “In this quality improvement study of generative AI in health care, using GPT-4 to generate draft responses to patient messages, we found adoption, usability, and notable improvements in mental task load and work exhaustion. There were no adverse safety signals, and qualitative feedback suggested high expectations for future use. These findings are especially remarkable given minimal user education and the use of a large language model without domain-specific training. That said, we did not find time-savings, and feedback highlighted the need for improvements in tone, brevity and personalization. Ongoing code-to-bedside testing is needed to inform future development and strategic organizational decision-making. In the case of generated draft messages, there is a cost each time GPT-4 is used to generate a draft response; multiplying that cost by millions of patient messages could represent a substantial expense to the US health care delivery system that must be justified with clinical practice data. Although the transformative potential of generative AI is evident, understanding when these tools have reached a maturity level to warrant costly investments and widespread use is paramount,” the article’s authors conclude.