NEW YORK, March 11, 2024 /PRNewswire/ — An artificial intelligence tool can convert doctor-written notes that summarize patients’ hospital visits into accurate, lay language, a new study found.
The research focuses on discharge notes written by doctors to capture patient’s health status in the medical record as they are discharged from the hospital. Effective summaries are essential for patient safety during these transitions in care, but most are filled with technical language and abbreviations that are hard to understand and increase patient anxiety, say the study authors.
To address the problem, NYU Langone Health has been testing the capabilities of a form of artificial intelligence (AI) called generative AI, which develops likely options for the next word in any sentence based on how billions of people use words in context on the internet. A result of this next-word prediction is that the such generative AI “chatbots” have become good at replying to questions in realistic, simple language, and at producing clear summaries of complex texts. However, AI programs, which work based on probabilities instead of “thinking,” may produce inaccurate summaries and so are meant to assist, not replace, human providers.
To explore generative AI, NYU Langone Health in March 2023 received access to GPT4, the latest tool from OpenAI, the company that created the famous chatGPT chatbot. NYU Langone Health licensed one of the first “private instances” of the tool, which freed hundreds of its frontline clinicians to experiment with AI-based solutions to clinical problems using real patient data, while adhering to federal standards that protect the patient privacy.
One of the first studies by researchers using GPT4, publishing online March 11 in JAMA Network Open, looked at how well the tool could convert 50 patient discharge notes into patient-friendly language. Specifically, running discharge notes through generative AI dropped the reports from an eleventh-grade reading level on average to a sixth grade level, the gold standard for patient education materials.
The team also ranked the AI discharge report translations using the Patient Education Materials Assessment Tool (PEMAT), which generates a percentage score based on 19 factors on the ability of patients to understand any piece of reading material. GPT-4 translation raised PEMAT understandability scores to 81 percent, up from 13 percent seen with the original doctor-written discharge reports from the medical record.
The research team designed the study to look at AI performance by itself as a scientific question: How far could it go independently when translating discharge reports?
“GPT-4 worked well alone with some gaps in accuracy and completeness, but did more than well enough to be highly effective when combined with physician oversight, the way it would be used in the real world,” says senior study author Jonah Feldman, MD, medical director of Clinical Transformation and Informatics within NYU Langone Health’s Medical Center Information Technology (MCIT) Department of Health Informatics. “One focus of the study was on how much work physicians must do to oversee the tool, and the answer is very little. Such tools could reduce patient anxiety even as they save each providers hours each week in medical paperwork, a major source of burnout.”
To measure the accuracy of the AI tool translations, the authors also asked two physicians to review the AI discharge summary for accuracy based on a 6-point scale. The reviewing physicians awarded just 54% of the AI-generated discharge notes the best possible accuracy rating. They also found that just 56% of notes created by AI were entirely complete. These results, however, must be considered in context, say the authors. For instance, they say, the results signify that, even at the current performance level, providers would not have to make a single change in more than half of the AI summaries reviewed.
Feldman notes that generative AI tools are sensitive, and asking a question of the tool in two subtly different ways may yield divergent answers. The skill required to frame the questions asked of chatbots in a way that elicits the desired response, called prompt engineering, combines intuition and experimentation. Physicians and nurses, with their deep understanding of individual cases and nuanced medical contexts, are best positioned to engineer prompts, say the authors, and without learning to write computer code.
Within weeks, the research team will be launching a program interviewing patients waiting to be discharged whether AI-generated reports are clear and helpful after physician review. By the summer, the team expects to launch a pilot program to integrate GPT4-generated, physician-reviewed lay language discharge summaries to patients on a larger scale.
“Having more than half of the AI reports generated being accurate and complete is an amazing start,” says first study author Jonah Zaretsky, MD, Associate Chief of Medicine at NYU Langone Hospital—Brooklyn. “Even at the current level of performance, which we expect to improve shortly, the scores achieved by the AI tool suggest that it can be taught to recognize subtleties.”
Along with Feldman and Zaretsky, NYU Langone study authors were Jonathan Austrian and Yindalon Aphinyanaphongs from the MCIT Department of Health Informatics, Jeong Min Kim and Saul Blecker, and Department of Medicine, Division of Hospital Medicine; Yunan Zhao from the Department of Population Health; Samuel Baskharoun from the Department of Medicine at NYU Grossman Long Island School of Medicine, and Ravi Gupta from NYU Langone Health’s Long Island Community Hospital.
Contact: Gregory Williams, [email protected]
SOURCE NYU Langone Health System