A popular artificial intelligence chatbot fell short of the standard expected of physicians when tested on common medical questions often posed by patients visiting a urology practice, a new University of Florida College of Medicine study shows.
The research is believed by its authors to be the first of its kind in the specialty of urology.
The research on the popular ChatGPT chatbot highlights the risk of asking AI engines for medical information even as they grow in accuracy and conversational ability. While this and other chatbots warn users that the programs are a work in progress, physicians believe some people will undoubtedly still rely on them.
One of the more dangerous characteristics of chatbots is that they can answer a patient’s inquiry with all the confidence of a veteran physician, even when completely wrong, the study said.
“I am not discouraging people from using chatbots,” said Russell S. Terry, M.D., an assistant professor in the UF College of Medicine’s department of urology and the study’s senior author. “But don’t treat what you see as the final answer. Chatbots are not a substitute for a doctor.”
The researchers generated 13 questions on common urologic topics often posed by patients. Each question was asked three times, since ChatGPT can formulate different answers to identical queries.
The researchers evaluated the answers based on guidelines produced by the three leading professional groups for urologists in the United States, Canada and Europe, including the American Urological Association.
Five UF Health urologists independently assessed the appropriateness of the chatbot’s answers using standardized methods.
Questions included topics such as vasectomies, overactive bladder, infertility, kidney stones, trauma and recurrent urinary tract infections, or UTIs, in women.
In the 39 responses evaluated overall, the chatbot provided appropriate responses 60% of the time. But otherwise, the study said, “it misinterprets clinical care guidelines, dismisses important contextual information, conceals its sources and provides inappropriate references.”
ChatGPT does not provide its sources of information by default. But when the study asked it to provide them, it was almost uniformly unable to do so.
“It provided sources that were either completely made up or completely irrelevant,” Terry said. “Transparency is important so patients can assess what they’re being told.”
In only one of the evaluated responses did the AI note it “cannot give medical advice,” the study said. The chatbot recommended consulting with a doctor or medical adviser in only 62% of its responses.
At times, the chatbot omitted key details or incorrectly processed their meaning, as it did by not recognizing the importance of pain from scar tissue in Peyronie’s disease. As a result, the paper said, the AI provided an improper treatment recommendation.
The urologists made their queries of ChatGPT in February, and since the chatbot is continually updated, performance today might be different than seen in the study, Terry said.
ChatGPT developers tell users the chatbot can provide bad information and warn users after logging in that ChatGPT “is not intended to give advice.”
The chatbot, Terry said, performed well on some topics, such as hypogonadism, infertility and overactive bladder. For others, like recurrent UTIs in women, it got little correct.
“It’s always a good thing when patients take ownership of their health care and do research to get information on their own,” Terry said. “And that’s great. But just as when you use Google, don’t accept anything at face value without checking with your health care provider.”