Health Advice From A.I. Chatbots Is Frequently Wrong. In part due to how users are asking their questions (Feb 2026, n=1,298) Reliability of LLMs as medical assistants for the general public: a randomized preregistered study Study 

Michael Harrop

Well-known member
Joined
Jul 6, 2023
Messages
1,483
Location
USA
https://www.nytimes.com/2026/02/09/well/chatgpt-health-advice.html
https://www.nature.com/articles/s41591-025-04074-y

The experiment found that the chatbots were no better than Google — already a flawed source of health information — at guiding users toward the correct diagnoses or helping them determine what they should do next. And the technology posed unique risks, sometimes presenting false information or dramatically changing its advice depending on slight changes in the wording of the questions.

Abstract​

Global healthcare providers are exploring the use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings.

We tested whether LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control).

Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants.

Moving forward, we recommend systematic human user testing to evaluate interactive capabilities before public deployments in healthcare.
 
Format correct?
  1. Yes
I've been very happy with my coaching by chatGBT. I am pain free for the first time in years. And my labs support that what chatGBT diagnosed is a correct, and the changes it suggested were also correct. Beats being diagnosed with anxiety, which is what I got from actual doctors for years.
 
I did a quick experiment, presenting to ChatGPT the symptoms from when I had my first case of knee swelling due to Lyme, and the blood test result I remember being abnormal, and it generated a list of 5-6 possible conditions. ChatGPT did suggest Lyme disease as the most likely condition, although it mentioned the need to rule out septic arthritis which is serious and can have permanent complications. It also suggested reactive arthritis as a less likely possibility, which is along the lines of the toxic synovitis that the rheumatologist whom I saw suggested it was. It said that Lyme was likely "because of the location being the US", which I didn't actually mention, it just hallucinated that detail (or possibly has access to the country of my IP address). The description it gave of Lyme arthritis DID fit very well, including that the pain is often mild relative to the degree of swelling.

It asked some follow-up questions, one of which was about another test that was probably performed but I don't recall the result, and another of which was about the location. The others were additional details about symptoms, including one about a rash. I answered all the questions except for the test result, including that the location was California, and that there HAD been a rash recently, which I described (I'm not actually sure anymore whether the rash had already happened in the year leading up to the arthritis, or whether it happened a few months after the arthritis resolved, but I pretended that it had happened "recently"). This rash was unlike the typical Lyme rash.

The AI said that the new symptom information ruled out septic arthritis and that Lyme was unlikely (though "not impossible") given that it was in California. It said that Lyme "still could be tested for", but was no longer near the top of the list. This parallels what the doctors said when I actually DID get a positive Lyme test years later--that Lyme was rare in CA, to the point where one even STILL doubted I had it even given the test results (which are two-tier, it's not just one test but there is confirmation). It suggested that the rash was likely unrelated to the arthritis (I still don't know myself if that rash in fact was connected to Lyme or not).

Interestingly, none of the follow-up questions mentioned sore throat, which had been mentioned as a symptom commonly preceding reactive arthritis. The rheumatologist did ask about recent illness history, and upon hearing that I'd had a cold with an ear infection several months earlier, that's what sealed the deal for him in terms of the toxic synovitis. Had I mentioned that as a follow-up, rather than sticking to the questions it asked me, it's possible that would have led it even farther from Lyme.

So, as that research demonstrates, what information is provided vs. left out can make a big difference. Also, this is 30 years ago, and I don't know what a similar AI would have been able to do with the clinical data available at that time. Of course there was no ChatGPT then but that isn't a reason to fault it now--the point is that epidemiology of Lyme disease has had three decades to collect more prevalence data since then, and yet it still suggests that it's unlikely for such a case given the location.

I bet that it would have gone back to putting Lyme at the top if I'd mentioned that joint swelling recurred several times over the next 5 years, with lymph node tenderness and fatigue the last time. At this time was when my mom, using the resources available at the time, began to strongly suspect Lyme, even though the doctors still didn't. It would have been hard to do this justice in this one ChatGPT conversation though, because in 5-6 years I of course had many interactions with doctors that were NOT about this--so the retroactive cherry-picking of putting all this together would have amounted to leading it if not done carefully.
 
I did a quick experiment, presenting to ChatGPT the symptoms from when I had my first case of knee swelling due to Lyme, and the blood test result I remember being abnormal, and it generated a list of 5-6 possible conditions. ChatGPT did suggest Lyme disease as the most likely condition, although it mentioned the need to rule out septic arthritis which is serious and can have permanent complications. It also suggested reactive arthritis as a less likely possibility, which is along the lines of the toxic synovitis that the rheumatologist whom I saw suggested it was. It said that Lyme was likely "because of the location being the US", which I didn't actually mention, it just hallucinated that detail (or possibly has access to the country of my IP address). The description it gave of Lyme arthritis DID fit very well, including that the pain is often mild relative to the degree of swelling.

It asked some follow-up questions, one of which was about another test that was probably performed but I don't recall the result, and another of which was about the location. The others were additional details about symptoms, including one about a rash. I answered all the questions except for the test result, including that the location was California, and that there HAD been a rash recently, which I described (I'm not actually sure anymore whether the rash had already happened in the year leading up to the arthritis, or whether it happened a few months after the arthritis resolved, but I pretended that it had happened "recently"). This rash was unlike the typical Lyme rash.

The AI said that the new symptom information ruled out septic arthritis and that Lyme was unlikely (though "not impossible") given that it was in California. It said that Lyme "still could be tested for", but was no longer near the top of the list. This parallels what the doctors said when I actually DID get a positive Lyme test years later--that Lyme was rare in CA, to the point where one even STILL doubted I had it even given the test results (which are two-tier, it's not just one test but there is confirmation). It suggested that the rash was likely unrelated to the arthritis (I still don't know myself if that rash in fact was connected to Lyme or not).

Interestingly, none of the follow-up questions mentioned sore throat, which had been mentioned as a symptom commonly preceding reactive arthritis. The rheumatologist did ask about recent illness history, and upon hearing that I'd had a cold with an ear infection several months earlier, that's what sealed the deal for him in terms of the toxic synovitis. Had I mentioned that as a follow-up, rather than sticking to the questions it asked me, it's possible that would have led it even farther from Lyme.

So, as that research demonstrates, what information is provided vs. left out can make a big difference. Also, this is 30 years ago, and I don't know what a similar AI would have been able to do with the clinical data available at that time. Of course there was no ChatGPT then but that isn't a reason to fault it now--the point is that epidemiology of Lyme disease has had three decades to collect more prevalence data since then, and yet it still suggests that it's unlikely for such a case given the location.

I bet that it would have gone back to putting Lyme at the top if I'd mentioned that joint swelling recurred several times over the next 5 years, with lymph node tenderness and fatigue the last time. At this time was when my mom, using the resources available at the time, began to strongly suspect Lyme, even though the doctors still didn't. It would have been hard to do this justice in this one ChatGPT conversation though, because in 5-6 years I of course had many interactions with doctors that were NOT about this--so the retroactive cherry-picking of putting all this together would have amounted to leading it if not done carefully.
I got RMSF and Lyme 10 years ago. The RMSF went undiagnosed for 9 years and the lyme undiagnosed for 8 years. I had ER visits and labs at the time. I was consistently diagnosed with anxiety despite fainting, lesions, joint pain, hashimotos, epilepsy (I also had Bartonella) etc etc etc. Last year, I fed all my symptoms and timeline at the time of infection, ER labs, doctors' comments, into ChatGBT. It took it two additional questions and five minutes to come up with "tick borne" disease - which I had never heard of when I was infected. Having a lot of experience with medical issues (husband died of cancer), I know how to ask my questions. I have been very satisfied with the help I have received for my post infection immune dysregulation. I dont take anything it says as gospel, but it certainly has helped me figure things out and points me in the right direction and I feel it has been FAR more useful than real doctors.
 
Back
Top Bottom