ChatGPT, Gemini and similar tools are increasingly being used as health advisors. Questions like “I have a headache – what could be the cause?” or “My shoulder hurts – when should I see a doctor?” are now routine for these chatbots. But a new study from the Massachusetts Institute of Technology (MIT) shows that not all users receive the same answers to these common queries.
Published on June 23, the study titled "The Medium is the Message: How Non-Clinical Information Shapes Clinical Decisions in LLMs" explores how seemingly irrelevant factors – like tone, writing style or formatting – can influence the medical advice given by AI systems.
To measure how much language and style affect AI chatbot decisions, the researchers built a "perturbation framework." This tool allowed them to create different versions of the same medical query – altered to include elements like uncertainty, dramatic wording, typos or inconsistent capitalization. They then tested these variations on four large language models: GPT-4, LLaMA-3-70B, LLaMA-3-8B and Palmyra-Med – a model designed specifically for medical use.
Particularly affected: Women, non-binary people, non-tech users and non-native speakers
The findings of the MIT study are clear: the way a person writes can significantly affect the medical advice they receive from AI chatbots. Some users, depending on their writing style or tone, were more likely to receive overly cautious recommendations. One of the most striking results: women were more often told to manage symptoms on their own or were less frequently advised to see a doctor, even when the medical content of their query was identical.
People who write in a hesitant tone, use simple language or make occasional typos also seem to be at a disadvantage. This often affects non-experts, those with limited health knowledge or individuals with weaker language skills, especially non-native speakers.
The researchers emphasize that before AI systems can be widely used in healthcare, they must be thoroughly tested – not just on average, but across different user groups. Average accuracy alone says little about a model's fairness or reliability, especially when users express themselves in ways that differ from the norm.
YouTube: Between praise and goosebumps
In an accompanying YouTube video, the study is praised for its smart and realistic design – but the findings are described as "disturbing" and even "chilling." The idea that superficial factors like tone or formatting can influence medical advice runs counter to the common belief that AI is objective and neutral.