Two leading artificial intelligence assistants fell short on most questions about women’s health, according to a test designed by medical professionals. The evaluation found that popular chatbots gave inadequate guidance in a majority of cases, raising fresh concerns over the use of general-purpose AI for medical information.
The assessment, conducted recently by a team of clinicians, compared answers from widely used systems to medically sound guidance. The team reported high rates of incomplete or unsuitable advice on common topics. The finding arrives as more people turn to chatbots for quick health answers at home and on mobile devices.
“AI models such as ChatGPT and Gemini fail to give adequate advice for 60 per cent of queries relating to women’s health in a test created by medical professionals.”
A High Error Rate on Key Topics
The headline figure—missed or inadequate advice on 60% of questions—suggests a gap between consumer expectations and what chatbots can safely deliver. The test did not classify the systems as unsafe overall. But it showed clear weaknesses when questions touched on symptoms, treatment options, and when to seek in-person care.
Experts say this result may reflect how these tools are built. Large language models can summarize broad information but are not tailored to individual medical contexts. They can appear confident even when evidence is thin or contradictory.
Women’s health is a wide field, ranging from menstrual pain and fertility to menopause and heart disease presentation. Many of these areas already face historic data gaps in research and clinical trials. That history may compound limits in training data and guidance generated by general-purpose AI.
Why It Matters for Patients
People often use chatbots for reassurance or next steps. If the guidance is incomplete, patients may delay care or try unsuitable remedies. In urgent situations, that delay can raise risks.
Consumer-facing AI tools usually include warnings that they are not medical devices and that users should consult clinicians. Those notices can be easy to miss. A plain-language reminder at the start of every medical exchange could help set safer expectations.
Clinicians who reviewed the results say high-level explanations are sometimes helpful for education. But complex conditions, medication conflicts, pregnancy-related issues, and red-flag symptoms require personal medical advice.
How Developers and Clinicians Respond
Health systems and AI developers have taken different approaches. Some hospitals test custom chatbots trained on vetted guidelines and local referral rules. These systems often restrict answers to approved content and flag emergencies.
General-purpose tools prioritize wide usability. They improve with updates but are not consistently tuned for sensitive health topics. Developers can reduce risk by tightening refusals, adding citations, and routing high-risk questions to standard clinical advice.
Clinicians recommend closer collaboration with professional bodies. Medical reviewers can help define safe output formats, such as step-by-step triage prompts, plain warnings, and links to trusted resources.
Trends and What to Watch
Regulators are studying how best to oversee health-related AI without blocking useful education tools. Clear labels, audit trails, and post-release monitoring are under discussion in many markets. The balance between innovation and safety will shape how consumers use these tools.
Industry groups are also exploring benchmarks specific to women’s health. Transparent test sets and public scorecards could push improvements and help users compare tools.
- Build domain-specific datasets that reflect women’s health priorities.
- Require visible disclaimers and clear “seek care now” triggers.
- Provide citations to clinical guidelines where possible.
- Limit speculative diagnoses and default to in-person care for red flags.
- Enable easy reporting of harmful or misleading answers.
The new findings send a practical message. General chatbots can help answer simple questions and explain terms, but they should not guide personal medical decisions. Clinicians advise users to treat AI as a starting point, not a final answer.
Next steps will likely include targeted training on women’s health, stronger safety controls, and independent audits. Readers should watch for transparent updates from developers and for external evaluations using clear, public criteria. Until then, the safest path is simple: use AI for education, and confirm care decisions with a qualified professional.
Rashan is a seasoned technology journalist and visionary leader serving as the Editor-in-Chief of DevX.com, a leading online publication focused on software development, programming languages, and emerging technologies. With his deep expertise in the tech industry and her passion for empowering developers, Rashan has transformed DevX.com into a vibrant hub of knowledge and innovation. Reach out to Rashan at [email protected]






















