
As artificial intelligence continues to permeate various sectors, its application in healthcare has garnered significant attention. A recent study published in Frontiers in Digital Health assessed the accuracy of several AI chatbotsโChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexityโin providing evidence-based health advice, specifically focusing on lumbosacral radicular pain.
Study Overview
The study involved posing nine clinical questions related to lumbosacral radicular pain to the latest versions of the aforementioned AI chatbots. These questions were designed based on established clinical practice guidelines (CPGs). Each chatbot’s responses were evaluated for consistency, reliability, and alignment with CPG recommendations. The evaluation process included assessing text consistency, intra- and inter-rater reliability, and the match rate with CPGs.
Key Findings
- Perplexity demonstrated the highest adherence to CPGs, with a match rate of 67%.
- Google Gemini followed closely with a 63% match rate.
- Microsoft Copilot achieved a 44% match rate.
- ChatGPT-3.5, ChatGPT-4o, and Claude each had a 33% match rate, indicating a significant gap in aligning with established guidelines.
The study also highlighted variability in the internal consistency of AI-generated responses, ranging from 26% to 68%. Intra-rater reliability was generally high, with ratings varying from “almost perfect” to “substantial.” Inter-rater reliability also showed variability, ranging from “almost perfect” to “moderate.”
Implications for Healthcare Professionals
The findings underscore the necessity for healthcare professionals to exercise caution when considering AI-generated health advice. While AI chatbots can serve as supplementary tools, they should not replace professional judgment. The variability in accuracy and adherence to clinical guidelines suggests that AI-generated recommendations may not always be reliable.
For allied health professionals, including speech-language pathologists, occupational therapists, and physical therapists, AI chatbots can provide valuable information. However, it is crucial to critically evaluate AI-generated content and cross-reference it with current clinical guidelines and personal expertise.
Conclusion
While AI chatbots have the potential to enhance healthcare delivery by providing quick access to information, their current limitations in aligning with evidence-based guidelines necessitate a cautious approach. Healthcare professionals should leverage AI tools to augment their practice, ensuring that AI-generated advice is used responsibly and in conjunction with clinical expertise.
