Researchers design a new way to more reliably evaluate AI models’ ability to make clinical decisions in realistic scenarios that closely mimic real-life interactions.
The analysis finds that large-language models excel at making diagnoses from exam-style questions but struggle to do so from conversational notes.
The researchers propose set of guidelines to optimize AI tools’ performance and align them with real-world practice before integrating them into the clinic.