Think back to your last telehealth visit with a doctor. Perhaps your kid had a persistently high fever, or you had worrying chest pain.

Are you sure you were interacting with a human? What makes you sure?

Perhaps the doctor listened attentively to your symptoms, asked pertinent questions, and even picked up on subtle cues in your language that hinted at the severity of your condition.

Google's AMIE (Articulate Medical Intelligence Explorer) went head-to-head with primary care physicians (PCPs) in hundreds of telehealth visits. Some visits were handled by PCPs and some by AMIE. Compared to human doctors, AMIE was rated as equally accurate while being more empathetic.

AMIE represents a subtle but important shift from Google's previous medical question-answering systems, Med-PaLM and Med-PaLM 2. These earlier models were trained on vast amounts of medical literature and could provide direct, accurate answers to a wide range of medical questions. While Med-PaLM and Med-PaLM 2 could provide highly accurate responses to health queries, they were more akin to sophisticated medical search engines. They could give you the facts, but lacked the ability to engage in the kind of back-and-forth conversation that's so essential to doctor-patient interactions.

AMIE aims to change that. It has been trained to engage in genuine back-and-forth dialogue. It doesn't just dispense facts; it asks follow-up questions, guides the conversation, and works towards a potential diagnosis — much like a real physician would. This interactive approach opens up possibilities for AI-assisted tasks like taking patient histories and conducting preliminary diagnoses.

The AMIE system is built on PaLM 2 (one of Google’s flagship LLMs), then fine-tuned with medical datasets. These datasets contain a mix of:

Medical Reasoning: Questions (like those on the US Medical Licensing Exam), plus expert-crafted explanations.
Long-form Medical QA: Expert answers to open-ended questions.
Medical Summarization: Summarizing clinician notes from electronic health records.
Real-world Dialogue: Transcripts of patient-doctor visits.

AI Doctors Learning to Diagnose

It's hard to scale up training on real doctor-patient chats — they're messy and can have privacy issues. To get around this, AMIE's developers created a simulated environment where it plays both doctor and patient in thousands of conversations.

Here's how it works.

First, AMIE generates realistic patient scenarios, called "vignettes". These vignettes include details like the patient's age, gender, medical history, and current symptoms. For example, AMIE might create a vignette for a 45-year-old woman with a history of high blood pressure who's experiencing chest pain and shortness of breath.

Next, multiple instantiations of the same AMIE model take on different roles to simulate a diagnostic conversation:

Patient AMIE responds to questions based on the details in the vignette.
Doctor AMIE asks questions and tries to figure out what's wrong.
Moderator AMIE makes sure the conversation stays on track and determines the end of the conversation.
Critic AMIE knows the correct diagnosis for the vignette and critiques the interaction.

The conversation is simulated, and then the Critic AMIE analyzes the Doctor AMIE's performance in the simulated conversation, providing specific feedback on questions such as:

Did the Doctor AMIE show empathy and build rapport with the patient?
Were the Doctor AMIE's questions clear and relevant, or were there unnecessary repetitions?
Is the Doctor AMIE asking the right questions to hone in on the correct diagnosis?

Armed with this feedback, the Doctor AMIE gets to try again with the same patient vignette. And then again. Each time, it learns from its mistakes and incorporates the Critic AMIE's advice to become a better diagnostician.

This self-play process allows AMIE to practice and refine its conversational and diagnostic skills in an automated way. By the end of this training, AMIE has developed an ability to ask the right questions, interpret the answers, and arrive at accurate diagnoses.

Putting AMIE to the Test

So, how well does AMIE actually perform as a diagnostician? To find out, the researchers put it through its paces in an online version of an Objective Structured Clinical Examination (OSCE).

In a typical OSCE, medical students or residents rotate through a series of stations where they interact with simulated patients (actors trained to portray specific clinical scenarios). The students are evaluated on their ability to gather information, make diagnoses, and communicate effectively with patients. It's a rigorous test of clinical skills that's widely used in medical education and licensing.

For AMIE's OSCE, the researchers created a chat-based platform where human doctors and AMIE could interact with patient actors across a range of 149 different clinical scenarios (effectively simulating a telehealth visit). These scenarios spanned various medical specialties and levels of diagnostic complexity. After the test, specialist physicians conducted a blind rating of AMIE’s and the physicians’ visits across multiple different axes.

The results? AMIE held its own against the human physicians, achieving diagnostic accuracy on par with the doctors' average performance. In some cases, AMIE even outperformed the physicians, particularly in scenarios involving rare diseases or complex presentations.

But AMIE's performance wasn't just about getting the diagnosis right. The researchers also found that AMIE was able to ask relevant, focused questions to efficiently gather key information. It provided clear explanations of its reasoning and offered thoughtful next steps for evaluation and treatment. In other words, AMIE demonstrated many of the hallmarks of a skilled clinician.

AMIE's OSCE success comes with caveats. Patient actors aren't the same as real patients. Doctors using an unfamiliar chat interface might not perform their best. And text-only interaction misses crucial diagnostic information from physical exams, lab tests, imaging studies, and nonverbal cues that doctors rely on every day.

But even with these limitations, AMIE's performance in this rigorous test is nothing short of remarkable. It's a testament to the incredible potential of artificial intelligence in healthcare — a field where the stakes couldn't be higher and the need for innovation has never been greater.

Imagine a future where AI systems like AMIE are integrated into every aspect of patient care. Where AI-powered triage systems guide patients to the right level of care, whether that's self-care at home, a telehealth visit with an AI or human physician, or an in-person appointment for more serious concerns. Where AI assistants work alongside doctors, taking histories, suggesting differential diagnoses, and helping to create personalized treatment plans. Where AI monitors patients remotely, alerting doctors to early signs of complications or deterioration.

This is a future that AMIE brings us one step closer to realizing. But it's also a future that raises profound questions about the role of AI in medicine. How do we ensure that AI systems are safe, effective, and unbiased? These are not easy questions to answer. But they are questions we must grapple with as AI continues to transform the practice of medicine. AMIE is just the beginning — the worst that this technology will be from now on. It's a future that's both exciting and humbling.

– Vishnu Bashyam, ML Researcher @ Hop

Insights

AI Doctors Learning to Diagnose

Putting AMIE to the Test