From GPT-4 to Med-PaLM, from vocabulary adaptation to agentic AI — understanding where LLMs work, where they fail, and how to build on them for your own medical AI research.
Large Language Models are everywhere in AI research right now. If you’re attending any top-tier conference — NeurIPS, ACL, AAAI, MICCAI — you’ll find that a significant chunk of accepted papers involve LLMs in some capacity.
But here’s the thing: the hype and the reality are often very different, especially in medicine.
I’ve been working at the intersection of NLP and medicine throughout my PhD at IIT Kharagpur and now as a postdoc at Stanford Medicine. When I co-delivered a tutorial on “Building Trustworthy AI Models for Medicine” at WSDM 2025 in Hannover, one theme dominated the discussion: the gap between impressive benchmark results and clinical deployment. That gap — and how to navigate it — is what this article is about.
This article is for AI researchers who want to understand the LLM landscape in clinical medicine — not just what models exist, but what problems they actually solve, where the open research gaps are, and how you can contribute.
Table of Contents
- Why Medical LLMs Are Different: The Domain Adaptation Problem
- The Development Landscape: Five Approaches to Building Medical LLMs
- The Key Models You Should Know
- Clinical Tasks Where LLMs Show Real Promise
- The Hard Problems: Where LLMs Still Struggle
- The Shift Toward Reliability, Transparency, and Agentic AI
- Evaluation Challenges Specific to Medical LLMs
- How to Start Your Own LLM-Based Medical AI Research
- Datasets and Benchmarks to Know
- A Research Idea to Get You Started
- Common Pitfalls to Avoid
- Final Words
Why Medical LLMs Are Different: The Domain Adaptation Problem
Before we dive in, let me address the core challenge that makes medical LLM research fundamentally different from general-domain work.
In my WSDM 2025 tutorial, I spent significant time on this exact point. The issues with medical downstream tasks are structural:
Different data distribution. Clinical text doesn’t look like Wikipedia or Reddit. It’s full of abbreviations, jargon that varies across hospitals and even departments, and temporal reasoning that general models handle poorly. “SOB” in a clinical note means “shortness of breath,” not what you might think.
High vocabulary mismatch. This is something I’ve studied directly in my own research. Open-domain language models poorly tokenize medical terminology. Take a word like “erythromycin” — BERT, BART, and PEGASUS all fragment it into 4-5 subword tokens (like “er ##yt ##hr ##omy ##cin”). This excessive splitting has real consequences: in the encoder, it leads to poor representations; in the decoder, the model wastes generation capacity on reconstructing medical words token by token instead of generating content.
This vocabulary mismatch was the motivation for our MEDVOC work, published at IJCAI 2024, where we showed that adapting the model vocabulary during fine-tuning significantly improves medical text summarization.
Limited or no labeled datasets. Medical datasets are small (often 700-2,000 data points for a specific task), annotation requires clinical experts, and data access requires CITI training and IRB approval. See my article on working with clinical data for more on navigating this.
Evaluation requires medical experts. You can’t just compute ROUGE scores. Clinical accuracy assessment needs physicians — which means building relationships with clinical collaborators.
The Development Landscape: Five Approaches to Building Medical LLMs
One of the most useful frameworks I present in my talks is an overview of the different techniques for developing AI models for medicine. Let me walk you through each.
1. Prompting (No Training Required)
The simplest approach. You use a general-purpose LLM (GPT-4, Claude, Gemini) and craft prompts with medical context. The key advancement here is MedPrompt, which showed that an ensemble of prompting strategies can sometimes match or beat fine-tuned medical models.
MedPrompt uses three techniques together: dynamic few-shot selection (finding the most relevant examples via cosine similarity of embeddings), self-generated chain-of-thought (having the LLM produce its own reasoning chains), and choice-shuffling ensemble (reducing position bias in multiple-choice tasks by shuffling answer options and checking consistency).
The finding that surprised many people: CoT rationales generated by GPT-4 were actually longer and provided finer-grained step-by-step reasoning than the handcrafted examples by clinical experts used in Med-PaLM 2.
When to use prompting: When you have limited compute, need quick baselines, or are working with proprietary models. It’s also a strong baseline that every medical LLM paper should compare against.
2. Instruction Tuning
Instruction tuning means fine-tuning a language model on a collection of tasks described via instructions. The insight from the ICLR 2022 paper by Wei et al. is that instruction-tuned models improve their zero-shot performance on unseen tasks.
Med-PaLM used instruction prompt-tuning on Flan-PaLM, with instructions and exemplars from a panel of qualified clinicians. This was a data-efficient alignment strategy that didn’t require massive amounts of medical data.
3. Continual Pretraining
Here you take a general-purpose LLM and continue pretraining it on a large medical corpus. MEDITRON (from EPFL) is the flagship example — they continued pretraining Llama on 48.1 billion tokens from clinical guidelines (WHO, WikiDoc, NICE, CDC), 16.1 million PubMed abstracts, and 5 million full-text PubMed Central papers.
A critical detail that many researchers miss: you need a replay dataset to avoid catastrophic forgetting. MEDITRON included 420 million tokens randomly sampled from the general-domain RedPajama dataset (just 1% of the total corpus) to maintain the model’s general capabilities. Learning rate re-warming and re-decaying is another important technique for stable continual pretraining.
4. Fine-tuning (Full and Parameter-Efficient)
Full fine-tuning of 7B or 13B models is often computationally infeasible for academic labs. LoRA (Low-Rank Adaptation) is the practical solution: instead of updating the entire weight matrix W, you represent the update as W = W₀ + ΔW, where ΔW = A·B (low-rank matrices). Only A and B are updated during fine-tuning, dramatically reducing memory requirements.
In our vocabulary adaptation work for LLMs, we used parameter-efficient fine-tuning of LoRA adapter layers combined with embedding matrix updates. This worked for Llama 2 7B, Mistral 7B, and Llama 3.1 8B models on medical summarization tasks.
5. Retrieval-Augmented Generation (RAG)
RAG combines a retriever with a generator — the model retrieves relevant medical knowledge from an external corpus before generating its response. This is a compound AI system rather than a monolithic model.
MedRAG (ACL Findings 2024) is the toolkit to know here. It improved the accuracy of six different LLMs by up to 18% over chain-of-thought prompting alone, and elevated GPT-3.5 and Mixtral performance to GPT-4 levels on medical QA benchmarks.
My practical advice: Don’t think of these approaches as competing alternatives. The best systems often combine them — for example, continual pretraining + vocabulary adaptation + LoRA fine-tuning, or RAG + prompting strategies.
The Key Models You Should Know
GPT-4 and GPT-4V (OpenAI)
GPT-4 demonstrated strong performance on medical licensing exams (USMLE), scoring well above the passing threshold. GPT-4V extended this to multimodal tasks. However, performance on actual clinical tasks — discharge summary generation, clinical coding, treatment recommendation — is more mixed. The model hallucinates medical facts, and its performance degrades on rare conditions.
Med-PaLM and Med-PaLM 2 (Google)
Med-PaLM 2 achieved expert-level performance on several medical QA benchmarks (published in Nature Medicine 2025). The key contribution was the evaluation methodology: Google worked with clinicians to assess not just accuracy but also potential for harm. Med-PaLM used instruction prompt-tuning with clinician-crafted exemplars — a data-efficient alignment strategy given the paucity of medical data for full fine-tuning.
MEDITRON (EPFL)
The leading example of continual pretraining for medicine. MEDITRON-70B was trained on 48.1B medical tokens with careful data curation across clinical guidelines, abstracts, and full-text papers, plus a replay dataset to prevent catastrophic forgetting.
Me-LLaMA (University of Florida)
A medical foundation LLM that incorporates high-quality domain-specific metadata. Their approach emphasizes the incorporation of structured medical knowledge during training.
Open-Source Clinical Models
Several open-source efforts worth your attention: ClinicalBERT/BioClinicalBERT (still relevant for classification and NER), BioMistral, OpenBioLLM, and PMC-LLaMA (pre-trained on PubMed Central). For research purposes, open-source models give you far more flexibility for fine-tuning, ablation studies, and inspecting model behavior.
Clinical Tasks Where LLMs Show Real Promise
Based on both the literature and what I covered in my WSDM 2025 tutorial, I organize medical NLP tasks into two categories:
AI for Automation:
- Medical note-taking and clinical documentation
- Reducing administrative overload for nurses and caregivers
- Literature and clinical trial search
- Medical question-answering and summarization
- Patient monitoring and alert generation
AI for Discovery:
- Drug discovery acceleration
- Novel disease subtyping (like our Parkinson’s disease work)
- Clinical trial design and patient matching
- Biomarker identification from clinical text
Clinical note summarization remains one of the most practically impactful applications — physicians spend a disproportionate amount of their day on documentation. In my own work on perspective-aware medical summarization, I found that even fine-tuned models struggle with maintaining clinical accuracy while producing fluent summaries. This remains an open problem with real clinical value.
The Hard Problems: Where LLMs Still Struggle
This is the section that matters most for researchers looking for meaningful problems to work on.
Hallucination in Clinical Contexts
LLMs generate plausible but fabricated medical facts. In clinical settings, this is unacceptable. Current mitigation strategies include RAG and grounding in clinical guidelines, but neither fully solves the problem. Reasoning hallucinations — where the model’s chain-of-thought appears sound but leads to a wrong conclusion — are particularly insidious and harder to detect.
Reasoning Instruction Following
This is a cutting-edge challenge. Recent work from Stanford (ReasonIF, October 2025) showed that fewer than 25% of reasoning traces in open-source large reasoning models actually comply with the given instructions. This is deeply problematic for clinical applications where following specific clinical guidelines and protocols is essential. Approaches being explored include multi-turn reasoning and reasoning instruction finetuning using synthetic data.
The Implementation Gap
As our WSDM 2025 tutorial co-presenter Dominik highlighted, there’s a stark gap between AI papers published and AI systems deployed in clinical practice. A systematic review of AI in the ICU found that most models remain in the testing and prototyping environment — patients don’t benefit because the majority never reach the bedside. This gap exists because of legal regulations, financial obstacles, missing interoperability between clinical data systems, and clinician acceptance challenges.
The cautionary tale of MYCIN (1970s) is instructive here: this AI system for diagnosing bacterial infections outperformed five general practitioners in a blinded evaluation (65% vs. 42.5-62.5% correct decisions) — yet it was never used in practice. Ethical concerns, lack of interoperability, missing user-friendliness, and fear that doctors would be replaced all contributed.
Integration with Structured Data
Clinical decision-making relies heavily on structured data — vital signs, lab values, medication lists. LLMs are fundamentally designed for text, and integrating structured clinical data remains challenging. Tools like MetaMap, QuickUMLS, and SemRep for extracting structured information from unstructured medical text (e.g., UMLS concepts) are important bridges, along with Python packages like clinspacy and medspacy.
The Shift Toward Reliability, Transparency, and Agentic AI
In my January 2026 talk at GDG on Campus Kalyani Government Engineering College, I emphasized a fundamental shift in the AI landscape: the focus is moving from efficiency and accuracy to reliability and transparency.
The EU AI Act (2025) is a major driver. It classifies AI systems for healthcare as high-risk, requiring compliance with security, transparency, and quality obligations, plus conformity assessments. Transparency means that AI systems must allow appropriate traceability and explainability. This has direct implications for how we design and evaluate medical LLMs.
The rise of agentic AI is the other major trend. We’re moving from monolithic models that generate text to compound AI systems that can use tools, retrieve information, and take multi-step actions. For medicine, this means AI systems that can search clinical guidelines, check drug interactions, and integrate patient-specific data — all within a single reasoning chain.
But agentic AI also introduces new challenges. Tasks where AI excels (content creation, information extraction, summarization, chat interfaces) are very different from tasks where it completely fails (instruction following during reasoning, consistent decision logic, working with noisy or incomplete data, and medical decision-making for high-risk scenarios). Recognizing this distinction is essential for building reliable clinical AI systems.
Evaluation Challenges Specific to Medical LLMs
Standard NLP metrics aren’t enough. BLEU, ROUGE, and F1 scores don’t capture clinical correctness. A summary that’s fluent but contains a factual error about a patient’s medication is dangerous.
What you need instead:
- Clinical accuracy: Does the output contain factually correct medical information?
- Completeness: Does it capture all clinically relevant findings?
- Harm potential: Could the output lead to a wrong clinical decision?
- Expert evaluation: You will need clinician reviewers — budget time and resources for this
The FURM Framework (published in NEJM Catalyst 2024) provides a structured approach: evaluate Fairness, Usefulness, Reliability, and Maintainability across three stages — what/why, how, and impact. Similarly, the FAVES principles (Functional, Accessible, Valuable, Engaging, Safe) provide high-level ethical guidelines. By integrating both, you can build AI systems that are technically sound and ethically responsible.
Practical advice: If you’re starting a medical LLM project, budget time and resources for clinical evaluation. I’ve discussed how to build clinical collaborations in my article on working with clinical data.
How to Start Your Own LLM-Based Medical AI Research
Based on the research lifecycle I present in both my talks and tutorials, here’s a realistic path:
Step 1: Problem Ideation. Read papers from top AI and NLP conferences, especially the “NLP Applications” research area. Follow key labs: van der Schaar Lab (Cambridge), Zitnik Lab (Harvard), Boussard Lab (Stanford), Butte Lab (UCSF), Rajpurkar Lab (Harvard). Also track healthcare-focused venues: IEEE ICHI, CHIL, ACL BioNLP workshop, and ML4H.
Step 2: Verification. Is the codebase of prior work reproducible? Has the area already matured? Are datasets available? Are you conversant with open-source medical NLP tools (MetaMap, QuickUMLS, SemRep, clinspacy, medspacy)?
Step 3: Baseline Setup. Reproduce paper results. Run error analysis and ablation studies. Key skills here: resource management (do you have enough compute?), coding knowledge (can you debug codebases and identify gaps?), and critique/problem-solving skills.
Step 4: Novelty. Add a component to the SOTA model to alleviate one or more errors observed. Repeat until performance improves. This is the iterative core of research.
Step 5: Target the right venue. Clinical NLP work fits well at ACL (ClinicalNLP workshop), EMNLP, NAACL, AMIA, and ML4H. Check my conference deadlines article for timing.
Datasets and Benchmarks to Know
| Dataset | Task | Access |
|---|---|---|
| MIMIC-III/IV | Clinical NLP (notes, labs, etc.) | PhysioNet (requires CITI training) |
| MIMIC-CXR | Radiology report generation | PhysioNet |
| PubMedQA | Biomedical question answering | Open |
| MedQA (USMLE) | Medical exam QA | Open |
| n2c2 (various years) | Clinical NLP challenges | Application required |
| MeQSum | Medical question summarization | Open |
| emrQA | Clinical QA from EMRs | Application required |
| PubMed Abstract Collection | Continual pretraining / vocabulary adaptation | Open |
For a comprehensive list beyond LLM tasks, see my Ultimate Guide to ML & AI Datasets and Part 2: Clinical & Social Datasets.
A Research Idea to Get You Started
In the spirit of my Side-Hustle Scientist article, let me sketch a concrete research idea:
Problem: Medical text summarization models generate fluent summaries but often fragment medical terminology, leading to inaccurate clinical content. Can vocabulary adaptation improve the factual accuracy of LLM-generated medical summaries?
Approach: Based on our MEDVOC work (IJCAI 2024), extend vocabulary adaptation to newer open-source LLMs (Llama 3, Mistral). Use two-stage fine-tuning: first continual pretraining with extended vocabulary on PubMed abstracts, then LoRA fine-tuning on the target summarization task. Evaluate both automated metrics and clinical accuracy with physician reviewers.
Dataset: MeQSum (open), or MIMIC-III discharge summaries (PhysioNet, CITI training required)
Target venue: ACL ClinicalNLP workshop or ML4H at NeurIPS
Timeline: 4-6 months, feasible alongside a full-time position
Common Pitfalls to Avoid
Pitfall 1: Evaluating only on benchmarks. Benchmark performance doesn’t equal clinical utility. The implementation gap is real — most AI models never leave the prototyping environment. Always discuss how your model would fit into actual clinical workflows.
Pitfall 2: Ignoring prompt sensitivity. LLM performance is highly sensitive to prompt design. MedPrompt’s choice-shuffling showed that answer ordering alone can shift results significantly. Report results across multiple prompts.
Pitfall 3: Not comparing against simple baselines. If prompting GPT-4 matches your fine-tuned 7B model, that’s an important finding — not a failure. MedPrompt showed that a generalist foundation model with smart prompting can outcompete specialized tuning.
Pitfall 4: Overlooking vocabulary mismatch. If your model fragments “erythromycin” into five tokens, it’s spending generation capacity on reconstruction instead of content. Check your tokenizer’s fertility score on your target domain.
Pitfall 5: Neglecting the regulatory dimension. The EU AI Act classifies medical AI as high-risk. The shift from accuracy to reliability and transparency is real. If you’re building models for clinical use, trustworthiness isn’t optional.
Final Words
LLMs are not going to replace clinicians. But as I emphasized in my GDG talk, we’re moving toward hybrid agentic AI systems where AI handles what it does best (content creation, information extraction, summarization) and humans handle what AI still fails at (reasoning under uncertainty, working with noisy data, high-stakes medical decisions).
The opportunity for AI researchers — especially those from India’s growing AI ecosystem — is enormous. The field needs people who understand both the technology and the clinical context. If that’s the kind of researcher you want to be, I hope this guide gives you a solid starting point.
To stay updated on AI trends, I recommend resources like Sebastian Raschka’s State of LLMs reports, the Stanford HAI AI Index Report, NeurIPS best paper announcements, and the Microsoft Research Forum. Follow top AI experts on LinkedIn and subscribe to mailing lists from key labs.
Related articles that may be of interest to you
- How to Get Started with Research in AI for Medicine
- The Side-Hustle Scientist: Publish AI Papers While Working Full-Time
- How to Start Working with Clinical Data as an AI Researcher
- The Ultimate Guide to ML & AI Datasets for 2025
- ML & AI Conference Deadlines 2025–2026
Tutorials and Talks Referenced
- WSDM 2025 Tutorial: Building Trustworthy AI Models for Medicine — Slides
- GDG Talk Jan 2026: Building a Career in AI for Industry and Academia — Slides
- Guest Lecture Dec 2024: Foundation Models for Medical Text — Leibniz University Hannover
Discover more from Medical AI Insights
Subscribe to get the latest posts sent to your email.