Explainability in Medical AI: Why Your Black-Box Model Won’t Reach the Clinic

A researcher’s guide to trustworthy, interpretable ML for healthcare — the frameworks, the methods, the trade-offs, and what clinicians actually need from your model.


Here’s a conversation I’ve had more times than I can count, in slightly different forms:

Clinician: “Your model says this patient is high-risk. Why?”
AI Researcher: “Well, the model captures complex non-linear interactions across 200 features…”
Clinician: “So you don’t know why.”

That’s the explainability problem in a nutshell.

During my PhD work on Parkinson’s Disease subtyping at L3S Research Center in Germany, I realized early on that clinical collaborators didn’t just want accurate predictions — they wanted to understand why a patient was assigned to a particular subtype. When I co-delivered our WSDM 2025 tutorial on “Building Trustworthy AI Models for Medicine” in Hannover, explainability was one of the central pillars. This article distills what I’ve learned — from my own research, from the tutorial, and from working alongside clinicians at Stanford Medicine.

If you’re doing AI for Medicine research and plan to publish at clinical venues (Frontiers, JAMIA, JMIR) or present to clinical audiences, explainability and trustworthiness are no longer optional. Even at CS conferences like MICCAI and CHIL, reviewers increasingly ask about interpretability.


Table of Contents

  • Why Explainability Matters in Medicine (More Than Anywhere Else)
  • Trustworthy AI: The Bigger Picture
  • The FAVES and FURM Frameworks: Designing for Trust
  • Fairness: The Bias Taxonomy Every Medical AI Researcher Must Know
  • The Spectrum: Inherently Interpretable vs. Post-Hoc Explanations
  • Method 1: SHAP — The Current Standard
  • Method 2: LIME — Quick and Intuitive
  • Method 3: Attention-Based Explanations (and Their Limitations)
  • Method 4: Grad-CAM for Medical Imaging
  • Method 5: Concept-Based Explanations
  • The Implementation Gap: Why Good Models Still Don’t Reach Patients
  • What Clinicians Actually Want (It’s Not What You Think)
  • How to Include Explainability in Your Paper
  • A Practical Example: Explainable Parkinson’s Disease Subtyping
  • Common Mistakes
  • Final Words

Why Explainability Matters in Medicine (More Than Anywhere Else)

Let me give you three concrete reasons, beyond the philosophical.

1. Trust and adoption. A model that cannot be explained will not be used by clinicians. The cautionary tale of MYCIN (1970s) is instructive here — as I discussed in our WSDM tutorial. MYCIN was an AI system for diagnosing and recommending treatment for bacterial infections. In a blinded evaluation with eight independent experts, MYCIN’s decisions were correct 65% of the time, outperforming all five general practitioners (who ranged from 42.5% to 62.5%). Yet MYCIN was never used in clinical practice. The reasons? Ethical and regulatory concerns, lack of interoperability, missing user-friendliness, and the perception that doctors would be replaced.

The lesson from 1970: a system that outperforms clinicians but can’t explain itself — and doesn’t engage the target group in its design — will fail at adoption.

2. Regulatory requirements. The EU AI Act (2025) classifies most AI systems for healthcare as “high-risk” and requires transparency — meaning AI systems must allow appropriate traceability and explainability, and humans must be aware when they interact with an AI system. The U.S. Department of Health and Human Services (HHS) has published a Trustworthy AI Playbook with similar principles. If your model is a black box, it faces higher regulatory barriers.

3. Scientific understanding. If your model discovers that a particular combination of biomarkers predicts disease progression, that’s a scientifically interesting finding — but only if you can extract and communicate it.


Trustworthy AI: The Bigger Picture

Explainability is one dimension of a broader concept: Trustworthy AI. In our WSDM 2025 tutorial, we defined trustworthy AI as artificial intelligence systems that are developed and deployed to operate reliably, ethically, and transparently while maintaining user confidence and minimizing potential harms.

The HHS Trustworthy AI Playbook identifies six pillars:

  1. Fair / Impartial: Ensuring equitable decision-making and preventing bias
  2. Transparent / Explainable: Making AI processes understandable to stakeholders
  3. Responsible / Accountable: Establishing clear oversight and responsibility
  4. Safe / Secure: Protecting against vulnerabilities and threats
  5. Privacy: Safeguarding sensitive data and personal information
  6. Robust / Reliable: Ensuring consistent performance and accuracy

Explainability is pillar #2, but it’s deeply interconnected with fairness (#1), safety (#4), and reliability (#6). A model that’s explainable but unfair, or transparent but unreliable, still isn’t trustworthy.


The FAVES and FURM Frameworks: Designing for Trust

In our tutorial, we covered two frameworks that provide structured approaches to trustworthy medical AI. Understanding these will help you design better systems and write stronger papers.

The FAVES Principles (from the U.S. Office of the National Coordinator for Health IT):

  • Functional: AI systems must perform their intended tasks accurately and reliably
  • Accessible: Ensure AI tools are usable by a diverse range of healthcare professionals
  • Valuable: AI should provide tangible benefits like improving patient outcomes or operational efficiency
  • Engaging: Design intuitive interfaces that encourage active participation and adoption
  • Safe: Prioritize patient privacy and data security in all AI applications

The FURM Framework (published in NEJM Catalyst, 2024):

  • Fairness: Are there subgroups where the model performs significantly worse? Have you tested for bias across demographics?
  • Usefulness: Does this model address a real clinical need? How will it integrate into existing workflows?
  • Reliability: How does the model perform on different datasets or in different clinical settings? Is performance consistent over time?
  • Maintainability: How easy is it to update the model with new data? Can you monitor performance degradation?

Each FURM assessment has three stages that evaluate the what and why motivating an AI use case, how a given model will be formulated and evaluated, and its impact upon implementation.

My practical advice: Reference these frameworks in your paper’s evaluation section. Reviewers at clinical venues (AMIA, JAMIA, CHIL) increasingly expect structured assessments of trustworthiness, not just accuracy metrics.


Fairness: The Bias Taxonomy Every Medical AI Researcher Must Know

In our WSDM tutorial, we covered a comprehensive bias taxonomy from the Japanese Journal of Radiology (2023) that I find invaluable for medical AI research. Let me summarize the four categories:

1. Data Biases:

  • Minority bias: Data not representative of minority groups, leading to unfair outcomes
  • Missing data bias: Certain data points systematically missing, leading to skewed results
  • Informativeness bias: Available data lacking relevant information for accurate conclusions
  • Training-serving skew: Differences between data pipelines during training and deployment

2. Algorithmic Biases:

  • Label bias: Training labels that are biased or inaccurate (common in medical imaging where labels are NLP-extracted from reports)
  • Cohort bias: Training cohort not representative of the broader population

3. Clinician Interaction-Related Biases:

  • Automation bias: Tendency to favor AI suggestions even when incorrect
  • Feedback loop: Model results affecting downstream processes and future data collection
  • Allocation discrepancy: Automated systems allocating resources inequitably

4. Patient Interaction-Related Biases:

  • Privilege bias: Certain groups receiving preferential treatment
  • Informed mistrust: Patients distrusting AI-assisted medical advice
  • Agency bias: Power imbalances between patients and AI-assisted healthcare providers

Why this matters for your paper: At minimum, report your primary metrics disaggregated by available demographic variables. Discuss which biases from this taxonomy are most relevant to your specific clinical application and how you’ve mitigated them.


The Spectrum: Inherently Interpretable vs. Post-Hoc Explanations

There are two broad approaches:

Inherently interpretable models are transparent by design — logistic regression, decision trees, rule lists, and generalized additive models (GAMs). Their advantage is that the explanation is the model.

Post-hoc explanation methods are applied to existing black-box models to explain their predictions after the fact — SHAP, LIME, Grad-CAM, and attention visualization. They let you use powerful models while still providing explanations, but the explanations are approximations.

My pragmatic advice: Start with an interpretable model as a baseline. Then build your complex model and use post-hoc methods to explain it. If the explanations contradict the interpretable model’s behavior, investigate — it usually indicates a problem.


Method 1: SHAP — The Current Standard

SHAP (SHapley Additive exPlanations) assigns each feature a “Shapley value” — a measure of that feature’s marginal contribution to a prediction.

Why researchers love it: theoretically sound (based on cooperative game theory), works with any model, provides both local (per-prediction) and global (across-dataset) explanations, and has excellent built-in visualizations.

Practical tips for clinical ML:

  • Use TreeSHAP for tree-based models (XGBoost, LightGBM) — exact and fast
  • Use KernelSHAP or DeepSHAP for neural networks
  • The “beeswarm plot” is extremely effective for clinical papers

Key limitation to acknowledge: For correlated features (common in clinical data — think systolic and diastolic blood pressure), SHAP values can be misleading. This connects directly to the data bias issues from the taxonomy above.


Method 2: LIME — Quick and Intuitive

LIME explains individual predictions by fitting a simple linear model in the neighborhood of the prediction.

When to use LIME over SHAP: for quick intuitive explanations, for text data (LIME works naturally with token-level perturbations), or when computational resources are limited.

Limitations: Explanations are local and can vary significantly across similar inputs. The choice of perturbation method affects results.


Method 3: Attention-Based Explanations (and Their Limitations)

If you’re working with transformer-based models, attention weights are tempting to use as explanations.

The appeal: Attention weights are computed naturally during inference. If the model attends heavily to “edema” when predicting heart failure, that seems meaningful.

The problem: Research has shown that attention weights are not reliable explanations. Models can attend to a token for reasons unrelated to the final prediction.

What to do instead: Use integrated gradients or input × gradient methods. If you report attention patterns, caveat them explicitly and combine with more rigorous methods.


Method 4: Grad-CAM for Medical Imaging

For medical imaging (chest X-rays, histopathology, retinal images), Grad-CAM produces localization maps highlighting important image regions.

Clinical value: A heatmap showing that the model focuses on the right lower lobe when predicting pneumonia is clinically meaningful and builds trust.

Best practices: Use Grad-CAM++ or Score-CAM for more precise localizations. Validate against expert annotations. Report failure cases — this is as informative as success cases.


Method 5: Concept-Based Explanations

This is a newer approach that I believe will become increasingly important.

Instead of explaining predictions in terms of raw features, explain them in terms of human-understandable concepts. TCAV (Testing with Concept Activation Vectors) tests sensitivity to high-level concepts. Concept Bottleneck Models force the model to predict intermediate concepts first.

Why this matters: Clinicians think in medical concepts, not feature vectors. “This patient is high-risk because of elevated troponin, low ejection fraction, and ST-segment elevation” is actionable. “Features 47, 92, and 156 have high SHAP values” is not.


The Implementation Gap: Why Good Models Still Don’t Reach Patients

This is something we covered extensively in our WSDM 2025 tutorial. The data is sobering: a systematic review of AI in the ICU found that the vast majority of models remain in the testing and prototyping environment. Most patients do not benefit because AI never moves from bytes to bedside.

The challenges include:

  • Legal regulations: Healthcare is heavily regulated. HIPAA in the US, GDPR and AI Act in Europe. In India, while there’s no healthcare-specific data law yet, the DPDP Act 2023 and IT Rules 2011 apply. Certification for Germany’s DiGa (Digital Health Applications), for example, can cost 3 to 8 million Euros.
  • Financial obstacles: The cost of clinical validation, regulatory compliance, and deployment infrastructure is prohibitive for most academic labs.
  • Acceptance: Clinicians need to trust and want to use the system. This requires engaging the target group from the design phase — the lesson from MYCIN still applies 50 years later.
  • Generalizability: Models trained on data from one institution often don’t transfer to another due to differences in data standards, populations, and clinical practices. Interoperability standards (FHIR, openEHR) help but are not yet widely adopted.
  • Knowledge drift: Clinical guidelines and medical knowledge evolve. A model that was accurate in 2023 may be outdated by 2025.

The evaluation perspective: True clinical evaluation goes beyond “human vs. machine” comparisons. A more realistic framework is “human vs. human + machine” — focusing on whether the AI adds value to the clinician’s decision-making, which is closer to how AI would actually be deployed.


What Clinicians Actually Want (It’s Not What You Think)

After working with clinicians across three countries, here’s what I’ve learned:

They want case-level explanations, not model-level. “Why is this patient high-risk?” beats “what features are generally important.”

They want confirmatory evidence, not surprises. If your model flags a patient for reasons the clinician can verify in the chart, that builds trust.

They want actionable information. “Declining renal function” leads to a clear action. “Complex feature interactions” does not.

They want to understand failure modes. “The model is less reliable for patients under 30” is honest and useful.

They want to be engaged, not replaced. The FAVES principle of “Engaging” — designing intuitive interfaces that encourage active participation — directly addresses this.


How to Include Explainability in Your Paper

Here’s a practical checklist that integrates trustworthiness frameworks:

  1. Describe your explanation method in the Methods section. Justify your choice.
  2. Include global explanations — feature importance rankings (SHAP summary plot) for the full model
  3. Include local explanations — 2-3 example cases showing why the model made specific predictions
  4. Discuss alignment with clinical knowledge — Do the important features make clinical sense?
  5. Assess fairness — Report metrics disaggregated by demographics. Reference the bias taxonomy.
  6. Address trustworthiness — Use FURM or FAVES to structure your discussion of the model’s reliability, usefulness, and safety
  7. Acknowledge limitations — No explanation method is perfect. Be upfront.
  8. Show failure cases — Where do explanations break down?

A Practical Example: Explainable Parkinson’s Disease Subtyping

In our Parkinson’s Disease subtyping work (published in Frontiers in AI, impact factor 4.7), we used unsupervised clustering to identify patient subtypes from the PPMI dataset.

Our approach to explainability:

  • We computed SHAP values for the supervised model that predicted cluster assignments
  • The SHAP analysis revealed that motor symptoms (tremor, rigidity, bradykinesia) and non-motor symptoms (sleep disturbances, cognitive decline) contributed differently to different subtypes
  • This aligned with clinical literature on Parkinson’s subtypes, validating our computational findings

Interoperability was a significant challenge, as our WSDM tutorial co-presenter Dominik discussed using our Parkinson’s work as a case study. We worked with two datasets — LRRK2 (397 patients from Tunisia) and MDS (402 patients from UK/US) — where most features were mappable but feature names were completely different (e.g., “NP2SPCH” vs. “mdsupdrs2_1” for the same speech impairment measure). This data interoperability challenge directly impacts model interoperability and the ability to validate findings across populations.

This clinical validation and cross-dataset evaluation were key factors in the paper’s acceptance.


Common Mistakes

Mistake 1: Adding explainability as an afterthought. If you only think about explainability the night before submission, it shows. Plan for it from the start — and frame it within trustworthiness, not just as a technical add-on.

Mistake 2: Claiming attention is explanation. Attention weights are not reliable explanations. Don’t present them as such without caveats.

Mistake 3: Not validating explanations with domain experts. If your model says “age” is the most important feature for a disease without a strong age association, something is wrong.

Mistake 4: Ignoring the implementation gap. Even if your model is explainable on paper, consider: would a clinician actually use this explanation in practice? Would it change their decision?

Mistake 5: Only showing explanations that look good. Cherry-picking examples where explanations are clean is dishonest. Include cases where explanations are confusing or contradictory.


Final Words

Explainability in medical AI isn’t just a nice-to-have — it’s becoming a requirement for publication, deployment, and regulatory approval. But more importantly, it’s part of a broader shift toward trustworthy AI that I believe will define the next decade of medical AI research.

As I’ve argued in our WSDM 2025 tutorial and my GDG talk: the focus is moving from efficiency and accuracy to reliability and transparency. The EU AI Act mandates it. The HHS Trustworthy AI Playbook guides it. And clinicians demand it.

The good news is that the tools are mature and the frameworks are available. SHAP is a few lines of Python. FAVES and FURM provide structured evaluation templates. The real challenge is methodological and cultural — designing for trust from the start, engaging clinicians as partners, and honestly communicating what your model can and cannot do.


Related articles that may be of interest to you

Tutorial Referenced


Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights

Subscribe now to keep reading and get access to the full archive.

Continue reading