How to start working with clinical data as an AI researcher

Using our paper on Parkinson’s Disease subtyping as a case study, I share the essential steps for navigating the complex world of medical research.

Featured Image by Erik Mclean on Unsplash

Given the wide accessibility to medical imaging and NLP datasets and benchmarking efforts of LLMs, clinical research with patient data still has higher barriers to entry.

Although, the barriers to entry are more related to training requirements but I feel it is also due to limited tutorials or guidance regarding how to navigate or plan the journey as someone starting working in this research domain.

In this article, I will share my learnings as an AI researcher working with Parkinson’s Disease and publishing at the Frontiers in Artificial Intelligence Journal under the section “Medicine and Public Health” (website, impact factor: 4.7).

Our paper is open-access, so feel free to go through it at https://doi.org/10.3389/frai.2025.1668206

Learning Outcomes

In this article, we gained some insights and guidelines for the following questions:

  • As an AI researcher, what are the first practical steps to gain access to clinical datasets, and what are the key ethical and privacy certifications to be aware of?
  • How do you establish a strong clinical motivation for your research, and why is the “outcome-action pairing” framework essential for a successful project?
  • What are the key questions to discuss with medical experts at the start of a project to ensure the clinical utility of your AI model?
  • How do you navigate common machine learning challenges specific to clinical data, such as high data sparsity, the curse of dimensionality, and the need for interpretability?
  • Why is it often more important to prioritize explainability and clinical impact over state-of-the-art novelty when developing AI models for medicine?
  • What are the fundamental differences in writing and structuring a research paper for a clinical journal versus a traditional AI conference?
  • How can you effectively frame the Results and Discussion sections of a manuscript to highlight the clinical findings and their real-world implications?

The good thing is that sincere efforts are being made to increase accessibility of such clinical datasets, as well as make them inherently AI-ready.

So, let’s start.

First, let me quickly brief you about our AI-based patient subtyping approach.

Highlights from our journal paper

Paper title: Decision tree-based approach to robust Parkinson’s disease subtyping using clinical data of the Michael J. Fox Foundation LRRK2 cross-sectional study

Open-access paper link: https://doi.org/10.3389/frai.2025.1668206

Codebase: https://github.com/roysoumya/decision-tree-subtyping-Parkinson

Team

This work was done at the International Leibniz Future Laboratory for Artificial Intelligence (Leibniz AI Lab), L3S Research Center, Germany, in collaboration with Medizinische Hochschule Hannover, Germany.

Further details can be found at: https://leibniz-ai-lab.de/research/

Summary

We developed a novel, interpretable decision-tree-based method to identify six clinically meaningful PD subtypes — validated across two independent cohorts (LRRK2 and MDS). Unlike traditional clustering approaches, our model offers robust reproducibility and clear diagnostic rules.

🧠 Key findings

– Subtypes are defined by features like persistent asymmetry, long disease duration (≥10 years), and postural instability.
– Early onset subtype (E4) shows a favorable prognosis.
– Mixed onset subtypes (M3, M7) are linked to faster decline — highlighting the need for proactive care.
– Late onset subtypes (L2, L4) correlate with reduced quality of life and higher care dependency, while L1 appears more benign.

Relevance

These insights could pave the way for more personalized therapeutic strategies and better prognostic guidance in PD care.

First Step: Data Access and Clinical Motivation

Given the sensitive nature of clinical data, various safeguards are put in place to prevent harmful or unethical use.

As an AI researcher, my suggestion will be to focus on non-PHI data.

PHI stands for “Protected Health Information”. There are a total of 18 identifiers, like name, geographical subdivisions smaller than a state, phone numbers, etc.

You can find more details at https://cphs.berkeley.edu/hipaa/hipaa18.html

Working with PHI data involves completion of various training requirements:

  1. Protecting patient privacy: knowledge of HIPAA and country-specific regulations
  2. Compliance for working with human subjects research. For example: https://researchcompliance.stanford.edu/panels/hs/for-all-researchers

Therefore, my practical suggestion would be to work with de-identified non-PHI data. If you are affiliated with an academic institution, it is usually more straightforward.

Dataset Access

The following three types of approaches primarily exist:

  1. Requires approval from a senior researcher or your research supervisor. It may also involve getting signatures from a representative of your academic institution. This may take a few months to even a year.
  2. Does not require approval but requires completion of CITI training: Like the famous MIMIC datasets (ICU), which requires completion of the CITI Program’s “Data or Specimens Only Research” training course.
  3. The Michael J. Fox Foundation has made the process very smooth for researchers. It will involve signing the “Data Use Agreement” and agreeing to follow the Publications Policy.

Parkinson’s Disease Datasets

Michael J. Fox Foundation: https://www.michaeljfox.org/data-resources

  • LRRK2 Cohort Consortium: Cross-sectional study data (single time point), Longitudinal Data. Contains both clinical and genomics data (in the form of SNPs). Both cross-sectional (single time point) and longitudinal data (multiple time points in the patient journey).
  • Parkinson’s Progression Markers Initiative study
  • The Fox Investigation for New Discovery of Biomarkers (BioFIND)

International Parkinson and Movement Disorder Society (https://www.movementdisorders.org/)

Clinical Motivation

Once you have a reasonable clarity about dataset access, the next fundamental issue that needs to be taken care of is “clinical motivation”.

This is formally framed as “outcome-action pairing”, where a systematic, detailed discussion is required to involve the clinicians and medical professionals at the start of the project or during the problem formulation stage to clarify the clinical utility of the research project.

Specifically, one needs to broadly answer the following questions:

  • How do the AI model’s predictions fit in the clinical workflow? How much lead time is required to achieve clinical utility?
  • Who are the key stakeholders — clinicians, caregivers (nurses), hospital administrators, insurance providers?
  • What are the key clinical findings that you aim to investigate that AI model predictions can be used to answer? For a clinical paper, these types of findings are most important, and without clear research questions at the start of the project, it becomes quite difficult down the line.
  • How do you plan to evaluate the AI model? What metrics do we need to optimize — the metric for AI researchers is usually not the metric that clinicians may prefer?

Having multiple rounds of discussion at the start of the research project helps build the right foundation for the project. It helps to get clarity about:

  • Clinical Utility and Communication Skills: How much clinical feedback can you expect from them in the future? What form of communication works best to get specific feedback?
  • Problem relevance and how grounded is the motivation with recent literature?
  • Given your background as an AI researcher, it makes more sense to target some publications in the intersection of AI and medicine, such as Nature Scientific Reports, Frontiers in AI, ACM Transactions on Computing for Healthcare, and the Applications track for AI/NLP.
  • You can check out my Google Scholar profile to get some more ideas. This helps you to get an idea of the type of AI or machine learning models and evaluation metrics that are acceptable to that medical research community. I advise choosing a problem statement that has at least two research articles with available GitHub codebases.
  • Once you identify some relevant conferences and research papers, please read my previous Medium article on that topic

Second Step: How to start with AI for Medicine research

In my previous blog, I covered how to get started with the general AI for Medicine research domain, not just clinical data (which is the topic of this blog).

Third Step: Problem Formulation and AI Feasibility

Once you are through with the first two critical steps, we slowly start getting into the AI part, which is your domain of expertise.

The ideal output of the previous stages is:

  • 2 to 3 research ideas with clear clinical findings
  • How to evaluate? What metrics need to be optimized?
  • 2 related research publications. Type of publication venue to target?
Photo by Markus Winkler on Unsplash

As an AI researcher, you need to consider whether the necessary resources are available for developing an AI model.

For example, if it were a classification model, like it was in our case, how do we determine the class labels?

What evaluation metric to optimize?

What can be the baseline models to compare against, or what is the state-of-the-art AI model for the same task?

Case Study — Our paper

We had to decide how to make the class labels as early-onset, late-onset, and juvenile Parkinson’s Disease patients/subjects. We used a research paper as a reference to justify our choice —

  • Juvenile: age of onset of less than 21 years old
  • Early-onset: > 21 years but < 50 years old
  • Late-onset: > 50 years old

We removed juvenile PD patients and patients with an age of onset equal to 50 years. However, we also had to decide whether to remove patients over a wider window at this juncture (say, 47 to 53). This is commonly referred to as a ‘gray area’ and sometimes helps to improve the quality of class labels.

We also performed sensitivity analysis with different thresholds of age of onset and observed the model performance differences.

We had to decide between the standard machine learning approach of clustering for patient subtyping or deep learning. However, the lack of subtype labels, limited interpretability, limited number of patients, and wide adoption of decision tree-based models by the medical experts made a strong case to pursue a decision tree-based patient subtyping approach for our project.

Fourth Step: Machine Learning Challenges

I will only highlight the research challenges specific to developing AI models with clinical data.

Photo by Kevin Ku on Unsplash

Data Sparsity or Missing Data Issues

How much missing data exists, and how to deal with it?

If a feature is missing for more than 30% data points, we can consider removing it.

For the remaining features, we can consider imputing the feature values by the two most popular methods.

  • Median: Replace the missing value with the median value for that column. Straight-forward, clinically meaningful values, but not personalized.
  • MICE algorithm: More complex, iteratively imputes the missing values until convergence. Complex, personalized, but may introduce clinically useless values.

There are some more caveats that need to be kept in mind.

The above discussion assumed that the missingness was at random. However, it may be for systematic reasons.

For example, if the clinical data is collected from multiple institutions, some particular tests or types of data may be missing for all data points of that institution.

In the LRRK2 Cross-sectional dataset, the genetic data were collected from only one institute. Similarly, the MDS-UPDRS values were not collected from all institutions.

Curse of Dimensionality — Few Patients, Many Features

This is a classic and frequent challenge when working with clinical data: you may have a dataset with hundreds of patients but thousands of potential features (e.g., clinical measurements, survey responses, genetic markers).

This scenario, often called the “small n, large p” problem (where ’n’ is the number of samples and ‘p’ is the number of features), makes it very difficult to train robust machine learning models.

The high dimensionality can lead to model overfitting, where the model learns the noise in the training data rather than the true underlying patterns.

Some common strategies to address this challenge are:

1. Feature Selection

The goal of feature selection is to identify and retain only the most relevant features for your predictive task, discarding the rest. This simplifies the model and can improve its generalizability.

  • Embedded Methods: Techniques like L1 regularization (Lasso) are highly effective. During model training, L1 regularization adds a penalty proportional to the absolute value of the feature weights. This forces the weights of less important features to become exactly zero, effectively removing them from the model.
  • Filter and Wrapper Methods: You can also use statistical tests (filter methods) to score and rank features based on their correlation with the outcome, or use wrapper methods that train multiple models to find the optimal feature subset.

Multicollinearity Issue: The standard machine learning paradigm assumes that the features are independent of each other, so linearly correlated features that are not relevant need to be removed from the list of features.

2. Dimensionality Reduction

Instead of just selecting features, dimensionality reduction transforms the existing features into a new, smaller set of features while aiming to preserve as much of the original information as possible.

  • Principal Component Analysis (PCA): This is a widely used linear technique that creates a set of uncorrelated principal components from the original data, ordered by the amount of variance they explain. You can then use the top few components as your new features.
  • UMAP and t-SNE: For complex, non-linear relationships in the data, techniques like Uniform Manifold Approximation and Projection (UMAP) or t-SNE are excellent. They are particularly powerful for visualizing high-dimensional data and can be used to create low-dimensional embeddings for clustering or classification tasks.

3. A New Paradigm: Leveraging Foundation Models

While the techniques above are powerful, the development of foundation models pretrained on massive datasets offers a new approach. In the context of healthcare, these models are trained on millions of de-identified electronic health records (EHRs).

An example is CLMBR-T-Base (CLMBR-Transformer-Base), a 141-million-parameter autoregressive foundation model pretrained on 2.57 million EHRs from Stanford Medicine. By pre-training on such a vast dataset, these models learn rich, general-purpose representations of patient data.

You can then fine-tune these models on your smaller, specific dataset. This approach, known as transfer learning, significantly reduces the dependency on having a large number of labeled patients for your specific task, as the model has already learned the fundamental patterns of clinical data.

AI Model Interpretability

Given the high-risk nature of working with clinical data and the scope of potential harm, AI models need to be interpretable and explainable, providing transparency with the reasoning process of the AI model. This is necessary to satisfy the queries of the physician and build trust in the model predictions.

Decision trees stand out as a clear favorite of the medical experts, given the interpretable-by-design nature and tree-like structure that closely mimics the diagnostic reasoning process of a clinician.

Random Forests are also being accepted as they lead to better model performance, at the cost of transparency. Although feature importance scores provide some global model-level interpretability (typically permutation-based), but do not provide it at an instance level.

Fifth Step: Iteration, Improvement, and Clinical Findings

Iteration and Improvement

Final Step: Writing the Paper

Congratulations on making it so far and persevering to this stage of your project. You are now left with the final milestone of writing the manuscript and getting it accepted at a conference or a journal.

Journals are usually preferred for clinical papers, but with the high cost associated with open-access publishing and funding cuts, conferences are also being considered.

Major Changes — Clinical versus AI paper

  • Structured Abstract: Introduction, Methods, Results, Discussion
  • Results and Discussion sections receive more importance than the Methods section.
  • Section ordering change: Abstract — Introduction — Results — Discussion — Methods — Conclusion

As this is my first clinical paper, I am also in the learning stage. The following articles give some more ideas:

Writing Guidelines

How-to-write-a-great-MLHC-paper – Machine Learning for Healthcare

For some general help on how to use LaTeX, you can read my previous blog about it.

Writing Guidelines — Results section

Reference: Paraphrased from my lab’s medwiki

  • Population Characteristics Present descriptive details of the study sample (commonly in Table 1), including overall size and distribution of key demographic or predictor variables.
  • Study Flow: Include a diagram (often Figure 1) that illustrates patient or hospital inclusion and exclusion, showing how the final study population was derived.
  • Participation and Data Quality Report participation metrics, such as response rates or the proportion of eligible institutions included, along with information on data completeness.
  • Primary Outcomes: Provide findings related to the main outcome(s) of interest, highlighting the central results of the study.
  • Secondary Outcomes: Summarize results for additional outcomes beyond the primary focus, offering a broader view of the study’s impact.
  • Sensitivity Analysis: Present analyses conducted to test the robustness of findings under alternative assumptions or conditions.

Writing Guidelines — Discussion section

Reference: https://pmc.ncbi.nlm.nih.gov/articles/PMC10676253

  • Start by mentioning the major findings
  • Elaborate on what the findings mean and why they are important
  • Contextualize the research findings with relevant literature
  • Explore alternative explanations for the research findings
  • Limitations of the study need to be acknowledged. Mention the thought process behind this study design and how this is the best that is feasible or possible.
  • Suggest future research directions

Learning Summary

Key Research Ideas and Practical Tips

In this blog article, I covered the fundamentals and provided some practical tips for a successful clinical paper as an AI researcher:

  1. Access to the dataset and the necessary certifications
  2. A clear problem statement and hypothesis about clinical findings
  3. Why AI is required and how domain experts feel where AI can help
  4. Focus on a single disease. Depth instead of breadth.
  5. Outcome-Action Pairing
  6. Explainability and Interpretability > Accuracy. But predictive/generative performance should be sufficiently high. What metrics to optimize?
  7. AI challenges — high data sparsity or missingness, less number of patients (in 100s or 1000s), a high number of patient features, lack of connected data, like genetics or longitudinal data
  8. Communication and obtaining feedback from Medical Experts
  9. Writing a clinical paper versus an AI/NLP paper: More importance on results and discussion (clinical impact), and methods are checked more from a soundness perspective instead of a novelty or state-of-the-art perspective
  10. Rebuttal and perseverance: Our paper took over two years to get published.

Discover more from Medical AI Insights | Datanalytics101

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights | Datanalytics101

Subscribe now to keep reading and get access to the full archive.

Continue reading