Imbalanced Datasets: A Practical ML & Medical AI Guide

Why 98% accuracy can mean a useless model — and how to detect, measure, and fix class imbalance with resampling, SMOTE, and cost-sensitive learning.

Updated for 2025 — originally published August 2018

Class imbalance is one of the most common real-world problems in machine learning — and it’s especially severe in medical datasets. Here’s a complete, code-first guide to detecting and handling it properly.

If you’ve ever trained a classifier that achieved 98% accuracy and thought “this is too good to be true” — you were probably right. More often than not, that number reflects a heavily imbalanced dataset where your model learned to predict the majority class almost exclusively.

This problem is pervasive in clinical research. Disease prevalence is low. Fraud is rare. Equipment failure is uncommon. Your dataset naturally has far more “normal” examples than “abnormal” ones, and naive classifiers exploit that asymmetry.

In this article, I’ll explain the theory clearly and then walk through practical code solutions — the same approaches I use in my medical AI research.

What Is Class Imbalance?
Why Standard Accuracy Fails
How to Detect Imbalance in Your Dataset
Strategy 1: Better Evaluation Metrics
Strategy 2: Data-Level Approaches (Resampling)
Strategy 3: Algorithm-Level Approaches
Strategy 4: Cost-Sensitive Learning
Medical AI Context: Specific Considerations
Complete Code Example
When NOT to Resample

What Is Class Imbalance?

Imbalanced datasets occur when one class (the minority, or positive class) has significantly fewer samples than another (the majority, or negative class).

Common imbalance ratios in real-world problems:

Domain	Typical Imbalance
Medical diagnosis (rare disease)	1:100 to 1:1000
Fraud detection	1:100 to 1:500
Fault detection (manufacturing)	1:10 to 1:100
Clinical trial outcomes	1:10 to 1:50
Parkinson’s disease subtyping (my research)	1:3 to 1:8

The degree of imbalance matters. A 60/40 split is usually fine. A 99/1 split requires careful handling.

Why Standard Accuracy Fails

Consider a disease screening dataset: 950 healthy patients (class 0) and 50 sick patients (class 1).

A model that always predicts “healthy” achieves 95% accuracy while being completely useless clinically. It would miss every single sick patient.

This is why accuracy is the wrong metric for imbalanced problems. You need metrics that account for class-level performance separately.

How to Detect Imbalance in Your Dataset

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("clinical_data.csv")

# Check class distribution
print(df["diagnosis"].value_counts())
print(df["diagnosis"].value_counts(normalize=True))  # percentages

# Visualize
df["diagnosis"].value_counts().plot(kind="bar", color=["steelblue", "tomato"])
plt.title("Class Distribution")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

A ratio above 4:1 warrants attention. Above 10:1, you should implement at least one of the strategies below.

Strategy 1: Better Evaluation Metrics

Before touching your data or model, fix your evaluation. These are the metrics you should use:

Balanced Accuracy — the mean of per-class accuracy:

from sklearn.metrics import balanced_accuracy_score
print(balanced_accuracy_score(y_test, y_pred))

F1 Score (especially macro or weighted):

from sklearn.metrics import f1_score, classification_report
print(classification_report(y_test, y_pred))  # shows per-class precision, recall, F1

AUC-ROC — measures discriminative ability across all thresholds:

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_proba[:, 1]))

AUC-PR (Precision-Recall Curve) — often more informative than ROC for severe imbalance:

from sklearn.metrics import average_precision_score
print(average_precision_score(y_test, y_proba[:, 1]))

Strategy 2: Data-Level Approaches (Resampling)

Random Undersampling

Remove samples from the majority class until the classes are balanced. Fast, simple, but discards real data.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
print(f"Before: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After: {dict(zip(*np.unique(y_resampled, return_counts=True)))}")

When to use: When you have a large majority class and losing data is acceptable.

Random Oversampling

Duplicate minority class samples at random. Risk of overfitting because you’re just repeating examples.

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates synthetic minority samples by interpolating between existing ones. It’s smarter than random oversampling because it creates new, plausible data points rather than duplicates.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

How SMOTE works: For each minority sample, it finds its k nearest neighbors (among minority samples), then creates a synthetic point somewhere along the line segment connecting them.

Variants worth knowing:

SMOTE — standard
SMOTEENN — SMOTE + Edited Nearest Neighbours (cleans noisy majority samples)
SMOTETomek — SMOTE + Tomek links (removes borderline ambiguous samples)
BorderlineSMOTE — only oversamples minority samples near the decision boundary

from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)

Important: Only apply resampling to the training set. Never resample validation or test data — that would give you misleading evaluation results.

Strategy 3: Algorithm-Level Approaches

Some models handle imbalance natively through the class_weight parameter:

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# All of these support class_weight='balanced'
clf = RandomForestClassifier(class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)

Setting class_weight="balanced" tells the model to weight the minority class inversely proportional to its frequency. The model then penalizes mistakes on the minority class more heavily during training.

For scikit-learn, you can also pass a custom dictionary: class_weight={0: 1, 1: 10} to make the model 10x more sensitive to class 1 errors.

Strategy 4: Cost-Sensitive Learning

This is an extension of class weighting where you explicitly define the cost of different types of errors. In clinical settings, a false negative (missing a sick patient) is far more costly than a false positive (incorrectly flagging a healthy patient).

# Threshold adjustment: lower threshold to catch more positives
y_proba = clf.predict_proba(X_test)[:, 1]
threshold = 0.3  # default is 0.5; lower catches more positives at cost of precision
y_pred_adjusted = (y_proba >= threshold).astype(int)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred_adjusted))

Adjusting your classification threshold is often the most clinically meaningful intervention — it directly controls the sensitivity/specificity tradeoff.

Medical AI Context: Specific Considerations

In my own research on Parkinson’s disease subtyping, class imbalance was a constant challenge. Here are medical-AI-specific considerations I’ve learned:

1. Consider clinical prevalence when choosing your target ratio. Oversampling to a perfect 50:50 ratio creates a synthetic distribution that doesn’t reflect clinical reality. A reasonable target might be 3:1 or 4:1, not 1:1.

2. Feature-wise SMOTE can generate clinically invalid samples. If you’re generating synthetic patients, make sure interpolated values remain clinically plausible (e.g., age can’t be negative, lab values shouldn’t exceed physiological bounds). Post-process SMOTE outputs to enforce range constraints.

3. Report both balanced and unbalanced metrics. In your paper’s results section, report accuracy (for comparison), balanced accuracy, F1-macro, and AUC-ROC. Reviewers will ask for all of them.

4. Imbalance in multi-class settings is harder. In patient subtyping, you may have 3–5 subtypes with wildly different frequencies. SMOTE can still work, but consider the SMOTEN variant for nominal features.

Complete Code Example

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score, roc_auc_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Load data
df = pd.read_csv("clinical_data.csv")
X = df.drop("diagnosis", axis=1)
y = df["diagnosis"]

# Stratified split — preserves class ratio in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Pipeline: SMOTE + classifier (SMOTE only applied to training data)
pipeline = Pipeline([
    ("smote", SMOTE(random_state=42)),
    ("clf", RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42))
])

# Cross-validate with stratified folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="balanced_accuracy")
print(f"CV Balanced Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

# Final evaluation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")

Key design choices in this example:

stratify=y in train_test_split ensures the imbalance ratio is preserved in both splits
SMOTE is inside a Pipeline so it only fits on training folds during cross-validation
StratifiedKFold ensures each fold maintains the class ratio

When NOT to Resample

Resampling isn’t always the answer. Skip it when:

Your imbalance is mild (less than 4:1 ratio) — class weighting is usually sufficient
You have very few minority samples — SMOTE can’t interpolate meaningfully with fewer than ~10 examples per class
Your test set is your priority — if you need the model to perform realistically in deployment, keep the natural class distribution and adjust thresholds instead
The paper you’re comparing against didn’t resample — for fair comparison, maintain consistent preprocessing

Key Takeaways

Accuracy is a misleading metric for imbalanced datasets — switch to balanced accuracy, F1-macro, and AUC-ROC.
SMOTE is the most reliable default oversampling strategy; SMOTETomek is better for noisy boundaries.
Always put resampling inside a cross-validation pipeline — never resample before splitting.
For medical AI: lower your classification threshold to increase sensitivity, and report the tradeoff explicitly.
class_weight="balanced" is often enough — try it before reaching for SMOTE.

Reading resources

A good resource from AnalyticsVidya regarding how to handle imbalanced data sets. You can also go through this KDD 2018 paper where they make their deep learning architecture robust to class-imbalance.

Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

Working with Imbalanced Datasets: A Practical Guide for ML and Medical AI Researchers

Why 98% accuracy can mean a useless model — and how to detect, measure, and fix class imbalance with resampling, SMOTE, and cost-sensitive learning.

Table of Contents

What Is Class Imbalance?

Why Standard Accuracy Fails

How to Detect Imbalance in Your Dataset

Strategy 1: Better Evaluation Metrics

Strategy 2: Data-Level Approaches (Resampling)

Random Undersampling

Random Oversampling

SMOTE (Synthetic Minority Over-sampling Technique)

Strategy 3: Algorithm-Level Approaches

Strategy 4: Cost-Sensitive Learning

Medical AI Context: Specific Considerations

Complete Code Example

When NOT to Resample

Key Takeaways

Reading resources

Like this:

Related

Discover more from Medical AI Insights

What is your take on this topic?Cancel reply

Why 98% accuracy can mean a useless model — and how to detect, measure, and fix class imbalance with resampling, SMOTE, and cost-sensitive learning.

Table of Contents

What Is Class Imbalance?

Why Standard Accuracy Fails

How to Detect Imbalance in Your Dataset

Strategy 1: Better Evaluation Metrics

Strategy 2: Data-Level Approaches (Resampling)

Random Undersampling

Random Oversampling

SMOTE (Synthetic Minority Over-sampling Technique)

Strategy 3: Algorithm-Level Approaches

Strategy 4: Cost-Sensitive Learning

Medical AI Context: Specific Considerations

Complete Code Example

When NOT to Resample

Key Takeaways

Reading resources

Share this:

Like this:

Related

Discover more from Medical AI Insights

What is your take on this topic?Cancel reply

Discover more from Medical AI Insights