Why 98% accuracy can mean a useless model — and how to detect, measure, and fix class imbalance with resampling, SMOTE, and cost-sensitive learning.
Updated for 2025 — originally published August 2018
Class imbalance is one of the most common real-world problems in machine learning — and it’s especially severe in medical datasets. Here’s a complete, code-first guide to detecting and handling it properly.
If you’ve ever trained a classifier that achieved 98% accuracy and thought “this is too good to be true” — you were probably right. More often than not, that number reflects a heavily imbalanced dataset where your model learned to predict the majority class almost exclusively.
This problem is pervasive in clinical research. Disease prevalence is low. Fraud is rare. Equipment failure is uncommon. Your dataset naturally has far more “normal” examples than “abnormal” ones, and naive classifiers exploit that asymmetry.
In this article, I’ll explain the theory clearly and then walk through practical code solutions — the same approaches I use in my medical AI research.
Table of Contents
- What Is Class Imbalance?
- Why Standard Accuracy Fails
- How to Detect Imbalance in Your Dataset
- Strategy 1: Better Evaluation Metrics
- Strategy 2: Data-Level Approaches (Resampling)
- Strategy 3: Algorithm-Level Approaches
- Strategy 4: Cost-Sensitive Learning
- Medical AI Context: Specific Considerations
- Complete Code Example
- When NOT to Resample
What Is Class Imbalance?
Imbalanced datasets occur when one class (the minority, or positive class) has significantly fewer samples than another (the majority, or negative class).
Common imbalance ratios in real-world problems:
| Domain | Typical Imbalance |
|---|---|
| Medical diagnosis (rare disease) | 1:100 to 1:1000 |
| Fraud detection | 1:100 to 1:500 |
| Fault detection (manufacturing) | 1:10 to 1:100 |
| Clinical trial outcomes | 1:10 to 1:50 |
| Parkinson’s disease subtyping (my research) | 1:3 to 1:8 |
The degree of imbalance matters. A 60/40 split is usually fine. A 99/1 split requires careful handling.
Why Standard Accuracy Fails
Consider a disease screening dataset: 950 healthy patients (class 0) and 50 sick patients (class 1).
A model that always predicts “healthy” achieves 95% accuracy while being completely useless clinically. It would miss every single sick patient.
This is why accuracy is the wrong metric for imbalanced problems. You need metrics that account for class-level performance separately.
How to Detect Imbalance in Your Dataset
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("clinical_data.csv")
# Check class distribution
print(df["diagnosis"].value_counts())
print(df["diagnosis"].value_counts(normalize=True)) # percentages
# Visualize
df["diagnosis"].value_counts().plot(kind="bar", color=["steelblue", "tomato"])
plt.title("Class Distribution")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
A ratio above 4:1 warrants attention. Above 10:1, you should implement at least one of the strategies below.
Strategy 1: Better Evaluation Metrics
Before touching your data or model, fix your evaluation. These are the metrics you should use:
Balanced Accuracy — the mean of per-class accuracy:
from sklearn.metrics import balanced_accuracy_score
print(balanced_accuracy_score(y_test, y_pred))
F1 Score (especially macro or weighted):
from sklearn.metrics import f1_score, classification_report
print(classification_report(y_test, y_pred)) # shows per-class precision, recall, F1
AUC-ROC — measures discriminative ability across all thresholds:
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_proba[:, 1]))
AUC-PR (Precision-Recall Curve) — often more informative than ROC for severe imbalance:
from sklearn.metrics import average_precision_score
print(average_precision_score(y_test, y_proba[:, 1]))
Strategy 2: Data-Level Approaches (Resampling)
Random Undersampling
Remove samples from the majority class until the classes are balanced. Fast, simple, but discards real data.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
print(f"Before: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After: {dict(zip(*np.unique(y_resampled, return_counts=True)))}")
When to use: When you have a large majority class and losing data is acceptable.
Random Oversampling
Duplicate minority class samples at random. Risk of overfitting because you’re just repeating examples.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE generates synthetic minority samples by interpolating between existing ones. It’s smarter than random oversampling because it creates new, plausible data points rather than duplicates.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
How SMOTE works: For each minority sample, it finds its k nearest neighbors (among minority samples), then creates a synthetic point somewhere along the line segment connecting them.
Variants worth knowing:
SMOTE— standardSMOTEENN— SMOTE + Edited Nearest Neighbours (cleans noisy majority samples)SMOTETomek— SMOTE + Tomek links (removes borderline ambiguous samples)BorderlineSMOTE— only oversamples minority samples near the decision boundary
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)
Important: Only apply resampling to the training set. Never resample validation or test data — that would give you misleading evaluation results.
Strategy 3: Algorithm-Level Approaches
Some models handle imbalance natively through the class_weight parameter:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
# All of these support class_weight='balanced'
clf = RandomForestClassifier(class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
Setting class_weight="balanced" tells the model to weight the minority class inversely proportional to its frequency. The model then penalizes mistakes on the minority class more heavily during training.
For scikit-learn, you can also pass a custom dictionary: class_weight={0: 1, 1: 10} to make the model 10x more sensitive to class 1 errors.
Strategy 4: Cost-Sensitive Learning
This is an extension of class weighting where you explicitly define the cost of different types of errors. In clinical settings, a false negative (missing a sick patient) is far more costly than a false positive (incorrectly flagging a healthy patient).
# Threshold adjustment: lower threshold to catch more positives
y_proba = clf.predict_proba(X_test)[:, 1]
threshold = 0.3 # default is 0.5; lower catches more positives at cost of precision
y_pred_adjusted = (y_proba >= threshold).astype(int)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred_adjusted))
Adjusting your classification threshold is often the most clinically meaningful intervention — it directly controls the sensitivity/specificity tradeoff.
Medical AI Context: Specific Considerations
In my own research on Parkinson’s disease subtyping, class imbalance was a constant challenge. Here are medical-AI-specific considerations I’ve learned:
1. Consider clinical prevalence when choosing your target ratio. Oversampling to a perfect 50:50 ratio creates a synthetic distribution that doesn’t reflect clinical reality. A reasonable target might be 3:1 or 4:1, not 1:1.
2. Feature-wise SMOTE can generate clinically invalid samples. If you’re generating synthetic patients, make sure interpolated values remain clinically plausible (e.g., age can’t be negative, lab values shouldn’t exceed physiological bounds). Post-process SMOTE outputs to enforce range constraints.
3. Report both balanced and unbalanced metrics. In your paper’s results section, report accuracy (for comparison), balanced accuracy, F1-macro, and AUC-ROC. Reviewers will ask for all of them.
4. Imbalance in multi-class settings is harder. In patient subtyping, you may have 3–5 subtypes with wildly different frequencies. SMOTE can still work, but consider the SMOTEN variant for nominal features.
Complete Code Example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score, roc_auc_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# Load data
df = pd.read_csv("clinical_data.csv")
X = df.drop("diagnosis", axis=1)
y = df["diagnosis"]
# Stratified split — preserves class ratio in both sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Pipeline: SMOTE + classifier (SMOTE only applied to training data)
pipeline = Pipeline([
("smote", SMOTE(random_state=42)),
("clf", RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42))
])
# Cross-validate with stratified folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="balanced_accuracy")
print(f"CV Balanced Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Final evaluation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
Key design choices in this example:
stratify=yin train_test_split ensures the imbalance ratio is preserved in both splits- SMOTE is inside a
Pipelineso it only fits on training folds during cross-validation StratifiedKFoldensures each fold maintains the class ratio
When NOT to Resample
Resampling isn’t always the answer. Skip it when:
- Your imbalance is mild (less than 4:1 ratio) — class weighting is usually sufficient
- You have very few minority samples — SMOTE can’t interpolate meaningfully with fewer than ~10 examples per class
- Your test set is your priority — if you need the model to perform realistically in deployment, keep the natural class distribution and adjust thresholds instead
- The paper you’re comparing against didn’t resample — for fair comparison, maintain consistent preprocessing
Key Takeaways
- Accuracy is a misleading metric for imbalanced datasets — switch to balanced accuracy, F1-macro, and AUC-ROC.
- SMOTE is the most reliable default oversampling strategy; SMOTETomek is better for noisy boundaries.
- Always put resampling inside a cross-validation pipeline — never resample before splitting.
- For medical AI: lower your classification threshold to increase sensitivity, and report the tradeoff explicitly.
class_weight="balanced"is often enough — try it before reaching for SMOTE.
Reading resources
A good resource from AnalyticsVidya regarding how to handle imbalanced data sets. You can also go through this KDD 2018 paper where they make their deep learning architecture robust to class-imbalance.
Discover more from Medical AI Insights
Subscribe to get the latest posts sent to your email.