Working with Imbalanced Datasets: A Practical Guide for ML and Medical AI Researchers

Why 98% accuracy can mean a useless model — and how to detect, measure, and fix class imbalance with resampling, SMOTE, and cost-sensitive learning. Updated for 2025 — originally published August 2018 Class imbalance is one of the most common real-world problems in machine learning — and it’s especially severe in medical datasets. Here’s a … Read more

The Ultimate Guide to Machine Learning & AI Datasets for 2025

Back when I first wrote this post in 2022, the AI landscape was a different place. Finding a good dataset felt like a treasure hunt, piecing together links from university pages, old GitHub repos, and forum posts. I started this list to keep track of the gems I stumbled upon during literature reviews or through … Read more

Bengali Datasets in 2025 for Named Entity Recognition and Sentiment Analysis

(Updated in 2025) A major limitation of current AI research is the overemphasis on English and the under-representation of the Bengali or Bangla language. Despite being the seventh most spoken language worldwide, Bengali remains a low-resource language in the field of Natural Language Processing (NLP). This scarcity of resources poses significant challenges for developing robust … Read more

AI and ML Datasets to Start With — Part 2: Social Networks, Industry Collections, and Medical Data

Beyond the UCI Repository — the best datasets for graph analysis, clinical NLP, and medical imaging, from open access to application-required tiers. Updated 2025 — originally published January 2019 Part 2 of the dataset series covers social network analysis datasets, open-source collections from Google and Microsoft, and an expanded section on medical AI datasets — … Read more