AI and ML Datasets to Start With — Part 2: Social Networks, Industry Collections, and Medical Data

Beyond the UCI Repository — the best datasets for graph analysis, clinical NLP, and medical imaging, from open access to application-required tiers.

Updated 2025 — originally published January 2019


Part 2 of the dataset series covers social network analysis datasets, open-source collections from Google and Microsoft, and an expanded section on medical AI datasets — the resources you need when the standard UCI Repository isn’t enough.

In Part 1 (now updated as the Ultimate Guide to ML & AI Datasets for 2025), I covered the major dataset platforms and NLP/vision collections. This post goes deeper on specific domains: social network analysis, datasets open-sourced by major tech companies, and medical/clinical data resources.

The landscape has shifted from “model-centric” to “data-centric” AI — meaning finding the right dataset for your problem is now as important as designing the right model. This guide helps you navigate that search.


Social Network Analysis Datasets

Stanford SNAP (Stanford Network Analysis Project)

Link: https://snap.stanford.edu/data/

The gold standard for network analysis datasets. Covers:

Network TypeExampleUse Cases
Social networksFacebook ego networks, Twitter followsCommunity detection, influence propagation
Ground-truth communitiesAmazon product networksBenchmarking community detection algorithms
Communication networksEmail networks (Enron, EU email)Information flow analysis
Citation networksHigh-energy physics citationsAcademic network analysis
Collaboration networksDBLP co-authorshipScientific collaboration patterns
Web graphsGoogle web graphPageRank, web mining
Amazon networksProduct co-purchasingRecommendation systems
Road networksRoad network graphsRouting algorithms
Temporal networksMessages with timestampsDynamic network analysis
Wikipedia networksTalk pages, edit historyCollaborative dynamics

Why SNAP is essential: Every dataset is well-documented, includes basic statistics, and has associated papers that establish baseline results. If you’re doing any graph ML research, start here.

AMiner Academic Graph

Link: https://aminer.org/

A comprehensive academic social network:

  • 100+ million papers across CS, medicine, physics
  • Author profiles with affiliations, publications, citations
  • Research topic classification
  • Collaboration networks between researchers

Useful for: citation network analysis, author disambiguation, topic evolution over time, academic influence measurement.


GLUE and Super-GLUE Benchmarks

Link: https://gluebenchmark.com/

GLUE (General Language Understanding Evaluation) is a collection of NLP tasks designed to test general language understanding. Includes:

  • CoLA (grammaticality judgement)
  • SST-2 (sentiment analysis)
  • MRPC (paraphrase detection)
  • STS-B (semantic similarity)
  • QQP (question pair similarity)
  • MNLI (multi-genre NLI)
  • QNLI (question NLI)
  • RTE (textual entailment)
  • WNLI (Winograd NLI)

SuperGLUE is the harder successor, designed after models surpassed human performance on GLUE.

When to use: As a standard evaluation suite when you’re developing a new language model or fine-tuning approach and need to show generalization across tasks.


Datasets Open-Sourced by Industry

Google Research

  • Google Books N-grams: Frequency of words and phrases in millions of books (useful for language modeling, historical linguistics)
  • Open Images: 9M+ images with bounding boxes, segmentation masks, and visual relationships
  • Natural Questions: Real Google search queries with Wikipedia-sourced answers
  • TyDi QA: Question answering across typologically diverse languages (includes Bengali)
  • FLORES-200: Translation benchmark covering 200 languages

Access: Most available on Google Research GitHub and via Hugging Face Hub

Microsoft Research

  • MS MARCO: 1M+ real Bing queries with human-annotated relevant passages. Widely used for document ranking and passage retrieval research.
  • MultiWOZ: Multi-domain, multi-turn dialogue dataset for task-oriented dialog systems
  • SQUAD 2.0 (co-developed with Stanford): Reading comprehension with unanswerable questions

Meta AI Research

  • NLLB-200: Parallel text in 200 languages for translation research (includes many Indian languages)
  • FLORES: Evaluation benchmark for multilingual translation
  • ReALBench: Reasoning evaluation dataset

Medical and Clinical AI Datasets (Expanded)

For researchers specifically entering medical AI, here’s the progression from “easy to access” to “requires significant approval process”:

Tier 1: Open Access (Start Here)

ISIC (International Skin Imaging Collaboration)

  • 25,000+ dermatoscopy images
  • Multi-class skin lesion classification
  • Annual Kaggle competition
  • Access: Kaggle (open)

VinDr-CXR

  • 18,000 chest X-rays annotated by Vietnamese radiologists
  • 14 thoracic abnormalities
  • Access: Kaggle (open)

MedMNIST

  • 18 standardized medical image datasets (28×28 and 224×224 resolution)
  • Covers pathology, radiology, dermatology, ophthalmology
  • Access: medmnist.com (open)

Tier 2: Registration Required

PhysioNet datasets (MIMIC-CXR, MIMIC-III, eICU)

  • Requires CITI Human Research training (~4 hours) and a credentialing process
  • Once approved, access to the largest collection of clinical data available for research
  • Access: physionet.org

CheXpert

  • 224,316 chest X-rays from Stanford Hospital
  • Requires Stanford registration
  • Access: stanfordmlgroup.github.io/competitions/chexpert/

Tier 3: Application Required

UK Biobank

  • 500,000 participants, multi-modal data (genetics, imaging, wearables, EHR)
  • Requires a formal research application
  • Access: ukbiobank.ac.uk

PPMI (Parkinson’s Progression Markers Initiative)

  • Longitudinal clinical data for Parkinson’s disease research
  • Requires registration and research proposal
  • Access: ppmi-info.org

i2b2 Shared Task Datasets (Clinical NLP)

The i2b2 (Informatics for Integrating Biology & the Bedside) datasets are the foundational benchmarks for clinical NLP. Available tasks:

YearTaskSize
2006De-identification of clinical notes889 documents
2009Medication extraction268 discharge summaries
2010Relations between medical concepts394 documents
2011Coreference resolution427 documents
2012Temporal relation extraction310 documents
2014De-identification (updated)1,304 documents

Access: https://www.i2b2.org/NLP/DataSets/


Data-Centric AI: The 2025 Perspective

The key shift in ML practice over the last 3 years: the bottleneck in most real-world ML projects is data quality, not model architecture.

Practical implications:

  • A clean dataset with 10,000 examples often outperforms a noisy dataset with 100,000 examples
  • Systematic error analysis of dataset errors yields more improvement than most architectural innovations
  • “Slice-based evaluation” — checking model performance on specific subgroups — catches failures that aggregate metrics miss

When you’re building a medical AI model, spend as much time on data quality assessment as on model design. The investment pays off.


Final Words

We hope you found some useful datasets aligned with your area of interest.

If you want to add any more links that I have missed, kindly comment below. Also, do provide your feedback so that I can improve further.


Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights

Subscribe now to keep reading and get access to the full archive.

Continue reading