Beyond the UCI Repository — the best datasets for graph analysis, clinical NLP, and medical imaging, from open access to application-required tiers.
Updated 2025 — originally published January 2019
Part 2 of the dataset series covers social network analysis datasets, open-source collections from Google and Microsoft, and an expanded section on medical AI datasets — the resources you need when the standard UCI Repository isn’t enough.
In Part 1 (now updated as the Ultimate Guide to ML & AI Datasets for 2025), I covered the major dataset platforms and NLP/vision collections. This post goes deeper on specific domains: social network analysis, datasets open-sourced by major tech companies, and medical/clinical data resources.
The landscape has shifted from “model-centric” to “data-centric” AI — meaning finding the right dataset for your problem is now as important as designing the right model. This guide helps you navigate that search.
Social Network Analysis Datasets
Stanford SNAP (Stanford Network Analysis Project)
Link: https://snap.stanford.edu/data/
The gold standard for network analysis datasets. Covers:
| Network Type | Example | Use Cases |
|---|---|---|
| Social networks | Facebook ego networks, Twitter follows | Community detection, influence propagation |
| Ground-truth communities | Amazon product networks | Benchmarking community detection algorithms |
| Communication networks | Email networks (Enron, EU email) | Information flow analysis |
| Citation networks | High-energy physics citations | Academic network analysis |
| Collaboration networks | DBLP co-authorship | Scientific collaboration patterns |
| Web graphs | Google web graph | PageRank, web mining |
| Amazon networks | Product co-purchasing | Recommendation systems |
| Road networks | Road network graphs | Routing algorithms |
| Temporal networks | Messages with timestamps | Dynamic network analysis |
| Wikipedia networks | Talk pages, edit history | Collaborative dynamics |
Why SNAP is essential: Every dataset is well-documented, includes basic statistics, and has associated papers that establish baseline results. If you’re doing any graph ML research, start here.
AMiner Academic Graph
Link: https://aminer.org/
A comprehensive academic social network:
- 100+ million papers across CS, medicine, physics
- Author profiles with affiliations, publications, citations
- Research topic classification
- Collaboration networks between researchers
Useful for: citation network analysis, author disambiguation, topic evolution over time, academic influence measurement.
GLUE and Super-GLUE Benchmarks
Link: https://gluebenchmark.com/
GLUE (General Language Understanding Evaluation) is a collection of NLP tasks designed to test general language understanding. Includes:
- CoLA (grammaticality judgement)
- SST-2 (sentiment analysis)
- MRPC (paraphrase detection)
- STS-B (semantic similarity)
- QQP (question pair similarity)
- MNLI (multi-genre NLI)
- QNLI (question NLI)
- RTE (textual entailment)
- WNLI (Winograd NLI)
SuperGLUE is the harder successor, designed after models surpassed human performance on GLUE.
When to use: As a standard evaluation suite when you’re developing a new language model or fine-tuning approach and need to show generalization across tasks.
Datasets Open-Sourced by Industry
Google Research
- Google Books N-grams: Frequency of words and phrases in millions of books (useful for language modeling, historical linguistics)
- Open Images: 9M+ images with bounding boxes, segmentation masks, and visual relationships
- Natural Questions: Real Google search queries with Wikipedia-sourced answers
- TyDi QA: Question answering across typologically diverse languages (includes Bengali)
- FLORES-200: Translation benchmark covering 200 languages
Access: Most available on Google Research GitHub and via Hugging Face Hub
Microsoft Research
- MS MARCO: 1M+ real Bing queries with human-annotated relevant passages. Widely used for document ranking and passage retrieval research.
- MultiWOZ: Multi-domain, multi-turn dialogue dataset for task-oriented dialog systems
- SQUAD 2.0 (co-developed with Stanford): Reading comprehension with unanswerable questions
Meta AI Research
- NLLB-200: Parallel text in 200 languages for translation research (includes many Indian languages)
- FLORES: Evaluation benchmark for multilingual translation
- ReALBench: Reasoning evaluation dataset
Medical and Clinical AI Datasets (Expanded)
For researchers specifically entering medical AI, here’s the progression from “easy to access” to “requires significant approval process”:
Tier 1: Open Access (Start Here)
ISIC (International Skin Imaging Collaboration)
- 25,000+ dermatoscopy images
- Multi-class skin lesion classification
- Annual Kaggle competition
- Access: Kaggle (open)
- 18,000 chest X-rays annotated by Vietnamese radiologists
- 14 thoracic abnormalities
- Access: Kaggle (open)
- 18 standardized medical image datasets (28×28 and 224×224 resolution)
- Covers pathology, radiology, dermatology, ophthalmology
- Access: medmnist.com (open)
Tier 2: Registration Required
PhysioNet datasets (MIMIC-CXR, MIMIC-III, eICU)
- Requires CITI Human Research training (~4 hours) and a credentialing process
- Once approved, access to the largest collection of clinical data available for research
- Access: physionet.org
CheXpert
- 224,316 chest X-rays from Stanford Hospital
- Requires Stanford registration
- Access: stanfordmlgroup.github.io/competitions/chexpert/
Tier 3: Application Required
UK Biobank
- 500,000 participants, multi-modal data (genetics, imaging, wearables, EHR)
- Requires a formal research application
- Access: ukbiobank.ac.uk
PPMI (Parkinson’s Progression Markers Initiative)
- Longitudinal clinical data for Parkinson’s disease research
- Requires registration and research proposal
- Access: ppmi-info.org
i2b2 Shared Task Datasets (Clinical NLP)
The i2b2 (Informatics for Integrating Biology & the Bedside) datasets are the foundational benchmarks for clinical NLP. Available tasks:
| Year | Task | Size |
|---|---|---|
| 2006 | De-identification of clinical notes | 889 documents |
| 2009 | Medication extraction | 268 discharge summaries |
| 2010 | Relations between medical concepts | 394 documents |
| 2011 | Coreference resolution | 427 documents |
| 2012 | Temporal relation extraction | 310 documents |
| 2014 | De-identification (updated) | 1,304 documents |
Access: https://www.i2b2.org/NLP/DataSets/
Data-Centric AI: The 2025 Perspective
The key shift in ML practice over the last 3 years: the bottleneck in most real-world ML projects is data quality, not model architecture.
Practical implications:
- A clean dataset with 10,000 examples often outperforms a noisy dataset with 100,000 examples
- Systematic error analysis of dataset errors yields more improvement than most architectural innovations
- “Slice-based evaluation” — checking model performance on specific subgroups — catches failures that aggregate metrics miss
When you’re building a medical AI model, spend as much time on data quality assessment as on model design. The investment pays off.
Final Words
We hope you found some useful datasets aligned with your area of interest.
If you want to add any more links that I have missed, kindly comment below. Also, do provide your feedback so that I can improve further.
Discover more from Medical AI Insights
Subscribe to get the latest posts sent to your email.