Datasets to start with – Part 2

Article Summary

In this blog article, we add a new list of dataset resources ranging from applications like social network analysis, recommendation systems, information retrieval and medical datasets. In addition, we also provide a list of datasets open-sourced by top industries like Google and Microsoft.

In Part 1 of this series of dataset posts, I had similarly shared more such datasets, most of which are freely available for everyone. However, in order to access certain medical datasets like i2b2 and MIMIC III, you will be required to sign a data use and confidentiality agreement.

Social Network Analysis

1. AMiner
 2. Generalized Language Understanding Evaluation (GLUE) benchmark

This is actually a set of benchmark datasets over a number of NLP tasks. The associated Github repository provide instructions for downloading the training, test and validation datasets.

3. Stanford Large Network Dataset Collection
Social networks
 : online social networks, edges represent interactions between people

3.1 Networks with ground-truth communities : ground-truth network communities in social and information networks
3.2 Communication networks : email communication networks with edges representing communication
3.3 Citation networks : nodes represent papers, edges represent citations
3.4 Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper)
3.5 Web graphs : nodes represent webpages and edges are hyperlinks
3.6 Amazon networks : nodes represent products and edges link commonly co-purchased products
3.7 Internet networks : nodes represent computers and edges communication
3.8 Road networks : nodes represent intersections and edges roads connecting the intersections
3.9 Autonomous systems : graphs of the internet
3.10 Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)
3.11 Location-based online social networks : Social networks with geographic check-ins
3.12 Wikipedia networks, articles, and metadata : Talk, editing, voting, and article data from Wikipedia
3.13 Temporal networks : networks where edges have timestamps
3.14 Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets
3.15 Online communities : Data from online communities such as Reddit and Flickr
3.16 Online reviews : Data from online review systems such as BeerAdvocate and Amazon

Taken verbatim from Stanford SNAP website

4. Recommendation datasets

4.1 Librec
4.2 ReDial (Recommendation Dialogues)

It is an annotated dataset of dialogues, where users recommend movies to each other.

5. Information Retrieval

5.1 Github repo by lsminoula

Contain links to numerous IR datasets

6. Medical datasets (NLP based)

6.1 i2b2 NLP datasets

Informatics for Integrating Biology and the Bedside

6.2 CLEF eHealth datasets
6.3 SNAP Biomedical Network Dataset Collection
6.4 The CRAFT (Colorado Richly Annotated Full-Text) Corpus
6.6 MedNLI
6.7 Medical Data for Machine Learning on Github by beamandrew
6.8 FDA Adverse Event Reporting System (FAERS)
6.9 Commercial Health Insurance Payer Database
6.10 Adverse Drug Event database
6.11 Papers with code website
6.12 TREC 2017 medical LiveQA

7. Industry datasets

7.1 DeepMind open-source datasets
7.2 Microsoft/PointerSQL
7.3 Microsoft/Learning to represent programs with Graphs dataset – ICLR 2018 [Download link]
7.4 Microsoft Information-Seeking Conversation(MISC) dataset [Download link]
7.5 Microsoft Research Open Data

Hope you found some useful datasets aligned to your area of interest. If you want to add any more links, that I have missed, kindly comment below. Also do provide your feedback so that I can improve further.


Hello everyone. I am Soumyadeep. I have been working on Machine learning projects for the last 4 years. I am now pursuing Ph.D. in Computer Science Department at IIT Kharagpur. I recently completed M.S (Research) from the same department in November, 2019. My research interests involve applying Machine Learning, NLP and Deep Learning to solve Online Reputation Monitoring and Consumer Health Search problems.

Leave a Reply