In this blog article, we add a new list of dataset resources ranging from applications like social network analysis, recommendation systems, information retrieval and medical datasets. In addition, we also provide a list of datasets open-sourced by top industries like Google and Microsoft.
In Part 1 of this series of
Social Network Analysis
2. Generalized Language Understanding Evaluation (GLUE) benchmark
This is actually a set of benchmark datasets over a number of NLP tasks. The associated Github repository provide instructions for downloading the training, test and validation datasets.
3. Stanford Large Network Dataset Collection
Social networks : online social networks, edges represent interactions between people
3.1 Networks with ground-truth
3.11 Location-based online social
3.12 Wikipedia networks, articles, and
3.14 Twitter and
3.16 Online reviews : Data from online review systems such as BeerAdvocate and Amazon
Taken verbatim from Stanford SNAP website
4. Recommendation datasets
It is an annotated dataset of dialogues, where users recommend movies to each other.
5. Information Retrieval
Contain links to numerous IR datasets
6. Medical datasets (NLP based)
6.1 i2b2 NLP datasets –
Informatics for Integrating Biology and the Bedside
6.7 Medical Data for Machine Learning on Github by
6.11 Papers with code website
7. Industry datasets
7.3 Microsoft/Learning to represent programs with Graphs dataset – ICLR 2018 [Download link]
7.4 Microsoft Information-Seeking Conversation(MISC) dataset [Download link]
Hope you found some useful datasets aligned to your area of interest. If you want to add any more links, that I have missed, kindly comment below. Also do provide your feedback so that I can improve further.