AI and ML Datasets to start with – Part 2

Datasets form the basis of the domain of Artificial Intelligence and Machine learning. Over the last few years, model-centric AI has shifted from model-centric to data-centric AI.

Supervised learning depends heavily on labeled data (i.e., features with ground-truth labels), where a mapping function is learned between the features and labels.

In unsupervised learning, we depend heavily on unlabeled data where the underlying data distribution is learned.

Artificial Intelligence and machine learning have become quite popular, and corresponding job opportunities like data engineer, data analyst, or data scientist.

It is also well-observed that around 70-80% of the effort is directed towards data collection and preparation, while only 20-30% is spent on model selection and training.

What will you learn from this article?

In this blog article, we add a new list of dataset resources ranging from applications like social network analysis, recommendation systems, information retrieval, and medical datasets.

In addition, we also provide a list of datasets open-sourced by top industries like Google and Microsoft.

In Part 1 of this series of dataset posts, I shared more such datasets, most of which are freely available for everyone.

However, you must sign a data use and confidentiality agreement to access certain medical datasets like i2b2 and MIMIC III,

AI and ML Datasets on Social Network Analysis

1. AMiner
 2. Generalized Language Understanding Evaluation (GLUE) benchmark

This is a set of benchmark datasets over several NLP tasks. The Github repository provides instructions for downloading the training, test, and validation datasets.

3. Stanford Large Network Dataset Collection
Social networks
: online social networks, edges represent interactions between people.
  • Networks with ground-truth communities: ground-truth network communities in social and information networks
  • Communication networks: email communication networks with edges representing communication
  • Citation networks: nodes represent papers, edges represent citations
  • Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper)
  • Web graphs: nodes represent webpages, and edges are hyperlinks
  • Amazon networks: nodes represent products, and edges link commonly co-purchased products
  • Internet networks: nodes represent computers and edge communication
  • Road networks: nodes represent intersections and edges roads connecting the intersections
  • Autonomous systems: graphs of the Internet
  • Signed networks: networks with positive and negative edges (friend/foe, trust/distrust)
  • Location-based online social networks: Social networks with geographic check-ins
  • Wikipedia networks, articles, and metadata: Talk, editing, voting, and article data from Wikipedia
  • Temporal networks: networks where edges have timestamps
  • Twitter and Memetracker: Memetracker phrases, links, and 467 million Tweets
  • Online communities: Data from online communities such as Reddit and Flickr
  • Online reviews: Data from online review systems such as BeerAdvocate and Amazon

Taken verbatim from the Stanford SNAP website

4. Recommendation AI and ML Datasets

4.1 Librec
4.2 ReDial (Recommendation Dialogues)

It is an annotated dataset of dialogues where users recommend movies to each other.

5. Information Retrieval Datasets

5.1 Github repo by lsminoula

Contain links to numerous IR datasets

6. Medical datasets (NLP based)

6.1 i2b2 NLP datasets

Informatics for Integrating Biology and the Bedside

6.2 CLEF eHealth datasets
6.3 SNAP Biomedical Network Dataset Collection
6.4 The CRAFT (Colorado Richly Annotated Full-Text) Corpus
6.5 MIMIC-III
6.6 MedNLI
6.7 Medical Data for Machine Learning on Github by beamandrew
6.8 FDA Adverse Event Reporting System (FAERS)
6.9 Commercial Health Insurance Payer Database
6.10 Adverse Drug Event database
6.11 Papers with code website
6.12 TREC 2017 medical LiveQA

7. Industry datasets

7.1 DeepMind open-source datasets
7.2 Microsoft/PointerSQL
7.3 Microsoft/Learning to represent programs with Graphs dataset – ICLR 2018 [Download link]
7.4 Microsoft Information-Seeking Conversation(MISC) dataset [Download link]
7.5 Microsoft Research Open Data

Final Thoughts on AI and ML datasets

We hope you found some useful datasets aligned with your area of interest.

If you want to add any more links that I have missed, kindly comment below. Also, do provide your feedback so that I can improve further.


💚 I plan to write one post a month on Medium. To get updates directly to your email, please subscribe at https://medium.com/subscribe/@soumyadeeproy

💚 30+ free articles already available at datanalytics101.com

💚 Your feedback is critical to improving the content, so please feel free to share your take on this topic

💚 Follow me on Twitter @roysoumya1 for getting updates on “AI in Healthcare.”

What is your take on this topic?

%d bloggers like this: