Datasets to start with – Part 1

In this article, we compile a list of datasets and codebases of recent papers or even diverse domains which I have come across. I usually have stumbled upon them during the literature survey for one of my works and the most recent ones from Twitter mostly.
I have covered a list of data challenge competitions list in another blog, where along with the dataset, the problem statement is also provided. This makes it really easy to get started.

List of databases that are recently published

  1. ICLR OpenReview 2019 web pages [Link1]
  2. FIRE -Forum of Information Retrieval India involving multi-lingual languages [Link]
  3. UCI Machine Learning Repository
  4. https://archive.ics.uci.edu/ml/datasets.html
  5. Carnegie Mellon University – Machine Learning Course Projects, Spring 2015 http://www.cs.cmu.edu/~ninamf/courses/601sp15/projects.html, Fall 2010, http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/projects.html
  6. Awesome-public-datasets and Awesome-machine-learning in Github
  7. ACL Knowledge datasets and collections
  8. Linguistic Data Consortium Catalog
  9. The Dryad Digital Data Repository
  10. Kaggle
  11. Google: Machine Learning Student projects based on Natural Language Processing.
  12. Avocado Research Email Collection and Enron Email Dataset
  13.  FiveThirtyEight Data
  14. This blog post gives links to many available datasets

4 thoughts on “Datasets to start with – Part 1”

What is your take on this topic?

%d bloggers like this: