In this article, we compile a list of datasets and codebases of recent papers or even diverse domains which I have come across. I usually have stumbled upon them during the literature survey for one of my works and the most recent ones from Twitter mostly.
I have covered a list of data challenge competitions list in another blog
, where along with the dataset, the problem statement is also provided. This makes it really easy to get started.
List of databases that are recently published
- ICLR OpenReview 2019 web pages [Link1]
- FIRE -Forum of Information Retrieval India involving multi-lingual languages [Link]
- UCI Machine Learning Repository
- Carnegie Mellon University – Machine Learning Course Projects, Spring 2015 http://www.cs.cmu.edu/~ninamf/courses/601sp15/projects.html, Fall 2010, http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/projects.html
- Awesome-public-datasets and Awesome-machine-learning in Github
- ACL Knowledge datasets and collections
- Linguistic Data Consortium Catalog
- The Dryad Digital Data Repository
- Google: Machine Learning Student projects based on Natural Language Processing.
- Avocado Research Email Collection and Enron Email Dataset
- FiveThirtyEight Data
- This blog post gives links to many available datasets