In this article, we compile a list of datasets and codebases of recent papers or even diverse domains which I have come across. I usually have stumbled upon them during the literature survey for one of my works and the most recent ones from Twitter mostly.
I have covered a list of data challenge competitions list in another blog, where along with the dataset, the problem statement is also provided. This makes it really easy to get started.
List of databases that are recently published
- ICLR OpenReview 2019 web pages [Link1]
- FIRE -Forum of Information Retrieval India involving multi-lingual languages [Link]
- UCI Machine Learning Repository
- Carnegie Mellon University – Machine Learning Course Projects, Spring 2015 http://www.cs.cmu.edu/~ninamf/courses/601sp15/projects.html, Fall 2010, http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/projects.html
- Awesome-public-datasets and Awesome-machine-learning in Github
- ACL Knowledge datasets and collections
- Linguistic Data Consortium Catalog
- The Dryad Digital Data Repository
- Google: Machine Learning Student projects based on Natural Language Processing.
- Avocado Research Email Collection and Enron Email Dataset
- FiveThirtyEight Data
- This blog post gives links to many available datasets