Bengali Datasets in 2025 for Named Entity Recognition and Sentiment Analysis

(Updated in 2025) A major limitation of current AI research is the overemphasis on English and the under-representation of the Bengali or Bangla language. Despite being the seventh most spoken language worldwide, Bengali remains a low-resource language in the field of Natural Language Processing (NLP). This scarcity of resources poses significant challenges for developing robust … Read more

AI and ML Datasets to start with – Part 2

Datasets form the basis of the domain of Artificial Intelligence and Machine learning. Over the last few years, model-centric AI has shifted from model-centric to data-centric AI. Supervised learning depends heavily on labeled data (i.e., features with ground-truth labels), where a mapping function is learned between the features and labels. In unsupervised learning, we depend … Read more

Working with Imbalanced Data sets

Theory Imbalanced data sets, in the context of supervised classification problems, refer to the case when the class distribution is highly skewed or disproportionate. Since general supervised learning algorithms assume them to be balanced, they perform accuracy maximization. However, this, in turn, will propagate a model bias and be addressed to some extent, when we … Read more

Datasets to start with – Part 1

In this article, we compile a list of datasets and codebases of recent papers or even diverse domains which I have come across. I usually have stumbled upon them during the literature survey for one of my works and the most recent ones from Twitter mostly. I have covered a list of data challenge competitions … Read more