(Updated in 2025) A major limitation of current AI research is the overemphasis on English and the under-representation of the Bengali or Bangla language.
Despite being the seventh most spoken language worldwide, Bengali remains a low-resource language in the field of Natural Language Processing (NLP). This scarcity of resources poses significant challenges for developing robust AI models tailored to Bengali text.
In this article, we will discuss the resources available in the Bangla language for NLP tasks, specifically focusing on Named Entity Recognition (NER) and Sentiment Analysis. These two tasks are fundamental for many applications such as information extraction, opinion mining, and social media analytics, yet they suffer from a lack of high-quality, publicly available datasets in Bengali.
Table of Contents
- The Challenge of Low-resource Languages
- Named Entity Recognition and Sentiment Analysis in Bengali
- Noteworthy Researchers and Resources
- Practical Applications and Future Directions
- Bengali Datasets for other NLP Tasks
- Final Thoughts
- Buy Artificial Intelligence Book written in Bengali
The Challenge of Low-Resource Languages
The dominance of English in NLP research has led to a disproportionate allocation of resources, models, and datasets, leaving languages like Bengali underrepresented.
However, recent advances in multilingual transformer models such as Multilingual BERT (M-BERT), pretrained on the top 104 languages by Wikipedia size, have started bridging this gap by enabling transfer learning across languages.
Still, the effectiveness of these models depends heavily on the availability of annotated datasets in the target language for fine-tuning and evaluation.
Issue recognized by a top-tier international conference
The 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), a top-tier conference in computational linguistics and NLP, held in Dublin, Ireland, had a dedicated theme track, “Language Diversity: from Low-Resource to Endangered Languages,” this year.
… ACL 2022 will have a new theme to commemorate the 60th anniversary of ACL with the goal of reflecting and stimulating discussion about how the advances in computational linguistics and natural language processing can be used for promoting language diversity from low-resource to endangered languages… [Quoted from ACL 2022 website]
Named Entity Recognition and Sentiment Analysis in Bengali
Named Entity Recognition (NER)
NER involves identifying and classifying key entities in text, such as person names, locations, organizations, and dates. For Bengali, the development of NER datasets has been limited but is gradually progressing.
- One notable resource is the Bengali Language NER dataset available on GitHub, which provides annotated text for entity recognition tasks.
- Google Research’s multilingual BERT paper (ACL 2019) mentions an in-house Bengali NER dataset, indicating ongoing efforts to build such corpora.
- Additionally, the repository by Foysal87 on GitHub offers a comprehensive Bangla NLP dataset collection that includes NER annotations alongside POS tagging and sentiment analysis data
Sentiment Analysis
Sentiment Analysis, or opinion mining, aims to detect the emotional tone behind a text, classifying it as positive, negative, neutral, or sometimes mixed. For Bengali, several datasets have been curated recently to support this task:
- The BnSentMix dataset is a large-scale Bengali-English code-mixed sentiment analysis corpus containing 20,000 samples labeled with four sentiment classes: positive, negative, neutral, and mixed. It includes data from YouTube, Facebook, and e-commerce reviews, reflecting real-world usage and linguistic diversity
- The MUltiplatform BAngla SEntiment (MUBASE) and SentNob datasets provide multi-platform Bengali social media posts with sentiment annotations, covering domains such as politics, education, and agriculture
- Kaggle hosts a Bengali sentiment analysis dataset based on microblog posts, which can be a valuable resource for training models
- Recent research also explores hybrid feature extraction techniques and learning algorithms tailored for Bengali document-level sentiment analysis, contributing to improved accuracy
Noteworthy Researchers and Resources
- Monojit Choudhury from MBZUAI, Abu Dhabi (previously, worked at Microsoft India) is a key researcher focusing on low-resource NLP and multilingual language models, with several publications involving Bengali text.
- Professors at IIT Kharagpur have contributed papers on Bengali NLP tasks dating back to 2014 and 2016, laying foundational work for the community.
- The BEmoC corpus (2022) is designed for identifying emotions in Bengali texts, enriching sentiment analysis beyond simple polarity classification
- The ACM Transactions on Asian and Low-Resource Language Information Processing journal regularly publishes research relevant to Bengali NLP, supporting the growth of this field.
Practical Applications and Future Directions
The availability of these datasets opens doors for various applications:
- Social Media Monitoring: Detecting public sentiment and named entities in Bengali social media posts to understand trends and public opinion.
- Customer Feedback Analysis: Analyzing Bengali reviews on e-commerce platforms to improve products and services.
- Healthcare and Medical AI: Extracting named entities such as diseases, medications, and symptoms from Bengali medical texts for better healthcare analytics.
- Multilingual AI Systems: Enhancing multilingual models to better handle code-mixed Bengali-English content, which is prevalent online.
By focusing on these resources and research directions, we can help bridge the gap in Bengali language AI research and build tools that truly serve the Bengali-speaking population.
Additional Resources
- For those interested, the Google Dataset Search provides more Bengali sentiment datasets to explore.
- You can also find a curated list of Bangla NLP datasets on GitHub repositories such as banglanlp/bnlp-resources
- For a comprehensive introduction to AI in Bengali, consider reading one of the few AI books written in the Bengali language, which covers fundamentals and practical concerns relevant to the Bengali tech community.
More datasets for other Bangla NLP tasks
BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts (LREC 2022) — The authors have collected 50,281 comments from different social media sites like Facebook, YouTube, and TikTok and have labeled them in a 3-level hierarchical annotation scheme.
- Hate Speech Identification
- Target Identification of Hate Speech
- Identifying Hate Speech Categories
Please check out their paper and GitHub repository for more details.
Bengali.AI is an initiative for creating open-source datasets for research in the Bengali language. This initiative is supported by Kaggle, Google, and others. It is currently in Beta.
It currently supports three datasets:
- Bengali Text-to-Speech Dataset
- Bengali Automatic Speech Recognition Dataset
- Numta Handwritten Bengali Digits.
Bhashini is a recent initiative developed by the Ministry of Electronics & Information Technology, Government of India. The aim is to create an ecosystem involving government, industry, academia, and others to create open-source datasets and models for Indian languages.

You can contribute to this initiative by:
- Suno India — Write down the audio that you hear
- Bolo India — Read a sentence and contribute your voice
- Likho India — Translate the provided text to your language
- Dekho India — Label the images
- Validate the contributions of other participants for all the above 4 points
Final Thoughts
There is a growing research interest in developing NLP models that work effectively on non-English and low-resource languages like Bengali. Two main approaches are emerging:
- Creation of High-Quality Labeled Data: Supervised learning depends on annotated datasets, so expanding and improving Bengali NER and sentiment datasets is crucial.
- Transfer Learning and Multilingual Models: Leveraging pretrained models like M-BERT and fine-tuning them on Bengali-specific data while accounting for language-specific nuances.
- The combined effort of dataset creation, model development, and community engagement will accelerate the progress of Bengali NLP, making AI tools more inclusive and effective for Bengali speakers worldwide.
Buy Artificial Intelligence Book written in Bengali on Amazon India

As one of the few comprehensive books on artificial intelligence written in Bengali, this guide offers an accessible introduction to AI fundamentals for the Bengali-speaking tech community. The book explains core AI concepts — from machine learning basics to real-world applications like voice assistants and recommendation systems — while addressing practical concerns about privacy, job market impacts, and responsible AI development.
Discover more from Medical AI Insights | Datanalytics101
Subscribe to get the latest posts sent to your email.