Bangla Language Datasets for Sentiment Analysis and NER

A major limitation of current research in Machine Learning and Natural Language Processing (NLP) is that they are focused on few languages, particularly English. In this article, we will talk about resources available in the Bangla language for the NLP tasks of Named Entity Recognition (NER) and Sentiment Analysis.

Bangla language (also many sources also call it “Bengali” Language) is the 7th most popular languages with 272.7 million native speakers worldwide, according to Ethnologue website.

We will refer here as “Bangla NLP”, which can be interchangeably used as Bengali NLP.

Image source: http://www.ethnologue.com/guides/ethnologue200

However Bangla is still considered as a low-resource language in the natural language processing community (Alam et al. arXiv 2021)

There are many initiatives to create more datasets and open-source projects on Bangla language.

Related works in NLP research on low-resource languages like Bangla Natural Language Processing (Bangla NLP)

ACL 2022, a top-tier conference in computational linguistics and NLP, has a dedicated theme track, “Language Diversity: from Low-Resource to Endangered Languages,” this year.

… ACL 2022 will have a new theme to commemorate the 60th anniversary of ACL with the goal of reflecting and stimulating discussion about how the advances in computational linguistics and natural language processing can be used for promoting language diversity from low-resource to endangered languages… [Quoted from ACL 2022 website]

Multilingual transformer models like Multilingual BERT (M-BERT) help to reduce this wide gap. M-BERT is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.

For more details about M-BERT, please go through their paper and the Huggingface model.

Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective…

ACM Transactions on Asian and Low-Resource Language Information Processing is a well-reputed journal in this domain.

Let’s now focus on the Bangla language for this article!

Photo by Gautam Ganguly on Unsplash

Named Entity Recognition and Sentiment Analysis resources for Bangla NLP

1. You can also follow the works (publications list) of Monojit Choudhury from Turing India (bio), with research interests in low-resource NLP, specifically trying to understand large multilingual language models.

2. You can also check the following IIT KGP professor’s publications involving Bangla text (2014, 2016 papers)

3. BEmoC: A Corpus for Identifying Emotion in Bengali Texts (DOI), 2022

4. Bengali Language NER — https://github.com/SuchandraDatta/bengali_language_NER

5. The following ACL 2019 paper from Google Research on multilingual BERT mentions that they have an in-house Bengali NER dataset. https://aclanthology.org/P19-1493.pdf

6. The following Kaggle competition may be helpful https://www.kaggle.com/datasets/wchowdhu/bengali-sentiment-analysis-microblog-posts

7. The Google Dataset Search gives some more datasets: https://datasetsearch.research.google.com/search?src=0&query=bengali%20sentiment

More datasets for other Bangla NLP tasks

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts (LREC 2022) – The authors have collected 50,281 cooments from the different social media sites like Facebook, Youtube and TikTok and have labeled them in a 3-level hierarchical annotation scheme.

Hate Speech Identification

Target Identification of Hate Speech

Identifying Hate Speech Categories

Please check out their paper and Github repository for more details.

Bengali.AI is an initiative for creating open-source datasets for research in the Bengali language. This initiative is supported by Kaggle, Google and others. It is currently in Beta.

It currently supports three datasets:

Bengali Text to Speech Dataset, Bengali Automatic Speech Recognition Dataset and Numta Handwritten Bengali Digits.

Bhashini is a recent initiative developed by the Ministry of Electronics & Information Technology, Government of India. The aim is to create an ecosystem involving government, industry, academia, and others, to create open source datasets and models for Indian languages.

Image source: Bhashini website

You can contribute to this initiative by:

  1. Suno India – Write down the audio that you hear
  2. Bolo India – Read a sentence and contribute your voice
  3. Likho India – Translate the provided text to your language
  4. Dekho India – Label the images
  5. Validate the contributions of other participants for all the above 4 points

Final thoughts

There is a growing research interest in developing NLP models that work on non-English and low-resource languages.

One approach is the creation of high-quality labeled data for supervised learning methods.

The second approach may be to apply transfer learning using models trained on high-resource languages while also keeping a check on language-specific artifacts of the target low-resource language.


💚 I plan to write one post a month on Medium. To get updates directly to your email, please subscribe at https://medium.com/subscribe/@soumyadeeproy

💚 30+ free articles already available at datanalytics101.com

💚 Your feedback is critical to improving the content, so please feel free to share your take on this topic

💚 Follow me on Twitter @roysoumya1 for getting updates on “AI in Healthcare.”

What is your take on this topic?