Bangla Language Datasets for Sentiment Analysis and NER

A major limitation of current research in Machine Learning and Natural Language Processing (NLP) is that they are focused on few languages, particularly English. In this article, we will talk about resources available in the Bangla language for the NLP tasks of Named Entity Recognition (NER) and Sentiment Analysis.

Related works in NLP research on low-resource languages like Bangla NLP

ACL 2022, a top-tier conference in computational linguistics and NLP, has a dedicated theme track, “Language Diversity: from Low-Resource to Endangered Languages,” this year.

… ACL 2022 will have a new theme to commemorate the 60th anniversary of ACL with the goal of reflecting and stimulating discussion about how the advances in computational linguistics and natural language processing can be used for promoting language diversity from low-resource to endangered languages… [Quoted from ACL 2022 website]

Multilingual transformer models like Multilingual BERT (M-BERT) help to reduce this wide gap. M-BERT is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.

For more details about M-BERT, please go through their paper and the Huggingface model.

Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective…

ACM Transactions on Asian and Low-Resource Language Information Processing is a well-reputed journal in this domain.

Let’s now focus on the Bangla language for this article!

Photo by Gautam Ganguly on Unsplash

Named Entity Recognition and Sentiment Analysis resources for Bangla NLP

1. You can also follow the works (publications list) of Monojit Choudhury from Turing India (bio), with research interests in low-resource NLP, specifically trying to understand large multilingual language models.

2. You can also check the following IIT KGP professor’s publications involving Bangla text (2014, 2016 papers)

3. BEmoC: A Corpus for Identifying Emotion in Bengali Texts (DOI), 2022

4. Bengali Language NER — https://github.com/SuchandraDatta/bengali_language_NER

5. The following ACL 2019 paper from Google Research on multilingual BERT mentions that they have an in-house Bengali NER dataset. https://aclanthology.org/P19-1493.pdf

6. The following Kaggle competition may be helpful https://www.kaggle.com/datasets/wchowdhu/bengali-sentiment-analysis-microblog-posts

7. The Google Dataset Search gives some more datasets: https://datasetsearch.research.google.com/search?src=0&query=bengali%20sentiment

Final thoughts

There is a growing research interest in developing NLP models that work on non-English and low-resource languages.

One approach is the creation of high-quality labeled data for supervised learning methods.

The second approach may be to apply transfer learning using models trained on high-resource languages while also keeping a check on language-specific artifacts of the target low-resource language.


💚 I plan to write one post a month on Medium. To get updates directly to your email, please subscribe at https://medium.com/subscribe/@soumyadeeproy

💚 30+ free articles already available at datanalytics101.com

💚 Your feedback is critical to improving the content, so please feel free to share your take on this topic

💚 Follow me on Twitter @roysoumya1 for getting updates on “AI in Healthcare.”

What is your take on this topic?

%d bloggers like this: