About me


My name is Soumyadeep Roy. Currently, I am working as an PhD Candidate in the Department of Computer Science and Engineering at IIT Kharagpur, West Bengal, India. My thesis topic is “Incorporating Domain Knowledge in Medical NLP Applications“. I work under the supervision of Prof. Niloy Ganguly and Prof. Shamik Sural. I am currently also a part of Complex Networks Research Group(CNeRG), IIT Kharagpur.

I completed my Masters (MS research) from IIT Kharagpur. I defended my thesis titled “Computational Approaches for Online Reputation Monitoring [Modeling, Analysis and Recommendation]“. My detailed research profile is provided here.

Recent News

3rd November, 2020 – Paper accepted at ACM Transactions on the Web (TWEB) titled “An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation” (arXiv preprint)

2nd November, 2020 – Defended my PhD Registration Seminar. Thesis topic: Incorporating Domain Knowledge in Medical NLP Applications

Research publications

1. Soumyadeep Roy, Shamik Sural, Niyati Chhaya, Anandhavelu Natarajan, and Niloy Ganguly, An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation, in ACM Transactions on the Web (TWEB), 25 pages, November 2020 (Journal) [arXiv]

2. Soumyadeep Roy, Koustav Rudra, Nikhil Agrawal, Shamik Sural, Niloy Ganguly, Towards an Aspect-based Ranking Model for Clinical Trial Search, in the 8th International Conference on Computational Data and Social Networks (CSoNet 2019), Novemer 18 – 20, 2019, Ho Chi Minh City, Vietnam (Conference Full paper) [PDF] [DOI] [Code] [Dataset][Slides]

3. Soumyadeep Roy, Niloy Ganguly, Shamik Sural, Niyati Chhaya, and Anandhavelu Natarajan, Understanding Brand Consistency from Web Content, in Proceedings of the 10th ACM Conference on Web Science, WebSci 19, (Boston, MA, USA), pp. 245–253, ACM, 2019. (Conference full paper) [DOI] [PDF] [Slides][Dataset details]

4. Soumyadeep Roy, Nibir Pal, Kousik Dasgupta, Binay Gupta; Understanding Email Interactivity and Predicting User Response to email, In Methodologies and Application Issues of Contemporary Computing Framework, pp. 69-79. Springer, Singapore, 2018 (Book Chapter) [DOI][Paper][Code][Slides]

5. Soumyadeep Roy; Automated EBM-oriented Summarization of Active or Recruiting Clinical Trials at COMSNETS 2018 held at Bangalore, India on January 3 – 7, 2018 [Link] Best Graduate Forum Presentation Award

Research Projects

Project A : Developing an integrated framework for understanding brand consistency from online content

Publications : WebSci 2019 Full paper

Abstract : A consumer-dependent (business-to-consumer) organization tends to present itself as possessing a set of human qualities, which is termed as the brand perception of a company. The perception is impressed upon the consumer through the content (be it in the form of advertisement, blogs, magazines) produced by the organization – in this era of digital marketing, a continuous generation of web-content is needed to keep up the engagement with the consumers and thus delve in a lasting impression on them. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand’s messaging tone. Thus the first task is to develop a quantitative technique to check whether the desired brand personality is maintained in a published article.

In the first work, we quantify brand personality and formulate its linguistic features. We develop five independent classification models to score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication. We perform a large-scale data collection activity and collect around 300 K web page content that covers around 650 Fortune 1000 companies and develop a novel deep learning architecture which leverages transfer learning to improve our classifier performance further. We also study the effect of directly adding the linguistic features to our neural architecture. The classifier automatically identifies the web articles which are not consistent with the mission and vision of a company. A consistent brand will generate trust and retain customers over time since consumers look for regularity and common patterns. There have been studies that provide various strategies for maintaining the brand image over time and also the impact of major company-related events like brand extension, mergers on it. However, there does not exist any standard measure which quantifies the extent of brand inconsistency for a given company.

Research Challenges related to Brand Consistency (ACM WebSci 2019)
Research Challenges related to Brand Consistency (ACM WebSci 2019)

Thus in the second chapter, we first quantify brand consistency and study the company-level contributing factors like the effect of brand extensions. We observe that promotion posts primarily portray competence as their brand personality across all brands and favours the brand consistency of companies which portrays competence in most of their posts (primary trait). We find that presence of brand extension-related posts reduces the brand consistency score of a company and that it is more challenging to maintain posts of a company which has low topical consistency with the brand personality they want to evoke. We discover companies which post consistently and find that financially affluent companies are better at maintaining consistency. To address the brand inconsistency issue, we develop a helper tool that recommends the sentences that need to be modified to make the web article more consistent with the brand perception to the content writers.

Keywords: online reputation management, brand image, brand personality, brand consistency, text classification, transfer learning, sentence ranking

Project B : Developing an Aspect-based Search System for Clinical Trials Search

Publications : CSoNet 2019 Full paper

Abstract : Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments. Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials. However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity. We aim to develop a clinical trial search system which covers multiple disease classes, instead of only focusing on retrieval of oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries, across five disease categories. Our proposed method performs better than the baseline model in almost 90% cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.

Keywords : Clinical trial search, Aspect-based ranking , Biomedical information retrieval

Brief Intro

I have been working on Machine Learning problems for the last 2 years. Through this page, firstly, I will share my experience on how to start working on real-life datasets. I have worked in projects as diverse as Weather, Email, Event Extraction from a text, Emotion mining, Twitter, Online Reputation Monitoring and Digital Health. I am no expert, but I feel it will really you to get started.

Secondly, I will share resources and news with the aim of making Computer Science research accessible and understandable by the general population. As of now, I will be targeting the undergraduates and those starting their career in research like me.

Some of my past projects range from scraping dynamic websites using Python to data cleaning using “dplyr” in R. Extensive preprocessing with climate data from NOAA in R and beginner Python projects from Automate the Boring Stuff With Python are available with full documented codes in my Github account. You can take a look, if interested.

Please mention the topic about which you want me to write or any random question you have in the Contact section. I will surely get back to you within a week.