About me

My name is Soumyadeep Roy. I am a PhD Candidate in the Department of Computer Science and Engineering at IIT Kharagpur, West Bengal, India. My thesis topic is “Incorporating Domain Knowledge in Medical NLP Applications“. I work under the supervision of Prof. Niloy Ganguly and Prof. Shamik Sural. I am currently also a part of Complex Networks Research Group(CNeRG), IIT Kharagpur.

I have joined as a Research Associate from January 2021 at the Leibniz AI Future Lab, L3S Research Center, Leibniz University Hannover, Germany, where I work under the supervision of Prof. Niloy Ganguly and Prof. Wolfgang Nejdl. Here, I work on the “Big Data in Psychiatric Disorders” project where we study patient subtyping of Parkinsons and Schizophrenia disease using clinical and genomic data (SNPs).

I completed my Masters (MS research) from IIT Kharagpur. I defended my thesis titled Computational Approaches for Online Reputation Monitoring [Modeling, Analysis and Recommendation]. This work was a result of a collaboration with Adobe Research, Bangalore, India. Here, we analyzed the textual content generated by brands on corporate websites and developed text classification and sentence-level recommendation solutions that assist reputation management experts.

More details are provided in the Projects section below.

Recent News

April 2022 – Teaching Assistant for the Foundations of Information Retrieval course at Leibniz University Hannover

Oct 2021 – Teaching Assistant for the Natural Language Processing course at Leibniz University Hannover, Volunteering at the Indian Symposium for Machine Learning (IndoML) 2021 (indoml.in) to be held virtually from Dec. 16 – 18, 2021

August 2021 – Short paper “Developing Knowledge-Aware Neural Models for Medical Forum Question Classification” got accepted to CIKM 2021 (acceptance rate – 28%). Joint work with Leibniz AI Future Lab and Adobe Research, India

January 2021 – Joined Leibniz AI Future Lab at L3S Research Center, Hannover, Germany as a Research Associate

3rd November, 2020 – Paper accepted at ACM Transactions on the Web (TWEB) titled “An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation” (arXiv preprint). Joint work with Big Data Experience Lab, Adobe Research, Bangalore, India. Code available too!

2nd November, 2020 – Defended my PhD Registration Seminar. Thesis topic: Incorporating Domain Knowledge in Medical NLP Applications

Research publications

Soumyadeep Roy, Sudip Chakraborty, Aishik Mandal, Prakhar Sharma, Gunjan Balde, Anandhavelu Natarajan, Megha Khosla, Shamik Sural and Niloy Ganguly, Developing Knowledge-Aware Neural Models for Medical Forum Question Classification, in the 30th ACM International Conference on Information and Knowledge Management (CIKM), 1-5 November 2021, Online (Short Paper) (to appear) [Code][arXiv][DOI][Slides][Video]

Soumyadeep Roy, Shamik Sural, Niyati Chhaya, Anandhavelu Natarajan, and Niloy Ganguly, An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation, in ACM Transactions on the Web (TWEB), 25 pages, November 2020 (Journal) [arXiv][DOI] [Slides-MS Thesis version] [Code and Data]

Soumyadeep Roy, Koustav Rudra, Nikhil Agrawal, Shamik Sural, Niloy Ganguly, Towards an Aspect-based Ranking Model for Clinical Trial Search, in the 8th International Conference on Computational Data and Social Networks (CSoNet 2019), Novemer 18 – 20, 2019, Ho Chi Minh City, Vietnam (Full Paper) [PDF] [DOI] [Code] [Data][Slides]

Soumyadeep Roy, Niloy Ganguly, Shamik Sural, Niyati Chhaya, and Anandhavelu Natarajan, Understanding Brand Consistency from Web Content, in Proceedings of the 10th ACM Conference on Web Science, WebSci 19, (Boston, MA, USA), pp. 245–253, ACM, 2019. (Full Paper) [DOI] [PDF] [Slides][Data]

Soumyadeep Roy, Nibir Pal, Kousik Dasgupta, Binay Gupta; Understanding Email Interactivity and Predicting User Response to email, In Methodologies and Application Issues of Contemporary Computing Framework, pp. 69-79. Springer, Singapore, 2018 (Book Chapter) [DOI][Paper][Code][Slides]

Soumyadeep Roy; Automated EBM-oriented Summarization of Active or Recruiting Clinical Trials at COMSNETS 2018 held at Bangalore, India on January 3 – 7, 2018 [Link] Best Graduate Forum Presentation Award

Research Projects

Project A : Patient subtyping in Parkinson’s disease and Schizophrenia using clinical and genomic data

Motivation : Parkinson‘s Disease (PD) is a complex neurodegenerative disorder with high heterogeneity in clinical symptoms (motor and non-motor), progression course, treatment response, genetic factors. Patient subtyping helps improve disease mechanism understanding and facilitate targeted interventions or treatment regimes.

Current Situation : Most PD subtypes are based on motor symptoms and do not focus on non-motor symptoms. General phenotype-based approaches do not provide a personalized way and approaches considering phenotype and genotype data together are not well-explored.

Solution and Vision : We aim to develop data-driven patient subtyping methods that integrate both motor and non-motor characteristics of PD as well as jointly utilize clinical and genotype data. These automatically learnt subtypes will be examined to identify potential markers for neurodegenerative diseases like PD. With the help of these predictive markers, early therapeutic intervention in neurodegenerative diseases could be realised. We are working in close collaboration with Prof. Dr. Helge Frieling of the Department of Psychiatry, Social Psychiatry and Psychotherapy (MHH) and other biomedical partners at MHH associated with the Leibniz AI Lab. Here, we will initially focus on young-onset PD patients and patients with comorbodities like schizophrenia, severe depression. Our final goal will be to develop personalized AI-based solutions to assist the doctors with their day-to-day clinical practice.

Project B : Knowledge-Guided Efficient Representation Learning for Genomic Applications

Understanding gene regulatory code and developing deep learning models for gene sequence representation learning, have become an active area of research. The task to learn better representations is quite difficult due to the existence of polysemy and distant semantic relationship, which prior methods often fail to capture especially in data-scarce scenarios. Further, these models do not utilize gene-related biomedical domain knowledge. For each downstream task a separate fine-tuned model is required, and may lead to issues like catastrophic forgetting. We will discuss how we try to address some of these challenges and our initial results. We evaluate on gene sequence-level classification tasks like promoter region prediction, chromatin profile prediction, and promoter-enhancer interaction prediction.

Project C : Developing an Aspect-based Search System for Clinical Trials Search

Publications : CSoNet 2019 Full paper

Abstract : Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments. Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials. However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity. We aim to develop a clinical trial search system which covers multiple disease classes, instead of only focusing on retrieval of oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries, across five disease categories. Our proposed method performs better than the baseline model in almost 90% cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.

Keywords : Clinical trial search, Aspect-based ranking , Biomedical information retrieval

Completed Projects

Developing an integrated framework for understanding brand consistency from online content

Publications : WebSci 2019 Full paper , TWEB Journal Paper 2021

Abstract : A consumer-dependent (business-to-consumer) organization tends to present itself as possessing a set of human qualities, which is termed as the brand perception of a company. The perception is impressed upon the consumer through the content (be it in the form of advertisement, blogs, magazines) produced by the organization – in this era of digital marketing, a continuous generation of web-content is needed to keep up the engagement with the consumers and thus delve in a lasting impression on them. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand’s messaging tone. Thus the first task is to develop a quantitative technique to check whether the desired brand personality is maintained in a published article.

In the first work, we quantify brand personality and formulate its linguistic features. We develop five independent classification models to score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication. We perform a large-scale data collection activity and collect around 300 K web page content that covers around 650 Fortune 1000 companies and develop a novel deep learning architecture which leverages transfer learning to improve our classifier performance further. We also study the effect of directly adding the linguistic features to our neural architecture. The classifier automatically identifies the web articles which are not consistent with the mission and vision of a company. A consistent brand will generate trust and retain customers over time since consumers look for regularity and common patterns. There have been studies that provide various strategies for maintaining the brand image over time and also the impact of major company-related events like brand extension, mergers on it. However, there does not exist any standard measure which quantifies the extent of brand inconsistency for a given company.

Research Challenges related to Brand Consistency (ACM WebSci 2019)
Research Challenges related to Brand Consistency (ACM WebSci 2019)

Thus in the second chapter, we first quantify brand consistency and study the company-level contributing factors like the effect of brand extensions. We observe that promotion posts primarily portray competence as their brand personality across all brands and favours the brand consistency of companies which portrays competence in most of their posts (primary trait). We find that presence of brand extension-related posts reduces the brand consistency score of a company and that it is more challenging to maintain posts of a company which has low topical consistency with the brand personality they want to evoke. We discover companies which post consistently and find that financially affluent companies are better at maintaining consistency. To address the brand inconsistency issue, we develop a helper tool that recommends the sentences that need to be modified to make the web article more consistent with the brand perception to the content writers.

Keywords: online reputation management, brand image, brand personality, brand consistency, text classification, transfer learning, sentence ranking

Brief Intro

I have been working on Machine Learning problems for the last 2 years. Through this page, firstly, I will share my experience on how to start working on real-life datasets. I have worked in projects as diverse as Weather, Email, Event Extraction from a text, Emotion mining, Twitter, Online Reputation Monitoring and Digital Health. I am no expert, but I feel it will really you to get started.

Secondly, I will share resources and news with the aim of making Computer Science research accessible and understandable by the general population. As of now, I will be targeting the undergraduates and those starting their career in research like me.

Some of my past projects range from scraping dynamic websites using Python to data cleaning using “dplyr” in R. Extensive preprocessing with climate data from NOAA in R and beginner Python projects from Automate the Boring Stuff With Python are available with full documented codes in my Github account. You can take a look, if interested.

Please mention the topic about which you want me to write or any random question you have in the Contact section. I will surely get back to you within a week.