About me

My name is Soumyadeep Roy. I am a PhD Candidate in the Department of Computer Science and Engineering at IIT Kharagpur, West Bengal, India. My thesis topic is “Incorporating Domain Knowledge in Medical NLP Applications“. I work under the supervision of Prof. Niloy Ganguly and Prof. Shamik Sural. I am currently also a part of Complex Networks Research Group(CNeRG), IIT Kharagpur.

I have joined as a Research Associate from January 2021 at the Leibniz AI Future Lab, L3S Research Center, Leibniz University Hannover, Germany, where I work under the supervision of Prof. Niloy Ganguly and Prof. Wolfgang Nejdl. Here, I work on the “Big Data in Psychiatric Disorders” project where we study patient subtyping of Parkinsons and Schizophrenia disease using clinical and genomic data (SNPs).

I completed my Masters (MS research) from IIT Kharagpur. I defended my thesis titled Computational Approaches for Online Reputation Monitoring [Modeling, Analysis and Recommendation]. This work was a result of a collaboration with Adobe Research, Bangalore, India. Here, we analyzed the textual content generated by brands on corporate websites and developed text classification and sentence-level recommendation solutions that assist reputation management experts.

More details are provided in the Projects section below.

Recent News

August 2021 – Short paper “Developing Knowledge-Aware Neural Models for Medical Forum Question Classification” got accepted to CIKM 2021 (acceptance rate – 28%). Joint work with Leibniz AI Future Lab and Adobe Research, India

January 2021 – Joined Leibniz AI Future Lab at L3S Research Center, Hannover, Germany as a Research Associate

3rd November, 2020 – Paper accepted at ACM Transactions on the Web (TWEB) titled “An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation” (arXiv preprint). Joint work with Big Data Experience Lab, Adobe Research, Bangalore, India. Code available too!

2nd November, 2020 – Defended my PhD Registration Seminar. Thesis topic: Incorporating Domain Knowledge in Medical NLP Applications

Research publications

Soumyadeep Roy, Sudip Chakraborty, Aishik Mandal, Prakhar Sharma, Gunjan Balde, Anandhavelu Natarajan, Megha Khosla, Shamik Sural and Niloy Ganguly, Developing Knowledge-Aware Neural Models for Medical Forum Question Classification, in the 30th ACM International Conference on Information and Knowledge Management (CIKM), 1-5 November 2021, Online (Short Paper) (to appear) [Code][arXiv][DOI][Slides][Video]

Soumyadeep Roy, Shamik Sural, Niyati Chhaya, Anandhavelu Natarajan, and Niloy Ganguly, An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis and Recommendation, in ACM Transactions on the Web (TWEB), 25 pages, November 2020 (Journal) [arXiv][DOI][Code and Data]

Soumyadeep Roy, Koustav Rudra, Nikhil Agrawal, Shamik Sural, Niloy Ganguly, Towards an Aspect-based Ranking Model for Clinical Trial Search, in the 8th International Conference on Computational Data and Social Networks (CSoNet 2019), Novemer 18 – 20, 2019, Ho Chi Minh City, Vietnam (Full Paper) [PDF] [DOI] [Code] [Data][Slides]

Soumyadeep Roy, Niloy Ganguly, Shamik Sural, Niyati Chhaya, and Anandhavelu Natarajan, Understanding Brand Consistency from Web Content, in Proceedings of the 10th ACM Conference on Web Science, WebSci 19, (Boston, MA, USA), pp. 245–253, ACM, 2019. (Full Paper) [DOI] [PDF] [Slides][Data]

Soumyadeep Roy, Nibir Pal, Kousik Dasgupta, Binay Gupta; Understanding Email Interactivity and Predicting User Response to email, In Methodologies and Application Issues of Contemporary Computing Framework, pp. 69-79. Springer, Singapore, 2018 (Book Chapter) [DOI][Paper][Code][Slides]

Soumyadeep Roy; Automated EBM-oriented Summarization of Active or Recruiting Clinical Trials at COMSNETS 2018 held at Bangalore, India on January 3 – 7, 2018 [Link] Best Graduate Forum Presentation Award

Research Projects

Project A : Patient subtyping in Parkinson’s disease and Schizophrenia using clinical and genomic data

Motivation: Psychiatric Disorders (PDs) rank 5th in terms of prevalence and account for 6.7% of “Disability Adjusted Life Years”, and will be my research area of interest. PDs are polygenic which means many genetic loci contribute to risk. Particularly, we aim to investigate schizophrenia because it has a very heterogenous set of symptoms and nearly all syndromes lack pathological or biological defining features, thus there is a need to understand its genetic basis (Psychiatric Genetics). Presently, the treatment is still at symptomatic level. Genetic architectures of different PDs strongly overlap and thus deep knowledge of functional genomic architecture is beneficial.

Problem: We aim to investigate the “reverse phenotyping of psychosis” problem, where we aim to identify genetically homogeneous subgroups for people suffering from schizophrenia. Presently, schizophrenia is associated with a highly heterogeneous set of symptoms, and is thus thought to actually belong to a broader category of diseases that are associated with psychosis risk. Although, many patients possess similar set of symptoms, they react differently to the same medications, which may be attributed to variations in genetic, environmental or other factors. Here, we will focus on grouping the patients based on their genotype and phenotype data. We observe similar setting in patients with Parkinson’s disease and have started analyzing their clinical data.

Project B : Developing an Aspect-based Search System for Clinical Trials Search

Publications : CSoNet 2019 Full paper

Abstract : Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments. Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials. However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity. We aim to develop a clinical trial search system which covers multiple disease classes, instead of only focusing on retrieval of oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries, across five disease categories. Our proposed method performs better than the baseline model in almost 90% cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.

Keywords : Clinical trial search, Aspect-based ranking , Biomedical information retrieval

Project C : Developing an integrated framework for understanding brand consistency from online content

Publications : WebSci 2019 Full paper , TWEB Journal Paper 2021

Abstract : A consumer-dependent (business-to-consumer) organization tends to present itself as possessing a set of human qualities, which is termed as the brand perception of a company. The perception is impressed upon the consumer through the content (be it in the form of advertisement, blogs, magazines) produced by the organization – in this era of digital marketing, a continuous generation of web-content is needed to keep up the engagement with the consumers and thus delve in a lasting impression on them. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand’s messaging tone. Thus the first task is to develop a quantitative technique to check whether the desired brand personality is maintained in a published article.

In the first work, we quantify brand personality and formulate its linguistic features. We develop five independent classification models to score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication. We perform a large-scale data collection activity and collect around 300 K web page content that covers around 650 Fortune 1000 companies and develop a novel deep learning architecture which leverages transfer learning to improve our classifier performance further. We also study the effect of directly adding the linguistic features to our neural architecture. The classifier automatically identifies the web articles which are not consistent with the mission and vision of a company. A consistent brand will generate trust and retain customers over time since consumers look for regularity and common patterns. There have been studies that provide various strategies for maintaining the brand image over time and also the impact of major company-related events like brand extension, mergers on it. However, there does not exist any standard measure which quantifies the extent of brand inconsistency for a given company.

Research Challenges related to Brand Consistency (ACM WebSci 2019)
Research Challenges related to Brand Consistency (ACM WebSci 2019)

Thus in the second chapter, we first quantify brand consistency and study the company-level contributing factors like the effect of brand extensions. We observe that promotion posts primarily portray competence as their brand personality across all brands and favours the brand consistency of companies which portrays competence in most of their posts (primary trait). We find that presence of brand extension-related posts reduces the brand consistency score of a company and that it is more challenging to maintain posts of a company which has low topical consistency with the brand personality they want to evoke. We discover companies which post consistently and find that financially affluent companies are better at maintaining consistency. To address the brand inconsistency issue, we develop a helper tool that recommends the sentences that need to be modified to make the web article more consistent with the brand perception to the content writers.

Keywords: online reputation management, brand image, brand personality, brand consistency, text classification, transfer learning, sentence ranking

Brief Intro

I have been working on Machine Learning problems for the last 2 years. Through this page, firstly, I will share my experience on how to start working on real-life datasets. I have worked in projects as diverse as Weather, Email, Event Extraction from a text, Emotion mining, Twitter, Online Reputation Monitoring and Digital Health. I am no expert, but I feel it will really you to get started.

Secondly, I will share resources and news with the aim of making Computer Science research accessible and understandable by the general population. As of now, I will be targeting the undergraduates and those starting their career in research like me.

Some of my past projects range from scraping dynamic websites using Python to data cleaning using “dplyr” in R. Extensive preprocessing with climate data from NOAA in R and beginner Python projects from Automate the Boring Stuff With Python are available with full documented codes in my Github account. You can take a look, if interested.

Please mention the topic about which you want me to write or any random question you have in the Contact section. I will surely get back to you within a week.