My name is Soumyadeep Roy. I am a Ph.D. Candidate in Computer Science and Engineering Department at IIT Kharagpur, West Bengal, India. My thesis topic is “Incorporating Domain Knowledge in Medical NLP Applications.” I work under the supervision of Prof. Niloy Ganguly and Prof. Shamik Sural. I am also part of the Complex Networks Research Group(CNeRG), IIT Kharagpur.
I joined as a Research Associate in January 2021 at the Leibniz AI Future Lab, L3S Research Center, Leibniz University Hannover, Germany, where I work under Prof. Niloy Ganguly and Prof. Wolfgang Nejdl. Here, I work on the “Big Data in Psychiatric Disorders” project, where we study patient subtyping of Parkinson’s and Schizophrenia disease using clinical and genomic data (SNPs).
I completed my Master’s (MS research) from IIT Kharagpur. I defended my thesis titled Computational Approaches for Online Reputation Monitoring [Modeling, Analysis, and Recommendation]. This work resulted from a collaboration with Adobe Research, Bangalore, India. Here, we analyzed the textual content generated by brands on corporate websites and developed text classification and sentence-level recommendation solutions that assist reputation management experts.
More details are provided in the Projects section below.
Recent News
July 2023 – Full paper “GeneMask: Fast Pretraining of Gene Sequences to Enable Few-shot Learning” accepted at the 26th European Conference on Artificial Intelligence (ECAI) 2023, Core A conference.
Serving as reviewer for AAAI 2024 and EMNLP 2023
May 2023 – Full paper “Interpretable Clinical Trial Search using Pubmed Citation Network” accepted at the IEEE International Conference on Digital Health (ICDH) 2023. Attended the conference in person at Chicago, United States. Our paper was selected as one of the three candidates for the “Best Student Paper” award. Slides, codebase and doi details are provided below under “Publications” section.
March 2023 – Served as Reviewer for ACL 2023
October 2022 – Serving as Reviewer in EACL 2023 and IEEE Transactions on Cognitive and Developmental Systems
August 2022 – Serving as Reviewer in AAAI 2023 (Main) Track. (conference link)
Volunteering at the Indian Symposium for Machine Learning (IndoML) 2023 Datathon. The task description and dataset are already released, so please register to participate. The submission deadline is 7th October 2022
July 2022 – Served as Reviewer in EMNLP 2022 for the tracks “Unsupervised and Weakly-Supervised Methods in NLP” and “NLP Applications.” (conference link)
April 2022 – Teaching Assistant for the Foundations of Information Retrieval course at Leibniz University Hannover
Oct 2021 – Teaching Assistant for the Natural Language Processing course at Leibniz University Hannover, Volunteering at the Indian Symposium for Machine Learning (IndoML) 2021 to be held virtually from Dec. 16 – 18, 2021
August 2021 – Short paper “Developing Knowledge-Aware Neural Models for Medical Forum Question Classification” got accepted to CIKM 2021 (acceptance rate – 28%). Joint work with Leibniz AI Future Lab and Adobe Research, India
January 2021 – Joined Leibniz AI Future Lab at L3S Research Center, Hannover, Germany as a Research Associate
3rd November 2020 – Paper accepted at ACM Transactions on the Web (TWEB) titled “An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis, and Recommendation” (arXiv preprint). Joint work with Big Data Experience Lab, Adobe Research, Bangalore, India. Code available too!
2nd November 2020 – Defended my Ph.D. Registration Seminar. Thesis topic: Incorporating Domain Knowledge in Medical NLP Applications
Research publications
Soumyadeep Roy, Jonas Wallat, Sowmya S Sundaram, Wolfgang Nejdl, Niloy Ganguly, GeneMask: Fast Pretraining of Gene Sequences to Enable Few-shot Learning, in the 26th European Conference on Artificial Intelligence (ECAI) 2023, September 30 – October 5, 2023, Krakow, Poland, (Core A conference) [Code][arXiv]
Soumyadeep Roy, Niloy Ganguly, Shamik Sural, Koustav Rudra, Interpretable Clinical Trial Search using Pubmed Citation Network, in the 2023 IEEE International Conference on Digital Health (ICDH), July 2 to 8, 2023, Chicago, United States, doi: 10.1109/ICDH60066.2023.00056 [Paper][Code][Slides] Candidate for Best Student Paper Award
Soumyadeep Roy, Sudip Chakraborty, Aishik Mandal, Prakhar Sharma, Gunjan Balde, Anandhavelu Natarajan, Megha Khosla, Shamik Sural and Niloy Ganguly, Developing Knowledge-Aware Neural Models for Medical Forum Question Classification, in the 30th ACM International Conference on Information and Knowledge Management (CIKM), 1-5 November 2021, Online (Short Paper) [Code][arXiv][DOI][Slides][Video]
Soumyadeep Roy, Shamik Sural, Niyati Chhaya, Anandhavelu Natarajan, and Niloy Ganguly, An Integrated Approach for Improving Brand Consistency of Web Content: Modeling, Analysis, and Recommendation, in ACM Transactions on the Web (TWEB), 25 pages, November 2020 (Journal) [arXiv][DOI] [Slides-MS Thesis version] [Code and Data]
Soumyadeep Roy, Koustav Rudra, Nikhil Agrawal, Shamik Sural, Niloy Ganguly, Towards an Aspect-based Ranking Model for Clinical Trial Search, in the 8th International Conference on Computational Data and Social Networks (CSoNet 2019), November 18 – 20, 2019, Ho Chi Minh City, Vietnam (Full Paper) [PDF] [DOI] [Code] [Data][Slides]
Soumyadeep Roy, Niloy Ganguly, Shamik Sural, Niyati Chhaya, and Anandhavelu Natarajan, Understanding Brand Consistency from Web Content, in Proceedings of the 10th ACM Conference on Web Science, WebSci 19, (Boston, MA, USA), pp. 245–253, ACM, 2019. (Full Paper) [DOI] [PDF] [Slides][Data]
Soumyadeep Roy, Nibir Pal, Kousik Dasgupta, Binay Gupta; Understanding Email Interactivity and Predicting User Response to email, In Methodologies and Application Issues of Contemporary Computing Framework, pp. 69-79. Springer, Singapore, 2018 (Book Chapter) [DOI][Paper][Code][Slides]
Soumyadeep Roy; Automated EBM-oriented Summarization of Active or Recruiting Clinical Trials at COMSNETS 2018 held at Bangalore, India on January 3 – 7, 2018 [Link] Best Graduate Forum Presentation Award
Research Projects
Project A: Patient subtyping in Parkinson’s disease and Schizophrenia using clinical and genomic data
Motivation: Parkinson‘s Disease (PD) is a complex neurodegenerative disorder with high heterogeneity in clinical symptoms (motor and non-motor), progression course, treatment response, and genetic factors. Patient subtyping helps improve disease mechanism understanding and facilitates targeted interventions or treatment regimes.
Current Situation: Most PD subtypes are based on motor symptoms and do not focus on non-motor symptoms. General phenotype-based approaches do not provide a personalized way, and approaches considering phenotype and genotype data together are not well-explored.
Solution and Vision: We aim to develop data-driven patient subtyping methods that integrate both motor and non-motor characteristics of PD and jointly utilize clinical and genotype data. These automatically learned subtypes would be examined to identify potential markers for neurodegenerative diseases like PD.
With the help of these predictive markers, early therapeutic intervention in neurodegenerative diseases could be realized. We are working closely with Prof. Dr. Helge Frieling of the Department of Psychiatry, Social Psychiatry and Psychotherapy (MHH) and other biomedical partners at MHH associated with the Leibniz AI Lab.
Here, we will initially focus on young-onset PD patients and patients with comorbidities like schizophrenia and severe depression. Our final goal will be to develop personalized AI-based solutions to assist doctors with their day-to-day clinical practice.
Project B: Knowledge-Guided Efficient Representation Learning for Genomic Applications
Understanding gene regulatory code and developing deep learning models for gene sequence representation learning have become active areas of research.
Learning better representations is quite difficult due to the polysemy and distant semantic relationship, which prior methods often fail to capture, especially in data-scarce scenarios.
Further, these models do not utilize gene-related biomedical domain knowledge. For each downstream task, a separate fine-tuned model is required and may lead to issues like catastrophic forgetting.
We will discuss how we try to address some of these challenges and our initial results. We evaluate gene sequence-level classification tasks like promoter region prediction, chromatin profile prediction, and promoter-enhancer interaction prediction.
Project C: Developing an Aspect-based Search System for Clinical Trials Search
Publications: CSoNet 2019 Full paper
Abstract: Clinical Trials are crucial for the practice of evidence-based medicine. It provides updated and essential health-related information for the patients. Sometimes, Clinical trials are the first source of information about new drugs and treatments.
Different stakeholders, such as trial volunteers, trial investigators, and meta-analyses researchers, often need to search for trials. In this paper, we propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between the user query and clinical trials.
However, different stakeholders may have different information needs, and accordingly, we rank the retrieved clinical trials based on the following four aspects – Relevancy, Adversity, Recency, and Popularity.
We aim to develop a clinical trial search system that covers multiple disease classes instead of only focusing on retrieving oncology-based clinical trials. We follow a rigorous annotation scheme and create an annotated retrieval set for 25 queries across five disease categories.
Our proposed method performs better than the baseline model in almost 90% of cases. We also measure the correlation between the different aspect-based ranking lists and observe a high negative Spearman rank’s correlation coefficient between popularity and recency.
Keywords: Clinical trial search, Aspect-based ranking, Biomedical information retrieval
Completed Projects
Developing an integrated framework for understanding brand consistency from online content
Publications: WebSci 2019 Full paper, TWEB Journal Paper 2021
Abstract: A consumer-dependent (business-to-consumer) organization tends to present itself as possessing a set of human qualities, which is a company’s brand perception.
The perception is impressed upon the consumer through the content (be it in the form of advertisements, blogs, or magazines) produced by the organization. In this digital marketing era, a continuous generation of web content is needed to keep up consumer engagement and thus make a lasting impression on them.
However, such content authoring at scale introduces challenges in maintaining consistency in a brand’s messaging tone.
Thus the first task is to develop a quantitative technique to check whether the desired brand personality is maintained in a published article.
In the first work, we quantify brand personality and formulate its linguistic features. We develop five independent classification models to score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness, and sophistication.
We perform a large-scale data collection activity and collect around 300 K web page content that covers around 650 Fortune 1000 companies. We also develop a novel deep learning architecture that leverages transfer learning to improve our classifier performance further.
We also study the effect of directly adding linguistic features to our neural architecture. The classifier automatically identifies the web articles which are not consistent with the mission and vision of a company.
A consistent brand will generate trust and retain customers over time since consumers look for regularity and common patterns.
Studies have provided various strategies for maintaining the brand image over time and the impact of major company-related events like brand extension and mergers on it.
However, there is no standard measure that quantifies the extent of brand inconsistency for a given company.

Thus in the second chapter, we quantify brand consistency and study the company-level contributing factors like the effect of brand extensions.
We observe that promotion posts primarily portray competence as their brand personality across all brands and favor the brand consistency of companies that portrays competence in most of their posts (primary trait).
We find that the presence of brand extension-related posts reduces the brand consistency score of a company and that it is more challenging to maintain posts of a company with low topical consistency with the brand personality they want to evoke.
We discover companies that post consistently and find that financially affluent companies are better at maintaining consistency.
To address the brand inconsistency issue, we developed a helper tool that recommends the sentences that need to be modified to make the web article more consistent with the brand perception of the content writers.
Keywords: online reputation management, brand image, brand personality, brand consistency, text classification, transfer learning, sentence ranking
Brief Intro
I have been working on Machine Learning problems for the last five years. Through this page, firstly, I will share my experience on how to start working on real-life datasets. I have worked on projects as diverse as Weather, Email, Event Extraction from text, Emotion mining, Twitter, Online Reputation Monitoring, and Digital Health. I am no expert, but I feel it will help you to get started.
Secondly, I will share resources and news to make Computer Science research accessible and understandable to the general population. As of now, I will be targeting undergraduates and those starting their careers in research like me.
Some of my past projects range from scraping dynamic websites using Python to data cleaning using “dplyr” in R. Extensive preprocessing with climate data from NOAA in R and beginner Python projects from Automate the Boring Stuff With Python are available with full documented codes in my Github account. You can take a look if interested.
Please mention the topic you want me to write or any random question you have in the Contact section. I will surely get back to you as soon as possible.