A deep dive into our research on vocabulary adaptation for reliable medical summarization — from IJCAI 2024 to ACL 2025, with practical insights for anyone building NLP models for healthcare.
In December 2025, I had the privilege of delivering a talk at the Breakfast Talk series of Microsoft Research India in Bangalore. The topic was something my collaborators and I have been working on for the past three years: vocabulary adaptation strategies for fine-tuning medical language models, and what it means for reliable medical summarization.
This article is a detailed companion to that talk. If you were in the room, this gives you the full written version with all the references and technical details. If you weren’t, this is the article I wish I had when I first started thinking about the vocabulary problem in medical NLP.
Here’s the core insight that drives all of this work: when you ask a pre-trained language model to process the word “erythromycin” — a common antibiotic — BERT splits it into five subword tokens (“er ##yt ##hr ##omy ##cin”). BART and PEGASUS fare no better. This excessive fragmentation isn’t just a tokenization curiosity. It has real, measurable consequences for the quality and faithfulness of medical text generated by these models. And as we’ll see, fixing this problem opens a surprising window into how we can make medical AI more reliable.
Table of Contents
- Why Vocabulary Adaptation Matters for Generative AI Reliability
- The Fundamentals: How Tokenization Breaks Medical Text
- Our Approach: MEDVOC (IJCAI 2024)
- How to Actually Add the Extension Vocabulary: AdaptBPE (EMNLP 2024 Findings)
- Scaling to Large Language Models (ACL 2025 Findings)
- Key Results Across All Three Papers
- Open Questions and Future Directions
- Learning Outcomes
- Final Words
Why Vocabulary Adaptation Matters for Generative AI Reliability
Let me start with the problem.
Pre-trained language models — whether small (BERT, BART, PEGASUS) or large (Llama, Mistral, Qwen) — are trained on general-domain text. Their vocabularies are optimized for web text, news articles, and Wikipedia. When you fine-tune these models on medical text, you inherit their vocabulary, and that vocabulary doesn’t know medical language.
The consequence is what we call a low compression rate: the model’s tokenizer splits medical terminology into an excessive number of subword tokens. This isn’t a minor inconvenience — it has cascading effects:
In the encoder: Excessive splitting leads to poor representation of medical concepts. If “erythromycin” is represented as five disconnected subword tokens, the model has to reconstruct the meaning from fragments rather than processing it as a single semantic unit.
In the decoder: The model must generate more tokens to produce a single medical word during decoding. This means fewer medical words can be generated within a fixed output length, directly reducing the quality and informativeness of generated summaries.
For reliability: When the model fragments important medical terms, it’s more likely to produce summaries that miss key clinical concepts, substitute incorrect terms, or generate unfaithful content. In medical AI, faithfulness isn’t optional — an unfaithful summary can be dangerous.
This is why we position vocabulary adaptation as a reliability intervention, not just a performance optimization. It’s part of the broader shift in AI from chasing accuracy to ensuring trustworthiness — a theme I also covered in our WSDM 2025 tutorial on Trustworthy AI for Medicine.
The Fundamentals: How Tokenization Breaks Medical Text
Let me explain the vocabulary mechanics, because understanding them is essential to understanding our solution.
Vocabulary basics: A pre-trained language model has a finite vocabulary — a set of tokens that serve as its basic input/output units. A tokenizer decomposes any input word into tokens from this vocabulary. These are subword units: basic units smaller than words. The vocabulary is “open” in the sense that all words can be defined as a concatenation of subwords. Each vocabulary item has a learned embedding — a dense vector representation that the model uses during processing.
The fertility problem: We measure the severity of vocabulary mismatch using fertility (also called fragment score): the average number of tokens per word. For general English text, fertility is typically close to 1.0-1.5. For medical text processed by a general-domain tokenizer, it can be much higher.
Take “erythromycin” through different tokenizers:
- BERT (WordPiece): er ##yt ##hr ##omy ##cin — fertility of 5
- BART (ByteLevelBPE): ery th romy cin — fertility of 4
- PEGASUS (SentencePiece Unigram): ery th r omycin — fertility of 4
None of these tokenizers produce “erythromycin” as a single token, because the word was rare or absent in their training corpora.
But isn’t this well-known? Yes, the general principle is well-known. Vocabulary adaptation has been widely studied for adapting language models to new languages — for example, adding Hindi or Chinese tokens to an English-trained model. The standard method is: introduce new tokens, initialize their embeddings, and apply continual pre-training on target-language data. Recent works like OFA (NAACL 2024 Findings) and HYPEROFA (ACL 2025 SRW) focus on novel embedding initialization strategies.
What’s different about our work: We use vocabulary adaptation to adapt models to a new domain (medicine) within the same source language (English). This setting has distinct challenges — the vocabulary mismatch is subtler, the available training data is smaller, and the downstream tasks (summarization) are more sensitive to vocabulary choices than classification tasks.
Our Approach: MEDVOC (IJCAI 2024)
MEDVOC was our first paper in this line of research, published at IJCAI 2024 as a main track paper. This was joint work with Gunjan Balde (PhD candidate at IIT Kharagpur), Prof. Niloy Ganguly, and Prof. Mainack Mondal.
The core idea: Update the pre-trained language model’s vocabulary during fine-tuning by adding domain-specific subwords, then train these new vocabulary items through an intermediate fine-tuning step on biomedical text.
Let me walk through the three components.
Component 1: Constructing the Extension Vocabulary
We don’t just add every medical word we can find. The construction is principled:
Phase 1 — Candidate Subword Selection: We identify medical words that are split more than once by the model tokenizer. To determine whether a word is medical, we use QuickUMLS — given a word, it outputs a similarity score between 0 and 1 against the Unified Medical Language System, and we consider words with similarity > 0.95 as medical terms.
We then train the PLM’s tokenization scheme (WordPiece for BertSumAbs, ByteLevelBPE for BART, SentencePiece Unigram for PEGASUS) on two corpora to obtain candidate subwords:
- V_TGT-TEMP: From the target downstream task dataset
- V_PAC: From the PubMed Abstract Collection (a large biomedical corpus)
Why two sources? The target dataset is small (700 to 2,000 data points) — it alone wouldn’t produce a robust vocabulary. The PubMed Abstract Collection provides broader coverage of medical terminology.
Component 2: Hyperparameter Search for Optimal Vocabulary Size
This is where MEDVOC gets interesting. We have two hyperparameters:
- K: The size of V_PAC used to filter out infrequent subwords from V_TGT-TEMP
- A: The optimal subset of V_PAC to add to V_TGT to increase its size
The key insight: optimizing fragment score leads to optimizing medical summarization performance. Instead of running the full training pipeline (intermediate fine-tuning + target fine-tuning) for each vocabulary configuration — which would require 240 combinations × 45 hours each = 450 days of compute — we optimize the computationally cheap fragment score as a proxy metric.
The maximum time for hyperparameter search across all 240 settings was 5 hours 45 minutes (for PEGASUS on the EBM dataset). This makes MEDVOC practical for academic research budgets.
Component 3: Training the Extension Vocabulary
The critical question: how do you learn good embeddings for the newly added vocabulary items?
We can’t just add tokens and fine-tune on the small target dataset — there isn’t enough data. Instead, we use biomedical article title generation from PubMed abstracts as an intermediate fine-tuning task. This works because:
- The source document length of medical summarization datasets (EBM: 276 words, BioASQ: 505 words) is more similar to PubMed abstracts (276 words) than to CNN/DailyMail (700 words)
- The reference summary length of consumer health question summarization (MeQSum, CHQSum: ~12 words) is close to PubMed titles (~15 words) rather than CNN/DailyMail summaries (~57 words)
This also addresses a known limitation: CNN/DailyMail is not suited for medical domain intermediate fine-tuning. Beyond the length mismatch, CNN/DailyMail suffers from lead bias (reference summaries are mostly present in the top 1-5 sentences) and has low vocabulary overlap with medical datasets.
Results
MEDVOC achieved consistent improvements across four medical summarization datasets (EBM, BioASQ, MeQSum, CHQSum) and three model architectures (BertSumAbs, BART, PEGASUS). The compute overhead was marginal: only 0.15% increase in parameters for PEGASUS, 1.15% for BART, and 1.59% for BertSumAbs.
Critically, the improvements were largest where they mattered most:
- High OOV concentration: 17.29% improvement
- Longer reference summaries (>30 tokens): 10.71% improvement
Human evaluation by medical experts hired through Prolific showed MEDVOC generates more faithful summaries (88% vs. 59%) and more relevant summaries (76% vs. 50% receiving positive scores above 3 on a Likert scale).
How to Actually Add the Extension Vocabulary: AdaptBPE (EMNLP 2024 Findings)
After MEDVOC, we discovered a subtle but important problem with how extension vocabulary is integrated into existing models.
The ill-tokenization problem: When you add new vocabulary items and their associated merge rules to a BPE-based tokenizer, they’re appended at the end of the existing vocabulary. This means the priority of domain merge rules is always lower than the priority of PLM merge rules. In practice, this means the added merge rules are never utilized for many instances, resulting in ill-tokenization of the very words you added to fix.
Let me illustrate. Suppose we add “cholesterol” as a new vocabulary item. In standard BPE, the tokenizer first applies its original merge rules (which have higher priority). These original rules might merge character sequences in a way that prevents “cholesterol” from ever forming as a token — because the constituent characters get consumed by other, higher-priority merges.
Our solution — AdaptBPE: We proposed a fundamental change to the BPE algorithm. Instead of starting by splitting text to the character level (standard BPE’s first step), AdaptBPE first finds the longest substring match in the domain vocabulary iteratively and preserves these matches. Only then does it run the standard merge operations.
This ensures that domain-specific tokens like “cholesterol” are correctly captured in the longest substring match phase, mitigating the ill-tokenization issue.
Results: AdaptBPE improved classification accuracy (when integrated with AVocaDo, the state-of-the-art vocabulary adaptation method for classification) across four domains — biomedicine, scientific text, news, and reviews — with an overall improvement of 3.57%. For medical summarization (integrated with MEDVOC), it improved ROUGE-L across all four datasets, with particularly strong gains in high OOV concentration settings.
Scaling to Large Language Models (ACL 2025 Findings)
The natural question after MEDVOC and AdaptBPE was: does vocabulary adaptation work for large language models?
This matters because LLMs like Llama-3.1 (128K vocabulary) and Qwen2 (151,646 vocabulary) have vastly larger vocabularies than small models. You might expect that a 128K vocabulary would cover most medical terminology. But as we showed, even LLMs with huge vocabularies still face vocabulary mismatch issues — over-fragmentation persists for specialized medical terms.
Adapting MEDVOC for LLMs
Scaling MEDVOC to LLMs required several modifications:
MEDVOC-LLM — filtering out noise: We discovered that some tokens added by the original MEDVOC method were not useful for generation. Several tokens populated from the PubMed corpus were infrequent in the downstream task. Others were non-standard mixtures of numerals, punctuation, and letters (e.g., “▁95%”, “=1.”) that don’t conform to LLM tokenizer conventions. MEDVOC-LLM filters these out.
ScafFix — removing scaffolding tokens: When you add a medical word like “cholesterol” to an LLM’s vocabulary via BPE, the addition procedure also adds intermediate subwords incrementally (scaffolding tokens). These scaffolding tokens are infrequent and rarely used once the full word is added. ScafFix leverages AdaptBPE to add only whole medical words, eliminating the scaffolding overhead.
Training procedure: We replaced intermediate fine-tuning (which required 300K data points) with continual pretraining on 20K PubMed abstracts — more efficient for LLMs. We used LoRA for parameter-efficient fine-tuning, with two variants:
- End-to-End: Freeze all base model layers except embedding layers, train LoRA adapters
- Two-Stage: First train only embedding layers (short duration), then unfreeze LoRA adapters and train both
New vocabulary embeddings are initialized as the average of existing subword embeddings — for example, EMBED_cholesterol = AVG[EMBED_ch, EMBED_ol, EMBED_ester, EMBED_ol].
Fine-Grained Evaluation
We defined three challenging evaluation settings (top ten percentile of each):
- DifficultRS: High difficult-OOV concentration in reference summaries
- DifficultSD: High difficult-OOV concentration in source documents
- NovelRS: High novel word concentration in reference summaries
Where “difficult-OOV” means medical words split more than thrice by the model tokenizer, and “novel” means words in the summary not present in the source document.
Key Results
On full data: At least one vocabulary adaptation strategy outperformed the BASE model in 5 out of 8 settings (across Llama-2, Llama-3.1, and Mistral on multiple datasets). ScafFix was the best-performing method.
On challenging scenarios: Vocabulary adaptation outperformed BASE in 7 out of 8 settings (averaging across DifficultRS, DifficultSD, and NovelRS). This is the key finding — VA’s benefits are most pronounced exactly where they’re needed most.
Factual consistency: At least one vocabulary adaptation strategy improved factual consistency over BASE in 5 out of 6 settings. The average improvement was 18.75% for Llama-2 and 14.82% for Llama-3.1. For medical summarization, this improvement in faithfulness is arguably more important than ROUGE improvements.
Human evaluation: Medical experts evaluated 30 data points (BASE vs. ScafFix) with high difficult OOV concentration across relevance, coherence (Likert 1-5), and faithfulness (binary). ScafFix generated more faithful and more relevant summaries.
Key Results Across All Three Papers
Let me step back and summarize the trajectory:
2024 (IJCAI): We established that vocabulary adaptation works for small language models on medical summarization — with MEDVOC achieving consistent improvements across three model architectures and four datasets, using an efficient fragment-score-based hyperparameter search.
2024 (EMNLP Findings): We identified and fixed the ill-tokenization problem with AdaptBPE — showing that how you integrate extension vocabulary matters as much as which vocabulary you add.
2025 (ACL Findings): We demonstrated that vocabulary adaptation scales to LLMs (7B-8B parameter models) and that even models with 128K+ vocabularies benefit from domain-specific vocabulary adaptation, particularly in challenging high-OOV settings.
The consistent thread: vocabulary adaptation is most valuable in the hardest cases — high OOV concentration, high novelty requirements, and long-form generation settings. These are precisely the settings where standard LLM pipelines (zero/few-shot prompting, parameter-efficient fine-tuning without vocabulary changes) struggle most.
Open Questions and Future Directions
I closed my Microsoft Research talk with several open questions that I believe represent important directions for the field:
Improving Efficiency
Currently, MEDVOC requires a separate extension vocabulary creation and continual pretraining step for each combination of LLM and downstream dataset. The PubMed Abstract Collection alone, while valuable, limits diversity. Future work should explore more diverse medical training corpora and reduce the dependence on specific downstream tasks.
Beyond Surface-Level Factors
Our current evaluation focuses on surface-level factors: high OOV, high novelty, and long reference summaries. But these don’t encompass the semantic and reasoning issues that vocabulary mismatch can cause.
How do LLMs manage conflicting evidence between source and target domains? How does an LLM handle lack of evidence in the target domain — does it simply substitute evidence from its pre-training knowledge? These deeper questions connect vocabulary adaptation to the broader challenge of LLM faithfulness and hallucination.
End-to-End Vocabulary Adaptation
All current vocabulary adaptation approaches, including ours, separately create an extension vocabulary and then train the model. A more elegant solution would make extension vocabulary creation part of the model training process — an iterative framework where the model identifies its own vocabulary gaps and fills them during training.
VA for Compound AI Systems
Vocabulary adaptation has not yet been applied to compound AI setups — RAG systems, agentic AI, and other multi-component architectures. As the field moves toward these compound systems for clinical applications, understanding how vocabulary mismatch affects retrieval quality, tool use, and multi-step reasoning is an important open question.
Learning Outcomes
If you take away four things from this article, let it be these:
- Vocabulary mismatch severely affects AI reliability, and vocabulary adaptation is an effective mitigation strategy — particularly for high OOV concentration, high novelty, and long-form generation settings.
- VA works best in high vocabulary mismatch scenarios, and may lead to performance drops for general cases. It’s a targeted intervention, not a universal one.
- VA can be achieved with low compute overhead, but separate models need to be built for each domain. The hyperparameter search (fragment score optimization) is fast; the training (continual pretraining + fine-tuning) is the main cost.
- Integrating extension vocabulary requires careful training mechanisms. For LLMs, continual pretraining is essential — you can’t just add tokens and fine-tune. And how you integrate the vocabulary (standard BPE append vs. AdaptBPE) matters significantly.
Final Words
When I started this line of research during my PhD at IIT Kharagpur, vocabulary adaptation for domain-specific NLP felt like a niche technical problem. Over three years and three papers, I’ve come to see it as something more fundamental: a window into how language models represent and process specialized knowledge, and a practical lever for improving the reliability of medical AI systems.
The fact that a model worth billions of parameters can be meaningfully improved by updating 0.15% to 1.59% of its parameters — just the vocabulary embeddings — tells us something important about how these models work. The vocabulary is the interface between the model and the world. If that interface is poorly calibrated for a domain, everything downstream suffers.
As the field moves toward deploying LLMs in clinical settings — for note summarization, patient communication, clinical coding, and decision support — ensuring that these models can reliably process and generate medical terminology isn’t a luxury. It’s a prerequisite.
I want to thank my collaborators Gunjan Balde, Prof. Niloy Ganguly, and Prof. Mainack Mondal at IIT Kharagpur for making this work possible. And I want to thank Microsoft Research India for the invitation to present this work — the feedback from that session helped sharpen several of the ideas discussed here.
Papers Referenced
- MEDVOC: Balde*, Roy*, Mondal, Ganguly. “MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization.” IJCAI 2024 (Main Track). Paper
- AdaptBPE: Balde*, Roy*, Mondal, Ganguly. “Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models.” EMNLP 2024 (Findings). Paper
- ScafFix / VA for LLMs: Balde, Roy, Mondal, Ganguly. “Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings.” ACL 2025 (Findings). Paper
Related Articles
- How to Get Started with Research in AI for Medicine
- The Side-Hustle Scientist: Publish AI Papers While Working Full-Time
- How to Start Working with Clinical Data as an AI Researcher
- A Self-Help Guide to Starting Your Own ML Research Project
Talk Slides
- Microsoft Research India Breakfast Talk, December 2025
- WSDM 2025 Tutorial: Building Trustworthy AI Models for Medicine
Discover more from Medical AI Insights
Subscribe to get the latest posts sent to your email.