Python for NLP: A Complete Tutorial with Pandas, NLTK, and spaCy (2025)

From raw text to model-ready features — a hands-on guide to the NLP libraries every data scientist and researcher needs to know.

A practical, code-first guide to natural language processing in Python — from raw text to model-ready features, using the libraries every NLP researcher needs to know.

NLP in 2025 is dominated by large language models, but the fundamentals haven’t changed: before you fine-tune a BERT model, you need to understand tokenization, stopwords, POS tagging, and text normalization. And for most real-world NLP projects, knowing how to process text with pandas, NLTK, and spaCy is still essential.

This tutorial takes you from a raw string to a cleaned, feature-rich representation ready for ML modeling. I’ll use a healthcare text example throughout — relevant for those of us working in medical AI.

Python is the most popular programming language for Machine Learning and Natural Language Processing (NLP). Its compact nature along with an enormous collection of packages like pandas, nltk, and deep learning frameworks like Pytorch, and Tensorflow, have made it a go-to language for data science enthusiasts and programming newbies, in general.

Complete tutorial on how to use Python for NLP

The corresponding IPython notebook used in the above lecture can be downloaded from here.


Table of Contents

  • Setting Up Your NLP Environment
  • Text Preprocessing Pipeline
  • NLTK: Classical NLP Tools
  • spaCy: Modern, Production-Ready NLP
  • pandas: Handling Text at Scale
  • A Complete Pipeline: From Raw Clinical Text to Features
  • Modern Extensions: HuggingFace Tokenizers
  • Resources and Next Steps

 Setting Up Your NLP Environment
pip install pandas nltk spacy transformers
python -m spacy download en_core_web_sm

# For biomedical NLP
pip install scispacy
python -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

Download NLTK data (run once):

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

Text Preprocessing Pipeline

A standard NLP preprocessing pipeline:

Raw text → Lowercase → Remove noise → Tokenize → 
Remove stopwords → Stem/Lemmatize → Feature extraction

Let’s implement each step:

import re
import string
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sample clinical text
text = """Patient presents with shortness of breath (SOB) and chest pain 
since yesterday. BP: 130/85 mmHg. Prescribed Metformin 500mg twice daily.
No known drug allergies (NKDA). Follow-up in 2 weeks."""

# Step 1: Lowercase
text_lower = text.lower()

# Step 2: Remove noise (numbers, punctuation, extra spaces)
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)           # remove numbers
    text = re.sub(r'[^\w\s]', ' ', text)      # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # normalize whitespace
    return text

cleaned = clean_text(text)

# Step 3: Tokenize
tokens = word_tokenize(cleaned)
sentences = sent_tokenize(text)

# Step 4: Remove stopwords
stop_words = set(stopwords.words('english'))
tokens_filtered = [t for t in tokens if t not in stop_words and len(t) > 2]

# Step 5: Lemmatize (preferred over stemming for clinical text)
lemmatizer = WordNetLemmatizer()
tokens_lemmatized = [lemmatizer.lemmatize(t) for t in tokens_filtered]

print("Original:", text[:80])
print("Cleaned tokens:", tokens_lemmatized[:10])

NLTK: Classical NLP Tools

NLTK provides a comprehensive suite of classical NLP components:

Part-of-Speech (POS) Tagging

from nltk import pos_tag

tokens = word_tokenize("The patient was diagnosed with Type 2 Diabetes Mellitus.")
pos_tags = pos_tag(tokens)
print(pos_tags)
# [('The', 'DT'), ('patient', 'NN'), ('was', 'VBD'), ('diagnosed', 'VBN'), ...]

N-gram Language Models

from nltk.util import ngrams
from collections import Counter

tokens = word_tokenize("chest pain shortness of breath fever nausea")
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

bigram_freq = Counter(bigrams)
print(bigram_freq.most_common(5))

Frequency Distribution

from nltk import FreqDist

fdist = FreqDist(tokens_lemmatized)
fdist.plot(30, cumulative=False)  # plots top 30 words
print(fdist.most_common(10))

spaCy: Modern, Production-Ready NLP

spaCy is faster and more accurate than NLTK for most tasks. It’s the better choice for any production or research pipeline:

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Patient John Smith, 45, was admitted to Apollo Hospital in Chennai with acute MI."
doc = nlp(text)

# Named Entity Recognition (NER)
print("Entities:")
for ent in doc.ents:
    print(f"  {ent.text:<20} → {ent.label_} ({spacy.explain(ent.label_)})")

# POS tagging + dependency parsing
print("\nTokens:")
for token in doc:
    print(f"  {token.text:<15} POS: {token.pos_:<8} DEP: {token.dep_}")

# Sentence boundary detection
print("\nSentences:")
for sent in doc.sents:
    print(f"  {sent.text}")

For Biomedical NER (scispaCy)

import spacy

# Load biomedical model
nlp_bio = spacy.load("en_core_sci_sm")

clinical_text = "The patient was treated with aspirin 100mg for atrial fibrillation."
doc = nlp_bio(clinical_text)

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# aspirin → CHEMICAL
# atrial fibrillation → DISEASE

pandas: Handling Text at Scale

When you have thousands or millions of text samples, you need to process them efficiently:

import pandas as pd

# Load a clinical notes dataset
df = pd.read_csv("clinical_notes.csv")

# Apply preprocessing to an entire column at once
df['notes_clean'] = df['clinical_notes'].apply(clean_text)

# Vectorized string operations (faster than apply for simple tasks)
df['notes_lower'] = df['clinical_notes'].str.lower()
df['word_count'] = df['clinical_notes'].str.split().str.len()
df['has_diagnosis'] = df['clinical_notes'].str.contains('diagnosis|diagnosed', case=False)

# Filter rows containing specific terms
chest_pain_cases = df[df['clinical_notes'].str.contains('chest pain', case=False)]

# Extract all mentions of a pattern
df['blood_pressure'] = df['clinical_notes'].str.extract(r'BP:\s*(\d+/\d+)')

A Complete Pipeline: From Raw Text to Features

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    """Full preprocessing pipeline for a text string."""
    doc = nlp(text.lower())
    tokens = [
        token.lemma_ for token in doc
        if not token.is_stop
        and not token.is_punct
        and not token.like_num
        and len(token.text) > 2
    ]
    return " ".join(tokens)

# Load data
df = pd.read_csv("medical_qa_dataset.csv")  # columns: text, label

# Preprocess
df['text_processed'] = df['text'].apply(preprocess)

# Vectorize with TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df['text_processed'])
y = df['label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train classifier
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Modern Extension: HuggingFace Tokenizers

For anyone building on top of transformer models, HuggingFace tokenizers are now the standard:

from transformers import AutoTokenizer

# Clinical BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

text = "The patient presented with acute myocardial infarction."
tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

print("Token IDs:", tokens['input_ids'])
print("Decoded:", tokenizer.decode(tokens['input_ids'][0]))

The key difference from NLTK/spaCy: transformer tokenizers use subword tokenization (BPE or WordPiece), which handles out-of-vocabulary words gracefully — essential for medical terminology.


Resources and Next Steps

  • Complete NLP tutorial video: https://youtu.be/-Zt3eyZ9Rhg (NLTK and pandas walkthrough, IPython notebook available on my GitHub)
  • spaCy documentation (spacy.io) — excellent tutorials for every task
  • Hugging Face Course (huggingface.co/learn/nlp-course) — free, covers everything from transformers to fine-tuning
  • CS224N (Stanford NLP) — free lectures and notes at web.stanford.edu/class/cs224n/
  • Automate the Boring Stuff with Python — builds Python intuition before you tackle NLP

Related articles that may be of interest to you

You can get a comprehensive list of academic conferences in the field of AI and Machine Learning in another article written by me

If you are new to writing papers using Latex for academic conferences, you can visit the following articles:

  1. I cover how to setup up a Tex environment in your local machine (article link)
  2. Conference or journal paper template – individual files and how to use them (article link)
  3. How to correctly write references or perform cross-referencing while writing your paper (article link)

Please comment below if you would like to add something.


Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

Discover more from Medical AI Insights

Subscribe now to keep reading and get access to the full archive.

Continue reading