Exploratory Data Analysis in R: A Practical Tutorial for Beginners (2026)

From raw data to publication-quality visualizations — master EDA with dplyr and ggplot2 before you build your first model.

Updated 2026 edition — originally published August 2018


Exploratory Data Analysis (EDA) is the most important step before any machine learning model — and R makes it fast, visual, and intuitive. This tutorial covers everything from loading your first dataset to publication-quality visualizations.

Before you build any model, you need to understand your data. EDA is not optional — it’s where you catch data quality issues, identify patterns, and form hypotheses about which features matter. Skipping EDA and jumping straight to model training is one of the most common mistakes I see in student research projects.

R is particularly well-suited for EDA. Its core libraries (dplyr, ggplot2) are elegant and composable in ways that make exploratory work feel natural. I still reach for R when I’m doing preliminary data exploration, even when my final model ends up in Python.


Table of Contents

  • Setting Up R and RStudio
  • Loading and Inspecting Your Data
  • Handling Missing Values
  • Univariate Analysis: Understanding Each Variable
  • Bivariate and Multivariate Analysis
  • Feature Engineering with dplyr
  • Handling Imbalanced Data in R
  • Exporting Cleaned Data for Python
  • Quick Reference Code Snippets

Setting Up R and RStudio

Installation

Windows and Mac: Download R from https://cran.r-project.org/ and RStudio from https://posit.co/download/rstudio-desktop/

Ubuntu/Linux:

sudo apt-get update
sudo apt-get install r-base r-base-dev
# Then install RStudio from https://posit.co/download/rstudio-desktop/

Essential packages to install immediately

install.packages(c("dplyr", "ggplot2", "tidyr", "readr", "corrplot", "ROSE", "caret"))

Loading and Inspecting Your Data

library(dplyr)
library(ggplot2)

# Load dataset (we'll use the classic UCI Dress dataset)
dress_data <- read.csv("~/Downloads/Attribute_DataSet.csv", header=TRUE, stringsAsFactors=FALSE)

# First look
dim(dress_data)        # rows × columns
str(dress_data)        # data types and first few values
head(dress_data, 10)   # first 10 rows
summary(dress_data)    # min, max, mean, quartiles, NA count

The summary() function is your best first diagnostic. It tells you:

  • The range of continuous variables (catch absurd values like negative ages)
  • The number of NAs per column
  • Factor level distribution for categorical variables

Handling Missing Values

# Count NAs per column
colSums(is.na(dress_data))

# Percentage missing per column
round(colMeans(is.na(dress_data)) * 100, 2)

# Remove rows with any NA (use cautiously — losing data)
dress_clean <- na.omit(dress_data)

# Impute numeric columns with median (safer for skewed data)
dress_data$Rating[is.na(dress_data$Rating)] <- median(dress_data$Rating, na.rm=TRUE)

# For categorical: impute with mode
mode_material <- names(sort(table(dress_data$Material), decreasing=TRUE))[1]
dress_data$Material[is.na(dress_data$Material) | dress_data$Material == "null"] <- mode_material

Univariate Analysis: Understanding Each Variable

Continuous variables

# Histogram with density curve
ggplot(dress_data, aes(x=Rating)) +
  geom_histogram(aes(y=..density..), bins=20, fill="steelblue", alpha=0.7) +
  geom_density(color="darkred", size=1.2) +
  labs(title="Distribution of Product Ratings", x="Rating (0–5)", y="Density") +
  theme_minimal()

# Box plot — shows outliers clearly
ggplot(dress_data, aes(y=Rating)) +
  geom_boxplot(fill="lightblue", outlier.color="red") +
  theme_minimal()

Categorical variables

# Bar chart for Style distribution
dress_data %>%
  count(Style, sort=TRUE) %>%
  ggplot(aes(x=reorder(Style, n), y=n)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(title="Distribution of Dress Styles", x="Style", y="Count") +
  theme_minimal()

Bivariate and Multivariate Analysis

Correlation heatmap

library(corrplot)

# Select only numeric columns
numeric_cols <- dress_data %>% select_if(is.numeric)

# Correlation matrix
cor_matrix <- cor(numeric_cols, use="pairwise.complete.obs")
corrplot(cor_matrix, method="color", type="upper", tl.cex=0.8,
         col=colorRampPalette(c("blue", "white", "red"))(200))

Scatter plot using ggplot2

ggplot(dress_data, aes(x=Rating, y=Recommendation)) +
  geom_jitter(alpha=0.3, height=0.05) +
  geom_smooth(method="loess", color="red") +
  labs(title="Rating vs. Recommendation Probability") +
  theme_minimal()

Group comparison

# Does price category affect rating?
dress_data %>%
  group_by(Price) %>%
  summarise(
    mean_rating = mean(Rating, na.rm=TRUE),
    median_rating = median(Rating, na.rm=TRUE),
    count = n()
  ) %>%
  arrange(desc(mean_rating))

# Visualize group differences
ggplot(dress_data, aes(x=Price, y=Rating, fill=Price)) +
  geom_boxplot() +
  labs(title="Rating Distribution by Price Category") +
  theme_minimal() +
  theme(legend.position="none")

Feature Engineering with dplyr

library(dplyr)

# Assume DF is your dataframe with columns X1 and X2
DF <- dress_data

# Select specific columns
DF_subset <- DF %>% select(Rating, Price, Recommendation)

# Remove a column
DF_no_style <- DF %>% select(-Style)

# Sort by multiple columns (Rating descending, then Price ascending)
DF_sorted <- DF %>% arrange(desc(Rating), Price)

# Create new columns
DF <- DF %>% mutate(
  high_rating = ifelse(Rating >= 4.0, 1, 0),
  rating_normalized = (Rating - min(Rating, na.rm=TRUE)) /
                      (max(Rating, na.rm=TRUE) - min(Rating, na.rm=TRUE))
)

# Group summary
DF %>%
  group_by(Season) %>%
  summarise(
    avg_rating = mean(Rating, na.rm=TRUE),
    total_recommended = sum(Recommendation, na.rm=TRUE),
    n = n()
  )

# Filter rows
high_rated <- DF %>% filter(Rating >= 4.0, Recommendation == 1)

Handling Imbalanced Data in R

library(ROSE)

# Check class balance
table(dress_data$Recommendation)

# Oversample minority class
DF_over <- ovun.sample(Recommendation ~ ., data=dress_data,
                        method="over", N=nrow(dress_data)*2, seed=42)$data
table(DF_over$Recommendation)

# Both oversampling and undersampling
DF_both <- ovun.sample(Recommendation ~ ., data=dress_data,
                        method="both", p=0.5, N=nrow(dress_data), seed=42)$data
table(DF_both$Recommendation)

# ROSE synthetic data generation
DF_rose <- ROSE(Recommendation ~ ., data=dress_data, seed=42)$data
table(DF_rose$Recommendation)

Exporting Cleaned Data for Python

When your EDA is done in R and your modeling happens in Python:

library(readr)
write_csv(dress_data_cleaned, "dress_data_cleaned.csv")

In Python:

import pandas as pd
df = pd.read_csv("dress_data_cleaned.csv")

Quick Reference: Shuffling, Oversampling, Indexing

# Shuffle dataset
DFindex <- seq(1, nrow(DF))
DFindex_shuffled <- sample(DFindex, length(DFindex), replace=FALSE)
DF_shuffled <- DF[DFindex_shuffled, ]

# Stratified train/test split (preserves class ratio)
library(caret)
set.seed(42)
train_idx <- createDataPartition(DF$Recommendation, p=0.8, list=FALSE)
DF_train <- DF[train_idx, ]
DF_test  <- DF[-train_idx, ]

Dealing with Large Datasets using “data.table” package

library(data.table)
sample_data= fread("bigfile.csv", nrows = 1000)

col_names = list(character = c("id", "category"), integer = c("count"), numeric = c("score"))
full_data = fread("large_dataset.csv", colClasses= col_names)

Recommended Learning Resources

  • MIT Analytics Edge on edX — the course that got me started with R. Very applied, uses real business datasets.
  • R for Data Science by Hadley Wickham — free at https://r4ds.had.co.nz/ — the definitive guide to the tidyverse
  • R-bloggers (https://www.r-bloggers.com/) — daily tutorials and examples from the R community
  • ggplot2 documentation (https://ggplot2.tidyverse.org/) — excellent reference with examples


Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights

Subscribe now to keep reading and get access to the full archive.

Continue reading