Exploratory Data Analysis in R: A Practical Tutorial for Beginners (2025)

Category: R | Tags: R for data science, exploratory analysis, ggplot2, data visualisation, dplyr, machine learning preprocessing

Updated 2025 edition — originally published August 2018


Exploratory Data Analysis (EDA) is the most important step before any machine learning model — and R makes it fast, visual, and intuitive. This tutorial covers everything from loading your first dataset to publication-quality visualizations.

Before you build any model, you need to understand your data. EDA is not optional — it’s where you catch data quality issues, identify patterns, and form hypotheses about which features matter. Skipping EDA and jumping straight to model training is one of the most common mistakes I see in student research projects.

R is particularly well-suited for EDA. Its core libraries (dplyr, ggplot2) are elegant and composable in ways that make exploratory work feel natural. I still reach for R when I’m doing preliminary data exploration, even when my final model ends up in Python.


Table of Contents

  • Setting Up R and RStudio
  • Loading and Inspecting Your Data
  • Handling Missing Values
  • Univariate Analysis: Understanding Each Variable
  • Bivariate and Multivariate Analysis
  • Feature Engineering with dplyr
  • Handling Imbalanced Data in R
  • Exporting Cleaned Data for Python
  • Quick Reference Code Snippets

Setting Up R and RStudio

Installation

Windows and Mac: Download R from https://cran.r-project.org/ and RStudio from https://posit.co/download/rstudio-desktop/

Ubuntu/Linux:

sudo apt-get update
sudo apt-get install r-base r-base-dev
# Then install RStudio from https://posit.co/download/rstudio-desktop/

Essential packages to install immediately

install.packages(c("dplyr", "ggplot2", "tidyr", "readr", "corrplot", "ROSE", "caret"))

Loading and Inspecting Your Data

library(dplyr)
library(ggplot2)

# Load dataset (we'll use the classic UCI Dress dataset)
dress_data <- read.csv("~/Downloads/Attribute_DataSet.csv", header=TRUE, stringsAsFactors=FALSE)

# First look
dim(dress_data)        # rows × columns
str(dress_data)        # data types and first few values
head(dress_data, 10)   # first 10 rows
summary(dress_data)    # min, max, mean, quartiles, NA count

The summary() function is your best first diagnostic. It tells you:

  • The range of continuous variables (catch absurd values like negative ages)
  • The number of NAs per column
  • Factor level distribution for categorical variables

Handling Missing Values

# Count NAs per column
colSums(is.na(dress_data))

# Percentage missing per column
round(colMeans(is.na(dress_data)) * 100, 2)

# Remove rows with any NA (use cautiously — losing data)
dress_clean <- na.omit(dress_data)

# Impute numeric columns with median (safer for skewed data)
dress_data$Rating[is.na(dress_data$Rating)] <- median(dress_data$Rating, na.rm=TRUE)

# For categorical: impute with mode
mode_material <- names(sort(table(dress_data$Material), decreasing=TRUE))[1]
dress_data$Material[is.na(dress_data$Material) | dress_data$Material == "null"] <- mode_material

Univariate Analysis: Understanding Each Variable

Continuous variables

# Histogram with density curve
ggplot(dress_data, aes(x=Rating)) +
  geom_histogram(aes(y=..density..), bins=20, fill="steelblue", alpha=0.7) +
  geom_density(color="darkred", size=1.2) +
  labs(title="Distribution of Product Ratings", x="Rating (0–5)", y="Density") +
  theme_minimal()

# Box plot — shows outliers clearly
ggplot(dress_data, aes(y=Rating)) +
  geom_boxplot(fill="lightblue", outlier.color="red") +
  theme_minimal()

Categorical variables

# Bar chart for Style distribution
dress_data %>%
  count(Style, sort=TRUE) %>%
  ggplot(aes(x=reorder(Style, n), y=n)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(title="Distribution of Dress Styles", x="Style", y="Count") +
  theme_minimal()

Bivariate and Multivariate Analysis

Correlation heatmap

library(corrplot)

# Select only numeric columns
numeric_cols <- dress_data %>% select_if(is.numeric)

# Correlation matrix
cor_matrix <- cor(numeric_cols, use="pairwise.complete.obs")
corrplot(cor_matrix, method="color", type="upper", tl.cex=0.8,
         col=colorRampPalette(c("blue", "white", "red"))(200))

Scatter plot with trend

ggplot(dress_data, aes(x=Rating, y=Recommendation)) +
  geom_jitter(alpha=0.3, height=0.05) +
  geom_smooth(method="loess", color="red") +
  labs(title="Rating vs. Recommendation Probability") +
  theme_minimal()

Group comparison

# Does price category affect rating?
dress_data %>%
  group_by(Price) %>%
  summarise(
    mean_rating = mean(Rating, na.rm=TRUE),
    median_rating = median(Rating, na.rm=TRUE),
    count = n()
  ) %>%
  arrange(desc(mean_rating))

# Visualize group differences
ggplot(dress_data, aes(x=Price, y=Rating, fill=Price)) +
  geom_boxplot() +
  labs(title="Rating Distribution by Price Category") +
  theme_minimal() +
  theme(legend.position="none")

Feature Engineering with dplyr

library(dplyr)

# Assume DF is your dataframe with columns X1 and X2
DF <- dress_data

# Select specific columns
DF_subset <- DF %>% select(Rating, Price, Recommendation)

# Remove a column
DF_no_style <- DF %>% select(-Style)

# Sort by multiple columns (Rating descending, then Price ascending)
DF_sorted <- DF %>% arrange(desc(Rating), Price)

# Create new columns
DF <- DF %>% mutate(
  high_rating = ifelse(Rating >= 4.0, 1, 0),
  rating_normalized = (Rating - min(Rating, na.rm=TRUE)) /
                      (max(Rating, na.rm=TRUE) - min(Rating, na.rm=TRUE))
)

# Group summary
DF %>%
  group_by(Season) %>%
  summarise(
    avg_rating = mean(Rating, na.rm=TRUE),
    total_recommended = sum(Recommendation, na.rm=TRUE),
    n = n()
  )

# Filter rows
high_rated <- DF %>% filter(Rating >= 4.0, Recommendation == 1)

Handling Imbalanced Data in R

library(ROSE)

# Check class balance
table(dress_data$Recommendation)

# Oversample minority class
DF_over <- ovun.sample(Recommendation ~ ., data=dress_data,
                        method="over", N=nrow(dress_data)*2, seed=42)$data
table(DF_over$Recommendation)

# Both oversampling and undersampling
DF_both <- ovun.sample(Recommendation ~ ., data=dress_data,
                        method="both", p=0.5, N=nrow(dress_data), seed=42)$data
table(DF_both$Recommendation)

# ROSE synthetic data generation
DF_rose <- ROSE(Recommendation ~ ., data=dress_data, seed=42)$data
table(DF_rose$Recommendation)

Exporting Cleaned Data for Python

When your EDA is done in R and your modeling happens in Python:

library(readr)
write_csv(dress_data_cleaned, "dress_data_cleaned.csv")

In Python:

import pandas as pd
df = pd.read_csv("dress_data_cleaned.csv")

Quick Reference: Shuffling, Oversampling, Indexing

# Shuffle dataset
DFindex <- seq(1, nrow(DF))
DFindex_shuffled <- sample(DFindex, length(DFindex), replace=FALSE)
DF_shuffled <- DF[DFindex_shuffled, ]

# Stratified train/test split (preserves class ratio)
library(caret)
set.seed(42)
train_idx <- createDataPartition(DF$Recommendation, p=0.8, list=FALSE)
DF_train <- DF[train_idx, ]
DF_test  <- DF[-train_idx, ]

Recommended Learning Resources

  • MIT Analytics Edge on edX — the course that got me started with R. Very applied, uses real business datasets.
  • R for Data Science by Hadley Wickham — free at https://r4ds.had.co.nz/ — the definitive guide to the tidyverse
  • R-bloggers (https://www.r-bloggers.com/) — daily tutorials and examples from the R community
  • ggplot2 documentation (https://ggplot2.tidyverse.org/) — excellent reference with examples


Discover more from Medical AI Insights | Datanalytics101

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights | Datanalytics101

Subscribe now to keep reading and get access to the full archive.

Continue reading