Category: R | Tags: R for data science, exploratory analysis, ggplot2, data visualisation, dplyr, machine learning preprocessing
Updated 2025 edition — originally published August 2018
Exploratory Data Analysis (EDA) is the most important step before any machine learning model — and R makes it fast, visual, and intuitive. This tutorial covers everything from loading your first dataset to publication-quality visualizations.
Before you build any model, you need to understand your data. EDA is not optional — it’s where you catch data quality issues, identify patterns, and form hypotheses about which features matter. Skipping EDA and jumping straight to model training is one of the most common mistakes I see in student research projects.
R is particularly well-suited for EDA. Its core libraries (dplyr, ggplot2) are elegant and composable in ways that make exploratory work feel natural. I still reach for R when I’m doing preliminary data exploration, even when my final model ends up in Python.
Table of Contents
- Setting Up R and RStudio
- Loading and Inspecting Your Data
- Handling Missing Values
- Univariate Analysis: Understanding Each Variable
- Bivariate and Multivariate Analysis
- Feature Engineering with dplyr
- Handling Imbalanced Data in R
- Exporting Cleaned Data for Python
- Quick Reference Code Snippets
Setting Up R and RStudio
Installation
Windows and Mac: Download R from https://cran.r-project.org/ and RStudio from https://posit.co/download/rstudio-desktop/
Ubuntu/Linux:
sudo apt-get update
sudo apt-get install r-base r-base-dev
# Then install RStudio from https://posit.co/download/rstudio-desktop/
Essential packages to install immediately
install.packages(c("dplyr", "ggplot2", "tidyr", "readr", "corrplot", "ROSE", "caret"))
Loading and Inspecting Your Data
library(dplyr)
library(ggplot2)
# Load dataset (we'll use the classic UCI Dress dataset)
dress_data <- read.csv("~/Downloads/Attribute_DataSet.csv", header=TRUE, stringsAsFactors=FALSE)
# First look
dim(dress_data) # rows × columns
str(dress_data) # data types and first few values
head(dress_data, 10) # first 10 rows
summary(dress_data) # min, max, mean, quartiles, NA count
The summary() function is your best first diagnostic. It tells you:
- The range of continuous variables (catch absurd values like negative ages)
- The number of NAs per column
- Factor level distribution for categorical variables
Handling Missing Values
# Count NAs per column
colSums(is.na(dress_data))
# Percentage missing per column
round(colMeans(is.na(dress_data)) * 100, 2)
# Remove rows with any NA (use cautiously — losing data)
dress_clean <- na.omit(dress_data)
# Impute numeric columns with median (safer for skewed data)
dress_data$Rating[is.na(dress_data$Rating)] <- median(dress_data$Rating, na.rm=TRUE)
# For categorical: impute with mode
mode_material <- names(sort(table(dress_data$Material), decreasing=TRUE))[1]
dress_data$Material[is.na(dress_data$Material) | dress_data$Material == "null"] <- mode_material
Univariate Analysis: Understanding Each Variable
Continuous variables
# Histogram with density curve
ggplot(dress_data, aes(x=Rating)) +
geom_histogram(aes(y=..density..), bins=20, fill="steelblue", alpha=0.7) +
geom_density(color="darkred", size=1.2) +
labs(title="Distribution of Product Ratings", x="Rating (0–5)", y="Density") +
theme_minimal()
# Box plot — shows outliers clearly
ggplot(dress_data, aes(y=Rating)) +
geom_boxplot(fill="lightblue", outlier.color="red") +
theme_minimal()
Categorical variables
# Bar chart for Style distribution
dress_data %>%
count(Style, sort=TRUE) %>%
ggplot(aes(x=reorder(Style, n), y=n)) +
geom_col(fill="steelblue") +
coord_flip() +
labs(title="Distribution of Dress Styles", x="Style", y="Count") +
theme_minimal()
Bivariate and Multivariate Analysis
Correlation heatmap
library(corrplot)
# Select only numeric columns
numeric_cols <- dress_data %>% select_if(is.numeric)
# Correlation matrix
cor_matrix <- cor(numeric_cols, use="pairwise.complete.obs")
corrplot(cor_matrix, method="color", type="upper", tl.cex=0.8,
col=colorRampPalette(c("blue", "white", "red"))(200))
Scatter plot with trend
ggplot(dress_data, aes(x=Rating, y=Recommendation)) +
geom_jitter(alpha=0.3, height=0.05) +
geom_smooth(method="loess", color="red") +
labs(title="Rating vs. Recommendation Probability") +
theme_minimal()
Group comparison
# Does price category affect rating?
dress_data %>%
group_by(Price) %>%
summarise(
mean_rating = mean(Rating, na.rm=TRUE),
median_rating = median(Rating, na.rm=TRUE),
count = n()
) %>%
arrange(desc(mean_rating))
# Visualize group differences
ggplot(dress_data, aes(x=Price, y=Rating, fill=Price)) +
geom_boxplot() +
labs(title="Rating Distribution by Price Category") +
theme_minimal() +
theme(legend.position="none")
Feature Engineering with dplyr
library(dplyr)
# Assume DF is your dataframe with columns X1 and X2
DF <- dress_data
# Select specific columns
DF_subset <- DF %>% select(Rating, Price, Recommendation)
# Remove a column
DF_no_style <- DF %>% select(-Style)
# Sort by multiple columns (Rating descending, then Price ascending)
DF_sorted <- DF %>% arrange(desc(Rating), Price)
# Create new columns
DF <- DF %>% mutate(
high_rating = ifelse(Rating >= 4.0, 1, 0),
rating_normalized = (Rating - min(Rating, na.rm=TRUE)) /
(max(Rating, na.rm=TRUE) - min(Rating, na.rm=TRUE))
)
# Group summary
DF %>%
group_by(Season) %>%
summarise(
avg_rating = mean(Rating, na.rm=TRUE),
total_recommended = sum(Recommendation, na.rm=TRUE),
n = n()
)
# Filter rows
high_rated <- DF %>% filter(Rating >= 4.0, Recommendation == 1)
Handling Imbalanced Data in R
library(ROSE)
# Check class balance
table(dress_data$Recommendation)
# Oversample minority class
DF_over <- ovun.sample(Recommendation ~ ., data=dress_data,
method="over", N=nrow(dress_data)*2, seed=42)$data
table(DF_over$Recommendation)
# Both oversampling and undersampling
DF_both <- ovun.sample(Recommendation ~ ., data=dress_data,
method="both", p=0.5, N=nrow(dress_data), seed=42)$data
table(DF_both$Recommendation)
# ROSE synthetic data generation
DF_rose <- ROSE(Recommendation ~ ., data=dress_data, seed=42)$data
table(DF_rose$Recommendation)
Exporting Cleaned Data for Python
When your EDA is done in R and your modeling happens in Python:
library(readr)
write_csv(dress_data_cleaned, "dress_data_cleaned.csv")
In Python:
import pandas as pd
df = pd.read_csv("dress_data_cleaned.csv")
Quick Reference: Shuffling, Oversampling, Indexing
# Shuffle dataset
DFindex <- seq(1, nrow(DF))
DFindex_shuffled <- sample(DFindex, length(DFindex), replace=FALSE)
DF_shuffled <- DF[DFindex_shuffled, ]
# Stratified train/test split (preserves class ratio)
library(caret)
set.seed(42)
train_idx <- createDataPartition(DF$Recommendation, p=0.8, list=FALSE)
DF_train <- DF[train_idx, ]
DF_test <- DF[-train_idx, ]
Recommended Learning Resources
- MIT Analytics Edge on edX — the course that got me started with R. Very applied, uses real business datasets.
- R for Data Science by Hadley Wickham — free at https://r4ds.had.co.nz/ — the definitive guide to the tidyverse
- R-bloggers (https://www.r-bloggers.com/) — daily tutorials and examples from the R community
- ggplot2 documentation (https://ggplot2.tidyverse.org/) — excellent reference with examples
Discover more from Medical AI Insights | Datanalytics101
Subscribe to get the latest posts sent to your email.