From zero to your first working analysis — install R, set up RStudio, and learn the core tools every data scientist needs.
Last Updated 2026 — originally published July 2018
R is still one of the most powerful tools in a data scientist’s toolkit — especially for statistical modeling, EDA, and clinical research. This guide covers everything from installation to your first working analysis, updated for 2025.
Python gets more headlines, but R remains indispensable for several categories of work: statistical modeling, clinical trial analysis, bioinformatics, and exploratory data visualization. Many clinical researchers and biostatisticians work primarily in R, which means knowing R makes you a better collaborator in medical AI projects.
This guide is for anyone starting with R from scratch, or refreshing their knowledge after a gap.
Table of Contents
- Should You Learn R or Python? (Honest Answer)
- Installing R and RStudio
- Essential Packages to Install First
- Understanding R’s Core Data Structures
- Your First Real Analysis: Loading, Exploring, Cleaning
- Data Manipulation with dplyr
- Data Visualization with ggplot2
- Connecting R to Python Workflows
- Learning Resources
Should You Learn R or Python? (Honest Answer)
Learn both, but prioritize based on your immediate needs:
| Task | Better choice |
|---|---|
| Deep learning / neural networks | Python (PyTorch, TensorFlow) |
| Quick statistical modeling (linear regression, mixed models) | R |
| Exploratory data visualization | R (ggplot2 is superior) |
| Clinical trial analysis | R (gold standard in pharma) |
| Bioinformatics (RNA-seq, GWAS) | R (Bioconductor ecosystem) |
| Production ML systems | Python |
| Rapid prototyping of ML models | Python |
For researchers at the intersection of AI and medicine: learn Python for model building, and R for statistical analysis and clinical collaboration. I use both in my own research.
Installing R and RStudio
Step 1: Install R
Windows and Mac:
Download from https://cran.r-project.org/ — choose your OS, download the installer, run it.
Ubuntu/Linux:
sudo apt update
sudo apt install r-base r-base-dev
For a more current version on Ubuntu:
# Add CRAN repository for latest R
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" | sudo tee /etc/apt/sources.list.d/r-project.list
sudo apt update
sudo apt install r-base
Step 2: Install RStudio
RStudio is the IDE for R. Download from: https://posit.co/download/rstudio-desktop/
RStudio provides:
- Script editor with syntax highlighting and autocomplete
- Interactive console
- Integrated plots panel
- Environment viewer (see all variables and their values)
- Help panel
Essential Packages to Install First
Open RStudio and run this in the console:
# The tidyverse: a collection of packages that work together
install.packages("tidyverse") # includes dplyr, ggplot2, tidyr, readr, and more
# Statistical modeling
install.packages("caret") # ML training and evaluation
install.packages("glmnet") # Regularized regression
install.packages("survival") # Survival analysis (essential for clinical data)
# Handling imbalanced data
install.packages("ROSE")
# Visualization extras
install.packages("corrplot") # Correlation heatmaps
install.packages("ggpubr") # Publication-ready plots
# Working with Excel/CSV
install.packages("readxl")
install.packages("openxlsx")
Understanding R’s Core Data Structures
Unlike Python, R was designed for statistical computing, so its data structures reflect that:
# Vectors — the atomic unit (like Python lists but typed)
nums <- c(1, 2, 3, 4, 5)
chars <- c("hello", "world")
logical <- c(TRUE, FALSE, TRUE)
# Data frames — the equivalent of pandas DataFrames
df <- data.frame(
name = c("Alice", "Bob", "Carol"),
age = c(28, 35, 22),
score = c(0.87, 0.91, 0.79)
)
# Tibbles — modern, improved data frames (from tidyverse)
library(tibble)
tb <- tibble(name = c("Alice", "Bob"), age = c(28, 35))
# Factors — categorical variables with defined levels
treatment <- factor(c("control", "treatment", "control"),
levels = c("control", "treatment"))
The most important thing to understand: R is vectorized. Most operations apply to entire vectors without explicit loops:
nums * 2 # [1] 2 4 6 8 10 — no for loop needed
nums > 3 # [1] FALSE FALSE FALSE TRUE TRUE
sum(nums > 3) # [1] 2 — count of values greater than 3
Your First Real Analysis
Let’s work through a real example using the built-in mtcars dataset (car performance data):
# Load data
data(mtcars)
# Step 1: Inspect
dim(mtcars) # 32 rows, 11 columns
str(mtcars) # data types
summary(mtcars) # min, max, mean, quartiles
head(mtcars, 6) # first 6 rows
# Step 2: Check missing values
colSums(is.na(mtcars)) # none in this dataset
# Step 3: Distribution of key variable
hist(mtcars$mpg,
main = "Distribution of Miles Per Gallon",
xlab = "MPG", col = "steelblue", breaks = 10)
# Step 4: Correlation
cor(mtcars$mpg, mtcars$wt) # negative: heavier cars get fewer mpg
# Step 5: Simple linear model
model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model)
# Shows coefficients, p-values, R-squared
Data Manipulation with dplyr
dplyr is the R equivalent of pandas for data manipulation. Its “verb” syntax is clean and composable:
library(dplyr)
# The pipe operator %>% chains operations
mtcars %>%
filter(cyl == 6) %>% # keep only 6-cylinder cars
select(mpg, wt, hp) %>% # keep only these columns
mutate(wt_kg = wt * 453.6) %>% # add new column
arrange(desc(mpg)) %>% # sort by mpg descending
head(5) # first 5 rows
# Group and summarize
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
avg_hp = mean(hp),
count = n()
)
# Join two data frames (like SQL JOIN)
left_join(df1, df2, by = "id")
Data Visualization with ggplot2
ggplot2 is built on the “grammar of graphics” — you build plots layer by layer:
library(ggplot2)
# Scatter plot
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Car Weight vs Fuel Efficiency",
x = "Weight (1000 lbs)", y = "Miles per Gallon",
color = "Cylinders") +
theme_minimal()
# Box plot comparing groups
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_boxplot(alpha = 0.7, outlier.color = "red") +
labs(title = "MPG by Number of Cylinders",
x = "Cylinders", y = "MPG") +
theme_minimal() +
theme(legend.position = "none")
# Histogram with density
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density..), bins = 15,
fill = "steelblue", alpha = 0.7) +
geom_density(color = "darkred", size = 1.2) +
labs(title = "MPG Distribution") +
theme_minimal()
Connecting R to Python Workflows
If your modeling happens in Python but your statistics or visualization happens in R, you have two good options:
Option 1: Export/import via CSV
# Export from R
write.csv(df_cleaned, "data_for_python.csv", row.names = FALSE)
# Import in Python
import pandas as pd
df = pd.read_csv("data_for_python.csv")
Option 2: Use reticulate (R package that calls Python)
install.packages("reticulate")
library(reticulate)
py_run_string("import pandas as pd; df = pd.read_csv('data.csv')")
Option 3: RMarkdown / Quarto
If you write research reports, Quarto documents can contain both R and Python code chunks in the same document — ideal for reproducible research.
Learning Resources
- R for Data Science (r4ds.had.co.nz) — free online book by Hadley Wickham; the definitive tidyverse guide
- MIT Analytics Edge on edX — the course that got me started; uses R for real business analytics problems
- R-bloggers (r-bloggers.com) — daily community tutorials
- Swirl — learn R interactively inside R itself:
install.packages("swirl"); swirl() - Tidytuesday — weekly social data project using R; great for building visualization skills
Discover more from Medical AI Insights
Subscribe to get the latest posts sent to your email.