Getting Started with R for Data Science in 2026: Installation, Setup, and Your First Analysis

From zero to your first working analysis — install R, set up RStudio, and learn the core tools every data scientist needs.

Last Updated 2026 — originally published July 2018


R is still one of the most powerful tools in a data scientist’s toolkit — especially for statistical modeling, EDA, and clinical research. This guide covers everything from installation to your first working analysis, updated for 2025.

Python gets more headlines, but R remains indispensable for several categories of work: statistical modeling, clinical trial analysis, bioinformatics, and exploratory data visualization. Many clinical researchers and biostatisticians work primarily in R, which means knowing R makes you a better collaborator in medical AI projects.

This guide is for anyone starting with R from scratch, or refreshing their knowledge after a gap.


Table of Contents

  • Should You Learn R or Python? (Honest Answer)
  • Installing R and RStudio
  • Essential Packages to Install First
  • Understanding R’s Core Data Structures
  • Your First Real Analysis: Loading, Exploring, Cleaning
  • Data Manipulation with dplyr
  • Data Visualization with ggplot2
  • Connecting R to Python Workflows
  • Learning Resources

Should You Learn R or Python? (Honest Answer)

Learn both, but prioritize based on your immediate needs:

TaskBetter choice
Deep learning / neural networksPython (PyTorch, TensorFlow)
Quick statistical modeling (linear regression, mixed models)R
Exploratory data visualizationR (ggplot2 is superior)
Clinical trial analysisR (gold standard in pharma)
Bioinformatics (RNA-seq, GWAS)R (Bioconductor ecosystem)
Production ML systemsPython
Rapid prototyping of ML modelsPython

For researchers at the intersection of AI and medicine: learn Python for model building, and R for statistical analysis and clinical collaboration. I use both in my own research.


Installing R and RStudio

Step 1: Install R

Windows and Mac:
Download from https://cran.r-project.org/ — choose your OS, download the installer, run it.

Ubuntu/Linux:

sudo apt update
sudo apt install r-base r-base-dev

For a more current version on Ubuntu:

# Add CRAN repository for latest R
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" | sudo tee /etc/apt/sources.list.d/r-project.list
sudo apt update
sudo apt install r-base

Step 2: Install RStudio

RStudio is the IDE for R. Download from: https://posit.co/download/rstudio-desktop/

RStudio provides:

  • Script editor with syntax highlighting and autocomplete
  • Interactive console
  • Integrated plots panel
  • Environment viewer (see all variables and their values)
  • Help panel

Essential Packages to Install First

Open RStudio and run this in the console:

# The tidyverse: a collection of packages that work together
install.packages("tidyverse")  # includes dplyr, ggplot2, tidyr, readr, and more

# Statistical modeling
install.packages("caret")       # ML training and evaluation
install.packages("glmnet")      # Regularized regression
install.packages("survival")    # Survival analysis (essential for clinical data)

# Handling imbalanced data
install.packages("ROSE")

# Visualization extras
install.packages("corrplot")    # Correlation heatmaps
install.packages("ggpubr")      # Publication-ready plots

# Working with Excel/CSV
install.packages("readxl")
install.packages("openxlsx")

Understanding R’s Core Data Structures

Unlike Python, R was designed for statistical computing, so its data structures reflect that:

# Vectors — the atomic unit (like Python lists but typed)
nums <- c(1, 2, 3, 4, 5)
chars <- c("hello", "world")
logical <- c(TRUE, FALSE, TRUE)

# Data frames — the equivalent of pandas DataFrames
df <- data.frame(
  name = c("Alice", "Bob", "Carol"),
  age = c(28, 35, 22),
  score = c(0.87, 0.91, 0.79)
)

# Tibbles — modern, improved data frames (from tidyverse)
library(tibble)
tb <- tibble(name = c("Alice", "Bob"), age = c(28, 35))

# Factors — categorical variables with defined levels
treatment <- factor(c("control", "treatment", "control"),
                    levels = c("control", "treatment"))

The most important thing to understand: R is vectorized. Most operations apply to entire vectors without explicit loops:

nums * 2          # [1] 2 4 6 8 10 — no for loop needed
nums > 3          # [1] FALSE FALSE FALSE TRUE TRUE
sum(nums > 3)     # [1] 2 — count of values greater than 3

Your First Real Analysis

Let’s work through a real example using the built-in mtcars dataset (car performance data):

# Load data
data(mtcars)

# Step 1: Inspect
dim(mtcars)          # 32 rows, 11 columns
str(mtcars)          # data types
summary(mtcars)      # min, max, mean, quartiles
head(mtcars, 6)      # first 6 rows

# Step 2: Check missing values
colSums(is.na(mtcars))  # none in this dataset

# Step 3: Distribution of key variable
hist(mtcars$mpg,
     main = "Distribution of Miles Per Gallon",
     xlab = "MPG", col = "steelblue", breaks = 10)

# Step 4: Correlation
cor(mtcars$mpg, mtcars$wt)  # negative: heavier cars get fewer mpg

# Step 5: Simple linear model
model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model)
# Shows coefficients, p-values, R-squared

Data Manipulation with dplyr

dplyr is the R equivalent of pandas for data manipulation. Its “verb” syntax is clean and composable:

library(dplyr)

# The pipe operator %>% chains operations
mtcars %>%
  filter(cyl == 6) %>%            # keep only 6-cylinder cars
  select(mpg, wt, hp) %>%         # keep only these columns
  mutate(wt_kg = wt * 453.6) %>%  # add new column
  arrange(desc(mpg)) %>%          # sort by mpg descending
  head(5)                          # first 5 rows

# Group and summarize
mtcars %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = mean(mpg),
    avg_hp = mean(hp),
    count = n()
  )

# Join two data frames (like SQL JOIN)
left_join(df1, df2, by = "id")

Data Visualization with ggplot2

ggplot2 is built on the “grammar of graphics” — you build plots layer by layer:

library(ggplot2)

# Scatter plot
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Car Weight vs Fuel Efficiency",
       x = "Weight (1000 lbs)", y = "Miles per Gallon",
       color = "Cylinders") +
  theme_minimal()

# Box plot comparing groups
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot(alpha = 0.7, outlier.color = "red") +
  labs(title = "MPG by Number of Cylinders",
       x = "Cylinders", y = "MPG") +
  theme_minimal() +
  theme(legend.position = "none")

# Histogram with density
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), bins = 15,
                 fill = "steelblue", alpha = 0.7) +
  geom_density(color = "darkred", size = 1.2) +
  labs(title = "MPG Distribution") +
  theme_minimal()

Connecting R to Python Workflows

If your modeling happens in Python but your statistics or visualization happens in R, you have two good options:

Option 1: Export/import via CSV

# Export from R
write.csv(df_cleaned, "data_for_python.csv", row.names = FALSE)

# Import in Python
import pandas as pd
df = pd.read_csv("data_for_python.csv")

Option 2: Use reticulate (R package that calls Python)

install.packages("reticulate")
library(reticulate)
py_run_string("import pandas as pd; df = pd.read_csv('data.csv')")

Option 3: RMarkdown / Quarto
If you write research reports, Quarto documents can contain both R and Python code chunks in the same document — ideal for reproducible research.


Learning Resources

  • R for Data Science (r4ds.had.co.nz) — free online book by Hadley Wickham; the definitive tidyverse guide
  • MIT Analytics Edge on edX — the course that got me started; uses R for real business analytics problems
  • R-bloggers (r-bloggers.com) — daily community tutorials
  • Swirl — learn R interactively inside R itself: install.packages("swirl"); swirl()
  • Tidytuesday — weekly social data project using R; great for building visualization skills


Discover more from Medical AI Insights

Subscribe to get the latest posts sent to your email.

What is your take on this topic?

Discover more from Medical AI Insights

Subscribe now to keep reading and get access to the full archive.

Continue reading