Examples of data manipulation with dplyr: 3 practical examples
Before getting theoretical, let’s start with real examples of data manipulation with dplyr that mirror what analysts and data scientists actually do:
- Clean and filter messy records
- Group and summarize by categories and time
- Reshape data for modeling or dashboards
The three main sections below are the core examples of data manipulation with dplyr: 3 practical examples you can reuse. Around those, I’ll sprinkle extra variations so you end up with 6–8 concrete patterns you can lift straight into production code.
All examples assume you have the tidyverse installed:
install.packages("tidyverse")
library(dplyr)
library(lubridate) # for dates in time-series examples
Example 1: Cleaning and filtering a customer dataset with dplyr
This first example of data manipulation with dplyr is the one most analysts hit on day one: cleaning and subsetting a raw table.
Imagine a 2024 marketing dataset with customer signups and purchases:
customers <- tibble::tibble(
customer_id = 1:10,
signup_date = as.Date("2024-01-01") + sample(0:120, 10, replace = TRUE),
country = c("US", "US", "CA", "UK", NA, "US", "DE", "US", "US", "CA"),
age = c(25, 41, 39, 18, 52, NA, 33, 47, 29, 62),
purchases = c(3, 0, 5, 1, 0, 2, 7, 1, 4, 0),
revenue = c(150, 0, 230, 40, 0, 90, 410, 30, 220, 0)
)
Filtering and selecting columns
You want to:
- Keep only US customers
- Exclude people with missing age
- Keep just the variables you need for a quick report
us_active <- customers %>%
filter(country == "US", !is.na(age)) %>%
select(customer_id, signup_date, age, purchases, revenue)
This is one of the best examples of how dplyr turns multiple steps—filtering and column selection—into a single readable pipeline.
Creating new variables with mutate
Now you want to:
- Flag high-value customers (revenue > 200)
- Convert age into a simple age band
us_active <- us_active %>%
mutate(
high_value = revenue > 200,
age_band = case_when(
age < 30 ~ "<30",
age >= 30 & age < 45 ~ "30–44",
age >= 45 & age < 60 ~ "45–59",
age >= 60 ~ "60+",
TRUE ~ NA_character_
)
)
This tiny pipeline is a very common example of data manipulation with dplyr in real dashboards: it prepares data for segmentation and retention analysis.
Handling missing values and basic sanity checks
Suppose a stakeholder asks: “How many customers have missing country or age?” You can answer in one short pipeline:
missing_summary <- customers %>%
summarize(
n = n(),
missing_country = sum(is.na(country)),
missing_age = sum(is.na(age)),
pct_missing_age = mean(is.na(age))
)
This is the sort of thing you do constantly when working with public data sources like the U.S. Census Bureau, survey data, or hospital admissions data from sites like the CDC. These are all real examples where dplyr keeps exploratory work fast and readable.
Example 2: Grouped summaries and time trends with dplyr
The second of our examples of data manipulation with dplyr: 3 practical examples focuses on grouped summaries—arguably the heart of dplyr.
Imagine you’re analyzing daily web traffic and conversions for 2024, by marketing channel:
set.seed(123)
traffic <- tibble::tibble(
date = seq(as.Date("2024-01-01"), as.Date("2024-03-31"), by = "day"),
channel = sample(c("Email", "Paid Search", "Social", "Direct"),
size = 91, replace = TRUE),
sessions = sample(100:2000, 91, replace = TRUE),
conversions = rbinom(91, size = sessions, prob = 0.03)
)
Grouping by channel and summarizing performance
You want to know which channels actually convert:
channel_perf <- traffic %>%
group_by(channel) %>%
summarize(
total_sessions = sum(sessions),
total_conversions = sum(conversions),
conv_rate = total_conversions / total_sessions,
.groups = "drop"
) %>%
arrange(desc(conv_rate))
This is an everyday example of data manipulation with dplyr in marketing analytics teams: group, summarize, rank.
Adding time windows: monthly performance
Executives rarely want raw daily data; they want trends by month or quarter. Here’s how to aggregate by month:
monthly_perf <- traffic %>%
mutate(month = floor_date(date, unit = "month")) %>%
group_by(month, channel) %>%
summarize(
sessions = sum(sessions),
conversions = sum(conversions),
conv_rate = conversions / sessions,
.groups = "drop"
)
Now you have a tidy table ready for a time-series plot or a dashboard. This pattern shows up constantly in real-world cases: COVID-19 case counts by week from the CDC, hospital admissions by month from NIH, or enrollment statistics by term from universities like Harvard. All of these are examples of data manipulation with dplyr where grouped summaries are the backbone.
Multi-level grouping: channel and weekday
Let’s say your team suspects weekends behave differently. You can layer in another grouping variable:
weekday_perf <- traffic %>%
mutate(weekday = wday(date, label = TRUE)) %>%
group_by(channel, weekday) %>%
summarize(
avg_sessions = mean(sessions),
avg_conv_rate = sum(conversions) / sum(sessions),
.groups = "drop"
)
This is one of the best examples of how group_by() scales: you can pivot from a high-level channel view to channel-by-weekday without rewriting your logic.
Example 3: Joins and reshaping data for modeling and reporting
The third of our examples of data manipulation with dplyr: 3 practical examples tackles joins—where most real pipelines either sing or fall apart.
Imagine you have two tables:
patients: demographic and clinical infolab_results: repeated lab test results over time
patients <- tibble::tibble(
patient_id = 1:5,
gender = c("F", "M", "F", "F", "M"),
age = c(45, 60, 37, 52, 71),
smoker = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)
lab_results <- tibble::tibble(
patient_id = c(1, 1, 2, 3, 3, 3, 5),
test_date = as.Date("2024-02-01") + c(0, 30, 15, 0, 7, 60, 21),
test_type = c("A1C", "A1C", "LDL", "A1C", "LDL", "A1C", "LDL"),
value = c(6.8, 7.1, 130, 7.5, 145, 7.0, 120)
)
This setup mirrors what you’d see in clinical research or electronic health record analysis, like case studies you’ll find at Mayo Clinic.
Left join: adding demographics to lab results
You want each lab record to carry the patient’s demographics:
labs_with_demo <- lab_results %>%
left_join(patients, by = "patient_id")
Now you can immediately ask questions like: “What’s the average A1C by smoker status?”
a1c_by_smoker <- labs_with_demo %>%
filter(test_type == "A1C") %>%
group_by(smoker) %>%
summarize(
mean_a1c = mean(value),
n_tests = n(),
.groups = "drop"
)
This pipeline is a textbook example of data manipulation with dplyr in health analytics: join, filter, group, summarize.
Wide vs long: reshaping for modeling
Suppose you need one row per patient with separate columns for A1C and LDL averages. You can combine dplyr with tidyr:
library(tidyr)
patient_summary <- labs_with_demo %>%
group_by(patient_id, gender, age, smoker, test_type) %>%
summarize(mean_value = mean(value), .groups = "drop") %>%
pivot_wider(
names_from = test_type,
values_from = mean_value
)
Now patient_summary has columns like A1C and LDL—a clean input for regression models or risk scoring.
Again, this is one of the best examples of how dplyr plays nicely with the rest of the tidyverse. Your workflow stays consistent whether you’re doing finance, epidemiology, or education research.
More real examples of data manipulation with dplyr in 2024–2025
Those three core sections give you examples of data manipulation with dplyr: 3 practical examples, but modern R workflows tend to mix in a few extra patterns. Here are additional real examples you’ll see in 2024–2025 projects.
Window functions: ranking and percentiles
Say you’re working with a 2025 sales table and want to rank reps within each region:
sales <- tibble::tibble(
rep_id = 1:8,
region = c("East", "East", "West", "West", "South", "South", "Midwest", "Midwest"),
revenue = c(200000, 150000, 300000, 280000, 180000, 220000, 160000, 190000)
)
ranked_sales <- sales %>%
group_by(region) %>%
mutate(
region_rank = dense_rank(desc(revenue)),
region_pct = percent_rank(revenue)
) %>%
arrange(region, region_rank)
This is a real example of using mutate() with window functions to create rankings without losing row-level detail.
Case_when for business rules
You might need to categorize revenue into performance tiers for a quarterly slide deck:
sales_tiered <- ranked_sales %>%
mutate(perf_tier = case_when(
revenue >= 250000 ~ "Top performer",
revenue >= 180000 ~ "Solid performer",
TRUE ~ "Needs support"
))
This pattern—mutate() plus case_when()—shows up constantly in examples of data manipulation with dplyr for credit scoring, risk ratings, and internal KPIs.
Working with large data and databases
By 2024–2025, more R teams connect directly to data warehouses. The good news: the same dplyr verbs work on remote tables via dbplyr.
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, traffic, "traffic", temporary = FALSE)
traffic_db <- tbl(con, "traffic")
monthly_db <- traffic_db %>%
mutate(month = floor_date(date, unit = "month")) %>%
group_by(month, channel) %>%
summarize(
sessions = sum(sessions),
conversions = sum(conversions),
.groups = "drop"
)
## This is still lazy; collect() pulls it into R
monthly_local <- monthly_db %>% collect()
This database-backed pipeline is one of the best examples showing that dplyr code scales from CSVs on your laptop to production-grade warehouses.
FAQ: common questions about dplyr data manipulation
What are some examples of data manipulation with dplyr for beginners?
Good starter examples of data manipulation with dplyr include:
- Filtering rows with
filter()(e.g., keep only 2024 records) - Selecting columns with
select()(e.g., keep justid,date,value) - Creating new variables with
mutate()(e.g.,revenue_per_user = revenue / users) - Grouping and summarizing with
group_by()+summarize()(e.g., average score by school)
Those building blocks already cover a huge share of day-to-day analytics work.
Can you give an example of joining datasets with dplyr?
Yes. A classic example of a join is combining a users table with a subscriptions table:
users_with_sub <- users %>%
left_join(subscriptions, by = "user_id")
From there, you can group by subscription plan, compute churn, or analyze upgrades using the same verbs shown in the examples above.
Is dplyr still a good choice in 2024–2025 with data.table and Arrow around?
Yes. dplyr remains a very common choice because:
- It’s readable for teams with mixed skill levels
- It integrates cleanly with
ggplot2,tidyr, anddbplyr - It now plays nicely with backends like Arrow and databases
If you need extreme performance, you might combine dplyr with arrow or duckdb, but the core examples of data manipulation with dplyr stay the same.
Where can I find real-world datasets to practice these examples?
You can grab:
- Public health data from the CDC
- Research datasets from the NIH
- Education and social science data from university repositories (for example, Harvard Dataverse)
All of these provide rich, messy, real examples where dplyr pipelines shine.
The bottom line: once you understand these examples of data manipulation with dplyr: 3 practical examples—cleaning and filtering, grouped summaries, and joins plus reshaping—you’ve covered the majority of what analysts actually do in R. Everything else is a variation on these patterns, scaled up to bigger data and more complex business questions.
Related Topics
Practical examples of loops and conditional statements in R
Practical R code examples of examples of reading and writing CSV files
Examples of Creating and Using R Packages: 3 Practical Examples for Real Projects
Real‑world examples of top examples of connecting to databases with R
Examples of data manipulation with dplyr: 3 practical examples
Explore More R Code Snippets
Discover more examples and insights in this category.
View All R Code Snippets