Examples of data manipulation with dplyr: 3 practical examples

If you work with R and data, you’ve probably heard that `dplyr` makes data wrangling faster and more readable. That’s true, but it only really clicks when you see concrete examples of data manipulation with dplyr: 3 practical examples that mirror what you do at work. Instead of abstract toy problems, this guide walks through real examples you’d hit in analytics, reporting, and data science projects. We’ll start from a simple data frame and gradually layer in filtering, grouping, summarizing, and reshaping. Along the way, you’ll see multiple examples of how `dplyr` chains operations together so your code reads like a sentence instead of a puzzle. Whether you’re cleaning survey data, tracking marketing performance, or analyzing public datasets, these are the best examples to copy, adapt, and reuse in your own projects. Everything runs on modern `tidyverse` tools, so you can plug it directly into your 2024–2025 R workflow.
Written by
Jamie
Published
Updated

Before getting theoretical, let’s start with real examples of data manipulation with dplyr that mirror what analysts and data scientists actually do:

  • Clean and filter messy records
  • Group and summarize by categories and time
  • Reshape data for modeling or dashboards

The three main sections below are the core examples of data manipulation with dplyr: 3 practical examples you can reuse. Around those, I’ll sprinkle extra variations so you end up with 6–8 concrete patterns you can lift straight into production code.

All examples assume you have the tidyverse installed:

install.packages("tidyverse")
library(dplyr)
library(lubridate)  # for dates in time-series examples

Example 1: Cleaning and filtering a customer dataset with dplyr

This first example of data manipulation with dplyr is the one most analysts hit on day one: cleaning and subsetting a raw table.

Imagine a 2024 marketing dataset with customer signups and purchases:

customers <- tibble::tibble(
  customer_id = 1:10,
  signup_date = as.Date("2024-01-01") + sample(0:120, 10, replace = TRUE),
  country      = c("US", "US", "CA", "UK", NA, "US", "DE", "US", "US", "CA"),
  age          = c(25, 41, 39, 18, 52, NA, 33, 47, 29, 62),
  purchases    = c(3, 0, 5, 1, 0, 2, 7, 1, 4, 0),
  revenue      = c(150, 0, 230, 40, 0, 90, 410, 30, 220, 0)
)

Filtering and selecting columns

You want to:

  • Keep only US customers
  • Exclude people with missing age
  • Keep just the variables you need for a quick report
us_active <- customers %>%
  filter(country == "US", !is.na(age)) %>%
  select(customer_id, signup_date, age, purchases, revenue)

This is one of the best examples of how dplyr turns multiple steps—filtering and column selection—into a single readable pipeline.

Creating new variables with mutate

Now you want to:

  • Flag high-value customers (revenue > 200)
  • Convert age into a simple age band
us_active <- us_active %>%
  mutate(
    high_value = revenue > 200,
    age_band = case_when(
      age < 30              ~ "<30",
      age >= 30 & age < 45  ~ "30–44",
      age >= 45 & age < 60  ~ "45–59",
      age >= 60             ~ "60+",
      TRUE                  ~ NA_character_
    )
  )

This tiny pipeline is a very common example of data manipulation with dplyr in real dashboards: it prepares data for segmentation and retention analysis.

Handling missing values and basic sanity checks

Suppose a stakeholder asks: “How many customers have missing country or age?” You can answer in one short pipeline:

missing_summary <- customers %>%
  summarize(
    n = n(),
    missing_country = sum(is.na(country)),
    missing_age     = sum(is.na(age)),
    pct_missing_age = mean(is.na(age))
  )

This is the sort of thing you do constantly when working with public data sources like the U.S. Census Bureau, survey data, or hospital admissions data from sites like the CDC. These are all real examples where dplyr keeps exploratory work fast and readable.


The second of our examples of data manipulation with dplyr: 3 practical examples focuses on grouped summaries—arguably the heart of dplyr.

Imagine you’re analyzing daily web traffic and conversions for 2024, by marketing channel:

set.seed(123)
traffic <- tibble::tibble(
  date    = seq(as.Date("2024-01-01"), as.Date("2024-03-31"), by = "day"),
  channel = sample(c("Email", "Paid Search", "Social", "Direct"),
                   size = 91, replace = TRUE),
  sessions    = sample(100:2000, 91, replace = TRUE),
  conversions = rbinom(91, size = sessions, prob = 0.03)
)

Grouping by channel and summarizing performance

You want to know which channels actually convert:

channel_perf <- traffic %>%
  group_by(channel) %>%
  summarize(
    total_sessions    = sum(sessions),
    total_conversions = sum(conversions),
    conv_rate         = total_conversions / total_sessions,
    .groups = "drop"
  ) %>%
  arrange(desc(conv_rate))

This is an everyday example of data manipulation with dplyr in marketing analytics teams: group, summarize, rank.

Adding time windows: monthly performance

Executives rarely want raw daily data; they want trends by month or quarter. Here’s how to aggregate by month:

monthly_perf <- traffic %>%
  mutate(month = floor_date(date, unit = "month")) %>%
  group_by(month, channel) %>%
  summarize(
    sessions    = sum(sessions),
    conversions = sum(conversions),
    conv_rate   = conversions / sessions,
    .groups = "drop"
  )

Now you have a tidy table ready for a time-series plot or a dashboard. This pattern shows up constantly in real-world cases: COVID-19 case counts by week from the CDC, hospital admissions by month from NIH, or enrollment statistics by term from universities like Harvard. All of these are examples of data manipulation with dplyr where grouped summaries are the backbone.

Multi-level grouping: channel and weekday

Let’s say your team suspects weekends behave differently. You can layer in another grouping variable:

weekday_perf <- traffic %>%
  mutate(weekday = wday(date, label = TRUE)) %>%
  group_by(channel, weekday) %>%
  summarize(
    avg_sessions  = mean(sessions),
    avg_conv_rate = sum(conversions) / sum(sessions),
    .groups = "drop"
  )

This is one of the best examples of how group_by() scales: you can pivot from a high-level channel view to channel-by-weekday without rewriting your logic.


Example 3: Joins and reshaping data for modeling and reporting

The third of our examples of data manipulation with dplyr: 3 practical examples tackles joins—where most real pipelines either sing or fall apart.

Imagine you have two tables:

  • patients: demographic and clinical info
  • lab_results: repeated lab test results over time
patients <- tibble::tibble(
  patient_id = 1:5,
  gender     = c("F", "M", "F", "F", "M"),
  age        = c(45, 60, 37, 52, 71),
  smoker     = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)

lab_results <- tibble::tibble(
  patient_id = c(1, 1, 2, 3, 3, 3, 5),
  test_date  = as.Date("2024-02-01") + c(0, 30, 15, 0, 7, 60, 21),
  test_type  = c("A1C", "A1C", "LDL", "A1C", "LDL", "A1C", "LDL"),
  value      = c(6.8, 7.1, 130, 7.5, 145, 7.0, 120)
)

This setup mirrors what you’d see in clinical research or electronic health record analysis, like case studies you’ll find at Mayo Clinic.

Left join: adding demographics to lab results

You want each lab record to carry the patient’s demographics:

labs_with_demo <- lab_results %>%
  left_join(patients, by = "patient_id")

Now you can immediately ask questions like: “What’s the average A1C by smoker status?”

a1c_by_smoker <- labs_with_demo %>%
  filter(test_type == "A1C") %>%
  group_by(smoker) %>%
  summarize(
    mean_a1c = mean(value),
    n_tests  = n(),
    .groups  = "drop"
  )

This pipeline is a textbook example of data manipulation with dplyr in health analytics: join, filter, group, summarize.

Wide vs long: reshaping for modeling

Suppose you need one row per patient with separate columns for A1C and LDL averages. You can combine dplyr with tidyr:

library(tidyr)

patient_summary <- labs_with_demo %>%
  group_by(patient_id, gender, age, smoker, test_type) %>%
  summarize(mean_value = mean(value), .groups = "drop") %>%
  pivot_wider(
    names_from  = test_type,
    values_from = mean_value
  )

Now patient_summary has columns like A1C and LDL—a clean input for regression models or risk scoring.

Again, this is one of the best examples of how dplyr plays nicely with the rest of the tidyverse. Your workflow stays consistent whether you’re doing finance, epidemiology, or education research.


More real examples of data manipulation with dplyr in 2024–2025

Those three core sections give you examples of data manipulation with dplyr: 3 practical examples, but modern R workflows tend to mix in a few extra patterns. Here are additional real examples you’ll see in 2024–2025 projects.

Window functions: ranking and percentiles

Say you’re working with a 2025 sales table and want to rank reps within each region:

sales <- tibble::tibble(
  rep_id = 1:8,
  region = c("East", "East", "West", "West", "South", "South", "Midwest", "Midwest"),
  revenue = c(200000, 150000, 300000, 280000, 180000, 220000, 160000, 190000)
)

ranked_sales <- sales %>%
  group_by(region) %>%
  mutate(
    region_rank = dense_rank(desc(revenue)),
    region_pct  = percent_rank(revenue)
  ) %>%
  arrange(region, region_rank)

This is a real example of using mutate() with window functions to create rankings without losing row-level detail.

Case_when for business rules

You might need to categorize revenue into performance tiers for a quarterly slide deck:

sales_tiered <- ranked_sales %>%
  mutate(perf_tier = case_when(
    revenue >= 250000          ~ "Top performer",
    revenue >= 180000          ~ "Solid performer",
    TRUE                       ~ "Needs support"
  ))

This pattern—mutate() plus case_when()—shows up constantly in examples of data manipulation with dplyr for credit scoring, risk ratings, and internal KPIs.

Working with large data and databases

By 2024–2025, more R teams connect directly to data warehouses. The good news: the same dplyr verbs work on remote tables via dbplyr.

library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, traffic, "traffic", temporary = FALSE)

traffic_db <- tbl(con, "traffic")

monthly_db <- traffic_db %>%
  mutate(month = floor_date(date, unit = "month")) %>%
  group_by(month, channel) %>%
  summarize(
    sessions    = sum(sessions),
    conversions = sum(conversions),
    .groups     = "drop"
  )

## This is still lazy; collect() pulls it into R
monthly_local <- monthly_db %>% collect()

This database-backed pipeline is one of the best examples showing that dplyr code scales from CSVs on your laptop to production-grade warehouses.


FAQ: common questions about dplyr data manipulation

What are some examples of data manipulation with dplyr for beginners?

Good starter examples of data manipulation with dplyr include:

  • Filtering rows with filter() (e.g., keep only 2024 records)
  • Selecting columns with select() (e.g., keep just id, date, value)
  • Creating new variables with mutate() (e.g., revenue_per_user = revenue / users)
  • Grouping and summarizing with group_by() + summarize() (e.g., average score by school)

Those building blocks already cover a huge share of day-to-day analytics work.

Can you give an example of joining datasets with dplyr?

Yes. A classic example of a join is combining a users table with a subscriptions table:

users_with_sub <- users %>%
  left_join(subscriptions, by = "user_id")

From there, you can group by subscription plan, compute churn, or analyze upgrades using the same verbs shown in the examples above.

Is dplyr still a good choice in 2024–2025 with data.table and Arrow around?

Yes. dplyr remains a very common choice because:

  • It’s readable for teams with mixed skill levels
  • It integrates cleanly with ggplot2, tidyr, and dbplyr
  • It now plays nicely with backends like Arrow and databases

If you need extreme performance, you might combine dplyr with arrow or duckdb, but the core examples of data manipulation with dplyr stay the same.

Where can I find real-world datasets to practice these examples?

You can grab:

  • Public health data from the CDC
  • Research datasets from the NIH
  • Education and social science data from university repositories (for example, Harvard Dataverse)

All of these provide rich, messy, real examples where dplyr pipelines shine.


The bottom line: once you understand these examples of data manipulation with dplyr: 3 practical examples—cleaning and filtering, grouped summaries, and joins plus reshaping—you’ve covered the majority of what analysts actually do in R. Everything else is a variation on these patterns, scaled up to bigger data and more complex business questions.

Explore More R Code Snippets

Discover more examples and insights in this category.

View All R Code Snippets