Data Manipulation with dplyr: 3 Practical Examples

Explore three practical examples of data manipulation using dplyr in R to enhance your data analysis skills.
By Jamie

Introduction to Data Manipulation with dplyr

Data manipulation is a crucial part of data analysis that involves transforming and organizing data to extract meaningful insights. The dplyr package in R provides a set of tools for efficiently manipulating data frames, making it easier to perform complex operations with simple syntax. Below are three diverse examples of data manipulation using dplyr that will help you understand its capabilities.

Example 1: Filtering Data Based on Conditions

Use Case: Extracting Specific Data

In this example, we will filter a dataset to isolate records that meet certain criteria. This is particularly useful when you want to analyze a subset of your data based on specific conditions.

library(dplyr)

## Sample data frame
students <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eva"),
  grade = c(85, 90, 78, 88, 92),
  age = c(20, 21, 19, 22, 20)
)

## Filtering students with grade above 85
high_achievers <- students %>%
  filter(grade > 85)

print(high_achievers)

The output will show only those students who have grades higher than 85. This kind of filtering helps focus on high-performing individuals in the dataset.

Notes

  • You can use multiple conditions in the filter() function with & (AND) or | (OR).
  • Example: filter(grade > 85 & age < 21) to filter by both grade and age criteria.

Example 2: Summarizing Data with Grouping

Use Case: Analyzing Grouped Data

This example demonstrates how to summarize data by grouping it based on a specific variable. Grouping is useful for aggregating data to understand trends or patterns within subgroups.

library(dplyr)

## Sample data frame
sales <- data.frame(
  product = c("A", "B", "A", "B", "C"),
  quantity = c(10, 15, 5, 10, 20),
  revenue = c(100, 150, 50, 100, 200)
)

## Summarizing total revenue and quantity by product
summary_sales <- sales %>%
  group_by(product) %>%
  summarize(total_quantity = sum(quantity), total_revenue = sum(revenue))

print(summary_sales)

The output will provide a summary of total quantities sold and total revenue generated for each product. This helps in performance analysis across different products.

Notes

  • The summarize() function can be extended to include other metrics like mean, median, or standard deviation.
  • Example: summarize(mean_revenue = mean(revenue)) for average revenue calculations.

Example 3: Mutating Data to Create New Columns

Use Case: Adding Calculated Columns

In this example, we will create a new column in the dataset that is derived from existing data. This is useful for creating indicators or metrics based on calculations.

library(dplyr)

## Sample data frame
employees <- data.frame(
  name = c("John", "Doe", "Jane"),
  salary = c(50000, 60000, 55000),
  bonus = c(5000, 6000, 5500)
)

## Mutating to create a total compensation column
employees <- employees %>%
  mutate(total_compensation = salary + bonus)

print(employees)

The output will include a new column showing total compensation for each employee, which is the sum of salary and bonus. This calculation is crucial for financial analyses.

Notes

  • You can perform various operations in mutate(), including conditional statements using ifelse().
  • Example: mutate(category = ifelse(salary > 55000, 'High', 'Low')) to categorize salaries.

These examples illustrate the versatility of dplyr for data manipulation in R, empowering you to efficiently analyze and transform your datasets.