Web Scraping with rvest: 3 Practical Examples

Discover three practical examples of web scraping using rvest in R, perfect for beginners and data enthusiasts.
By Jamie

Introduction to Web Scraping with rvest

Web scraping is a technique used to extract information from websites. The R package rvest makes it easy to scrape web data and is particularly well-suited for beginners. With its user-friendly syntax, rvest allows you to gather information efficiently, which can be invaluable for data analysis or research. In this article, we will explore three diverse examples of web scraping using rvest that demonstrate its capabilities.

Example 1: Scraping Product Information from an E-Commerce Site

In this example, we will scrape product titles and prices from a fictional e-commerce site. This use case is common for market research or competitive analysis.

## Load necessary libraries
library(rvest)

## Specify the URL of the e-commerce site
url <- 'https://example-ecommerce.com/products'

## Read the HTML content from the URL
web_page <- read_html(url)

## Extract product titles
product_titles <- web_page %>% 
  html_nodes('.product-title') %>% 
  html_text()

## Extract product prices
product_prices <- web_page %>% 
  html_nodes('.product-price') %>% 
  html_text()

## Combine into a data frame
products_df <- data.frame(Title = product_titles, Price = product_prices)

## Display the data frame
print(products_df)

Notes:

  • Make sure to replace the URL and the CSS selectors with the actual values from the target website.
  • Always check the website’s robots.txt file to ensure that web scraping is allowed.

Example 2: Extracting News Headlines from a News Website

This example demonstrates how to scrape the latest news headlines from a news website. This can be useful for tracking trends or gathering information on specific topics.

## Load necessary libraries
library(rvest)

## Specify the URL of the news website
url <- 'https://example-news.com'

## Read the HTML content from the URL
web_page <- read_html(url)

## Extract news headlines
headlines <- web_page %>% 
  html_nodes('.headline') %>% 
  html_text()

## Extract publication dates
dates <- web_page %>% 
  html_nodes('.date') %>% 
  html_text()

## Combine into a data frame
news_df <- data.frame(Headline = headlines, Date = dates)

## Display the data frame
print(news_df)

Notes:

  • Adjust the CSS selectors based on the structure of the target news website.
  • Consider using additional functions like trimws() to clean the extracted text.

Example 3: Gathering Data from a Public API with HTML Output

In some cases, you may want to scrape data that is presented in HTML format through a public API. This example illustrates how to extract user information from a public GitHub repository page.

## Load necessary libraries
library(rvest)

## Specify the URL of the GitHub repository
url <- 'https://github.com/example-user/example-repo'

## Read the HTML content from the URL
web_page <- read_html(url)

## Extract user names from the contributors section
contributors <- web_page %>% 
  html_nodes('.contributor') %>% 
  html_text()

## Extract contribution counts
contributions <- web_page %>% 
  html_nodes('.contribution-count') %>% 
  html_text()

## Combine into a data frame
contributors_df <- data.frame(User = contributors, Contributions = contributions)

## Display the data frame
print(contributors_df)

Notes:

  • The CSS selectors will vary based on the repository structure, so inspect the page source to find the correct classes.
  • Keep in mind that scraping too frequently can lead to your IP being blocked, so implement respectful scraping practices.