Web scraping is a technique used to extract information from websites. The R package rvest
makes it easy to scrape web data and is particularly well-suited for beginners. With its user-friendly syntax, rvest
allows you to gather information efficiently, which can be invaluable for data analysis or research. In this article, we will explore three diverse examples of web scraping using rvest
that demonstrate its capabilities.
In this example, we will scrape product titles and prices from a fictional e-commerce site. This use case is common for market research or competitive analysis.
## Load necessary libraries
library(rvest)
## Specify the URL of the e-commerce site
url <- 'https://example-ecommerce.com/products'
## Read the HTML content from the URL
web_page <- read_html(url)
## Extract product titles
product_titles <- web_page %>%
html_nodes('.product-title') %>%
html_text()
## Extract product prices
product_prices <- web_page %>%
html_nodes('.product-price') %>%
html_text()
## Combine into a data frame
products_df <- data.frame(Title = product_titles, Price = product_prices)
## Display the data frame
print(products_df)
robots.txt
file to ensure that web scraping is allowed.This example demonstrates how to scrape the latest news headlines from a news website. This can be useful for tracking trends or gathering information on specific topics.
## Load necessary libraries
library(rvest)
## Specify the URL of the news website
url <- 'https://example-news.com'
## Read the HTML content from the URL
web_page <- read_html(url)
## Extract news headlines
headlines <- web_page %>%
html_nodes('.headline') %>%
html_text()
## Extract publication dates
dates <- web_page %>%
html_nodes('.date') %>%
html_text()
## Combine into a data frame
news_df <- data.frame(Headline = headlines, Date = dates)
## Display the data frame
print(news_df)
trimws()
to clean the extracted text.In some cases, you may want to scrape data that is presented in HTML format through a public API. This example illustrates how to extract user information from a public GitHub repository page.
## Load necessary libraries
library(rvest)
## Specify the URL of the GitHub repository
url <- 'https://github.com/example-user/example-repo'
## Read the HTML content from the URL
web_page <- read_html(url)
## Extract user names from the contributors section
contributors <- web_page %>%
html_nodes('.contributor') %>%
html_text()
## Extract contribution counts
contributions <- web_page %>%
html_nodes('.contribution-count') %>%
html_text()
## Combine into a data frame
contributors_df <- data.frame(User = contributors, Contributions = contributions)
## Display the data frame
print(contributors_df)