[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this tutorial, we’ll see how to scrape an HTML table from Wikipedia and process the data for finding insights in it (or naively, to build a data visualization plot).
Youtube – https://youtu.be/KCUj7JQKOJA
Why?
Most of the times, As a Data Scientist or Data Analyst, your data may not be readily availble hence it’s handy to know skills like Web scraping to collect your own data. While Web scraping is a vast area, this tutorial focuses on one particular aspect of it, which is “Scraping or Extracting Tables from Web Pages”.
Code
library(tidyverse)
content <- read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_States_and_Canada")
tables <- content %>% html_table(fill = TRUE)
first_table <- tables[[1]]
first_table <- first_table[-1,]
library(janitor)
first_table <- first_table %>% clean_names()
first_table %>%
mutate(lifetime_gross = parse_number(lifetime_gross)) %>%
arrange(desc(lifetime_gross)) %>%
head(20) %>%
mutate(title = fct_reorder(title, lifetime_gross)) %>%
ggplot() + geom_bar(aes(y = title, x = lifetime_gross), stat = "identity", fill = "blue") +
labs(title = "Top 20 Grossing movies in US and Canada",
caption = "Data Source: Wikipedia ")
first_table %>%
mutate(lifetime_gross_2 = parse_number(lifetime_gross_2)) %>%
arrange(desc(lifetime_gross_2)) %>%
head(20) %>%
mutate(title = fct_reorder(title, lifetime_gross_2)) %>%
ggplot() + geom_bar(aes(y = title, x = lifetime_gross_2), stat = "identity", fill = "blue") +
labs(title = "Top 20 Grossing movies in US and Canada",
caption = "Data Source: Wikipedia ")
second_table <- tables[[2]]
second_table %>%
clean_names() -> second_table
second_table %>%
mutate(adjusted_gross = parse_number(adjusted_gross)) %>%
group_by(year) %>%
summarise(total_adjusted_gross = sum(adjusted_gross)) %>%
arrange(desc(total_adjusted_gross)) %>%
ggplot() + geom_line(aes(x = year,y = total_adjusted_gross, group = 1))
To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.