In what is rapidly becoming a series — cool things you can do with R in a tweet— Julia Silge demonstrates scraping the list of members of the US house of representatives on Wikipedia in just 5 R statements:
library(rvest) library(tidyverse)
h <- read_html("https://t.co/gloY1eErBn")
reps <- h %>% html_node("#mw-content-text > div > table:nth-child(18)") %>% html_table()
reps <- reps[,c(1:2,4:9)] %>% as_tibble() pic.twitter.com/25ANm7BHkj
— Julia Silge (@juliasilge) January 12, 2018
Since Twitter munges the URL in the third line when you cut-and-paste, here's a plain-text version of Julia's code:
library(rvest)library(tidyverse)h <- read_html("https://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives")reps <- h %>% html_node("#mw-content-text > div > table:nth-child(18)") %>% html_table()reps <- reps[,c(1:2,4:9)] %>% as_tibble()
And sure enough, here's what the reps
object looks like in the RStudio viewer:
As Julia notes it's not perfect, but you're still 95% of the way there to gathering data from a page intended for human rather than computer consumption. Impressive!
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...