(This article was first published on R – rud.is, and kindly contributed to R-bloggers)
(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)
At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.
Here’s how to uniformly sample data from Apache Drill using the sergeant
package:
library(sergeant)
db <- src_drill("sonar")
tbl <- tbl(db, "dfs.dns.`aaaa.parquet`")
summarise(tbl, n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
##
## 1 19977415
mutate(tbl, r=rand()) %>%
filter(r <= 0.01) %>%
summarise(n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
##
## 1 199808
mutate(tbl, r=rand()) %>%
filter(r <= 0.50) %>%
summarise(n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
##
## 1 9988797
And, for groups (using a different/larger “database”):
fdns <- tbl(db, "dfs.fdns.`201708`")
summarise(fdns, n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
##
## 1 1895133100
filter(fdns, type %in% c("cname", "txt")) %>%
count(type)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## type n
##
## 1 cname 15389064
## 2 txt 67576750
filter(fdns, type %in% c("cname", "txt")) %>%
group_by(type) %>%
mutate(r=rand()) %>%
ungroup() %>%
filter(r <= 0.15) %>%
count(type)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## type n
##
## 1 cname 2307604
## 2 txt 10132672
I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...