2 Months in 2 Minutes – rOpenSci News, October 2019

October 16, 2019, 5:00 pm

≫ Next: Map coloring: the color scale styles available in the tmap package

≪ Previous: Productionizing Shiny and Plumber with Pins

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rOpenSci HQ

What would you like to hear about in an rOpenSci Community Call? We are soliciting your “votes” and new ideas for Community Call topics and speakers. Find out how you can influence us by checking out our new Community Calls repository.
Videos, speaker’s slides, resources and collaborative notes from our Community Call on Reproducible Workflows at Scale with drake are posted.
Help wanted! We encourage rOpenSci package authors to help us help you get more contributors to your package. If you label an issue “help wanted” (no hyphen or emojis), those issues can be found in a search of the rOpenSci organization.

Software Peer Review

3 community-contributed packages passed software peer review.

c14bazAAR– Download and Prepare C14 Dates from Different Source Databases. Author: Clemens Schmid; Reviewers: Ben Marwick, Enrico Crema; Read the Review

rmangal– An interface to the Mangal database https://mangal.io/#/. Author: Steve Vissault; Reviewers: Thomas Pedersen, Anna Willoughby; Read the Review

rnassqs– Access the NASS Quick Stats API. Author: Nicholas Potter; Reviewers: Adam Sparks, Neal Richardson; Read the Review

Consider submitting your package or volunteering to review.

Software

5 new packages from the community are on CRAN.

cde– download data from the Catchment Data Explorer
chlorpromazineR– convert antipsychotic doses to chlorpromazine equivalents
citecorp– client for the Open Citations Corpus
PostcodesioR– API wrapper for Postcodes.io
rmangal– interface to the Mangal database of ecological networks

On the Blog

From the rOpenSci team

citecorp: working with open citations Tech Note – Scott Chamberlain
Updates to the rOpenSci image suite: magick, tesseract, and av Tech Note – Jeroen Ooms
rOpenSci Dev Guide 0.3.0: Updates– rOpenSci Software Peer Review Editors
cran checks API: an update Tech Note – Scott Chamberlain
2 Months in 2 Minutes – rOpenSci News, August 2019– Stefanie Butland

From the community

From Introducing Open Forensic Science in R. Were these bullets fired by the same gun? Top: Images of partial bullet scans. Bottom left: representative cross-sections from two bullets with 6 lands each. Bottom right: resulting smoothed bullet signatures and raw signatures.

Use Cases

84 published works cited or used rOpenSci software (listed in individual newsletters)
7 use cases for our packages or resources were posted in our discussion forum Look for pdftools, tabulizer, writexl, rorcid, rnaturalearth, rdflib, drake, and tic.

Have you used an rOpenSci package? Share your use case and we’ll tweet about it.

Call For Contributors

Part of rOpenSci’s mission is to make sustainable software. When a package needs a new maintainer, we work to find a new one. The current maintainer of mregions, Scott Chamberlain, is looking for a new maintainer. Contact Scott if you’re interested.

Keep up with rOpenSci

We create a newsletter every two weeks. You can subscribe via rss feed in XML or JSON or via our one-way mailing list.

Follow @rOpenSci on Twitter. Find out how you can contribute to rOpenSci as a user or developer.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Map coloring: the color scale styles available in the tmap package

October 16, 2019, 5:00 pm

≫ Next: Repetitive Q: Reading Multiple Files in the Zip Folder

≪ Previous: 2 Months in 2 Minutes – rOpenSci News, October 2019

[This article was first published on the Geocomputation with R website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This vignette builds on the making maps chapter of the Geocomputation with R book. Its goal is to demonstrate all possible map styles available in the tmap package.

Prerequisites

The examples below assume the following packages are attached:

library(spData) # example datasetslibrary(tmap)   # map creationlibrary(sf)     # spatial data reprojection

The world object containing a world map data from Natural Earth and information about countries’ names, regions, and subregions they belong to, areas, life expectancies, and populations. This object is in geographical coordinates using the WGS84 datum, however, for mapping purposes, the Mollweide projection is a better alternative (learn more in the modifying map projections section). The st_tranform function from the sf package allows for quick reprojection to the selected coordinate reference system (e.g., "+proj=moll" represents the Mollweide projection).

world_moll = st_transform(world, crs = "+proj=moll")

One color

Let’s start with the basics. To create a simple world map, we need to specify the data object (world_moll) inside the tm_shape() function, and the way we want to visualize it. The tmap package offers several visualisation possibilities for polygons, including tm_borders(), tm_fill(), and tm_polygons(). The last one draws the filled polygons with borders, where the fill color can be specified with the col argument:

tm_shape(world_moll) +  tm_polygons(col = "lightblue")

The output is a map of world countries, where each country is filled with a light blue color.

Coloring of adjacent polygons

The col argument is very flexible, and its action depends on the value provided. In the previous example, we provided a single color value resulting in a map with one color. To create a map, where adjacent polygons do not get the same color, we need to provide a keyword "MAP_COLORS".

tm_shape(world_moll) +  tm_polygons(col = "MAP_COLORS")

The default color can be changed using the palette argument – run the tmaptools::palette_explorer() function to see possible palettes’ names.

tm_shape(world_moll) +  tm_polygons(col = "MAP_COLORS",              palette = "Pastel1")

Additionally, in this case, it is possible to use the minimize argument, which triggers the internal algorithm to search for a minimal number of colors for visualization.

tm_shape(world_moll) +  tm_polygons(col = "MAP_COLORS",              minimize = TRUE)

The new map uses five colors. On a side note, in theory, no more than four colors are required to color the polygons of the map so that no two adjacent polygons have the same color (learn more about the four color map theorem on Wikipedia).

Categorical maps

The third use of the col argument is by providing the variable (column) name. In this case, the map will represent the given variable. By default, tmap behaves differently depending on the input variable type. For example, it will create a categorical map when the provided variable contains characters or factors. The tm_polygons(col = "subregion", style = "cat") code will be run automatically in this case.

tm_shape(world_moll) +  tm_polygons(col = "subregion")+  tm_layout(legend.outside = TRUE)

Discrete maps

Discrete maps represents continuous numerical variables using discrete class intervals. There are several ways to convert continuous variables to discrete ones implemented in tmap.

Pretty

When the variable provided as the col argument is numeric, tmap will use the "pretty" style as a default. In other words, it runs tm_polygons(col = "lifeExp", style = "pretty") invisibly to the user. This style rounds breaks into whole numbers where possible and spaces them evenly.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

A histogram is added using legend.hist = TRUE in this and several next examples to show how the selected map style relates to the distribution of values.

It is possible to indicate a preferred number of classes using the n argument. Importantly, not every n is possible depending on the range of the values in the data.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              legend.hist = TRUE,              n = 4) +  tm_layout(legend.outside = TRUE)

Fixed

The "jenks" style allows for a manual selection of the breaks in conjunction with the breaks argument.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "fixed",              breaks = c(45, 60, 75, 90),              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Additionally, the default labels can be overwritten using the labels argument.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "fixed",              breaks = c(45, 60, 75, 90),              labels = c("low", "medium", "high"),              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Breaks based on the standard deviation value

The "sd" style calculates a standard deviation of a given variable, and next use this value as the break width.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "sd",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Fisher algorithm

The "fisher" style creates groups with maximalized homogeneity.¹

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "fisher",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Jenks natural breaks

The "jenks" style identifies groups of similar values in the data and maximizes the differences between categories.²

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "jenks",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Hierarchical clustering

In the "hclust" style, breaks are created using hierarchical clustering.³

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "hclust",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Bagged clustering

The "bclust" style uses the bclust function to generate the breaks using bagged clustering.⁴

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "bclust",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

## Committee Member: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1) 7(1) 8(1) 9(1) 10(1)## Computing Hierarchical Clustering

k-means clustering

The "kmeans" style uses the kmeans function to generate the breaks.⁵

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "kmeans",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Quantile breaks

The "quantile" style creates breaks with an equal number of features (polygons).

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "quantile",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Equal breaks

The "equal" style divides input values into bins of equal range and is appropriate for variables with a uniform distribution. It is not recommended for variables with a skewed distribution as the resulting map may end-up having little color diversity.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",               style = "equal",              legend.hist = TRUE) +  tm_layout(legend.outside = TRUE)

Learn more about the implementation of discrete scales in the classInt package’s documentation.

Continuous maps

The tmap package also allows for creating continuous maps.

Continuous

The "cont" style presents a large number of colors over the continuous color field.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "cont") +  tm_layout(legend.outside = TRUE)

Order

The "order" style also presents a large number of colors over the continuous color field. However, this style is suited to visualize skewed distributions; notice that the values on the legend do not change linearly.

tm_shape(world_moll) +  tm_polygons(col = "lifeExp",              style = "order") +  tm_layout(legend.outside = TRUE)

Logarithmic scales

The default numeric style, pretty, is easy to understand, but it is not proper for maps of variables with skewed distributions.

tm_shape(world_moll) +  tm_polygons(col = "pop") +  tm_layout(legend.outside = TRUE)

Another possible style, order works better in this case; however, it is not easy to interpret.

tm_shape(world_moll) +  tm_polygons(col = "pop",               style = "order") +  tm_layout(legend.outside = TRUE)

A better alternative, in this case, is to use a common logarithm (the logarithm to base 10) scale. The tmap package gives two possibilities in this case – "log10_pretty" and "log10". The "log10_pretty" style is a common logarithmic version of the regular pretty style.

tm_shape(world_moll) +  tm_polygons(col = "pop",               style = "log10_pretty") +  tm_layout(legend.outside = TRUE)

On the other hand, the "log10" style is a version of a continuous scale.

tm_shape(world_moll) +  tm_polygons(col = "pop",               style = "log10") +  tm_layout(legend.outside = TRUE)

Conclusions

Selecting a color scale style is not an easy task. It depends on the type of input variable and its distribution, but also the intended audience. Therefore, it is worth to spend some time and think about your readers (e.g., would they be able to understand the logarithmic scale or should you use the manual breaks instead?) and your data (e.g., how many breaks should there be to show different subgroups?). Now you know different color scale styles implemented in tmap, so let’s try using them for your own projects!

https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization
See the ?hclust documentation for more details.
See the ?bclust documentation for more details.
See the ?kmeans documentation for more details.

To leave a comment for the author, please follow the link and comment on their blog: the Geocomputation with R website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Repetitive Q: Reading Multiple Files in the Zip Folder

October 17, 2019, 12:06 am

≫ Next: rBokeh – Don’t be stopped by missing arguments!

≪ Previous: Map coloring: the color scale styles available in the tmap package

[This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear Readers,I always see a repetitive question coming to me and across various forums on how to read multiple files in the zip folder of same separator or multiple separator. Again, here, lets not compromise on speed.Solution is to use easycsv package in R, which in turn uses data.table package function “fread”.Find below a quick example:

library(easycsv)

## Loading required package: data.table

easycsv::fread_zip("xxxx\\alldata.zip", extension="CSV", sep=",")

Additionally, if you want to read and load large files efficiently you can refer to following page:https://www.kaggle.com/pradeep13/for-r-users-read-load-efficiently-save-timeHappy R Programming!

Views expressed here are from author’s industry experience. Author trains on and blogs Machine (Deep) Learning applications; for further details, he will be available at mavuluri.pradeep@gmail.com for more details.Find more about author at http://in.linkedin.com/in/pradeepmavuluri

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

rBokeh – Don’t be stopped by missing arguments!

October 17, 2019, 3:00 am

≫ Next: Job: Junior Systems Administrator (with a focus on R/Python)

≪ Previous: Repetitive Q: Reading Multiple Files in the Zip Folder

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last article on the STATWORX blog, I have guided you through the process of writing a complex JavaScript-callback. However, most users might be slightly frustrated by the lack of arguments to customize a standard rBokeh plot fully. Actually, rBokeh is a little bit outdated (structure() warnings all the way!) and lacks some functionalities that are available in its Python equivalent. But don’t toss in the towel right away! I created some workarounds for these, which you hopefully find helpful.

In general

This approach is my go-to solution to change a rBokeh plot for which there is an argument missing in rBokeh that is available in python.

Create the plot.
Inspect the structure (str(plot)) of the rBokeh object.
Search for the python’s argument name.
Overwrite the value with the desired option as derived from python’s bokeh.

So, first of all, I set up an initial rBokeh figure that we manipulate later on.

plot <- figure(data = iris) %>% ly_bar(x = Species,y = Sepal.Length,hover = TRUE)

Manipulate the hover functionality

The first set of tricks deals with the customization of hover effects. Hover effects are essentials of interactive plots, so it makes a lot of sense to invest some time in optimizing them.

Anchor

Unlike in python’s bokeh, there is no anchor argument to change the position of a hover tooltip. By default, it appears in the center of the hovered element. To change it, we need to deep dive into the rBokeh object. The object is a deep and complex nested list in which all the information about the plot is stored. While some elements are always structured in the same way, different layers are named by a seemingly arbitrary string (e.g., 51dab389c6209bbf084a86b368f68724). I wrote the following code snippet to change the hover position from center to top_center.

# Get the position of the anchor argument within the object-listxyz <- logical()for (i in seq_along(plot$x$spec$model)) {  xyz[i] <- !is.null(plot$x$spec$model[[i]]$attributes$point_policy)}# Solution using for loopfor (i in which(xyz)) {plot$x$spec$model[[i]]$attributes$anchor <- "top_center"}

In case you are not very fond of simple for loops, here are also solutions with purrr or lapply:

# Solution using purrrxyz <- purrr::map_lgl(plot$x$spec$model, .f = ~ !is.null(.x$attributes$anchor))plot$x$spec$model[which(xyz)] <- purrr::map(plot$x$spec$model[which(xyz)], ~{.$attributes$anchor <- "top_center"return(.)})# Solution using the apply familyxzy <- sapply(plot$x$spec$model, function(x) !is.null(x$attributes$anchor))plot$x$spec$model[which(xyz)] <- lapply(plot$x$spec$model[xyz], function(abc) {abc$attributes$anchor <- "top_center"return(abc)})

All options of the tooltip position can be found here.

Point policy

Another option that can be specified in the same way is whether the tooltip should appear at a specific place (snap_to_data) or should follow the courser (follow_mouse). This point_policy option is also missing in rBokeh but can be added by the same logic. Here is a solution for the purrr way but all other descriped options work as well.

# Get the position of the point policy argument within the object-listxyz <- purrr::map_lgl(plot$x$spec$model, .f = ~ !is.null(.x$attributes$point_policy))plot$x$spec$model[which(xyz)] <- purrr::map(plot$x$spec$model[which(xyz)], ~{.$attributes$point_policy <- "follow_mouse"return(.)})

What you see is what you want

The last hover-related issue I want to address are the shown values. rBokeh is rather inflexible in this context. Sometimes (e.g., in ly_points) it is possible to define a specific hover information (either a variable from the data or another data frame/list of the same length as the plot data) but in other cases the hover argument is just logical (TRUE or FALSE, like in ly_bar). If you want to change its default tooltip you need to do this by hand, again.

# Set up the figureplot <- figure(data = iris) %>% ly_bar(x = Species,y = Sepal.Length,hover = T)# get the list elements where tooltips are definedhover_info <- purrr::map_lgl(plot$x$spec$model, .f = ~ !is.null(.x$attributes$tooltips))# delete a specific tooltipplot$x$spec$model[[which(hover_info)]]$attributes$tooltips[[2]] <- NULL# add a tooltipplot$x$spec$model[[which(hover_info)]]$attributes$tooltips[[2]] <- # list of printed name (test) and name for internal use (@hover_col_3)list("test","@hover_col_3")hover_data <- purrr::map_lgl(plot$x$spec$model, .f = ~ !is.null(.x$attributes$data$hover_col_1))# manipulate a tooltipplot$x$spec$model[which(hover_data)] <- purrr::map(plot$x$spec$model[which(hover_data)], ~{.$attributes$data$hover_col_1 <- 1:3# must match assigned name above.$attributes$data$hover_col_3 <- letters[1:3]return(.)})plot

Keep plotting!

I hope you enjoyed my blog post, and it helps you in solving or avoiding some troubles with rBokeh. And who knows, maybe a more intense use of this package might even motivate the developers to update or further develop this excellent package. So, keep plotting!

All options of the tooltip position can be found here.

In case you are not very fond of simple for loops, here are also solutions with purrr or lapply.

Über den Autor

Matthias Nistler

I am a data scientist at STATWORX and passionate for wrangling data and getting the most out of it. Outside of the office, I use every second for cycling until the sun goes down.

ABOUT US

STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

.button { background-color: #0085af;}</p><p>.x-container.width { width: 100% !important;}</p><p>.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag rBokeh – Don't be stopped by missing arguments! erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Job: Junior Systems Administrator (with a focus on R/Python)

October 17, 2019, 7:34 am

≫ Next: Practical Data Science with R 2nd Edition update

≪ Previous: rBokeh – Don’t be stopped by missing arguments!

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jumping Rivers is a data science consultancy company focused on R and Python. We work across industries and throughout the world. We offer a mixture of training, modelling, and infrastructure support. Jumping Rivers is an RStudio Full Service Certified Partner.

This role is suitable for anyone interested in deploying (Linux-based) data science services and contains two main elements:

Client facing: assess virtual servers & services. Identify potential issues or improvements.
Internal: Everyone(!) at Jumping Rivers uses Linux. Provide support on setting-up systems.

Depending on the interests of the applicant, getting involved with training is also a possibility.

Location: Jumping Rivers is based in Newcastle upon Tyne. However, half of the team are remote (Leeds, Lancaster, Edinburgh). To make remote working a possibility, you need a) a good internet connection and b) within a few hours of (train) travel to London or Edinburgh.

Essential technical requirements

Linux server administration
Shell scripting
Version control
Relevant technical degree or equivalent experience (Sciences, server administration)

Bonus

Experience with R, Python, HTML/CSS/JS
Docker stack deployment (e.g., Docker Compose, Terraform, Packer)
Continuous Integration and Deployment (e.g., GitLab CI, Travis)
Authentication services (e.g., Active Directory, SAML, LDAP, OAuth)

Individual responsibilities

Time management
Communication (video chat and email)
Travel to client’s location as required
Work independently
Work as part of a team

Future role opportunities

Opportunity to develop new orchestration and deployment pipelines for use in Artificial Intelligence and Machine Learning workloads.
Maintaining remote Linux services both cloud-based and internal VPS
Designing bespoke infrastructure solutions clients
Training: develop and deliver courses

To discuss this role, please email us at careers@jumpingrivers.com . To apply, please send a short covering letter and CV. Please use "Junior Systems Administrator" as the subject.

Closing date: 14th November

The post Job: Junior Systems Administrator (with a focus on R/Python) appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Practical Data Science with R 2nd Edition update

October 17, 2019, 10:55 am

≫ Next: three birthdays and a numeral

≪ Previous: Job: Junior Systems Administrator (with a focus on R/Python)

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are in the last stages of proofing the galleys/typesetting of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019. So this edition will definitely be out soon!

If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!

One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.

2/3rds of the book is her work.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

three birthdays and a numeral

October 17, 2019, 3:19 pm

≫ Next: Vignette: Google Trends with the gtrendsR package

≪ Previous: Practical Data Science with R 2nd Edition update

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The riddle of the week on The Riddler was to find the size n of an audience for at least a 50% chance of observing at least one triplet of people sharing a birthday, as is the case in the present U.S. Senate. The question is much harder to solve than for a pair of people but the formula exists!, as detailed on this blog entry, this X validated entry, or my friend Anirban Das Gupta’s review of birthday problems. If W is the number of triplets among n people,

$\mathbb P(W =0) = \sum_{i=0}^{\lfloor n/2 \rfloor} \frac{365! n!}{i! (n-2i)! (365-n+i)! 2^i 365^n}$

which returns n=88 as the smallest population size for which P(W=0|n=88)=0.4889349, while P(W=0|n=87)=0.5005451. A simulation based on 10⁶ draws confirms this boundary value, P(W=0|n=88)≈0.4890849 and P(W=0|n=87)≈0.5006471.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Vignette: Google Trends with the gtrendsR package

October 17, 2019, 5:00 pm

≫ Next: How confident are you? Assessing the uncertainty in forecasting

≪ Previous: three birthdays and a numeral

[This article was first published on Musings on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

Google Trends is a well-known, free tool provided by Google that allows you to analyse the popularity of top search queries on its Google search engine. In market exploration work, we often use Google Trends to get a very quick view of what behaviours, language, and general things are trending in a market.

And of course, if you can do something in R, then why not do it in R?

Philippe Massicotte’s gtrendsR is pretty much the go-to package for running Google Trends queries in R. It’s simple, you don’t need to set up API keys or anything, and it’s fairly intuitive. Let’s have a go at this with a simple and recent example.

Example: A Controversial Song from Hong Kong

Glory to Hong Kong (Chinese: 願榮光歸香港) is a Cantonese march song which became highly controversial politically, due to its wide adoption as the “anthem” of the Hong Kong protests. Since it was written collaboratively by netizens in August 2019,¹ the piece has become viral and was performed all over the world and translated into many different languages.² It’s also available on Spotify– just to give you a bit of context of its popularity.

Analytically, it would be interesting to compare the Google search trends of the English search term (“Glory to Hong Kong”) and the Chinese search term (“願榮光歸香港”), and see what they yield respectively. When did it go viral, and which search term is more popular? Let’s find out.

Using gtrendsR

gtrendsR is available on CRAN, so just make sure it’s installed (install.packages("gtrendsR")) and load it. Let’s load tidyverse as well, which we’ll need for the basic data cleaning and plotting:

library(gtrendsR)library(tidyverse)

The next step then is to assign our search terms to a character variable called search_terms, and then use the package’s main function gtrends().

Let’s set the geo argument to Hong Kong only, and limit the search period to 12 months prior to today. We’ll assign the output to a variable – and let’s call it output_results.

search_terms <-c("Glory to Hong Kong", "願榮光歸香港")gtrends(keyword = search_terms,geo ="HK",time ="today 12-m") ->output_results

output_results is a gtrends/list object, which you can extract all kinds of data from:

output_results %>%summary()

##                     Length Class      Mode## interest_over_time  7      data.frame list## interest_by_country 0      -none-     NULL## interest_by_region  0      -none-     NULL## interest_by_dma     0      -none-     NULL## interest_by_city    0      -none-     NULL## related_topics      0      -none-     NULL## related_queries     6      data.frame list

Let’s have a look at interest_over_time, which is primarily what we’re interested in. You can access the data frame with the $ operator, and check out the data structure:

output_results %>%.$interest_over_time %>%glimpse()

## Observations: 104## Variables: 7## $ date      2018-10-21, 2018-10-28, 2018-11-04, 2018-11-11, 2018...## $ hits      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ geo       "HK", "HK", "HK", "HK", "HK", "HK", "HK", "HK", "HK",...## $ time      "today 12-m", "today 12-m", "today 12-m", "today 12-m...## $ keyword   "Glory to Hong Kong", "Glory to Hong Kong", "Glory to...## $ gprop     "web", "web", "web", "web", "web", "web", "web", "web...## $ category  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

This is what the hits variable represents, according to Google’s FAQ documentation:

Google Trends normalizes search data to make comparisons between terms easier. Search results are normalized to the time and location of a query by the following process:

Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest.

The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics.

Let us plot this in ggplot2, just to try and replicate what we normally see on the Google Trends site – i.e. visualising the search trends over time. I really like the Economist theme from ggthemes, so I’ll use that:

output_results %>%.$interest_over_time %>%ggplot(aes(x = date, y = hits)) +geom_line(colour ="darkblue", size =1.5) +facet_wrap(~keyword) +ggthemes::theme_economist() ->plot

This finding above is surprising, because you would expect that Hong Kong people are more likely to search for the Chinese term rather than the English term, as the original piece was written in Cantonese.

I’ll now re-run this piece of analysis, using the shorter term 榮光, as the hypothesis is that people are more likely to search for that instead of the full song name. It could also be a quirk of Google Trends that it doesn’t return long Chinese search queries properly.

I’ll try to do this in a single pipe-line. Note what’s done differently this time:

time is set to 3 months from today
The onlyInterest argument is set to TRUE, which only returns interest over time and therefore is faster.
Google Trends returns hits as <1 as a character value for a value lower than 1, so let’s replace that with an arbitrary value 0.5 so we can plot this properly (the hits variable needs to be numeric).

gtrends(keyword =c("Glory to Hong Kong", "榮光"),geo ="HK",time ="today 3-m",onlyInterest =TRUE) %>%.$interest_over_time %>%mutate_at("hits", ~ifelse(. == "<1", 0.5, .)) %>%# replace with 0.5mutate_at("hits", ~as.numeric(.)) %>%# convert to numeric# Begin ggplotggplot(aes(x = date, y = hits)) +geom_line(colour ="darkblue", size =1.5) +facet_wrap(~keyword) +ggthemes::theme_economist() ->plot2

There you go! This is a much more intuitive result, where you’ll find that the search term for “榮光” reaches its peak in mid-September of 2019, whereas search volume for the English term is relatively lower, but still peaks at the same time.

I should caveat that the term “榮光” simply means Glory in Chinese, which people could search for without necessarily searching for the song, but we can be pretty sure that in the context of what’s happening that this search term relates to the actual song itself.

Limitations

One major limitation of Google Trends is that you can only search a maximum of five terms at the same time, which means that there isn’t really a way to do this at scale. There are some attempts online of doing multiple searches and “connect” the searches together by calculating an index, but so far I’ve not come across any attempts which have yielded a very satisfactory result. However, this is more of a limitation of Google Trends than the package itself.

What you can ultimately do with the gtrendsR package is limited by what Google provides, but the benefit of using gtrendsR is that all your search inputs will be documented, and certainly helps towards a reproducible workflow.

End notes / side discussion

This might just be another one of those things you can do in R, but the benefit is relatively minimal given that you cannot scale it very much. This reminds me of a meme:

Still, I suppose you can do a lot worse.

Also, writing this particular post made me realise how much more faff is involved when your post features a non-English language and you have to make changes to the encoding – Jekyll (the engine used for generating the static HTML pages on GitHub) isn’t particularly friendly for this purpose. Might be a subject for a separate discussion!

https://en.wikipedia.org/wiki/Glory_to_Hong_Kong. For more information about the Hong Kong protests, check out https://www.helphk.info/.

To leave a comment for the author, please follow the link and comment on their blog: Musings on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How confident are you? Assessing the uncertainty in forecasting

October 18, 2019, 1:18 pm

≫ Next: SQL Server Schemas & R Tip

≪ Previous: Vignette: Google Trends with the gtrendsR package

[This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Some people think that the main idea of forecasting is in predicting the future as accurately as possible. I have bad news for them. The main idea of forecasting is in decreasing the uncertainty.

Think about it: any event that we want to predict has some systematic components $\mu_t$, which could potentially be captured by our model, and a random error $\epsilon_t$. The latter might not be purely random in its nature, but this component absorbs all the factors that we cannot predict. For example, it is impossible to predict, whether a specific person will get ill in a specific day and go to a hospital. So, the observed demand (you can bravely substitute this word by whatever you work with) can be roughly represented in the following way: \begin{equation} \label{eq:demand} y_t = \mu_t + \epsilon_t, \end{equation} where $y_t$ is the actual value of demand (there is another formula for the multiplicative model, but it does not change the discussion, so we skip it for now in order not to overcomplicate things). And here comes the revelation: what we usually do in forecasting, is capture $\mu_t$ as correctly as possible, so that we could predict the structure, and infer somehow the uncertainty $\epsilon_t$ around it. When it comes to $\epsilon_t$, all that we can usually do is estimate its mean and variance.

So, when we have a data like this:

we can say that the average level of demand is close to 1000, and that there is a variability around it with some standard deviation $\sigma \approx 100 $. The trick here is to get correctly $\mu_t$ and $\sigma$, so that we can do appropriate predictions. If we do that correctly, then we can produce point forecasts (the blue line on the graph) and a prediction interval of a width $1-\alpha$ (let’s say 0.95, the grey area on the graph), which in the ideal situation should contain $(1-\alpha) \times 100$% of observations in it, when we produce forecasts for the holdout sample.

In real life, we never know $\mu_t$, so, when trying to capture it, we might underestimate it in some cases (e.g. not include seasonality, when it is needed), which would lead to the higher variance of the error term, the increased uncertainty and wider prediction intervals. Alternatively, we might overestimate $\mu_t$ (e.g. include trend, when it is not needed), which would lead to the lower variance, unrealistically small uncertainty and the narrow intervals. So, when we select the most appropriate model for the data, we want to get as close as possible to the true $\mu_t$ and $\sigma$. Note that whenever we need to estimate $\sigma$, there is a way to make it closer to the “true” one by introducing the correction of the bias:

\begin{equation} \label{eq:biasCorrection} s = \sqrt{\frac{\sum_{t=1}^T e_t^2}{T-k}}, \end{equation}

where $e_t$ is the estimated error term, $T$ is the number of observations in the sample and $k$ is the number of estimated parameters.

When it comes to the practicalities, we produce point forecasts, which usually correspond to the conditional mean of the model, and the prediction intervals (quantiles of the assumed future distribution), which somehow reflect the uncertainty we capture. There has already been a post on the construction of prediction intervals on this website, and we have discussed how to assess the accuracy of point forecasts. So, the next natural question that we should have, is how to assess the accuracy of prediction intervals. How can we tell whether the model captures the uncertainty well?

Interval measures

Consider the following example in R, using smooth v2.5.4. We generate the data from ETS(A,N,A) model and then apply three different models:

library(smooth)x <- sim.es("ANA", obs=120, frequency=12, persistence=c(0.3,0.1), initial=c(1000), mean=0, sd=100)modelUnderfit <- es(x$data, "ANN", silent=F, interval=T, holdout=T, h=24)modelOverfit <- es(x$data, "AAA", silent=F, interval=T, holdout=T, h=24)modelCorrect <- es(x$data, "ANA", silent=F, interval=T, holdout=T, h=24)modelTrue <- es(x, silent=F, interval=T, holdout=T, h=24)

Four figures for the respective models

The model underfitting the data

The model overfitting the data

The correct model applied to the data

The true model applied to the data

The data exhibits the change of level and some changes of seasonal component over time, and there are four models applied to it:

ETS(A,N,N), which underfits the data, because it does not have the seasonal component,
ETS(A,A,A), which overfits the data, because it contains the redundant component (in this example the trend),
ETS(A,N,A), which is correctly specified, but differs from the true model due to the estimation on a sample.
ETS(A,N,A) with the true parameters, which is supposed to be ideal in this situation.

All these models produce different point forecasts, which we can asses using some error measures:

errorMeasures <- rbind(modelUnderfit$accuracy,                       modelOverfit$accuracy,                       modelCorrect$accuracy,                       modelTrue$accuracy)[,c("sMAE","sMSE","sCE")]rownames(errorMeasures) <- c("Model Underfit","Model Overfit","Model Correct","Model True")errorMeasures*100

sMAE      sMSE       sCEModel Underfit 45.134368 25.510527 -122.3740Model Overfit  19.797382  5.026588 -449.8459Model Correct   9.580048  1.327130 -149.7284Model True      9.529042  1.318951 -139.8342

Note that in our example, the first model is the least accurate, because it does not contain the seasonal component, but it produces the least biased forecasts, probably just by chance. The second model is more accurate than the first one in terms of sMAE and sMSE, because it contains all the necessary components, but it is not as accurate as the correct model, because of the trend component – the forecasts in the holdout continue the declining trajectory, while in reality they should not. The difference in terms of accuracy and bias between the correct and the true models is small, but it seems that in our example, the correct one is a bit worse, which is probably due to the estimation of the smoothing parameters.

More importantly these models produce different prediction intervals, but it is difficult to tell the difference in some of the cases. So, what we could do in order to assess the accuracy of the prediction intervals is to calculate Mean Interval Score (MIS) metrics, proposed by Gneiting (2011) and popularised by the M4 Competition: \begin{equation} \label{MIS} \begin{matrix} \text{MIS} = & \frac{1}{h} \sum_{j=1}^h \left( (u_{t+j} -l_{t+j}) + \frac{2}{\alpha} (l_{t+j} -y_{t+j}) \mathbb{1}(y_{t+j} < l_{t+j}) \right. \\ & \left. + \frac{2}{\alpha} (y_{t+j} -u_{t+j}) \mathbb{1}(y_{t+j} > u_{t+j}) \right) , \end{matrix} \end{equation} where $u_{t+j}$ is the upper bound, $l_{t+j}$ is the lower bound of the prediction interval, $\alpha$ is the significance level and $\mathbb{1}(\cdot)$ is the indicator function, returning one, when the condition is true and zero otherwise. The idea of this measure is to assess the range of the prediction intervals together with the coverage. If the actual values lie outside of the interval they get penalised with a ratio of $\frac{2}{\alpha}$, proportional to the distance from the interval. At the same time the width of the interval positively influences the value of the measure. The idealistic model with the MIS=0 should have all the values in the holdout lying on the bounds of the interval and $u_{t+j}=l_{t+j}$, which means that there is no uncertainty about the future, we know what’s going to happen with the 100% precision (which is not possible in the real life).

This measure is available in greybox package for R:

c(MIS(modelUnderfit$holdout,modelUnderfit$lower,modelUnderfit$upper,level=0.95),  MIS(modelOverfit$holdout,modelOverfit$lower,modelOverfit$upper,level=0.95),  MIS(modelCorrect$holdout,modelCorrect$lower,modelCorrect$upper,level=0.95),  MIS(modelTrue$holdout,modelTrue$lower,modelTrue$upper,level=0.95))

[1] 1541.6667 1427.7527  431.7717  504.8203

These number do not say anything on their own, they should be compared between the models. The comparison shows that the first two models do not perform well, while the correct model seems to be doing the best job, even better than the true one (this could have happened by chance).

Unfortunately, this is a one-number measure, so we cannot say what specifically happened in our case. We can infer based on the graphs, that the first model had the widest range, and the second one had too many values lying outside the interval, but we cannot say that by simply looking at MIS values. In order to investigate this, we can check the average range of these intervals, so that we get an idea, whether the uncertainty has been captured correctly by the models: \begin{equation} \label{range} \text{range} = \frac{1}{h} \sum_{j=1}^h (u_{t+j} -l_{t+j}) , \end{equation} which in human language means to average out the width of the interval from one step ahead to h steps ahead. This can be easily calculated in R:

c(mean(modelUnderfit$upper - modelUnderfit$lower),  mean(modelOverfit$upper - modelOverfit$lower),  mean(modelCorrect$upper - modelCorrect$lower),  mean(modelTrue$upper - modelTrue$lower))

[1] 1541.6667  297.1488  431.7717  504.8203

Looking at these numbers, it appears that the second model (overfitting the data) has the narrowest prediction interval of the four models. It seems to underestimate the uncertainty substantially, which leads to the problem of not covering the necessary 95% of observations in the holdout. The first model, as we noted above, has the unreasonably wide prediction interval, and the last two models are producing the more or less balanced intervals.

Another thing that we can check with the intervals is the coverage (how many observations are in the prediction interval in the holdout): \begin{equation} \label{coverage} \text{coverage} = \frac{1}{h} \sum_{j=1}^h \left( \mathbb{1}(y_{t+j} < l_{t+j}) \times \mathbb{1}(y_{t+j} > u_{t+j}) \right) , \end{equation} which can be done in R using the following command:

c(sum((modelUnderfit$holdout > modelUnderfit$lower & modelUnderfit$holdout < modelUnderfit$upper)) / length(modelUnderfit$holdout),  sum((modelOverfit$holdout > modelOverfit$lower & modelOverfit$holdout < modelOverfit$upper)) / length(modelOverfit$holdout),  sum((modelCorrect$holdout > modelCorrect$lower & modelCorrect$holdout < modelCorrect$upper)) / length(modelCorrect$holdout),  sum((modelTrue$holdout > modelTrue$lower & modelTrue$holdout < modelTrue$upper)) / length(modelTrue$holdout))

[1] 1.0000000 0.5416667 1.0000000 1.0000000

Unfortunately, this is not very helpful, when only one time series is under consideration. For example, in our case the first, the third and the fourth models cover all the 100% observations in the holdout, and the second one covers only 54.2%. None of this is good, but we cannot say much anyway, because this is just one time series with 24 observations in the holdout.

By looking at the range and coverage, we can now understand, why the MIS had those values that we have observed earlier: models 1, 3 and 4 cover everything, while the model 2 does not, but has the narrowest interval.

If we want to further investigate the performance of models in terms of prediction intervals, we can calculate the pinball loss function, for each of the bounds separately (which seems to originate from the work of Koenker & Basset, 1978): \begin{equation} \label{pinball} \text{pinball} = (1 – \alpha) \sum_{y_{t+j} < b_{t+j}, j=1,\dots,h } |y_{t+j} -b_{t+j}| + \alpha \sum_{y_{t+j} \geq b_{t+j} , j=1,\dots,h } |y_{t+j} -b_{t+j}|,\end{equation}where $b_{t+j}$ is the value of a bound (either an upper, or a lower). What pinball is supposed to show is how well we capture the specific quantile in the data. The lower the value of pinball is, the closer the bound is to the specific quantile of the holdout distribution. If the pinball is equal to zero, then we have done the perfect job in hitting that specific quantile.In our example, we used 95% prediction interval, which means that we have produced the 2.5% and 97.5% quantiles, corresponding to the lower and the upper bounds. We can calculate the pinball loss using the respective function from greybox package:

pinballValues <- cbind(c(pinball(modelUnderfit$holdout,modelUnderfit$lower,0.025),                         pinball(modelOverfit$holdout,modelOverfit$lower,0.025),                         pinball(modelCorrect$holdout,modelCorrect$lower,0.025),                         pinball(modelTrue$holdout,modelTrue$lower,0.025)),                       c(pinball(modelUnderfit$holdout,modelUnderfit$upper,0.975),                         pinball(modelOverfit$holdout,modelOverfit$upper,0.975),                         pinball(modelCorrect$holdout,modelCorrect$upper,0.975),                         pinball(modelTrue$holdout,modelTrue$upper,0.975)))rownames(pinballValues) <- c("Model Underfit","Model Overfit","Model Correct","Model True")colnames(pinballValues) <- c("lower","upper")pinballValues

lower    upperModel Underfit 484.0630 440.9371Model Overfit  168.4098 688.2418Model Correct  155.9144 103.1486Model True     176.0856 126.8066

Once again, the pinball values do not tell anything on their own, and should be compared with each other. By analysing this result, we can say that the correct model did the best job in terms of capturing the quantiles correctly. It even did better than the true model, which agrees with what we have observed in the analysis of MIS, coverage and range. Note that the true model outperformed the correct one in terms of the accuracy (sMAE, sMSE, sCE), but it did not do that in terms of capturing the uncertainty. Still, this is only one time series, so we cannot make any serious conclusions based on it...

Also, the first model has the highest pinball value for the lower bound and the second highest for the upper one. This is because the intervals are too wide.

Furthermore, the second model has an adequate lower pinball, but has a too high upper one. This is because the model contains the trend component and predicts a decline.

As a side note, it is worth saying that the quantiles are very difficult to assess correctly using the pinball function on small samples. For example, in order to get a better idea of how the 97.5% bound performs, we would need to have at least 40 observations, so that 39 of them would be expected to lie below this bound ($\frac{39}{40} = 0.975$). In fact, the quantiles are not always uniquely defined, which makes the measurement difficult. Just as a reminder for those of you who are comfortable with mathematics, the $\alpha$-quantile is defined as: \begin{equation} \label{quantile} P \left(y_t < q_{\alpha} \right) = \alpha ,\end{equation}which reads as "the probability that a value will be lower than the specified $\alpha$-quantile is equal to $\alpha$". So, we are saying, for example, that $q_{97.5\%}$ is such a number that guarantees that 97.5% of observations in the holdout would lie below it. And in order to assess the performance of interval more precisely, we need to measure it on a bigger sample, it might not work well on an example of fixed origin and just one time series.Finally, MIS, range and pinballs are measured in the units of the original data (e.g. pints of beer). So, they cannot be summarised, when we deal with different time series and want to aggregate them across. In order to do that correctly, we would need to get rid of units somehow. We can use scaling (divide the value by the mean of series as Petropoulos & Kourentzes (2015) or a mean differences of the data as Hyndman & Koehler (2006) and M4 competition) or calculate relative values, using one of them as a benchmark (similar to Davydenko & Fildes (2013)).

An experiment in R

In order to see the performance of the discussed measures on a bigger sample, we conduct a simple simulation experiment in R, using the same setting but for the dataset of 1000 time series. The whole script for the experiment is shown below:

A chunk of code in R

library(smooth)# 4 models, 5 measures: MIS, Coverage, Range, Pinball L, Pinball U, 1000 iterationserrorMeasures <- array(NA, c(1000,4,5), dimnames=list(NULL, c("Model Underfit","Model Overfit","Model Correct","Model True"),                                                      c("MIS","Range","Coverage","Lower","Upper")))for(i in 1:1000){    x <- sim.es("ANA", obs=120, frequency=12, persistence=c(0.3,0.1), initial=c(1000), mean=0, sd=100)        modelUnderfit <- es(x$data, "ANN", silent=T, interval="p", holdout=T, h=24)    modelOverfit <- es(x$data, "AAA", silent=T, interval="p", holdout=T, h=24)    modelCorrect <- es(x$data, "ANA", silent=T, interval="p", holdout=T, h=24)    modelTrue <- es(x, silent=T, interval=T, holdout=T, h=24)        errorMeasures[i,,1] <- c(MIS(modelUnderfit$holdout,modelUnderfit$lower,modelUnderfit$upper,level=0.95),                             MIS(modelOverfit$holdout,modelOverfit$lower,modelOverfit$upper,level=0.95),                             MIS(modelCorrect$holdout,modelCorrect$lower,modelCorrect$upper,level=0.95),                             MIS(modelTrue$holdout,modelTrue$lower,modelTrue$upper,level=0.95));        errorMeasures[i,,2] <- c(mean(modelUnderfit$upper - modelUnderfit$lower),                             mean(modelOverfit$upper - modelOverfit$lower),                             mean(modelCorrect$upper - modelCorrect$lower),                             mean(modelTrue$upper - modelTrue$lower));        errorMeasures[i,,3] <- c(sum(modelUnderfit$holdout > modelUnderfit$lower & modelUnderfit$holdout < modelUnderfit$upper),                             sum(modelOverfit$holdout > modelOverfit$lower & modelOverfit$holdout < modelOverfit$upper),                             sum(modelCorrect$holdout > modelCorrect$lower & modelCorrect$holdout < modelCorrect$upper),                             sum(modelTrue$holdout > modelTrue$lower & modelTrue$holdout < modelTrue$upper)) / length(modelUnderfit$holdout);        errorMeasures[i,,4] <- c(pinball(modelUnderfit$holdout,modelUnderfit$lower,0.025),                             pinball(modelOverfit$holdout,modelOverfit$lower,0.025),                             pinball(modelCorrect$holdout,modelCorrect$lower,0.025),                             pinball(modelTrue$holdout,modelTrue$lower,0.025));        errorMeasures[i,,5] <- c(pinball(modelUnderfit$holdout,modelUnderfit$upper,0.975),                             pinball(modelOverfit$holdout,modelOverfit$upper,0.975),                             pinball(modelCorrect$holdout,modelCorrect$upper,0.975),                             pinball(modelTrue$holdout,modelTrue$upper,0.975));}

This code can be more efficient if the calculations are done in parallel rather than in serial, but this should suffice for an example.

The problem that we have now is that MIS, range and pinball are measured in the units of the original data and cannot be aggregated as is. A simple solution would be to take one of the models as a benchmark and calculate the relative measures. Let’s use the correct model as a benchmark (note that we do not do anything with the coverage as it is already unitless):

errorMeasuresRelative <- errorMeasuresfor(i in 1:4){    errorMeasuresRelative[,i,c(1,2,4,5)] <- errorMeasures[,i,c(1,2,4,5)] / errorMeasures[,3,c(1,2,4,5)]}

This way we obtain relative range, relative MIS and relative pinball values, which can now be analysed however we want, for example, using geometric means:

round(cbind(exp(apply(log(errorMeasuresRelative[,,-3]),c(2,3),mean)),            apply(errorMeasuresRelative,c(2,3),mean)[,3,drop=FALSE]),3)

MIS Range Lower Upper CoverageModel Underfit 2.091 2.251 2.122 2.133    0.958Model Overfit  1.133 1.040 1.123 1.113    0.910Model Correct  1.000 1.000 1.000 1.000    0.938Model True     0.962 1.013 0.964 0.963    0.951

As we can see from this example, the model that underfits the data has the 125.1% wider range than the correct on, and it also has the higher upper and lower pinball values (112.2% and 113.3% higher respectively). So, it overestimates the uncertainty, because it does not have the correct time series component. However, it has the coverage closest to the nominal among the first three models, which is difficult to explain.

The second model, which overfits the data, has higher range than the correct one and includes less observations in the holdout. So due to the redundant components, it underestimates the uncertainty.

We also have the true model, which does not have an issue with the estimation and thus covers the nominal 95% of the observations in the holdout, producing slightly narrower intervals than the ones of the correct model.

Finally, the third model, which is supposed to be correct, does a better job than the first two in terms of the range, MIS and pinball values, but it only contains 93.8% of values in the interval of the 95% width. This is because of the estimation in sample and due to the formulation of the ETS model – the conventional approach of Hyndman et al. (2008) does not take the uncertainty of the parameters into account. This is one of the issues with the ETS in its current state, which has not yet been addressed in the academic literature.

There can be other different reasons, why the prediction intervals do not perform as expected, some of which have been discussed in a post about the intervals in smooth package. The main message from this post is that capturing the uncertainty is a difficult task, and there is still a lot of things that can be done in terms of model formulation and estimation. But at least, when applying models on real data, we can have an idea about their performance in terms of the uncertainty captured.

P.S. I would like to use this opportunity to promote courses run by the Centre for Marketing Analytics and Forecasting. We will have Forecasting with R” course on 30th and 31st January, 2020, which will be taught by Nikos Kourentzes and I in London. If you want to know more, here is the landing page of the course with some information.

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

SQL Server Schemas & R Tip

October 18, 2019, 5:00 pm

≫ Next: Partial Dependence Plot (PDP) of GRNN

≪ Previous: How confident are you? Assessing the uncertainty in forecasting

[This article was first published on R on Thomas Roh, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I ran into an issue the other day where I was tring to write a new table to a SQL Server Database with a non-default schema. I did end up spending a bit of time debugging and researching so I wanted to share for anyone else that runs into the issue. Using the DBI::Id function, allows you to specify the schema when you are trying to write a table to a SQL Server database.

DBI::dbWriteTable(con,                   DBI::Id(schema = "schema", table = "tablename"),                   df)

But the code above will return a strange error:

After some investigation I found a workaround to be able to write the table. For non-default schemas, a “_” needs to in the table name for it to work.

DBI::dbWriteTable(con,                   DBI::Id(schema = "schema", table = "tablename_"),                   df)

This really isn’t ideal for naming conventions so using the t-sql command sp_rename will rename the table to what I originally wanted.

DBI::dbWriteTable(con,                   DBI::Id(schema = "schema", table = "tablename"),                   df)DBI::dbGetQuery(con, "USE database;EXEC sp_rename '[schema].[tablename_]', 'tablename';")

I ran into the same issues for overwriting tables as well but a workflow for doing the same is simply to use sp_rename a couple of times.

DBI::dbGetQuery(con, "USE database;EXEC sp_rename '[schema].[tablename]', 'tablename_';")DBI::dbWriteTable(con,                   DBI::Id(schema = "schema", table = "tablename_"),                   df,                   overwrite = TRUE)DBI::dbGetQuery(con, "USE database;EXEC sp_rename '[schema].[tablename_]', 'tablename';")

To leave a comment for the author, please follow the link and comment on their blog: R on Thomas Roh.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Partial Dependence Plot (PDP) of GRNN

October 19, 2019, 8:43 am

≫ Next: Permutation Feature Importance (PFI) of GRNN

≪ Previous: SQL Server Schemas & R Tip

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The function grnn.margin() (https://github.com/statcompute/yager/blob/master/code/grnn.margin.R) was my first attempt to explore the relationship between each predictor and the response in a General Regression Neural Network, which usually is considered the Black-Box model. The idea is described below:

First trained a GRNN with the original training dataset
Created an artificial dataset from the training data by keeping distinct values of the variable that we are interested in but replacing all values of other variables with their means. For instance, given a dataset with three variables X1, X2, and X3, if we are interested in the marginal effect of X1 with 3 distinct values, e.g. [X11 X12 X13], then the constructed dataset should look like {[X11 mean(X2) mean(X3)], [X12 mean(X2) mean(X3)], [X13 mean(X2) mean(X3)]}
Calculated predicted values, namely [Pred1 Pred2 Pred3], based on the constructed dataset by using the GRNN created in the first step
At last, the relationship between [X11 X12 X13] and [Pred1 Pred2 Pred3] is what we are looking for

The above-mentioned approach is computationally efficient but might be somewhat “brutal” in a sense that it doesn’t consider the variation in other variables.

By the end of Friday, my boss pointed me to a paper describing the partial dependence plot (Yes! In 53, we also have SVP who is technically savvy). The idea is very intriguing, albeit computationally expensive, and is delineated as below:

First trained a GRNN with the original training dataset
Based on the training dataset, get a list of distinct values from the variable of interest, e.g. [X11 X12 X13]. In this particular example, we created three separate datasets from the training data by keeping the other variables as they are but replacing all values of X1 with each of [X11 X12 X13] respectively
With each of three constructed datasets above, calculated predicted values and then averaged them out such that we would have an average of predicted values for each of [X11 X12 X13], namely [Pavg1 Pavg2 Pavg3]
The relationship between [X11 X12 X13] and [Pavg1 Pavg2 Pavg3] is the so-called Partial Dependence

The idea of PDP has been embedded in the YAGeR project (https://github.com/statcompute/yager/blob/master/code/grnn.partial.R). In the chart below, I compared outcomes of grnn.partial() and grnn.margin() side by side for two variables, e.g. the first not so predictive and the second very predictive. In this particular comparison, both appeared almost identical.

dpd

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Permutation Feature Importance (PFI) of GRNN

October 19, 2019, 9:00 pm

≫ Next: Building a Corporate R Package for Pleasure and Profit

≪ Previous: Partial Dependence Plot (PDP) of GRNN

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the post https://statcompute.wordpress.com/2019/10/13/assess-variable-importance-in-grnn, it was shown how to assess the variable importance of a GRNN by the decrease in GoF statistics, e.g. AUC, after averaging or dropping the variable of interest. The permutation feature importance evaluates the variable importance in a similar manner by permuting values of the variable, which attempts to break the relationship between the predictor and the response.

Today, I added two functions to calculate PFI in the YAGeR project, e.g. the grnn.x_pfi() function (https://github.com/statcompute/yager/blob/master/code/grnn.x_pfi.R) calculating PFI of an individual variable and the grnn.pfi() function (https://github.com/statcompute/yager/blob/master/code/grnn.pfi.R) calculating PFI for all variables in the GRNN.

Below is an example showing how to use PFI to evaluate the variable importance. It turns out that the outcome looks very similar to the one created by the grnn.imp() function previously discussed.

.gist table { margin-bottom: 0; }

pfi

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Building a Corporate R Package for Pleasure and Profit

October 19, 2019, 5:00 pm

≫ Next: RcppGSL 0.3.7: Fixes and updates

≪ Previous: Permutation Feature Importance (PFI) of GRNN

[This article was first published on R on technistema, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The “Great Restructuring” of our economy is underway. That’s the official name for what we know is happening: the best are rising to the top, and the mediocre are sinking to the bottom. It’s the Matthew Principle in-motion.

In Brynjolfsson and McAfee’s 2011 book Race Against the Machine, they detail how this New Economy will favor those that have the skill set or the capital to interface and invest in new technologies such as deep learning and robotics, which are becoming more ubiquitous every day.

Cal Newport’s Deep Work outlines two core abilities for thriving in this new economy:

Be able to quickly master hard things

Be able to produce at an elite level, in terms of both quality and speed. This need for speed (sorry) will be our focus.

Don’t repeat yourself (DRY) is a well-known maxim in software development, and most R programmers follow this rule and build functions to avoid duplicating code. But how often do you:

Reference the same dataset in different analyses
Create the same ODBC connection to a database
Tinker with the same colors and themes in ggplot
Produce markdown docs from the same template

and so on? Notice a pattern? The word “same” is sprinkled in each bullet point. I smell an opportunity to apply DRY!

If you work in a corporate or academic setting like me, you probably do these things pretty often. I’m going to show you how to wrap all of these tasks into a minimalist R package to save you time, which, as we’ve learned, is one of the keys to your success in the New Economy.

Tools

First some groundwork. I’ll assume if you work in R that you are using RStudio, which will be necessary to follow along. I’m using R version 3.5.1 on a Windows 10 machine (ahh, corporate America…). Note that the package we are about to develop is minimalist, which is a way of saying that we’re gonna cut corners to make a minimum viable product. We won’t get deep into documentation and dependencies much, as the packages we’ll require in our new package are more than likely already on your local machine.

Create an empty package project

We’ll be creating a package for the consulting firm Ketchbrook Analytics, a boutique shop from Connecticut who know their way around a %>% better than anyone.

Open RStudio and create a project in a new directory:

Select R Package and give it a name. I’ll call mine ketchR.

RStudio will now start a new session with an example “hello” function. Looks like we’re ready to get down to business.

Custom functions

Let’s start by adding a function to our package. A common task at Ketchbrook is mapping customer data with an outline for market area or footprint. We can easily wrap that into a simple function.

Create a new R file and name it ketchR.R. We’ll put all of our functions in here.

# To generate the footprint_polys datafootprint_poly <- function() {  #' Returns object of class SpatialPolygons of the AgC footprint.  #' Utilizes the Tigris:: package.  require(tidyverse)  require(tigris)  require(sf)  # Get County Polygons  states.raw <- tigris::states()  states <- states.raw[states.raw@data$STUSPS %in% c("CA", "OR", "WA"),]  states <- sf::st_as_sfc(states)  states <- sf::st_union(states)  states <- as(states, 'Spatial')  return(states)}

So what we’ve done is create a function that utilizes the tigris package to grab shapefiles for states in our footprint. The function then unions those states into one contiguous polygon so we can easily overlay this using leaflet, ggmap, etc.

Try your new function out:

library(leaflet)leaflet() %>%   addTiles() %>%   addPolygons(data = footprint_poly())

There is no limit to what kinds of custom functions you can add in your package. Machine learning algs, customer segmentation, whatever you want you can throw in a function with easy access in your package.

Datasets

Let’s stay on our geospatial bent. Branch or store-level analysis is common in companies spread out over a large geographical region. In our example, Ketchbrook’s client has eight branches from Tijuana to Seattle. Instead of manually storing and importing a CSV or R data file each time we need to reference these locations, we can simply save the data set to our package.

In order to add a dataset to our package, we first need to pull it into our local environment either by reading a csv or grabbing it from somewhere else. I simply read in a csv from my local PC:

branches <- read.csv("O:\\exchange\\branches.csv", header = T)

This is what the data set looks like:

Now, we have to put this data in a very specific place, or our package won’t be able to find it. Like when my wife hides the dishwasher so I’m reluctantly forced to place dirty dishes on the counter.

First, create a folder in your current directory called “data.” Your directory should look like this now, btw:

Bonus points: use the terminal feature in RStudio to create the directory easily:

Now we need to save this branches data set into our new folder as an .RData file:

save(branches, file = "data/branches.RData")

Now, we build

Let’s test this package out while there’s still a good chance we didn’t mess anything up. When we build the package, we are compiling it into the actual package as we know it. In RStudio, this is super simple. Navigate to the “Build” tab, and click “Install and Restart.” If you’ve followed along, you shouldn’t see any errors, but if you do see errors, try updating your local packages.

Now, we should be able to call our package directly and use our branches dataset:

Cool, that works. Now let’s plot our branches with Leaflet quick to make sure footprint_poly() worked:

library(leaflet)leaflet() %>%   addTiles() %>%   addPolygons(data = ketchR::footprint_poly()) %>%   addCircles(data = branches,             lat = branches$lat,             lng = branches$lon,             radius = 40000,             stroke = F,             color = "red")

Niiiice.

Database connections

One of the most common tasks in data science is pulling data from databases. Let’s say that Ketchbrook stores data in a SQL Server. Instead of manually copy and pasting a connection script or relying on the RStudio session to cache the connection string, let’s just make a damn function.

get_db <- function(query = "SELECT TOP 10 * FROM datas.dbo.Customers") {  #' Pull data from g23 database  #' @param query: enter a SQL query; Microsoft SQL syntax please  require(odbc)  con <- dbConnect(odbc(),                   Driver = "SQL Server",                   Server = "datas",                   Database = "dataserver",                   UID = "user",                   PWD = rstudioapi::askForSecret("password"),                   Port = 6969)  z <- odbc::dbGetQuery(con, query)  return(z)  odbc::dbDisconnect(con)}

Here, we’re building a function that lets us enter any query we want to bang against this SQL Server. The function creates the connection, prompts us to enter the password each time (we don’t store passwords in code…) and closes the connection when it’s through.

Let’s take it a step further. Many times you may pull a generic SELECT * query in order to leverage dplyr to do your real data munging. In this case, it’s easier to just make a function that does just that.

Let’s make another function that pulls a SELECT * FROM Customers.

get_customers <- function() {  #' Pull most recent customer data from G23 - datascience.agc_Customers  require(odbc)  con <- dbConnect(odbc(),                   Driver = "SQL Server",                   Server = "datas",                   Database = "dataserver",                   UID = "user",                   PWD = rstudioapi::askForSecret("password"),                   Port = 6969)  query1 <- "SELECT * FROM datas.dbo.Customers"  z <- odbc::dbGetQuery(con, query1)  return(z)  odbc::dbDisconnect(con)}

Ahh, this alone saved me quarters-of-hours each week once I started using it in my own practice. Think hard about any piece of code that you may copy and paste on a regular basis — that’s a candidate for your packages stable of functions.

Branded ggplot visualizations

Ok now we’re getting to the primo honey, the real time-savers, the analyst-impresser parts of our package. We’re going to make it easy to produce consistent data visualizations which reflect a company’s image with custom colors and themes.

Although I personally believe the viridis palette is the best color scheme of all time, it doesn’t necessarily line up with Ketchbrook’s corporate color palette. So let’s make our own set of functions to use Ketchbrook’s palette is a ‘lazy’ way. (Big thanks to this Simon Jackson’s great article).

Get the colors

Let’s pull the colors directly from their website. We can use the Chrome plugin Colorzilla to pull the colors we need.

Take those hex color codes and paste them into this chunk like so:

# Palette main colorsketch.styles <- c(  `salmon` = "#F16876",  `light_blue`= "#00A7E6",  `light_grey` = "#E8ECF8",  `brown`  = "#796C68")

This will give us a nice palette that has colors different enough for categorical data, and similar enough for continuous data. We can even split this up into two separate sub-palettes for this very purpose:

# Create separate palettesagc.palettes <- list(  `main`  = styles('salmon','light_blue', 'brown', 'light_grey'),  `cool`  = styles('light_blue', 'light_grey'))

Create the functions

I’m not going to go through these functions line by line; if you have questions reach out to me at bradley.lindblad[at]gmail[dot]com, create an issue on the Github repo. Here is the full code snippet:

# Palette main colorsketch.styles <- c(  `salmon` = "#F16876",  `light_blue`= "#00A7E6",  `light_grey` = "#E8ECF8",  `brown`  = "#796C68")# Fn to extract them by hex codesstyles <- function(...) {  cols <- c(...)  if (is.null(cols))    return (ketch.styles)  ketch.styles[cols]}# Create separate palettesketch.palettes <- list(  `main`  = styles('salmon','light_blue', 'brown', 'light_grey'),  `cool`  = styles('light_blue', 'light_grey'))# Fn to access themketch_pal <- function(palette = "main", reverse = FALSE, ...) {  pal <- ketch.palettes[[palette]]  if (reverse) pal <- rev(pal)  colorRampPalette(pal, ...)}# Fn for customer scalescale_color_ketch <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) {  pal <- ketch_pal(palette = palette, reverse = reverse)  #' Scale color using AgC color palette.  #' @param palette: main, greens or greys  #' @param discrete: T or F  #' @param reverse: reverse the direction of the color scheme  if (discrete) {    discrete_scale("colour", paste0("ketch_", palette), palette = pal, ...)  } else {    scale_color_gradientn(colours = pal(256), ...)  }}scale_fill_ketch <- function(palette = "main", discrete = TRUE, reverse = FALSE, ...) {  #' Scale fill using AgC color palette.  #' @param palette: main, greens or greys  #' @param discrete: T or F  #' @param reverse: reverse the direction of the color scheme  pal <- ketch_pal(palette = palette, reverse = reverse)  if (discrete) {    discrete_scale("fill", paste0("ketch_", palette), palette = pal, ...)  } else {    scale_fill_gradientn(colours = pal(256), ...)  }}

Let’s test it out:

ggplot(mtcars) +  geom_point(aes(mpg, disp, color = qsec), alpha = 0.5, size = 6) +  ketchR::scale_color_ketch(palette = "main", discrete = F) +  theme_minimal()

produces:

Markdown templates

Now that we’ve fetched the data and plotted the data much more quickly, the final step is to communicate the results of our analysis. Again, we want to be able to do this quickly and consistently. A custom markdown template is in order.

I found this part to be the hardest to get right, as everything needs to be in the right place within the file structure, so follow closely. (Most of the credit here goes to this article by Chester Ismay.)

1. Create skeleton directory

dir.create("ketchbrookTemplate/inst/rmarkdown/templates/report/skeleton",    recursive = TRUE)

This creates a nested directory that will hold our template .Rmd and .yaml files. You should have a new folder in your directory called “ketchbrookTemplate”:

2. Create skeleton.Rmd

Next we create a new RMarkdown file:

This will give us a basic RMarkdown file like this:

At this point let’s modify the template to fit our needs. First I’ll replace the top matter with a theme that I’ve found to work well for me, feel free to rip it off:

---title: "ketchbrookTemplate"author: Brad Lindbladoutput:   prettydoc::html_pretty:    theme: cayman    number_sections: yes    toc: yes  pdf_document:     number_sections: yes    toc: yes  rmarkdown::html_document:    theme: cayman  html_notebook:     number_sections: yes    theme: journal    toc: yesheader-includes:  - \setlength{\parindent}{2em}  - \setlength{\parskip}{0em}date: February 05, 2018always_allow_html: yes#bibliography: bibliography.bibabstract: "Your text block here"---

I like to follow an analysis template, so this is the top matter combined with my basic EDA template:

--title: "Customer Service Survey EDA"author: Brad Lindblad, MBAoutput:   pdf_document:     number_sections: yes    toc: yes  html_notebook:     number_sections: yes    theme: journal    toc: yes  rmarkdown::html_document:    theme: cayman  prettydoc::html_pretty:    theme: cayman    number_sections: yes    toc: yesheader-includes:  - \setlength{\parindent}{2em}  - \setlength{\parskip}{0em}date: September 20, 2018always_allow_html: yesbibliography: bibliography.bibabstract: "Your text block here"---Writing Your ReportNow that you've done the necessary preparation, you can begin writing your report. To start, keep in mind there is a simple structure you should follow. Below you'll see the sections in order along with descriptions of each part.IntroductionSummarize the purpose of the report and summarize the data / subject.Include important contextual information about the reason for the report.Summarize your analysis questions, your conclusions, and briefly outline the report.Body - Four SectionsData Section - Include written descriptions of data and follow with relevant spreadsheets.Methods Section - Explain how you gathered and analyzed data.Analysis Section - Explain what you analyzed. Include any charts here.Results - Describe the results of your analysis.ConclusionsRestate the questions from your introduction.Restate important results.Include any recommendations for additional data as needed.AppendixInclude the details of your data and process here. Include any secondary data, including references.# Introduction# Data# Methods# Analysis# Results# Conclusions# Appendix# References

Save this file in the skeleton folder and we’re done here.

3. Create the yaml file

Next we need to create a yaml file. Simply create a new text document called “template.yaml” in RStudio and save it like you see in this picture:

Rebuild the package and open a new RMarkdown document, select “From Template” and you should see your new template available:

Sweet. You can now knit to html pretty and have sweet output like this:

If you run into problems, make sure your file structure matches this:

├───inst│   └───rmarkdown│       └───templates│           └───ketchbrookTemplate│               │   template.yaml│               ││               └───skeleton│                       skeleton.nb.html│                       skeleton.Rmd

What’s next?

So we’ve essentially made a bomb package that will let you do everything just a little more quickly and a little better: pull data, reference common data, create data viz and communicate results.

From here, you can use the package locally, or push it to a remote Github repository to spread the code among your team.

The full code for this package is available at the Github repo set up for it. Feel free to fork it and make it your own. I’m not good at goodbye’s so I’m just gonna go.

I’m available for data science consulting on a limited basis. Reach me at bradley.lindblad[at]gmail[dot]com

To leave a comment for the author, please follow the link and comment on their blog: R on technistema.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

RcppGSL 0.3.7: Fixes and updates

October 20, 2019, 8:01 am

≫ Next: (Much) faster unnesting with data.table

≪ Previous: Building a Corporate R Package for Pleasure and Profit

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.3.7 of RcppGSL is now on CRAN. The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

Stephen Wade noticed that we were not actually freeing memory from the GSL vectors and matrices as we set out to do. And he is quite right: a dormant bug, present since the 0.3.0 release, has now been squashed. I had one boolean wrong, and this has now been corrected. I also took the opportunity to switch the vignette to prebuilt mode: Now a pre-made pdf is just included in a Sweave document, which makes the build more robust to tooling changes around the vignette processing. Lastly, the package was converted to the excellent tinytest unit test framework. Detailed changes below.

Changes in version 0.3.7 (2019-10-20)
A logic error was corrected in the wrapper class, vector and matrix memory is now properly free()’ed (Dirk in #22 fixing #20).
The introductory vignettes is now premade (Dirk in #23), and was updated lightly in its bibliography handling.
The unit tests are now run by tinytest (Dirk in #24).

Courtesy of CRANberries, a summary of changes to the most recent release is also available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

(Much) faster unnesting with data.table

October 20, 2019, 5:00 pm

≫ Next: IPO Exploration

≪ Previous: RcppGSL 0.3.7: Fixes and updates

[This article was first published on Johannes B. Gruber on Johannes B. Gruber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I was struggling with a relatively simple operation: unnest() from the tidyr package. What it’s supposed to do is pretty simple. When you have a data.frame where one or multiple columns are lists, you can unlist these columns while duplicating the information in other columns if the length of an element is larger than 1.

library(tibble)df <- tibble(  a = LETTERS[1:5],  b = LETTERS[6:10],  list_column = list(c(LETTERS[1:5]), "F", "G", "H", "I"))df

## # A tibble: 5 x 3##   a     b     list_column##          ## 1 A     F       ## 2 B     G       ## 3 C     H       ## 4 D     I       ## 5 E     J

library(tidyr)unnest(df, list_column)

## # A tibble: 9 x 3##   a     b     list_column##           ## 1 A     F     A          ## 2 A     F     B          ## 3 A     F     C          ## 4 A     F     D          ## 5 A     F     E          ## 6 B     G     F          ## 7 C     H     G          ## 8 D     I     H          ## 9 E     J     I

I came across this a lot while working on data from Twitter since individual tweets can contain multiple hashtags, mentions, URLs and so on, which is why they are stored in lists. unnest() is really helpful and very flexible in my experience since it makes creating, for example, a table of top 10 hashtags a piece of cake.

However, on large datasets, unnest() has its limitations (as I found out today). On a set with 1.8 million tweets, I was barely able to unnest the URL column and it would take forever on my laptop or simply crash at some point. In a completely new environment, unnesting the data took half an hour.

So let’s cut this time down to 10 seconds with data.table. In data.table, you would unlist like this¹:

library(data.table)dt <- as.data.table(df)dt[, list(list_column = as.character(unlist(list_column))), by = list(a, b)]

##    a b list_column## 1: A F           A## 2: A F           B## 3: A F           C## 4: A F           D## 5: A F           E## 6: B G           F## 7: C H           G## 8: D I           H## 9: E J           I

This is quite a bit longer than the tidyr code. So I wrapped it in a short function (note, that most of the code deals with quasiquotation so we can use it the same way as the original unnest()):

library(rlang)unnest_dt <- function(tbl, col) {  tbl <- as.data.table(tbl)  col <- ensyms(col)  clnms <- syms(setdiff(colnames(tbl), as.character(col)))  tbl <- as.data.table(tbl)  tbl <- eval(    expr(tbl[, as.character(unlist(!!!col)), by = list(!!!clnms)])  )  colnames(tbl) <- c(as.character(clnms), as.character(col))  tbl}

On the surface, it does the same as unnest:

unnest_dt(df, list_column)

##    a b list_column## 1: A F           A## 2: A F           B## 3: A F           C## 4: A F           D## 5: A F           E## 6: B G           F## 7: C H           G## 8: D I           H## 9: E J           I

But the function is extremely fast and lean. To show this, I do some benchmarking on a larger object. I scale the example ‘data.frame’ up from 5 to 50,000 rows since the overhead of loading a function will influence runtime much stronger on small-n data.

library(bench)df_large <- dplyr::sample_frac(df, 10000, replace = TRUE)res <- mark(  tidyr = unnest(df_large, list_column),  dt = unnest_dt(df_large, list_column))res

## # A tibble: 2 x 6##   expression      min   median `itr/sec` mem_alloc `gc/sec`##               ## 1 tidyr         52.4s    52.4s    0.0191   16.77GB     6.38## 2 dt           14.3ms   18.5ms   50.0       9.56MB    10.00

summary(res, relative = TRUE)

## # A tibble: 2 x 6##   expression   min median `itr/sec` mem_alloc `gc/sec`##                    ## 1 tidyr      3666.  2832.        1      1796.     1   ## 2 dt            1      1      2617.        1      1.57

As you can see, data.table is 3666 times faster. That is pretty insane. But what is often even more important, the memory consumption is negligible with the data.table function compared to tidyr. When trying to unnest my Twitter dataset with 1.8 million tweets, my computer would choke on the memory issue and even throw an error if I had some other large objects loaded.

Admittedly the function is not perfect. It is far less flexible than unnest, especially since it only runs on one variable at the time. However, this covers 95% of my usage of unnest and I would only consider including it in a script if performance is key.

Source: this answer from @akrun: https://stackoverflow.com/a/40420690/5028841, which I think should be added to data.table’s documentation somewhere.

To leave a comment for the author, please follow the link and comment on their blog: Johannes B. Gruber on Johannes B. Gruber.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

IPO Exploration

October 20, 2019, 5:00 pm

≫ Next: Bootstrapping time series for improving forecasting accuracy

≪ Previous: (Much) faster unnesting with data.table

Inspired by recent headlines like “Fear Overtakes Greed in IPO Market after WeWork Debacle” and “This Year’s IPO Class is Least Profitable since the Tech Bubble”, today we’ll explore historical IPO data, and next time we’ll look at the the performance of IPO-driven portfolios constructed during the ten-year period from 2004 to 2014. I’ll admit, I’ve often wondered how a portfolio that allocated money to new IPOs each year might perform since this has to be an ultimate example of a few headline-gobbling whales dominating the collective consciousness. We hear a lot about a few IPOs each year, but there are dozens about which we hear nothing. Here are the packages we’ll be using today. library(tidyverse) library(tidyquant) library(dplyr) library(plotly) library(riingo) library(roll) library(tictoc) Let’s get all the companies listed on the NASDAQ, NYSE, and AMEX exchanges and their IPO dates. That’s not every company that IPO’d in those years, of course, but we’ll go with it as a convenience for today’s purposes. Fortunately, the tq_exchange() function from tidyquant makes it painless to grab this data. nasdaq % filter(!is.na(ipo.year)) company_ipo_sector %__% head() # A tibble: 6 x 4 symbol company ipo.year sector 1 TXG 10x Genomics, Inc. 2019 Capital Goods 2 YI 111, Inc. 2018 Health Care 3 PIH 1347 Property Insurance Holdings, Inc. 2014 Finance 4 FLWS 1-800 FLOWERS.COM, Inc. 1999 Consumer Services 5 BCOW 1895 Bancorp of Wisconsin, Inc. 2019 Finance 6 VNET 21Vianet Group, Inc. 2011 Technology Before we start implementing and testing portfolio strategies in next week’s post, let’s spend today on some exploration of this data set. We have the sector and IPO year of each sector, and a good place to start is visualizing the number of IPOs by year. The key here is to call count(ipo.year), which will do exactly what we hope: give us a count of the number of IPOs by year company_ipo_sector %__% group_by(ipo.year) %__% count(ipo.year) %__% tail() # A tibble: 6 x 2 # Groups: ipo.year [6] ipo.year n 1 2014 258 2 2015 210 3 2016 184 4 2017 274 5 2018 397 6 2019 310 Then we want to pipe straight to ggplot() and put the new n column on the y-axis. company_ipo_sector %__% group_by(ipo.year) %__% count(ipo.year) %__% ggplot(aes(x = ipo.year, y = n)) + geom_col(color = "cornflowerblue") + scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) + theme(axis.text.x = element_text(angle = 90)) I like that chart, but it would be nice to be able to hover on the bars and get some more information. Let’s wrap the whole code flow inside of the ggplotly() function from plotly, which will convert this to an interactive chart. The names of the columns will be displayed in the tooltip, so let’s use rename(num IPOs = n, year = ipo.year) to create better labels. ggplotly( company_ipo_sector %__% group_by(ipo.year) %__% count(ipo.year) %__% rename(`num IPOs` = n, year = ipo.year) %__% ggplot(aes(x = year, y = `num IPOs`)) + geom_col(color = "cornflowerblue") + scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) + theme(axis.text.x = element_text(angle = 90)) ) We see a big decline in 2008 due to the financial crisis, and a steady rise until 2014 when things jump, but that might be due to the fact that since 2014, not as many companies have had a chance to be delisted. I’ll leave it to an IPO maven to explain things further. I did come across this treasure trove of data on the IPO market for the curious. There’s a lot of interesting stuff in there, but one thing to note about this data source and others I stumbled upon is that IPO data tends to focus on companies with a certain market cap, generally greater than $50 million. We didn’t make any cutoff based on market cap, and thus will have more observations than you might find if you Google something like ‘number of IPOs in year XXXX’. For the curious, I’ll post how to create this market cap filter on linkedin, and more importantly, it does set off some neurons in my brain to think that researchers tend to focus on IPOs of a certain market cap. That usually means there’s weird data stuff going on in the ignored area, or it’s risky, or it’s not worth the time to institutional investors because of market structure issues - or any of a host of reasons to investigate the stuff that other people find unattractive. Let’s get back on course and chart IPOs by sector by year. Instead of using count, we’ll use add_count(), which is a short-hand for group_by() + add_tally(). company_ipo_sector %__% group_by(ipo.year, sector) %__% select(ipo.year, sector) %__% add_count(ipo.year, sector) %__% slice(1) %__% filter(ipo.year __ 2003) # A tibble: 193 x 3 # Groups: ipo.year, sector [193] ipo.year sector n

↧

Bootstrapping time series for improving forecasting accuracy

October 20, 2019, 5:00 pm

≫ Next: rmangal: making ecological networks easily accessible

≪ Previous: IPO Exploration

[This article was first published on Peter Laurinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bootstrapping time series? It is meant in a way that we generate multiple new training data for statistical forecasting methods like ARIMA or triple exponential smoothing (Holt-Winters method etc.) to improve forecasting accuracy. It is called bootstrapping, and after applying the forecasting method on each new time series, forecasts are then aggregated by average or median – then it is bagging– bootstrap aggregating. It is proofed by multiple methods, e.g. in regression, that bagging helps improve predictive accuracy – in methods like classical bagging, random forests, gradient boosting methods and so on. The bagging methods for time series forecasting were used also in the latest M4 forecasting competition. For residential electricity consumption (load) time series (as used in my previous blog posts), I proposed three new bootstrapping methods for time series forecasting methods. The first one is an enhancement of the originally proposed method by Bergmeir – link to article– and two clustering-based methods. I also combined classical bagging for regression trees and time series bagging to create ensemble forecasts – I will cover it in some next post. These methods are all covered in the journal article entitled: Density-based Unsupervised Ensemble Learning Methods for Time Series Forecasting of Aggregated or Clustered Electricity Consumption.

In this blog post, I will cover:

an introduction to the bootstrapping of time series,
new bootstrapping methods will be introduced,
extensive experiments with 7 forecasting methods and 4 bootstrapping methods will be described and analysed on a part of M4 competition dataset.

Bootstrapping time series data

Firstly, read the M4 competition data, which compromise 100 thousand time series and load all the needed packages.

library(M4comp2018)# M4 data,# install package by devtools::install_github("carlanetto/M4comp2018")library(data.table)# manipulating the datalibrary(TSrepr)# forecasting error measureslibrary(forecast)# forecasting and bootstrapping methodslibrary(ggplot2)# graphicslibrary(ggsci)# colourslibrary(clusterCrit)# int.validity indicesdata(M4)

I will use only time series with hourly frequency (so with the highest frequency from M4 dataset, and they can have daily and also weekly seasonality), because I use at work also time series with double-seasonality (quarter-hourly or half-hourly data). The high (or double) period (frequency) also means higher complexity and challenge, therefore it is great for this use case

hourly_M4<-Filter(function(l)l$period=="Hourly",M4)

Let’s plot random time series from the dataset:

theme_ts<-theme(panel.border=element_rect(fill=NA,colour="grey10"),panel.background=element_blank(),panel.grid.minor=element_line(colour="grey85"),panel.grid.major=element_line(colour="grey85"),panel.grid.major.x=element_line(colour="grey85"),axis.text=element_text(size=13,face="bold"),axis.title=element_text(size=15,face="bold"),plot.title=element_text(size=16,face="bold"),strip.text=element_text(size=16,face="bold"),strip.background=element_rect(colour="black"),legend.text=element_text(size=15),legend.title=element_text(size=16,face="bold"),legend.background=element_rect(fill="white"),legend.key=element_rect(fill="white"),legend.position="bottom")ggplot(data.table(Time=1:length(hourly_M4[[1]]$x),Value=as.numeric(hourly_M4[[1]]$x)))+geom_line(aes(Time,Value))+labs(title=hourly_M4[[1]]$st)+theme_ts

plot of chunk unnamed-chunk-4

Bootstrap aggregating (bagging) is an ensemble meta-algorithm (introduced by Breiman in 1996), which creates multiple versions of learning set to produce multiple numbers of predictions. These predictions are then aggregated, for example by arithmetic mean. For time dependent data with a combination of statistical forecasting methods, the classical bagging can’t be used – so sampling (bootstrapping) with replacement. We have to sample data more sophisticated – based on seasonality or something similar. One of the used bootstrapping method is Moving Block Bootstrap (MBB) that uses a block (defined by seasonality for example) for creating new series. However, we don’t use the whole time series as it is, but we bootstrap only its remainder part from STL decomposition (this bootstrapping method was proposed by Bergmeir et al. in 2016).

This method is implemented in the forecast package in bld.mbb.bootstrap function, let’s use it on one time series from M4 competition dataset:

period<-24*7# weekly perioddata_ts<-as.numeric(hourly_M4[[1]]$x)data_boot_mbb<-bld.mbb.bootstrap(ts(data_ts,freq=period),100)data_plot<-data.table(Value=unlist(data_boot_mbb),ID=rep(1:length(data_boot_mbb),each=length(data_ts)),Time=rep(1:length(data_ts),length(data_boot_mbb)))ggplot(data_plot)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=data_plot[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts

plot of chunk unnamed-chunk-5

We can see that where values of the time series are low, there bld.mbb method fails to bootstrap new values around original ones (red line). Notice also that variance of bootstrapped time series is very high and values fly somewhere more than 30% against original values.

For this reason, I proposed a smoothed version of the bld.mbb method for reducing variance (Laurinec et al. 2019). I smoothed the remainder part from STL decomposition by exponential smoothing, so extreme noise was removed. Let’s use it (source code is available on my GitHub repo):

data_boot_smbb<-smo.bootstrap(ts(data_ts,freq=period),100)data_plot<-data.table(Value=unlist(data_boot_smbb),ID=rep(1:length(data_boot_smbb),each=length(data_ts)),Time=rep(1:length(data_ts),length(data_boot_smbb)))ggplot(data_plot)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=data_plot[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts

plot of chunk unnamed-chunk-6

We can see that the variance was nicely lowered, but the final time series are sometimes really different from the original. It is not good when we want to use them for forecasting.

Therefore, I developed (designed) another two bootstrapping methods based on K-means clustering. The first step of the two methods is identical – automatic clustering of univariate time series – where automatic means that it estimates the number of clusters from a defined range of clusters by Davies-Bouldin index.

The first method, after clustering, samples new values from cluster members, so it isn’t creating new values. Let’s plot results:

data_boot_km<-KMboot(ts(data_ts,freq=period),100,k_range=c(8,10))data_plot<-data.table(Value=unlist(data_boot_km),ID=rep(1:length(data_boot_km),each=length(data_ts)),Time=rep(1:length(data_ts),length(data_boot_km)))ggplot(data_plot)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=data_plot[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts

plot of chunk unnamed-chunk-7

We can change the range of the number of clusters to be selected to lower to increase the variance of bootstrapped time series. Let’s try increase number of clusters to decrease variance:

data_boot_km<-KMboot(ts(data_ts,freq=period),100,k_range=c(14,20))data_plot<-data.table(Value=unlist(data_boot_km),ID=rep(1:length(data_boot_km),each=length(data_ts)),Time=rep(1:length(data_ts),length(data_boot_km)))ggplot(data_plot)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=data_plot[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts

plot of chunk unnamed-chunk-8

We can see that the variance is much lower against MBB based methods. But we have still the power to increase it if we change the range of number of clusters.

The second proposed K-means based method is sampling new values randomly from Gaussian distribution based on parameters of created clusters (mean and variance). Let’s use it on our selected time series:

data_boot_km.norm<-KMboot.norm(ts(data_ts,freq=period),100,k_range=c(12,20))data_plot<-data.table(Value=unlist(data_boot_km.norm),ID=rep(1:length(data_boot_km.norm),each=length(data_ts)),Time=rep(1:length(data_ts),length(data_boot_km.norm)))ggplot(data_plot)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=data_plot[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts

plot of chunk unnamed-chunk-9

We can see nicely distributed values around the original time series, but will be it beneficial for forecasting?

We can also check all four bootstrapping methods in one plot by wrapper around above calls:

print_boot_series<-function(data,ntimes=100,k_range=c(12,20)){data_boot_1<-bld.mbb.bootstrap(data,ntimes)data_boot_2<-smo.bootstrap(data,ntimes)data_boot_3<-KMboot(data,ntimes,k_range=k_range)data_boot_4<-KMboot.norm(data,ntimes,k_range=k_range)datas_all<-data.table(Value=c(unlist(data_boot_1),unlist(data_boot_2),unlist(data_boot_3),unlist(data_boot_4)),ID=rep(rep(1:ntimes,each=length(data)),4),Time=rep(rep(1:length(data),ntimes),4),Method=factor(rep(c("MBB","S.MBB","KM","KM.boot"),each=ntimes*length(data))))datas_all[,Method:=factor(Method,levels(Method)[c(3,4,1,2)])]print(ggplot(datas_all)+facet_wrap(~Method,ncol=2)+geom_line(aes(Time,Value,group=ID),alpha=0.5)+geom_line(data=datas_all[.(1),on=.(ID)],aes(Time,Value),color="firebrick1",alpha=0.9,size=0.8)+theme_ts)}print_boot_series(ts(hourly_M4[[200]]$x,freq=period))

plot of chunk unnamed-chunk-10

When the time series is non-stationary, the clustering bootstrapping methods can oscillate little-bit badly. The clustering-based bootstrapping methods could be enhanced by for example differencing to be competitive (applicable) for non-stationary time series with a strong linear trend.

Forecasting with bootstrapping

For evaluating four presented bootstrapping methods for time series, to see which is the most competitive in general, experiments with 6 statistical forecasting methods were performed on all 414 hourly time series from the M4 competition dataset. Forecasts from bootstrapped time series were aggregated by the median. Also, simple base methods were evaluated alongside seasonal naive forecast. The six chosen statistical base forecasting methods were:

STL+ARIMA,
STL+ETS (both forecast package),
triple exponential smoothing with damped trend (smooth package – named ES (AAdA)),
Holt-Winters exponential smoothing (stats package),
dynamic optimized theta model (forecTheta package – named DOTM),
and standard theta model (forecast package).

Forecasting results were evaluated by sMAPE and ranks sorted by sMAPE forecasting accuracy measure.

The source code that generates forecasts from the all mentioned methods above and process results are available on my GitHub repository.

The first visualization of results shows mean sMAPE by base forecasting method:

ggplot(res_all[,.(sMAPE=mean(sMAPE,na.rm=TRUE)),by=.(Base_method,Boot_method)])+facet_wrap(~Base_method,ncol=3,scales="fix")+geom_bar(aes(Boot_method,sMAPE,fill=Boot_method,color=Boot_method),alpha=0.8,stat="identity")+geom_text(aes(Boot_method,y=sMAPE+1,label=paste(round(sMAPE,2))))+scale_fill_d3()+scale_color_d3()+theme_ts+theme(axis.text.x=element_text(angle=60,hjust=1))

plot of chunk unnamed-chunk-11

Best methods based on average sMAPE are STL+ETS with s.MBB and DOTM with s.MBB bootstrapping. Our proposed s.MBB bootstrapping method won’s 5 times from 6 based on average sMAPE.

The next graph shows boxplot of Ranks:

ggplot(res_all)+facet_wrap(~Base_method,ncol=3,scales="fix")+geom_boxplot(aes(Boot_method,Rank,fill=Boot_method),alpha=0.8)+scale_fill_d3()+theme_ts+theme(axis.text.x=element_text(angle=60,hjust=1))

plot of chunk unnamed-chunk-12

We can see very tight results amongst base learners, MBB and s.MBB methods.

So, let’s see average Rank instead:

ggplot(res_all[,.(Rank=mean(Rank,na.rm=TRUE)),by=.(Base_method,Boot_method)])+facet_wrap(~Base_method,ncol=3,scales="fix")+geom_bar(aes(Boot_method,Rank,fill=Boot_method,color=Boot_method),alpha=0.8,stat="identity")+geom_text(aes(Boot_method,y=Rank+0.22,label=paste(round(Rank,2))),size=5)+scale_fill_d3()+scale_color_d3()+theme_ts+theme(axis.text.x=element_text(angle=60,hjust=1))

plot of chunk unnamed-chunk-13

As was seen in the previous plot, it changes and it depends on a forecasting method, but very slightly better results have original MBB method over my s.MBB, but both beats base learner 5 times from 6. KM-based bootstraps failed to beat base learner 4 times from 6.

Let’s be more curious about distribution of Ranks…I will show you plot of Rank counts:

ggplot(res_all[,.(N=.N),by=.(Base_method,Boot_method,Rank)][,Rank:=factor(Rank)],aes(Boot_method,N))+facet_wrap(~Base_method,ncol=3,scales="free")+geom_bar(aes(fill=Rank,color=Rank),alpha=0.75,position="dodge",stat="identity")+scale_fill_d3()+scale_color_d3()+theme_ts+theme(axis.text.x=element_text(size=10,angle=60,hjust=1))

plot of chunk unnamed-chunk-14

We can see that KM-based bootstrapping can be first or second too! Around 25-times for each forecasting method were these 2 boot. methods first, and around 2-times more second…not that bad. Also, the seasonal naive forecast can be on some types of time series best against more sophisticated statistical forecasting methods… However, MBB and s.MBB methods are best and most of the time are in the first two ranks.

Conclusion

In this blog post, I showed you how to simply use four various bootstrapping methods on time series data. Then, the results of experiments performed on the M4 competition dataset were shown. Results suggest that bootstrapping can be very useful for improving forecasting accuracy.

References

[1] Bergmeir C., Hyndman R.J., Benitez J.M.: Bagging exponential smoothing methods using stl decomposition and box-cox transformation. International Journal of Forecasting, 32(2):303-312, (2016)

[2] Laurinec P., Loderer M., Lucka M. et al.: Density-based unsupervised ensemble learning methods for time series forecasting of aggregated or clustered electricity consumption, Journal Intelligent Information Systems, 53(2):219-239, (2019). DOI link

To leave a comment for the author, please follow the link and comment on their blog: Peter Laurinec.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

rmangal: making ecological networks easily accessible

October 20, 2019, 5:00 pm

≫ Next: Avoiding embarrassment by testing data assumptions with expectdata

≪ Previous: Bootstrapping time series for improving forecasting accuracy

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In early September, the version 2.0.0 of rmangal was approved by rOpenSci, four weeks later it made it to CRAN. Following-up on our experience we detail below the reasons why we wrote rmangal, why we submitted our package to rOpenSci and how the peer review improved our package.

Mangal, a database for ecological networks

Ecological networks are defined as a set of species populations (the nodes of the network) connected through ecological interactions (the edges). Interactions are ecological processes in which one species affects another. Although predation is probably the most known and documented interaction, other less noticeable associations are just as essential to ecosystem functioning. For instance, a mammal that unintentionally disperses viable seeds attached to its fur might help plants to thrive. All of these interactions occur simultaneously, shaping ecosystem functioning and making them as complex as they are fascinating.

Recording and properly storing these interactions help ecologists to better understand ecosystems. That is why they are currently compiling datasets to explore how species associations vary over environmental gradients and how species lost might affect ecosystem functioning. This fundamental research question should help us understanding how ecological networks will respond to global change. To this end, the Mangal project https://mangal.io/#/ standardizes ecological networks and eases their access. Every dataset contains a collection of networks described in a specific reference (a scientific publication, a book, etc.). For every network included in the database, Mangal includes all the species names and several taxonomic identifiers (gbif, eol, tsn etc.) as well as all interactions and their types. Currently, Mangal includes 172 datasets, which represents over 1300 ecological networks distributed worldwide.

An R client to make ecological networks easily accessible

In 2016, the first paper describing the project was published¹. In 2018, a substantial effort was made in order to improve the data structure and gather new networks from existing publications. In 2019, the web API was rewritten, a new website launched and hundreds of new interactions were added.

Because of all these modifications, the first version of rmangal was obsolete and a new version needed. It is worth explaining here why the R client is an important component of the Mangal project. Even though Mangal has a documented RESTful API, this web technology is not commonly used by ecologists. On the contrary, providing a R client ensures that the scientific community that documents these interactions in the field can access them, as easily as possible. The same argument holds true for the Julia client that Timothée Poisot wrote because Julia is increasingly popular among theoreticians, that can test ecological theory with such datasets.

We had two main objectives for rmangal 2.0.0. First, the rmangal package had to allow users to search for all entries in the database in a very flexible way. From a technical point this means that we had to write functions to query all the endpoints of the new web API. The second goal was to make the package as user friendly as possible. To do so, we used explicit and consistent names for functions and arguments. We then designed a simple workflow, and documented how to use other field related packages (such as igraph) to visualize and analyze networks (see below). You can find further details in the vignette “get started with rmangal”.

# Loading dependancieslibrary(rmangal)library(magrittr)library(ggraph)library(tidygraph)# Retrieving all ecological networks documented in Haven, 1992havens_1992 <- search_references(doi="10.1126/science.257.5073.1107")                   %>% get_collection()# Coerce and visualize the first network object return by mangal with ggraph ggraph(as_tbl_graph(havens_1992[[1]])) +     geom_edge_link(aes(colour = factor(type))) +     geom_node_point() +     theme_graph(background = "white")

A successful peer review process

After some hard work behind the screen and once we deemed our two objectives achieved, we decided to submit the rmangal package to rOpenSci for peer review. We did so because we needed feedback, we needed qualified people to critically assess whether our two main objectives were achieved. Given the strong expertise of rOpenSci in software review, and given that our package was in-scope, submitting rmangal to rOpenSci was an obvious choice.

We had very valuable feedback from Anna Willoughby and Thomas Lin Pedersen. They carefully assessed our work and pointed out areas where improvement was required. One good example of how their review made our package better concerns the dependencies. We originally listed sf in Imports as we used it to filter networks based on geographic coordinates. But the reviewers pointed out that this was not an essential part of the package and that sf has several dependencies. This made us realize that for one extra feature, we were substantially increasing the number of indirect dependencies. Following the reviewers’ suggestions, we moved sf to Suggests and added a message to warn users that to use the spatial filtering feature requires sf to be installed. Similarly, based on another good comment, we added a function to convert Mangal networks into tidygraph objects and we documented how to plot Mangal networks with ggraph (and so we added those packages in Suggests). Such improvements were very helpful to properly connect rmangal to the existing R packages. The plethora of R packages is one of its major strengths, and connecting a package properly to others makes the entire ecosystem even stronger.

We are now looking for user experience feedback, not only for rmangal (vignette) but also for the web API (documentation) and the mangal.io website. We welcome suggestions and contributions, especially for the documentation by opening new issues on GitHub (mangal-api, mangal-app, rmangal). In the future, we envision that rmangal will integrate functions to format ecological networks for ecologists willing to add their datasets to Mangal. This will likely be the next major release of rmangal.

Acknowledgments

We are thankful to all contributors to rmangal and to all ecologists that have spent countless hours in collecting data. We would like to thank Anna Willoughby and Thomas Lin Pedersen for thorough reviews as well as Noam Ross for handling the review process.

Poisot, T. et al. mangal – making ecological network analysis simple. Ecography 39, 384–390 (2016). https://doi.org/10.1111/ecog.00976

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Avoiding embarrassment by testing data assumptions with expectdata

October 21, 2019, 3:42 am

≫ Next: Widening Multiple Columns Redux

≪ Previous: rmangal: making ecological networks easily accessible

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Expectdata is an R package that makes it easy to test assumptions about a data frame before conducting analyses. Below is a concise tour of some of the data assumptions expectdata can test for you. For example,

Note: assertr is an ropensci project that aims to have similar functionality. Pros and cons haven’t been evaluated yet, but ropensci is a big pro for assertR.

Check for unexpected duplication

library(expectdata)expect_no_duplicates(mtcars,"cyl")#> [1] "top duplicates..."#> # A tibble: 3 x 2#> # Groups:   cyl [3]#>     cyl     n#>    #> 1     8    14#> 2     4    11#> 3     6     7#> Error: Duplicates detected in column: cyl

The default return_df == TRUE option allows for using these function as part of a dplyr piped expression that is stopped when data assumptions are not kept.

library(dplyr,warn.conflicts=FALSE)library(ggplot2)mtcars%>%filter(cyl==4)%>%expect_no_duplicates("wt",return_df=TRUE)%>%ggplot(aes(x=wt,y=hp,color=mpg,size=mpg))+geom_point()#> [1] "no wt duplicates...OK"

If there are no expectations violated, an “OK” message is printed.

After joining two data sets you may want to verify that no unintended duplication occurred. Expectdata allows comparing pre- and post- processing to ensure they have the same number of rows before continuing.

expect_same_number_of_rows(mtcars,mtcars,return_df=FALSE)#> [1] "Same number of rows...OK"expect_same_number_of_rows(mtcars,iris,show_fails=FALSE,stop_if_fail=FALSE,return_df=FALSE)#> Warning: Different number of rows: 32 vs: 150# can also compare to no df2 to check is zero rowsexpect_same_number_of_rows(mtcars,show_fails=FALSE,stop_if_fail=FALSE,return_df=FALSE)#> Warning: Different number of rows: 32 vs: 0

Can see how the stop_if_fail = FALSE option will turn failed expectations into warnings instead of errors.

Check for existance of problematic rows

Comparing a data frame to an empty, zero-length data frame can also be done more explicitly. If the expectations fail, cases can be shown to begin the next step of exploring why these showed up.

expect_zero_rows(mtcars[mtcars$cyl==0,],return_df=TRUE)#> [1] "No rows found as expected...OK"#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb#> <0 rows> (or 0-length row.names)expect_zero_rows(mtcars$cyl[mtcars$cyl==0])#> [1] "No rows found as expected...OK"#> numeric(0)expect_zero_rows(mtcars,show_fails=TRUE)#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1#> Error: Different number of rows: 32 vs: 0

This works well at the end of a pipeline that starts with a data frame, runs some logic to filter to cases that should not exist, then runs expect_zero_rows() to check no cases exist.

# verify no cars have zero cylindarsmtcars%>%filter(cyl==0)%>%expect_zero_rows(return_df=FALSE)#> [1] "No rows found as expected...OK"

Can also check for NAs in a vector, specific columns of a data frame, or a whole data frame.

expect_no_nas(mtcars,"cyl",return_df=FALSE)#> [1] "Detected 0 NAs...OK"expect_no_nas(mtcars,return_df=FALSE)#> [1] "Detected 0 NAs...OK"expect_no_nas(c(0,3,4,5))#> [1] "Detected 0 NAs...OK"#> [1] 0 3 4 5expect_no_nas(c(0,3,NA,5))#> Error: Detected 1 NAs

Several in one dplyr pipe expression:

mtcars%>%expect_no_nas(return_df=TRUE)%>%expect_no_duplicates("wt",stop_if_fail=FALSE)%>%filter(cyl==4)%>%expect_zero_rows(show_fails=TRUE)#> [1] "Detected 0 NAs...OK"#> [1] "top duplicates..."#> # A tibble: 2 x 2#> # Groups:   wt [2]#>      wt     n#>    #> 1  3.44     3#> 2  3.57     2#> Warning: Duplicates detected in column: wt#>    mpg cyl  disp hp drat    wt  qsec vs am gear carb#> 1 22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1#> 2 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2#> 3 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2#> 4 32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1#> 5 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2#> 6 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1#> Error: Different number of rows: 11 vs: 0

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Widening Multiple Columns Redux

October 21, 2019, 8:14 am

≫ Next: Gold-Mining Week 7 (2019)

≪ Previous: Avoiding embarrassment by testing data assumptions with expectdata

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last year I wrote about the slightly tedious business of spreading (or widening) multiple value columns in Tidyverse-flavored R. Recent updates to the tidyr package, particularly the introduction of the pivot_wider() and pivot_longer() functions, have made this rather more straightforward to do than before. Here I recapitulate the earlier example with the new tools.

The motivating case is something that happens all the time when working with social science data. We’ll load the tidyverse, and then quickly make up some sample data to work with.

library(tidyverse)

gen_cats <-function(x, N =1000){sample(x, N, replace =TRUE)}set.seed(101)
N <-1000

income <- rnorm(N,100,50)

vars <-list(stratum =c(1:8),
          sex =c("M","F"),
          race =c("B","W"),
          educ =c("HS","BA"))

df <- as_tibble(map_dfc(vars, gen_cats))
df <- add_column(df, income)

What we have are measures of sex, race, stratum (from a survey, say), education, and income. Of these, everything is categorical except income. Here’s what it looks like:


df

## # A tibble: 1,000 x 5##    stratum sex   race  educ  income##           ##  1       6 F     W     HS      83.7##  2       5 F     W     BA     128. ##  3       3 F     B     HS      66.3##  4       3 F     W     HS     111. ##  5       6 M     W     BA     116. ##  6       7 M     B     HS     159. ##  7       8 M     W     BA     131. ##  8       3 M     W     BA      94.4##  9       7 F     B     HS     146. ## 10       2 F     W     BA      88.8## # … with 990 more rows

Let’s say we want to transform this to a wider format, specifically by widening the educ column, so we end up with columns for both the HS and BA categories, and as we do so we want to calculate both the mean of income and the total n within each category of educ.

For comparison, one could do this with data.table in the following way:



data.table::setDT(df)
df_wide_dt <- data.table::dcast(df, sex + race + stratum ~ educ,
              fun =list(mean,length),
              value.var ="income")head(df_wide_dt)

##    sex race stratum income_mean_BA income_mean_HS income_length_BA income_length_HS ## 1:   F    B       1       93.78002       99.25489               19                 6## 2:   F    B       2       89.66844       93.04118               11                16## 3:   F    B       3      112.38483       94.99198               13                16## 4:   F    B       4      107.57729       96.06824               14                15## 5:   F    B       5       91.02870       92.56888               11                15## 6:   F    B       6       92.99184      116.06218               15                15

Until recently, widening or spreading on multiple values like this was kind of a pain when working in the tidyverse. You can see how I approached it before in the earlier post. (The code there still works fine.) Previously, you had to put spread() and gather() through a slightly tedious series of steps, best wrapped in a function you’d have to write yourself. No more! Since tidyr v1.0.0 has been released, though, the new function pivot_wider() (and its complement, pivot_longer()) make this common operation more accessible.

Here’s how to do it now. Remember that in the tidyverse approach, we’ll first do the summary calculations, mean and length, respectively, though we’ll use dplyr’s n() for the latter. Then we widen the long result.



tv_pivot <- df %>%
    group_by(sex, race, stratum, educ)%>% 
    summarize(mean_inc =mean(income),
              n = n())%>%
    pivot_wider(names_from =(educ),
                values_from =c(mean_inc, n))

This gives us an object that’s equivalent to the df_wide_dt object created by data.table.



tv_pivot

## # A tibble: 32 x 7## # Groups:   sex, race, stratum [32]##    sex   race  stratum mean_inc_BA mean_inc_HS  n_BA  n_HS##                        ##  1 F     B           1        93.8        99.3    19     6##  2 F     B           2        89.7        93.0    11    16##  3 F     B           3       112.         95.0    13    16##  4 F     B           4       108.         96.1    14    15##  5 F     B           5        91.0        92.6    11    15##  6 F     B           6        93.0       116.     15    15##  7 F     B           7       102.        121.     13    13##  8 F     B           8       105.         88.3    14     8##  9 F     W           1        92.6       110.     19    13## 10 F     W           2        98.5       101.     15    19## # … with 22 more rows

And there you have it. Be sure to check out the complement of pivot_wider(), pivot_longer(), also.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧