Quantcast
Channel: R-bloggers
Viewing all 12126 articles
Browse latest View live

Tidy Troubles and SwimmeR does the Splits – v0.6.0

$
0
0

[This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you’re only going to read one line of this post let it be this one: v0.6.0 of SwimmeR is now available from CRAN and it’s better now than ever before. But don’t just stop at one line though, grab that update and come along on this journey, deep into the heart of SwimmeR.

library(SwimmeR)library(dplyr)library(stringr)library(flextable)flextable_style <- function(x) {  x %>%    flextable() %>%    bold(part = "header") %>% # bold header    bg(bg = "#D3D3D3", part = "header") %>% # puts gray background behind the header row    align_nottext_col(align = "center", header = TRUE, footer = TRUE) %>% # center alignment    autofit()}

The rest of this post, which I promise will be super interesting and even have jokes, is about the tidy framework, inconsistencies in source material, subcategories and my failure to get all those things to play nicely together in the context of SwimmeR. I realize calling it a failure makes it sound like I was lying a few lines ago when I said that SwimmeR v0.6.0 is better than ever before, but I wasn’t. Don’t jump to conclusions.


Breaking Change

Before going any further though there’s one important, breaking, change. The output of swim_parse has been modified, such that the Grade column is renamed to Age and the School column is renamed to Team. These are frankly overdue changes, which reflect the broader applicability of SwimmeR outside the American high school swimming arena I first developed it for. My apologies and please adjust your work flows.


Tidy Data

Tidy is a way of organizing data that’s been championed by Hadley Wickham. He’s, I think indisputably, the most famous proponent of R, and from his post as Chief Scientist at RStudio he’s in a prime position to drive his personal philosophies forward. One of those philosophies is tidy data. He also has a distinctive first name, so you’ll often see him referred to simply as Hadley, like he’s Kobe or Madonna or something, but that’s neither here nor there.

Tidy data has several benefits.

  1. It’s easy to explain. Tidy data is simply data where each row is a distinct observation, or event (like a swim), and each column contains one and only one variable, like a name, or a time.

  2. It’s easy to represent in a table, or even a spreadsheet.

  3. It’s super easy to work with in R. This is arguably because of tidy data’s relative simplicity of structure, but it’s also in some part because Hadley Wickham/RStudio release tons of free, excellent, packages that use (and enforce the norms of) tidy data.

In spite of what I see as tidy data’s real utility, bemoaning the influence of Hadley Wickham/RStudio is something of a sport in the R blog-o-sphere. This is the Swimming + Data Science blog though, and the only sport we care about here is swimming (and also diving). I’ve got my hands way too full dealing with getting SwimmeR releases out, the International Swim League changing their results formats for no reason, and breaststrokers trying to cheat all the time to be complaining about someone releasing free tools. I need all the help I can get.

Back to SwimmeR v0.6.0 and its many improvements…


Splits

SwimmeR can now read split times. Both the swim_parse and swim_parse_ISL now have another argument, called splits. Setting splits = TRUE will direct each function to attempt to read in splits. The default is FALSE so splits won’t be collected unless you, the user, decide they should be.

In swim_parse there’s another argument, splits_length. This is the length of pool at which the splits are recorded. It’s usually 50 (the default).

file_50 <- read_results("http://www.section1swim.com/Results/BoysHS/2020/Sec1/Single.htm")df_50 <- swim_parse(file_50, splits = TRUE)df_50 %>%   filter(Event == "Men 100 Yard Butterfly") %>%   select(Place, Name, Finals_Time, Split_50:Split_100) %>%   head(3) %>%   flextable_style()
.cl-50971482{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-50971483{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-50a14c40{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-50a14c41{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-50a2f9c8{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9c9{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9ca{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9cb{width:111px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9cc{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9cd{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9ce{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9cf{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9d0{width:111px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9d1{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a2f9d2{width:111px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320ba{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320bb{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320bc{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320bd{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320be{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320bf{width:111px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320c0{width:69px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320c1{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-50a320c2{width:63px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Name

Finals_Time

Split_50

Split_100

1

Chung, Hudson H

51.98

24.82

27.16

2

Laidlaw, John F

52.15

24.91

27.24

3

Sakharuk, Nikita V

52.71

24.73

27.98

Sometimes though splits_length should be set to 25, because splits are taken every 25 meters or yards.

file_25 <- read_results("https://github.com/gpilgrim2670/Pilgrim_Data/raw/master/SwimmeR%20Demo%20Files/7th-snsc-scm-2017-full-results.pdf")df_25 <- swim_parse(file_25, avoid = c("MR\\:", "NR\\:"), splits = TRUE, split_length = 25)df_25 %>%   filter(Event == "Women 100 SC Meter IM") %>%   select(Place, Name, Finals_Time, Split_25:Split_100) %>%   head(3) %>%   flextable_style()
.cl-9ee7d798{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-9ee7d799{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-9ee825ea{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-9ee825eb{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-9ee9380e{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee9380f{width:121px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93810{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93811{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93812{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93813{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93814{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93815{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93816{width:121px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93817{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee93818{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e2e{width:121px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e2f{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e30{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e31{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e32{width:121px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e33{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e34{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e35{width:63px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-9ee95e36{width:69px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Name

Finals_Time

Split_25

Split_50

Split_75

Split_100

1

*Ling, Jessica

1:05.58

13.45

17.21

19.16

15.76

2

HO, Hui Ting Natalie

1:07.31

13.32

16.90

21.06

16.03

3

Cheong, Chloe

1:07.76

13.74

17.01

21.27

15.74

This is new, exciting, and undoubtedly the most requested feature, so there you go people – enjoy!


Relay Swimmers

SwimmeR can also now capture relay swimmers, by setting the argument relay_swimmers = TRUE inside either swim_parse or swim_parse_ISL. Like splits, relay_swimmers defaults to FALSE– using it is optional.

df_relay <- swim_parse(file_50, relay_swimmers = TRUE)df_relay %>%   filter(Event == "Men 400 Yard Freestyle Relay") %>%   select(Place, Team, Finals_Time, Relay_Swimmer_1:Relay_Swimmer_4) %>%   head(3) %>%   flextable_style()
.cl-a7e6208e{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a7e6208f{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a7e6478a{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a7e6478b{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a7e66e90{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e91{width:117px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e92{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e93{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e94{width:126px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e95{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e96{width:117px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e97{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e98{width:126px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e99{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e66e9a{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e69596{width:117px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e69597{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e69598{width:126px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e69599{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e6959a{width:117px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e6959b{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e6959c{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e6959d{width:164px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a7e6959e{width:126px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Team

Finals_Time

Relay_Swimmer_1

Relay_Swimmer_2

Relay_Swimmer_3

Relay_Swimmer_4

1

Ardsley-Hast-Edge-Dobbs-Irv

3:12.34

Lee, Christian

Andrews, Samuel T

Vincent, Connor J

Pierce, Adrien T

2

Horace Greeley

3:16.06

Chung, Hudson H

Sakharuk, Nikita V

McHugh, Luke P

Laidlaw, John F

3

Wappingers

3:17.31

McGregor, Kyle T

McGregor, Matthew J

Holan, Steve

Smith, Sebastian L


The Troubles

Splits

“All that sounds great”, I hear you saying, “and your examples worked flawlessly!” Wrong, wrong, wrong. I picked those examples as traps. Splits can be trouble, as Ricky Berens well knows.

Not where you want a split

Here’s the problem. SwimmeR reads in data from sources, but it also imposes a structure on that data, a tidy structure in fact. In versions prior to v0.6.0 there was no issue with imposing a tidy structure, it didn’t chafe at all, because the source data SwimmeR was reading in was fundamentally tidy. Each row was a swim, each column the corresponding event, or athlete name, or final time. Splits are different though.

There are two general types of splits, cumulative and non-cumulative. Cumulative splits work like this: The first 50 meters of your race takes you 29.00 seconds. The second 50 meters takes you 31.00 seconds. Your cumulative split for the 50 is 29.00, and for the 100 is 61.00 seconds (1:01.00). You keep going, your third 50 takes you 31.50 seconds, so your cumulative split for the 150 is 29.00 plus 31.00 plus 31.50, for a total of 92.50 seconds (1:32.50). That’s all fine.

Non cumulative is the same, just without the adding. Your split for the 50 is still 29.00, for the next 50 (not the 100), it’s 31.00, for the third 50 (not the 150) it’s 31.50. This is fine too. SwimmeR prefers non-cumulative splits due to tidy principles – each split is it’s own variable, rather than the splits being collections (sums, or accumulations) of other splits/variables but if only cumulative splits are available SwimmeR will take them.

Either cumulative or non-cumulative splits wouldn’t be a problem, for SwimmeR or otherwise, and indeed both are often used together in swimming results, with non-cumulative printed inside parenthesis and cumulative outside. What about when the non-cumulative isn’t quite non-cumulative? Sounds daft right, but that’s the convention during relays. Take a look:

In the 1st place relay Christian Lee swims his first 50 in 23.27, then his next 50 in what? Not 47.33 – that’s his total time, despite it being wrapped in parenthesis. Then Samuel Andrews dives in and he swims the third 50 in 23.06, for a total time 1:10.39. He swims his second 50 in – again, not 48.65, that’s his cumulative 100. You see the problem. This problem is reproduced within the results from swim_parse.

df_50 %>%  filter(Event == "Men 400 Yard Freestyle Relay") %>%  select(Place, Team, Finals_Time, Split_50:Split_400) %>%  head(3) %>%  flextable_style()
.cl-a81df144{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a81df145{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a81e17be{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a81e17bf{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a81e65ca{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65cb{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65cc{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65cd{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65ce{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65cf{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65d0{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65d1{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65d2{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65d3{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e65d4{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd0{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd1{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd2{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd3{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd4{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd5{width:69px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd6{width:63px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd7{width:164px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a81e8cd8{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Team

Finals_Time

Split_50

Split_100

Split_150

Split_200

Split_250

Split_300

Split_350

Split_400

1

Ardsley-Hast-Edge-Dobbs-Irv

3:12.34

23.27

47.33

23.06

48.65

22.55

47.90

22.88

48.46

2

Horace Greeley

3:16.06

23.85

49.11

23.27

49.36

23.09

48.72

23.12

48.87

3

Wappingers

3:17.31

24.14

49.43

23.69

49.55

23.50

49.94

22.65

48.39

I could of course convert the cumulative splits to non-cumulative.

df_50 %>%  rowwise() %>%   mutate(    Split_100 = case_when(      str_detect(Event, "Men 400 Yard Freestyle Relay") ~ as.character(sec_format(Split_100) - sec_format(Split_50)),      TRUE ~ Split_100    ),    Split_200 = case_when(      str_detect(Event, "Men 400 Yard Freestyle Relay") ~ as.character(sec_format(Split_200) - sec_format(Split_150)),      TRUE ~ Split_200    ),    Split_300 = case_when(      str_detect(Event, "Men 400 Yard Freestyle Relay") ~ as.character(sec_format(Split_300) - sec_format(Split_250)),      TRUE ~ Split_300    ),    Split_400 = case_when(      str_detect(Event, "Men 400 Yard Freestyle Relay") ~ as.character(sec_format(Split_400) - sec_format(Split_350)),      TRUE ~ Split_400    )  ) %>%  filter(Event == "Men 400 Yard Freestyle Relay") %>%  select(Place, Team, Finals_Time, Split_50:Split_400) %>%  head(3) %>%  flextable_style()
.cl-a8f48628{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a8f48629{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a8f4ad4c{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a8f4ad4d{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a8f4fc02{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc03{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc04{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc05{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc06{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc07{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc08{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc09{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc0a{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc0b{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f4fc0c{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52312{width:164px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52313{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52314{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52315{width:63px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52316{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52317{width:69px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52318{width:63px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f52319{width:164px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a8f5231a{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Team

Finals_Time

Split_50

Split_100

Split_150

Split_200

Split_250

Split_300

Split_350

Split_400

1

Ardsley-Hast-Edge-Dobbs-Irv

3:12.34

23.27

24.06

23.06

25.59

22.55

25.35

22.88

25.58

2

Horace Greeley

3:16.06

23.85

25.26

23.27

26.09

23.09

25.63

23.12

25.75

3

Wappingers

3:17.31

24.14

25.29

23.69

25.86

23.50

26.44

22.65

25.74

I can do the conversion here, and I could have built the conversion into swim_parse and swim_parse_ISL, but there’s a tension, between the extremely useful tidy framework on one hand, and fidelity to the source material on the other. SwimmeR thus far gives users the choice. You can choose to read in splits (and remember, the default is not to), and then you can choose how you treat splits if you decide to read them in. You can choose between tidiness and fidelity.

I may bundle the above code, for converting split types, into a future version of SwimmeR but if I do its use will also be at the discretion of the user.

Even deeper is the fact that (for example), the third 50 of a 200 Freestyle and the third 50 of a 200 Freestyle relay are both recorded in the same column (Split_150) but they’re fundamentally different, because the third 50 on a relay begins with a relay pickup (a start from the blocks), and involves a fresh swimmer, where the third 50 in the individual event starts from a turn and involves the same swimmer who swam the first two 50s. One column contains two variables. That’s a no-no. I could put relay splits in their own columns, but I’d have to break them up by relay type too. The 2nd 50 of a 200 Freestyle Relay is almost always done with the front crawl stroke (freestyle), but the 2nd 50 of a 200 Medley Relay is required to be swam breaststroke – different animals all together. Differentiating them means a lot lot lot of columns/variables.

And what if the length at which splits are taken varies within a meet, or even within a race? Have a look at this:

The splits are by 25 for the first 800m of this 1500m freestyle, but for the last 700 there’s only one split, for the whole 700m. swim_parse faithfully reproduces this, except that it doesn’t know that suddenly the split length has changed mid-race. The result is a column labeled Split_825 that isn’t actually the 825m split.

df_25 %>%   filter(Event == "Men 1500 SC Meter Freestyle") %>%  select(Place, Name, Finals_Time, Split_700:Split_825) %>%  head(3) %>%  flextable_style()
.cl-a93d4840{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a93d4841{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-a93d6fe6{margin:0;text-align:center;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a93d6fe7{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-a93dbdfc{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbdfd{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbdfe{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbdff{width:110px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe00{width:84px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe01{width:69px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe02{width:110px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe03{width:49px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe04{width:49px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe05{width:69px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93dbe06{width:84px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-a93de50c{width:110px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

Place

Name

Finals_Time

Split_700

Split_725

Split_750

Split_775

Split_800

Split_825

1

*Nehra, Aryan

16:22.19

16.80

16.53

16.53

16.44

16.58

7:35.27

2

Cheng, Jimmy

16:49.64

17.35

16.63

16.76

17.21

17.17

7:53.87

3

Abraham, Levente

17:06.60

17.64

17.51

17.63

17.48

17.65

8:03.36

There’s nothing I can do here – I can’t recreate splits that aren’t included in some form in the source data. You’ll just have to know what’s going on in your results.

For further discussion of potential issues with splits please see vignette("SwimmeR").

Relay Swimmers

Relay swimmers aren’t anywhere near as troublesome as splits, but they’re still in tension with tidy principles. Let’s say you were interested in what races a particular athlete had participated in. It’s easy to filter or in some other way subset the data based on Name. With relay_swimmers = TRUE though that athlete’s name might be in Name or it might be in one of the relay swimmer columns. Having one variable (athlete name) in multiple columns is not tidy.

I could put relay swimmers in the names columns, but then what to do about their times? They don’t have a Finals_Time in the normal sense, only their splits, and as you’ve just read that’s a whole other mess.

Again my approach here has been reasonable fidelity to the results, and letting users decide if they want to opt into tidy troubles (by affirmatively setting relay_swimmers = TRUE).


In Conclusion

To be clear, in neither the case of splits nor the case of relay swimmers is the issue with the principles of tidy data as such. They’re fine, sensible principles, although they do have their limitations. The issue is also not with the conventions of reporting for swimming data. They also make sense and are perfectly usable for the millions of people who participate in the sport every year. The issue is in the tension between the two – how to put swimming data into a computationally useful format, while still being friendly to the human analyst. This is the central tension at the heart of SwimmeR, all the more on display now in v0.6.0.

Thanks for reading, and be sure to come back to Swimming + Data Science where next time we’ll be doing a look back over the recently completed International Swimming League Season 2! Lilly King will be involved, so get excited.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Tidy Troubles and SwimmeR does the Splits - v0.6.0 first appeared on R-bloggers.


RObservations #4 Using Base R to Clean Data

$
0
0

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A friend of mine had some data which was mixed with character strings and was interested in looking at the numeric data only; Because the data set was quite large, cleaning it manually wasn’t viable.

Besides for being too great of a task to do manually – tampering with the raw data can be very dangerous if you don’t have a way to track the changes or know what was done to it. This is why data cleaning should always be handled with a script and/or with a pipeline established without tampering with the raw data itself.

In this blog we’re going to look at a quick trick that I found useful for cleaning data frames on a large scale using base R and some understanding of data structures in R.

The Problem

Assuming character data in this data frame is not meaningful and we are only interested in numeric data. Here’s a “toy example” of how the data might look.

# Sample Data Set
df<-data.frame(a=c(1,"a",2.4),b=c("c",0.1,"b"))

df


##     a   b
## 1   1   c
## 2   a 0.1
## 3 2.4   b

Seeing this we can construct a matrix of TRUE/FALSE values that can tell us which variables contain character strings- and which do not.

Setting up a TRUE/FALSE matrix

Obviously, the Regular Expressions can get more complicated, but for this example- we will keep it simple to convey the concept.

tfvals<-as.matrix(as.data.frame(lapply(df, function(x) grepl("[A-z]",x))))

tfvals


##          a     b
## [1,] FALSE  TRUE
## [2,]  TRUE FALSE
## [3,] FALSE  TRUE

We use the grepl() function to give us logical values in the columns of our data set to determine which data is not meaningful to us by use of regular expressions. Because the result returned is a list- we need to convert this back to a data frame, which in turn needs to be converted into a matrix.

This gives us a nice and neat matrix of the TRUE/FALSE values that we will be able to use to clean our data set!

For R users who are more comfortable with the pipe operator (%>%) this solution can be rewritten as:

#' Import the magrittr library

library(magrittr)

tfvals <- df %>%
  lapply(function(x)grepl("[A-z]", x)) %>%
  as.data.frame() %>%
  as.matrix()

tfvals


##          a     b
## [1,] FALSE  TRUE
## [2,]  TRUE FALSE
## [3,] FALSE  TRUE

Which is also much easier to look at and requires less deciphering by others to read. But feel free to make whatever is easier for your environment and tastes (that comment is going to get some lash back!)

Cleaning and Formatting the Data

The actual cleaning of the data can now be done with one line.

df[tfvals]<-NA

# And there you have it!
df


##      a    b
## 1    1 
## 2   0.1
## 3  2.4 

Now, to make sure we have our data in numeric format, we will coerce all of our data into numeric format by first coercing it to character format.

newdf<-data.frame(lapply(df,function(x) as.numeric(as.character(x))))

newdf


##     a   b
## 1 1.0  NA
## 2  NA 0.1
## 3 2.4  NA

The reason why we did this is because our data frame values are read as character data. Additionally as a safety precaution- if you are using an older version of R your data frame may still may have the stringsAsFactors argument still set to TRUE (yes, I know the newer versions of R have now set stringsAsFactors=FALSE as default in data frames, but hear me out); Thus we first coerce all the data that we have from the factor class to character, and from character to numeric data.

This can alternatively be rewritten with pipes (although, in this case I find using base R easier to read).

newdf <- df %>%
  lapply(function(x)
    x %>%
      as.character %>%
      as.numeric) %>%
  data.frame

newdf


##     a   b
## 1 1.0  NA
## 2  NA 0.1
## 3 2.4  NA

And there you have it!

Conclusion

I found a problem like this all to commonplace; more often than not- a uncleaned data set is handed over to you and knowing how to clean it with minimal effort is crucial to working efficiently. This is what inspired me to write about this.

Additionally, I’m really happy that my blog has been getting more reach via Social Media and R-Bloggers. With that, I am thrilled to have a lot of experienced R-Programmers who get to check out my work.

So my question for you (yes, YOU!)- how would you deal with this problem? Let me know in the comments!

Did you like this content? Be sure to never miss an update and Subscribe!

Email Address:

Follow

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RObservations #4 Using Base R to Clean Data first appeared on R-bloggers.

A/B testing my resume

$
0
0

[This article was first published on R – David's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis.

Well, let’s fix that.

Being currently open to work, I thought this would be the right time to test this scientifically. I have two versions of my resume:

The purpose of a resume is to land you an interview, so we’ll track for each resume how many applications yield a call for an interview. Non-responses after one week are treated as failures. We’ll model the effectiveness of a resume as a binomial distribution: all other things being considered equal, we’ll assume all applications using the same resume type have the same probability ($p1$ or $p2$) of landing an interview. We’d like to estimate these probabilities, and decide if one resume is more effective than the other.

In a traditional randomized trial, we would randomly assign each job offer to a resume and record the success rate. But let’s estimate the statistical power of such a test. From past experience, and also from many plots such as this one posted on Reddit, it seems reasonable to assign a baseline success rate of about 0.1 (i.e., about one application in 10 yields an interview). Suppose the one-page version is twice as effective and we apply to 100 positions with each. Then the statistical power, i.e. the probability of detecting a statistically significant effect, is given by:

library(Exact)power.exact.test(p1 = 0.2, p2 = 0.1, n1 = 100, n2 = 100)## ##      Z-pooled Exact Test ## ##          n1, n2 = 100, 100##          p1, p2 = 0.2, 0.1##           alpha = 0.05##           power = 0.501577##     alternative = two.sided##           delta = 0

That is, we have only about 50% chances of detecting the effect with 0.05 confidence. This is not going to work; at a rate of about 10 applications per month, this would require 20 months.

Instead I’m going to frame this as a multi-armed bandit problem: I have two resumes and I don’t know which one is the most effective, so I’d like to test them both but give preference to the one that seems to have the highest rate of success—also known as trading off exploration vs exploitation.

We’ll begin by assuming again that we think each has about 10% chance of success, but since this is based on a limited experience it makes sense to treat this probability as the expected value of a beta distribution parameterized by, say, 1 success and 9 failures.

So whenever we apply for a new job, we:

  • draw a new $p1$ and $p2$ from each beta distribution
  • apply to the one with the highest drawn probability
  • update the selected resume’s beta distribution according to its success or failure.

Let’s simulate this, assuming that we know immediately if the application was successful or not. Let’s take the “true” probabilities to be 0.14 and 0.11 for the one-page and two-page resumes respectively. We’ll keep track of the simulation state in a simple list:

new_stepper <- function() {  state <- list(k1 = 1, n1 = 10, p1 = 0.14, k2 = 1, n2 = 10, p2 = 0.11)  step <- function() {    old_state <- state    state <<- next_state(state)    old_state  }  step}

new_stepper() returns a closure that keeps a reference to the simulation state. Each call to that closure updates the state using the next_state function:

next_state <- function(state) {  p1 <- rbeta(1, state$k1, state$n1 - state$k1)  p2 <- rbeta(1, state$k2, state$n2 - state$k2)  pull1 <- p1 > p2  result <- rbinom(1, 1, ifelse(pull1, state$p1, state$p2))  if (pull1) {    state$n1 <- state$n1 + 1    state$k1 <- state$k1 + result  } else {    state$n2 <- state$n2 + 1    state$k2 <- state$k2 + result  }  state}

So let’s now simulate 1000 steps:

step <- new_stepper()sim <- data.frame(t(replicate(1000, unlist(step()))))

The estimated effectiveness of each resume is given by the number of successes divided by the number of applications made with that resume:

sim$one_page <- sim$k1 / sim$n1sim$two_page <- sim$k2 / sim$n2sim$id <- 1:nrow(sim)

The follow plot shows how that estimated probability evolves over time:

library(reshape2)library(ggplot2)sim_long <- melt(sim, measure.vars = c('one_page', 'two_page'))ggplot(sim_long, aes(x = id, y = value, col = variable)) +  geom_line() +  xlab('Applications') +  ylab('Estimated probability of success')
Wouldn’t that be nice

As you can see, the algorithm decides pretty rapidly (after about 70 applications) that the one-page resume is more effective.

So here’s the protocol I’ve begun to follow since about mid-November:

  • Apply only to jobs that I would normally have applied to
  • Go through the entire application procedure, including writing cover letter etc, until uploading the resume becomes unavoidable (I do this mainly to avoid any personal bias when writing cover letters)
  • Draw $p1$ and $p2$ as described above; select resume type with highest $p$
  • Adjust the resume according to the job requirements, but keep the changes to a minimum and don’t change the overall format
  • Finish the application, and record a failure until a call for an interview comes in.

I’ll be sure to report on the results in a future blog post.

The post A/B testing my resume appeared first on David's blog.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – David's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post A/B testing my resume first appeared on R-bloggers.

A latent threshold model to dichotomize a continuous predictor

$
0
0

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the context. In the convalescent plasma pooled individual patient level meta-analysis we are conducting as part of the COMPILE study, there is great interest in understanding the impact of antibody levels on outcomes. (I’ve described various aspects of the analysis in previous posts, most recently here). In other words, not all convalescent plasma is equal.

If we had a clear measure of antibodies, we could model the relationship of these levels with the outcome of interest, such as health status as captured by the WHO 11-point scale or mortality, and call it a day. Unfortunately, at the moment, there is no single measure across the RCTs included in the meta-analysis (though that may change). Until now, the RCTs have used a range of measurement “platforms” (or technologies), which may measure different components of the convalescent plasma using different scales. Given these inconsistencies, it is challenging to build a straightforward model that simply estimates the relationship between antibody levels and clinical outcomes.

The study team is coalescing around the idea of comparing the outcomes of patients who received low levels of antibodies with patients who received not low levels (as well as with patients who received no antibodies). One thought (well, really my thought) is to use a model that can jointly estimate the latent threshold and, given that threshold, estimate a treatment effect. Importantly, this model would need to accommodate multiple antibody measures and their respective thresholds.

To tackle this problem, I have turned to a class of models called change point or threshold models. My ultimate goal is to fit a Bayesian model that can estimate threshold and effect-size parameters for any number of RCTs using any number of antibody measures. At this point we are a few steps removed from that, so in this post I’ll start with a simple case of a single RCT and a single antibody measure, and use a maximum likelihood estimation method implemented in the R package chngpt to estimate parameters from a simulated data set. In a subsequent post, I’ll implement a Bayesian version of this simple model, and perhaps in a third post, I’ll get to the larger model that incorporates more complexity.

Visualizing simple scenarios

Change point models appear to be most commonly used in the context of time series data where the focus is on understanding if a trend or average has shifted at a certain point in a sequence of measurements over time. In the case of COMPILE, the target would be a threshold for a continuous antibody measure across multiple patients; we are interested in measuring the average outcome for patients on either side of the threshold.

The following plots show three scenarios. On the left, there is no threshold; the distribution of continuous outcomes is the same across all values of the the antibody measure. In the middle, there is a threshold at \(-0.7\); patients with antibody levels below \(-0.7\) have a lower average outcome than patients with antibodies above \(-0.7\). On the right, the threshold is shifted to \(0.5\).

The key here is that the outcome is solely a function of the latent categorical status – not the actual value of the antibody level. This may be a little simplistic, because we might expect the antibody level itself to be related to the outcome based on some sort of linear or non-linear relationship rather than the dichotomous relationship we are positing here. However, if we set our sights on detecting a difference in average clinical outcomes for patients categorized as having been exposed to low and not low antibody levels rather than on understanding the full nature of their relationship, this simplification may be reasonable.

Data generation

I think if you see the data generation process, the model and assumptions might make more sense. We start with an antibody level that, for simplicity’s sake, has a standard normal distribution. In this simulation, the latent group status (i.e. low vs. not low) is not determined completely by the threshold (though it certainly could); here, the probability that latent status is not low is about \(5\%\) for patients with antibody levels that fall below \(-0.7\), but is \(95\%\) for patients that exceed threshold.

library(simstudy)set.seed(87654)d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")d1 <- defData(d1, varname = "latent_status", formula = "-3 + 6 * (antibody > -0.7)",              dist = "binary", link = "logit")d1 <- defData(d1, varname = "y", formula = "0 + 3 * latent_status",               variance = 1, dist = "normal")dd <- genData(500, d1)dd##       id antibody latent_status       y##   1:   1  -1.7790             0  0.5184##   2:   2   0.2423             1  3.2174##   3:   3  -0.4412             1  1.8948##   4:   4  -1.2505             0  0.9816##   5:   5  -0.0552             1  2.9251##  ---                                   ## 496: 496  -0.4634             1  2.7298## 497: 497   0.6862             0 -0.0507## 498: 498  -1.0899             0  0.9680## 499: 499   2.3395             1  1.9540## 500: 500  -0.4874             1  3.5238

Simple model estimation

The chngptm function in the chngpt package provides an estimate of the threshold as well as the treatment effect of antibody lying above this latent threshold. The parameters in this simple case are recovered quite well. The fairly narrow \(95\%\) confidence interval (2.2, 2.8) just misses the true value. The very narrow \(95\%\) CI for the threshold is (-0.73, -0.69) just does include the true value.

library(chngpt)fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,   data = dd, type="step", family="gaussian")summary(fit)## Change point model threshold.type:  step ## ## Coefficients:##                   est Std. Error* (lower upper) p.value*## (Intercept)     0.296       0.130 0.0547  0.563 2.26e-02## antibody>chngpt 2.520       0.139 2.2416  2.787 1.99e-73## ## Threshold:##        est Std. Error     (lower     upper) ##   -0.70261    0.00924   -0.72712   -0.69092

Alternative scenarios

When there is more ambiguity in the relationship between the antibody threshold and the classification into the two latent classes of low and not low, there is more uncertainty in both the effect and threshold estimates. Furthermore, the effect size estimate is attenuated, since the prediction of the latent class is less successful.

In the next simulation, the true threshold remains at \(-0.7\), but the probability that a patient below the threshold actually does not have low levels of antibodies increases to about \(21\%\), while the probability of a patient above the threshold does not have low levels of antibodies decreases to \(79\%\). There is more uncertainty regarding the the threshold, as the \(95\%\) CI is (-1.09, -0.62). And the estimated effect is \(1.5 \; (1.3, 2.0)\) is attenuated with more uncertainty. Given the added uncertainty in the data generation process, these estimates are what we would expect.

d1 <- updateDef(d1, changevar = "latent_status",   newformula = "-1.3 + 2.6 * (antibody > -0.7)")dd <- genData(500, d1)fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,   data = dd, type="step", family="gaussian")summary(fit)## Change point model threshold.type:  step ## ## Coefficients:##                   est Std. Error* (lower upper) p.value*## (Intercept)     0.881       0.159   0.50   1.12 3.05e-08## antibody>chngpt 1.439       0.173   1.17   1.85 1.09e-16## ## Threshold:##        est Std. Error     (lower     upper) ##    -0.6298     0.0579    -0.8083    -0.5814

The effect size has an impact on the estimation of a threshold. At the extreme case where there is no effect, the concept of a threshold is not meaningful; we would expect there to be great uncertainty with the estimate for the threshold. As the true effect size grows, we would expect the precision of the threshold estimate to increase as well (subject to the latent class membership probabilities just described). The subsequent plot shows the point estimates and \(95\%\) CIs for thresholds at different effect sizes. The true threshold is \(0.5\) and effect sizes range from 0 to 2:

This last figure shows that the uncertainty around the effect size estimate is higher at lower levels of true effectiveness. This higher level of uncertainty in the estimated effect is driven by the higher level of uncertainty in the estimate of the threshold at lower effect sizes (as just pointed out above).

With a fundamentally different data generating process

What happens when the underlying data process is quite different from the one we have been imagining? Is the threshold model useful? I would say “maybe not” in the case of a single antibody measurement. I alluded to this a bit earlier in the post, justifying the idea by arguing it might make more sense with multiple types of antibody measurements. We will hopefully find that out if I get to that point. Here, I briefly investigate the estimates we get from a threshold model when the outcome is linearly related to the antibody measurement, and there is in fact no threshold, as in this data set:

d1 <- defData(varname = "antibody", formula = 0, variance = 1, dist = "normal")  d1 <- defData(d1, varname = "y", formula = "antibody", variance = 1, dist = "normal")dd <- genData(500, d1)

The estimated threshold is near the center of the antibody data (which in this case is close to \(0\)), with a fairly narrow \(95\%\) confidence interval. The effect size is essentially a comparison of the means for patients with measurements below \(0\) compared to patients above \(0\). If this were the actual data generation process, it might be preferable to model the relationship directly using simple linear regression without estimating a threshold.

fit <- chngptm(formula.1 = y ~ 1, formula.2 = ~ antibody,                data = dd, type="step", family="gaussian")summary(fit)## Change point model threshold.type:  step ## ## Coefficients:##                    est Std. Error* (lower upper) p.value*## (Intercept)     -0.972       0.162  -1.24 -0.607 2.19e-09## antibody>chngpt  1.739       0.109   1.58  2.006 1.15e-57## ## Threshold:##        est Std. Error     (lower     upper) ##    -0.0713     0.2296    -0.3832     0.5170
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post A latent threshold model to dichotomize a continuous predictor first appeared on R-bloggers.

Deploying an R Shiny app on Heroku free tier

$
0
0

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continuous Deployment made easy with GitHub Actions and Heroku

This article is a short guide on deploying a Shiny app on Heroku. Familiarity with Docker, Shiny and GitHub is presumed. For an introduction to Docker, see Deploying a Shiny Flexdashboard with Docker.

This article was also published on https://www.r-bloggers.com/.

This article will show you

  • What Heroku is and what you can get for free
  • How to containerize a Shiny app with Docker
  • How to set up GitHub to automatically deploy to Heroku
  • How to set up a custom domain name for your deployed app

Heroku free tier

Heroku (which I am not affiliated with) is a cloud platform service that offers an intuitive way to deploy and manage enterprise grade web apps. It enables easy auto scaling of your apps, however not on Free (and Hobby) tier we will use here. Apps can be managed via CLI or the Heroku dashboard.

Heroku has a plethora of addons, e.g. for logging or messaging. Additionally it provides services you can co-deploy with your apps like databases with just a few clicks or lines in the CLI or in a deploy instruction file, the `heroku.yml`.

With the new heroku.yml file it is possible to deploy multi container apps. Here however we will not use a heroku.yml as the deployment will be managed by GitHub Actions.

Heroku apps run in “dynos” which are containers running on AWS. With the free tier you can run apps in max 5 dynos (100 if you verify) with 512 MB memory each and get 550 (1000 if you verify) dyno hours. On free tier your apps go to “sleep mode” after 30 min and need to be woken up, which usually means your load time on the first request will be 10–30s longer. This behavior saves your free dyno hours. If you want to circumvent this, add a GET Request to your app so that the app will call itself in intervals < 30 min. We`ll see how long this will be possible.

What I like about the Heroku pricing model is that it is per app. That means you can upgrade a single app after testing it on free tier. On paid tiers you get various benefits, like auto scaling, unlimited dyno hours and SSL encryption with custom domains.

Shiny app Docker image

Heroku dynos support a few languages out of the box like Python and GO with provided buildpacks. For R we have to use a Docker container to run in the dyno. Furthermore putting your Shiny app in a Docker container has various advantages such as properly managing dependencies and good portability across systems.

In this example an R Shiny app that provides batch geo-coding of addresses to longitude and latitude is deployed. In the repo you can find and clone the Shiny app, Dockerfile and GitHub Actions YAML: https://github.com/timosch29/geocoding_shiny

For the deployment on Heroku you have to modify the Dockerfile instructions a bit.

# Base image https://hub.docker.com/u/rocker/                       FROM rocker/shiny-verse:4.0.3LABEL author="Tim M.Schendzielorz docker@timschendzielorz.com"# system libraries of general use                       # install debian packages                       RUN apt-get update -qq && apt-get -y --no-install-recommends install \     libxml2-dev \    libcairo2-dev \    libpq-dev \     libssh2-1-dev \    libcurl4-openssl-dev \    libssl-dev# update system librariesRUN apt-get update && \     apt-get upgrade -y && \      apt-get clean# copy necessary files from app folder# Shiny app COPY /shiny_geocode ./app                       # renv.lock fileCOPY /renv.lock ./renv.lock# install renv & restore packages                       RUN Rscript -e 'install.packages("renv")'RUN Rscript -e 'renv::restore()'# remove install files                       RUN rm -rf /var/lib/apt/lists/*# make all app files readable, gives rwe permisssion (solves issue when dev in Windows, but building in Ubuntu)                       RUN chmod -R 755 /app# expose port (for local deployment only)                       EXPOSE 3838# set non-root                       RUN useradd shiny_userUSER shiny_user# run app on container start (use heroku port variable for deployment)CMD ["R", "-e", "shiny::runApp('/app', host = '0.0.0.0', port = as.numeric(Sys.getenv('PORT')))"]

In this Dockerfile, renv is used to install the necessary R libraries from a renv.lock file via `RUN Rscript -e ‘renv::restore()’` to have the same libray versions in the container as in the local dev environment.

To run the containerized app on Heroku, two things are necessary. First, the container must run as non-root for security reasons. Make a new user via RUN useradd shiny_user and set via USER shiny_user .

Second, Heroku provides you with a random Port via the PORT host variable for each dyno. To run the Shiny app at this port, use port = as.numeric(Sys.getenv('PORT')) in the runApp command.

Continuous Deployment with GitHub Actions

GitHub Actions provides an straightforward template and instructions from https://github.com/AkhileshNS/heroku-deploy to deploy to Heroku. In the top level include a directory .github/workflows with the following main.yml file in your GitHub repo:

name: heroku_deploy    on:                                 push:                                       branches:                - masterjobs:                            build:                              runs-on: ubuntu-latest                               steps:           - uses: actions/checkout@v2                   - uses: akhileshns/heroku-deploy@v3.6.8           with:                heroku_api_key: ${{secrets.HEROKU_API_KEY}}                 heroku_app_name: "geocode-shiny"                                                heroku_email: ${{secrets.HEROKU_EMAIL}}                                                  healthcheck: "https://geocode-shiny.herokuapp.com/"                                                    usedocker: true                                                 delay: 60                                                rollbackonhealthcheckfailed: true           env:               HD_API_KEY: ${{secrets.MAPS_API_KEY}} # Docker env var

Here, we specify the action “heroku_deploy” to happen on a push to the master branch. In the steps, the commit which triggered the action is checked out to be accessed by the workflow and in the next step it is pushed to Heroku, build and deployed.

The parameters heroku_api_key, heroku_app_name and heroku_email are needed. To get them

  1. Make an Heroku account on the website.
  2. Go to your Heroku dashboard or download the CLI tool.
  3. Create a new app with a unique name in the dashboard or via heroku create your_app_name .
  4. Get an API key from your Heroku Account Settings or via heroku auth:token .
  5. Store the two variables for your Heroku account in your GitHub repo Settings->Secrets as HEROKU_API_KEY and HEROKU_EMAIL.

The parameter usedocker: true is needed for deployment of a Docker container. Additionally we use the url (which you will know after the first successful deploy) with healthcheck: true . The healthcheck is delayed for 60s with delay: 60 and the deploy is rolled back to the previous commit when the health check of the app fails via rollbackonhealthcheckfailed: true.

This Shiny app needs a secret API key for an external API to work. It is saved as GitHub secret too and supplied as a Docker environment variable. To set env variables for your apps, prefix them with HD_. This gets stripped off in deployment and is necessary to distinguish between build and deploy variables. DO NOT put any of your secrets directly in files in GitHub repos!

That’s it! Push to the master branch (or any other, you could also use tags to specify commits for deployment in the main.yml ) and check the GitHub Actions tab if the deployment worked. Then get the URL from which your app can be reached in the dashboard or via heroku apps:info -a your_app_name . Add this url to the healthcheck in the main.yml for future deployment versions.

Set up a custom URL for the app

To set up a custom domain for your Heroku app which you own/have access to, you need to:

  1. Verify your Heroku account with a credit card. You will not occur any charges for apps on free tier and additionally would need to opt in for payed service. If you verify, you get more dynos and free dyno hours, too.
  2. Add your domain to the app via dashboard or via the CLI tool with heroku domains:add www.example.com -a your_app_name .
  3. Go to your Domain provider and add a new CNAME record for www.example.com to point to the Heroku DNS target you get via the dashboard or heroku domains -a your_app_name.
  4. Check that the DNS is correctly configured via host www.example.com .


Deploying an R Shiny app on Heroku free tier was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Deploying an R Shiny app on Heroku free tier first appeared on R-bloggers.

To peek or not to peek after 32 cases? Exploring that question in Biontech/Pfizer’s vaccine trial

$
0
0

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is my 4th post about Biontech/Pfizer’s Covid-19 vaccine trial. If you are missing background, please first look at my 2nd post and 3rd post.

Biontech/Pfizer originally planed to analyze the vaccine efficacy at 5 stages: after 32, 62, 92, 120 and finally 164 Covid-19 cases have been observed among the 43538 study participants that consisted of a vaccinated treatment group and a control group who got a placebo. Each interim analysis could lead to an early declaration of sufficient efficacy. The success thresholds at each stage were chosen such that the overall type I error (falsely declaring a vaccine efficacy above 30%) was bounded below 2.5%.

This press release notes a modification of the original plan:

After discussion with the FDA, the companies recently elected to drop the 32-case interim analysis and conduct the first interim analysis at a minimum of 62 cases. Upon the conclusion of those discussions, the evaluable case count reached 94 and the DMC [independent Data Monitoring Committee] performed its first analysis on all cases.

Some observers claimed the decision to skip the 32-case analysis was politically motivated (some comments on those claims are relegated to the last section of this post). Yet, Biontech/Pfizer refute any political motive and there are very good reasons to skip the first interim analysis that have nothing to do with the US elections. I found a good discussion in this Science article.

The reason to have originally planned an interim analysis at the low number of 32 cases is summarized in the Science article as follows:

Ugur Sahin, scientist, CEO, and co-founder of BioNTech, says the initial plan to look at 32 cases stemmed from a conservative assumption about the rate of spread of COVID-19 and the sense of urgency about the need for a vaccine. If the vaccine looked terrific at 32 cases and it was going to take months to get to 62 cases, then waiting seemed like a mistake, he says.

However, the FDA and outside observers seemed skeptical about the small case number in the first interim analysis. Other Covid vaccine trials seem to have made the first analysis at only at least 50 cases. The Science article continues:

In mid-October, the companies had yet to confirm 32 cases. But with the epidemic exploding at many of the trial’s locations—which were mainly in the United States—they had second thoughts about FDA’s request that their first interim analysis should have more to support an EUA request. FDA “had strongly recommended to us that we change that, and the pandemic just was spiraling out of control in the United States and elsewhere, and we realized that we probably could get cases much faster than what we had anticipated,” Jansen says.

We should also note that a successful 32-case analysis probably would not have sped up the vaccine deployment because submission for Emergency Use Authorization (EUA) required passing a safety milestone, which according to the press release was expected to occur in the third week of November. And, indeed the EUA request was submitted only on November 20th.

Based on those arguments, skipping the 32-case analysis, looks like a very sound strategy to me. But let’s look at some statistics and run some simulations to better understand some of the trade-offs.

The (statistical) power of patience

The first, obvious point I want to illustrate is that generally waiting for more cases gives higher statistical power when fixing the type 1 error. Below I simulate 10000 trials under the assumption that the true vaccine efficacy is VE=70%. The success criterion required by the FDA is that with a type I error of at most 2.5% an efficacy above 30% can be proven.

The simulation below computes the shares of trials that are successful for the three cases of an analysis after either 32, 62 or 164 cases. We assume for each case that just a single analysis takes place (no interim analyses).

# Helper functions to transform efficacy to theta# and the other way roundVE.to.theta = function(VE) (1-VE)/(2-VE)theta.to.VE = function(theta) (1-2*theta)/(1-theta)# This function simulates a single trialsimulate.trial = function(runid=1, m.max=164, VE.true = 0.3, m.analyse = 1:m.max) {  theta.true = VE.to.theta(VE.true)  is.vaccinated = ifelse(runif(m.max) >= theta.true,0,1)  mv = cumsum(is.vaccinated)[m.analyse]  mc = m.analyse - mv  # Returning results as matrix is faster than as data frame  cbind(runid=runid, m=m.analyse, mv=mv,mc=mc)}# Simulate 10000 trialsset.seed(1)dat = do.call(rbind,lapply(1:10000, simulate.trial, VE.true=0.7, m.analyse = c(32,62,164))) %>%  as_tibble# Parameters of used prior distributiona0 = 0.700102; b0 = 1# Compute posterior probabilities of VE > 30%dat = dat %>% mutate(  posterior.VE.above.30 = pbeta(VE.to.theta(0.3),    shape1 = a0+mv, shape2=b0+mc, lower.tail=TRUE))# Compute success share for each mdat %>%   group_by(m) %>%  summarize(    share.success = mean(posterior.VE.above.30 >= 0.975)  )  ## # A tibble: 3 x 2##       m share.success##            ## 1    32         0.525## 2    62         0.896## 3   164         0.999

We see how the statistical power (probability to reject VE <= 30%), i.e. the success probability of the trial, increases when we wait until more cases have accrued. Analyzing (only) after 32 cases would yield a success with just 52.5% probability, while analyzing (only) after 164 cases would be successful with 99.9% probability.

So there is a natural trade-off between speed and statistical power. Hence, if cases accrue faster than expected, it seems quite natural to prefer to postpone an analysis to a higher case count.

Of course, the skipped interim analysis at 32 cases was only one of a total of the 5 planned analyses at 32,62, 92, 120 and 164 cases. But if the FDA allows to adapt the success thresholds for the later analyses correspondingly, also skipping an early interim analysis will increase total power. Yet, as is illustrated further below, skipping the 32 cases interim analysis would increase overall power only by an amount that is not too large and it is not obvious whether the FDA should indeed allow adaption of the success thresholds of the later analyses.

I guess more relevant for the decision was that a failed interim test probably would have to be announced due to SEC guidelines. But why should Biontech/Pfizer take a substantial risk of having to announce bad news if they just had to wait a few days to perform an analysis with much stronger statistical power?

How convincing would be a success with only 32 cases?

I guess a stronger concern for the FDA would be that a declared early success at 32 cases might have been encountered with high skepticism by the interested public arguing that 32 cases are just not enough. This could have been pouring oil on the fire of vaccine critiques and may also have reflected negatively on Biontech/Pfizer.

So if both a success and a failure at the first interim analysis would likely have created problems from a public relations point of view and the required safety milestone anyway requires to wait at least until mid November for Emergency Use Authorization, it almost seems a no-brainer to wait a few days for a more powerful first efficacy analysis.

Personally, I share the view that a small case count may be very problematic from a PR point of view. Yet, I don’t fully understand in which cases a small case count is objectively statistically problematic given that we control the type I error rate.

Table 5 on p. 103 in Biontech/Pfizer’s study plan specifies the exact success thresholds for the 4 planned interim analysis and the final efficacy analysis. I will restrict attention to the planned analyses at 32, 62 and 164 Covid-19 cases. They would be declared a success if not more than 6, 15 or 53 confirmed Covid-19 cases, respectively, were from the vaccinated treatment group.

The following code plots the posterior distributions of the vaccine efficacy VE for each of the 3 analyses given that those success thresholds are met exactly. I assume a conservative uniform prior for the unknown parameter $\theta$ that measures the probability that a trial subject with Covid-19 was vaccinated. At the expected prior value $E \theta =0.5$ the efficacy is exactly 0.

a0=1; b0=1grid = tibble(m=c(32,62,164), mv = c(6,15,53), mc=m-mv) %>%   tidyr::expand_grid(theta = seq(0,1,by=0.01)) %>%  mutate(    VE = theta.to.VE(theta),    density = dbeta(theta, shape1 = a0+mv, shape2=b0+mc),    cases = as.factor(m)   )ggplot(filter(grid, VE > 0), aes(x=VE, y=density, fill=cases)) +  geom_area(alpha=0.5,position = "identity")+  ggtitle("Posterior if success threshold is exactly reached at m cases")

We see that given that in an analysis the success threshold would be hit exactly, we should be most optimistic about the vaccine efficacy in the earliest interim analysis with just 32 cases. This finding thus does not support the skepticism against small case numbers. It rather corresponds to the observation that consistently finding significant effects with small sample sizes typically requires large effect sizes. Another reason for this finding is that Biontech/Pfizer required smaller error probabilities in the interim analysis than in the final analysis (see the previous post for details).

However, there could be problems with small sample sizes that are not accounted for in the statistical analysis. For example, in other settings a problem of small sample sizes is a larger scope for p-hacking as illustrated here. But I don’t see how p-hacking could be an issue in Biontech/Pfizer’s clean experimental design.

Yet, one could imagine other problems of early evaluation with small sample sizes. E.g. what if the vaccination would very commonly have side effects like fatigue or fever that make treated subjects more likely to stay at home for several days after the vaccination? They would then meet fewer people and therefore have a lower infection risk than the control group for reasons nothing to do with vaccine efficacy. While such effects should probably wash out with a longer trial duration, they may perhaps bias in a non-negligibly fashion the results if a first interim analysis takes place after very few cases.

Another problem is that the point estimate after a successful 32-cases analysis may change substantially once more observations accrue. So drafting a press release would be more complicated. Shall a high point estimate be stated, or just a quite low conservative bound? A high estimate may sound nice, but having to reduce the efficacy as the trial proceeds may be undesirable. In theory, one could state the credible interval, but that seems to be seldom done in press releases, perhaps due to the fact that it is complicated to explain all assumptions that entered its calculations. While similar considerations also play a role for interim analysis with larger case counts, the magnitude of the uncertainties is larger, the smaller is the case count.

How much statistical power could be gained by removing the first interim trial?

We want now move towards the relegated question of how much power we could gain by dropping the first interim trial. We first simulate 100000 trials assuming a vaccine efficacy of VE=30%, which is the minimum efficacy that should be exceeded with a type I error of at least 2.5%.

# Simulate 100000 trials assuming true efficacy of only 30%# which study plan required to exceed with a most 2.5% type I error set.seed(1)sim.VE30 = do.call(rbind,lapply(1:100000, simulate.trial,  VE.true=0.3, m.analyse = c(32,62,92,120,164))) %>% as_tibble

We now compute the share of simulated trials that given a true vaccine efficacy of only 30% were successful, i.e. the share of trials where we would wrongly reject the null hypothesis of a 30% efficacy or less. We use the thresholds mv_max of the maximum number of vaccinated subjects among the m cases from Biontech/Pfizer’s original study plan.

# Helper function to specify mv_max bounds.# By default as specified in Biontech/Pfizer's original study plan.set_mv_max = function(sim.dat,m32=6,m62=15,m92=25,m120=35,m164=53) {  sim.dat %>% mutate(    mv_max = case_when(      m == 32 ~ m32,      m == 62 ~ m62,      m == 92 ~ m92,      m == 120 ~m120,      m == 164 ~ m164    )                )}# Helper function to compute shares of trials that were successful# in interim or final analysis given mv_max bounds in sim.datcompute.success.shares = function(sim.dat, ignore.m = NULL) {  if (!is.null(ignore.m)) {    sim.dat = filter(sim.dat,!m %in% ignore.m)  }  sim.dat %>%    group_by(runid) %>%    summarize(      success = any((mv <= mv_max))    ) %>%    pull(success) %>%    mean()}# Compute type 1 error rate given Biontech/Pfizer's specificationcompute.success.shares(sim.VE30 %>% set_mv_max)## [1] 0.0221

We see that that Biontech/Pfizer’s specification yields a total type I error rate of 2.21%.

If we would relax the success thresholds by allowing in the final analysis at most 54 instead of only 53 Covid-19 cases to be from the treatment group, we would get the following type I error rate:

# Relax final analysis bound by 1 observationcompute.success.shares(sim.VE30 %>% set_mv_max(m164=54))## [1] 0.02697

The 2.697% error rate would violate the required 2.5% bound.

Let us now compute the corresponding error rates assuming that we ignore the first interim analysis at 32-cases:

compute.success.shares(sim.VE30  %>% set_mv_max, ignore.m = 32)## [1] 0.01745compute.success.shares(sim.VE30 %>% set_mv_max(m164=54), ignore.m = 32)## [1] 0.02242compute.success.shares(sim.VE30 %>% set_mv_max(m164=55), ignore.m = 32)## [1] 0.03015

We see that we could increase the threshold in the final analysis to 54 cases from the treatment group and still have an error rate of 2.242%. Increasing it to 55 cases would propel the type I error rate beyond 2.5%, however.

We now want to compare the power of the original study plan with the modified study plan that skips the first interim analysis but increases the threshold in the final analysis to at most 54 cases from the treatment group. We first compare the statistical power for a vaccine with 70% efficacy.

# Simulate 100000 trials given 70% efficacyset.seed(1)sim.VE70 = do.call(rbind,lapply(1:100000, simulate.trial,  VE.true=0.7, m.analyse = c(32,62,92,120,164))) %>% as_tibble  # Original study plancompute.success.shares(sim.VE70 %>% set_mv_max())## [1] 0.99767# Modified study plan without first interim analysiscompute.success.shares(sim.VE70 %>% set_mv_max(m164=54), ignore.m=32)## [1] 0.99851

We see how dropping the first interim analysis and adjusting the final analysis threshold indeed increases the total statistical power, but just by a small amount from 99.77% in the original plan to 99.85%.

If the true efficacy were just 50% the effect would be a bit larger:

# Simulate 100000 trials given 50% efficacyset.seed(1)sim.VE50 = do.call(rbind,lapply(1:100000, simulate.trial,  VE.true=0.5, m.analyse = c(32,62,92,120,164))) %>% as_tibble  # Original study plancompute.success.shares(sim.VE50 %>% set_mv_max())## [1] 0.45823# Modified study plan without first interim analysiscompute.success.shares(sim.VE50 %>% set_mv_max(m164=54), ignore.m=32)## [1] 0.50847

Now the modified analysis plan would increase the statistical power by 5 percentage points from 45.8% to 50.8%. Well, a 5 percentage point increase of the success chance for such a huge, important project seems not negligible. Ex-post it may be easy to say that assuming just 50% efficacy is unrealistically low. But how certain could one have been that the efficacy is high before analyzing the trial data?

Should the FDA allow to adapt the success thresholds if the first interim analysis is skipped?

I actually don’t know whether the FDA indeed allowed to adjust the success threshold for the final analysis from 53 to 54 cases after Biontech/Pfizer agreed to skip the first interim analysis.

Should the FDA have allowed it or not? While this is ex-post not relevant given the realized tremendous efficacy (from 170 cases, only 8 were in the treatment group), I consider it an interesting academic question.

Obviously, one should not allow that a vaccine maker who already saw the unblinded data of an interim analysis, i.e. knowing how many cases were from the treatment group, can still decide whether to skip that interim analysis. While Biontech/Pfizer did not see the unblinded data when deciding to skip the first interim analysis, they knew, if I understand correctly, the development of the total case count in their experiment. While on first thought one may believe that the total case count does not reveal information about the efficacy, that is not necessarily true. The thing is that one may in principle estimate the sample efficacy when knowing the total case count from the experimental subjects and the incident rate in the total population.

For example, assume that in the analyzed period 2% of the total population got Covid-19 but only 1% from the experimental population. This may suggest a high efficacy because it is consistent with the event that only control group subjects got sick. In contrast, if also 2% of the experimental population got Covid-19, a low efficacy may be induced.

Of course, such estimates are in practice not simple. The Covid test intensity is probably higher among the experimental subjects than in the total population and the experimental subjects may substantially differ from the total population in ways that are not easily controlled for in a statistical analysis. Nevertheless, the theoretical objection remains that vaccine makers who know the total case count, also have some noisy information about the vaccine efficacy.

So I guess the cleaner approach for the FDA would be to not allow adjustment of the success thresholds after a decision to skip the first interim trial. On the other hand, given that the FDA seemed to have preferred such a skip and that also a modified treatment plan would have a type I error rate of 2.242% and thus some slack until 2.5%, allowing for such an adjustment may seem defensible.

But even without an adjustment of the later success thresholds and thus no gain in statistic power, already the reasons discussed earlier seem to make a clear case for Biontech/Pfizer to skip the 32-cases interim analysis.

Politics

It is true that a 32-case interim case analysis possibly or even likely may have taken place already before the US elections. So, not surprisingly, Donald Trump suspected a political motivation for skipping the analysis. He tweeted on Nov 10th:

As I have long said, @Pfizer and the others would only announce a Vaccine after the Election, because they didn’t have the courage to do it before. Likewise, the @US_FDA should have announced it earlier, not for political purposes, but for saving lives!

Let me conclude this post with some remarks on that tweet:

First, even if sufficient efficacy would have been announced before the election, it would, to my understanding, not have sped up vaccine deployment since the submission for emergency use authorization required anyway to wait until required safety results are available in the third week of November. So faster vaccine deployment would only be possible by forcing the FDA to reduce safety standards. But would a reduction in safety standards in expectation save lives or rather risk more deaths (possibly indirectly due to reduced willingness to get vaccinated) and risk other harmful consequences?

Second, Biontech/Pfizer did not decide whether to announce a vaccine success or not, but whether to perform the first interim analysis, without knowing its result. If the vaccine would not have been as tremendously effective as it turned out to be, the first interim analysis with 32 cases may well have been unsuccessful with high probability. For example, recent news reported an estimated average 70% efficacy for Astrazeneca’s Covid-19 vaccine. If Biontech/Pfizer’s vaccine would have had a true efficacy of 70%, a success would have been declared in the 32 cases interim analysis only with 37.7% probability. (You can replicate this number by adapting the code of this post using the success threshold from the original analysis plan). So from an ex-ante perspective, it is not clear in which direction an early interim analysis would have affected the election (if at all). If Trump himself believed his claims that he was the clear favorite in the election race, shouldn’t he rather have preferred the absence of risk of bad news over the absence of a chance for good news?

Third, assume that it really would have been the case that a 32-cases interim analysis would have changed the presidential election result. Given that Trump was quite fond of criticizing imports from Germany and of imposing trade restrictions, it would be quite some irony of history that he lost the election because the success of a product developed in Germany was not known before the election. (As far as I understand, the vaccine was mainly developed by Biontech, while Pfizer mainly handled the trials).

Fourth, given that it will still take substantial time until a sufficient share of the population can be vaccinated, efficient management of the Covid crisis is still highly important for several months to come. If it turns out that Joe Biden will be a better crisis manager, knowing the trial result only after the election may well have saved a substantial number of lives.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post To peek or not to peek after 32 cases? Exploring that question in Biontech/Pfizer's vaccine trial first appeared on R-bloggers.

xkcd Comics as a Minimal Example for Calling APIs, Downloading Files and Displaying PNG Images with R

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

xkcd webcomics is one of the institutions of the internet, especially for the nerd community. If you want to learn how to fetch JSON data from a REST API, download a file from the internet and display a PNG file in a ultra-simple example, read on!

Many services on the internet provide a web service so that it is easier for machines to get access to their data. The access point is called an Application Programming Interface (API) and there are different types of APIs. One especially widespread type is known under the name REST (for REpresentational State Transfer) and a data format that is often used is called JSON (for JavaScript Object Notation). So this is what we will do first here: Fetching JSON data from a REST API!

As a simple example, we will use the JSON interface of xkcd webcomics, the documentation is very concise:

If you want to fetch comics and metadata automatically, you can use the JSON interface. The URLs look like this:

http://xkcd.com/info.0.json (current comic)

or:

http://xkcd.com/614/info.0.json (comic #614)

Those files contain, in a plaintext and easily-parsed format: comic titles, URLs, post dates, transcripts (when available), and other metadata.

To access the data we will use the wonderful jsonlite package (on CRAN):

library(jsonlite)# call apixkcd <- fromJSON("http://xkcd.com/1838/info.0.json ")str(xkcd)## List of 11##  $ month     : chr "5"##  $ num       : int 1838##  $ link      : chr ""##  $ year      : chr "2017"##  $ news      : chr ""##  $ safe_title: chr "Machine Learning"##  $ transcript: chr ""##  $ alt       : chr "The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent."##  $ img       : chr "https://imgs.xkcd.com/comics/machine_learning.png"##  $ title     : chr "Machine Learning"##  $ day       : chr "17"

It couldn’t be any easier, right!

To download the PNG image as a raw file we use download.file from Base R:

#download filedownload.file(xkcd$img, destfile = "images/xkcd.png", mode = 'wb')

Finally, we want to plot the image, we use the png package (on CRAN) for that:

library(png)# plot pngplot(1:2, type='n', main = xkcd$title, xlab = "", ylab = "")rasterImage(readPNG("images/xkcd.png"), 1, 1, 2, 2)

If this doesn’t work on your system please consult the documentation, there might be system-related differences.

We can, of course, display the other data directly:

xkcd$alt## [1] "The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent."

This has hopefully given you some inspiration for your own experiments with more sophisticated APIs… if you have interesting examples and use cases please post them in the comments below!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post xkcd Comics as a Minimal Example for Calling APIs, Downloading Files and Displaying PNG Images with R first appeared on R-bloggers.

Upcoming workshop: Introduction to Agile

$
0
0

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last but not least, the workshop ‘Not Agile yet? Experience being part of a scrum team’ will bring to close Mirai first series of workshop in Data Science!

How can we talk about software development best practices and modern devops without talking about agile development? In fact we can’t! Not agile yet? Don’t wait more to discover the magic* formula to succes! agile transformation

Have you heard about the VUCA world? What better example than the crisis we are all going through with this covid 19 pandemic!? Most of us have to adapt, reinvent ourselves – and quickly, to survive. Thinking and acting agile is no longer a choice but a necessity.

At Mirai, we were early adopters, we have been practicing Agile for several years now, accompanying teams and departments in their transformations. We would like to share our experience and know-how with you by offering you an introduction to the agile approach. In this workshop you will learn the concepts on which agile techniques and methods are built and we will try to make you live and feel them since being agile is above all a mindset and an attitude. Join us on the 8th of December.

agile workshop

For an overview of the benefits of transitionning to an agile approach, check out our article and for a deep understanding of the agile trends, dive into the 14th Annual State of Agile Report.

*It might not happen so magically. In reality the agile transformation might become a long and painfull journey. But at the end of the road what you will have achieved will be a bit magical and for sure there will be no turning back.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Upcoming workshop: Introduction to Agile first appeared on R-bloggers.


Visualizing geospatial data in R—Part 1: Finding, loading, and cleaning data

$
0
0

[This article was first published on Articles - The Analyst Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This is part 1 of a 4 part series on how to build maps using R.

  1. How to load geospatial data into your workspace and prepare it for visualization
  2. How to make static maps using ggplot2
  3. How to make interactive maps (pan, zoom, click) using leaflet
  4. How to add interactive maps to a Shiny dashboard

One of the benefits of R is its large ecosystem of packages developed and maintained by individuals and teams, and available to the general public via CRAN. The downside of this, however, is that it is sometimes difficult for even experienced coders to determine which package is the best place to start when learning a new skill.

Map data (aka geospatial data or Geographic Information System data or GIS data) is no exeption to this rule, and is, in fact, particularly intimidating to the uninitiated.

This is for two reasons: developers have created (1) different ways of storing and representing GIS data and (2) multiple, similar packages for each type of GIS data representation.

In the tutorials in this series, we will introduce GIS data visualization in R using the simple features standard, which has become increasingly popular in recent years and which has the smallest learning curve for those who are already comfortable with data frames and the principles of the “tidyverse.” The simple features approach works well for many common map-making applications, including drawing regions (e.g., states) or points (e.g., cities) and coloring them to create an analytical insight (e.g., shading by population density).

In order to manipulate and visualize data, we will rely primarily on three geospatial packages: sf, ggplot2, and leaflet. You may find that other packages assist greatly with collecting data (e.g., tidycensus for U.S. Census data) or improving the aesthitics of your map (e.g., ggspatial to add a map scale).

Thinking about map data the way R does

Many GIS programs (Tableau, Qlik, etc.) make it extraordinarily easy for users to create maps by loading a file with a geographic identifier and data to plot. They are designed to simply drag and drop a field called “zipcode” or “county” onto a map and automatically draw the appropriate shapes. Then, you can drag and drop a field called “population density” or “GDP per capita” onto the map, and the shapes automatically color appropriately. These “drag and drop” GIS programs do a lot of work behind the scenes to translate your geographic idenifier into a geometric shape and your fields into colorful metrics.

In R, we have to do this work ourselves. R has no innate knowledge of what we want to graph; we have to provide every detail. This means we need to pass R the information it needs in order to, say, draw the shape of Pennsylvania or draw a line representing I-95. R needs to be told the 4 coordinates defining a rectangle; R needs to be told the hundreds of points defining a rectangle-ish shape like Pennsylvania. If we want to fill Pennsylvania with a color, we need to explicitly tell R how to do so.

The manual nature of GIS in R can cause some headaches, as we need to hunt down all of the information in order to provide it to R to graph. Once you have the desired information, however, you will find that the manual nature of R’s graphing allows significantly more flexibility than “drag and drop” programs allow. We are not constrained by the type of information pre-loaded into the program, by the number of shapes we can draw at once, by the color palletes provided, or by any other factor. We have complete flexibility.

If you want to draw state borders (polygons), county borders (more polygons), major highways (lines), and highway rest stops (points), add each of them as an individual layer to the same plot, and color them as you please. There are no constraints when visualizing geospatial data in R.

This post will focus on how to find, import, and clean geospatial data. The actual graphing will come in Part 2 (static maps with ggplot2) and Part 3 (interactive maps with leaflet).

A brief introduction to simple features data in R

Out in the wild, map data most frequntly comes as either geoJSON files (.geojson) or Shapefiles (.shp). These files will, at the very minimum, contain information about the geometry of each object to be drawn, such as instructions to draw a point in a certain location or to draw a polygon with certain dimensions. The raw file may, however, also contain any amount of additional information, such as a name for the object (“Pennsylvania”), or summary statistics (GDP per capita, total population, etc.). Regardless of whether the data is geoJSON or a Shapefile, and regardless of how much additional data the file has, you can use one convenient function from the sf package to import the raw data into R as a simple features object. Simply use either sf::read_sf(my_json_file) or sf::read_sf(my_shp_file).

library(sf)# Data from OpenDataPhilly# Source: https://www.opendataphilly.org/dataset/zip-codeszip_geojson <- "http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson"phl_zip_raw <- sf::read_sf(zip_geojson)# If you want to save / load a local copy# sf::write_sf(phl_zip_raw, "phl_zip_raw.shp")# phl_zip_raw <- sf::read_sf("phl_zip_raw.shp")

Let's take a look at the simple features data we imported above.

head(phl_zip_raw)#> Simple feature collection with 6 features and 5 fields#> geometry type:  POLYGON#> dimension:      XY#> bbox:           xmin: -75.20435 ymin: 39.95577 xmax: -75.06099 ymax: 40.05317#> geographic CRS: WGS 84#> # A tibble: 6 x 6#>   OBJECTID CODE    COD Shape__Area Shape__Length                        geometry#>                                           #> 1        1 19120    20   91779697.        49922. ((-75.11107 40.04682, -75.1094~#> 2        2 19121    21   69598787.        39535. ((-75.19227 39.99463, -75.1920~#> 3        3 19122    22   35916319.        24125. ((-75.15406 39.98601, -75.1532~#> 4        4 19123    23   35851751.        26422. ((-75.1519 39.97056, -75.1515 ~#> 5        5 19124    24  144808025.        63659. ((-75.0966 40.04249, -75.09281~#> 6        6 19125    25   48226254.        30114. ((-75.10849 39.9703, -75.11051~
  1. We have 6 features. Each row is a feature that we could plot; since we called head() we have only see the first 6 even though the full dataset has more

  2. We have 5 fields. Each column is a field with (potentially) useful information about the feature. Note that the geometry column is not considered a field

  3. We are told this is a collection of polygons, as opposed to points, lines, etc.

  4. We are told the bounding box for our data (the most western/eastern longitudes and northern/southern latitudes)

  5. We are told the Coordinate Reference System (CRS), which in this case is "WGS 84." CRSs are cartographers' ways of telling each other what system they used for describing points on the earth. Cartographers need to pick an equation for an ellipsoid to approximate earth's shape since it's slightly pear-shaped. Cartographers also need to determine a set of reference markers--known as a datum--to use to set coordinates, as earth's tectonic plates shift ever so slightly over time. Togehether, the ellipsoid and datum become a CRS.

    WGS 84 is one of the most common CRSs and is the standard used for GPS applications. In the US, you may see data provided using NAD 83. WGS 84 and NAD 83 were originally identical (back in the 1980s), but both have been modified over time as the earth changes and scientific knowledge progresses. WGS 84 seeks to keep the global average of points as similar as possible while NAD 83 tries to keep the North American plate as constant as possible. The net result is that the two different CRSs may vary by about a meter in different places. This is not a big difference for most purposes, but sometimes you may need to adjust.

    If we wanted to transform our data between CRSs, we would call sf::st_transform(map_raw, crs = 4326), where 4362 is the EPSG code of the CRS into which we would like to transform our geometry. EPSGs are a standard, shorthand way to refer to various CRSs. 4326 is the EPSG code for WGS 84 and 4269 is the EPSG code for NAD 83.

  1. Finally, we are provided a column called "geometry." This column contains everything that R will need to draw each of the ZIP Codes in Philadelphia, with one row per ZIP Code

Finding data

Simple features data in R will always look similar to the example above. You will have some metadata describing the type of geometry, the CRS, and so on; a "geometry" column; and optionally some fields of additional data. The trouble comes in trying to find the data you need--both the geometry and the proper additional fields--and getting them together into the same object in R.

Finding geospatial data

One of the most common sources of geospatial files in R is the tigris package. This package allows users to directly download and use TIGER/Line shapefiles--the shapefiles describing the U.S. Census Buerau's census areas. The package includes, among other files, data for national boundaries, state boundaries, county boundaries, ZIP Code Tabulation Areas (very similar to ZIP Codes), census tracts, congressional districts, metro areas, roads, and many other useful US geographic features.

tigris allows you to import directly as a simple features object. Let's take a quick look at how to import county data.

library(tigris)library(ggplot2)pa_counties_raw <- tigris::counties(  state = "PA",  cb = TRUE,  resolution = "500k",  year = 2018,  class = "sf")head(pa_counties_raw)#> Simple feature collection with 6 features and 9 fields#> geometry type:  MULTIPOLYGON#> dimension:      XY#> bbox:           xmin: -80.51942 ymin: 39.72089 xmax: -75.01507 ymax: 41.47858#> geographic CRS: NAD83#>     STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID       NAME LSAD      ALAND#> 239      42      005 01213658 0500000US42005 42005  Armstrong   06 1691724751#> 240      42      029 01209174 0500000US42029 42029    Chester   06 1943848979#> 241      42      035 01214721 0500000US42035 42035    Clinton   06 2299868396#> 242      42      059 01214033 0500000US42059 42059     Greene   06 1491700989#> 243      42      067 01209180 0500000US42067 42067    Juniata   06 1013592882#> 244      42      091 01213680 0500000US42091 42091 Montgomery   06 1250855248#>       AWATER                       geometry#> 239 27619089 MULTIPOLYGON (((-79.69293 4...#> 240 22559478 MULTIPOLYGON (((-75.59129 3...#> 241 23178635 MULTIPOLYGON (((-78.09338 4...#> 242  5253865 MULTIPOLYGON (((-80.51942 3...#> 243  5606077 MULTIPOLYGON (((-77.74677 4...#> 244 11016762 MULTIPOLYGON (((-75.69595 4...ggplot2::ggplot(pa_counties_raw) +   ggplot2::geom_sf() +   ggplot2::theme_void()

Basic map of PA counties. Source: U.S. Census Bureau TIGER/Line Shapefiles.

Basic map of PA counties. Source: U.S. Census Bureau TIGER/Line Shapefiles.

For non-US applications, the package rnaturalearth, which is a well-supported part of the rOpenSci project, provides easy access to global data. Like tigris, we can import directly as a simple features object. Here's a quick look at how to import all the countries in Asia.

library(rnaturalearth)library(ggplot2)asia <- rnaturalearth::ne_countries(  continent = "Asia",  returnclass = "sf")head(asia, 0)#> Simple feature collection with 0 features and 63 fields#> bbox:           xmin: NA ymin: NA xmax: NA ymax: NA#> CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0#>  [1] scalerank  featurecla labelrank  sovereignt sov_a3     adm0_dif  #>  [7] level      type       admin      adm0_a3    geou_dif   geounit   #> [13] gu_a3      su_dif     subunit    su_a3      brk_diff   name      #> [19] name_long  brk_a3     brk_name   brk_group  abbrev     postal    #> [25] formal_en  formal_fr  note_adm0  note_brk   name_sort  name_alt  #> [31] mapcolor7  mapcolor8  mapcolor9  mapcolor13 pop_est    gdp_md_est#> [37] pop_year   lastcensus gdp_year   economy    income_grp wikipedia #> [43] fips_10    iso_a2     iso_a3     iso_n3     un_a3      wb_a2     #> [49] wb_a3      woe_id     adm0_a3_is adm0_a3_us adm0_a3_un adm0_a3_wb#> [55] continent  region_un  subregion  region_wb  name_len   long_len  #> [61] abbrev_len tiny       homepart   geometry  #> <0 rows> (or 0-length row.names)ggplot2::ggplot(asia) +   ggplot2::geom_sf() +   ggplot2::theme_void()

Basic map of countries in Asia. Source: rnaturalearth R package.

Basic map of countries in Asia. Source: rnaturalearth R package.

Finding non-geospatial data

Chances are that you are coming to a geospatial mapping project with a particular dataset in mind. Perhaps you want to explore the New York Times'Covid-19 data. Or perhaps you are interested in FiveThirtyEight's hate crimes by state data). Your data likely has the statistics you want but not the geometry you need for graphing. Hopefully, your data has an ID that you can use to identify each geospatial region. In the example hospital data below, the PA Department of Health provides a ZIP Code and a County name. We also have a longitude and latitude that could be coerced into a simple features geometry (it isn't one yet, though...just a column with a numeric value).

library(readr)# Hospitals by county# Data from PASDA# Source: https://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=909pa_hospitals_url <- "https://www.pasda.psu.edu/spreadsheet/DOH_Hospitals201912.csv"pa_hospitals_raw <- readr::read_csv(url(pa_hospitals_url))head(pa_hospitals_raw)#> # A tibble: 6 x 19#>   SURVEY_ID_ FACILITY_I LONGITUDE LATITUDE FACILITY_U GEOCODING_ FACILITY_N#>                                         #> 1 1357       135701         -80.3     40.7 http://ww~ 00         HERITAGE ~#> 2 0040       13570101       -80.3     40.7 http://cu~ 00         CURAHEALT~#> 3 1370       137001         -79.9     40.2 http://ww~ 00         MONONGAHE~#> 4 0047       53010101       -79.7     40.6 https://w~ 00         NEW LIFEC~#> 5 7901       790101         -79.7     40.6 http://ww~ 00         ALLEGHENY~#> 6 0023       490601         -80.2     40.4 http://cu~ 00         CURAHEALT~#> # ... with 12 more variables: STREET , CITY , ZIP_CODE ,#> #   ZIP_CODE_E , CITY_BORO_ , COUNTY , AREA_CODE ,#> #   TELEPHONE_ , CHIEF_EXEC , CHIEF_EX_1 , LAT , LNG 

Let's think for a moment, though about geospatial analysis. Having the number of hospitals in a county is useful, but what we really want to know is the number of hospitals per capita. Often times with geospatial visualizations, we want to know penetration rates per capita. To do this, we will need to find census data.

U.S. Census data is available at census.gov. The data.census.gov website is not always the most intuitive to navigate, as the data live in many different tables from different governemtn surveys. In addition to the 10-year census survey, there are over 130 intermediate-year surveys including the American Community Survey (ACS). You can browse surveys and available data to your heart's content. Compounding this difficulty is the Census Bureau's naming convention. If you want median household income, for example, you need to look for variable "B19013_001." Once you manage to struggle through all that, you can download a CSV with your desired census data.

library(readr)county_pop_url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv"county_pop_raw <- readr::read_csv(url(county_pop_url))head(county_pop_raw)#> # A tibble: 6 x 164#>   SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP#>                           #> 1 040         3        6 01    000    Alaba~ Alabama       4779736#> 2 050         3        6 01    001    Alaba~ Autaug~         54571#> 3 050         3        6 01    003    Alaba~ Baldwi~        182265#> 4 050         3        6 01    005    Alaba~ Barbou~         27457#> 5 050         3        6 01    007    Alaba~ Bibb C~         22915#> 6 050         3        6 01    009    Alaba~ Blount~         57322#> # ... with 156 more variables: ESTIMATESBASE2010 , POPESTIMATE2010 ,#> #   POPESTIMATE2011 , POPESTIMATE2012 , POPESTIMATE2013 ,#> #   POPESTIMATE2014 , POPESTIMATE2015 , POPESTIMATE2016 ,#> #   POPESTIMATE2017 , POPESTIMATE2018 , POPESTIMATE2019 ,#> #   ...

Combining spatial data with non-spatial data

Now we have our county geospatial data, our original dataset with GDP data, and another datset with census data. How in the world do we plot this?

Four simple steps to prepare your data for graphing.

  1. Import all data (already completed above)
  2. Clean your geospatial data frame
  3. Combine non-spatial data into a single, clean data frame
  4. Merge your two data frames together

Step 1: Import all data

This was completed above, but to refresh your memory, we have pa_counties_raw (spatial), pa_hospitals_raw (non-spatial), and county_pop_raw (non-spatial).

Step 2: Clean your geospatial data frame

Not much work to do here, but just to demonstrate how it's done, let's drop and rename some columns.

library(dplyr)# Even though we don't select "geometry,", the sf object will keep it# To drop a geometry, use sf::st_drop_geometry()pa_counties <- pa_counties_raw %>%   dplyr::transmute(    GEOID,    MAP_NAME = NAME, # Adams    COUNTY = toupper(NAME) # ADAMS  )

Step 3: Combine non-spatial data into a single, clean data frame

library(dplyr)library(tidyr)pa_hospitals <- pa_hospitals_raw %>%  dplyr::group_by(COUNTY) %>%   dplyr::summarise(N_HOSP = n()) %>%   dplyr::ungroup()head(pa_hospitals)#> # A tibble: 6 x 2#>   COUNTY    N_HOSP#>         #> 1 ADAMS          1#> 2 ALLEGHENY     28#> 3 ARMSTRONG      1#> 4 BEAVER         2#> 5 BEDFORD        1#> 6 BERKS          6pa_pop <- county_pop_raw %>%   dplyr::filter(SUMLEV == "050") %>% # County level  dplyr::filter(STNAME == "Pennsylvania") %>%   dplyr::select(    COUNTY = CTYNAME,    POPESTIMATE2019  ) %>%   dplyr::mutate(    COUNTY = toupper(COUNTY),    COUNTY = gsub("(.*)( COUNTY)", "\\1", COUNTY)  ) # "Adams County" --> "ADAMS COUNTY" --> "ADAMS"head(pa_pop)#> # A tibble: 6 x 2#>   COUNTY    POPESTIMATE2019#>                  #> 1 ADAMS              103009#> 2 ALLEGHENY         1216045#> 3 ARMSTRONG           64735#> 4 BEAVER             163929#> 5 BEDFORD             47888#> 6 BERKS              421164combined_pa_data <-   dplyr::full_join(pa_hospitals, pa_pop, by = "COUNTY") %>%   tidyr::replace_na(list(N_HOSP = 0)) %>%   dplyr::mutate(HOSP_PER_1M = N_HOSP / (POPESTIMATE2019/1000000))head(combined_pa_data)#> # A tibble: 6 x 4#>   COUNTY    N_HOSP POPESTIMATE2019 HOSP_PER_1M#>                           #> 1 ADAMS          1          103009        9.71#> 2 ALLEGHENY     28         1216045       23.0 #> 3 ARMSTRONG      1           64735       15.4 #> 4 BEAVER         2          163929       12.2 #> 5 BEDFORD        1           47888       20.9 #> 6 BERKS          6          421164       14.2 

Step 4: Merge your two data frames together

The tigris package mentioned above has a function for combining geospatial data with a standard data frame. We need to provide tigris::geo_join with our datasets and three instructions.

  • by_sp: Column name from my spatial data to identify unique fields (e.g., COUNTY, ZIP_CODE, FIPS)
  • by_df: Column name from my non-spatial data to identify unique fields
  • how: "inner" to keep rows that are present in both datasets or "left" to keep all rows from the spatial dataset and fill in NA for missing non-spatial rows
library(tigris)library(ggplot2)# Combine spatial and non-spatial datapa_geospatial_data <- tigris::geo_join(  spatial_data = pa_counties,  data_frame = combined_pa_data,  by_sp = "GEOID",  by_df = "COUNTY",  how = "inner")head(pa_geospatial_data)#> Simple feature collection with 6 features and 6 fields#> geometry type:  MULTIPOLYGON#> dimension:      XY#> bbox:           xmin: -80.51942 ymin: 39.72089 xmax: -75.01507 ymax: 41.47858#> geographic CRS: NAD83#>   GEOID   MAP_NAME     COUNTY N_HOSP POPESTIMATE2019 HOSP_PER_1M#> 1 42005  Armstrong  ARMSTRONG      1           64735    15.44759#> 2 42029    Chester    CHESTER     11          524989    20.95282#> 3 42035    Clinton    CLINTON      2           38632    51.77055#> 4 42059     Greene     GREENE      1           36233    27.59915#> 5 42067    Juniata    JUNIATA      0           24763     0.00000#> 6 42091 Montgomery MONTGOMERY     15          830915    18.05239#>                         geometry#> 1 MULTIPOLYGON (((-79.69293 4...#> 2 MULTIPOLYGON (((-75.59129 3...#> 3 MULTIPOLYGON (((-78.09338 4...#> 4 MULTIPOLYGON (((-80.51942 3...#> 5 MULTIPOLYGON (((-77.74677 4...#> 6 MULTIPOLYGON (((-75.69595 4...ggplot2::ggplot(pa_geospatial_data, aes(fill = HOSP_PER_1M)) +   ggplot2::geom_sf() +   ggplot2::scale_fill_viridis_c() +  ggplot2::theme_void()

Hospitals per million residents. Montour County is apparently the place to be if you need a hospital!  Source: PASDA, U.S. Census Bureau

Hospitals per million residents. Montour County is apparently the place to be if you need a hospital!

Source: PASDA, U.S. Census Bureau

An aside on U.S. Census Bureau data

If you are using data from the U.S. Census Bureau, the easist option is to use the tidycensus package, for it allows you to access census data that comes pre-joined to shapefile data. The package also has helper functions for easy navigation of the U.S. Census Bureau datasets. tidycensus uses the U.S. Census API, so you will need to obtain an API key first. The tidycensuswebsite has great instruction on how to get an API key and add it to your .Renviron file. The website also has excellent example vignettes to demonstrate the package's robust functionality.

Conclusion

To conclude, we have now seen how to find geospatial data and import it into R as a simple features object. We have seen how to find U.S. Census data or other non-spatial data and import it into R. We have seen how to tidy both geospatial and non-spatial data and join them as a single dataset in preparation for visualization. Finally, we have had a brief preview of how to plot using ggplot2. Visualization was not a focus of this post, but how can we write a whole post on maps without showing you a single map! Consider it a teaser for Part 2, when we discuss visualization in more detail.

Continue reading in Part 2 to learn how to plot geospatial data with ggplot2 (and make the plots prettier than the ones in this post). Or, skip ahead to Part 3 to learn how to create an interactive plot with leaflet.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Articles - The Analyst Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Visualizing geospatial data in R—Part 1: Finding, loading, and cleaning data first appeared on R-bloggers.

Basic Multipage Routing Tutorial for Shiny Apps: shiny.router

$
0
0

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Basic Routing for Shiny Web Applications

Web applications couldn’t exist without routing. Think about it in terms of the Appsilon website – you’ve visited our home page, navigated to the blog, and opened this particular article. That’s routing in a nutshell – matching UI components to a URL.

Appsilon released the open source shiny.router package back in late 2018, and just recently version 0.2.0 went live on CRAN. This version fixed some bugs, augmented existing functionality a bit, and made a few important changes. In this article, we’ll walk you through the version 0.2.0 update and show you how to create navbars using shiny.router. 

You can download the source code for this article here.

To learn more about Appsilon’s open source packages, see our new Open Source Landing Page: 

Appsilon's shiny.tools landing page for our open source packages.

Appsilon’s shiny.tools landing page for our open source packages.

Introducing shiny.router

At Appsilon, we routinely build Shiny applications for Global 2000 companies. The problem of routing is one of the first ones we encounter on every project. It made sense to develop an open-source package that handles it with ease. After all, routing is a common component of any web application and behaves identically in most projects.

Displaying the app’s content on multiple pages can be achieved via tabsets, subpages of a dashboard. There was no easy way to direct users to a specific subpage via URL for a long time. Recently some alternative solutions started to emerge, but the shiny.router remains the easiest one.

We have used shiny.router in many commercial projects, so we can confirm it is a field-tested solution. Also, many large organizations have adopted it as a solution on their own.

The latest version of the package is available on CRAN, so the installation couldn’t be any easier:

install.packages("shiny.router")

The new version (0.2.0) was released on the 30th October 2020, and it brought new documentation and functionalities and fixed existing issues. You can read the full changelog here.

See some impressive Example Shiny Apps in our Shiny Demo Gallery

Creating Navigation Bars with shiny.router

To show how shiny.router works in practice, we’ll develop a simple dashboard with a couple of routes. Every route will have a dummy text, showing us which route we’re on.

To start, we’ll import both shiny and shiny.router:

library(shiny)library(shiny.router)

Next, we will store content for three pages in three variables. Every page has a shiny.titlePanel and a paragraph:

home_page <- div(  titlePanel("Dashboard"),  p("This is a dashboard page"))settings_page <- div(  titlePanel("Settings"),  p("This is a settings page"))contact_page <- div(  titlePanel("Contact"),  p("This is a contact page"))

We can then make a router and attach each of the pages with its corresponding route. The dashboard is located on the root page, so it will be the first one you see:

router <- make_router(  route("/", home_page),  route("settings", settings_page),  route("contact", contact_page))

The rest of the Shiny app is more or less what you would expect. We have to declare the UI, which contains a list of routes that enable us to navigate between pages. The server function passes input, output, and session data to the router. Finally, the call to shinyApp brings these two components together.

ui <- fluidPage(  tags$ul(    tags$li(a(href = route_link("/"), "Dashboard")),    tags$li(a(href = route_link("settings"), "Settings")),    tags$li(a(href = route_link("contact"), "Contact"))  ),  router$ui)server <- function(input, output, session) {  router$server(input, output, session)}shinyApp(ui, server)

As a result, we have the following web application:

Unstyled Shiny Router App

Unstyled Shiny Router App

The application gets the job done but is quite basic with regards to the styling. Let’s fix that next.

Styling Navigation Bars

You can add styles to your Shiny applications with CSS. To do so, create a www folder where your R script is, and create a CSS file inside it. We’ve named ours main.css, but you can call yours whatever you want.

To link the created CSS file with the Shiny app, we have to add a theme to shiny.fluidPage. Here’s how:

ui <- fluidPage(  theme = "main.css",  tags$ul(    tags$li(a(href = route_link("/"), "Dashboard")),    tags$li(a(href = route_link("settings"), "Settings")),    tags$li(a(href = route_link("contact"), "Contact"))  ),  router$ui)

The value for the theme parameter must be identical to the name of the CSS file. 

If you were to run the app now, everything would look the same as before. That’s because we haven’t added any stylings yet. Copy the following code snippet to your CSS file:

ul {  background-color: #0099f9;    display: flex;  justify-content: flex-end;  list-style-type: none;}ul li a {  color: #ffffff;    display: block;  font-size: 1.6rem;  padding: 1.5rem 1.6rem;  text-decoration: none;  transition: all, 0.1s;}a:link, a:visited, a:hover, a:active {  color: #ffffff;    text-decoration: none;}ul li a:hover {  background-color: #1589d1;    color: #ffffff;}

Save and rerun the application. You will see the following:

Styled Shiny Router App

Styled Shiny Router App

And that’s how you can style shiny.router and Shiny apps in general.

Conclusion

In this short hands-on guide, we’ve covered the intuition and logic behind the shiny.router package. We’ve seen how routing works, how to create navigation bars with routing, and how to style navigation bars.

You can learn more about shiny.router and other Appsilon’s open-source packages below:

Appsilon is hiring! We are primarily seeking a senior-level engineering manager who can mentor our junior developers. See our Careers page for all new openings, including openings for a Project Manager and Community Manager.

Article Basic Multipage Routing Tutorial for Shiny Apps: shiny.router comes from Appsilon | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Basic Multipage Routing Tutorial for Shiny Apps: shiny.router first appeared on R-bloggers.

What’s the most successful Dancing With the Stars “Profession”? Visualizing with {gt}

$
0
0

[This article was first published on R | JLaw's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | JLaw's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post What's the most successful Dancing With the Stars "Profession"? Visualizing with {gt} first appeared on R-bloggers.

pointblank v0.6

$
0
0

[This article was first published on Posts | R & R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With the release of pointblank v0.6, workflows for the validation of tabular data have been refined. On top of those improvements, we have a new workflow for describing our tabular data. There’s really so much that’s new in the release that we can only go over the big stuff in this post. For everything else, have a look at the Release Notes.

Pointblank Information

The new Information Management workflow is full of features that help you to describe tables and keep on top of changes to them. We added the create_informant() function to create an informant object that is meant to hold information (as much as you want, really) for a target table, with reporting features geared toward communication.

The informant works in conjunction with functions to facilitate the entry of info text: info_columns(), info_tabular(), and info_section(). These functions are focused on describing columns, the table proper, and for reporting on any other aspects of the table. We can even glean little snippets of information from the target table and mix them into the info text to make the overall information more dynamic. The all-important incorporate() function concludes this workflow, reaching out to the target table to ensure that queries to it are made and that table properties are synchronized with the reporting.

informant <-   create_informant(    read_fn = ~ small_table,    tbl_name = "small_table",    label = "Example No. 1"  ) %>%  info_tabular(    description = "This table is included in the **pointblank** pkg."  ) %>%  info_columns(    columns = "date_time",    info = "This column is full of timestamps."  ) %>%  info_section(    section_name = "further information",     `examples and documentation` = "Examples for how to use the `info_*()` functions    (and many more) are available at the     [**pointblank** site](https://rich-iannone.github.io/pointblank/)."  ) %>%  incorporate()

This ultra-simple example report has some basic information on the small_table dataset available in pointblank. The TABLE and COLUMNS sections are in their prescribed order and the section FURTHER INFORMATION follows those (having one subsection called EXAMPLES AND DOCUMENTATION).

If all this new functionality for describing data tables wasn’t enough, this release also adds the info_snippet() function to round out the collection of info_*() functions for this workflow. The idea here is to have some methodology for acquiring important bits of data from the target table (that’s info_snippet()’s job) and then use incorporate() to grab those morsels of data and stitch them into the info text (via { }).

The informant produces an information report that can be printed, included in R Markdown documents and Shiny apps, or emailed with the email_create() function. Here’s an information report I put together for the penguins dataset available in the palmerpenguins package (code is available in this GitHub Gist).

Because this workflow has a lot to it, two new articles were written to explain everything that can be done. Start with a gentle introduction and find out even more in an advanced article. It’s hoped that this method of creating a data report, data dictionary, metadata summary (or whatever you want to call it) is both enjoyable and brings great value to an organization that uses shared data.

Translations and Locales

One of the design goals of pointblank is to produce reporting in several spoken languages. Many improvements have been made in v0.6 to continue down this road. For starters, three new translations are available: Portuguese ("pt", Brazil), Chinese ("zh", China mainland), and Russian ("ru"). With these additions, your validation reports, information reports, and table scans (via scan_data()) can now be produced in any of eight different languages. Secondly, all numerical values are formatted to match the base locale of the language, which just makes sense (and it’s possible to use a different locale ID, there’s over 700 options there).

Email generation through email_create() will properly translate the agent report or the information report to any of the eight supported languages when generating the blastula email object. How? It’s the language setting ("lang") in the agent or the informant that is used to determine the language of email message content.

More New Functions

Database tables work exceedingly well as table sources in pointblank. While it’s not too difficult to obtain a tbl_dbi object, this new release adds a function to make that process ridiculously easy: db_tbl(). It allows us to access a database table from a selection of popular database types. We only need to supply one of the following short names and the correct DB driver will be used:

  • "postgres" (PostgreSQL)
  • "mysql" (MySQL)
  • "maria" (MariaDB)
  • "duckdb" (DuckDB)
  • "sqlite" (SQLite)

If none of these cover your needs you can take a DIY approach and supply any driver function you want so that the vital connection is made.

Here’s an example where we might get the intendo::intendo_revenue table into an in-memory DuckDB database table. We are creating a pointblankagent for use in a data validation workflow, so, we could pass the db_tbl() call to the read_fn argument of create_agent().

agent <-   create_agent(      read_fn =         ~ db_tbl(          db = "duckdb",          dbname = ":memory:",          table = intendo::intendo_revenue,        ),      tbl_name = "revenue",      label = "The **intendo** revenue table."    ) %>%     %>%    interrogate()

Take a look at the Introduction to the Data Quality Reporting Workflow (VALID-I) article for more information on how this workflow can be used.

To make logging easier during data validation, the log4r_step() function has been added. This function is used as an action in an action_levels() function call. This allows for the production of logs based on failure conditions (i.e., warn, stop, and notify).

al <-   action_levels(    warn_at = 0.1,    stop_at = 0.2,    fns = list(      warn = ~ log4r_step(x),      stop = ~ log4r_step(x)    )  )

Printing this al object will show us the failure threshold settings and the associated actions for the failure conditions (this print method is NEW for v0.6 🎊🎉).

── The `action_levels` settings ────────────────────────────────────────────WARN failure threshold of 0.1 of all test units.\fns\ ~ log4r_step(x)STOP failure threshold of 0.2 of all test units.\fns\ ~ log4r_step(x)──────────────────────────────────────────────────────────────────────────

Using the al object with our validation workflow will result in failures at certain validation steps to be logged. By default, logging is to a file named "pb_log_file" in the working directory but the log4r_step() function is flexible for allowing any log4rappender to be used. Running the following data validation code

agent <-   create_agent(    tbl = small_table,    tbl_name = "small_table",    label = "`log4r_step()` Example",    actions = al  ) %>%  col_is_posix(vars(date_time)) %>%  col_vals_in_set(vars(f), set = c("low", "mid")) %>%  col_vals_lt(vars(a), value = 7) %>%  col_vals_regex(vars(b), regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>%  col_vals_between(vars(d), left = 0, right = 4000) %>%  interrogate()  agent

will print a validation report that looks like this

but it will also produce new log entries in the file "pb_log_file", which is created if it doesn’t exist. Upon inspection with readLines() we see four entries (one for each validation step with at least a WARN condition).

readLines("pb_log_file")#> [1] "ERROR [2020-11-24 10:26:07] Step 2 exceeded the STOP failure threshold (f_failed = 0.46154) ['col_vals_in_set']" #> [2] "WARN  [2020-11-24 10:26:07] Step 3 exceeded the WARN failure threshold (f_failed = 0.15385) ['col_vals_lt']"     #> [3] "ERROR [2020-11-24 10:26:07] Step 4 exceeded the STOP failure threshold (f_failed = 0.53846) ['col_vals_regex']"  #> [4] "WARN  [2020-11-24 10:26:07] Step 5 exceeded the WARN failure threshold (f_failed = 0.07692) ['col_vals_between']"

Dozens of Other Small Changes Here and There

This release makes lots of small improvements to almost all aspects of the package. Documentation got some much-needed love here, adding several articles that explain the different Validation Workflows (there are six of ’em) and articles that go over the Information Management workflow. On top of that, there is improved documentation for almost every function in the package.

One thing that was very important to improve upon was the overall appearance of the agent report (aka the validation report). This reporting for data validation needs to be in tip-top shape, so, here’s a quick listing of ten things that changed for the better:

  1. more tooltips
  2. the tooltips are much improved (they animate, have larger text, and are snappier than the previous ones)
  3. SVGs are now used as symbols for the validation steps instead of blurry PNGs
  4. less confusing glyphs are now used in the TBL column
  5. the agent label can be expressed as Markdown and looks nicer in the report
  6. the table type (and name, if supplied as tbl_name) is shown in the header
  7. validation threshold levels also shown in the table header
  8. interrogation starting/ending timestamps are shown (along with duration) in the table footer
  9. the table font has been changed to be less default-y
  10. adjustments to table borders and cell shading were made for better readability

Whoa! That’s a lot of stuff. But, in the end, the table does look nice and it packs in a lot of information. There are live examples of validation reports for the intendo::intendo_revenue table for three different data sources: PostgreSQL, MySQL, and DuckDB. In future releases we can expect even more improvements (across all pointblank reporting outputs).

Closing Remarks

All in all, the changes made in v0.6 have really improved the package! And even though there have been a ton of changes for the better, we have not skimped on QC measures. For a package that does validation, it’s super important to ensure that everything is as correct as possible. To make this possible, pointblank has the following quality measures in place:

We have a lot planned for the v0.7 and v0.8 releases, so the future for pointblank is pretty exciting! You can take a look at the updating table at the bottom of the project README for some insight on where development is headed. As always, feel free to file an issue if you encounter a bug, have usage questions, or want to share ideas to make this package better.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts | R & R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post pointblank v0.6 first appeared on R-bloggers.

BlueSky Statistics Intro and User Guides Now Available

$
0
0

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BlueSky Statistics is an easy-to-use menu system that uses the R language to do all its work. My detailed review of BlueSky is available here, and a brief comparison of the various menu systems for R is here. I’ve just released the BlueSky Statistics 7.1 User Guide in printed form on the world’s largest independent bookstore, Lulu.com. A description and detailed table of contents are available here.

Cover design by Kiran Rafiq.

I’ve also released the BlueSky Statistics 7.1 Intro Guide. It is a complete subset of the User Guide, and you can download it for free here (if you have trouble downloading it, your company may have security blocking Microsoft OneDrive; try it at home). Its description and table of contents are here, and soon you will also be able to purchase a printed copy of it from Lulu.com.

Cover design by Kiran Rafiq.

I’m enthusiastic about getting feedback on these books. If you have comments or suggestions, please send them to me at muenchen.bob at gmail dot com.

Other books that feature BlueSky Statistics include: Introduction to Biomedical Data ScienceApplying the Rasch Model in Social Sciences Using RData Preparation and Exploration, Applied to Healthcare Data

Publishing with Lulu.com has been a very pleasant experience. They put the author in complete control, making one responsible for every detail of the contents, obtaining reviewers, creating a cover file that includes the front, back, and spine of the book to match the dimensions of the book (e.g. more pages means wider spine, etc.) Advertising is left up to the writer as well, hence this blog post! If you are thinking about writing a book, I highly recommend both Lulu.com and getting a cover design from 99designs.com. The latter let me run a contest in which a dozen artists submitted several ideas each. Their built-in survey system let me ask many colleagues for their opinions to help me decide. Altogether, it was a very interesting experience.

To follow the progress of these and other R related books, subscribe to my blog, or follow me on Twitter.

The post BlueSky Statistics Intro and User Guides Now Available first appeared on r4stats.com.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post BlueSky Statistics Intro and User Guides Now Available first appeared on R-bloggers.

Forecasting Time Series ARIMA Models (10 Must-Know Tidyverse Functions #5)

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Making multiple ARIMA Time Series models in R used to be difficult. But, with the purrrnest() function and modeltime, forecasting has never been easier. Learn how to make many ARIMA models in this tutorial. Here are the links to get set up. 👇

(Click image to play tutorial)

What is Nest?

Nesting is a data frame reshaping tool that produces a “nested” structure.

The nested structure is super powerful for modeling groups of data. We’ll see how. Let’s check nest() out. With 3 lines of code, we turn an ordinary data frame into a nested data frame.

Before Unnested time series data with many groups of time series.

tidyverse nest

After Nested Time Series Data that we can model!

tidyverse nest

ARIMA Modeling with Modeltime

So what can we do with a “Nested” Data Frame? How about making 7 ARIMA Forecasts!

Make ARIMA Models

ARIMA Model DataFrame

And with a little extra work (thanks to my Modeltime R Package), we can create this INTERACTIVE ARIMA FORECAST! 💥💥💥

Tidyverse Unnest ARIMA Models

Timeseies ARIMA Models

The look on your coworker’s face speaks volumes. 👇

shocked gif

But you don’t have the force yet!

Here’s how to master R programming and become powered by R. 👇

Ive got the power

…Your executive management review after you’ve launched your your first Shiny App. 👇

Crowd Applause

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

  1. Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter

  2. Set Up the GitHub Repo: https://github.com/business-science/free_r_tips

  3. Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Forecasting Time Series ARIMA Models (10 Must-Know Tidyverse Functions #5) first appeared on R-bloggers.

Reverse Engineering AstraZeneca’s Vaccine Trial Press Release

$
0
0

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In their press release AstraZeneca provide the following information about an interim analysis of their vaccine trial:

  • One dosing regimen (first a half dose and at least a month later a full dose) with 2741 participants showed 90% efficacy

  • Another dosing regimen (two full doses at least one month apart) with 8896 participants showed 62% efficacy

  • Average efficacy is 70% and in total there were 131 Covid cases.

Most observers seem surprised that the regimen with only half an initial dosage showed a substantially larger efficacy. Some theories for this result are sketched in this Nature news article. An obvious question is: How statistically robust is the 90% efficacy reported for the smaller dosing regimen?

This post first performs several educated guesses to reverse-engineer the underlying case numbers from the press release. Then we follow Biontech/Pfizer’s Bayesian analysis approach to compare the posterior distributions of AstraZeneca’s two dosage regimens with those of the Biontech/Pfizer and Moderna trials.

Let s1 denote the share of the 131 Covid cases that accrued in the first dosage regimen. If we ignore rounding errors in the stated efficacy, we can compute it from the equation that determines the average efficacy of 70% of both dosage regimens by solving

0.9 * s1 + 0.62*(1-s_1) = 0.7

which yields s1 = 2/7 = 28.6%.

So while approximately 28.6% of the 131 Covid-19 cases come from the first, smaller dosing regimen only 2741 / 11637 = 23.6% of the participants are from that regimen. Given that the smaller dosing regimen has higher efficacy, I would rather have suspected that its share of the Covid-19 cases is smaller than the 23.6% share of the participants. The result means that the share of participants in the control group that got Covid-19 is larger in the first dosing regimen than in the second one.

Looking at AstraZeneca’s press release in more detail, we read

The pooled analysis included data from the COV002 Phase II/III trial in the UK and COV003 Phase III trial in Brazil.

A description of the UK and Brazil trials reveals that the UK trial had both dosing regimens, while the Brazil trial only had the second larger dosing regimen. A quick internet search did not confirm that the Covid-19 risk was smaller in Brazil than in the UK. Yet, the UK trial may well have started earlier than the Brazil trial, which would give participants more time to catch Covid-19. That might explain the relative high Covid-19 case proportion in the smaller dosing regimen.

For the moment, we will ignore integer constraints and thus compute that m1 = (2/7) * 131 = 37.42 cases were from the smaller dosing regimen. As a next step, we want to compute the number of Covid-19 cases mv1 from the vaccinated treatment group of the smaller dosing regimen, using the reported efficacy of VE1=90%.

We first compute for the smaller dosing regimen the helper parameter theta1 that shall measure the share of the m1 Covid-19 cases that were from vaccinated subjects. To compute it, we need to make an assumption about the subject split between treatment and control group. Since no information is given in the press release, let us assume that AstraZeneca has a 1:1 assignment to treatment and control group, like Biontech/Pfizer. Then we have (see e.g. here)

theta1 = (1-VE1)/(2-VE1)

From this we can compute mv1, as well as the cases from the control group mc1 in the smaller dosing regimen. The following R code computes in this fashion all relevant case numbers for both dosing regimens:

m = 131m1 = m*(2/7)VE1 = 0.9theta1 = (1-VE1)/(2-VE1)mv1 = theta1*m1mc1 = m1-mv1m2 = m*(5/7)VE2 = 0.62theta2 = (1-VE2)/(2-VE2)mv2 = theta2*m2mc2 = m2-mv2rbind(  smaller_dosing = c(m=m1, mv=mv1,mc=mc1, VE=VE1),  larger_dosing = c(m=m2, mv=mv2,mc=mc2, VE=VE2))##                       m        mv       mc   VE## smaller_dosing 37.42857  3.402597 34.02597 0.90## larger_dosing  93.57143 25.766046 67.80538 0.62

OK, the obvious problem remains that these case counts are no integer numbers. That is probably due to the fact that the reported efficacy percentages are rounded. So we have to guess integer numbers that yield efficacy values that are plausible to have been rounded to the reported numbers. As a guess, let us just round the numbers above to the nearest integers:

# Integer guess: just round case numbers abovem1 = 37; mv1 = 3;  mc1 = 34m2 = 94; mv2 = 26; mc2 = 68# Compute resulting efficaciesc(  VE1 = 1-(mv1/mc1),  VE2 = 1-(mv2/mc2),  VE  = 1-(mv1+mv2)/(mc1+mc2))##       VE1       VE2        VE ## 0.9117647 0.6176471 0.7156863

This means these assumed case counts would yield 91.2% efficacy in the small dosing regimen, 61.8% efficacy in the large dosing regimen and a 71.6% average efficacy. Seems roughly consistent with the stated numbers. Of course, a lot of guesses went into this computation.

So if we assume 3 of 37 Covid-19 cases in the small dosage regimen were vaccinated compared to 26 of 94 in the large dosage regimen, is the difference in these proportions significant? While I am no expert on non-parametric tests (economists tend to mostly run regressions), the R function prob.test seems on first sight appropriate. So let’s just run it:

prop.test(x=c(mv1,mv2),n=c(m1,m2))## ## 2-sample test for equality of proportions with continuity correction## ## data:  c(mv1, mv2) out of c(m1, m2)## X-squared = 4.8083, df = 1, p-value = 0.02832## alternative hypothesis: two.sided## 95 percent confidence interval:##  -0.34049235 -0.05053698## sample estimates:##     prop 1     prop 2 ## 0.08108108 0.27659574

The p-value of 2.8% indeed gives some support for the hypothesis that the true efficacy is larger in AstraZeneca’s small dosing regimen. (Of course, just taking this p-value at face value might be slightly p-hackish.)

As a final step, let us compare AstraZeneca’s results with those reported by Biontech/Pfizer and Moderna using the Bayesian approach suggested by Biontech/Pfizer’s study plan (see my previous post for details). Biontech reported that out of 170 Covid-19 cases 8 subjects were vaccinated and Moderna’s press release states 5 out of 95 cases.

The code below shows for each trial / dosing regimen the posterior distribution for the vaccine efficacy using the prior Beta distribution specified by Biontech/Pfizer and our guessed numbers for AstraZeneca.

# Helper functiontheta.to.VE = function(theta) (1-2*theta)/(1-theta)# Parameters of Biontech/Pfizer's prior distributiona0 = 0.700102; b0 = 1grid = tibble(  study=c("Biontech/Pfizer","Moderna","AstraZeneca-1","AstraZeneca-2"),  m=c(170,95,37,94),  mv = c(8,5,3,26),  mc=m-mv) %>%   tidyr::expand_grid(theta = seq(0,0.5,by=0.002)) %>%  mutate(    VE = theta.to.VE(theta),    density = dbeta(theta, shape1 = a0+mv, shape2=b0+mc)  )# Show all 4 posterior distributionsggplot(filter(grid, VE > 0.2), aes(x=VE, y=density, fill=study, color=study)) +  geom_area(alpha=0.5,position = "identity")+  ggtitle("Estimated posterior efficacy of different vaccines / dosing regimens")

Hmm, for me subjectively the posterior for AstraZeneca’s small dosing regimen looks actually better than I would have suspected after reading the press release. Of course, a lot of guesses went into those curves.

Overall, I personally would consider AstraZeneca’s preliminary results good news. In particular, if one also accounts for the substantially lower expected price and the following statements from the press release:

no hospitalisations or severe cases of the disease were reported in participants receiving the vaccine.

The vaccine can be stored, transported and handled at normal refrigerated conditions (2-8 degrees Celsius/ 36-46 degrees Fahrenheit) for at least six months.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Reverse Engineering AstraZeneca's Vaccine Trial Press Release first appeared on R-bloggers.


2020-05 Adding TikZ support to ‘dvir’

$
0
0

[This article was first published on R – Stat Tech, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This report describes an update to the R package ‘dvir’ to add support for the TikZ graphics package. This allows R users to make use of TikZ drawing capabilities within R graphics.

Paul Murrell

Download

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 2020-05 Adding TikZ support to ‘dvir’ first appeared on R-bloggers.

the riddle(r) of the certain winner losing in the end

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Considering a binary random walk, starting at zero, what is the probability of being almost sure of winning at some point only to lose at the end? This is the question set by the post-election Riddler, with almost sure meaning above 99% and the time horizon set to n=101 steps (it could have been 50 or 538!). As I could not see a simple way to compute the collection of states with a probability of being positive at the end of at least 0.99, even after checking William Feller’s Random Walks fabulous chapter, I wrote an R code to find them, and then ran a Monte Carlo evaluation of the probability to reach this collection and still end up with a negative value. Which came as 0.00212 over repeated simulations. Obviously smaller than 0.01, but no considerably so. As seen on the above picture, the set to be visited is actually not inconsiderable. The bounding curves are the diagonal and the 2.33 √(n-t) bound derived from the limiting Brownian approximation to the random walk, which fits rather well. (I wonder if there is a closed form expression for the probability of the Brownian hitting the boundary 2.33 √(n-t). Simulations with 1001 steps give an estimated probability of 0.505, leading to a final probability of 0.00505 of getting over the boundary and loosing in the end, close to the 1/198 produced by The Riddler.)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post the riddle(r) of the certain winner losing in the end first appeared on R-bloggers.

Why R Webinar – Mocking in R

$
0
0

[This article was first published on Why R? Foundation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tomorrow at Why R? Webinars we will have a chance to guest Max Kronborg who will present Mocking in R. The biogram and the link to YouTube are below. Go the video and set a reminder!

Thursday November 26th. 7:00pm UTC

Mocking allows the programmer to temporarily substitute custom code for part of a given function. This can be useful in a variety of scenarios:

  • When needing stubs or fakes for unit tests
  • To replace a not yet completed function, particularly helpful in test driven development
  • When for whatever reason R is doing something you don’t want it to do, like accessing the internet, reading files from a strange location, or printing graphs incorrectly.

Max is a danish Computer Science student currently in final year at the University of Bath after completing a placement year with Mango.

He is particularly interested in ML, and is currently working on a dissertation aimed at predicting capacity levels on public transportation.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Why R? Foundation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Why R Webinar - Mocking in R first appeared on R-bloggers.

COVID-19 Mobility Data

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are in the middle of a mind-boggling natural experiment here in the United States. In spite of the advice from the CDC and dire warnings from our nation’s health care experts, millions of Americans will travel over the long holiday weekend. Although the number of people flying is significantly down from last year, there are still large numbers of Americans on the move. The TSA reported that more than two million people went through airport checkpoints last weekend, and the AAA is forecasting as many as fifty-million people may travel.

No matter what the outcome, it is a pretty safe bet that the mobility data collected this weekend will be studied by epidemiologists and public health experts for years to come. In addition to the anecdotal reports linking travel to increased COVID-19 transmission a number of studies including this recent PNAS Report which suggests a “positive relationship between mobility inflow and the number infections”, and this Lancet Correspondence which concludes that a “concomitant increases in mobility will be correlated with an increased numbers cases”, the experts are just beginning to understand the dynamics of mobility and the spread of infection. (See, for example, this Nature paper that claims a “relatively simple SEIR model” informed by the hourly movements of 98 million people can “accurately fit the real case trajectory”).

Acquiring mobility data requires access to large scale infrastructure. Fortunately, several sites are providing access to large scale data sets. The COVIDcast site from the Delphi group provides both R and Python APIs to access the SafeGraph Mobility Data. Click here to see a time-lapse animation of “away from home” data that shows how the cycles of travel vary from before the pandemic up through the middle of this month.

Click here

for another classy dashboard from the University Maryland and the Maryland Transportation Institute that shows how mobility data tracks with COVID cases.

To get your hands on some mobility data in addition to what is available with the Delphi API, try out the covid19mobility package which scrapes mobility data from Google and Apple and look here for the data and R code behind PNAS report mentioned above.

For an in depth look at the issues relating to mobility data and the COVID-19 pandemic, please sign up for the next COVID-19 Data Forum event which will be held at 9 AM Pacific Time on Thursday, December 10th.

Chris Volinsky, Associate vice-president of Big Data Research at ATT Labs will moderate presentations and a panel discussion with Caroline Buckee, Associate Professor of Epidemiology and Associate Director of the Center for Communicable Disease Dynamics at the Harvard T.H. Chan School of Public Health, Dr. Andrew Schoeder, Vice-president Research & Analytics for Direct Relief, and Christophe Fraser, Professor of Pathogen Dynamics at University of Oxford and Senior Group Leader at Big Data Institute, Oxford University, UK.

Finally, wherever your are: please assess the risks of travel for yourself, for your family, and for anyone with whom you may share the air. Stay safe!

_____='https://rviews.rstudio.com/2020/11/25/covid-19-mobility-data/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post COVID-19 Mobility Data first appeared on R-bloggers.

Warspeed 5 — priors and models continued

$
0
0

[This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My last 4 posts have all focused on the vaccines being produced to fight COVID-19. They have primarily focused on Bayesian methods (or at least comparing bayesian to frequentist methods). This one follows that pattern and provides expanded coverage of the concept of priors in bayesian thinking, how to operationalize them, and additional coverage of how to compare bayesian regression models using various tools in r.

There’s no actual additional analysis of “real data”. While the news has all been good lately I haven’t found anything publicly available that begs for investigation. Instead we’ll use “realistic” but not real data, and let the process unfold for us.

I will, however, try to show you some tools and tricks in r to make your analysis of any data easier and smoother.

What bayesians do

First off let me remind the reader that like most I was educated and trained in the frequentist tradition. I’m not against frequentist methods and even admit in the vast majority of cases they lead to similar conclusions. But, I do believe the bayesian methods are better and let you answer questions you really want to research. Now a quote from a great book, which also happens to be available at no cost.

From a Bayesian perspective, statistical inference is all about belief revision. I start out with a set of candidate hypotheses about the world. I don’t know which of these hypotheses is true, but do I have some beliefs about which hypotheses are plausible and which are not. When I observe the data, I have to revise those beliefs. If the data are consistent with a hypothesis, my belief in that hypothesis is strengthened. If the data is inconsistent with the hypothesis, my belief in that hypothesis is weakened. Navarro, page 555.

As a formula you’ll see Bayes Theorem expressed something like this:

\[P(hypothesis | data) = \frac{P(data | hypothesis) * P(hypothesis) }{P(data)}\]

We have data, and we have hypotheses, we have bayesian tools, most “new” bayesians seem to stumble in the process of making good on our declaration of what our priors are. Up until today’s post I’ve mainly avoided the topic or briefly touched on and used terms like “vague”, “uninformed” or “flat”. See for example the very flat and uninformed plot from my second blog post in the series.

It’s okay in a bayesian framework to have priors that are vague if that’s what you really know and what you really mean. Just don’t use them when you know “more” or “better”. There is a real difference “between this value can be literally anything” and “this value we are studying has some prior knowledge behind it and some likely spread of values”.

And, no, you are absolutely NOT guaranteeing success or dooming failure if you are off on your priors. Obviously, as a scientist, you should pick values that make sense and not pluck them from thin air or make them up, but rest assured given sufficient data even priors that are “off” are adjusted by the data. Isn’t that the point after all? Develop a hypothesis and test it with data? Notice that in bayesian inference we don’t need to force a “null hypothesis” that we “reject”.

Come let us reason together

At this point if you’re not comfortable with bayesian inference I probably haven’t helped you at all and it all probably feels a bit theoretical. Let’s take our current vaccine problem and work it through.

Step 1 (what do we think we know or believe) about the vaccine and COVID-19
  1. A vast amount of time, money, and science have gone into developing these vaccines, and in many cases they are products of years of previous research – if not about COVID-19 then at least other viruses. The candidate vaccines have already been tried in Phase I and Phase II in perhaps animals or in limited numbers of humans. They got to phase III not by being unknown or “anything is possible” but rather on a belief they are going to be at least 50% effective. So we know something about the direction and with less certainty something about the magnitude.

  2. We know given public data that older people tend to contract COVID-19 less. All the data points to much higher infection rates in younger populations. Good thing too since the older population tends to have much more severe outcomes. Notice neither #1 or #2 says anything about how effective the vaccine is in older populations, we’ll get to that in a minute. But we certainly have evidence that currently older folks are less likely to get it.Notice reasonable people can disagree. It’s okay if you have scientific prior evidence that makes you think otherwise. Express it! Let the data inform us.

  3. There is evidence that vaccines can be less effective for older populations. It is certainly a worry for scientists going into the trial. Just as there are worries for other health risk factors and potentially race and ethnicity and even gender as well. Chuck’s hypothesis is that the vaccine will be slightly less effective for older folks. But I’m prepared to be wrong and believe this less firmly than I believe the vaccine is effective.

Step 2 How do we make use of our prior knowledge?

In the last post we used brms to build a bayesian equivalent of a glm model. That’s our plan again today. So how do we go from my theoretical hypotheses in 1-3 above to something we can feed into r and let it chew on and inform us with our data? This is often the hard part for newcomers to bayesian inference. But it really needn’t be if you focus on the basics. Our plan is to feed our hypotheses into a generalized linear model. As with a the simpler case of modeling with lm we’ll produce a model that gives us slope estimates (the venerable \(\beta\) or \(\hat{b}_n\) coefficients). In traditional frequentist output these are in turn tested against the hypothesis that they are zero and a pvalue is generated. Zero slope means that predictor has zero ability to help us predict outcome in \(y = \hat{b}_1 + \hat{b}_0\) if \(\hat{b}_1 = 0\) then \(y = \hat{b}_0\) the intercept and \(\hat{b}_1\) can be eliminated.

So our priors need to be converted to something that makes sense as a value for \(\hat{b}_n\). No effect equals zero and bigger effects mean larger positive or negative values. Notice that our estimates have a standard error which means they have a standard deviation which will serve as a way of expressing our “confidence” about our priors. A small value means we’re more sure that the effect will be tightly centered, larger SD means we’re less certain.

So going back to 1-3 above let’s proceed with our priors one by one.

  1. Vaccine effect – So we know something about the direction and with less certainty something about the magnitude. We could pick from a large number of possible distributions for our \(\hat{b}_n\) here but hopefully you’re thinking by now this sounds like a normal or t distribution with an appropriate sd and mean. With small amounts of data t is possibly a better choice, but remember our data will inform us we are NOT fixing the outcome. So let’s set a prior for \(\hat{b}_{conditionvaccinated}\) with the mean at -2.0 and the sd of .5. Why a minus sign? That’s because \(\hat{b}_{conditionvaccinated}\) compares to the placebo condition and therefore the number of people who got COVID-19 in the trial will be less. The .5 sd is because we think the vaccine will be at least moderately effective. No worries later I’ll show you how to be more specific in your hypotheses. You’ll see this later but for brms and STAN we express this as prior(normal(-2.0, .5), class = b, coef = conditionvaccinated)

  2. Older folks are less likely to get COVID-19. Remember this is a general statement about older folks regardless of whether they received the placebo or the vaccine. This difference is less pronounced and less certain so we’ll set the mean closer to zero at -.5 and increase the standard deviation to 1 to indicate we believe it’s more variable and that is expressed as prior(normal(-0.5, 1), class = b, coef = ageOlder).

  3. Chuck’s hypothesis is that the vaccine will be slightly less effective for older folks. Which means that this time the sign will be positive because the interaction ageOlder:conditionvaccinated will yield slightly higher infection rates and therefore \(\hat{b}_{ageOlder:conditionvaccinated}\) will be a small positive number. But I’m less firm in this belief so once again standard deviation = 1 and the code is prior(normal(.5, 1), class = b, coef = ageOlder:conditionvaccinated)

Now, we have reasonable informed priors! And hopefully you have a better idea on how to set them in a regression model of interest to you. Remember you can consult the STAN documentation for more suggestions on informative priors. We can run get_prior to get back information about the defaults. Notice we don’t really care about the overall intercept \(\hat{b}_0\).

Load libraries and on to the data

Load the necessary libraries as before

library(dplyr)library(brms)library(ggplot2)library(kableExtra)theme_set(theme_bw())

Load the same data as last time, this time I’m providing it as a structure for your ease. Again you could have your data in full long format and accomplish the same modeling I’m providing it wide and summarized for computational speed and leaving you the dplyr commands to go long if you like.

agg_moderna_data <-    structure(list(      condition = structure(c(1L, 1L, 2L, 2L), .Label = c("placebo", "vaccinated"), class = "factor"),       age = structure(c(1L, 2L, 1L, 2L), .Label = c("Less than 65", "Older"), class = "factor"),      didnot = c(9900L, 993L, 9990L, 999L),       got_covid = c(100L, 7L, 10L, 1L),       subjects = c(10000L, 1000L, 10000L, 1000L)),       class = "data.frame", row.names = c(NA, -4L))# if you need to convert to long format use this...# agg_moderna_data %>% #    tidyr::pivot_longer(cols = c(didnot, got_covid), #                        names_to = "Outcome") %>%#    select(-subjects) %>%#    tidyr::uncount(weights = value) %>%#    mutate(Outcome = factor(Outcome))# Make a nice table for viewingagg_moderna_data %>%  kbl(digits = c(0,0,0,0,0),      caption = "Notional Data") %>%  kable_minimal(full_width = FALSE,      position = "left") %>%  add_header_above(c("Factors of interest" = 2,                      "COVID Infection" = 2,                      "Total subjects" = 1))
Table 1: Notional Data
Factors of interest
COVID Infection
Total subjects
conditionagedidnotgot_covidsubjects
placeboLess than 65990010010000
placeboOlder99371000
vaccinatedLess than 6599901010000
vaccinatedOlder99911000

Once again I’m assuming there are fewer older subjects in the trials.

Repeating the frequentist GLM solution

So we’ll create a matrix of outcomes called outcomes with successes got_covid and failures didnot. For binomial and quasibinomial families the response can be specified … as a two-column matrix with the columns giving the numbers of successes and failures

outcomes <- cbind(as.matrix(agg_moderna_data$got_covid, ncol = 1),              as.matrix(agg_moderna_data$didnot, ncol = 1))moderna_frequentist <-   glm(formula = outcomes ~ age + condition + age:condition,       data = agg_moderna_data,       family = binomial(link = "logit"))# moderna_frequentist glm resultsbroom::tidy(moderna_frequentist)## # A tibble: 4 x 5##   term                         estimate std.error statistic  p.value##                                            ## 1 (Intercept)                    -4.60      0.101   -45.7   0.      ## 2 ageOlder                       -0.360     0.392    -0.917 3.59e- 1## 3 conditionvaccinated            -2.31      0.332    -6.96  3.32e-12## 4 ageOlder:conditionvaccinated    0.360     1.12      0.321 7.48e- 1### In case you want to plot the results# sjPlot::plot_model(moderna_frequentist, #                    type = "int")

Repeating the bayesian solution with no priors given

moderna_bayes_full <-    brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition,       iter = 12500,        warmup = 500,        chains = 4,        cores = 4,       seed = 9,       file = "moderna_bayes_full")summary(moderna_bayes_full)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age + condition + age:condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS## Intercept                       -4.60      0.10    -4.80    -4.41 1.00    46428## ageOlder                        -0.41      0.40    -1.27     0.33 1.00    26886## conditionvaccinated             -2.35      0.34    -3.06    -1.73 1.00    27428## ageOlder:conditionvaccinated     0.05      1.29    -2.90     2.17 1.00    18097##                              Tail_ESS## Intercept                       37933## ageOlder                        23100## conditionvaccinated             24161## ageOlder:conditionvaccinated    15000## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).

Priors and Bayes Factors

Earlier I showed you what priors we want. We can run get_prior to get back information about the defaults.

get_prior(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition)##                 prior     class                         coef group resp dpar##                (flat)         b                                             ##                (flat)         b                     ageOlder                ##                (flat)         b ageOlder:conditionvaccinated                ##                (flat)         b          conditionvaccinated                ##  student_t(3, 0, 2.5) Intercept                                             ##  nlpar bound       source##                   default##              (vectorized)##              (vectorized)##              (vectorized)##                   default

Right now they are all flat. Let’s create an object called my_priors that contains what we want them to be by model parameter. Then rerun and put the results into full_model. Finally let’s plot the prior’s and posteriors by parameter.

my_priors <-    c(prior(normal(-0.5, 1), class = b, coef = ageOlder),     prior(normal(-2.0, .5), class = b, coef = conditionvaccinated),     prior(normal(.5, 1), class = b, coef = ageOlder:conditionvaccinated))my_priors##            prior class                         coef group resp dpar nlpar bound##  normal(-0.5, 1)     b                     ageOlder                            ##  normal(-2, 0.5)     b          conditionvaccinated                            ##   normal(0.5, 1)     b ageOlder:conditionvaccinated                            ##  source##    user##    user##    userfull_model <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition + age:condition,       prior = my_priors,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "full_model")summary(full_model)##  Family: binomial ##   Links: mu = logit ## Formula: got_covid | trials(subjects) ~ age + condition + age:condition ##    Data: agg_moderna_data (Number of observations: 4) ## Samples: 4 chains, each with iter = 12500; warmup = 500; thin = 1;##          total post-warmup samples = 48000## ## Population-Level Effects: ##                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS## Intercept                       -4.61      0.10    -4.80    -4.42 1.00    54540## ageOlder                        -0.42      0.36    -1.17     0.25 1.00    35067## conditionvaccinated             -2.24      0.26    -2.77    -1.75 1.00    36868## ageOlder:conditionvaccinated     0.35      0.72    -1.12     1.69 1.00    31410##                              Tail_ESS## Intercept                       37706## ageOlder                        27560## conditionvaccinated             28006## ageOlder:conditionvaccinated    28442## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS## and Tail_ESS are effective sample size measures, and Rhat is the potential## scale reduction factor on split chains (at convergence, Rhat = 1).plot(bayestestR::bayesfactor_parameters(full_model, null = c(-.5, .5)))

# bayestestR::sensitivity_to_prior(full_model, index = "Median", magnitude = 10)

With “good” priors we can compare models

Now that we have convinced ourselves theoretically and by running the full model that we have reasonable and informative priors we can begin the process of comparing models. First step is to build the other models by removing parameters and their associated priors. Hopefully my naming scheme makes it clear what we’re doing. The obvious and elegant thing to do would be to write a simple purrr::map statement that would build a list of models but for clarity I’ll simply repeat the same steps repeatedly (if you want the purrr statement add a comment in disqus). I’ll spare you printing all the various model summaries.

# no priors herenull_model <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ 1,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "null_model")# no prior for the interaction termmy_priors2 <- c(prior(normal(-0.5, 1), class = b, coef = ageOlder),            prior(normal(-2.0, .5), class = b, coef = conditionvaccinated))no_interaction <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age + condition,       prior = my_priors2,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_interaction")# summary(no_interaction)# No prior for interaction or vaccinemy_priors3 <- c(prior(normal(-0.5, 1), class = b, coef = ageOlder))no_vaccine <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ age,       prior = my_priors3,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_vaccine")# summary(no_vaccine)# no prior for interaction or for agemy_priors4 <- c(prior(normal(-2.0, .5), class = b, coef = conditionvaccinated))no_age <-   brm(data = agg_moderna_data,       family = binomial(link = logit),       got_covid | trials(subjects) ~ condition,       prior = my_priors4,       iter = 12500,       warmup = 500,       chains = 4,       cores = 4,       seed = 9,       save_pars = save_pars(all = TRUE),       file = "no_age")# summary(no_age)

Okay, we now have models named full_model, no_interaction, no_age, no_vaccine, and null_model. What we’d like to know is which is the “best” model, where best is a balance between accuracy and parsimony. Highly predictive but doesn’t include any unnecessary predictors. There are a number of ways to accomplish this including AIC and BIC and even comparing R squared but we’re going to focus on comparing Bayes Factors.

Using bayestestR::bayesfactor_models we’ll compare all our models to the null_model. Then since as usual the null_model is a rather silly point of comparison we’ll use update to change our comparison point to the model that does NOT include the interaction term (a common question in research). That will help us understand whether we need to worry about whether the vaccine is less effective in older subjects. To make your workflow easier I actually recommend skipping this manual update process and going straight to bayestestR::bayesfactor_inclusion(comparison, match_models = TRUE).

comparison <-    bayestestR::bayesfactor_models(full_model, no_interaction, no_age, no_vaccine,                                  denominator = null_model)comparison## # Bayes Factors for Model Comparison## ##   Model                                      BF##   [1] age + condition + age:condition 7.352e+18##   [2] age + condition                 9.708e+18##   [3] condition                       1.991e+19##   [4] age                                 0.486## ## * Against Denominator: [5] (Intercept only)## *   Bayes Factor Type: marginal likelihoods (bridgesampling)update(comparison, reference = 2)## # Bayes Factors for Model Comparison## ##   Model                                      BF##   [1] age + condition + age:condition     0.757##   [3] condition                           2.051##   [4] age                             5.006e-20##   [5] (Intercept only)                1.030e-19## ## * Against Denominator: [2] age + condition## *   Bayes Factor Type: marginal likelihoods (bridgesampling)bayestestR::bayesfactor_inclusion(comparison, match_models = TRUE)## # Inclusion Bayes Factors (Model Averaged)## ##               Pr(prior) Pr(posterior) Inclusion BF## age                 0.4         0.263        0.488## condition           0.4         0.801    1.993e+19## age:condition       0.2         0.199        0.757## ## * Compared among: matched models only## *    Priors odds: uniform-equal### This is possible but NOT recommended# bayestestR::bayesfactor_inclusion(comparison)

There is very little evidence that we should include age or the age:condition interaction terms in our model. If anything we have modest evidence that we should exclude them! Weak evidence to be sure but evidence nonetheless (remember bayesian inference allows for “supporting” the null, not just rejecting it).

More elegant testing (order restriction)

Notice that our priors are unrestricted - that is, our \(\hat{b}_n\) parameters in the model have some non-zero credibility all the way out to infinity (no matter how small; this is true for both the prior and posterior distribution). Does it make sense to let our priors cover all of these possibilities? Our priors can be formulated as restricted priors (Morey, 2015; Morey & Rouder, 2011).

By testing these restrictions on prior and posterior samples, we can see how the probabilities of the restricted distributions change after observing the data. This can be achieved with bayesfactor_restricted(), that computes a Bayes factor for these restricted model vs the unrestricted model.

Think or it as more precise hypothesis testing. We need to specify these restrictions as logical conditions. An easy one is that I’m very confident that the vaccine is highly effective and \(\hat{b}_{conditionvaccinated}\) should be no where near 0. As a matter of research I want to know how the data support the notion that is less than -2 or a very large effect. How does that belief square with our data? More complex but testable, is that \(\hat{b}_{ageOlder.conditionvaccinated}\) (the interaction) if it is non zero is quite modest. The vaccine may work a little better or a little worse on older people but it is not a big difference.

chuck_hypotheses1 <-    c("b_conditionvaccinated < -2")bayestestR::bayesfactor_restricted(full_model, hypothesis = chuck_hypotheses1)## # Bayes Factor (Order-Restriction)## ##                  Hypothesis P(Prior) P(Posterior)    BF##  b_conditionvaccinated < -2    0.497        0.822 1.654## ## * Bayes factors for the restricted model vs. the un-restricted model.chuck_hypotheses2 <-    c("b_conditionvaccinated < -2",     "(b_ageOlder.conditionvaccinated > -0.5) & (b_ageOlder.conditionvaccinated < 0.5)"   )bayestestR::bayesfactor_restricted(full_model, hypothesis = chuck_hypotheses2)## # Bayes Factor (Order-Restriction)## ##                                                                        Hypothesis##                                                        b_conditionvaccinated < -2##  (b_ageOlder.conditionvaccinated > -0.5) & (b_ageOlder.conditionvaccinated < 0.5)##  P(Prior) P(Posterior)    BF##     0.499        0.822 1.648##     0.340        0.444 1.305## ## * Bayes factors for the restricted model vs. the un-restricted model.

Although the support is limited (remember we had very few older subjects who got COVID-19) we at least have limited support and as our numbers continue to expand over the coming months we can continue to watch the support (hopefully grow).

Done

Hope you found this continuation of the series helpful. As always feel free to comment. Personally I find it helpful to work on real world problems and if you can follow these methods you’re in great shape in the next few months as more data becomes available to build even more complex models.

Keep counting the votes! Every last one of them!

Chuck

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Warspeed 5 -- priors and models continued first appeared on R-bloggers.

Viewing all 12126 articles
Browse latest View live