Quantcast
Channel: R-bloggers
Viewing all 12111 articles
Browse latest View live

Learn R Free — Our Interactive R Courses Are ALL Free This Week!

$
0
0

[This article was first published on r-promote-feed – Dataquest, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exciting news: for the next week, all of our R programming courses are free. In fact, every single course in our Data Analyst in R career path is free from July 20-27.

This free week includes all of our R programming courses and all of the SQL and statistics courses that are included in the R path.

Why are we doing this? Last week, we launched four new R courses and totally revamped the beginning of our Data Analyst in R learning path. These courses teach modern, production-ready R (including tidyverse packages) for data science from scratch — no prerequisites required.

Now, it’s easier than ever to go from total beginner to job-qualified with R using Dataquest. The paywall is down!

(No credit card required.)

How our interactive R programming courses work

At Dataquest, we do things a little differently. You won’t find any lecture videos here. Nor will you find any multiple-choice quizzes or fill-in-the-blank assessments.

Instead, we challenge you to write real R code and work with real datasets. From the first mission of the first course, you’ll be writing and running your own R code using our interactive code-running platform. You’ll get real-time feedback when your code is correct, and hints, answers, and access to our amazing learner community for whenever you need a little help.

At the end of the course, you’ll often be challenged to put your new skills together by building a real data science project. Our guided projects will help you build skills and confidence with R even as you’re building out a cool project portfolio for your Github.

Why try Dataquest’s R courses? Here’s what students say:


The best parts of the courses were when they took a concept that seemed daunting and abstract and then made it relatable and easy to understand


— New R course beta tester


The courses introduced R to me in a simple way, requiring no prior knowledge of the language. But it wasn’t too simplistic, so it kept my level of interest high.


— New R course beta tester


[The new courses are] now very beginner-friendly. And the workflow on the tidyverse is very helpful.


— New R course beta tester

FAQ

What do I have to do to access the free courses?

Simply sign up for a free Dataquest account (or log in to your existing account) and you will have access to every course in the Data Analyst in R learning path.

You can start at the beginning, or dive in at any point in the path you’d like. If you have an existing account and have already made progress on a Python path, don’t worry — your Python progress has been saved, and you can switch paths back and forth at any time from your dashboard.

All courses in the R path will be free from July 20 at 0:00 UTC to July 28 at 0:00 UTC.

Do I need a credit card to sign up?

No! You can create a Dataquest account for free, and no credit card is required. Even after the free week has ended, you’ll be able to use your free account to access any of our dozens of free missions.

Are there any limits to how much I can learn during this week?

Nope! Our platform is self-serve and allows you to work at your own pace. You can complete as much of the path as you’d like during the free week, and you will be issued certificates for any courses you complete.

After the free week is over, users who do not have a paid subscription will no longer have access to paywalled parts of the R path, but your progress from the free week will be saved, and you will still have access to a free mission in each course.

How can I access the full R path once Free Week is over?

Once the free week is over, a Dataquest subscription will be required to access the full R path.

In a 2020 survey of more than 600 Dataquest learners, 91% of students rated a Dataquest subscription a good or great investment, and 97% said they recommended Dataquest for career advancement.

Charlie Custer

Charlie is a student of data science, and also a content marketer at Dataquest. In his free time, he’s learning to mountain bike and making videos about it.

The post Learn R Free — Our Interactive R Courses Are ALL Free This Week! appeared first on Dataquest.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-promote-feed – Dataquest.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Colorful tables in a terminal

$
0
0

[This article was first published on r – log Fold Change, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It all started when I wanted to have significant p-values shown on the terminal colored in red. The R terminal is capable of showing colors, simple formatting (like italics or bold) and Unicode characters, thanks to the actual terminal that does the job of displaying R output – whether it is the console of rstudio or a terminal window. You can see that when you use tibbles from tidyverse: they use some very limited formatting (like showing “NA” in red).

I ended up writing a new package, colorDF. The package defines a new class of data frames, but it really does not change their behavior – just the way they are shown (specifically, it modifies some attributes and introduces a print.colorDF function for printing). If you change a tibble to a colorDF, it will still behave exactly like a tibble, but it will be shown in color:

# Color data frame 6 x 87:# (Showing rows 1 - 20 out of 87)│name │height│mass │birth_year│gender │probability 1 Luke Skywalker 172 77 19male 0.0083 2 C-3PO 167 75 112NA 0.0680 3 R2-D2 96 32 33NA 0.0596 4 Darth Vader 202 136 42male 0.0182 5 Leia Organa 150 49 19female 0.0138 6 Owen Lars 178 120 52male 0.0115 7 Beru Whitesun lars 165 75 47female 0.0489 8 R5-D4 97 32 NANA 0.0040 9 Biggs Darklighter 183 84 24male 0.0954 10 Obi-Wan Kenobi 182 77 57male 0.0242 11 Anakin Skywalker 188 84 42male 0.0066 12 Wilhuff Tarkin 180 NA 64male 0.0605 13 Chewbacca 228 112 200male 0.0587 14 Han Solo 180 80 29male 0.0519 15 Greedo 173 74 44male 0.0204 16Jabba Desilijic Tiure 175 1358 600hermaphrodite0.0929 17 Wedge Antilles 170 77 21male 0.0457 18 Jek Tono Porkins 180 110 NAmale 0.0331 19 Yoda 66 17 896male 0.0931 20 Palpatine 170 75 82male 0.0012

Yes, it looks like that in the terminal window!

You can read all about it in the package vignette (please use the package from github, the CRAN version is lagging behind). Apart from the print function, I implemented also a summary function which is more informative than the default summary function for the data frames.

starwars %>% as.colorDF %>% summary
# Color data frame 5 x 13:│Col │Class│NAs │unique│Summary 1name 0 87All values unique 2height 6 45 66 [167 <180> 191] 264 3mass 28 38 15.0 [ 55.6 < 79.0> 84.5] 1358.0 4hair_color 5 12none: 37, brown: 18, black: 13, white: 4, blond: 3, auburn: 1, … 5skin_color 0 31fair: 17, light: 11, dark: 6, green: 6, grey: 6, pale: 5, brown… 6eye_color 0 15brown: 21, blue: 19, yellow: 11, black: 10, orange: 8, red: 5, … 7birth_year 44 36 8 [ 35 < 52> 72] 896 8gender 3 4male: 62, female: 19, none: 2, hermaphrodite: 1 9homeworld 10 48Naboo: 11, Tatooine: 10, Alderaan: 3, Coruscant: 3, Kamino: 3, …10species 5 37Human: 35, Droid: 5, Gungan: 3, Kaminoan: 2, Mirialan: 2, Twi'l…11films 0 24Attack of the Clones: 40, Revenge of the Sith: 34, The Phantom …12vehicles 0 11Imperial Speeder Bike: 2, Snowspeeder: 2, Tribubble bongo: 2, A…13starships 0 17Millennium Falcon: 4, X-wing: 4, Imperial shuttle: 3, Naboo fig…

For numeric vectors, by default the function shows the minimum, quartiles and median, but it can also produce a boxplot-like graphical summary. Since the function works also on lists, implementing a text terminal based boxplot function was super easy:

term_boxplot(Sepal.Length ~ Species, data=iris, width=90)
# Color data frame 5 x 4:│Col │Class│NAs │unique│Summary 1setosa 0 15╾──────┤ + ├────────╼ 2versicolor 0 21╾─────────┤ + ├──────────╼ 3virginica 0 21╾──────────────────┤ + ├──────────────╼4Range 0 1Only one value: Range: 4.3 - 7.9

Cool, isn’t it?

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – log Fold Change.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RvsPython #2: Pivoting Data From Long to Wide Form

$
0
0

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Note: This is an older post originally written as a LinkedIn article I wrote in lateMay. I have added information about shaping data thanks to Casper Crause using the data.table library. You can see our original correspondence in the comments there (for now)

If you dabble in data, you know one of the challenges that everyone has when working with data is reshaping data to the form you want to use it; thankfully, there are ways to shape data in both Python and R to speed up the process by using some of the functions available in their extensive libraries.

In this post, we will be looking at how to pivot data from long to wide form using Python’s pandas library and R’s stats, tidyr and data.table libraries and how they match up.

I did write more annotations on the Python code as I am still learning about the language and while its been pretty easy to pick up, I still need to work through the steps. I’m sure there’s another way to wrangle and shape data in Python besides for pandas; If you know of another one, be sure to leave a comment below and let me know!

Lets go!

The Problem

The problem that we’ll be using will be a problem I saw on StackExchange’s Data Science site. (link to problem: here). Here are the screenshots of the question.

While the OP only asks for how to do this in R. I thought this would be good to show how this works in Python as well! Lets dive right into it!

The Answer(s).

TL;DR: To do the above transformation, Using Python’s pandas library is not as efficient as R’s stats, tidyr or data.table libraries.

Disclaimer: for this problem, I will be focusing on getting the data to its proper form. I won’t rename columns as it is a cosmetic issue.

Python (Pandas library):

First lets input our data:

# The Raw Datax = {"ID":[1234,1234],  "APPROVAL_STEP":["STEP_A","STEP_B"],  "APPROVAL_STATUS":["APPROVED","APPROVED"],  "APPROVAL_DATE":["23-Jan-2019","21-Jan-2019"],  "APPROVER":["John Smith","Jane Doe"]}print(x)
## {'ID': [1234, 1234], 'APPROVAL_STEP': ['STEP_A', 'STEP_B'], 'APPROVAL_STATUS': ['APPROVED', 'APPROVED'], 'APPROVAL_DATE': ['23-Jan-2019', '21-Jan-2019'], 'APPROVER': ['John Smith', 'Jane Doe']}

Now to convert this data into a data frame by using the DataFrame() function from the pandas library.

import pandas as pddf=pd.DataFrame(x)df
##      ID APPROVAL_STEP APPROVAL_STATUS APPROVAL_DATE    APPROVER## 0  1234        STEP_A        APPROVED   23-Jan-2019  John Smith## 1  1234        STEP_B        APPROVED   21-Jan-2019    Jane Doe

Now, to convert the data into wide form; this can be done by using the .pivot_table() method. We want to index the data based on ID and see each data point based on the step. This can be done with the code below:

df=df.pivot_table(index="ID",         columns="APPROVAL_STEP",         aggfunc="first")df
##               APPROVAL_DATE              APPROVAL_STATUS            \## APPROVAL_STEP        STEP_A       STEP_B          STEP_A    STEP_B   ## ID                                                                   ## 1234            23-Jan-2019  21-Jan-2019        APPROVED  APPROVED   ## ##                  APPROVER            ## APPROVAL_STEP      STEP_A    STEP_B  ## ID                                   ## 1234           John Smith  Jane Doe

We’re starting to have our data look like what we want it to be . Now, to categorize the columns.

df.columns = ['_'.join(col) for col in df.columns]df
##      APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A  \## ID                                                                      ## 1234          23-Jan-2019          21-Jan-2019               APPROVED   ## ##      APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B  ## ID                                                           ## 1234               APPROVED      John Smith        Jane Doe

Now, for the finishing touches, we use the .reset_index() method and reorder the columns.

##      ID APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A  \## 0  1234          23-Jan-2019          21-Jan-2019               APPROVED   ## ##   APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B  ## 0               APPROVED      John Smith        Jane Doe
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "['ID'] not in index"## ## Detailed traceback: ##   File "", line 2, in ##   File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\frame.py", line 2806, in __getitem__##     indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]##   File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\indexing.py", line 1553, in _get_listlike_indexer##     keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing##   File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\indexing.py", line 1646, in _validate_read_indexer##     raise KeyError(f"{not_found} not in index")
##      APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A  \## ID                                                                      ## 1234          23-Jan-2019          21-Jan-2019               APPROVED   ## ##      APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B  ## ID                                                           ## 1234               APPROVED      John Smith        Jane Doe

Phew! That was alot of steps to follow to get here! Lets see how R matches up!

R (tidyr package)

The tidyr library is a package made by Hadley Wickam and his team at RStudio. It is one of the many packages in the tidyverse made for managing data. We can solve this problem by using the pivot_wider() function.

# The Raw Datax<-data.frame(ID=c(1234,1234),              APPROVAL_STEP=c("STEP_A","STEP_B"),              APPROVAL_STATUS=c("APPROVED","APPROVED"),              APPROVAL_DATE=c("23-Jan-2019","21-Jan-2019"),              APPROVER=c("John Smith","Jane Doe"))# Use pivot_wider()library(tidyr)t<-x %>% pivot_wider(id_cols=ID,                  names_from=APPROVAL_STEP,                  values_from =c(APPROVAL_STATUS,APPROVAL_DATE,APPROVER)) t
## # A tibble: 1 x 7##      ID APPROVAL_STATUS_STEP_A APPROVAL_STATUS_STEP~ APPROVAL_DATE_STEP~ APPROVAL_DATE_STEP~ APPROVER_STEP_A APPROVER_STEP_B##                                                                                          ## 1  1234 APPROVED               APPROVED              23-Jan-2019         21-Jan-2019         John Smith      Jane Doe

Now, we just need to reorder the columns.

# Reorderedt<-t[,c(1,2,4,6,3,5,7)]t
## # A tibble: 1 x 7##      ID APPROVAL_STATUS_STEP_A APPROVAL_DATE_STEP~ APPROVER_STEP_A APPROVAL_STATUS_STEP~ APPROVAL_DATE_STEP~ APPROVER_STEP_B##                                                                                          ## 1  1234 APPROVED               23-Jan-2019         John Smith      APPROVED              21-Jan-2019         Jane Doe

R (stats package)

Using the reshape() function from R’s stats package is a more “old school” way of doing this because it’s something more popular with people who have learned how to write R pre-tidyverse era. Being that I’ve initially learned R from people who programmed pre-tidyverse, I learned how to do this. This can all be done with one function without having to reorder columns!

(This can also be seen on my answer to this question on Data Science StackExchange page)

library(stats)reshape(x,     timevar="APPROVAL_STEP",    idvar="ID",    sep="_", direction = "wide")
##     ID APPROVAL_STATUS_STEP_A APPROVAL_DATE_STEP_A APPROVER_STEP_A APPROVAL_STATUS_STEP_B APPROVAL_DATE_STEP_B## 1 1234               APPROVED          23-Jan-2019      John Smith               APPROVED          21-Jan-2019##   APPROVER_STEP_B## 1        Jane Doe

There you have it! Everything with one function!

R (data.table package)

Casper Crause pointed out that this task can also be done with the data.table package.

The advantage of using this over tidyr or the stats packages is that data.table is written largely in C (see breakdown in languages used on Github page linked). So for larger datasets, using this in a script will save more time computationally.

The quirk here is that your data frame needs to be converted to a data table (which for this example was not hard at all). But throwing this into dcast() works like a charm and puts your shaping of data in “mathematical” terms where the ID variables (rows) are placed on the left hand side and your measuring variables are placed on the right hand side.

Thank you Casper for pointing this out!

library(data.table)x <-as.data.table(x)dcast(  data       = x,   formula  = ID~...,  value.var = c("APPROVAL_STATUS", "APPROVAL_DATE","APPROVER")  ) 
##      ID APPROVAL_STATUS_STEP_A APPROVAL_STATUS_STEP_B APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVER_STEP_A## 1: 1234               APPROVED               APPROVED          23-Jan-2019          21-Jan-2019      John Smith##    APPROVER_STEP_B## 1:        Jane Doe

Conclusion

While there are ways to pivot data from long to wide form in both Python and R, using R makes for a less labor intensive and intuitive time for shaping data as opposed to Python. I am learning that both languages have their strengths, but for this data-wrangling challenge R saves time working through those sort of details.

If you write in R or Python and have an alternative/better solution to answering this problem (or see a mistake) please feel free to reach out to me in a comment or message to share it with me!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A Bayesian model for a simulated meta-analysis

$
0
0

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is essentially an addendum to the previous post where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the nlme package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package rstan.

Create the data set

We’ll use the exact same data generating process as described in some detail in the previous post.

library(simstudy)library(rstan)library(data.table)
defS <- defData(varname = "a.k", formula = 3, variance = 2, id = "study")defS <- defData(defS, varname = "d.0", formula = 3, dist = "nonrandom")defS <- defData(defS, varname = "v.k", formula = 0, variance = 6, dist= "normal")defS <- defData(defS, varname = "s2.k", formula = 16, variance = .2, dist = "gamma")defS <- defData(defS, varname = "size.study", formula = ".3;.5;.2", dist = "categorical")defS <- defData(defS, varname = "n.study",     formula = "(size.study==1) * 20 + (size.study==2) * 40 + (size.study==3) * 60",    dist = "poisson")defI <- defDataAdd(varname = "y", formula = "a.k + x * (d.0 + v.k)", variance = "s2.k")RNGkind(kind = "L'Ecuyer-CMRG")set.seed(12764)ds <- genData(12, defS)dc <- genCluster(ds, "study", "n.study", "id", )dc <- trtAssign(dc, strata = "study", grpName = "x")dc <- addColumns(defI, dc)d.obs <- dc[, .(study, id, x, y)]

Build the Stan model

There are multiple ways to estimate a Stan model in R, but I choose to build the Stan code directly rather than using the brms or rstanarm packages. In the Stan code, we need to define the data structure, specify the parameters, specify any transformed parameters (which are just a function of the parameters), and then build the model – which includes laying out the prior distributions as well as the likelihood.

In this case, the model is slightly different from what was presented in the context of a mixed effects model. This is the mixed effects model:

\[ y_{ik} = \alpha_k + \delta_k x_{ik} + e_{ik} \\ \\ \delta_k = \delta_0 + v_k \\ e_{ik} \sim N(0, \sigma_k^2), v_k \sim N(0,\tau^2) \] In this Bayesian model, things are pretty much the same: \[ y_{ik} \sim N(\alpha_k + \delta_k x_{ik}, \sigma_k^2) \\ \\ \delta_k \sim N(\Delta, \tau^2) \]

The key difference is that there are prior distributions on \(\Delta\) and \(\tau\), introducing an additional level of uncertainty into the estimate. I would expect that the estimate of the overall treatment effect \(\Delta\) will have a wider 95% CI (credible interval in this context) than the 95% CI (confidence interval) for \(\delta_0\) in the mixed effects model. This added measure of uncertainty is a strength of the Bayesian approach.

data {  int N;               // number of observations  int K;               // number of studies  real y[N];                    // vector of continuous outcomes  int kk[N];   // study for individual  int x[N];    // treatment arm for individual}parameters {  vector[K] beta;               // study-specific intercept  vector[K] delta;              // study effects  real sigma[K];       // sd of outcome dist - study specific  real Delta;                   // average treatment effect  real  tau;           // variation of treatment effects}transformed parameters{     vector[N] yhat;    for (i in 1:N)        yhat[i] = beta[kk[i]] + x[i] * delta[kk[i]];}model {    // priors    sigma ~ normal(0, 2.5);  beta ~ normal(0, 10);    tau ~ normal(0, 2.5);  Delta ~ normal(0, 10);  delta ~ normal(Delta, tau);  // outcome model    for (i in 1:N)    y[i] ~ normal(yhat[i], sigma[kk[i]]);}

Generate the posterior distributions

With the model in place, we transform the data into a list so that Stan can make sense of it:

N <- nrow(d.obs)                               ## number of observationsK <- dc[, length(unique(study))]               ## number of studiesy <- d.obs$y                                   ## vector of continuous outcomeskk <- d.obs$study                              ## study for individualx <- d.obs$x                                   ## treatment arm for individualddata <- list(N = N, K = K, y = y, kk = kk, x = x)

And then we compile the Stan code:

rt <- stanc("model.stan")sm <- stan_model(stanc_ret = rt, verbose=FALSE)

Finally, we can sample data from the posterior distribution:

fit <-  sampling(sm, data=ddata, seed = 3327, iter = 10000, warmup = 2500,                 control=list(adapt_delta=0.9))

Check the diagonstic plots

Before looking at any of the output, it is imperative to convince ourselves that the MCMC process was a stable one. The trace plot is the most basic way to assess this. Here, I am only showing these plots for \(\Delta\) and \(\tau\), but the plots for the other parameters looked similar, which is to say everything looks good:

pname <- c("Delta", "tau")stan_trace(object = fit, pars = pname)

Look at the results

It is possible to look inspect the distribution of any or all parameters. In this case, I am particularly interested in the treatment effects at the study level, and overall. That is, the focus here is on \(\Delta\), \(\delta_k\), and \(\tau\).

pname <- c("delta", "Delta","tau")print(fit, pars=pname, probs = c(0.05, 0.5, 0.95))
## Inference for Stan model: model.## 4 chains, each with iter=10000; warmup=2500; thin=1; ## post-warmup draws per chain=7500, total post-warmup draws=30000.## ##            mean se_mean   sd    5%   50%  95% n_eff Rhat## delta[1]   6.39    0.01 1.13  4.51  6.41 8.22 29562    1## delta[2]  -0.78    0.01 1.62 -3.45 -0.78 1.85 28188    1## delta[3]  -0.14    0.01 1.39 -2.37 -0.16 2.18 28909    1## delta[4]   3.08    0.00 0.59  2.09  3.08 4.05 34277    1## delta[5]  -0.16    0.01 1.01 -1.77 -0.18 1.52 27491    1## delta[6]   3.87    0.00 0.86  2.47  3.87 5.27 35079    1## delta[7]   4.04    0.01 1.11  2.21  4.03 5.87 32913    1## delta[8]   5.23    0.01 1.29  3.12  5.23 7.36 33503    1## delta[9]   1.79    0.01 1.25 -0.27  1.78 3.82 30709    1## delta[10]  1.38    0.01 1.12 -0.46  1.38 3.21 30522    1## delta[11]  4.47    0.01 1.25  2.43  4.47 6.54 34573    1## delta[12]  0.79    0.01 1.45 -1.60  0.80 3.16 33422    1## Delta      2.48    0.00 0.89  1.01  2.50 3.89 31970    1## tau        2.72    0.00 0.71  1.72  2.64 4.01 24118    1## ## Samples were drawn using NUTS(diag_e) at Sat Jun 27 15:47:15 2020.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).

The forest plot is quite similar to the one based on the mixed effects model, though as predicted, the 95% CI is considerably wider:

As a comparison, here is the plot from the mixed effects model estimated using the nlme package in the previous post. The bootstrapped estimates of uncertainty at the study level are quite close to the Bayesian measure of uncertainty; the difference really lies in the uncertainty around the global estimate.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Local Differential Privacy: Getting Honest Answers on Embarrassing Questions

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Do you cheat on your partner? Do you take drugs? Are you gay? Are you an atheist? Did you have an abortion? Will you vote for the right-wing candidate? Not all people feel comfortable answering those kinds of questions in every situation honestly.

So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about randomized response (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called randomized response. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it cancels out overall!

Big tech companies like Google and Microsoft also use this method to e.g. collect telemetry data in a privacy-preserving manner. The broader concept is called differential privacy, which is differentiated into local and global methods. Global methods are being used when the data is already collected, local methods concern the process of data collection itself. We will be focusing on the latter here.

One way to do randomized response is the following: each participant is asked to flip two coins. If the first coin comes up heads, she answers the question truthfully. Otherwise, she answers “yes” if the second coin came up heads and “no” if it came up tails. This way nobody can find out whether the participant answered truthfully or not, her answers could have been produced by randomness. Yet, the great thing is that when asking a whole group of people it is possible to calculate the true proportions because the added random noise cancels out… we will see how later in the post.

The process is being depicted in the following flowchart, created with R and the diagram package (on CRAN):

library(diagram)## Loading required package: shapeoldpar <- par(mar = c(1, 1, 1, 1))openplotmat()elpos <- coordinates (c(1, 1, 2, 4))fromto <- matrix(ncol = 2, byrow = TRUE, data = c(1, 2, 2, 3, 2, 4, 4, 7, 4, 8))nr <- nrow(fromto)arrpos <- matrix(ncol = 2, nrow = nr)for (i in 1:nr)   arrpos[i, ] <- straightarrow(to = elpos[fromto[i, 2], ], from = elpos[fromto[i, 1], ]                              , lwd = 2, arr.pos = 0.6, arr.length = 0.5)textellipse(elpos[1, ], 0.1, lab = "START", box.col = "green", shadow.col = "darkgreen", shadow.size = 0.005, cex = 1.5)textrect   (elpos[2, ], 0.15, 0.05, lab = "1'st coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5)textrect   (elpos[4, ], 0.15, 0.05, lab = "2'nd coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5)textellipse(elpos[3, ], 0.1, 0.1, lab = c("True", "Answer"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5)textellipse(elpos[7, ], 0.1, 0.1, lab = c("Yes"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5)textellipse(elpos[8, ], 0.1, 0.1, lab = c("No"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5)dd <- c(0.0, 0.025)text(arrpos[2, 1]+0.06, arrpos[2, 2], "50%")text(arrpos[3, 1]-0.06, arrpos[3, 2], "50%")text(arrpos[4, 1]-0.03, arrpos[4, 2]+0.05, "25%")text(arrpos[5, 1]+0.03, arrpos[5, 2]+0.05, "25%")par(oldpar)

Of course, this is just one potential way of doing it and it doesn’t have to be real coin flips: the whole process could be done transparently in the background while collecting personal data automatically, yet to give you some feeling for the method let us actually do a toy example next.

We now simulate a survey of 200 people who get asked an embarrassing question, e.g. whether they are an alcoholic. In our sample, about 15% are alcoholics (which is about the real rate in Western countries, believe it or not!):

set.seed(123)n <- 200true_responses <- sample(c(TRUE, FALSE), size = n, replace = TRUE,  prob = c(0.15, 0.85)) # only responders knowtable(true_responses) / n## true_responses## FALSE  TRUE ##  0.85  0.15

We now use the above method to get an estimate of the proportion without embarrassing single individuals:

randomized_response <- function(true_response)  ifelse(sample(c(TRUE, FALSE), 1), true_response, sample(c(TRUE, FALSE), 1))randomized_response <- Vectorize(randomized_response)randomized_responses <- randomized_response(true_responses) # what comes out of survey

How can we actually get rid of the noise? Mathematician Dr. John D. Cook explains (source: Randomized response, privacy, and Bayes theorem):

How can [one] estimate p, the [true] proportion […]? Around half […] will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Let us actually do this for our little example:

2 * sum(randomized_responses) / n - 0.5## [1] 0.14

With an estimate of 14% we are obviously not so far off from the true proportion (15%). To find out how many people answer truthfully, we do another quick calculation:

sum(true_responses == randomized_responses) / n## [1] 0.75

This makes sense since 50% give the true answer anyway and 25% by chance, which makes 75% altogether.

Who would have thought that a little bit of randomness at the right place could reveal hidden truths people are normally ashamed of to admit?

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

4 Tips to Make Your Shiny Dashboard Faster

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fast versus slow Shiny appA slow-running Shiny application (left) and an optimized one (right)

This is a guest post from RStudio’s partner, Appsilon Data Science

When developing Shiny applications, we at Appsilon strive to implement functionality, enhance appearance, and optimize the user’s experience. However, we often forget about one of the most important elements of UX: the speed of the application. Nobody wants to use a slow application that takes seconds (or minutes) to load or navigate. In this article, I will share four tips and best practices that will help your Shiny applications run much faster. Those tips are:

  1. Figure out why your Shiny app is running slowly
  2. Use faster functions
  3. Pay attention to scoping rules for Shiny apps
  4. Use caching operations

The theme underlying these tips can be summed up by this quote:

“The reason for Shiny’s slow action [is] usually not Shiny.” – Winston Chang

1. Measure Where Your Shiny App Is Spending Its Time

With R, we can find some very useful solutions for verifying which parts of our code are less effective. One of my favorite tools is the profvis package, whose output is shown below:

Profvis screenshot

A timing measurement created by the profvis package

Profvis allows you to measure the execution time and R memory consumption of R code. The package itself can generate a readable report that helps us identify inefficient parts of the code, and it can be used to test Shiny applications. You can see profvis in action here.

If we are only interested in measuring a code fragment rather than a complete application, we may want to consider simpler tools such as the tictoc package, which measures the time elapsed to run a particular code fragment.

2. Use Faster Functions

Once you’ve profiled your application, take a hard look at the functions consuming the most time. You may achieve significant performance gains by replacing the functions you routinely use with faster alternatives.

For example, a Shiny app might search a large vector of strings for ones starting with the characters “an”. Most R programmers would use a function such as grepl as shown below:

  grepl("^an", cnames),

However, we don’t need the regular expression capabilities of grepl to find strings starting with a fixed pattern. We can tell grepl not to bother with regular expressions by adding the parameter fixed = TRUE. Even better, though, is to use the base R function startsWith. As you can see from the benchmarks below, both options are faster than the original grepl, but the simpler startsWith function performs the search more than 30 times faster.

microbenchmark::microbenchmark(  grepl("an", cnames),  grepl("an", cnames, fixed = TRUE)  startsWith(cnames, "an"))Unit: microseconds                              expr      min        lq       mean   median       uq      max neval grepl("an", cnames)               2046.846 2057.7725 2082.44583 2067.474 2089.499 2449.035   100 grepl("an", cnames, fixed = TRUE) 1127.246 1130.7440 1146.35229 1132.597 1136.032 1474.634   100 startsWith(cnames, "an")            62.982   63.2485   64.47847   63.548   64.155   79.528   100

Similarly, consider the following expressions:

sum_value <- 0for (i in 1:100) {  sum_value <- sum_value + i ^ 2}

versus

sum_value <- sum((1:100) ^ 2)

Even a novice R programmer would likely use the second version because it takes advantage of the vectorized function sum.

When we create more complex functions for our Shiny apps, we should similarly look for vectorized operations to use instead of loops whenever possible. For example, the following code does a simple computation on two columns in a long data frame:

frame <- data.frame (col1 = runif (10000, 0, 2),                     col2 = rnorm (10000, 0, 2))  for (i in 1:nrow(frame)) {    if (frame[i, 'col1'] + frame[i, 'col2'] > 1) {      output[i] <- "big"    } else {      output[i] <- "small"    }  }

However, an equivalent output can be obtained much faster by using ifelse which is a vectorized function:

  output <- ifelse(frame$col1 + frame$col2 > 1, "big", "small")

This vectorized version is easier to read and computes the same result about 100 times faster.

3. Pay Attention to Object Scoping Rules in Shiny Apps

  1. Global: Objects in global.R are loaded into R’s global environment. They persist even after an app stops. This matters in a normal R session, but not when the app is deployed to Shiny Server or Connect. To learn more about how to scale Shiny applications to thousands of users on RStudio Connect, this recent article has some excellent tips.
  2. Application-level: Objects defined in app.R outside of the server function are similar to global objects, except that their lifetime is the same as the app; when the app stops, they go away. These objects can be shared across all Shiny sessions served by a single R process and may serve multiple users.
  3. Session-level: Objects defined within the server function are accessible only to one user session.

In general, the best practice is:

  • Create objects that you wish to be shared among all users of the Shiny application in the global or app-level scopes (e.g., loading data that users will share).
  • Create objects that you wish to be private to each user as session-level objects (e.g., generating a user avatar or displaying session settings).

4. Use Caching Operations

If you’ve used all of the previous tips and your application still runs slowly, it’s worth considering implementing caching operations. In 2018, RStudio introduced the ability to cache charts in the Shiny package. However, if you want to speed up repeated operations other than generating graphs, it is worth using a custom caching solution.

One of my favorite packages that I use for this case is memoise. Memoise saves the results of new invocations of functions while reusing the answers from previous invocations of those functions.

The memoise package currently offers 3 methods for storing cached objects:

  1. cache_mem– storing cache in RAM (default)
  2. cache_filesystem(path)– storing cache on the local disk
  3. cache_s3(s3_bucket)– storage in the AWS S3 file database

The selected caching type is defined by the cache parameter in the memoise function.

If our Shiny application is served by a single R process and its RAM consumption is low, the simplest method is to use the first option, cache_mem, where the target function is defined and its answers cached in the global environment in RAM. All users will then use shared cache results, and the actions of one user will speed up the calculations of others. You can see a simple example below:

library(memoise)# Define an example expensive to calculate functionexpensive_function <- function(x) {    sum((1:x) ^ 2)    Sys.sleep(5)    # make it seem to take even longer  }system.time(expensive_function(1000)) # Takes at least 5 seconds    user  system elapsed   0.013   0.016   5.002 system.time(expensive_function(1000)) # Still takes at least 5 seconds   user  system elapsed   0.016   0.015   5.005 # Now, let's cache results using memoise and its default cache_memorymemoised_expensive_function <- memoise(expensive_function)system.time(memoised_expensive_function(1000)) # Takes at least 5 seconds   user  system elapsed   0.016   0.015   5.001 system.time(memoised_expensive_function(1000)) # Returns much faster   user  system elapsed   0.015   0.000   0.015 

The danger associated with using in-memory caching, however, is that if you don’t manage the cached results, it will grow without bound and your Shiny application will eventually run out of memory. You can manage the cached results using the timeout and forget functions.

If the application is served by many processes running on one server, the best option to ensure cache sharing among all users is to use cache_filesystem and store objects locally on the disk. Again, you will want to manage the cache, but you will be limited only by your available disk space.

In the case of an extensive infrastructure using many servers, the easiest method will be to use cache_s3 which will store its cached values on a shared external file system – in this case, AWS S3.


About Appsilon Data Science:

Appsilon logo One of the winners of the 2020 Shiny Contest and a Full Service RStudio Partner, Appsilon delivers enterprise Shiny apps, data science and machine learning consulting, and support with R and Python for customers all around the world.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Enabling Remote Data Science Teams: A Webinar with Appsilon and RStudio

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

remote office

rstudio appsilon remote

What is Happening?

RStudio  and Appsilon Data Science will be hosting a joint webinar on Tuesday, July 28th at 11AM EST. The theme is “Enabling Remote Data Science Teams.” You can register to attend the event here.

What Will Be Covered?

Whatever happens in the coming months, remote work is here to stay. The goal of the webinar is to provide data scientists and data science team leaders with the knowledge and tools to succeed as a distributed team. Some of the topics we will cover include:
  • Setting up a remote and collaborative R environment
  • Version control and Scrum
  • Using Shiny and RStudio Connect to share apps within and across teams
  • How to improve the UI and appearance of Shiny dashboards
  • How to scale Shiny dashboards to hundreds of users
  • How to build and grow a remote data science team

Who Will Be Presenting?

What is Appsilon?

Appsilon Data Science provides enterprise data analytics and machine learning solutions. We are global leaders in dashboard development and our clients are some of the world’s most influential Fortune 500 companies. We are constantly pushing the boundaries of what is possible within the R Shiny framework.

Appsilon is also a pioneer in open source, with thousands of package downloads and hundreds of GitHub stars to date. Above all, we are committed to effecting positive change in the world. We frequently contribute their skills to projects that support the preservation of human life and the conservation of animal populations all over the globe.

Learn More

Article Enabling Remote Data Science Teams: A Webinar with Appsilon and RStudio comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tutorial: Better Blog Post Analysis with googleAnalyticsR

$
0
0

[This article was first published on R tutorial – Dataquest, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my previous role as a marketing data analyst for a blogging company, one of my most important tasks was to track how blog posts performed.

On the surface, it’s a fairly straightforward goal. With Google Analytics, you can quickly get just about any metric you need for your blog posts, for any date range. 

But when it comes to comparing blog post performance, things get a bit trickier. 

For example, let’s say we want to compare the performance of the blog posts we published on the Dataquest blog in June (using the month of June as our date range). 

But wait… two blog posts with more than 1,000 pageviews were published earlier in the month, And the two with fewer than 500 pageviews were published at the end of the month. That’s hardly a fair comparison!

My first solution to this problem was to look up each post individually, so that I could make an even comparison of how each post performed in their first day, first week, first month, etc. 

However, that required a lot of manual copy-and-paste work, which was extremely tedious if I wanted to compare more than a few posts, date ranges, or metrics at a time. 

But then, I learned R, and realized that there was a much better way.

In this post, we’ll walk through how it’s done, so you can do my better blog post analysis for yourself!

What we’ll need

To complete this tutorial, you’ll need basic knowledge of R syntax and the tidyverse, and access to a Google Analytics account.

Not yet familiar with the basics of R? We can help with that! Our interactive online courses teach you R from scratch, with no prior programming experience required. Sign up and start today!

You’ll also need the dyplr, lubridate, and stringr packages installed — which, as a reminder, you can do with the install.packages() command.

Finally, you will need a CSV of the blog posts you want to analyze. Here’s what’s in my dataset:

post_url: the page path of the blog post post_date: the date the post was published (formatted m/d/yy) category: the blog category the post was published in (optional) title: the title of the blog post (optional)

Depending on your content management system, there may be a way for you to automate gathering this data — but that’s out of the scope of this tutorial!

For this tutorial, we’ll use a manually-gathered dataset of the past ten Dataquest blog posts.

Setting up the googleAnalyticsR package

To access data from the Google Analytics API, we’ll use the excellent googleAnalyticsR package by Mark Edmonson. 

As described in the documentation, there are two “modes” to the googleAnalyticsR package. The first mode, which we’ll use here, is a “Try it out” mode, which uses a shared Google Project to authorize your Google Analytics account. 

If you want to make this report a recurring tool for your blog or client, be sure to create your own Google Project, which will help keep the traffic on the shared Project to a minimum. To find out how to set this up, head over to the package setup documentation.

For now, though, we’ll stick with “Try it out” mode. 

First, we’ll install the package using this code:

install.packages('googleAnalyticsR', dependencies = TRUE)

This installs the package, as well as the required dependencies.

Next, we’ll load the library, and authorize it with a Google Analytics account using the ga_auth() function.

library(googleAnalyticsR)ga_auth()

When you run this code the first time, it will open a browser window and prompt you to log in to your Google account. Then, it will give you a code to paste into your R console. After that, it will save an authorization token so you only have to do this once!

Once you’ve completed the Google Analytics authorization, we’re ready to set up the rest of the libraries and load in our blog posts. We’ll also use dplyr::mutate() to change the post_date to a Date class while we’re at it!

library(dplyr)library(lubridate)library(stringr)library(readr)blog_posts <- read.csv("articles.csv")%>%  mutate(    post_date = as.Date(post_date,"%m/%d/%y")# changes the post_date column to a Date)

Here’s what the blog post data frame looks like: 

Finally, to get data from your Google Analytics account, you will need the ID of the Google Analytics view you want to access. ga_account_list() will return a list of your available accounts.

accounts <- ga_account_list()# select the view ID by view and property name, and store it for ease of useview_id <- accounts$viewId[which(accounts$viewName =="All Web Site Data"& accounts$webPropertyName =="Dataquest")]# be sure to change this out with your own view and/or property name!

Now, we’re ready to do our first Google Analytics API requests!

Accessing blog post data with googleAnalyticsR

In this tutorial, our goal is to gather data for the first week each post was active, and compile it in a dataframe for analysis. To do this, we’ll create a function that runs a for loop and requests this data for each post in our blog_posts dataframe.

So, let’s take a look at how to send a request to the Google Analytics API using googleAnalyticsR.

google_analytics(view_id,                  date_range = c(as.Date("2020-06-01"), as.Date("2020-06-30")),                  metrics = c("pageviews"),                  dimensions = c("pagePath"))

This request has a few components. First, enter the view_id, which we already stored from our ga_accounts() dataframe.

Next, specify the date range, which needs to be passed in as a list of dates.

Then, we input the metrics (like pageviews, landing page sessions, or time on page) and dimensions (like page path, channel, or device). We can use any dimension or metric that’s available in the Google Analytics UI — here’s a useful reference for finding the API name of any UI metric or dimension.

So, the request above will return a dataframe of all pageviews in June, by page path (by default googleAnalyticsR will only return the first 1,000 results).

But, in our case, we only want to retrieve pageviews for a specific page – so we need to filter on the pagePath dimension using a dimension filter, which looks like this:

page_filter <- dim_filter(dimension ="pagePath",                          operator ="REGEXP",                          expressions ="^www.dataquest.io/blog/r-markdown-guide-cheatsheet/$")

To use this filter in our request, googleAnalyticsR wants us to create a filter clause – which is how you would combine filters if you wanted to use multiple dimension filters. But in our case, we just need the one: 

page_filter_clause <- filter_clause_ga4(list(page_filter))

Now, let’s try sending a response with this filter:

google_analytics(view_id,              date_range = c(as.Date("2020-07-01"), Sys.Date()),              metrics = c("pageviews"),              dimensions = c("pagePath"),              dim_filters = page_filter_clause)

The result is a dataframe with the pageviews for the R Markdown post!

Creating the for loop

Now that we can gather data and filter it by dimension, we are ready to build out our function to run our for loop! The steps to the function are:

  • Set up a data frame to hold the results
  • Begin the loop based on the number of rows in the data frame
  • Access the post URL and post date for each post
  • Create a page filter based on the post URL
  • Send a request to Google Analytics using the post_date as the start date, and date the week later as the end date
  • Add the post URL and pageview data to the final data frame

I also have added a print() command to let us know how far along the loop is (because it can take awhile) and a Sys.Sleep() command to keep us from hitting the Google Analytics API rate limit.

Here’s what that looks like all put together!

get_pageviews <-function(posts){# set up dataframe to be returned, using the same variable names as our original dataframe  final <- tibble(pageviews = numeric(),                      post_url = character())# begin the loop for each row in the posts dataframefor(i in seq(1:nrow(posts))){# select the post URL and post date for this loop — also using the same variable names as our original dataframe    post_url <- posts$post_url[i]    post_date <- posts$post_date[i]# set up the page filter and page filter clause with the current post URL    page_filter <- dim_filter(dimension ="pagePath",                              operator ="REGEXP",                              expressions = post_url)    page_filter_clause <- filter_clause_ga4(list(page_filter))# send the request, and set the date range to the week following the date the post was shared    page_data <- google_analytics(view_id,                                    date_range = c(post_date, post_date %m+% weeks(1)),                                    metrics = c("pageviews"),                                    dim_filters = page_filter_clause)# add the post url to the returned dataframe    page_data$post_url <- post_url    # add the returned data to the data frame we created outside the loop    final <- rbind(final, page_data)# print loop status    print(paste("Completed row", nrow(final),"of", nrow(posts)))# wait two seconds    Sys.sleep(2)}  return(final)}

We could potentially speed this up with a “functional” in R, such as purrr::map(). The map() function takes a function as an input and returns a vector as output. Check out Dataquest’s interactive online lesson on the map function if you’d like to deepen your knowledge!

For this tutorial, though, we’ll use a for loop because it’s a bit less abstract. 

Now, we’ll run the loop on our blog_posts dataframe, and merge the results to our blog_posts data.

recent_posts_first_week <- get_pageviews(blog_posts)recent_posts_first_week <- merge(blog_posts, recent_posts_first_week)recent_posts_first_week

And that’s it! Now, we can get on to the good stuff — analyzing and visualizing the data.

Blog post data, visualized!

For demonstration, here’s a ggplot bar chart that shows how many pageviews each of our most recent 10 blog posts got in the first week after they were published: 

library(ggplot2)library(scales)recent_posts_first_week %>%  arrange(    post_date  )%>%  mutate(    pretty_title = str_c(str_extract(title,"^(\\S+\\s+\\n?){1,5}"),"..."),    pretty_title = factor(pretty_title, levels = pretty_title[order(post_date)]))%>%  ggplot(aes(pretty_title, pageviews))+  geom_bar(stat ="identity", fill ="#39cf90")+  coord_flip()+  theme_minimal()+  theme(axis.title = element_blank())+  labs(title ="Recent Dataquest blog posts by first week pageviews")+  scale_y_continuous(labels = comma)

Now we can see how useful it is to be able to compare blog posts on “even footing”! 

For more information on the googleAnalyticsR package and what you can do with it, check out its very helpful resource page

Avatar

Loryn is a product marketer at Dataquest who enjoys learning to code with data. She rides motorcycles and sometimes writes about them.

The post Tutorial: Better Blog Post Analysis with googleAnalyticsR appeared first on Dataquest.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R tutorial – Dataquest.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


le compte est bon

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Riddler asks how to derive 24 from (1,2,3,8), with each number appearing once and all operations (x,+,/,-,^) allowed. This reminded me of a very old TV show on French TV, called Le compte est bon!, where players were given 5 or 6 numbers and supposed to find a given total within 60 ,seconds. Unsurprisingly there is an online solver for this game, as shown above, e.g., 24=(8+3+1)x2. But it proves unable to solve the puzzle when the input is 24 and (2,3,3,4), only using 2,3 and 4, since 24=2x3x4. Introducing powers as well, since exponentiation is allowed, leads to two solutions, (4-2)³x3=(4/2)³x3=(3²-3)x4=3/(2/4)³=24… Not fun!

I however rewrote an R code to check whether 24 was indeed a possibility allowed with such combinations but could not find an easy way to identify which combination was used, although a pedestrian version eventually worked! And exhibited the slightly less predictable 43/2x3=24!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Top 5 Best Practices for Data Collection and Storage in Microsoft Excel

$
0
0

[This article was first published on r-bloggers – Telethon Kids Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a biostatistician, my preferred way to receive data from a client is a csv extract from a well-designed database that included validation with the data-entry process. However, 95% of the data I see comes as an Excel file.

I have just sent an email to a client with some tips for using Excel to collect and store their research data. These 5 steps are my recommendations that save me a lot of headache when importing data from .xlsx files into R for analysis:

  1. (basic) Only enter a single piece of information into a cell. For example, if the column is “number”, then 365 is the only thing in the cell, do not do this: “365 (children only)”. If you need to leave a comment, then put it in another column.
  2. (basic) Keep column names simple, my preference is up to a maximum of 3 words, separated by underscores, all lower caps. If you can, keep column names meaningful to a human reader. Also, only use Row 1 for column names/information. Extra information about the data contained within a specific column should be placed in the data dictionary, which is in a different worksheet.
  3. (basic) Include a data dictionary that should have at least 5 columns: 1| column name (point 2), 2| Description, 3| data type (e.g. dichotomous (yes/no, true/false), categorical (list the categories), numeric, text), 4| valid ranges of data (for example, valid entries for age could be between 0 and 110 years, if 200 was entered then I know it’s likely an error), and 5| if the value is numeric, provide the unit of measurement (e.g. years, days, cm, kg, …).
  4. (advanced) Use Excel’s data validation functions – especially to make drop-down lists for categorical responses.
  5. (intermediate) If you have any date columns, change the cell data type to “Text” (the same for phone numbers).

Note: Always, any information that can be used to identify an individual should be removed from the file before it gets sent to a statistician.

If everyone followed these 5 steps, then 80% of my work (and time) would be freed up for analysis and communicating results. I know there must be other simple pieces of advice. What have I missed? What other simple advice do you give your colleagues/clients about data entry into Microsoft Excel?

Photo by Scott Graham on Unsplash

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Telethon Kids Institute.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Create a scatter plot with ggplot

$
0
0

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Make your first steps with the ggplot2 package to create a scatter plot. Use the grammar-of-graphics to map data set attributes to your plot and connect different layers using the + operator.

  • Define a dataset for the plot using the ggplot() function
  • Specify a geometric layer using the geom_point() function
  • Map attributes from the dataset to plotting properties using the mapping parameter
  • Connect different ggplot objects using the + operator
library(ggplot2)ggplot(___) +   geom_point(    mapping = aes(x = ___, y = ___)  )

Introduction to scatter plots

Scatter plots use points to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x- and y-axis. Let’s see an example of a scatter plot to understand the relationship between the speed and the stopping distance of cars:

Each point represents a car. Each car starts to break at a speed given on the y-axis and travels the distance shown on the x-axis until full stop. If we take a look at all points in the plot, we can clearly see that it takes faster cars a longer distance until they are completely stopped.

Quiz: Scatter Plot Facts

Which of the following statements about scatter plots are correct?

  • Scatter plots visualize the relation of two numeric variables
  • In a scatter plot we only interpret single points and never the relationship between the variables in general
  • Scatter plots use points to visualize observations
  • Scatter plots visualize the relation of categorical and numeric variables

Start Quiz

Specifying a dataset

library(ggplot2)ggplot(___) +   geom_point(    mapping = aes(x = ___, y = ___)  )

To create plots with ggplot2 you first need to load the package using library(ggplot2).

After the package has been loaded specify the dataset to be used as an argument of the ggplot() function. For example, to specify a plot using the cars dataset you can use:

library(ggplot2)ggplot(cars)

Note that this command does not plot anything but a grey canvas yet. It just defines the dataset for the plot and creates an empty base on top of which we can add additional layers.

Exercise: Specify the gapminder dataset

To start with a ggplot visualizing the gapminder dataset we need to:

  1. Load the ggplot2 package
  2. Load the gapminder package
  3. Define the gapminder dataset to be used in the plot with the ggplot() function

Start Exercise

Specifying a geometric layer

library(ggplot2)ggplot(___) +   geom_point(    mapping = aes(x = ___, y = ___)  )

We can use ggplot’s geometric layers (or geoms) to define how we want to visualize our dataset. Geoms use geometric objects to visualize the variables of a dataset. The objects can have multiple forms like points, lines and bars and are specified through the corresponding functions geom_point(), geom_line() and geom_col():

Quiz: Scatter Plot Layers

Which geometric layer should be used to create scatter plots in ggplot2?

  • point_geom()
  • geom()
  • geom_scatter()
  • geom_point()

Start Quiz

Creating aesthetic mappings

library(ggplot2)ggplot(___) +   geom_point(    mapping = aes(x = ___, y = ___)  )

ggplot2 uses the concept of aesthetics, which map dataset attributes to the visual features of the plot. Each geometric layer requires a different set of aesthetic mappings, e.g. the geom_point() function uses the aesthetics x and y to determine the x- and y-axis coordinates of the points to plot. The aesthetics are mapped within the aes() function to construct the final mappings.

To specify a layer of points which plots the variable speed on the x-axis and distance dist on the y-axis we can write:

geom_point(  mapping = aes(x=speed, y=dist))

The expression above constructs a geometric layer. However, this layer is currently not linked to a dataset and does not produce a plot. To link the layer with a ggplot object specifying the cars dataset we need to connect the ggplot(cars) object with the geom_point() layer using the + operator:

ggplot(cars) +   geom_point(    mapping = aes(x=speed, y=dist)  )

Through the linking ggplot() knows that the mapped speed and dist variables are taken from the cars dataset. geom_point() instructs ggplot to plot the mapped variables as points.

The required steps to create a scatter plot with ggplot can be summarized as follows:

  1. Load the package ggplot2 using library(ggplot2).
  2. Specify the dataset to be plotted using ggplot().
  3. Use the + operator to add layers to the plot.
  4. Add a geometric layer to define the shapes to be plotted. In case of scatter plots, use geom_point().
  5. Map variables from the dataset to plotting properties through the mapping parameter in the geometric layer.

Exercise: Visualize the “cars” dataset

Create a scatter plot using ggplot() and visualize the cars dataset with the car’s stopping distance dist on the x-axis and the speed of the car on the y-axis.

The ggplot2 package is already loaded. Follow these steps to create the plot:

  1. Specify the dataset through the ggplot() function
  2. Specify a geometric point layer with the geom_point() function
  3. Map the speed to the x-axis and the dist to the y-axis with aes()

Start Exercise

Exercise: Visualize the Gapminder dataset

Create a scatter plot using ggplot() and visualize the gapminder_2007 dataset with the GDP per capita gdpPercap on the x-axis and the life expectancy lifeExp of each country on the y-axis.

The ggplot2 package is already loaded. Follow these steps to create the plot:

  1. Specify the gapminder_2007 dataset through the ggplot() function
  2. Specify a geometric point layer with geom_point().
  3. Map the gdpPercap to the x-axis and the lifeExp to the y-axis with aes()

Start Exercise

Create a scatter plot with ggplot is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Quine with R

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Quine is a self-reproducing function or a computer program that will output its source source or itself. Terms used are for these programs are also self-replicating programs, self-reproducing programs or self-copying programs (courtesy of wikipedia).

Many programming languages have been known to do this, for example in Python, a simple Quine would look like this:

a= 'a=%b;print(a%%a)';print(a%a)

Similar to Python, quine can be written in R as well. Simple example would be:

a<-"a<-0;cat(sub(0,deparse(a),a))";cat(sub(0,deparse(a),a))

Example consists of two blocks; the first block contains the function that will perform the process of replication

"a<-0;cat(sub(0,deparse(a),a))"

And the second part contains the code that will be outputted.

cat(sub(0,deparse(a),a))

When we run the command, the script will return itself, revealing the complete input command.

Word quine primarily comes from the biology, precisely from the self-replication, and it consists of two parts, first part is the code that performs the replication and the second is the data that contains all the code, script, instructions to perform the replication process.

The value of variable a can hold basically any text or any additional information, since the function in R is using string manipulation functions sub and deparse.

Deparse is used to preserve the quotations in original input command and sub is used to get the the first and the second block of code.

Happy R-coding!!!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Optimisation of a Poisson survival model using Optimx in R

$
0
0

[This article was first published on R | Joshua Entrop, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this blog post, we will fit a Poisson regression model by maximising its likelihood function using optimx() in R. As an example we will use the lung cancer data set included in the {survival} package. The data set includes information on 228 lung cancer patients from the North Central Cancer Treatment Group (NCCTG). Specifically, we will estimate the survival of lung cancer patients by sex and age using a simple Poisson regression model. You can download the code that I will use throughout post here. The general survival function \(S(t)\) for our model can be specified as

\[ S(t) = \exp(-\lambda t) \]

where the hazard function \(h(t)\) is equal to

\[ h(t) = \lambda = \exp(\alpha + \beta_1 x_{female} + \beta_2 x_{age}). \] To get started we first need to load all the packages that we will need for our estimations and set up the data set.

# 1. Prefix -------------------------------------------------------------------# Remove all files from lsrm(list = ls())# Loading packagesrequire(survival)require(flexsurv)require(optimx)require(numDeriv)require(dplyr)require(tibble)require(car)# 2. Loading dataset ----------------------------------------------------------#Reading the example data set lung from the survival packagelung <- as.data.frame(survival::lung)#Recode dichotomous vairableslung$female <- ifelse(lung$sex == 2, 1, 0)lung$status_n <- ifelse(lung$status == 2, 1, 0)

At this point we would usually call survreg() or flexsurvreg() to fit our Possion model. However, in this post we will use the likelihood function of our Possion regression model together with optimx() from the {optimx} package instead. For this we first need to find the likelihood function for our model and then use optimx() to find the values for our parameters, that maximise our likelihood function.

The log likelihood (\(\ln L_i\)) for a survival model can be specified as

\[ \ln L_i = d_i \ln h(t_i) + \ln S(t_i). \] Where \(d_i\) indicates whether the \(i^{th}\) subject experienced an event (1) or did not experience an event (0) during follow up and \(t_i\) is the time from the beginning of follow up until censoring.

To obtain the log likelihood function for our Possion model we can substitute \(h(t)\) and \(S(t)\) with our previously defined hazard and survival function respectively. Thus, we get the following equation for our log likelihood

\[\ln L_i = d_i \ln \lambda – \lambda t_i \]

where \(\lambda\) is defined as mentioned above.

The next step is now to write our likelihood function as a function in R, which can be maximised by optimx(). Please keep in mind, that optimx() by default minimises the function we pass to it. However, in our case we need to find the maximum of our likelihood function. To yield the estimates, that maximise our function we can just ask optimx() to minimise the negative of our likelihood. For more information on setting up the likelihood function for optimx() or optim() please take a look at this earlier blog post.

Lets set up our likelihood function in R.

# 3. Define log-likelihood function for Poisson regression model --------------negll <- function(par){#Extract guesses for alpha, beta1 and beta2alpha <- par[1]beta1 <- par[2]beta2 <- par[3]#Define dependent and independent variablest <- lung$timed <- lung$status_nx1 <- lung$femalex2 <- lung$age#Calculate lambdalambda <- exp(alpha + beta1 * x1 + beta2 * x2)#Estimate negetive log likelihood valueval <- -sum(d * log(lambda) - lambda * t)return(val)}

To improve the optimisation we can further pass the gradient function of our likelihood function to our optimx() call. After partially deriving \(L_i\) for \(\alpha\) and \(\beta_i\) we yield the two following equations for the gradient of \(L_i\).

\[ \sum d_i – \lambda_i t_i = 0\]

\[ \sum d_i x_{ij} – \lambda_i x_{ij} t = 0\]

Given these gradient equations we can now define our gradient function in R. For this we need to create a function, that returns the gradient for each of our unknown parameters. Since we have three unknown parameters our gradient function will return a vector gg with three values.

# 4. Define gradient function for Poisson regression model -------------------negll.grad <- function(par){#Extract guesses for alpha and beta1alpha <- par[1]beta1 <- par[2]beta2 <- par[3]#Define dependent and independent variablest <- lung$timed <- lung$status_nx1 <- lung$femalex2 <- lung$age#Create output vectorn <- length(par[1])gg <- as.vector(rep(0, n))#Calculate pi and xblambda <- exp(alpha + beta1 * x1 + beta2 * x2)#Calculate gradients for alpha and beta1gg[1] <- -sum(d - lambda * t)gg[2] <- -sum(d * x1 - lambda * x1 * t)gg[3] <- -sum(d * x2 - lambda * x2 * t)return(gg)}

We can compare the results of our gradient function with the results from the grad() function included in the {numDeriv} package, before we begin with the optimisation of our functions. This is just a check to be sure our gradient function works properly.

# 4.1 Compare gradient function with numeric approximation of gradient ========# compare gradient at 0, 0, 0mygrad <- negll.grad(c(0, 0, 0))numgrad <- grad(x = c(0, 0, 0), func = negll)all.equal(mygrad, numgrad)
## [1] TRUE

Looks like our gradient functions does a good job. Now that we have all the functions and information we need for our optimisation, we can call optimx() and pass our functions to it.

The output of optimx() provides us with estimates for our coefficients and information regarding whether the optimisation algorithm converged (convcode == 0) besides the maximum value of the negative log likelihood obtained by the different algorithms. Hence, it is useful to sort the results by convcode and value.

# 5. Find maximum of log-likelihood function ----------------------------------# Passing names to the values in the par vector improves readability of resultsopt <- optimx(par = c(alpha = 0, beta_female = 0, beta_age = 0),fn = negll,gr = negll.grad,hessian = TRUE,control = list(trace = 0, all.methods = TRUE))# Show results for optimisation alogrithms, that convergered (convcode == 0)summary(opt, order = "value") %>%rownames_to_column("algorithm") %>%filter(convcode == 0) %>%select(algorithm, alpha, beta_female, beta_age, value)
## algorithm alpha beta_female beta_age value## 1 nlminb -6.840606 -0.4809343 0.01561870 1156.099## 2 BFGS -6.840627 -0.4809436 0.01561907 1156.099## 3 L-BFGS-B -6.840636 -0.4809316 0.01561902 1156.099## 4 Nelder-Mead -6.832428 -0.4814582 0.01547911 1156.099

The summary of our optimx() call shows, that the nlminb algorithm yielded the best result. Lets see if this result is equal to the results we will get, if we use flexsurvreg from the {flexsurv} package to fit our desired model.

# 6. Estimate regression coeficents using flexsurvreg -------------------------pois_model <- flexsurvreg(Surv(time, status_n == 1) ~ female + age,data = lung,dist = "exp")# 7. Comparing results from optimx and flexsurvreg ----------------------------pois_results <- unname(coef(pois_model))coef_opt <- coef(opt)lapply(1:nrow(coef_opt), function(i){opt_name <- attributes(coef_opt)$dimnames[[1]][i]mle_pois1 <- (coef_opt[i, 1] - pois_results[1])mle_pois2 <- (coef_opt[i, 2] - pois_results[2])mle_pois3 <- (coef_opt[i, 3] - pois_results[3])mean_dif <- mean(mle_pois1, mle_pois2, mle_pois3, na.rm = TRUE)data.frame(opt_name, mean_dif)}) %>%bind_rows() %>%filter(!is.na(mean_dif)) %>%mutate(mean_dif = abs(mean_dif)) %>%arrange(mean_dif)
## opt_name mean_dif## 1 nlminb 2.678650e-07## 2 BFGS 2.091779e-05## 3 L-BFGS-B 2.911668e-05## 4 Nelder-Mead 8.178948e-03## 5 CG 6.816256e+00## 6 Rvmmin 6.840606e+00

The mean difference between our estimates and the estimates obtained by using flexsurvreg() are close to zero. Seems like our optimisation using the log likelihood did a good job.

However, the result obtained with flexsurvreg() provided us with estimates for the standard errors (SEs) of our hazard estimates, too. Since the measurement of uncertainty is at the heart of statistics, I think it is worthwhile to obtain the SEs for our estimates with the information provided by our optimx() call. For a more detailed discussion on how this is done please take a look at one of my previous blog posts here.

Let’s obtain the SEs for our model by using the results from our optimx() call and compare them with the SEs obtained by flexsurvreg().

# 8. Estimate the standard error ----------------------------------------------#Extract hessian matrix for nlminb optimisationhessian_m <- attributes(opt)$details["nlminb", "nhatend"][[1]]# Estimate SE based on hession matrixfisher_info <- solve(hessian_m)prop_se <- sqrt(diag(fisher_info))# Compare the estimated SE from our model with the one from the flexsurv model# Note use res.t to get the estimates on the reale scale without transformaitonses <- data.frame(se_nlminb = prop_se,se_felxsurvreg = pois_model$res.t[, "se"]) %>%print()
## se_nlminb se_felxsurvreg## rate 0.587477415 0.587136471## female 0.167094278 0.167094285## age 0.009105681 0.009097022
all.equal(ses[, "se_nlminb"], ses[, "se_felxsurvreg"])
## [1] "Mean relative difference: 0.000457798"

Looks like we got nearly equal results. Let us use these information and estimate the 95% confidence intervals (CIs) for our estimates now.

# 9. Estimate 95%CIs using estimation of SE -----------------------------------# Extracting estimates from nlminb optimisaitoncoef_test <- coef(opt)["nlminb",]# Compute 95%CIsupper <- coef_test + 1.96 * prop_selower <- coef_test - 1.96 * prop_se# Print 95%CIsdata.frame(Estimate = coef_test,CI_lower = lower,CI_upper = upper,se = prop_se)
## Estimate CI_lower CI_upper se## alpha -6.8406062 -7.992061931 -5.68915046 0.587477415## beta_female -0.4809343 -0.808439062 -0.15342949 0.167094278## beta_age 0.0156187 -0.002228433 0.03346584 0.009105681

One usual way to plot the results of our estimation is plotting the survival function \(S(t)\). Since, uncertainty is important I also want to plot the CI for our survival function. To obtain estimates for the SE of the survival function \(S(t)\) is a little bit more complicated. However, the amazing deltaMethod() function included in the {car} package makes it fairly easy to obtain estimates for the SEs. We just need to provide deltaMethod() with a vector of our coefficients, our covariance matrix and the computation for which we would like to obtain the SEs.

# 10. Plot survival curve with 95%-CI -----------------------------------------# 10.1 Use Delta Method to compute CIs across time of follow-up ===============# Get coefficents for nlminb optimisationnlminb_coef <- coef(opt)["nlminb", ]# Compute CIs for a 60 year old female across follow-up timesurv_optim_female <- lapply(as.list(seq(0.01, 1000.01, 10)), function(t){g <- paste("exp(-exp(alpha + beta_female + 60 * beta_age) *", t, ")")fit <- deltaMethod(nlminb_coef, g, solve(hessian_m))data.frame(time = t,estimate = fit[, "Estimate"],ci_low = fit[, "2.5 %"],ci_up = fit[, "97.5 %"])}) %>%bind_rows()

We can now use these information to plot our survival curve \(S(t)\) together with a grey shaded area that indicates the CIs for our survival function.

# 10.2 Plot survival curve with CIs ===========================================plot(surv_optim_female$time,surv_optim_female$estimate,ylim = c(0, 1),type = "n",xlab = "Time in Days",ylab = "S(t)",main = "Survival after lung cancer \n for 60 year old females")polygon(c(surv_optim_female$time, rev(surv_optim_female$time)),c(surv_optim_female$ci_low, rev(surv_optim_female$ci_up)),border = NA,col = "grey")lines(surv_optim_female$time,surv_optim_female$estimate)legend(0, 0.15,fill = "grey","95% CI")

To sum it up, in this blog post we learned how to fit a Possion regression model using the log likelihood function in R instead of going the usual way of calling survreg() or flexsurvreg(). I think doing this is a good way of gaining a deeper understanding of how estimates for regression models are obtained. In my next post I will take this a step further and show how we can fit a Weibull regression model in R using the log likelihood function in combination with optimx().

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | Joshua Entrop.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

‘I’ve been waiting for a guide to come and take me by the hand’: Ridgeline plots with {ggridges}

$
0
0

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I really like ridgeline plots but only recently I have learned how to do them myself. Of course, the most famous ridgeline plot ever is the one you find on the cover of Joy Division’s album “Unknown Pleasures”. I wonder how many ridgeline plots done with the {ggridges} package try to replicate the look of this famous (and great!) album. And – of course – I will try, too.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Archive Existing RDS Files

$
0
0

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

When working on data science projects in R, exporting internal R objects as files on your hard drive is often necessary to facilitate collaboration. Here at STATWORX, we regularly export R objects (such as outputs of a machine learning model) as .RDS files and put them on our internal file server. Our co-workers can then pick them up for further usage down the line of the data science workflow (such as visualizing them in a dashboard together with inputs from other colleagues).

Over the last couple of months, I came to work a lot with RDS files and noticed a crucial shortcoming: The base R saveRDS function does not allow for any kind of archiving of existing same-named files on your hard drive. In this blog post, I will explain why this might be very useful by introducing the basics of serialization first and then showcasing my proposed solution: A wrapper function around the existing base R serialization framework.

Be wary of silent file replacements!

In base R, you can easily export any object from the environment to an RDS file with:

saveRDS(object = my_object, file = "path/to/dir/my_object.RDS")

However, including such a line somewhere in your script can carry unintended consequences: When calling saveRDS multiple times with identical file names, R silently overwrites existing, identically named .RDS files in the specified directory. If the object you are exporting is not what you expect it to be — for example due to some bug in newly edited code — your working copy of the RDS file is simply overwritten in-place. Needless to say, this can prove undesirable.

If you are familiar with this pitfall, you probably used to forestall such potentially troublesome side effects by commenting out the respective lines, then carefully checking each time whether the R object looked fine, then executing the line manually. But even when there is nothing wrong with the R object you seek to export, it can make sense to retain an archived copy of previous RDS files: Think of a dataset you run through a data prep script, and then you get an update of the raw data, or you decide to change something in the data prep (like removing a variable). You may wish to archive an existing copy in such cases, especially with complex data prep pipelines with long execution time.

Don’t get tangled up in manual renaming

You could manually move or rename the existing file each time you plan to create a new one, but that’s tedious, error-prone, and does not allow for unattended execution and scalability. For this reason, I set out to write a carefully designed wrapper function around the existing saveRDS call, which is pretty straightforward: As a first step, it checks if the file you attempt to save already exists in the specified location. If it does, the existing file is renamed/archived (with customizable options), and the „updated“ file will be saved under the originally specified name.

This approach has the crucial advantage that the existing code that depends on the file name remaining identical (such as readRDS calls in other scripts) will continue to work with the latest version without any needs for adjustment! No more saving your objects as „models_2020-07-12.RDS“, then combing through the other scripts to replace the file name, only to repeat this process the next day. At the same time, an archived copy of the — otherwise overwritten — file will be kept.

What are RDS files anyways?

Before I walk you through my proposed solution, let’s first examine the basics of serialization, the underlying process behind high-level functions like saveRDS.

Simply speaking, serialization is the „process of converting an object into a stream of bytes so that it can be transferred over a network or stored in a persistent storage.“

Stack Overflow: What is serialization?

There is also a low-level R interface, serialize, which you can use to explore (un-)serialization first-hand: Simply fire up R and run something like serialize(object = c(1, 2, 3), connection = NULL). This call serializes the specified vector and prints the output right to the console. The result is an odd-looking raw vector, with each byte separately represented as a pair of hex digits. Now let’s see what happens if we revert this process:

s <- serialize(object = c(1, 2, 3), connection = NULL)print(s)# >  [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 0e 00# > [29] 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00 00 40 08 00 00 00 00 00 00unserialize(s)# > 1 2 3

The length of this raw vector increases rapidly with the complexity of the stored information: For instance, serializing the famous, although not too large, iris dataset results in a raw vector consisting of 5959 pairs of hex digits!

Besides the already mentioned saveRDS function, there is also the more generic save function. The former saves a single R object to a file. It allows us to restore the object from that file (with the counterpart readRDS), possibly under a different variable name: That is, you can assign the contents of a call to readRDS to another variable. By contrast, save allows for saving multiple R objects, but when reading back in (with load), they are simply restored in the environment under the object names they were saved with. (That’s also what happens automatically when you answer „Yes“ to the notorious question of whether to „save the workspace image to ~/.RData“ when quitting RStudio.)

Creating the archives

Obviously, it’s great to have the possibility to save internal R objects to a file and then be able to re-import them in a clean session or on a different machine. This is especially true for the results of long and computationally heavy operations such as fitting machine learning models. But as we learned earlier, one wrong keystroke can potentially erase that one precious 3-hour-fit fine-tuned XGBoost model you ran and carefully saved to an RDS file yesterday.

Digging into the wrapper

So, how did I go about fixing this? Let’s take a look at the code. First, I define the arguments and their defaults: The object and file arguments are taken directly from the wrapped function, the remaining arguments allow the user to customize the archiving process: Append the archive file name with either the date the original file was archived or last modified, add an additional timestamp (not just the calendar date), or save the file to a dedicated archive directory. For more details, please check the documentation here. I also include the ellipsis ... for additional arguments to be passed down to saveRDS. Additionally, I do some basic input handling (not included here).

save_rds_archive <- function(object,                             file = "",                             archive = TRUE,                             last_modified = FALSE,                             with_time = FALSE,                             archive_dir_path = NULL,                             ...) {

The main body of the function is basically a series of if/else statements. I first check if the archive argument (which controls whether the file should be archived in the first place) is set to TRUE, and then if the file we are trying to save already exists (note that „file“ here actually refers to the whole file path). If it does, I call the internal helper function create_archived_file, which eliminates redundancy and allows for concise code.

if (archive) {    # check if file exists    if (file.exists(file)) {      archived_file <- create_archived_file(file = file,                                            last_modified = last_modified,                                            with_time = with_time)

Composing the new file name

In this function, I create the new name for the file which is to be archived, depending on user input: If last_modified is set, then the mtime of the file is accessed. Otherwise, the current system date/time (= the date of archiving) is taken instead. Then the spaces and special characters are replaced with underscores, and, depending on the value of the with_time argument, the actual time information (not just the calendar date) is kept or not.

To make it easier to identify directly from the file name what exactly (date of archiving vs. date of modification) the indicated date/time refers to, I also add appropriate information to the file name. Then I save the file extension for easier replacement (note that „.RDS“, „.Rds“, and „.rds“ are all valid file extensions for RDS files). Lastly, I replace the current file extension with a concatenated string containing the type info, the new date/time suffix, and the original file extension. Note here that I add a „$“ sign to the regex which is to be matched by gsub to only match the end of the string: If I did not do that and the file name would be something like „my_RDS.RDS“, then both matches would be replaced.

# create_archived_file.Rcreate_archived_file <- function(file, last_modified, with_time) {  # create main suffix depending on type  suffix_main <- ifelse(last_modified,                        as.character(file.info(file)$mtime),                        as.character(Sys.time()))  if (with_time) {    # create clean date-time suffix    suffix <- gsub(pattern = " ", replacement = "_", x = suffix_main)    suffix <- gsub(pattern = ":", replacement = "-", x = suffix)    # add "at" between date and time    suffix <- paste0(substr(suffix, 1, 10), "_at_", substr(suffix, 12, 19))  } else {    # create date suffix    suffix <- substr(suffix_main, 1, 10)  }  # create info to paste depending on type  type_info <- ifelse(last_modified,                      "_MODIFIED_on_",                      "_ARCHIVED_on_")  # get file extension (could be any of "RDS", "Rds", "rds", etc.)  ext <- paste0(".", tools::file_ext(file))  # replace extension with suffix  archived_file <- gsub(pattern = paste0(ext, "$"),                        replacement = paste0(type_info,                                             suffix,                                             ext),                        x = file)  return(archived_file)}

Archiving the archives?

By way of example, with last_modified = FALSE and with_time = TRUE, this function would turn the character file name „models.RDS“ into „models_ARCHIVED_on_2020-07-12_at_11-31-43.RDS“. However, this is just a character vector for now — the file itself is not renamed yet. For this, we need to call the base R file.rename function, which provides a direct interface to your machine’s file system. I first check, however, whether a file with the same name as the newly created archived file string already exists: This could well be the case if one appends only the date (with_time = FALSE) and calls this function several times per day (or potentially on the same file if last_modified = TRUE).

Somehow, we are back to the old problem in this case. However, I decided that it was not a good idea to archive files that are themselves archived versions of another file since this would lead to too much confusion (and potentially too much disk space being occupied). Therefore, only the most recent archived version will be kept. (Note that if you still want to keep multiple archived versions of a single file, you can set with_time = TRUE. This will append a timestamp to the archived file name up to the second, virtually eliminating the possibility of duplicated file names.) A warning is issued, and then the already existing archived file will be overwritten with the current archived version.

The last puzzle piece: Renaming the original file

To do this, I call the file.rename function, renaming the „file“ originally passed by the user call to the string returned by the helper function. The file.rename function always returns a boolean indicating if the operation succeeded, which I save to a variable temp to inspect later. Under some circumstances, the renaming process may fail, for instance due to missing permissions or OS-specific restrictions. We did set up a CI pipeline with GitHub Actions and continuously test our code on Windows, Linux, and MacOS machines with different versions of R. So far, we didn’t run into any problems. Still, it’s better to provide in-built checks.

It’s an error! Or is it?

The problem here is that, when renaming the file on disk failed, file.rename raises merely a warning, not an error. Since any causes of these warnings most likely originate from the local file system, there is no sense in continuing the function if the renaming failed. That’s why I wrapped it into a tryCatch call that captures the warning message and passes it to the stop call, which then terminates the function with the appropriate message.

Just to be on the safe side, I check the value of the temp variable, which should be TRUE if the renaming succeeded, and also check if the archived version of the file (that is, the result of our renaming operation) exists. If both of these conditions hold, I simply call saveRDS with the original specifications (now that our existing copy has been renamed, nothing will be overwritten if we save the new file with the original name), passing along further arguments with ....

        if (file.exists(archived_file)) {          warning("Archived copy already exists - will overwrite!")        }        # rename existing file with the new name        # save return value of the file.rename function        # (returns TRUE if successful) and wrap in tryCatch        temp <- tryCatch({file.rename(from = file,                                      to = archived_file)        },        warning = function(e) {          stop(e)        })      }      # check return value and if archived file exists      if (temp & file.exists(archived_file)) {        # then save new file under specified name        saveRDS(object = object, file = file, ...)      }    }

These code snippets represent the cornerstones of my function. I also skipped some portions of the source code for reasons of brevity, chiefly the creation of the „archive directory“ (if one is specified) and the process of copying the archived file into it. Please refer to our GitHub for the complete source code of the main and the helper function.

Finally, to illustrate, let’s see what this looks like in action:

x <- 5y <- 10z <- 20## save to RDSsaveRDS(x, "temp.RDS")saveRDS(y, "temp.RDS")## "temp.RDS" is silently overwritten with y## previous version is lostreadRDS("temp.RDS")#> [1] 10save_rds_archive(z, "temp.RDS")## current version is updatedreadRDS("temp.RDS")#> [1] 20## previous version is archivedreadRDS("temp_ARCHIVED_on_2020-07-12.RDS")#> [1] 10

Great, how can I get this?

The function save_rds_archive is now included in the newly refactored helfRlein package (now available in version 1.0.0!) which you can install directly from GitHub:

# install.packages("devtools")devtools::install_github("STATWORX/helfRlein")

Feel free to check out additional documentation and the source code there. If you have any inputs or feedback on how the function could be improved, please do not hesitate to contact me or raise an issue on our GitHub.

Conclusion

That’s it! No more manually renaming your precious RDS files — with this function in place, you can automate this tedious task and easily keep a comprehensive archive of previous versions. You will be able to take another look at that one model you ran last week (and then discarded again) in the blink of an eye. I hope you enjoyed reading my post — maybe the function will come in handy for you someday!

Über den Autor

Lukas Feick

Lukas Feick

I am a data scientist at STATWORX. I have always enjoyed using data-driven approaches to tackle complex real-world problems, and to help people gain better insights.

ABOUT US


STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.  

Sign Up Now!

Sign Up Now!

.button { background-color: #0085af;}</p><p>.x-container.width { width: 100% !important;}</p><p>.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag Archive Existing RDS Files erschien zuerst auf STATWORX.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Estimating the risks of partying during a pandemic

$
0
0

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There is no doubt that, every now and then, one ought to celebrate life. This usually involves people coming together, talking, laughing, dancing, singing, shouting; simply put, it means throwing a party. With temperatures rising, summer offers all the more incentive to organize such a joyous event. Blinded by the light, it is easy to forget that we are, unfortunately, still in a pandemic. But should that really deter us?

Walking around central Amsterdam after sunset, it is easy to notice that not everybody holds back. Even if my Dutch was better, it would likely still be difficult to convince groups of twenty-somethings of their potential folly. Surely, they say, it is exceedingly unlikely that this little party of ours results in any virus transmission?

Government retorts by shifting perspective: while the chances of virus spreading at any one party may indeed be small, this does not licence throwing it. Otherwise many parties would mushroom, considerably increasing the chances of virus spread. Indeed, government stresses, this is why such parties remain illegal.

But while if-everybody-did-what-you-did type of arguments score high with parents, they usually do no score high with their children. So instead, in this blog post, we ask the question from an individual’s perspective: what are the chances of getting the virus after attending this or that party? And what factors make this more or less likely?

As a disclaimer, I should say that I am not an epidemiologist — who, by the way, are a more cautious bunch than I or the majority of my age group — and so my assessment of the evidence may not agree with expert opinion. With that out of the way, and without further ado, let’s dive in.

Risky business?

To get us started, let’s define the risk of a party as the probability that somebody who is infected with the novel coronavirus and can spread it attends the gathering. The two major factors influencing this probability are the size of the party, that is, the number of people attending; and the prevalence of infected people in the relevant population. As we will see, the latter quantity is difficult to estimate. The probability of actually getting infected by a person who has the coronavirus depends further on a number of factors; we will discuss those in a later section.

I will base the following calculations on data from the Netherlands, specifically from Amsterdam. You can exchange these numbers with numbers from your country and city of choice. RIVM — the Netherland’s institute for public health — reports new cases across regions. Between July $8^{\text{th}}$ and July $21^{\text{st}}$, there have been $20.4$ reported new cases per $100,000$ residents in the region of Amsterdam, yielding a total of $178$ cases. In the next section, we discuss the difficulties in estimating the prevalence of infections.

Estimating the prevalence of infections

While $178$ is the reported number of new cases, it is not the true number of new cases.1 How do we arrive at that?

First, note that the number of new cases is not the number of current cases. We care about the latter. More specifically, we care about the number of currently infectious cases. This is different from the number of currently infected cases.

Upon infection, it usually takes a while until one can infect others, with estimates ranging from $1$ – $3$ days before showing symptoms. The incubation period is the time it takes from getting infected to showing symptoms. It lasts about $5$ days on average, with the vast majority of people showing symptoms within $12$ days (Lauer et al., 2020; see also here). Yet about a third to a half of people can be infectious without showing any symptoms (Pollán et al. 2020; He et al. 2020). Estimates suggest that one is infectious for about $8$ – $10$ days, but it can be longer.

These are complications, but we need to keep it simple. Currently, visitors from outside Europe must show a negative COVID-19 test or need to self-isolate for $14$ days upon arrival in most European countries (see Austria, for an example). Let’s take these $14$ days for simplicity, and assume conservatively that this is the time one is infectious upon getting infected. Thus, we simply take the reported number of new infected cases in the last two weeks as the reported number of currently infectious cases.

With that out of the way, a key question remains: how do we get from the reported number of infections to the true number of infections? One can estimate the true number of infections using models, or by empirically estimating the seroprevalence in the population, that is, the proportion of people who have developed antibodies.

Using the first approach, Flaxman et al. (2020) estimate that the total percentage of the population that has been infected — the attack rate— across $11$ European countries as of May $4^{\text{th}}$. The Netherlands was, unfortunately, not included in these estimates. To get some intuition for the calculation, and to see how countries that have fared better or worse compare to the Netherlands, we turn to Germany, Spain, and the UK. For these three countries the estimated true attack rates were $0.85\%$, $5.50\%$, and $5.10\%$, respectively. Given the population of these countries and the cumulative number of reported infections, we can compute the reported attack rate. Relating this to the estimate of the true attack rate gives us an indication of the extent that the reports undercount the actual infections; the code below calculates this for the three countries.

library('dplyr')library('COVID19')get_undercount<-function(country,attack_rate){# Flaxman et al. (2020) estimate the attack rate as of 4th of Maydat<-covid19(country=country,end='2020-05-04',verbose=FALSE)dat%>%group_by(id)%>%summarize(population=head(population,1),total_cases=tail(confirmed,1))%>%mutate(attack_rate=attack_rate,reported_attack_rate=100*total_cases/population,undercount_factor=attack_rate/reported_attack_rate)}get_undercount(c('Germany','Spain','United Kingdom'),c(0.85,5.5,5.10))
## # A tibble: 3 x 6##   id    population total_cases attack_rate reported_attack_rate undercount_factor##                                                    ## 1 DEU     82905782      165120        0.85                0.199              4.27## 2 ESP     46796540      218011        5.5                 0.466             11.8 ## 3 GBR     66460344      191843        5.1                 0.289             17.7

The table above shows that cases were undercounted by a factor of about $4$ in Germany, $12$ in Spain, and $18$ in the UK. The Netherlands undercounted cases by a factor of about $10$ in April (Luc Coffeng, personal communication). The attack rate estimate for Spain is confirmed by a recent seroprevalence study, which finds a similarly low overall proportion of people who have developed antibodies (around $5\%$, with substantial geographical variability) in the period between April $27^{\text{th}}$ and May $11^{\text{th}}$ (Pollán et al. 2020). In another seroprevalence study, Havers et al. 2020 find that between March $23^{\text{rd}}$ and May $12^{\text{nd}}$, reported cases from several areas in the United States undercounted true infections by a factor between $6$ and $24$.

Currently, the pandemic is not as severe in Europe as it was back when the above studies were conducted. Most importantly, the testing capacity has been ramped up. For example, while the proportion of positive tests in Germany, Spain, the UK, and the Netherlands were $2.40\%$, $2.60\%$, $7.10\%$, and $9.40\%$ on the $4^{\text{th}}$ of May (the end date used in the Flaxman et al. 2020 study), they are $0.50\%$, $1.40\%$, $0.60\%$, and $0.60\%$ respectively, using most recent data from here at the time of writing. Thus, these countries are tracking the epidemic much more closely, which in turn implies that the factor by which they undercount the true cases is likely lower than it was previously.

At the same time, cases are rising again, and RIVM estimates the effective reproductive number to be $R_t = 1.29$. Let’s assume therefore that the true number of infectious cases is $5$ times higher then the number of reported infected cases. This includes asymptomatic carriers or those that are pre-symptomatic but still can spread the virus. We assume the estimated true number of infectious cases to therefore be $5 \times 20.4 = 102$ per $100,000$ residents in Amsterdam. We will test how robust our results are against this particular choice later; in the next section, we estimate the risk of a party.

Estimating the risk of a party

What are the chances that a person who attends your party has the coronavirus and is infectious? To calculate this, we assume that party guests form an independent random sample from the population. We will discuss the implications of this crude assumption later; but for now, it allows us to estimate the desired probability in a straightforward manner.

We estimated that Amsterdam had $5 \times 20.4 = 102$ true infectious cases per $100,000$ inhabitants. Assuming that the probability of infection is the same for all citizens (more on this later), this results in $102 / 100,000 = 0.00102$, which gives a $0.102\%$ or $1$ in $980$ chance that a single party guest has the virus and can spread it.

A party with just one guest would be — intimate. So let’s invite a few others. What are the chances that at least one of them can spread the virus? We compute this by first computing the complement, that is, the probability that no party guest is infectious.

The chance that any one person from Amsterdam is not infectious is $1 – 0.00102 = 0.9990$, or $99.90\%$. With our assumption of guests being an independent random sample from the population, the probability that none of the $n$ guests can spread the virus is $0.9990^n$.

The functions below compute the probability that no party guests can spread the virus, as well as the probability that at least one of the guests is infectious; the latter probability is simply the complement of the former.

# Probability that no guest is infectiousprob_virus_free<-function(n,true_relative_cases=20.4*5){prob_virus<-true_relative_cases/100000(1-prob_virus)^n}# Probability that at least one guest is infectiousparty_risk<-function(n,true_relative_cases=20.4*5){1-prob_virus_free(n,true_relative_cases)}

The figure below shows the party risk in Amsterdam as a function of the party size $n$.

plot of chunk risk plot

The left panel shows how the party risk — the probability that at least one infectious person is the party — increases with $n$. In particular, to have near certainty that at least one infectious person shows up requires a very large party. The right panel zooms in on reasonably party sizes. Most parties that are thrown indoors probably do not exceed $100$ attendants, depending on how rich and reckless the host is. Some parties, for example this one, can attract $150$ people, but usually take place outdoors.2

The estimate of the party risk based on this simple calculation are somewhat sobering: there is a $2.02\%$, a $4.97\%$, and a $14.19\%$ chance that at least one guest can spread the coronavirus for parties of size $20$, $50$, and $150$, respectively. We have of course made a number of simplifying assumptions to arrive at these estimates, and we will critically discuss them in a later section. We will also talk about the factors that influence the chances of actually getting infected when an infectious guest shows up.

Let me note that RIVM actually has their own estimate of the number of currently infectious cases. At the time of writing, their dashboard shows an estimate of $37.2$ infectious cases per $100,000$ inhabitants (see here for a screenshot of the dashboard at the time of writing). This number is larger than $20.4$, the number of reported number cases per $100,000$ between July $8^{\text{th}}$ and July $21^{\text{st}}$.

In the terms of our calculations, their model applies a correction factor of $37.2 / 20.4 = 1.824$. RIVM is therefore slightly more optimistic than I am; for parties of size $20$, $50$, and $150$, their estimates of the probability that at least one guest is infectious — assuming guests are a random sample from the population — are $0.74\%$, $1.84\%$, and $5.43\%$, respectively.

How does RIVM arrive at their estimate of the number of infectious cases? We currently do not know. Their weekly report (Section 9.1) devotes only two small paragraphs to it, saying that the method is “still under development”.

In any event, what is important to note is that the estimates change with the correction factor. In the next section, we assess this relationship moree systematically.

Sensitivity analysis

We have assumed that the reported number of infected cases undercounts the true number of infectious cases by a factor of $5$. In particular, we used the reported cases of $20.4$ per $100,000$ and applied a correction factor of $5$. But what if, in a week from now, the reported cases are $30$ per $100,000$? In the following, we visualize the party risk as a function of the estimated true infectious cases per $100,000$ inhabitants (see also Lachmann & Fox, 2020).

Between July $8^{\text{th}}$ and July $21^{\text{st}}$, $20.4$ cases per $100,000$ inhabitants were reported in Amsterdam. A correction factor of $5$ would bolster this to $102$ cases, a correction factor of $10$ to $204$ cases, and so on; thus, one can backcalculate the correction factor from the estimated true number of cases.

The figure below visualizes the probability that at least one party guest has the coronavirus and can spread it as a function of the estimated true number of infectious cases per $100,000$ inhabitants and the size of the party.

plot of chunk risk plot sensitivity

Let’s take a moment to unpack this figure. Each coloured line represents a combination of true number of infectious cases and party size that yields the same party risk. For example, attending a party of size $20$ when the true number of infectious cases per $100,000$ inhabitants is $50$ yields a party risk of $1\%$, but so would, roughly, attending a party of size $10$ when the true relative number of infectious cases is $100$. Thus, there is a trade-off between the size of the party and the true number of infectious cases.

You can get a quick overview of the party risk for different party sizes by checking when the gray solid lines verticallly cross the coloured lines. Similarly, you can get a rough understanding for the party risk for different relative numbers of true infectious by checking when the gray and coloured lines cross horizontally. The dotted vertical line in the figure gives our previous estimate of the true number of cases.

The figure allows you to estimate the party risk wherever you live; just look up the local number of new cases in the last two weeks and, making the assumptions we have made so far, the plot above gives you the probability that at least one party guest will turn up infectious with the coronavirus. The assumptions we have made are very simplistic, and indeed, if you have a more elaborate way of estimating the number of currently infectious cases, then you can use that number combined with the figure to estimate the party risk.

While we have computed the party risk for a single party, this risk naturally increases when you attend multiple ones. Suppose you have been invited to parties of size $20$, $35$, and $50$ which will take place in the next month. Let’s for simplicity assume that all guests are different each time. Let’s further assume that the number of infectious cases stays constant over the next month; with a current estimate of $R_t = 1.29$, this seems unlikely. Together, these assumptions allow us to calculate the total party risk as the party risk of attending a single party of size $105$, which gives $10.16\%$. It seems that, in this case, fortune does not favour the bold.

Assumptions

The analysis above is a very rough back-of-the-envelope calculation. We have made a number of crucial assumptions to arrive at some numbers. That’s useful as a first approximation; now we have at least some intuition for the problem, and we can critically discuss the assumptions we made. Most importantly, do these assumptions lead to overestimates or underestimates if the party risk?

Independence

First, and most critically, we have assumed that party guests are a randomly and independently drawn from the population. It is this assumption that allowed us to compute the joint probability that none of the party guests have the virus by multiplying the probabilities of any individual being virus-free. If you have ever been to a party, you know that this is not true: instead, a considerable number of party guests usually know each other, and it is safe to say that they are similar on a range of socio-demographic variables such as age and occupation.3

This means we are sampling not from the whole population, as our simple calculation assumes, but from some particular subpopulation that is well connected. Since the party guests likely share social circles or even households, the effective party size — in terms of being relevant for virus transmission — is smaller than the actual party size; this is because these individuals share the same risks. A party with $20$ married couples seems safer than a party with $40$ singles. This would suggest that we overestimate the risk of a party.

Uniform infection probability

At the same time, however, our calculations assume that the risk of getting the coronavirus is evenly spread across the population. We used this fact when estimating the probability that any one person has the coronavirus as the total number of cases divided by the population size.

The probability of infection is not, howevever, evenly distributed. For example, Pollán et al. (2020) report a seroprevalence of people aged $65$ or more of about $6\%$, while people aged between $20$ and $34$ showed a seroprevalence of $4.4\%$ between April $27^{\text{th}}$ to May $11^{\text{th}}$. These days, however, there is a substantial rise in young people who become infected, in the United States and likely also in Europe. Because young people are less likely to develop symptoms, the virus can spread largely undetected.

Moreover, it seems to me that people who would join a party are in general more adventurous. This would increase the chances of an infectious person attending a party; thus, our calculation above may in fact underestimate the party risk.

At the same time, one would hope that people who show symptoms avoid parties. If all guests do this, then only pre-symptomatic or asymptomatic spread can occur, which would reduce the party risk by a half up to two thirds. On the flip side, people who show symptoms might get tested for COVID-19 and, upon receiving a negative test, consider it safe to attend a party. This might be foolish, however; recent estimates suggest that tests miss about $20\%$ infections for people who show symptoms (Kucirka et al, 2020; see also here). For people without symptoms, the test performs even worse.

For parties taking place in summer, it is not unlikely that many guests engaged in holiday travel in the days or weeks before the date of the party. Since travel increases the chances of infection, this would further increase the chances that at least one party guest has contracted the coronavirus.

Estimating true infections

We have assumed that the number of new cases in the last two weeks equals the number of currently infectious cases. This is certainly an approximation. Ideally, we would have a geographically explicit model which, at any point in time and space, provides an estimate of the number of infectious cases. To my knowledge, we are currently lacking such a model.

Note that, if the $178$ people who tested positive all self-isolate or, worse, end up in hospital, this clearly curbs virus spread compared to when they would continue to roam around. The former seems more likely. Moreover, these reported cases are likely not independent either, with outbreaks usually being localized. Similar to the fact that party guests know each other, the fact that reported cases cluster would lead us to overestimate the extent of virus spread.

At the same time, in the Netherlands only those that show symptoms can get tested. Since about a third to a half are asymptomatic or pre-symptomatic in the sense that they spread the virus considerably before symptom onset, the reported number of cases likely gives an undercount of infectious people.

All these complications can be summarized, roughly, in the correction factor, which gives the extent to which we believe that the reported number of cases deviates from the true number of infectious cases. We have first focused on a factor of $5$, but then assessed the robustness of our results in a sensitivity analysis. For example, for a party size of $50$, the chances that at least one guest is infectious is $1.84\%$ for a correction factor of $1.824$ (corresponding to the official RIVM estimate), $4.97\%$ for a factor of $5$, and $18.49\%$ for a factor of $20$. You can play around with these numbers yourself. Observe how they make you feel. Personally, given what we said above about the infection probability for young and adventurous people, I am inclined to err on the side of caution.

Estimating the probability of infection

We have the defined the party risk as the probability that at least one party guest has the coronavirus and is infectious. If this person does not spread the virus to other guests, no harm is done.

This is exceedingly unlikely, however. The probability of getting infected is a function of the time one is exposed to the virus, and the amount of virus one is exposed to. Estimates suggest that about $1,000$ SARS-CoV-2 infectious virus particles suffice for an infection. With breathing, about $20$ viral particles diffuse into the environment per minute; this increases to $200$ for speaking; coughing or sneezing can release $200,000,000$ (!) virus particles. These do not all fall to the ground, but instead can remain suspended in the air and fill the whole room; thus, physical distancing alone might not be enough indoors (the extent of airborne transmission remains debated, however; see for example Klompas, Baker, & Rhee, 2020). It seems reasonable to assume that, when party guests are crowded in a room for a number of hours, many of them stand a good chance of getting infected if any one of the guests is infectious. Masks would help, of course; but how would I sip my Negroni, wearing one?

It is different outdoors. A Japanese study found that virus transmission inside was about $19$ times more likely than outside (Nishiura et al. 2020). Analyzing $318$ outbreaks in China between January $4^{\text{th}}$ and February $11^{\text{th}}$, Quian et al. (2020) found that only a single one occurred outdoors. This suggests that parties outdoors should be much safer than parties indoors. Yet outdoor parties feature elements unlike other outdoor events; for example, there are areas — such as bars or public toilets— which could become spots for virus transmission. Our simple calculations suggest, with a correction factor of $5$, that the probability that at least one person out of $150$ has the coronavirus is $14.19\%$. While, in contrast to an indoor setting, the infected person is unlikely to infect the majority of the other guests, it seems likely that at least some guests will get the virus.

Conclusion: To party or not to party?

If I do not care whether I get wet or not, I will never carry an umbrella, regardless of the chances of rain. Similarly, my decision to throw (or attend) a party requires not only an estimate of how likely it is that the virus spreads at the gathering; it also requires an assessment of how much I actually care.

As argued above, it is almost certain that the virus spreads to other guests if one guest arrives infected. Noting that all guests are young, one might be tempted to argue that the cost of virus spread is low. In fact, people who party might even be helping — heroically— to build herd immunity!

This reasoning is foolish on two grounds. First, while the proportion of infected people who die is very small for young people — Salje et al. (2020) estimate it to be $0.0045\%$ for people in their twenties and $0.015\%$ for people in their thirties — the picture about the non-lethal, long-term effects of the novel coronavirus is only slowly becoming clear. For some people, recovery can be lengthy— much longer than the two weeks we previously believed it would take. Known as “mild” cases, they might not be so mild after all. Moreover, the potential strange neurological effects of a coronavirus infection are becoming increasingly apparent. All told, party animals, even those guarded by their youth, might not shake it off so easily.

Suppose that, even after carefully considering the potential health dangers, one is still willing to take the chances. After all, it would be a really good party, and we young people usually eat our veggies — especially in Amsterdam. The trouble with infectious diseases, though, is that they travel: while you might be happy to take a chance, you and the majority of party guests will probably not self-quarantine after the event, right? If infections occur at the party, the virus is thus likely to subsequently spread to other, more vulnerable parts of the population.

So while you might remain unharmed after attending a party, others might not. Take the story of Bob from Chicago, summarizing an actual infection chain reported by Ghinai et al. (2020):

“Bob was infected but didn’t know. Bob shared a takeout meal, served from common serving dishes, with $2$ family members. The dinner lasted $3$ hours. The next day, Bob attended a funeral, hugging family members and others in attendance to express condolences. Within $4$ days, both family members who shared the meal are sick. A third family member, who hugged Bob at the funeral became sick. But Bob wasn’t done. Bob attended a birthday party with $9$ other people. They hugged and shared food at the $3$ hour party. Seven of those people became ill.

But Bob’s transmission chain wasn’t done. Three of the people Bob infected at the birthday went to church, where they sang, passed the tithing dish etc. Members of that church became sick. In all, Bob was directly responsible for infecting $16$ people between the ages of $5$ and $86$. Three of those $16$ died.”

These events took place before much of the current corona measures were put in place, but the punchline remains: parties are a matter of public, not only individual health. Don’t be like Bob.


I want to thank Denny Borsboom and Luc Coffeng for helpful discussions. I also want to thank Andrea Bacilieri, Denny Borsboom, Tom Dablander, and Charlotte Tanis for helpful comments on this blog post.


Post Scriptum

The code below reproduces the first figure in the main text.

plot_risk<-function(ns,true_relative_cases=20.4*5,...){plot(ns,1-prob_virus_free(ns,true_relative_cases=true_relative_cases),type='l',axes=FALSE,xlab='Party size',ylab='Probability of at least one infection',xaxs='i',yaxs='i',...)axis(1)axis(2,las=2)}par(mfrow=c(1,2))n1<-3000n2<-200y2<-1-prob_virus_free(n2)plot_risk(seq(0,n1),lwd=2,main='Party Risk in Amsterdam',xlim=c(0,n1),ylim=c(0,1),font.main=1.5)lines(c(n2,n2),c(0,y2),col='red',lwd=2)lines(c(0,n2),c(y2,y2),col='red',lwd=2)plot_risk(seq(0,n2),lwd=2,main='Party Risk in Amsterdam (Zoomed in)',ylim=c(0,0.20),xlim=c(0,n2),font.main=1.5)

The code below reproduces the second figure in the main text.

library('RColorBrewer')# Calculates the party size that results in 'prob_virus_free'# for a given 'true_relative_cases'get_party_size<-function(prob_virus_free,true_relative_cases){log(prob_virus_free)/log(1-true_relative_cases/100000)}plot_total_risk<-function(party_sizes,true_relative_cases,...){plot(true_relative_cases,ns,type='n',xaxs='i',yaxs='i',axes=FALSE,ylab='Party Size',xlab='True Number of Infectious Cases per 100,000 Inhabitants',...)ticks<-seq(0,300,25)axis(1,at=ticks)axis(2,at=ticks,las=2)abline(h=ticks,col='gray86')abline(v=ticks,col='gray86')probs_virus<-seq(0.01,0.99,0.01)party_sizes<-sapply(1-probs_virus,get_party_size,true_relative_cases)cols<-rev(heat.colors(50))cols<-colorRampPalette(cols,bias=2)(99)[-1]ix<-c(seq(1,55,1),seq(60,95,5))show_text<-c(seq(1,10,1),seq(15,55,5))for(iinix){y<-party_sizes[,i]lines(true_relative_cases,y,col=cols[i],lwd=2.5)j<-which.min(true_relative_cases[-1]+y[-1])if(i%in%show_text){text(true_relative_cases[j]+1,y[j]+1,paste0(probs_virus[i]*100,'%'),cex=0.80)}}}ns<-seq(0,300)true_relative_cases<-seq(0,300,length.out=length(ns))plot_total_risk(ns,true_relative_cases,main='Probability That at Least One Guest is Infectious',font.main=1,cex.main=1.5)lines(c(20.4*5,20.4*5),c(0,300),lty=2)

References

  • Flaxman, Mishra, Gandy et al. (2020). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 3164.
  • Ghinai, I., Woods, S., Ritger, K. A., McPherson, T. D., Black, S. R., Sparrow, L., … & Arwady, M. A. (2020). Community Transmission of SARS-CoV-2 at Two Family Gatherings-Chicago, Illinois, February-March 2020. MMWR. Morbidity and mortality weekly report, 69(15), 446.
  • Havers, F. P., Reed, C., Lim, T. W., Montgomery, J. M., Klena, J. D., Hall, A. J., … & Krapiunaya, I. (2020). Seroprevalence of Antibodies to SARS-CoV-2 in Six Sites in the United States, March 23-May 3, 2020. JAMA Internal Medicine.
  • He, X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X., … & Mo, X. (2020). Temporal dynamics in viral shedding and transmissibility of COVID-19. Nature Medicine, 26(5), 672-675.
  • Klompas, M., Baker, M. A., & Rhee, C. (2020). Airborne Transmission of SARS-CoV-2: Theoretical Considerations and Available Evidence. JAMA.
  • Kucirka, L. M., Lauer, S. A., Laeyendecker, O., Boon, D., & Lessler, J. (2020). Variation in false-negative rate of reverse transcriptase polymerase chain reaction–based SARS-CoV-2 tests by time since exposure. Annals of Internal Medicine.
  • Lachmann, M., & Fox, S. (2020). When thinking about reopening schools, an important factor to consider is the rate of community transmission. Santa Fe Institute Transmission.
  • Lauer, S. A., Grantz, K. H., Bi, Q., Jones, F. K., Zheng, Q., Meredith, H. R., … & Lessler, J. (2020). The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of Internal Medicine, 172(9), 577-582.
  • Morawska, L., & Cao, J. (2020). Airborne transmission of SARS-CoV-2: The world should face the reality. Environment International, 105730.
  • Nishiura, H., Oshitani, H., Kobayashi, T., Saito, T., Sunagawa, T., Matsui, T., … & Suzuki, M. (2020). Closed environments facilitate secondary transmission of coronavirus disease 2019 (COVID-19). medRxiv.
  • Pollán, M., Pérez-Gómez, B., Pastor-Barriuso, R., Oteo, J., Hernán, M. A., Pérez-Olmeda, M., … & Molina, M. (2020). Prevalence of SARS-CoV-2 in Spain (ENE-COVID): a nationwide, population-based seroepidemiological study. The Lancet.
  • Salje, H., Kiem, C. T., Lefrancq, N., Courtejoie, N., Bosetti, P., Paireau, J., … & Le Strat, Y. (2020). Estimating the burden of SARS-CoV-2 in France. Science.
  • Qian, H., Miao, T., Li, L. I. U., Zheng, X., Luo, D., & Li, Y. (2020). Indoor transmission of SARS-CoV-2. medRxiv.

Footnote

  1. Reported deaths are more reliable than reported cases because deaths must always be reported. This is why, for example, Flaxman et al. (2020) use deaths to estimate the actual proportion of infections. There are issues with reported deaths, too, however, and I discuss some of them here↩

  2. Curiously, this party allowed a total of $300$ guests when I first drafted this post a few days ago. That would have resulted in a party risk of $26.37\%$. They changed the total to $150$ since, maybe because the organizers actually sat down to do some calculations, similar to as we did in this post? Still — and even though I am a big fan of Dominik Eulberg— $150$ guests strike me as too many. ↩

  3. Once the pandemic is over, inviting a random sample from the population should definitely become a thing. Bursting bubbles, one party at a time! ↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

High School Swimming Tournament: New York (4) vs. Pennsylvania (5)

$
0
0

[This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So it begins – the first match-up of the first round of the first ever Swimming + Data Science High School Swimming State-Off. If you missed the introductory post and are wondering what all this is about you can catch up here. Today’s match-up is between number 4 seeded New York and number 5 seeded Pennsylvania.

Setup

To begin let’s load some packages. Swimmer will be our workhorse for getting hold of the results. Then we’ll use the various tidyverse packages dplyr, stringr and purrr to facilitate our analysis and flextable for nice tables of the results. ******

library(SwimmeR)library(dplyr)library(stringr)library(purrr)library(flextable)

Raw Results

New York State has two seasons in a given school year. A girls’ season in the fall, followed by a boys season in the winter. They also have an excellent repository of results. We’re just interested in the most recent state meets, Girls 2019 and Boys 2020, so let’s put links to those results into a list.


NY_Boys <- "http://www.nyhsswim.com/Results/Boys/2020/NYS/Single.htm"NY_Girls <- "http://nyhsswim.com/Results/Girls/2019/NYS/Single.htm"NY_Links <- c(NY_Boys, NY_Girls)

The goal is to read in and clean the raw results, which we’ll do with SwimmeR::read_results and SwimmeR::swim_parse respectively. Before doing so however it’s useful to look the raw results over for potential issues.

We can see in the New York raw results that federation and state records for the two meets are recorded as “Federation”, “NYS Fed”, “NYSPHSAA” and “NYS Meet Rec”. Those are strings we’ll want to tell SwimmeR::swim_parse to avoid.

NY_Boys_Header


NY_Girls_Header


NY_Avoid <- c("Federation", "NYS Fed", "NYSPHSAA", "NYS Meet Rec")

Pennsylvania is a similar story, with a nice results repository. Unlike New York however Pennsylvania has two different divisions for their state championships, somewhat confusingly called 2A and 3A. The 3A championships were held in 2020 (boys and girls) but the 2A where canceled due to COVID-19. Also diving wasn’t included in the Girls 3A 2020 results so as State-Off meet director I’ll be subsisting 2019 results for 3A diving and all of 2A. There will be five total links.


PA_Boys_3A <- "http://www.paswimming.com/19_20/results/state/PIAA_3_A_boys_states_Results.htm"PA_Girls_3A <- "http://www.paswimming.com/19_20/results/state/PIAA_3_A_girls_states_Results.htm"PA_Girls_3A_Diving <- "http://www.piaa.org/assets/web/documents/2019_3a_girls_f_dive_results.htm"PA_Boys_2A <- "http://www.paswimming.com/18_19/results/states/Results/2_A_Boys_Results_2019.htm"PA_Girls_2A <- "http://www.paswimming.com/18_19/results/states/Results/2_A_Girls_Results_2019.htm"PA_Links <- c(PA_Boys_3A, PA_Girls_3A, PA_Girls_3A_Diving, PA_Boys_2A, PA_Girls_2A)

Inspecting the Pennsylvania raw results gives us a few more strings to avoid, namely “PIAA” (PA record), plus “NFHS” and “NF Hon. Roll”. PA_Header


PA_Avoid <- c("PIAA", "NFHS", "NF Hon. Rol")

Reading in Results with the SwimmeR Package

Getting our results is now a simple matter of mapping read_results and swim_parse over our list of links with our avoid lists passed to the avoid argument of swim_parse.

We’ll then add columns State and Gender since those are the parameters of our meet – each state is a team, with boys and girls meets, plus a combined total.

Results <- map(c(NY_Links, PA_Links), Read_Results, node = "pre") %>%   map(Swim_Parse, avoid = c(NY_Avoid, PA_Avoid)) %>%   set_names(c("NY_Boys", "NY_Girls", "PA_Boys", "PA_Girls", "PA_Girls", "PA_Boys", "PA_Girls")) %>%   bind_rows(.id = "Source")Results <- Results %>%  mutate(    State = str_split_fixed(Source, "_", n = 2)[, 1],    Gender = str_split_fixed(Source, "_", n = 2)[, 2]  ) %>%  select(-Source) %>%   filter(str_detect(Event, "Swim-off") == FALSE) # remove swim-offs

More Detail on Meet Parameters

We’ll use the National Federation of High School athletics scoring as below. It’s important to specify that 17th place gets 0 points when it comes to dealing with ties.


Point_Values <- c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1, 0)names(Point_Values) <- 1:17

In order to score the meet we need to reorder finishes from each state meet in the context of our larger meet. At first glance this is simple, because the fastest (i.e. lowest) time will win, followed by the second fastest/lowest in second etc. There are several complications though.

  1. Unique swims: In the New York results there are “Federation” results and “Association” results for each event. Federation is a subset of Association though, so athletes/relay teams in the Federation are listed twice, once in the Federation results and then again (with the same times) in the Association results. The Pennsylvania results include preliminary swims, so athletes/relay teams are also listed twice, once in the finals (which appear first) and again in the prelims, with different times in each instance. We’ll need a way to get only the first instance of an athlete or relay team in a given event.

  2. Relays: Relays are different from individual swims for two reasons

  • Naming: relays are named by the team/school (Central High), whereas athletes have both a team/school and an individual name (Sally Swimfast from Central High)
  • Scoring: point values are doubled for relays
  1. Ties: Ties happen, and the procedure (per NFHS rules) is for competitors to be awarded the average of their place and the voided place. For example, if two athletes tie for 9th place then there will be no 10th place finisher (both athletes get 9th, 10th is voided). The point value for 9th place is 9 points and the point value for 10th is 7, so each athlete receives (9 + 7) = 16, divided by two, equals 8 points. Our scoring needs to handle this.

  2. Diving: Here at Swimming + Data Science we love diving even if it is a complication. We’re not going to just cut diving out, we’re going to deal with diving on its own terms. Diving results are different from swimming results for two reasons.

  • Format: Diving results are scores not times.
  • Ordering: The highest score in diving wins, compared to the fastest (i.e. lowest) time winning in swimming.

General Workflow

  1. Break up Results into relays, diving, and individual swimming using filter.
  2. Take only the first instance of an athlete/team in an event using group_by and slice.
  3. For relays and individual swims convert times in minutes:seconds.hundreths to seconds with SwimmeR:sec_format.
  4. Reorder and record finishes on basis of time (or score) across the new NY vs. PA meet using arrange and mutate.
  5. Award points, accounting for ties using a nifty little combo of rank, summarize and inner_join

Relays

Relay_Results <- Results %>%  filter(str_detect(Event, "Relay") == TRUE) %>% # only want relays  group_by(Event, School) %>%  slice(1) %>% # select first occurrence of team in each event  ungroup() %>%  mutate(Finals_Time_sec = sec_format(Finals_Time)) %>% # convert time to seconds  group_by(Event) %>%  mutate(Place = rank(Finals_Time_sec, ties.method = "min")) %>% # places, low number wins  filter(Place <= 16) %>% # only top 16 score  select(-Points)Relay_Results <- Relay_Results %>% # deal with ties  mutate(New_Place = rank(Place, ties.method = "first"),         Points = Point_Values[New_Place]) %>%   group_by(Place, Event) %>%   summarize(Points = mean(Points)) %>%   inner_join(Relay_Results) %>%   mutate(Points = Points * 2) # double point values for relays

Diving

Same basic structure as relays, but we need to handle scores differently than times.

Diving_Results <- Results %>%  filter(str_detect(Event, "Diving") == TRUE) %>% # only want diving events  mutate(Finals_Time = as.numeric(Finals_Time)) %>%   group_by(Event, Name) %>%   slice(1) %>% # first instance of every diver  ungroup() %>%   group_by(Event) %>%   mutate(Place = rank(desc(Finals_Time), ties.method = "min"), # again, highest score gets rank 1         Finals_Time = as.character(Finals_Time)) %>%  filter(Place <= 16) %>% #only top 16 score  select(-Points)Diving_Results <- Diving_Results %>% # deal with ties  mutate(New_Place = rank(Place, ties.method = "first"),         Points = Point_Values[New_Place]) %>%   group_by(Place, Event) %>%   summarize(Points = mean(Points)) %>%   inner_join(Diving_Results)

Individual Swimming

Again, very similar to diving and relays.

Ind_Swimming_Results <- Results %>%  filter(str_detect(Event, "Diving") == FALSE,         str_detect(Event, "Relay") == FALSE) %>%   group_by(Event, Name) %>%   slice(1) %>% # first instance of every swimmer  ungroup() %>%   group_by(Event) %>%   mutate(Finals_Time_sec = sec_format(Finals_Time)) %>% # time as seconds  mutate(Place = rank(Finals_Time_sec, ties.method = "min")) %>% # places, low number wins  filter(Place <= 16) %>% #only top 16 score  select(-Points)Ind_Swimming_Results <- Ind_Swimming_Results %>% # deal with ties  mutate(New_Place = rank(Place, ties.method = "first"),         Points = Point_Values[New_Place]) %>%   group_by(Place, Event) %>%   summarize(Points = mean(Points)) %>%   inner_join(Ind_Swimming_Results)

The Final Results

Let’s bind together the results from our three cases (relays, diving and individual swims) and do a but of cleaning up. Pennsylvania for example has all their results in block capitals. That can be fixed with str_to_title.


Results_Final <-  bind_rows(Relay_Results, Diving_Results, Ind_Swimming_Results) %>%  mutate(Name = str_to_title(Name),         School = str_to_title(School)) %>%  mutate(School = str_remove_all(School, "[:punct:]"),         School = str_remove_all(School, "[0-9]"))

Scores

Now we summarise and see who won!


Scores <- Results_Final %>%  group_by(State, Gender) %>%  summarise(Score = sum(Points))Scores %>%  arrange(Gender, desc(Score)) %>%   flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

State

Gender

Score

PA

Boys

1711.5

NY

Boys

613.5

PA

Girls

1524.0

NY

Girls

801.0

Scores %>%  group_by(State) %>%  summarise(Score = sum(Score)) %>%  arrange(desc(Score)) %>%   flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

State

Score

PA

3235.5

NY

1414.5


Pennsylvania wins both meets and the combined in an upset, by quite a wide margin!

It’s interesting to think for a moment about why this might be. The State-Off is seeded by population. New York has about 19 million people, but about 8 million of them live in New York City. New York City doesn’t have very many swimmers. Swimmers from the new York City Public High School Athletic League have a -P designation after their school name in the raw results. The cleaning we did on Final_Results reduced this to a trailing P, which we can search for with str_detect.

Results_Final %>%  ungroup() %>%  filter(str_detect(School, "P$")) %>%  summarise(Count = n())
## # A tibble: 1 x 1##   Count##   ## 1     3
Results_Final %>%  ungroup() %>%  filter(State == "NY") %>%  summarise(Count = n())
## # A tibble: 1 x 1##   Count##   ## 1   134

Only three swims out of New York’s total of 134 swims are from New York City. Pools take up a lot space so they’re difficult to install in cities generally. New York City is also very dense, which makes building pools that much harder. Pennsylvania on the other hand has a total population of 12 million. Philadelphia (1.5 million) and Pittsburgh (300k) are much smaller than New York City, so it’s possible that much more of the Pennsylvania population lives in areas conducive to swimming. There’s also a racial component. New York City has a higher proportion of African American residents than New York State as a whole, and African Americans have been subjected to segregation and systematic discrimination including specifically with respect to swimming pools to the extent that even today black children drown at a rate 3x that of white children. New York’s larger than expected non-swimming population may be reflected in its lower than expected State-Off score.


Swimmers of the Meet

To determine the swimmers of the meet there will be two qualifications: 1. An athlete must have competed in two events – sorry divers. Winner will be the athlete with the lowest average place (winning two events gives an average place of 1). This is an individual award so relays don’t count. 2. As a tiebreaker from 1. above, the athlete whose times are fastest across their two events relative to the All-American cuts will be Swimmer of the Meet.

Now if only someone had the All-American cuts readily accessible. Oh wait someone does and that someone is me. Let’s grab those cuts and join them to Ind_Swimming_Results. Then we can do some math to calculate each athlete’s average difference from the All-American cut.


Cuts_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/State_Cuts.csv"Cuts <- read.csv(url(Cuts_Link))'%!in%' <- function(x,y)!('%in%'(x,y)) # "not in" functionCuts <- Cuts %>% # clean up Cuts  filter(Stroke %!in% c("MR", "FR", "11 Dives")) %>%   rename(Gender = Sex) %>%   mutate(    Event = case_when((Distance == 200 & #match events                         Stroke == 'Free') ~ "200 Yard Freestyle",                      (Distance == 200 &                         Stroke == 'IM') ~ "200 Yard IM",                      (Distance == 50 &                         Stroke == 'Free') ~ "50 Yard Freestyle",                      (Distance == 100 &                         Stroke == 'Fly') ~ "100 Yard Butterfly",                      (Distance == 100 &                         Stroke == 'Free') ~ "100 Yard Freestyle",                      (Distance == 500 &                         Stroke == 'Free') ~ "500 Yard Freestyle",                      (Distance == 100 &                         Stroke == 'Back') ~ "100 Yard Backstroke",                      (Distance == 100 &                         Stroke == 'Breast') ~ "100 Yard Breaststroke",                      TRUE ~ paste(Distance, "Yard", Stroke, sep = " ")),         Event = case_when(Gender == "M" ~ paste("Boys", Event, sep = " "),                      Gender == "F" ~ paste("Girls", Event, sep = " ")))Ind_Swimming_Results <- Ind_Swimming_Results %>%  left_join(Cuts %>% filter((Gender == "M" &                               Year == 2020) |                              (Gender == "F" &                                 Year == 2019)) %>%                     select(AAC_Cut, AA_Cut, Event),            by = 'Event')Swimmer_Of_Meet <- Ind_Swimming_Results %>%   mutate(AA_Diff = (Finals_Time_sec - sec_format(AA_Cut))/sec_format(AA_Cut),         Name = str_to_title(Name)) %>%   group_by(Name) %>%   filter(n() == 2) %>% # get swimmers that competed in two events  summarise(Avg_Place = sum(Place)/2,         AA_Diff_Avg = round(mean(AA_Diff, na.rm = TRUE), 2),         Gender = unique(Gender),         State = unique(State)) %>%   arrange(Avg_Place, AA_Diff_Avg) %>%   group_split(Gender) # split out a dataframe for boys (1) and girls (2)

Boys

Boys swimmer of the meet is Matt Brownstead from Pennsylvania, the only boy to win two events! He also broke the national high school record in the 50 free. Let’s see his results.


Swimmer_Of_Meet[[1]] %>%   slice_head(n = 5) %>%   select(-Gender) %>%   flextable::flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

Name

Avg_Place

AA_Diff_Avg

State

Brownstead, Matt

1.0

-0.05

PA

Jensen, Matthew

1.5

-0.04

PA

Faikish, Sean

1.5

-0.03

PA

Newmark, Jake

1.5

-0.02

NY

Guiliano, Chris

2.0

-0.02

PA


Results_Final %>%  filter(Name == "Brownstead, Matt") %>%  select(Place, Name, School, Finals_Time, Event) %>%  arrange(desc(Event)) %>%   flextable::flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

Place

Name

School

Finals_Time

Event

1

Brownstead, Matt

State College

19.24

Boys 50 Yard Freestyle

1

Brownstead, Matt

State College

43.29

Boys 100 Yard Freestyle

Girls

As for the girls the competition was a bit tighter, with two athletes, Chloe Stepanek and Megan Deuel, both winning two events. Going to our All-American standard tiebreaker gives the win to Chloe Stepanek! Winning here is hopefully some solace for Chloe after Megan won the award at the NYS girls meet.


Swimmer_Of_Meet[[2]] %>%  slice_head(n = 5) %>%  select(-Gender) %>%  flextable::flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

Name

Avg_Place

AA_Diff_Avg

State

Chloe Stepanek

1.0

-0.03

NY

Megan Deuel

1.0

-0.02

NY

Catherine Stanford

1.5

-0.01

NY

Cavan Gormsen

2.0

-0.01

NY

Buerger, Torie

2.5

-0.01

PA


Results_Final %>%  filter(Name == "Chloe Stepanek") %>%  select(Place, Name, School, Finals_Time, Event) %>%  arrange(desc(Event)) %>%   flextable::flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

Place

Name

School

Finals_Time

Event

1

Chloe Stepanek

Northport

1:46.15

Girls 200 Yard Freestyle

1

Chloe Stepanek

Northport

48.76

Girls 100 Yard Freestyle


In Closing

That wraps up this match up. Join us next time here at Swimming + Data Science for another Round 1 match up – number 1 seed California vs. number 8 seed Georgia. We’ll see you then!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Test

$
0
0

[This article was first published on R Language – LearnR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Online Training Software

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Language – LearnR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[The Art of] Regression and other stories

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CoI: Andrew sent me this new book [scheduled for 23 July on amazon] of his with Jennifer Hill and Aki Vehtari. Which I read in my garden over a few sunny morns. And as Andrew and Aki are good friends on mine, this review is definitely subjective and biased! Hence to take with a spoonful of salt.

The “other stories’ in the title is a very nice touch. And a clever idea. As the construction of regression models comes as a story to tell, from gathering and checking the data, to choosing the model specifications, to analysing the output and setting the safety lines on its interpretation and usages. I added “The Art of” in my own title as the exercise sounds very much like an art and very little like a technical or even less mathematical practice. Even though the call to the resident stat_glm R function is ubiquitous.

The style itself is very story-like, very far from a mathematical statistics book as, e.g., C.R. Rao’s Linear Statistical Inference and Its Applications. Or his earlier Linear Models which I got while drafted in the Navy. While this makes the “Stories” part most relevant, I also wonder how I could teach from this book to my own undergrad students without acquiring first (myself) the massive expertise represented by the opinions and advice on what is correct and what is not in constructing and analysing linear and generalised linear models. In the sense that I would find justifying or explaining opinionated sentences an amathematical challenge. On the other hand, it would make for a great remote course material, leading the students through the many chapters and letting them experiment with the code provided therein, creating new datasets and checking modelling assumptions. The debate between Bayesian and likelihood solutions is quite muted, with a recommendation for weakly informative priors superseded by the call for exploring the impact of one’s assumption. (Although the horseshoe prior makes an appearance, p.209!) The chapter on math and probability is somewhat superfluous as I hardly fathom a reader entering this book without a certain amount of math and stats background. (While the book warns about over-trusting bootstrap outcomes, I find the description in the Simulation chapter a wee bit too vague.) The final chapters about causal inference are quite impressive in their coverage but clearly require a significant amount of investment from the reader to truly ingest these 110 pages.

“One thing that can be confusing in statistics is that similar analyses can be performed in different ways.” (p.121)

Unsurprisingly, the authors warn the reader about simplistic and unquestioning usages of linear models and software, with a particularly strong warning about significance. (Remember Abandon Statistical Significance?!) And keep (rightly) arguing about the importance of fake data comparisons (although this can be overly confident at times). Great Chapter 11 on assumptions, diagnostics and model evaluation. And terrific Appendix B on 10 pieces of advice for improving one’s regression model. Although there are two or three pages on the topic, at the very end, I would have also appreciated a more balanced and constructive coverage of machine learning as it remains a form of regression, which can be evaluated by simulation of fake data and assessed by X validation, hence quite within the range of the book.

The document reads quite well, even pleasantly once one is over the shock at the limited amount of math formulas!, my only grumble being a terrible handwritten graph for building copters(Figure 1.9) and the numerous and sometimes gigantic square root symbols throughout the book. At a more meaningful level, it may feel as somewhat US centric, at least given the large fraction of examples dedicated to US elections. (Even though restating the precise predictions made by decent models on the eve of the 2016 election is worthwhile.) The Oscar for the best section title goes to “Cockroaches and the zero-inflated negative binomial model” (p.248)! But overall this is a very modern, stats centred, engaging and careful book on the most common tool of statistical modelling! More stories to come maybe?!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tidyr::complete to show all possible combinations of variables

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is an issue I often face, so I thought it best to write it down. When doing data analysis, we often want to known how many observations there are in each subgroup. These subgroups can be defined by multiple variables. In the code example below, I want to know how many vehicles there are for each (cyl, gear) combination:

library(tidyverse)
data(mtcars)
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n())

# # A tibble: 8 x 3
# # Groups:   cyl [3]
#     cyl  gear count
#     
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     5     2

If you look carefully, you will notice that there are no vehicles with cyl == 8 and gear == 4. In general it’s probably better to include this combination as a row in the tibble, with count as 0. This is especially important in data pipelines where future processes might expect there to be length(unique(cyl)) * length(unique(gear)) rows in the dataset.

We can achieve this by ungrouping the dataset and applying tidyr::complete(). This ensures that every possible (cyl, gear) combination gets a row.

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear)

# # A tibble: 9 x 3
#     cyl  gear count
#     
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4    NA
# 9     8     5     2

For rows that didn’t appear in the original summary table, complete() fills up the remaining columns with NA. We can specify the value complete() should use to fill in these cells with the fill option:

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear, fill = list(count = 0))

# # A tibble: 9 x 3
#     cyl  gear count
#     
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4     0
# 9     8     5     2

References:

  1. Reddit. Need help with dplyr: Show all possible group combinations.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12111 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>