Biologically Plausible Fake Survival Data

November 1, 2020, 10:00 am

≫ Next: Upcoming Why R? and X-Europe Webinars in November

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In two recent posts, one on the Disease Progression Model and the other on Fake Data, I highlighted some of R's tools for simulating data that exhibit desired correlations and other statistical properties. In this post, I’ll focus on a small cluster of R packages that support generating biologically plausible survival data.

Background

In an impressive paperSimulating biologically plausible complex survival dataCrowther & Lambert (2013) that combines survival analysis theory and numerical methods, Michael Crowther and Paul Lambert address the problem of simulating plausible data in which event time, censuring and covariate distributions are plausible. They develop a methodology for conducting survival analysis studies, and also provide computational tools for moving beyond the usual exponential, Weibull and Gompertz models. Building on the work by Bender et al. (2005) in establishing a framework for simulating survival data for Cox proportional hazards models, Crowther and Lambert discuss how modelers can incorporate non proportional model hazards, time varying effects, delayed entry and random effects and provide code examples based on the Statasurvsim package.

The `survsim` package

Not longer after the Stata package appeared, Moriña and Navarro released the Rsurvsim package which implements some of the features in the Stata package for simulating complex survival data. The R package does not have a vignette, but you can find several examples in the JSS paper Moriña & Navarro (2014).

The following example from section 4.3 of the paper simulates adverse events for a clinical trial with 100 patients followed up for 30 days. The authors suggest that the three covariates x could represent body mass index, age at entry to the cohort, and whether or not the subject has hypertension. This is a little bit unusual and sophisticated example of survival modeling.

set.seed(12345)dist.ev <- c("weibull", "llogistic", "weibull")anc.ev <- c(0.8, 0.9, 0.82)beta0.ev <- c(3.56, 5.94, 5.78)beta <- list(c(-0.04, -0.02, -0.01), c(-0.001, -0.0008, -0.0005),c(-0.7, -0.2, -0.1))x <- list(c("normal", 26, 4.5), c("unif", 50, 75), c("bern", 0.25))clinical.data <- mult.ev.sim(n = 100,      # number of patients in cohort                            foltime = 30,  # maximal followup time                            dist.ev,       # time to event distributions (t.e.d.)                            anc.ev,        # parameters for t.d.e. distributions                            beta0.ev,      # beta0 parameters for t.d.e. dist                             dist.cens = "weibull", #censoring distribution                            anc.cens = 1,  # parameters for censoring dist                            beta0.cens = 5.2, # beta0 for censoring dist                            z = list(c("unif", 0.6, 1.4)), # random effect dist                            beta, # effect of covariate                            x, # distributions of covariates                            nsit = 3) # max number of adverse events for an individualhead(round(clinical.data,2))##   nid ev.num  time status start  stop    z     x   x.1 x.2## 1   1      1  5.79      1     0  5.79 0.97 28.63 69.02   1## 2   1      2 30.00      0     0 30.00 0.97 28.63 69.02   1## 3   1      3 30.00      0     0 30.00 0.97 28.63 69.02   1## 4   2      1  3.37      1     0  3.37 0.60 36.42 53.81   0## 5   2      2 30.00      0     0 30.00 0.60 36.42 53.81   0## 6   2      3 30.00      0     0 30.00 0.60 36.42 53.81   0

The `simsurv` package

In the vignette on *How to use the simsurv package, the package authors Sam Brilleman and Alessandro Gasparini state that they directly modeled their package on the Stata packagesurvsim and cite the Crowther and Lambert paper. They show how survsim builds out much of the functionality envisioned there in examples that demonstrate the interplay between model fitting and simulation. Example 2 of the vignette is concerned with constructing fake data modeled on the German breast cancer data by Schumacher et al. (1994).

data("brcancer")head(brcancer)##   id hormon rectime censrec## 1  1      0    1814       1## 2  2      1    2018       1## 3  3      1     712       1## 4  4      1    1807       1## 5  5      0     772       1## 6  6      0     448       1

The example begins by fitting alternative models to the data using functions from the flexsurv package of Jackson, Metcalfe and Amdahl. Two candidate models are proposed and a spline model giving the best fit is used to simulate data. The example concludes with more model fitting to examine the fake data. All of the examples in the vignette showcase the interplay between simsurv and flexsurv functions and emphasize the flexible modeling tools in flexsruv for building custom survival models.

The following code replicates the portion of Example 2 that illustrates the use of the flexsurvspline() function which allows the calculation of the log cumulative hazard function to depend on knot locations.

The code below produces the simulated data and uses the survminer package of Kassambara et al. to produce high quality Kaplan-Meier plots.

This line of code fits a three knot spline model to the brcancer data. The flexsurvspline() function, as with other functions in the flexsurv package build on the basic functionality of the fundamental Terry Therneau’s survival package.

true_mod <- flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon,                                      data = brcancer, k = 3)

This helper function returns the log cumulative hazard at time t

logcumhaz <- function(t, x, betas, knots) {    # Obtain the basis terms for the spline-based log  # cumulative hazard (evaluated at time t)  basis <- flexsurv::basis(knots, log(t))    # Evaluate the log cumulative hazard under the  # Royston and Parmar specification  res <-     betas[["gamma0"]] * basis[[1]] +     betas[["gamma1"]] * basis[[2]] +    betas[["gamma2"]] * basis[[3]] +    betas[["gamma3"]] * basis[[4]] +    betas[["gamma4"]] * basis[[5]] +    betas[["hormon"]] * x[["hormon"]]    res}

The simsurv() functions generates the simulated survival data.

covariates <- data.frame(id = 1:686, hormon = rbinom(686, 1, 0.5))sim_data <- simsurv(               betas = true_mod$coefficients, # "true" parameter values               x = covariates,            # covariate data for 686 individuals               knots = true_mod$knots,    # knot locations for splines               logcumhazard = logcumhaz,  # definition of log cum hazard               maxt = NULL,               # no right-censoring               interval = c(1E-8,100000)) # interval for root findingsim_data <- merge(covariates, sim_data)head(sim_data)##   id hormon eventtime status## 1  1      1     240.4      1## 2  2      0     942.6      1## 3  3      1     463.5      1## 4  4      0    1762.0      1## 5  5      0    3976.0      1## 6  6      0    2288.0      1

We use the surv_fit function from the survminer package to fit the Kaplan-Meier curves

KM_data <- survminer::surv_fit(Surv(rectime, censrec) ~ 1, data = brcancer)KM_data_sim <- survminer::surv_fit(Surv(eventtime, status) ~ 1, data = sim_data)

Finally, plotting the curves shows that the simulsted data does appear to plausibly resemble the original data.

p <- ggsurvplot_combine(list(KM_data, KM_data_sim),                risk.table = TRUE,                conf.int = TRUE,                censor = FALSE,                conf.int.style = "step",                tables.theme = theme_cleantable(),                palette = "jco")plot.new() print(p,newpage = FALSE)

I hope you find this small post helpful. The CRAN task view on Survival Analysis is a fantastic resource, but it can be a daunting task for non-experts to know where to begin to unravel the secrets there without a thread to pull on.

_____='https://rviews.rstudio.com/2020/11/02/simulating-biologically-plausible-survival-data/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Biologically Plausible Fake Survival Data first appeared on R-bloggers.

↧

Upcoming Why R? and X-Europe Webinars in November

November 2, 2020, 6:00 pm

≫ Next: 10 Must-Know Tidyverse Functions: #2 – across()

≪ Previous: Biologically Plausible Fake Survival Data

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? Webinars are back after successful launch of Why R? 2020 Conference! All videos from the past series and from the recently finished conference can be watched on our youtube.com/WhyRFoundation channel. See this post to find out about upcoming webinars in November 2020. The aim of webinars is to support local R groups during the pandemic.

This post also introduces X-Europe Webinars– an organization for joint online events of Vienna Data Science Group, Frankfurt Data Science, Budapest Data Science Meetup, BCN Analytics, Budapest.AI, Barcelona Data Science and Machine Learning Meetup, Budapest Deep Learning Reading Seminar and Warsaw R Users Group.

Details

donate: whyr.pl/donate/
channel: youtube.com/WhyRFoundation
date: every Thursday 8:00 pm UTC+!
format: 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
comments: ask questions on YouTube live chat

Future talks

2020-11-04 Tools for Explainable Artificial Intelligence

2020-11-05 Why R? Webinar – R on AWS

2020-11-12 Preserving wildlife with computer vision + Scaling Shiny Dashboards on a Budget

speakers: Jędrzej Świeżewski PhD& Damian Budelewski
meetup event
webinar

2020-11-19 Satellite imagery analysis in R

2020-11-25 Get Started with ML in Azure Quantum Computing with Q#

Previous talks

Kyla McConnell& Julia MüllerHow to start your own #rstats group: Building an inclusive and fun R community. Video

Faris NajiLiberate the coder and empower the non coder. Video

Michał Majplatypus – image segmentation & object detection made easy with R. Video

Suhem ParackIntroduction to Twitter data analysis in R. Video

Colin GillespieMe, Myself and my Rprofile. Video

Sydeaka WatsonData Science for Social Justice. Video

John BlischakReproducible research with workflowr: a framework for organizing, versioning, and sharing your data analysis projects. Video

JD LongTaking friction out of R: helping drive data science adoption in organizations. Video

Leon Eyrich JessenIn Silico Immunology: Neural Networks for Modelling Molecular Interactions using Tensorflow via Keras in R. Video

Erin Hodgess Using R with High Performance Tools on a Windows Laptop. Video

Julia Silge Understanding Word Embeddings. Video

Bernd Bischl, Florian Pfisterer and Martin Binder Pipelines and AutoML with mlr3. Video

Mateusz Zawisza + Armin Reinert Uplift modeling for marketing campaigns. Video

Erin LeDell – Scalable Automatic Machine Learning in R with H2O AutoML. Video

Ahmadou Dicko – Humanitarian Data Analysis with R. Video

Dr. Nina Zumel and Dr. John Mount from win-vector– Advanced Data Preparation for Supervised Machine Learning. Video

Lorenzo Braschi– rZYPAD: Development pipeline for R production. Video

Robin Lovelace and Jakub Nowosad (authors of Geocomputation with R) – Recent changes in R spatial and how to be ready for them. Video

Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video

Olgun Aydin – PwC Poland– Introduction to shinyMobile. Video

Achim Zeileis from Universität Innsbruck– R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video

Stay up to date

subscribe to YouTube channel youtube.com/c/WhyRFoundation
join Why R? Slack whyr.pl/slack/
join Meetup

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Upcoming Why R? and X-Europe Webinars in November first appeared on R-bloggers.

↧

10 Must-Know Tidyverse Functions: #2 – across()

November 2, 2020, 12:00 pm

≫ Next: Financial Engineering: Static Replication of any Payoff Function

≪ Previous: Upcoming Why R? and X-Europe Webinars in November

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

The across() function was just released in dplyr 1.0.0. It’s a new tidyverse function that extends group_by and summarize for multiple column and function summaries.

Learn how to use across() to summarize data like a data wizard:

Get the Code: GitHub Link
Video Tutorial: YouTube Tutorial

(Click image to play tutorial)

Why Across?

Across doesn’t do anything you can’t with normal group_by() and summarize(). So why across()? Simply put, across()allows you to scale up your summarization to multiple columns and multiple functions.

Across makes it easy to apply a mean and standard deviation to one or more columns. We just slect the columns and functions that we want to apply.

Tidyverse accross() function 1

Tidyverse accross() function 2

That was ridiculously easy. Keep it up & you’ll become a tidyverse rockstar.

rockstar

You Learned Something New!

Great! But, you need to learn a lot to become an R programming wizard.

What happens after you learn R for Business from Matt

Tidyverse wizard

…And the look on your boss’ face after seeing your first Shiny App.

Amazed

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter
Set Up the GitHub Repo: https://github.com/business-science/free_r_tips
Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 10 Must-Know Tidyverse Functions: #2 - across() first appeared on R-bloggers.

↧

Financial Engineering: Static Replication of any Payoff Function

November 2, 2020, 6:00 pm

≫ Next: Frank Harrell – Controversies in Predictive Modeling and Machine Learning

≪ Previous: 10 Must-Know Tidyverse Functions: #2 – across()

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the area of option strategy trading, it has always been a dream of mine to have a universal tool that is able to replicate any payoff function statically by combining plain vanilla products like calls, puts, and zerobonds.

Many years ago there was such a tool online but it has long gone since and the domain is inactive. So, based on the old project paper from that website I decided to program it in R and make it available for free here!

The project paper the algorithm is based on and which is translated to R can be found here: Financial Engineering Tool: Replication Strategy and Algorithm. I will not explain the algorithm as such and how it works because this is done brilliantly in the paper. Also, I won’t get into any details concerning derivatives and structured products either. You can find tons of material on the web just by googling. So, without further ado let’s get started!

First, we need a way to define the payoff function: for each kink we provide two values, one for the underlying which goes from 0 to infinity and one for the payoff we want to replicate. We will use the names used in the paper for all the needed variables for clarity. Let us start by defining a plain vanilla call:

payoff <- data.frame(pi = c(0, 100, 110, Inf), f_pi = c(0, 0, 10, Inf))payoff##    pi f_pi## 1   0    0## 2 100    0## 3 110   10## 4 Inf  Inf

The last value of the payoff must be either equal to the penultimate value (= payoff staying flat at the given value) or must be (minus) infinity for a linear continuation in the given direction. Next we want to plot this payoff:

plot_payoff <- function(payoff, xtrpol = 1.5) {  k <- nrow(payoff) - 1  payoff_prt <- payoff  payoff_prt$pi[k+1] <- payoff$pi[k] * xtrpol  # linear extrapolation of last kink  slope <- diff(c(payoff$f_pi[k-1], payoff$f_pi[k])) / diff(c(payoff$pi[k-1], payoff$pi[k]))  payoff_prt$f_pi[k+1] <- ifelse(payoff$f_pi[k] == payoff$f_pi[k+1], payoff$f_pi[k+1], payoff$f_pi[k] + payoff$pi[k] * (xtrpol - 1) * slope)  plot(payoff_prt, ylim = c(-max(abs(payoff_prt$f_pi) * xtrpol), max(abs(payoff_prt$f_pi) * xtrpol)), main = "Payoff diagram", xlab = "S(T)", ylab = "f(S(T))", type = "l")  abline(h = 0, col = "blue")  grid()  lines(payoff_prt, type = "l")  invisible(payoff_prt)}plot_payoff(payoff)

Now comes the actual replication. We need to functions for that: a helper function to calculate some parameters…

calculate_params <- function(payoff) {  params <- payoff  k <- nrow(params) - 1    # add additional columns s_f_pi, lambda and s_lambda  params$s_f_pi <- ifelse(params$f_pi < 0, -1, 1)  # linear extrapolation of last kink  slope <- diff(c(params$f_pi[k-1], params$f_pi[k])) / diff(c(params$pi[k-1], params$pi[k]))  f_pi_k <- ifelse(params$f_pi[k] == params$f_pi[k+1], params$f_pi[k+1], slope)  params$lambda <- c(diff(params$f_pi) / diff(params$pi), f_pi_k)  params$s_lambda <- ifelse(params$lambda < 0, -1, 1)  # consolidate  params[k, ] <- c(params[k, 1:3], params[(k+1), 4:5])  params <- params[1:k, ]  params}

…and the main function with the replication algorithm:

replicate_payoff <- function(payoff) {  params <- calculate_params(payoff)  suppressMessages(attach(params))  k <- nrow(params)    portfolios <- as.data.frame(matrix("", nrow = k, ncol = 6))  colnames(portfolios) <- c("zerobonds", "nominal", "calls", "call_strike", "puts", "put_strike")    # step 0 (initialization)  i <- 1  i_r <- 1  i_l <- 1    while (i <= k) {        # step 1 (leveling)    if (f_pi[i] != 0) {      portfolios[i, "zerobonds"] <- s_f_pi[i]      portfolios[i, "nominal"] <- abs(f_pi[i])    }        # step 2 (replication to the right)    while (i_r <= k) {      if (i_r == i) {        if (lambda[i] != 0) {          portfolios[i, "calls"] <- paste(portfolios[i, "calls"], lambda[i])          portfolios[i, "call_strike"] <- paste(portfolios[i, "call_strike"], pi[i])        }        i_r <- i_r + 1        next      }      if ((lambda[i_r] - lambda[i_r-1]) != 0) {        portfolios[i, "calls"] <- paste(portfolios[i, "calls"], (lambda[i_r] - lambda[i_r-1]))        portfolios[i, "call_strike"] <- paste(portfolios[i, "call_strike"], pi[i_r])      }      i_r <- i_r + 1    }            # step 3 (replication to the left)    while (i_l != 1) {      if (i_l == i) {        if (-lambda[i_l-1] != 0) {          portfolios[i, "puts"] <- paste(portfolios[i, "puts"], -lambda[i_l-1])          portfolios[i, "put_strike"] <- paste(portfolios[i, "put_strike"], pi[i_l])        }      } else {        if ((lambda[i_l] - lambda[i_l-1]) != 0) {          portfolios[i, "puts"] <- paste(portfolios[i, "puts"], (lambda[i_l] - lambda[i_l-1]))          portfolios[i, "put_strike"] <- paste(portfolios[i, "put_strike"], pi[i_l])        }      }      i_l <- i_l - 1    }    # step 4    i <- i + 1    i_r <- i    i_l <- i  }  # remove duplicate portfolios  portfolios <- unique(portfolios)  # renumber rows after removal  row.names(portfolios) <- 1:nrow(portfolios)  portfolios}

Let us test our function for the plain vanilla call:

replicate_payoff(payoff)##   zerobonds nominal calls call_strike  puts put_strike## 1                       1         100                 ## 2         1      10     1         110  -1 1    110 100

There are always several possibilities for replication. In this case, the first is just our call with a strike of 100. Another possibility is buying a zerobond with a nominal of 10, going long a call with strike 110 and simultaneously going short a put with strike 110 and long another put with strike 100.

Let us try a more complicated payoff, a classic bear spread (which is also the example given in the paper):

payoff <- data.frame(pi = c(0, 90, 110, Inf), f_pi = c(20, 20, 0, 0))payoff##    pi f_pi## 1   0   20## 2  90   20## 3 110    0## 4 Inf    0plot_payoff(payoff)

replicate_payoff(payoff)##   zerobonds nominal calls call_strike  puts put_strike## 1         1      20  -1 1      90 110                 ## 2                                      1 -1     110 90

Or for a so-called airbag note:

payoff <- data.frame(pi = c(0, 80, 100, 200, Inf), f_pi = c(0, 100, 100, 200, Inf))payoff##    pi f_pi## 1   0    0## 2  80  100## 3 100  100## 4 200  200## 5 Inf  Infplot_payoff(payoff, xtrpol = 1)

replicate_payoff(payoff)##   zerobonds nominal         calls call_strike        puts  put_strike## 1                    1.25 -1.25 1    0 80 100                        ## 2         1     100             1         100       -1.25          80## 3         1     200             1         200  -1 1 -1.25  200 100 80

As a final example: how to replicate the underlying itself? Let’s see:

payoff <- data.frame(pi = c(0, 100, Inf), f_pi = c(0, 100, Inf))payoff##    pi f_pi## 1   0    0## 2 100  100## 3 Inf  Infplot_payoff(payoff, 1)

replicate_payoff(payoff)##   zerobonds nominal calls call_strike puts put_strike## 1                       1           0                ## 2         1     100     1         100   -1        100

The first solution correctly gives us what is called a zero-strike call, i.e. a call with the strike of zero!

I hope you find this helpful! If you have any questions or comments, please leave them below.

I am even thinking that it might be worthwhile to turn this into a package and put it on CRAN, yet I don’t have the time to do that at the moment… if you are interested in cooperating on that please leave a note in the comments too. Thank you!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Financial Engineering: Static Replication of any Payoff Function first appeared on R-bloggers.

↧

Frank Harrell – Controversies in Predictive Modeling and Machine Learning

November 3, 2020, 4:00 am

≫ Next: Hi, the third part is not yet available. I will post a link here once pubslished.

≪ Previous: Financial Engineering: Static Replication of any Payoff Function

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A month ago we finished Why R? 2020 conference. We had an pleasure to host Frank Harrell, a professor and a founding chair of the Department of Biostatistics at Vanderbilt University School of Medicine.. This post contains a biography of the speaker and an abstract of his talk: Controversies in Predictive Modeling and Machine Learning.

Frank is a professor and a founding chair of the Department of Biostatistics at Vanderbilt University School of Medicine. Aside from more than 300 scientific publications, Frank has authored Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd Edition 2015, Springer-Verlag), which still serves as a primer in modern statistical modeling for generations of statisticians. His specialties are development of accurate prognostic and diagnostic models, model validation, clinical trials, observational clinical research, cardiovascular research, technology evaluation, pharmaceutical safety, Bayesian methods, quantifying predictive accuracy, missing data imputation, and statistical graphics and reporting.

The fact that some people murder doesn’t mean we should copy them. And murdering data, though not as serious, should also be avoided. —Frank E. Harrell (answering a question on categorization of continuous variables in survival modelling) R-help (July 2005), fortunes::fortune(32)

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Frank Harrell - Controversies in Predictive Modeling and Machine Learning first appeared on R-bloggers.

↧

Hi, the third part is not yet available. I will post a link here once pubslished.

November 3, 2020, 4:13 am

≫ Next: How is the F-statistic computed in anova() when there are multiple models?

≪ Previous: Frank Harrell – Controversies in Predictive Modeling and Machine Learning

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hi, the third part is not yet available. I will post a link here once pubslished.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Hi, the third part is not yet available. I will post a link here once pubslished. first appeared on R-bloggers.

↧

How is the F-statistic computed in anova() when there are multiple models?

November 3, 2020, 7:08 am

≫ Next: asymmetric information

≪ Previous: Hi, the third part is not yet available. I will post a link here once pubslished.

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

In the linear regression context, it is common to use the F-test to test whether a proposed regression model fits the data well. Say we have predictors, and we are comparing the model fit for

Linear regression where $\beta_1, \dots, \beta_k$ are allowed to vary freely but $\beta_{k+1} = \dots = \beta_p = 0$ are fixed at zero, vs.
Linear regression where $\beta_1, \dots, \beta_p$ are allowed to vary freely.

( is some fixed parameter.) We call the first model the “restricted model”, and the secondthe “full model”. We say that these models are nested since the second model is a superset of the first. In the hypothesis testing framework, comparing the model fits would be testing

$\begin{aligned} H_0&: \beta_{k+1} = \dots = \beta_p = 0, \text{ vs.} \\ H_a &: \text{At least one of } \beta_{k+1}, \dots, \beta_p \neq 0. \end{aligned}$

If we let $RSS_{res}$ and $RSS_{full}$ denote the residual sum of squares under the restricted and full models respectively, and $df_{res}$ and $df_{full}$ denote the degrees of freedom under the restricted and full models respectively, then under the null hypothesis, the F-statistic

$\begin{aligned} F = \dfrac{(RSS_{res} - RSS_{full}) / (df_{full} - df_{res})}{RSS_{full} / df_{full}} \end{aligned}$

has the $F_{df_{full} - df_{res}, df_{full}}$ distribution. If is large, the null hypothesis is rejected and we conclude that the full model fits the data better than the restricted model. (See Reference 1 for more details.)

The problem

In R, we can use the anova() function to do these comparisons. In the following code, we compare the fits of mpg ~ wt (full model) vs. mpg ~ 1 (restricted model, intercept only):

data(mtcars)mod1 <- lm(mpg ~ 1, data = mtcars)mod2 <- lm(mpg ~ wt, data = mtcars)mod3 <- lm(mpg ~ wt + hp, data = mtcars)anova(mod1, mod2)# Analysis of Variance Table## Model 1: mpg ~ 1 # Model 2: mpg ~ wt #   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    # 1     31 1126.05                                  # 2     30  278.32  1    847.73 91.375 1.294e-10 ***# ---# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the table, we see that the -statistic is equal to 91.375.

The anova() function is pretty powerful: if we have a series of nested models, we can test them all at once with one function call. For example, the code below computes the -statistics for mod2 vs. mod1 and mod3 vs. mod2:

anova(mod1, mod2, mod3)# Analysis of Variance Table# # Model 1: mpg ~ 1# Model 2: mpg ~ wt# Model 3: mpg ~ wt + hp#   Res.Df     RSS Df Sum of Sq       F    Pr(>F)    # 1     31 1126.05                                   # 2     30  278.32  1    847.73 126.041 4.488e-12 ***# 3     29  195.05  1     83.27  12.381  0.001451 ** # ---# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But wait: the -statistic formod2 vs. mod1 has changed! It was previously 91.375, and now it is 126.041. What happened?

Resolution (Part 1)

(Credit: Many thanks to Naras who pointed me in the right direction.) The answer lies in a paragraph within the help file for anova.lm() (emphasis mine):

Optionally the table can include test statistics. Normally the F statistic is most appropriate, which compares the mean square for a row to the residual sum of squares for the largest model considered. If scale is specified chi-squared tests can be used. Mallows’ Cp statistic is the residual sum of squares plus twice the estimate of sigma^2 times the residual degrees of freedom.

In other words, the denominator of the F-statistic is based on the largest model in the anova() call. We can verify this with the computations below. In anova(mod1, mod2)<, the denominator depends on the RSS and Res.Df values for model 2; in anova(mod1, mod2, mod3), in depends on the RSS and Res.Df values for model 3.

 ((1126.05 - 278.32) / (31 - 30)) / (278.32 / 30)# [1] 91.37647((1126.05 - 278.32) / (31 - 30)) / (195.05 / 29)# [1] 126.0403

Resolution (Part 2)

Why would anova() determine the denominator in this way? I think the reason lies in what the F-statistic is trying to compare (see Reference 2 for details). The F-statistic is comparing two different estimates of the variance, and the estimate in the denominator is akin to the typical variance estimate we get from the residuals of a regression model. In our example above, one F-statistic used the residuals from mod2, while the other used the residuals from mod3.

Which F-statistic should you use in practice? I think this might depend on your data analysis pipeline, but my gut says that the F-statistic from the anova() call with just 2 models is probably the one you want to use. It’s a lot easier to interpret and understand.

I haven’t seen any discussion on this in my internet searches, so I would love to hear views on what one should do in practice!

References:

James, G., et al. (2013). An introduction to statistical learning (Section 3.2.2).
lumen. The F distribution and the F-ratio.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How is the F-statistic computed in anova() when there are multiple models? first appeared on R-bloggers.

↧

asymmetric information

November 3, 2020, 9:20 am

≫ Next: Random Seed on BNSL package

≪ Previous: How is the F-statistic computed in anova() when there are multiple models?

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Riddler of 16 October had the following puzzle:

Take a real number θ uniformly distributed over (0,100). Among three players, the winner is whoever guessed the closest price without going over θ. In the event all guesses exceeded θ, the contestant with the lowest (and therefore closest) guess is declared the winner. The second player knows the first player’s guess and the third player knows both other guesses. What is the optimal guess for the first player, assuming all players maximise their probability of winning?

Looking at the optimal solution z for the third player leads to six possible choices, depending on the connection between the other guesses, x and y. Which translates in the R code

topz=function(x,y){  if((2*y>=x)&(y>=1-x))  z=y-.001  if(max(4*y,1+y)<=2*x)  z=y+.001  if((2*x<=1+y)&(x<=1-y))z=x+.001  z}  third=function(x,y) ifelse(yFor there, the optimal choice y for the second player follows and happens on a boundary of one of the six regions, which itself returns that the optimal choice for the first player is x=2/3, leading to equal chances of winning (although there is some uncertainty on the boundaries). It is thus feasible to beat the asymmetric information. The picture above was my attempt at representing the probabilities of gain for all three players, some of the six regions being clearly visible, with first axis being x and second being y [and z is one of x⁻,x⁺,y⁻,y⁺]. The R code is too pedestrian to be reproduced!
    var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'};    (function(d, t) {        var s = d.createElement(t);            s.type = 'text/javascript';            s.async = true;// s.defer = true;//          s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js';        var r = d.getElementsByTagName(t)[0];            r.parentNode.insertBefore(s, r);    }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog:  R – Xi'an's Og.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or  here if you don't.
The post asymmetric information first appeared on R-bloggers.

↧

Random Seed on BNSL package

November 2, 2020, 5:00 pm

≫ Next: When is a number not a number?

≪ Previous: asymmetric information

[This article was first published on Akhila Chowdary, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the past few months, I have been working on testing the Rcpp packages using RcppDeepState in the cluster. When we fuzz test each Rcpp function-specific Testharness the number of inputs that are passed onto the target binary keeps varying depending on the clock speed, number of cores, cache size of the system. When I test a package using RcppDeepState, the number of inputs that are generated in the cluster are comparatively higher than the ones that are generated in a single-core system. The default timer is set to 2 minutes. Depending on the system speed the inputs RcppDeepState generates as many inputs as possible. The inputs generated are stored in .crash/.fail/.pass files.

RcppDeepState package provides a function where we can provide the random seed and the time (in seconds). Setting the seed and timer argument we can control the number of input files to generate avoiding the generation of large number of files.

RcppDeepState::deepstate_fuzz_fun_seed(fun_path,seed,time.limit.seconds)

Making a call to deepstate_fuzz_fun_seed() will limit the time the fuzzer is run. When I run the following code on the cluster:

RcppDeepState::deepstate_fuzz_fun_seed(“~/RcppDeepStateTest/BNSL/inst/testfiles/mi”,1604461988,5)

> deepstate_fuzz_fun_seed("~/RcppDeepStateTest/BNSL/inst/testfiles/mi",1604461988,5)
      err.kind                message           file.line
1: InvalidRead Invalid read of size 8 src/mi_cmi.cpp : 57
                                                      address.msg address.trace
1: Address 0x9f36468 is 0 bytes after a block of size 296 alloc'd          
>

The code when run on random seed 1604461988 for 5 seconds I get the following valgrind error. There is an invalid read of size 8 on the function mi_cmi.cpp at line number 57. The following 5 seconds timer generates only one crash file. I get the same output when I run the testharness on the single core system.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Akhila Chowdary.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Random Seed on BNSL package first appeared on R-bloggers.

↧

When is a number not a number?

November 3, 2020, 10:30 am

≫ Next: RStudio 1.4 Preview: Rainbow Parentheses

≪ Previous: Random Seed on BNSL package

[This article was first published on R – Blog – Mazama Science , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Have you ever asked yourself whether your telephone number is really a number? It’s got numbers in it but does it measure anything?

How about your credit card number? PO Box? Social Security Number? Zip code? What would happen if you subtracted one of these from another?

As it turns out, many of the “numbers” we deal with every day are actually identifiers and not a measure of something. Sadly, too many data managers do not distinguish between the two even though making this distinction is quite simple.

Identification vs. Measurement

We have a check stub on our desk that contains the following information:

CHECK NO. — 12345
OUR REF. NO. — 290227
YOUR INVOICE NO. — 090202
INVOICE DATE — 01/30/09
INVOICE AMOUNT — 100.00
AMOUNT PAID — 100.00

The first three items are used to uniquely identify the check according to three different accounting systems. We can think of this information as metadata associated with the actual measurements that the check keeps track of. Although these items all have ‘NO.’ (number) in their name, they should really be called numeric identifiers, names consisting of numbers.

The last three items, the date and amounts, are actual measurements which have units of days and dollars, respectively.

This check stub is not unlike many data records in that it contains identifiers (the ‘NO.’ numbers), spatio-temporal locators (the date) and measurements (the amounts). The world of scientific data would be a much friendlier place if data managers understood the distinction between these categories and made this distinction obvious in their datasets.

We have already dealt with Dates and Latitudes and Longitudes in earlier posts describing these spatio-temporal locators. This post will focus on the difference between identifiers and measurements.

For most non-statistical datasets, the easiest way to tell the difference between an identifier and a measurement is to ask yourself whether there are any units involved. If there are no units involved then we are usually talking about an identifier, a name, a handle on an individual piece of information. Unique identifiers are important as they allow us to be sure we are talking about a particular piece of information. You don’t want all your invoices to go out with the same invoice number. In fact, you should make every effort to ensure that no two invoices share this number.

Measurements are different. They make no claim to being unique. You may send out an invoice for $100 as many times as you like. Measurements have units. Measurements can be represented as a distance along an axis. Measurements, unlike identifiers, can be used in mathematical equations.

Strings vs. Floats and Ints

To human readers, it appears that numeric identifiers and measurements are expressed the same way — as a series of “numbers”. However, when writing software to process data, it is important to differentiate between numeric strings — arrays of ASCII characters from the set [0…9], and floating point or integer values. In typed languages like Fortran, C or Java, this distinction is enforced. In untyped languages like python and R, any lack of a clear distinction between numeric identifiers and measurements can lead to some interesting results.

Here is some python code that adds two integers:

>>> a = 7
>>> b = 7
>>> c = a + b
>>> c
14

Here is similar looking code that “adds” (i.e. concatenates) two “numbers”:

>>> a = '7'
>>> b = '7'
>>> c = a + b
>>> c
'77'

Some would say that this argues for the use of typed languages when working with scientific data. We do not share this judgement. Agile programming languages like python and R offer so many advantages with respect to programmer efficiency and concise readability that it would be folly to abandon them. Instead, we advocate a more careful approach to data management that can solve the problem for both typed and untyped languages.

Machine Parsable Data

We will introduce the term machine parsable to refer to data files that adhere to certain basic rules that allow software to go beyond simply reading in the data. When good software encounters machine parsable data it can actually make initial steps toward “understanding” what the data represent.

Identifiers and Measurements in a CSV File

Comma Separated Value (CSV) files are the defacto standard for scientific data interchange involving limited amounts of data. CSV is the format of choice for many data warehouses because it can be read in and spat out by any software that purports to work with data. Unfortunately, there is no standard for exactly how to use the CSV format. The only aspect that everyone appears to agree upon is that fields should be delimited by the Unicode character ‘COMMA’ (U+002C).

Along with the use of commas as delimiters, we recommend only two additional rules to make your CSV files machine parsable.

1. Use standard names for spatio-termporal locators

Spatial-temporal locators have well recognized names and abbreviations. Fields named ‘datetime’, ‘lat’, ‘lon’ and ”depth’ should be recognized as special columns by any data management software. If CSV files always used standard names like ‘lat’ (or ‘latitude’, ‘Lat’ or ‘Latitude’), software could be written that would always be aware of which columns contained spatio-termporal variables.

2. Surround identifiers in quotes

Many CSV files use quotes sporadically, often only when they are needed because a character value has an embedded comma. It would be more consistent if quotes surrounded every identifier or native character variable (e.g.‘name’) in each record. The only elements not enclosed in quotes would be actual numeric measurements. Software parsing CSV data like this would not need to infer data types from the column names. If a value is surrounded by quotes it is obviously of type character. All other values should be interpretable as numeric.

There are distinct advantages to creating CSV files in this manner. When a file like this is imported into a spreadsheet, for instance, character codes with leading zeros will always retain them. You wouldn’t want Boston zip codes to end up with only 4 digits? (James Bond is “007”, not 7.)

That’s all. Just two simple rules to make scientific data more useful to more people. It doesn’t sound difficult because it isn’t.

Examples

Below, we examine two federal government examples of aggregated datasets that do an excellent job of differentiating between identifiers and measures. In the first example, explicit variable names allow for the simple correction of automated parsing errors. In the second example, automatic parsing works without intervention.

Example 1) Water Quality data

We applaud the USGS and US EPA for working together to provide a publicly available, single source of water quality data through the Water Quality Portal. It is a huge effort to bring together and harmonize datasets from different agencies. Let’s see how easy they have made it to ingest this data programmatically.

From their data page, I can easily look for data within 5 miles of my location and filter for “Site Type: Stream”. Clicking on the Download button delivers a stations.csv file. We will R’s readr package to automatically parse this data:

   df <- read_csv("~/Downloads/station.csv")
   ...
   lapply(df, class) %>% str()
   List of 36
    $ OrganizationIdentifier                         : chr "character"
    $ OrganizationFormalName                         : chr "character"
    $ MonitoringLocationIdentifier                   : chr "character"
    $ MonitoringLocationName                         : chr "character"
    $ MonitoringLocationTypeName                     : chr "character"
    $ MonitoringLocationDescriptionText              : chr "logical"
    $ HUCEightDigitCode                              : chr "numeric"
    $ DrainageAreaMeasure/MeasureValue               : chr "numeric"
    $ DrainageAreaMeasure/MeasureUnitCode            : chr "character"
    $ ContributingDrainageAreaMeasure/MeasureValue   : chr "logical"
    $ ContributingDrainageAreaMeasure/MeasureUnitCode: chr "logical"
    $ LatitudeMeasure                                : chr "numeric"
    $ LongitudeMeasure                               : chr "numeric"
    $ SourceMapScaleNumeric                          : chr "logical"
    $ HorizontalAccuracyMeasure/MeasureValue         : chr "numeric"
    $ HorizontalAccuracyMeasure/MeasureUnitCode      : chr "character"
    $ HorizontalCollectionMethodName                 : chr "character"
    $ HorizontalCoordinateReferenceSystemDatumName   : chr "character"
    $ VerticalMeasure/MeasureValue                   : chr "numeric"
    $ VerticalMeasure/MeasureUnitCode                : chr "character"
    $ VerticalAccuracyMeasure/MeasureValue           : chr "numeric"
    $ VerticalAccuracyMeasure/MeasureUnitCode        : chr "character"
    $ VerticalCollectionMethodName                   : chr "character"
    $ VerticalCoordinateReferenceSystemDatumName     : chr "character"
    $ CountryCode                                    : chr "character"
    $ StateCode                                      : chr "numeric"
    $ CountyCode                                     : chr "character"
    $ AquiferName                                    : chr "logical"
    $ FormationTypeText                              : chr "logical"
    $ AquiferTypeName                                : chr "logical"
    $ ConstructionDateText                           : chr "logical"
    $ WellDepthMeasure/MeasureValue                  : chr "logical"
    $ WellDepthMeasure/MeasureUnitCode               : chr "logical"
    $ WellHoleDepthMeasure/MeasureValue              : chr "logical"
    $ WellHoleDepthMeasure/MeasureUnitCode           : chr "logical"
    $ ProviderName                                   : chr "character"

What fantastically named fields! They even self-describe as Identifier, Code, Name, Text, or Value.

Some of the fields did non automatically parse into the correct type but, given the excellent naming, we can use the col_types argument to read_csv() to enforce proper parsing. Every ~Value should be parsed as numeric and everything else should be parsed as character.

Example 2) Earthquake data

Here, we will look at a USGS dataset of M1.0+ Earthquakes for the past hour.

    df <- readr::read_csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/1.0_hour.csv")
   lapply(df, class) %>% str()
   List of 22
    $ time           : chr [1:2] "POSIXct" "POSIXt"
    $ latitude       : chr "numeric"
    $ longitude      : chr "numeric"
    $ depth          : chr "numeric"
    $ mag            : chr "numeric"
    $ magType        : chr "character"
    $ nst            : chr "numeric"
    $ gap            : chr "numeric"
    $ dmin           : chr "numeric"
    $ rms            : chr "numeric"
    $ net            : chr "character"
    $ id             : chr "character"
    $ updated        : chr [1:2] "POSIXct" "POSIXt"
    $ place          : chr "character"
    $ type           : chr "character"
    $ horizontalError: chr "numeric"
    $ depthError     : chr "numeric"
    $ magError       : chr "numeric"
    $ magNst         : chr "numeric"
    $ status         : chr "character"
    $ locationSource : chr "character"
    $ magSource      : chr "character"

Things look pretty good. By and large, the columns have human understandable names and, for those that are not obvious, USGS provides a page with the field descriptions.

We have well named spatio-temporal locators: time, latitude, longitude, depth.

Data columns containing non-numeric characters are automatically parsed as characters and describe identifiers of one sort or another: magType, net, place, type, status, locationSource, magSource

Best hopes for creating machine parsable data!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Blog – Mazama Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post When is a number not a number? first appeared on R-bloggers.

↧

RStudio 1.4 Preview: Rainbow Parentheses

November 3, 2020, 10:00 am

≫ Next: Warpspeed confidence what is credible?

≪ Previous: When is a number not a number?

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is part of a series on new features in RStudio 1.4, currently available as a preview release.

Beautiful code themes and rainbow parentheses, a tale as old as…well at least 2017. Being able to color your parentheses (and brackets and braces) based on the level of nesting has been a highly requested feature for years and we’re happy to announce that it’s available in the upcoming 1.4 release of RStudio.

Enabling Rainbow Parentheses

Rainbow parentheses are turned off by default. To enable them:

Open Global Options from the Tools menu
Select Code -> Display
Enable the Rainbow Parentheses option at the bottom

Optional Use

If you would prefer to only use the Rainbow Parentheses option on a per-file basis (just for specific debugging, for example) you can toggle this option by using the Command Palette.

Open the Command Palette by either using the keyboard shortcut (Default: Control/Command + Shift + P) or through the Tools -> Command Palette menu option.
Type rainbow to quickly highlight the Toggle Rainbow Parentheses Mode option and select it to toggle the option.

This is on the file itself so the rest of your environment will continue to respect the global setting.

Configuring

If you don’t like the default colors, or they don’t quite work for your theme, you can customize them to whatever you like. See this article on writing your own RStudio theme. The relevant classes to change are .ace_paren_color_0 to .ace_paren_color_6.

Try it out!

You can try out the new features from this blog series by installing the RStudio 1.4 Preview Release. If you do, we very much welcome your feedback on our community forum!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RStudio 1.4 Preview: Rainbow Parentheses first appeared on R-bloggers.

↧

Warpspeed confidence what is credible?

November 3, 2020, 10:00 am

≫ Next: Creating missing values in factors

≪ Previous: RStudio 1.4 Preview: Rainbow Parentheses

[This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, it’s election day here in America and the last thing in the world I want to talk about is politics. Hope you found a way to vote. Meanwhile, while we wait let me (hopefully) amuse and perhaps educate you with some information about another topic that we all care deeply about. A vaccine for COVID-19.

The other day I was in my car listening to the radio when I heard Olivier Knox interviewing Dr. Moncef Slaoui about progress towards developing a vaccine for COVID-19. Since I was driving it was difficult to pay attention to exact details but I remember being very surprised at some numbers I thought I heard quoted. Had to do with how few cases they were going to use for making their first decision. I’d heard all about the large number of participants in the trials…

Moderna, the biotechnology firm partnering with the National Institutes of Health to develop a coronavirus vaccine, announced Thursday that it has fully enrolled its trial, with 30,000 participants — more than a third of whom are minorities. Washington Post

And I had been sort of paying attention to the methodology knowing that participants were injected and then simply went about living their normal lives…

Much of the debate over the timeline for a vaccine has been fueled by the unpredictability of the clock in each vaccine trial, which ticks forward according to the accumulation of covid-19 cases among study participants… Moderna is likely to have 53 covid-19 cases among participants by November — enough for a first look at the data — with sufficient safety data available just before Thanksgiving… Before they authorize a vaccine, regulators will require that the vaccine be at least 50 percent effective, that there be at least five severe cases of covid-19 among people who receive a placebo and that there be at least two months of follow-up on half the study participants. Washington Post

53 out of 30,000 just seems like an awfully small number, so it started me wondering about confidence and credibility in a statistical sense and was the germ seed for this post. I’m no epidemiologist or bio-statistician but I like to think I have a decent grasp on the basics so off we go, with no claim that these are the exact methods they’ll follow but at least an informed look at the basics.

The Very Basics

If you have some background in statistics the next section may be a little boring for you but bear with me I promise an opportunity to delve more deeply later and revisit more complex issues like the the differences between frequentist and bayesian methods later. I’ve explored this topic before but there’s plenty of new stuff below. Let’s just get the basics under our belts.

If you’ve ever taken an introductory course in probability and statistics you’ve likely been exposed to the statistician’s love of coin flips. This is another instance where at it’s simplest comparing our vaccine trials to coin flips can be instructive. Although overall the study is a randomized controlled trial (actually several of them) on a very large scale in many ways our understanding of these first 53 cases can be modeled as a set of coin flips. For each of the 53 people who are confirmed to have contracted covid-19 they either got the vaccine or a placebo (heads or tails).

R let’s us quickly get a sense of what the probability is that of the 53 people in the trial who contracted covid-19 what if only the minimum of 5 were given the actual vaccine and not the placebo? For those of you who don’t like scientific notation that’s thirty-one billion, eight hundred fifty-nine million, nine hundred thousand probability against you doing that randomly flipping coins.

got_covid <- 53vaccinated_got_covid <- 5placebo_got_covid <- 53 - vaccinated_got_coviddbinom(vaccinated_got_covid, got_covid, .5)## [1] 3.18599e-10vaccinated_got_covid <- 19placebo_got_covid <- 53 - vaccinated_got_coviddbinom(vaccinated_got_covid, got_covid, .5)## [1] 0.01321525

Even if 19 people who received the true vaccine got covid the probability is still less than 2% that the vaccine versus placebo doesn’t matter that it’s a 50/50 chance.

Let’s talk about effectiveness. We want our vaccine to be at least 50% effective. We can operationalize that most simply by \[\frac{placebo~cases - vaccinated~cases}{placebo~cases}\] So for our current example where 19 of the 53 cases had received the vaccine our effectiveness is \[\frac{placebo~cases - vaccinated~cases}{placebo~cases} = (34-19)/34 = 0.4411765\] which is very close to what we need. Let’s display it graphically for the range between 5 and 26 vaccinated cases and dispense with one other pesky issue. So far we have been ignoring the fact that there aren’t 53 people involved in the trial, there are 30,000. We don’t really want to lose sight of this even though it makes very little difference to our math in most cases. Let’s calculate effectiveness both with and without 30,000 in the denominator and even track rounding error in our little tibble.

library(dplyr)library(ggplot2)library(kableExtra)theme_set(theme_bw())effectiveness <-   tibble::tibble(vaccinated = 5:26) %>%   mutate(placebo = 53 - vaccinated,          effectiveness = (placebo - vaccinated) / placebo * 100,          placeborate = placebo / (15000 - placebo),          vaccinatedrate = vaccinated / (15000 - vaccinated),          pctratedifference = (placeborate - vaccinatedrate) / placeborate * 100,          rounding = effectiveness - pctratedifference)effectiveness %>%  kbl(digits = c(0,0,2,4,4,2,2),      caption = "Effectiveness as a function of positive COVID cases") %>%  kable_minimal(full_width = FALSE,       position = "left") %>%  add_header_above(c("Cases (53)" = 2, " " = 1, "Infection Rate" = 2, "% difference in rate" = 1, " " = 1))

Table 1: Effectiveness as a function of positive COVID cases

Cases (53)		Infection Rate	% difference in rate
vaccinated	placebo	effectiveness	placeborate	vaccinatedrate	pctratedifference	rounding
5	48	89.58	0.0032	0.0003	89.61	-0.03
6	47	87.23	0.0031	0.0004	87.27	-0.03
7	46	84.78	0.0031	0.0005	84.82	-0.04
8	45	82.22	0.0030	0.0005	82.27	-0.04
9	44	79.55	0.0029	0.0006	79.59	-0.05
10	43	76.74	0.0029	0.0007	76.80	-0.05
11	42	73.81	0.0028	0.0007	73.86	-0.05
12	41	70.73	0.0027	0.0008	70.79	-0.06
13	40	67.50	0.0027	0.0009	67.56	-0.06
14	39	64.10	0.0026	0.0009	64.16	-0.06
15	38	60.53	0.0025	0.0010	60.59	-0.06
16	37	56.76	0.0025	0.0011	56.82	-0.06
17	36	52.78	0.0024	0.0011	52.84	-0.06
18	35	48.57	0.0023	0.0012	48.63	-0.06
19	34	44.12	0.0023	0.0013	44.17	-0.06
20	33	39.39	0.0022	0.0013	39.45	-0.05
21	32	34.38	0.0021	0.0014	34.42	-0.05
22	31	29.03	0.0021	0.0015	29.07	-0.04
23	30	23.33	0.0020	0.0015	23.37	-0.04
24	29	17.24	0.0019	0.0016	17.27	-0.03
25	28	10.71	0.0019	0.0017	10.73	-0.02
26	27	3.70	0.0018	0.0017	3.71	-0.01

effectiveness %>%   ggplot(aes(x = vaccinated)) +   geom_line(aes(y = effectiveness)) +   geom_line(aes(y = pctratedifference), color = "green") +   geom_hline(aes(yintercept = 50)) +   scale_y_continuous(labels = scales::label_percent(scale = 1),                      breaks = seq.int(0, 100, 10)) +   scale_x_continuous(breaks = seq.int(2, 26, 2)) +   xlab("Number of positive COVID-19 outcomes among those who received the vaccine") +   ylab("Effectiveness as a percentage") +   ggtitle("Vaccine effectiveness as a function of outcomes (53 total cases)")

mean(effectiveness$rounding) # that's .04% not 4.0%## [1] -0.04559179

So in the best possible scenario (remember we require that at least 5 people who got the vaccine contract COVID before we can run the numbers) our effectiveness is ~90%. We need the number of people who received the vaccine and still contracted COVID to be 17 or less.

Testing our confidence as a frequentist

Now we know what we have to do to convince ourselves about effectiveness let’s address what we can do to be confident that our results are not a fluke. We know that low probability events occur. I don’t want to repeat myself so if you want a little background on frequentist methodology please see this earlier post. If you want one of many cautionary tails about the limits of NHST then please see this post.

Having said all that, frequentist methods are certainly the most prevalent and likely to be applied to the warpspeed data so let’s see what we come up with. Let’s start with the simplest possible test we could use. Let’s build a simple matrix (table) of our results and call it dat. We’ll assume the 30,000 participants are equally divided between receiving the vaccine and the placebo. We’ll pretend 19 folks who got the real vaccine later developed COVID and 34 did not. We’ll grab just the first column dat[,1] and put that in an object called covid_gof. Our hypothesis is that the vaccine matters that it helps prevent being infected. In NHST terms our null hypothesis is that there is no difference in the number of people who will get infected. In essence it might as well be a coin toss, 50/50, equal numbers of people will get infected whether they received the vaccine or the placebo. We can use the $\chi^2$ goodness of fit test. For clarity we’ll remove the continuity correction and overtly specify 50/50 odds. Since the default chisq.test test results are very terse we’ll use lsr::goodnessOfFitTest on the same data (expressed as vector that is a factor) to make it clear what we’re doing.

dat <- matrix(c(34, 15000 - 34, 19, 15000 - 19),              nrow = 2,              byrow = TRUE)rownames(dat) <- c("Placebo", "Vaccine")colnames(dat) <- c("COVID", "No COVID")dat##         COVID No COVID## Placebo    34    14966## Vaccine    19    14981covid_gof <- dat[,1]chisq.test(covid_gof,            correct = FALSE,            p = c(.5, .5))## ##  Chi-squared test for given probabilities## ## data:  covid_gof## X-squared = 4.2453, df = 1, p-value = 0.03936outcomes <- factor(c(rep("Vaccine", 19), rep("Placebo", 34)))lsr::goodnessOfFitTest(outcomes)## ##      Chi-square test against specified probabilities## ## Data variable:   outcomes ## ## Hypotheses: ##    null:        true probabilities are as specified##    alternative: true probabilities differ from those specified## ## Descriptives: ##         observed freq. expected freq. specified prob.## Placebo             34           26.5             0.5## Vaccine             19           26.5             0.5## ## Test results: ##    X-squared statistic:  4.245 ##    degrees of freedom:  1 ##    p-value:  0.039

Either way $\chi^2$ = 4.245283 and p = 0.0393595. Therefore we reject the null and can have some “confidence” that our results aren’t simply a fluke.

That seems a little too simple and seems to ignore the fact that we actually have data for not 53 people but 30,000 people. So let’s make things more complex. There are actually many variants of a $\chi^2$ test. Instead of goodness of fit let’s use $\chi^2$ test of independence or association using all four cells of our little matrix. Here our null hypothesis is that our variables are independent of one another, that whether you get COVID has no association to whether you received the vaccine or the placebo. A subtle distinction perhaps but worth it if for no other reason than to exploit the additional data. To make it even more fun there are numerous functions in r to run the test. From base r, where there are at least two variants chisq.test and prop.testto specialized packages like epiR that focus on epidemiology.

Notice that they all express the same basic notions $\chi^2$ = 4.2527963 and p = 0.0391858, even though they present an array of additional information.

chisq.test(dat, correct = FALSE)## ##  Pearson's Chi-squared test## ## data:  dat## X-squared = 4.2528, df = 1, p-value = 0.03919prop.test(dat, correct = FALSE)## ##  2-sample test for equality of proportions without continuity##  correction## ## data:  dat## X-squared = 4.2528, df = 1, p-value = 0.03919## alternative hypothesis: two.sided## 95 percent confidence interval:##  0.0000496578 0.0019503422## sample estimates:##      prop 1      prop 2 ## 0.002266667 0.001266667epiR::epi.2by2(dat = as.table(dat), method = "cross.sectional",         conf.level = 0.95, units = 1, outcome = "as.columns")##              Outcome +    Outcome -      Total        Prevalence *        Odds## Exposed +           34        14966      15000             0.00227     0.00227## Exposed -           19        14981      15000             0.00127     0.00127## Total               53        29947      30000             0.00177     0.00177## ## Point estimates and 95% CIs:## -------------------------------------------------------------------## Prevalence ratio                             1.79 (1.02, 3.14)## Odds ratio                                   1.79 (1.02, 3.14)## Attrib prevalence *                          0.00 (0.00, 0.00)## Attrib prevalence in population *            0.00 (-0.00, 0.00)## Attrib fraction in exposed (%)              44.12 (2.08, 68.11)## Attrib fraction in population (%)           28.30 (-2.75, 49.97)## -------------------------------------------------------------------##  Test that OR = 1: chi2(1) = 4.253 Pr>chi2 = 0.04##  Wald confidence limits##  CI: confidence interval##  * Outcomes per population unit

Notice that epiR::epi.2by2 provides confirmation of several important pieces of data we generated earlier in our very first table results. If you consult the row for 19 & 34, prevalence matches 0.00227 & 0.00127 the columns labeled “infection rate” and “Attrib fraction in exposed (%) = 44.12” matches our “effectiveness” column.

Credible? Incredible? A bayesian approach to how confident we are

The “problem” with a frequentist’s approach is that the “testing” framework is rather contorted and you really can’t make the statements that you want to make. We want to use the language of probability theory to say things like “there is an 90% chance that the vaccine is effective, which of course means there’s a 10% chance that it doesn’t”. As I have written before and many others have written elsewhere a “p value” and a “confidence interval” don’t give you that capability. Don’t get me wrong, the approach can be useful, is still the most frequent and dominant approach, but I don’t find it the best approach.

So let’s approach our warpspeed data using several different tools and allow ourselves the joy (okay I know that sounds geeky) to make probabilistic statements about what the data tell us. The rest of this post will be all about using bayesian tools.

Since I’m always a fan of making use of existing packages and code that do what I want them to do I did some exploration looking to see what was available. One of the first things I stumbled upon was a very nice functionbayes.prop.testcontained in a package called bayesian_first_aid, sadly but not surprisingly there was no CRAN version and development appears to have stopped circa 2015. Not surprising because there has been a lot more code published for Bayesian methods since then and I know how much work it takes to keep up a package. There are others I’ll demonstrate a little later but the appeal of this function is it’s simplicity. It’s a great place to start the code was easy to follow so I just grabbed the framework and updated it for my own needs. The Github code is here.

It’s fiendishly easy to set-up and run. We’ll go back to our original dat object and divide it up by column. Column 1 those who got COVID by condition “Placebo” and “Vaccine” and column 2 those who didn’t. To keep this from being tl;dr I’m not going to go into too much detail on jags and runjags. If you need a good tutorial consult a reference like DBDA. I’ll only highlight the key points. For purposes of this post I’m going to use a flat, uninformed prior in all cases (this time the specific line is theta[i] ~ dbeta(1, 1)). One of the nice aspects of bayesian inference is you can express your prior thinking/knowledge mathematically, then let the data inform your thinking. Given the vaccine has already been involved in significant phase I and II trials it wouldn’t be unusual to have a prior that expressed at least a little confidence that it had some effect. But we’ll pretend we know nothing and that any outcome is equally likely. We’re back to flipping coins again.

We’re going to explore the probabilities that the percentage of positive cases among those who received the vaccine is the same as the percentage of positive cases among those who received the vaccine. The model is constructed so that we could conceivably have more than two possibilities (e.g. easy to extend to a case where we had two different vaccines + a placebo or two different doses of the same vaccine plus placebo). That is very likely to be modeled later or when more data are in. We’ll “monitor” results for the infection rate placebo and vaccine (theta[1] and theta[2] respectively). As well as the raw predictions of how many people (x_pred[1] and x_pred[2]), not strictly necessary but fun to watch.

This is a simple model with just four data elements we need to enter but we’ll still run it in 4 chains spread across 4 cores. If you’re following along I’m suppressing some messages about not choosing different random seeds per chain (run_jags will pick some), and that we used the same initial value for all chains (also not a worry). You should also run plot(my_results) to check the diagnostics for convergence and auto-correlation. I did, they’re fine but to save screen real estate I won’t include them.

# Setting up the datadat##         COVID No COVID## Placebo    34    14966## Vaccine    19    14981got_covid <- dat[,1] not_covid <- dat[,2] got_covid## Placebo Vaccine ##      34      19not_covid## Placebo Vaccine ##   14966   14981# The model string written in the JAGS languagemodel_string <- "model {  for(i in 1:length(got_covid)) {    got_covid[i] ~ dbinom(theta[i], not_covid[i])    theta[i] ~ dbeta(1, 1)    x_pred[i] ~ dbinom(theta[i], not_covid[i])  }}"my_results <- runjags::run.jags(model_string,                  sample = 10000,                  n.chains=4,                  method="parallel",                  monitor = c("theta", "x_pred"),                  data = list(got_covid = got_covid,                               not_covid = not_covid))## Calling 4 simulations using the parallel method...## Following the progress of chain 1 (the program will wait for all chains## to finish before continuing):## Welcome to JAGS 4.3.0 on Wed Nov  4 15:34:07 2020## JAGS is free software and comes with ABSOLUTELY NO WARRANTY## Loading module: basemod: ok## Loading module: bugs: ok## . . Reading data file data.txt## . Compiling model graph##    Resolving undeclared variables##    Allocating nodes## Graph information:##    Observed stochastic nodes: 2##    Unobserved stochastic nodes: 4##    Total graph size: 9## . Reading parameter file inits1.txt## . Initializing model## . Adapting 1000## -------------------------------------------------| 1000## ++++++++++++++++++++++++++++++++++++++++++++++++++ 100%## Adaptation successful## . Updating 4000## -------------------------------------------------| 4000## ************************************************** 100%## . . . Updating 10000## -------------------------------------------------| 10000## ************************************************** 100%## . . . . Updating 0## . Deleting model## . ## All chains have finished## Simulation complete.  Reading coda files...## Coda files loaded successfully## Calculating summary statistics...## Calculating the Gelman-Rubin statistic for 4 variables....## Finished running the simulationmy_results## ## JAGS model summary statistics from 40000 samples (chains = 4; adapt+burnin = 5000):##                                                                              ##              Lower95    Median   Upper95      Mean         SD Mode      MCerr## theta[1]   0.0015847 0.0023125 0.0031218 0.0023357 0.00039503   -- 2.5904e-06## theta[2]  0.00077049 0.0013122  0.001919 0.0013345 0.00029786   -- 1.9645e-06## x_pred[1]         18        34        50    34.949     8.3756   33   0.048986## x_pred[2]          8        20        32    20.036     6.3042   19   0.037185##                                          ##           MC%ofSD SSeff      AC.10   psrf## theta[1]      0.7 23255 -0.0023991 1.0001## theta[2]      0.7 22988  0.0022903 1.0001## x_pred[1]     0.6 29234 -0.0046137 1.0001## x_pred[2]     0.6 28743  0.0047744 1.0001## ## Total time taken: 1.3 seconds# plot(my_results) ## diagnostics were checked.

Okay that all looks quite complex, what do we do now? Let’s first investigate the things we care most about. Remember theta[1] & theta[2] represent our infection rates theta[1] = 0.0023357 is our rate among those who got the placebo and theta[2] = 0.0013345 is our rate among those who got the vaccine. Those are the mean values and the medians are similar. The x_pred values are estimates of the case counts.

We can use tidybayes::tidy_draws to extract the results of our 40,000 chains and pipe it through select to get the columns we want with the names we’d like. As much as I like Greek $\theta$ gets old after awhile. At the same time we can compute via a mutate statement what we really want to know which is the % difference in infection rates which we’ll put in a column called diff_rate.

Now when we pass this cleaned up data to bayestestR::describe_posterior(results1) we get back a table that is a little easier to read. Focus on just the line for diff_rate.

results1 <-    tidybayes::tidy_draws(my_results) %>%   select(placebo_rate = `theta[1]`,          vaccine_rate = `theta[2]`,          placebo_cases = `x_pred[1]`,          vaccine_cases = `x_pred[2]`) %>%   mutate(diff_rate = (placebo_rate - vaccine_rate) / placebo_rate * 100)glimpse(results1)## Rows: 40,000## Columns: 5## $ placebo_rate   0.00319585, 0.00300531, 0.00293815, 0.00228591, 0.00238…## $ vaccine_rate   0.00108651, 0.00131015, 0.00134464, 0.00137949, 0.00112…## $ placebo_cases  52, 47, 41, 36, 28, 21, 37, 34, 31, 34, 48, 37, 40, 33,…## $ vaccine_cases  9, 20, 23, 22, 11, 28, 21, 13, 20, 22, 27, 24, 15, 30, …## $ diff_rate      66.002472, 56.405496, 54.235148, 39.652480, 52.911123, …bayestestR::describe_posterior(results1, ci = 0.95)## # Description of Posterior Distributions## ## Parameter     | Median |           95% CI |      pd |        89% ROPE | % in ROPE## ---------------------------------------------------------------------------------## placebo_rate  |  0.002 | [ 0.002,  0.003] | 100.00% | [-0.100, 0.100] |       100## vaccine_rate  |  0.001 | [ 0.001,  0.002] | 100.00% | [-0.100, 0.100] |       100## placebo_cases | 34.000 | [20.000, 52.000] | 100.00% | [-0.100, 0.100] |         0## vaccine_cases | 20.000 | [10.000, 34.000] | 100.00% | [-0.100, 0.100] |         0## diff_rate     | 43.316 | [ 7.715, 71.077] |  97.97% | [-0.100, 0.100] |         0

Given our data and 40,000 samples and assuming we had zero prior knowledge or estimation of effectiveness, then we have a median estimate of the % difference in infection rates = 43.316. That’s about what we would expect given our earlier investigation. The columns we really want to use are the “95% CI” and “pd” columns for diff_rate. CI in a bayesian framework is credible interval not confidence interval. Since 95% of our chains wind up in that interval we can say given our data there’s a 95% probability that diff_rate lies in its range. No, they are not the same thing as a confidence interval. The Probability of Direction (pd) is an index of effect existence, ranging from 50% to 100%, representing the certainty with which an effect goes in a particular direction (i.e., is positive or negative). We can be very confident that the vaccine does have a positive effect.

Plotting the distribution or range of diff_rate may also help the reader “see” the results. With the data we have (and remember I have been using a 19/34 split) you can see that while there’s a chance that the vaccine has no effect the evidence (the data) supports the notion that it does.

plot(bayestestR::hdi(results1$diff_rate,                      ci = c(.89, .95, .99)                     )     )

Done

I’ve decided to make this a two parter. Please hang in there more bayesian “magic” follows soon. Probably tomorrow.

Hope you enjoyed the post. Comments always welcomed. Especially please let me know if you actually use the tools and find them useful.

Extra credit for me for not expressing a political view at any point. Let the data speak.

Chuck

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Warpspeed confidence what is credible? first appeared on R-bloggers.

↧

Creating missing values in factors

November 3, 2020, 10:00 am

≫ Next: Choose the Winner of Appsilon’s shiny.semantic PoContest

≪ Previous: Warpspeed confidence what is credible?

[This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

I was looking at some breast cancer data recently, and was analyzing the ER (estrogen receptor) status variable. It turned out that there were three possible outcomes in the data: Positive, Negative and Indeterminate. I had imported this data as a factor, and wanted to convert the Indeterminate level to a missing value, i.e. NA.

My usual method for numeric variables created a rather singular result:

x <- as.factor(c('Positive','Negative','Indeterminate'))
x1 <- ifelse(x=='Indeterminate', NA, x)
str(x1)
##  int [1:3] 3 2 NA

This process converted it to an integer!! Not the end of the world, but not ideal by any means.

Further investigation revealed two other tidyverse strategies.

`dplyr::na_if`

This method changes the values to NA, but keeps the original level in the factor’s levels

x2 <- dplyr::na_if(x, 'Indeterminate')
str(x2)
##  Factor w/ 3 levels "Indeterminate",..: 3 2 NA
x2
## [1] Positive Negative     
## Levels: Indeterminate Negative Positive

`dplyr::recode`

This method drops the level that I’m deeming to be missing from the factor

x3 <- dplyr::recode(x, Indeterminate = NA_character_)
str(x3)
##  Factor w/ 2 levels "Negative","Positive": 2 1 NA
x3
## [1] Positive Negative     
## Levels: Negative Positive

This method can also work more generally to change all values not listed to missing values.

x4 <- dplyr::recode(x, Positive='Positive', Negative='Negative', 
                    .default=NA_character_)
x4
## [1] Positive Negative     
## Levels: Negative Positive

Other strategies are welcome in the comments.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Creating missing values in factors first appeared on R-bloggers.

↧

Choose the Winner of Appsilon’s shiny.semantic PoContest

November 4, 2020, 2:00 am

≫ Next: Małgorzata Bogdan – Recent developments on Sorted L-One Penalized Estimation

≪ Previous: Creating missing values in factors

[This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Vote for the Winner of Appsilon’s Internal Shiny Contest

Election season has been stressful for everyone – this is your chance to vote on something a bit more light-hearted. At Appsilon, we wanted to prove that with the shiny.semantic open-source package, it’s possible to create a great looking and high-quality Proof of Concept Shiny app in one day. So, we recently ran an internal contest to see who could create the best shiny.semantic PoC in under 24 hours. You get to vote for the winner!

Which shiny.semantic PoC is your favorite? You can view the submissions and place your vote here.

Here are the contest rules:

Participants must use the shiny.semantic R package
PoC’s must be built by a single person
Development time must not exceed 24 hours

We’ve asked our friends at RStudio to choose the “Most Technically Impressive PoC” and the “Most Creatively Impressive PoC.” Those results will be announced soon – so stay tuned!

On top of these awards, we would like the R Community to vote for a “People’s Choice Award.” The PoC with the most votes will receive a special prize. Six fast-turnaround PoC’s created by Appsilon engineers have been submitted to the contest and are eligible to win:

To participate, simply vote for your favorite shiny.semantic PoC with this link. The link includes access to demos of each PoC, along with a GitHub repository.

Learn more about the new CSS Grid UI functionality in shiny.semantic here.

Check Out the Submissions

Shiny Mosiac

Shiny Mosaic

The purpose of the application is to enable the user to create a mosaic of a photo that they upload. The application, built in the form of a wizard, allows users to configure the target mosaic form easily. You can test the application here.

Squaremantic

With this app, you can quickly generate a nicely formatted square layout of letters based on the text input. It uses shiny.semantic layouts and input elements, by which you can control visual output like in a simple graphic program. You can test the application here.

FIFA ’19

This app was created to demonstrate shiny.semantic features for creating interactive data visualization. This dashboard used SoFifa data and was inspired by a fantastic Fifa Shiny Dashboard. You can test the application here.

Polluter Alert

Polluter Alert is a dashboard that allows the user to report sources of air pollution in the user’s area. The goal is to build a reliable dataset used for actionable insights – sometimes, the primary pollution source is a single chimney. Sometimes it is a district problem (lack of modern heating infrastructure), etc. You can test the application here.

Semantic Pixelator

Semantic Pixelator is a fun way to explore semantic elements by creating different image compositions using loaders, icons, and other UI elements from semantic/fomantic UI! Upload your picture or use the random button to start.

You can then use the sidebar to refine different parameters such as the generated grid’s size, the base element type, and other color options. After you find a composition that you like, you can use the palette generator to create a color palette based on the result and download both the current palette details and the developed composition. You can test the application here.

Semantic Memory

Semantic memory

Semantic Memory is a memory game similar to the one that won the Shiny contest last year, but it is created from scratch using shiny.semantic (with some adjustments). Two players try to find as many pairs of R package hexes (coming from both Appsilon and RStudio) as they can.

The app will count their scores and show who won the game. Semantic Memory is based on various shiny.semantic components and uses features that come with the FomanticUI, such as the mechanism responsible for revealing and hiding cards. You can test the application here.

Want to see more high-quality Shiny Dashboard examples? Visit Appsilon’s Shiny Demo Gallery.

Help Us Choose a Winner

We’re proud of what our team has accomplished in such a short period, and it’s hard to pick a single winner. Which PoC dashboard is your favorite? Keep in mind that all of these PoC’s were developed in under 24 hours, so naturally there will be some space for improvement. We welcome any and all feedback on these PoC’s and on our shiny.semantic open source package.

Do you like shiny.semantic? Please give us a star on GitHub. You can explore other Appsilon open source packages on our new Shiny Tools landing page.

Please tell us what you think! You can place your vote here.

Appsilon is hiring! See our Careers page for new openings.

Article Choose the Winner of Appsilon’s shiny.semantic PoContest comes from Appsilon Data Science | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Choose the Winner of Appsilon’s shiny.semantic PoContest first appeared on R-bloggers.

↧

Małgorzata Bogdan – Recent developments on Sorted L-One Penalized Estimation

November 4, 2020, 4:00 am

≫ Next: BARUG ROC day invitation

≪ Previous: Choose the Winner of Appsilon’s shiny.semantic PoContest

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A month ago we finished Why R? 2020 conference. We had an pleasure to host Małgorzata Bogdan, an associate professor of statistics at University of Wrocław. This post contains a biography of the speaker and an abstract of her talk: Recent developments on Sorted L-One Penalized Estimation.

Sorted L-One Penalized Estimator is an extension of LASSO, which allows for a reduction of dimension by eliminating some of the model parameters as well as by making some of them equal to each other. In this talk we will present some of the recent developments on SLOPE, with the specific emphasis on the Adaptive Bayesian version of SLOPE and on the strong screening rule, which allows a substantial speeding up of the SLOPE algorithm.

Małgorzata Bodgan is an associate professor of statistics at University of Wrocław. She focuses on statistical methods for filtering and modeling high-dimension data. She conducted her research at University of Washington, Purdue University, University of Vienna, Lund University and Stanford University. Małgorzata has published over fifty scientific publications and her achievements earned her a “Women for Math Science Award” from the Department of Mathematics of Munich University of Technology and Fullbright Scholarship.

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Małgorzata Bogdan - Recent developments on Sorted L-One Penalized Estimation first appeared on R-bloggers.

↧

BARUG ROC day invitation

November 4, 2020, 6:46 am

≫ Next: Trust the Future

≪ Previous: Małgorzata Bogdan – Recent developments on Sorted L-One Penalized Estimation

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve recorded a video invitation to help encourage you to consider attending BARUG’s online ROC day (Tuesday, November 10, 2020 4:30 PM US Pacific time).

Please check it out and share.

(link)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post BARUG ROC day invitation first appeared on R-bloggers.

↧

Trust the Future

November 4, 2020, 8:00 am

≫ Next: RvsPython #5: Using Monte Carlo To Simulate π

≪ Previous: BARUG ROC day invitation

[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A fortune cookie that reads 'You do not have to worry about your future'

Each time we use R to analyze data, we rely on the assumption that functions used produce correct results. If we can’t make this assumption, we have to spend a lot of time validating every nitty detail. Luckily, we don’t have to do this. There are many reasons for why we can comfortably use R for our analyses and some of them are unique to R. Here are some I could think of while writing this blog post – I’m sure I forgot something:

R is a functional language with few side effects (“just like mathematical functions”)
R, and its predecessor S, has undergone lots of real-world validation over the last two-three decades
Millions of users and developers use and vet R regularly, which increases the chances for detecting mistakes and bugs
R has one established, agreed-upon framework for validating an R package: R CMD check
The majority of R packages are distributed through a single repository (CRAN)
CRAN requires that all R packages pass checks on past, current, and upcoming R versions, across operating systems (MS Windows, Linux, macOS, and Solaris), and on different compilers
New checks are continuously added to R CMD check causing the quality of new and existing R packages to improve over time
CRAN asserts that package updates do not break reverse package dependencies
R developers spend a substantial amount of time validating their packages
R has users and developers with various backgrounds and areas of expertise
R has a community that actively engages in discussions on best practices, troubleshooting, bug fixes, testing, and language development
There are many third-party contributed tools for developing and testing R packages

I think Jan Vitek summarized it well in the ‘Why R?’ panel discussion on ‘Performance in R’ on 2020-09-26:

R is an ecosystem. It is not a language. The language is the little bit on top. You come for the ecosystem – the books, all of the questions and answers, the snippets of code, the quality of CRAN. … The quality assurance that CRAN brings … we don’t have that in any other language that I know of.

Without the above technical and social ecosystem, I believe the quality of my own R packages would have been substantially lower. Regardless of how many unit tests I would write, I could never achieve the same amount of validation that the full R ecosystem brings to the table.

When you use the future framework for parallel and distributed processing, it is essential that it delivers a corresponding level of correctness and reproducibility to that you get when implementing the same task sequentially. Because of this, validation is a top priority and part of the design and implementation throughout the future ecosystem. Below, I summarize how it is validated:

All the essential core packages part of the future framework, future, globals, listenv, and parallelly, implement a rich set of package tests. These are validated regularly across the wide-range of operating systems (Linux, Solaris, macOS, and MS Windows) and R versions available on CRAN, on continuous integration (CI) services (GitHub Actions, Travis CI, and AppVeyor CI), an on R-hub.
For each new release, these packages undergo full reverse-package dependency checks using revdepcheck. As of October 2020, the future package is tested against more than 140 direct reverse-package dependencies available on CRAN and Bioconductor, including packages future.apply, furrr, doFuture, drake, googleComputeEngineR, mlr3, plumber, promises (used by shiny), and Seurat. These checks are performed on Linux with both the default settings and when forcing tests to use multisession workers (SOCK clusters), which further validates that globals and packages are identified correctly.
A suite of Future API conformance tests available in the future.tests package validates the correctness of all future backends. Any new future backend developed must pass these tests to comply with the Future API. By conforming to this API, the end-user can trust that the backend will produce the same correct and reproducible results as any other backend, including the ones that the developer have tested on. Also, by making it the responsibility of the developer to assert that their new future backend conforms to the Future API, we relieve other developers from having to test that their future-based software works on all backends. It would be a daunting task for a developer to validate the correctness of their software with all existing backends. Even if they would achieve that, there may be additional third-party future backends that they are not aware of, that they do not have the possibility to test with, or that are yet to be developed. The future.tests framework was sponsored by an R Consortium ISC grant.
Since foreach is used by a large number of essential CRAN packages, it provides an excellent opportunity for supplementary validation. Specifically, I dynamically tweak the examples of foreach and popular CRAN packages caret, glmnet, NMF, plyr, and TSP to use the doFuture adaptor. This allows me to run these examples with a variety of future backends to validate that the examples produce no run-time errors, which indirectly validates the backends and the Future API. In the past, these types of tests helped to identify and resolve corner cases where automatic identification of global variables would fail. As a side note, several of these foreach-based examples fail when using a parallel foreach adaptor because they do not properly export globals or declare package dependencies. The exception is when using the sequential doSEQ adaptor (default), fork-based ones such as doMC, or the generic doFuture, which supports any future backend and relies on the future framework for handling globals and packages.
Analogously to above reverse-dependency checks of each new release, CRAN and Bioconductor continuously run checks on all these direct, but also indirect, reverse dependencies, which further increases the validation of the Future API and the future ecosystem at large.

May the future be with you!

RvsPython #5: Using Monte Carlo To Simulate π

November 4, 2020, 1:03 pm

≫ Next: {shinyscreenshot}: Finally, an easy way to take screenshots in Shiny apps!

≪ Previous: Trust the Future

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The number $\pi$ , while being an irrational and transcendental number is a central number to which much of mathematics and science at large relies on for many calculations.

A whole book can be written on this matter alone, but today we are going to focus on approximating the true value of $\pi$ using Monte Carlo simulations in R and Python!

Disclaimer

This problem has been covered extensively across the internet and serves as a benchmark example of what Monte Carlo can do. What we are going to do is highlight how this method preforms in both R and Python.

The General Algorithm

The formula for the unit circle is:

The Unit Circle; Graphed with Desmos Graphing Calculator

x^2+y^2=1

$\iff y = \sqrt{1-x^2}$

For $(x, y)\in[0,1]$ , the length of a quarter of the unit circle $\pi\over{4}$ . Thus, to approximate $\pi$ , the function that we will be using is:

$Y_i = 4\sqrt{1-X_i^2}$

To approximate $\pi$ , we use the following Algorithm:

Generate $\dots,$ $X_n \sim{}U(0,1)$
Calculate $Y_i = 4\sqrt{1-X_i^2}$
Take the mean of $\dots$ , $Y_{n}$ : ${\bar{{Y}} =} {{\sum_{i=1}^{n}Y_i}\over{n}}$ – this is our approximation

The code for this is relatively straight forward. But the question is, which code will run faster?

Let’s go!

The Test

For our challenge we are going to be writing code which is as intuitive as possible in each language. We are going to seek to approximate the value of $\pi i$ using the above algorithm. For R we will be using lapply to implement the Monte Carlo algorithm and for Python we will using for loops.

The Solution with R

#' Define Number of points we want to estimaten<-c(10,100,1000,10000,100000,1000000)#' Generate our random uniform variablesx<-sapply(n,runif)#' Our Transformation functiony<- function(u) {  4*sqrt(1-u^2)}startTime<-Sys.time()yvals<-lapply(x,y)endTime<-Sys.time()-startTimeavgs<-lapply(yvals,mean)endTime## Time difference of 0.01399588585 secsdata.frame(n, "MC Estimate"=unlist(avgs), "Difference from True Pi"= abs(unlist(avgs)-pi))##         n MC.Estimate Difference.from.True.Pi## 1      10 3.281637132         0.1400444782036## 2     100 3.391190973         0.2495983193740## 3    1000 3.090265904         0.0513267494211## 4   10000 3.143465663         0.0018730098616## 5  100000 3.141027069         0.0005655842822## 6 1000000 3.141768899         0.0001762457079

The Solution with Python

import numpy as npimport pandas as pdimport time# Define Number of points we want to estimaten = [10, 100, 1000, 10000, 100000, 1000000]# Generate our random uniform variablesx = [np.random.uniform(size=n) for n in n]# Our Transformation functiondef y(x):    return 4 * np.sqrt(1 - x ** 2)startTime= time.time()yvals = []for array in x:    yval=[]    for i in array:        yval.append(y(i))    yvals.append(yval)avgs=[]for array in yvals:  avgs.append(np.mean(array))endTime= time.time()-startTime# How long it took to run our codeprint("Time difference of "+ str(endTime) + " secs\n")# Output## Time difference of 3.146182060241699 secs## Estimated Values of Pipd.DataFrame({"n":n,              "MC Estimate":avgs,              "Difference from True Pi": [np.abs(avg-np.pi) for avg in avgs]})##          n  MC Estimate  Difference from True Pi## 0       10     3.320525                 0.178933## 1      100     3.172290                 0.030698## 2     1000     3.156044                 0.014451## 3    10000     3.141675                 0.000083## 4   100000     3.147255                 0.005662## 5  1000000     3.141400                 0.000193

Comparing R with Python

From the following ratio we can see how much faster R is than Python:

library(reticulate)reticulate::py$endTime/as.numeric(endTime)## [1] 224.7933496

Woah! Using my machine- R is over 220 times faster than Python!

I think it’s pretty clear to see who the winner is as far as speed is concerned.

Concluding Remarks

While R most of the time sits on the sidelines in the Python-dominant world of Data Science- we need to keep in mind where Python’s weaknesses lie and when to pivot from and use R.

Doing simulation? R, please.

Did you like this content? Be sure to never miss an update and Subscribe!

Email Address:

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RvsPython #5: Using Monte Carlo To Simulate π first appeared on R-bloggers.

↧

{shinyscreenshot}: Finally, an easy way to take screenshots in Shiny apps!

November 5, 2020, 2:30 am

≫ Next: RvsPython #5.1: Making the Game even with Python’s Best Practices

≪ Previous: RvsPython #5: Using Monte Carlo To Simulate π

[This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

{shinyscreenshot} has finally been released, two entire years (sorry!) after I began working on it. It allows you to capture screenshots of entire pages or parts of pages in Shiny apps, and have the image downloaded as a PNG automatically. It can be used to capture the current state of a Shiny app, including interactive widgets (such as plotly, timevis, maps, etc).

You can check out the demo to try it out yourself, or you can watch a short tutorial!

Try a Demo Watch Tutorial

How to use

Using {shinyscreenshot} is as easy as it gets. When you want to take a screenshot, simply call screenshot() and a full-page screenshot will be taken and downloaded as a PNG image. Try it for yourself!

It’s so simple that an example isn’t needed, but here’s one anyway:

library(shiny)
library(shinyscreenshot)

ui <- fluidPage(
  textInput("text", "Enter some text", "test"),
  actionButton("go", "Take a screenshot")
)

server <- function(input, output) {
  observeEvent(input$go, {
    screenshot()
  })
}

shinyApp(ui, server)

Screenshot button

The screenshot() function can be called any time inside the server portion of a Shiny app. A very common case is to take a screenshot after clicking a button. That case is so common that there’s a function for it: screenshotButton(). It accepts all the same parameters as screenshot(), but instead of calling it in the server, you call it in the UI.

screenshotButton() creates a button that, when clicked, will take a screenshot.

Features

Region: By default, the entire page is captured. If you’d like to capture a specific part of the screen, you can use the selector parameter to specify a CSS selector. For example, if you have a plot with ID myplot then you can use screenshot(selector="#myplot").
Scale: The image file will have the same height and width as what is visible in the browser. Using screenshot(scale=2) will result in an image that’s twice the height and width (and also a larger file size).
Timer: Usually you want the screenshot to be taken immediately, but sometimes you may want to tell Shiny to take a screenshot in, for example, 3 seconds from now. That can be done using screenshot(timer=3).
File name: You can choose the name of the downloaded file using the filename parameter.
Module support: As an alternative to the selector argument, you can also use the id argument. For example, instead of using screenshot(selector="#myplot"), you could use screenshot(id="myplot"). The advantage with using an ID directly is that the id parameter is module-aware, so even if you’re taking a screenshot inside a Shiny module, you don’t need to worry about namespacing.

Installation

To install the stable CRAN version:

install.packages("shinyscreenshot")

To install the latest development version from GitHub:

install.packages("remotes")
remotes::install_github("daattali/shinyscreenshot")

Motivation

For years, I saw people asking online how can they take screenshots of the current state of a Shiny app. This question comes up especially with interactive outputs (plotly, timevis, maps, DT, etc). Some of these don’t allow any way to save the current state as an image, and a few do have a “Save as image” option, but they only save the base/initial state of the output, rather than the current state after receiving user interaction.

After seeing many people asking about this, one day my R-friend Eric Nantz asked about it as well, which gave me the motivation to come up with a solution.

Browser support and limitations

The screenshots are powered by the ‘html2canvas’ JavaScript library. They do not always produce perfect screenshots, please refer to ‘html2canvas’ for more information about the limitations.

The JavaScript libraries used in this package may not be supported by all browsers. {shinyscreenshot} should work on Chrome, Firefox, Edge, Chrome on Android, Safari on iPhone (and probably more that I haven’t tested). It does not work in Internet Explorer.

Similar packages

As mentioned above, the libraries used by {shinyscreenshot} do have limitations and may not always work. There are two other packages that came out recently that also provide screenshot functionality which you may try and compare: {snapper} by Jonathan Sidi and {capture} by dreamRs.

RStudio’s {webshot} package is also similar, but serves a very different purpose. {webshot} is used to take screenshots of any website (including Shiny apps), but you cannot interact with the page in order to take a screenshot at a specific time.

TWO YEARS?!

Yes, I know. If you look at the commit history on GitHub, you’ll see this package started in 2018. I wrote almost the entire functionality of the package in just a few days, and then all I had left to do was the dreadful documentation. I didn’t want to release the package without great documentation, so I just waited until I’d have a free weekend to do that. But instead of coming back to do the documentation, I completely forgot about this package.

Many months later, COVID-19 hit us. One of my lockdown goals was to go through all my GitHub R packages and fix all the issues that I can. While doing that sweep, I noticed that I have this cool package that I forgot to release… If you really needed screenshots last year, then I apologize for this mistake. But, better late than never!

Lastly, if you enjoy my content, you should check out my new YouTube channel where I’ll be posting educational Shiny content!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post {shinyscreenshot}: Finally, an easy way to take screenshots in Shiny apps! first appeared on R-bloggers.

↧

RvsPython #5.1: Making the Game even with Python’s Best Practices

November 5, 2020, 3:36 am

≫ Next: R & Python Rosetta Stone: EDA with dplyr vs pandas

≪ Previous: {shinyscreenshot}: Finally, an easy way to take screenshots in Shiny apps!

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, it turns out that my last blog that R was over 220 times faster than Python got a lot of (constructive) criticism saying that I wasn’t using “best practices” with Python, which was why my Python code was so slow. This is a totally acceptable critique; thus, I’ve decided to write a follow up and rewrite the code I used making a more even playing field for both R and Python.

In this blog I’m going to do the following:

Perform Monte Carlo Using R and Python with for loops.
Use some of Python’s best practices and see how it compares to R’s lapply.

(Note: There is already a popular article article comparing for loops in R and Python against R’s lapplyhere)

Disclaimer

This is a follow up blog to my last write up comparing R and Python with Monte Carlo simulations. For context, check out the blog here.

Using `for` loops in R

In the last post, we didn’t have our generated data timed. I thought it would be a good idea to include it in the total processing time. I will also be changing the method I generated the data by using R’s for loop and will start timing the code from there.

I did my best to make my R code as similar to the Python code in the last blog- if you see an issue, please comment!

#' Define Number of points we want to estimaten<-c(10,100,1000,10000,100000,1000000)#' Our Transformation functiony<- function(u) {  4*sqrt(1-u^2)}#' Start the timerstartTime<-Sys.time()#' Generate our random uniform variablesx<-list()for(i in 1:length(n)){  x[[i]]<-runif(n[i])}#' Transform our uniform variables.yvals<-list()for (i in 1:length(x)){  #' Need to define this so that the list element will be populated  #' See: https://stackoverflow.com/questions/14333525/error-in-tmpk-subscript-out-of-bounds-in-r  yvals[[i]]<-1  for(j in 1:length(x[[i]])){    yvals[[i]][j]<-y(x[[i]][j])  }}#' Calculate our approximations of piavgs<- c()for(i in 1:length(yvals)){  avgs[i]<-mean(yvals[[i]])}endTime<-Sys.time()-startTimeendTime## Time difference of 1.009413958 secsdata.frame(n, "MC Estimate"=unlist(avgs), "Difference from True Pi"= abs(unlist(avgs)-pi))##         n MC.Estimate Difference.from.True.Pi## 1      10 3.281637132         0.1400444782036## 2     100 3.391190973         0.2495983193740## 3    1000 3.090265904         0.0513267494211## 4   10000 3.143465663         0.0018730098616## 5  100000 3.141027069         0.0005655842822## 6 1000000 3.141768899         0.0001762457079

Using `for` loops in Python (From previous blog)

As I did in the previous blog, here is the code I used to run the Monte Carlo algorithm with for loops. I heard there are more accurate ways to time this code, but since I want it to be similar to my R code- I am doing it this way.

import numpy as npimport pandas as pdimport time# Define Number of points we want to estimaten = [10, 100, 1000, 10000, 100000, 1000000]# Our Transformation functiondef y(x):    return 4 * np.sqrt(1 - x ** 2)#Start the timerstartTime= time.time()# Generate our random uniform variablesx = [np.random.uniform(size=n) for n in n]startTime= time.time()yvals = []for array in x:    yval=[]    for i in array:        yval.append(y(i))    yvals.append(yval)avgs=[]for array in yvals:  avgs.append(np.mean(array))endTime= time.time()-startTime# How long it took to run our codeprint("Time difference of "+ str(endTime) + " secs\n")# Output## Time difference of 2.790393352508545 secsprint("Estimated Values of Pi\n")## Estimated Values of Pipd.DataFrame({"n":n,              "MC Estimate":avgs,              "Difference from True Pi": [np.abs(avg-np.pi) for avg in avgs]})##          n  MC Estimate  Difference from True Pi## 0       10     3.259405                 0.117812## 1      100     3.351556                 0.209963## 2     1000     3.130583                 0.011009## 3    10000     3.126542                 0.015050## 4   100000     3.144484                 0.002891## 5  1000000     3.140740                 0.000853library(reticulate)py$endTime/as.numeric(endTime)## [1] 2.764369693

Ok- so using for loops R isn’t as fast as I initally stated. However, based on my machine R is still over twice as fast as Python with for loops.

Hey, it ain’t 220 but its something

Using R’s “Best Practices” (Using the `apply` family)

Instead of using for loops, a faster alternative is to use the apply family of functions, namely sapply and lapply.

#' Start the timerstartTime<-Sys.time()#' Generate our random uniform variablesx<-sapply(n,runif)yvals<-lapply(x,y)avgs<-lapply(yvals,mean)newendTime<-Sys.time()-startTimenewendTime## Time difference of 0.1879060268 secs#' Speed - for loop vs applyas.numeric(endTime)/as.numeric(newendTime)## [1] 5.371908366

Using Python’s best practices

After getting several comments of (constructive) criticism about how the comparison was not fair here’s some new code implementing some of the best practices in writing faster code.

This is some code that I saw posted in a comment on my LinkedIn post (thank you Thomas Halvorson), which is pretty similar in structure to the R code I have listed above.

I’m sure there are better ways out there (I have seen in the comments for the last blog a lot of very good solutions), but I found this to be the most readable and follows a structure similar to R’s.

(Let me know if you have something better!)

startTime= time.time()x = [np.random.uniform(size=n) for n in n]yvals = list(map(y, x))avgs = list(map(np.mean, yvals))endTime= time.time()-startTime# How long it took to run our codeprint("Time difference of "+ str(endTime) + " secs\n")## Time difference of 0.0629582405090332 secs

Comparing R with Python now we have:

as.numeric(newendTime)/py$endTime## [1] 2.984613695

Python is nearly 3 times faster on my machine using the updated code.

Conclusion

Well, you live and learn. Best practices can make it or break it for your code and this updated analysis can help give you a better idea.

Thank you everyone for reading my last blog post and pointing out some obvious issues that I didn’t notice! I definitely will be using the map() function more often in my Python code!

Did you like this content? Be sure to never miss an update and Subscribe!

Email Address:

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RvsPython #5.1: Making the Game even with Python’s Best Practices first appeared on R-bloggers.

↧

Background

The survsim package

The simsurv package

Details

Future talks

2020-11-04 Tools for Explainable Artificial Intelligence

2020-11-05 Why R? Webinar – R on AWS

2020-11-12 Preserving wildlife with computer vision + Scaling Shiny Dashboards on a Budget

2020-11-19 Satellite imagery analysis in R

2020-11-25 Get Started with ML in Azure Quantum Computing with Q#

Previous talks

Stay up to date

Why Across?

You Learned Something New!

SETUP R-TIPS WEEKLY PROJECT

Identification vs. Measurement

Strings vs. Floats and Ints

Machine Parsable Data

Identifiers and Measurements in a CSV File

Examples

Example 1) Water Quality data

Example 2) Earthquake data

Enabling Rainbow Parentheses

Optional Use

Configuring

Try it out!

The Very Basics

Testing our confidence as a frequentist

Credible? Incredible? A bayesian approach to how confident we are

Done

Background

dplyr::na_if

dplyr::recode

Vote for the Winner of Appsilon’s Internal Shiny Contest

Check Out the Submissions

Shiny Mosiac

Squaremantic

FIFA ’19

Polluter Alert

Semantic Pixelator

Semantic Memory

Help Us Choose a Winner

Links

Disclaimer

The General Algorithm

The Test

The Solution with R

The Solution with Python

Comparing R with Python

Concluding Remarks

Table of contents

How to use

Screenshot button

Features

Installation

Motivation

Browser support and limitations

Similar packages

TWO YEARS?!

Disclaimer

Using for loops in R

Using for loops in Python (From previous blog)

Using R’s “Best Practices” (Using the apply family)

Using Python’s best practices

Conclusion

The `survsim` package

The `simsurv` package

`dplyr::na_if`

`dplyr::recode`

Using `for` loops in R

Using `for` loops in Python (From previous blog)

Using R’s “Best Practices” (Using the `apply` family)