A look at Biontech/Pfizer’s Bayesian analysis of their Covid-19 vaccine trial

November 10, 2020, 8:00 pm

≪ Previous: The Pfizer-Biontech Vaccine May Be A Lot More Effective Than You Think

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s take again a look at Biontech / Pfizers vaccine candiate for which a press release stated more than 90% efficacy. As noted in my previous post Biontech/Pfizer actually use a Bayesian approach to assess the efficacy of their vaccine candiate.

In their study plan we find the following relatively short descriptions:

“The criteria for success at an interim analysis are based on the posterior probability (ie, P[VE > 30% given data]) at the current number of cases. Overwhelming efficacy will be declared if the posterior probability is higher than the success threshold. The success threshold for each interim analysis will be calibrated to protect overall type I error at 2.5%.”

“Bayesian approaches require specification of a prior distribution for the possible values of the unknown vaccine effect, thereby accounting for uncertainty in its value. A minimally informative beta prior, beta(0.700102, 1), is proposed for θ = (1-VE)/(2-VE). The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic. The prior allows considerable uncertainty; the 95% interval for θ is (0.005, 0.964) and the corresponding 95% interval for VE is (-26.2, 0.995).”

The approach is described in more detail in a statistical analysis plan. I could not find that plan in the internet, however. But let’s try to make an educated guess about their analysis from the information above.

Let $\pi_v$ and $\pi_c$ be the population probabilities that a vaccinated subject or a subject in the control group, respectively, fall ill to Covid-19. Then the population vaccine efficacy is given by

\[VE = 1-\frac {\pi_v} {\pi_c}\]

In their study plan Biontech/Pfizer assume a prior distribution for a parameter

\[\theta = \frac {1-VE} {2-VE}\]

Plugging in the definition of the population vaccine efficacy VE, we can rewrite $\theta$ as

\[\theta = \frac {\pi_v} {\pi_v+\pi_c}\]

Given that Biontech/Pfizer assigned the same number of subjects to the treatment and control group, this $\theta$ should denote the probability that a subject who fell ill with Covid-19 is from the treatment group, while $1-\theta$ is the probability that the subject is from the control group.

As we can see from the study plan, the assumed prior distribution of $\theta$ is a Beta distribution with shape parameters $a=a_0=0.700102$ and $b=b_0=1$. We see from the description of the beta distribution that the prior mean of $\theta$ is thus given by

\[E(\theta) = \frac {a_0} {a_0+b_0} = 0.4118\]

Given that we have

\[VE=\frac{1-2\theta}{1-\theta}\]

The efficacy at the expected prior value of $\theta=0.4118$ is indeed 30% as stated by Biontech/Pfizer.

The following plot shows the prior distribution for $\theta$

a0 = 0.700102; b0 = 1ggplot(data = data.frame(x = c(0, 1)), aes(x)) +  stat_function(geom="area",fun = dbeta, n = 101,    args = list(shape1 = a0, shape2 = b0),    col="blue", fill="blue", alpha=0.5  ) +  ylab("Prior Density") + xlab("theta") + geom_vline(xintercept=0.4118)+  ggtitle("Prior on probability that a subject with Covid-19 was vaccinated")

Now the crucial part of Bayesian statistics is how we update our believed distribution of $\theta$ once we get new data. Assume $m$ subjects fell ill with Covid-19 of which $m_v$ were vaccinated and $m_c$ were in the control group. Then one can show (see e.g. here) that the posterior distribution of $\theta$ is again a Beta distribution with arguments $a=a_0+m_v$ and $b=b_0+m_c$. Nice and simple!

For example, here is the posterior if we had a single observation of an ill subject and that subject was in the control group:

a = a0; b = b0+1ggplot(data = data.frame(x = c(0, 1)), aes(x)) +  stat_function(geom="area",fun = dbeta, n = 101,    args = list(shape1 = a, shape2 = b),    col="blue", fill="blue", alpha=0.5  ) +  xlab("theta") + ylab("density") + geom_vline(xintercept=a/(a+b)) +  ggtitle("Posterior if single ill subject was in control group")

Here is the posterior if the single ill subject was vaccinated:

a = a0+1; b = b0ggplot(data = data.frame(x = c(0, 1)), aes(x)) +  stat_function(geom="area",fun = dbeta, n = 101,    args = list(shape1 = a, shape2 = b),    col="blue", fill="blue", alpha=0.5  ) +  xlab("theta") + ylab("density") + geom_vline(xintercept=a/(a+b)) +  ggtitle("Posterior if a single ill subject was vaccinated")

Pfizers/Biontech press release stated that 94 subjects fell ill to Covid-19 and it can be deduced from a stated sample efficacy above 90% that at most 8 of those 94 subjects were vaccinated.

We can easily compute and draw the posterior distribution for $\theta$ for this data:

a = a0+8; b = b0+94-8ggplot(data = data.frame(x = c(0, 1)), aes(x)) +  stat_function(geom="area",fun = dbeta, n = 101,    args = list(shape1 = a, shape2 = b),    col="blue", fill="blue", alpha=0.5  ) +  xlab("theta") + ylab("density") + geom_vline(xintercept=a/(a+b)) +  ggtitle("Posterior if 8 out of the 94 ill subjects were vaccinated")

Biontech/Pfizer stated as interim success criterion that the posterior probability of an efficacy below 30% (corresponding to $\theta > 0.4118$) is smaller than 2.5%. We can easily compute this probability with our posterior distribution on $\theta$:

pbeta(0.4118,a,b,lower.tail = FALSE)*100## [1] 6.399889e-11

Well, as the graph has already suggested, our posterior probability of an efficacy below 30% is almost 0.

We can compute a Bayesian equivalence of the 95%-confidence interval (called credible interval) for $\theta$ by looking at the 2.5% and 97.5% quantiles of our posterior distribution:

theta.ci = qbeta(c(0.025,0.975),a,b)round(theta.ci*100,1)## [1]  4.2 15.6

This means given our prior belief and the data, we believe with 95% probability that the probability that an ill subject is vaccinated is between 4.2% and 15.6%.

We can transform these boundaries into a corresponding credible interval for the vaccine effectiveness:

VE.ci = rev((1-2*theta.ci)/(1-theta.ci))round(VE.ci*100,1)## [1] 81.6 95.6

This Bayesian credible interval from 81.6% to 95.6% of the vaccine effectiveness is actually quite close to the frequentist confidence interval we have computed in the first post. That’s reassuring.

Of course, the credible interval depends on the assumed prior for $\theta$. A more conservative prior would have been to simply assume a uniform distribution of $\theta$ which is equal to a Beta distribution with shape parameters $a=b=1$. Let us compute the 95% credible interval for this uniform prior:

a0 = b0 = 1a = a0+8; b = b0+94-8theta.ci = qbeta(c(0.025,0.975),a,b)VE.ci = rev((1-2*theta.ci)/(1-theta.ci))round(VE.ci*100,1)## [1] 81.1 95.4

It is nice to see that this different prior has only a quite small impact on the credible interval for the vaccine efficacy which is now from 81.1% to 95.4%.

While I typically was sceptical about Bayesian analysis because of the need to specify a prior distribution, I must say that in this example the Bayesian approach looks actually quite intuitive and nice. You should still take my analysis with a grain of salt, however. It is essentially just an educated guess of how Biontech/Pfizer actually performs the analysis. There may well be some statistical complications that I have overlooked.

PS: I just learned about another, nice Blog series here about Bayesian methods for analysing the Covid-19 vaccine trials. It illustrates also some advanced R packages like runjags that can be used for more complex Bayesian analyses.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post A look at Biontech/Pfizer's Bayesian analysis of their Covid-19 vaccine trial first appeared on R-bloggers.

↧

another electoral map

November 10, 2020, 8:00 pm

≫ Next: RStudio Announces Winners of Appsilon’s Internal Shiny Contest

≪ Previous: A look at Biontech/Pfizer’s Bayesian analysis of their Covid-19 vaccine trial

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post another electoral map first appeared on R-bloggers.

↧

RStudio Announces Winners of Appsilon’s Internal Shiny Contest

November 11, 2020, 2:15 am

≫ Next: Web Frameworks for R – A Brief Overview

≪ Previous: another electoral map

[This article was first published on r – Appsilon | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Results Are In!

We recently ran an internal Shiny contest to see which Appsilon team member could make the best shiny.semantic PoC in under 24 hours, and today we have some winners to announce!

Which shiny.semantic PoC app is your favorite? There is still time to vote for the winner of the People’s Choice Award, which will receive a special prize. Voting is open until Friday, November 13th.

Before we announce the winners, we would like to thank the Shiny team at RStudio. They very generously agreed to judge our shiny.semantic contest. For a previous collaboration between Appsilon and RStudio, see our joint webinar on Enabling Remote Data Science Teams. For RStudio’s blog post announcing the winners, go here.

And now, the winners:

Most technically impressive:Polluter Alert by Paweł Przytuła
Most creatively impressive:Semantic Pixelator by Pedro Silva
Most technically impressive (runner-up):FIFA’19 by Dominik Krzemiński

We’ve asked each of the winners to leave their thoughts and reasons why they decided to build their dashboards. We’ll go through their comments on the development of each PoC app below.

Want to see all submissions? Check out our previous article.

Most Technically Impressive: Polluter Alert

Polluter Alert is a dashboard that allows the user to report sources of air pollution in the user’s geographical area.

We’ve asked Paweł, Appsilon Co-Founder and VP of Engineering, why he decided to make a Polluter Alter application. Here are his thoughts:

Air quality has been an important topic to me for many years. Every year during colder days in Poland (October through March), many cities are covered by dense smog because of heating stoves. Even worse, people burn anything and everything as fuel, even plastic bottles. This results in terrible air quality that causes lung diseases and allergies. I really wish we could do something about it. The government couldn’t improve this situation for many years. I got inspired when my colleague told me that the primary source of pollution in his district was a single factory chimney. We need more data to act smart. Maybe collaborative effort and crowdsourcing can help us build a map of pollution sources, which can help decision-makers?

You can test the application here.

Most Creatively Impressive: Semantic Pixelator

Semantic Pixelator is a fun way to explore semantic elements by creating different image compositions using loaders, icons, and other UI elements from semantic/fomantic UI.

Here’s what Appsilon team member and Shiny frontend magician Pedro Silva had to say about his PoC application:

I believe fun and gamification are great ways of getting familiar with new things. Knowing that semantic offers a huge library of different components, I wanted to showcase as many of those components as possible in unconventional ways. I feel the end result shows both how easy it is to get started with semantic, but at the same time how flexible and customizable it can be.

Pedro was previously a grand prize winner of RStudio’s 2020 Shiny Contest after building a functional video game in Shiny, so it’s no surprise that he fared well in our internal shiny.semantic competition. You can test his shiny.semantic application here.

Most technically impressive (runner-up) – FIFA’19

This app was created to demonstrate shiny.semantic features for creating interactive data visualization.

Appsilon team member and Open Source leader Dominik shared his thoughts on the application:

I decided to create the FIFA dashboard to showcase that shiny.semantic can be a perfect choice for building data visualization apps. And also, I used to play FIFA way too much during high school.

Who doesn’t love a little football mixed with a lot of data? You can test the application here.

Vote for the People’s Choice Award

As mentioned earlier, we still have to determine the People’s Choice Award winner, and that award will be decided by the R Community. You can vote for any of the six submissions with this link. Keep in mind that all of these PoC’s were developed in under 24 hours, so naturally, there will be some space for improvement. We welcome any feedback on these PoC’s and on our shiny.semantic open-source package.

Please tell us what you think! You can place your vote here.

Do you like shiny.semantic? Please consider giving us a star on GitHub. You can explore other Appsilon open source packages on our new Shiny Tools landing page.

Appsilon is hiring! See our Careers page for new openings, including openings for a Project Manager and Community Manager.

Article RStudio Announces Winners of Appsilon’s Internal Shiny Contest comes from Appsilon | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RStudio Announces Winners of Appsilon’s Internal Shiny Contest first appeared on R-bloggers.

↧

Web Frameworks for R – A Brief Overview

November 11, 2020, 5:49 am

≫ Next: The Notion of “Double Descent”

≪ Previous: RStudio Announces Winners of Appsilon’s Internal Shiny Contest

[This article was first published on R – Blog – Mazama Science , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having recently announced the beakr web framework for R, we have received several questions about context and why we choose beakr over other options for some of our web services. This post will attempt to answer some of those questions by providing a few opinions on beakr and other web frameworks for R.

The comparison will by no means be exhaustive but will attempt to briefly summarize some of the key distinctions each web framework has to offer. While there are some differences in the approach each package takes to developing web services, they all share similar basic functionality. In the end, the choice of a particular framework will come down largely to personal preference.

beakr

beakr is a small package and minimal web framework for quickly developing simple and stable web-services.

Features

Simple & Stable to get up and running
Programmatic tidy syntax

beakr (CRAN, GitHub) is a web framework for developing and deploying web services with R. beakr is inspired by the minimalist and massively-expandable frameworks offered by Express.js and Flask and is a reincarnation of the discontinued jug package by Bart Smeets. It allows users to assign specific blocks of R code to individual URL routes accessed by different HTTP protocols (POST, GET, PUT, etc.) using a familiar tidy syntax with a few key features we know we can rely on at Mazama Science. We use beakr because it offers a simple and no-fuss back-end that we know and can rely on.

View beakr“Hello, World!” example

plumber

plumber is a package that allows you to create a web API by decorating your existing R code with special decorator comments.

Features

Only requires a decorative comment to expose endpoints
Large community
Supported by RStudio

plumber (CRAN, GitHub) is an RStudio maintained package which gives it a tremendous amount of resources and support.

The plumber package is targeted at those who have an R code with a well-defined API who wants to make that API available as a web service. This makes plumber an excellent package for exposing R code that already exists. Not to mention the package is supported and maintained by RStudio — so you can be confident of its wide adoption and documentation.

However, we created beakr so that we can define REST services first and then write whatever code we want to handle each REST URL. This allows us to add all sorts of logging, caching, error handling, and other code required in our operational systems.

View plumber“Hello, World!” example

ambiorix

Ambiorix is an unopinionated web framework package for R and is designed to be flexible and extendable.

Features

Extensions! (Generators template extensions, CLI support)
Websocket support

The ambiorix(GitHub) is a new web framework package that we are excited to see! Though currently in development, John Coene (the creator) has already added some neat extensions such as a command-line interface and web template generators.

ambiorix and beakr offer similar approaches to developing web services and both draw inspiration from the ever popular express.js framework. ambiorix offers many functional features and is certainly a package to keep an eye on while it matures for an official release!

View ambiorix“Hello, World! example

fiery

Fiery is a flexible and lightweight framework for building web servers in Rwith explicit control.

Features

Complete control of server life-cycle events
Modular design

Fiery(CRAN, GitHub) is designed to have explicit control of server life-cycle events and flexibly handle them. It’s a framework that defines itself as lightweight with modularity and control in its design. We think fiery is best for developers that demand complete control of how their web-services work to handle requests, responses, and everything in between. fieryis no-doubt one of the most control-oriented frameworks that exist for R and as such, has some technical overhead. With that being said – we opt with beakr because our services do not demand such a degree of control.

View fiery“Hello, World!” example

Shiny

Shiny is a full suite package that makes it easy to build interactive web apps straight from R.

Features

Included bootstrap GUI API
A popular choice supported by RStudio

Shiny(CRAN, GitHub) is more than a web framework — it is an ecosystem that includes the Shiny package and the Shiny Server. They are tied together to make Shiny an awesome tool for creating GUI web applications and dashboards with R. Not to mention it is loaded with pre-built features that work really well with each other.

Here at Mazama Science, we like to use Shiny to develop web applications, documents, and dashboards. However, sometimes we demand simple and reliable web-services without the need for so many niceties and beakr fills that gap.

View Shiny“Hello, World!” example

Conclusion

These are only a few of the web frameworks that exist for R and this survey is not meant to be an exhaustive review of any one of them — these are just a few of our quick findings. Each of these frameworks is pretty full-featured and they can likely be used interchangeably for most projects. However, the small differences that define each framework are important and developers should do their own review of functionality before any starting any project.

For our typical use case at Mazama Science — creating R-based, modular web services — beakr is an excellent fit as it allows us to easily create a web service that responds to requests from other parts of a larger web-based data visualization system. Web services based on beakr have proven to be quite robust and are used operationally by the US Forest Service in their wildfire smoke air quality site which can see thousands of hits per hour during wildfire season.

— Mazama Science

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Blog – Mazama Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Web Frameworks for R – A Brief Overview first appeared on R-bloggers.

↧

The Notion of “Double Descent”

November 11, 2020, 9:31 am

≫ Next: mlr3spatiotempcv: Initial CRAN release

≪ Previous: Web Frameworks for R – A Brief Overview

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I tend to be blase’ about breathless claims of “new” methods and concepts in statistics and machine learning. Most are “variations on a theme.” However, the notion of double descent, which has come into prominence in the last few years, is something I regard as genuinely new and very relevant, shedding important light on the central issue of model overfitting.

In this post, I’ll explain the idea, and illustrate it using code from my regtools R package.

The idea itself is not new, but is taking special importance these days, in its possibly shedding light on deep learning. It seems to suggest an answer to the question, “Why do DL networks often work well, in spite of being drastically overparameterized?”

Classical statistical thinking, including in my own books, is that the graph of loss L e.g. misclassification rate, on new data against model complexity C should be U-shaped. As C first moves away from 0, bias is greatly reduced while variance increases only slightly. The curve is in descent. But once C passes a minimum point, the curve will go back up, as variance eventually overwhelms bias.

(Technical note: Bias here is the expected value of the difference between the predicted values of the richer and coarser models, under the richer one, and taken over the distribution of the predictors.)

As C increases, we will eventually reach the point of saturation (nowadays termed interpolation), in which we have 0 training error. A common example is linear regression with a polynomial model in a single predictor variable. Here C is the degree of the polynomial. If we have n = 100 data points, a 99th-degree polynomial will fit the training data exactly — but will be terrible in predicting new data.

But what if we dare to fit a polynomial of higher degree than 99 anyway, i.e. what if we deliberately overfit? The problem now becomes indeterminate — there is now no unique solution to the problem of minimizing the error sum of squares. There are in fact infinitely many solutions. But actually, that can be a virtue; among those infinitely many solutions, we may be able to choose one that is really good.

“Good” would of course mean that it is able to predict new cases well. How might we find such a solution?

I’ll continue with the linear regression example, though not assume a polynomial model. First, some notation. Let β denote our population coefficient vector, and b its estimate. Let ||b|| denote the vector norm of b. Let p denote the number of predictor variables. Again, if we have p predictor variables, we will get a perfect fit in the training set if p = n-1.

If you are familiar with the LASSO, you know that we may be able to do better than computing b to be the OLS estimate; a shrunken version of OLS may do better. Well, which shrunken version? How about the minimum-norm solution?

Before the interpolation point, our unique OLS solution is the famous

b = (X’X)^-1 X’Y

This can also be obtained as b = X^– Y where X^– is a generalized inverse of X. It’s the same as the classic formula before interpolation, but it can be used after the interpolation point as well. And the key point is then that one implementation of this, the Moore-Penrose inverse, gives the minimum-norm solution.

This minimum-norm property reduces variance. So, not only will MP allow us to go past the interpolation point, it will also reduce variance, possibly causing the L-C curve to descend a second time! We now have a double U-shape (there could be more).

And if we’re lucky, in this second U-shape, we may reach a lower minimum than we had in the original one. If so, it will have paid to overfit!

There some subtleties here. The minimum norm means, minimum among all solutions for fixed p. As p increases, the space over which we are minimizing grows richer, thus more opportunities for minimizing the norm. On the other hand, the variance of that minimum norm solution is increasing with p. Meanwhile the bias is presumably staying constant, since all solutions have a training set error of 0. So the dynamics underlying this second U would seem to be different from those of the first.

Empirical illustration:

Here I’ll work with the Million Song dataset. It consists of 90 audio measurements made on about 500,000 songs (not a million) from 1922 to 2011. The goal is to predict the year of release from the audio.

I took the first p predictors, varying p from 2 to 30, and fit a quadratic model, with O(p²) predictors resulting.

One of the newer features of regtools is its qe*-series (“Quick and Easy”) of functions. They are extremely simple to use, all having the call format qeX(d,’yname’), to predict the specified Y in the data frame d. Again, the emphasis is on simplicity; the above call is all one needs, so for example there is no preliminary code for defining a model. They are all wrappers for standard R package functiona, and are paired with predict() and plot() wrappers.

Currently there are 9 different machine learning algorithms available, e.g qeSVM() and qeRF(), for SVM and random forests. Here we will use qePoly(), which wraps our polyreg package. Note by the way that the latter correctly handles dummy variables (e.g. no powers of a dummy are formed). Note too that qePoly() computes the Moore-Penrose inverse if we are past interpolation, using the function ginv() from the MASS package.

Again to make this experiment on a meaningful scale, I generated random training sets of size n = 250, and took the rest of the data as the test set. I used from 2 to 30 predictor variables, and used Mean Absolute Prediction Error as my accuracy criterion. Here are the results,

Well, sure enough, there it is, the second U. The interpolation point is between 22 and 23 predictors (there is no “in-between” configuration), where there are 265 parameters, overfitting our n = 250.

Alas, the minimum in the second U is not lower than that of the first, so overfitting hasn’t paid off here. But it does illustrate the concept of double-descent.

Code:

overfit <-  function(nreps,n,maxP) {    load('YearData.save')     nas <- rep(NA,nreps*(maxP-1))    outdf <- data.frame(p=nas,mape=nas)    rownum <- 0    for (i in 1:nreps) {       idxs <- sample(1:nrow(yr),n)       trn <- yr[idxs,]        tst <- yr[-idxs,]                  for (p in 2:maxP) {          rownum <- rownum + 1          out<-qePoly(trn[,1:p+1)],             'V1',2,holdout=NULL)           preds <- predict(out,tst[,-1])           mape <- mean(abs(preds - tst[,1]))           outdf[rownum,1] <- p          outdf[rownum,2] <- mape          print(outdf[rownum,])       }    }    outdf  #run through tapply() for the graph }

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Notion of “Double Descent” first appeared on R-bloggers.

↧

mlr3spatiotempcv: Initial CRAN release

November 11, 2020, 10:00 am

≫ Next: Round about the kernel

≪ Previous: The Notion of “Double Descent”

[This article was first published on Machine Learning in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are happy to announce that a new extension package has joined the CRAN family of mlr3 packages. mlr3spatiotempcv was in the works for more than a year and adds spatiotemporal resampling methods to the mlr3 ecosystem.

Such dedicated resampling methods make it possible to retrieve biased-reduced performance estimates in cross-validation scenarios when working with spatial, temporal or spatiotemporal datasets. mlr3spatiotempcv does not implement new methods but rather attempts to collect existing methods.

So far, applying such methods in both R and the mlr ecosystem was not particular easy since they were spread across various R packages. Usually every R package uses a slightly different syntax for the required objects and the returned results. This not only leads to an inconvenient single use experience but is also unpractical when working in an overarching ecosystem such as mlr3.

We hope that with the release of this package users are now able to seamlessly work with spatiotemporal data in mlr3. Please file issues and suggestions in the issues pane of the package.

For a quick and rather technial introduction please see the “Get Started” vignette. For more detailed information and a detailed walk-through, see the “Spatiotemporal Analysis” section in the mlr3book.

To finish with something visual, a simple example which showcases the visualization capabilities of mlr3spatiotempcv for different partitioning methods (random (non-spatial) partitioning (Fig.1) vs. k-means based partitioning (spatial) (Fig. 2)):

library("mlr3")library("mlr3spatiotempcv")set.seed(42)# be less verboselgr::get_logger("bbotk")$set_threshold("warn")lgr::get_logger("mlr3")$set_threshold("warn")task = tsk("ecuador")learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)autoplot(resampling_nsp, task, fold_id = c(1:4), crs = 4326) *  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))

autoplot(resampling_sp, task, fold_id = c(1:4), crs = 4326) *  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))

To leave a comment for the author, please follow the link and comment on their blog: Machine Learning in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post mlr3spatiotempcv: Initial CRAN release first appeared on R-bloggers.

↧

Round about the kernel

November 11, 2020, 10:00 am

≫ Next: ROC Day at BARUG

≪ Previous: mlr3spatiotempcv: Initial CRAN release

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our last post, we took our analysis of rolling average pairwise correlations on the constituents of the XLI ETF one step further by applying kernel regressions to the data and comparing those results with linear regressions. Using a cross-validation approach to analyze prediction error and overfitting potential, we found that kernel regressions saw average error increase between training and validation sets, while the linear models saw it decrease. We reasoned that the decrease was due to the idiosyncrasies of the time series data: models trained on volatile markets, validating on less choppy ones. Indeed, we felt we should trust the kernel regression results more than the linear ones precisely because those results followed the common expectation that error increases when exposing models to cross-validation. But such trust could be misplaced! However, we weren’t trying to pit kernel regressions against linear ones. Rather we were introducing the concept of kernel regressions prior to our examination of generalized correlations.

In this post, we’ll look at such correlations using the generalCorr package written by Prof. H. Vinod of Fordham University, NY.¹ Our plan is to tease out any potential causality we can find between the constituents and the index. From there we’ll be able to test that causality on out-of-sample results using a kernel regression. Strap in, we’re in for a bumpy ride!

What is generalized correlation? While we can’t do justice to the nuances or the package, we’ll try to give you the 10,000 foot view. Any errors in interpretation are ours of course. Standard correlation measures assume a linear relationship between two variables. Yet many data series do not exhibit such a relationship and so using this measure fails to capture a whole host of interesting dependencies. Add in times series data, and one finds assessing causality based on linear dependence becomes even more thorny. Economists sometimes use a work around that relies on time lags². Employing a less common “instantaneous” version of this work around, by including non-lagged data, may yield a better estimate.

Using the instantaneous version with a generalized measure of correlation, one can estimate whether one variable causes another. That is, does X cause Y or Y cause X? What’s this generalized measure? Using a more sophisticated version of kernel regression than we discussed in our last post, one regresses Y on X and then X on Y. After taking the difference of the two $R^{2}$s of X on Y and Y on X, if that difference, call it $delta$, is less than zero, then X predicts Y rather than the other way around.

Obviously, it’s way more complicated than this, but the intuition is kind of neat if you remember that the squared correlation of Y and X is equivalent to the $R^{2}$ of the linear regression of Y on X. Recall too that the $R^{2}$ of Y on X should be the same as X on Y for linear functions. Hence, the $delta$ mentioned above should be (roughly) zero if there is a linear relationship or limited causality. If not, there’s probably some non-linear dependence and causality too, depending on the sign of $delta$.

Okay, so how the heck does this relate to pairwise correlations and predictive ability? If we can establish the return of the index is causally determined by a subset of the constituents, then using only the pairwise correlations of that subset, we might be able to achieve better predictions of the returns. How would we set this up?

We could run the generalized correlations across our entire training series, then calculate the pairwise correlations on only the “high” causality constituents and use that metric to predict returns on the index. The difficulty with that method is that it’s forecasting the past. By the time we’ve calculated the causality we already know that it’s driven index returns! Instead, we could use the causality subset to calculate pairwise correlations in the next period, which would then be used to forecast forward returns. In other words, take the causality subset from $t_{0-l}$ to $t_{0}$ to calculate the average pairwise correlations on that subset from $t_{0-w}$ to $t_{f}$ and then regress the $w$-length returns on the index from $t_{0+w}$ to $t_{f+w}$ on those pairwise correlations. Here $l$ is the lookback period, $w$ is the window length, and $f$ is the look forward period. Seems convoluted but hopefully will make sense once we go through an example.

We’ll take 250 trading days, calculate the subset causality, and then compute the rolling 60-day pairwise correlations starting on day 191 (to have the 60-day correlation available prior to the start of the return period) and continue until day 500. We’ll calculate the rolling past 60-day return starting on day 310 (for the 60-day forward return starting on day 250) until day 560. Then we’ll we regress the returns against the correlations for the 250-day period. This effectively means we’re regressing 60-day forward returns on the prior 60-day average pairwise correlations. Whew!

What does this actually look like? Let’s first recall the original pairwise correlation vs. forward return scatter plot and then we’ll show an example from one period.

Here the red line is not a kernel regression, but another non-parametric method we use for ease of presentation.³ Now we’ll run through all the data wrangling and calculations to create multiple windows on separate 250 trading periods in our training set, which runs from about 2005 to mid-2015. We’ll select one of those periods to show what’s going on graphically to compare the different results and predictive models. First, we show a scatter plot of the causal subset of pairwise correlations and XLI returns for the 2007-2008 period.

The linear regression is the dashed line and the non-parametric is the wavy line. Clearly, the relationship is on a downward trend as one might expect for that period. Now let’s look at the non-causal subset.

The non-parametric regression line implies a pretty weird function, while the linear regression suggests almost no relationship. Already, we can see some evidence that the causal subset may do a better job at explaining the returns on the index than the non-causal. We’ll run regressions on the returns vs. the correlation subsets for a kernel and linear model. We present the $R^{2}$ results in the table below.

Table 1: Regression R-squareds (%)

Models	Causal	Non-causal
Kernel	41.3	15.4
Linear	7.7	0.9

The causal subset using a kernel regression outperforms the linear model by an order of magnitude. Its explanatory power is more than double a kernel regression using the non-causal subset and better than two orders of magnitude vs. a linear model also using the non-causal subset.

Now we’ll run the kernel and linear regressions on all the periods. We present the $R^{2}$ for the different regression models in a chart below.

In every case, the kernel regression model does a better job of explaining the variability in the index’s returns than the linear model. For completeness, we’ll include the linear model with the non-causal constituents.

Based on the chart above, we see that the kernel regression outperforms linear regression based on the non-causal subset too. Interestingly, the non-causal subset sometimes outperforms the causal on the linear model. We’re not entirely sure why this is the case. Of course, that mainly happens in the 2005-2007 time frame. This could be the result of correlations changing in the forecast period on these few occasions. Or it might be due to periods where correlations among the non-causal constituents rise, suggesting a stronger linear relationship with forward returns (especially during an upward trending market) even if the causal link from the prior period is weak.

The final metric we’d like to check is the root mean-squared error (RMSE) on the kernel vs. linear regressions.

As evident, the kernel regression using the causal subset has a lower error than the linear regressions on either the causal or non-causal subsets. Notice too that the error is not stable across periods and increases (predictably) during market turbulence (2007-2009). Interestingly, the differences in error rates also rise and fall over time, more easily seen in the graph below.

The average difference in error between the kernel and linear regression using the causal subset is about 1.7% points while it’s 1.8% points for the non-causal. While that difference doesn’t seem that different, we can see that there is some variability as shown in the graph. Whether fund managers would find that close to 2% point difference meaningful is likely to depend on the degree to which the lower error rate contributes to performance improvement. A better performing predictive model is one thing; whether it can be used to generate better risk-adjusted returns is another! Nonetheless, two points of achievable outperformance is nothing to sneeze at. Of concern, is the narrowing in performance in the 2012-2014 time frame. We suspect that is because that market enjoyed a relatively smooth, upwardly trending environment in that period as shown in the chart below.

Whether or not the narrowing of error between the kernel and linear regressions is likely to persist would need to be analyzed on the test set. But we’ll save that for another post. What have we learned thus far?

Generalized correlations appear to do a good job at identifying causal, non-linear relationships.
Using the output from a generalized correlation and employing a kernel regression algorithm, we can produce models that explain the variability in returns better than linear regressions using both causal and non-causal subsets.
The model error is lower for the causal subset kernel regressions vs. the other models too. However, that performance does appear to moderate in calm, upwardly trending markets.

Where could we go from here with these results? We might want to test this on other sector ETFs. And if the results bore out similar performance, then we could look to build some type of factor model. For example, we might go long the high causal and short the low causal stocks. Or we could go long the high causal and short the index to see if there is an “invisible” factor. Whatever the case, we’re interested to know what readers would like to see. More on generalized correlations or time to move on? Drop us a response at nbw dot osm at gmail. Until next time, the R and Python code are below.

One note on the Python code: we could not find an equivalent Python package to generalCorr. We did attempt to recreate it, but with the the complexity of the functions and the dependencies (e.g., R package np), it was simply taking too long. We also considered running R from within Python (using the rpy package), but realized that was quite an undertaking too! Hence, we imported the causal list created in R into our Python environment and then ran the rest of the analysis. We apologize to the Pythonistas. We try to reproduce everything in both languages. But in this case it was a task better suited for a long term project. We may, nevertheless, put a Python version on our TODO stack. If completed we’ll post here and on GitHub. Thanks for understanding!

R code:

# Built using R 3.6.2## Load packagessuppressPackageStartupMessages({  library(tidyverse)  library(tidyquant)  library(reticulate)  library(generalCorr)})## Load dataprices_xts <- readRDS("corr_2_prices_xts.rds")# Create function for rolling correlationmean_cor <- function(returns){  # calculate the correlation matrix  cor_matrix <- cor(returns, use = "pairwise.complete")    # set the diagonal to NA (may not be necessary)  diag(cor_matrix) <- NA    # calculate the mean correlation, removing the NA  mean(cor_matrix, na.rm = TRUE)}# Create return frames for manipulationcomp_returns <- ROC(prices_xts[,-1], type = "discrete") # kernel regressiontot_returns <- ROC(prices_xts, type = "discrete") # for generalCorr# Create data frame for regressioncorr_comp <- rollapply(comp_returns, 60,                        mean_cor, by.column = FALSE, align = "right")xli_rets <- ROC(prices_xts[,1], n=60, type = "discrete")# Merge series and create train-test splittotal_60 <- merge(corr_comp, lag.xts(xli_rets, -60))[60:(nrow(corr_comp)-60)]colnames(total_60) <- c("corr", "xli")split <- round(nrow(total_60)*.70)train_60 <- total_60[1:split,]test_60 <- total_60[(split+1):nrow(total_60),]# Create train set for generalCorrtot_split <- nrow(train_60)+60train <- tot_returns[1:tot_split,]test <- tot_returns[(tot_split+1):nrow(tot_returns),]# Graph originaal scatter plottrain_60 %>%   ggplot(aes(corr*100, xli*100)) +  geom_point(color = "darkblue", alpha = 0.4) +  labs(x = "Correlation (%)",       y = "Return (%)",       title = "Return (XLI) vs. correlation (constituents)") +  geom_smooth(method = "loess", formula = y ~ x, se=FALSE, size = 1.25, color = "red")# Create helper functioncause_mat <- function(df){  mat_1 <- df[,!apply(is.na(df),2, all)]  mat_1 <- as.matrix(coredata(mat_1))  out <- causeSummary(mat_1)  out <- as.data.frame(out)  out}# Create column and row indicescol_idx <- list(c(1:22), c(1,23:44), c(1,45:64))row_idx <- list(c(1:250), c(251:500), c(501:750), c(751:1000),                c(1001:1250), c(1251:1500), c(1501:1750), c(1751:2000),                c(2001:2250), c(2251:2500))# Create cause list for each period: which stocks cause the indexcause <- list()for(i in 1:length(row_idx)){  out <- list()  for(j in 1:length(col_idx)){    out[[j]] <- cause_mat(train[row_idx[[i]], col_idx[[j]]])  }  cause[[i]] <- out}# Bind cause into one listcause_lists <- list()for(i in 1:length(cause)){  out <- do.call("rbind", cause[[i]]) %>%    filter(cause != "xli") %>%    select(cause) %>%    unlist() %>%    as.character()  cause_lists[[i]] <- out}# Save cause_lists for use in Pythonmax_l <- 0for(i in 1:length(cause_lists)){  if(length(cause_lists[[i]]) > max_l){    max_l <- length(cause_lists[[i]])  }}write_l <- matrix(nrow = length(cause_lists), ncol = max_l)for(i in 1:length(cause_lists)){  write_l[i, 1:length(cause_lists[[i]])] <- cause_lists[[i]]           }write.csv(write_l, "cause_lists.csv")## Use cause list to run rolling correlations and aggregate forward returns for regressioncor_idx <- list(c(191:500), c(441:750), c(691:1000), c(941:1250),                c(1191:1500), c(1441:1750), c(1691:2000), c(1941:2250),                c(2191:2500))# Add 1 since xli is price while train is ret so begin date is off by 1 biz dayret_idx <- list(c(251:561), c(501:811), c(751:1061), c(1001:1311),                c(1251:1561), c(1501:1811), c(1751:2061), c(2001:2311),                c(2251:2561))merge_list <- list()for(i in 1:length(cor_idx)){  corr <- rollapply(train[cor_idx[[i]], cause_lists[[i]]], 60,                    mean_cor, by.column = FALSE, align = "right")  ret <- ROC(prices_xts[ret_idx[[i]],1], n=60, type = "discrete")  merge_list[[i]] <- merge(corr = corr[60:310], xli = coredata(ret[61:311]))}# Run correlations on non cause listnon_cause_list <- list()for(i in 1:length(cor_idx)){  corr <- rollapply(train[cor_idx[[i]], !colnames(train)[-1] %in% cause_lists[[i]]], 60,                    mean_cor, by.column = FALSE, align = "right")  ret <- ROC(prices_xts[ret_idx[[i]],1], n=60, type = "discrete")  non_cause_list[[i]] <- merge(corr = corr[60:310], xli = coredata(ret[61:311]))}## Load datamerge_list <- readRDS("corr3_genCorr_list.rds")non_cause_list <- readRDS("corr3_genCorr_non_cause_list.rds")# Graphical example of one periodcause_ex <- merge_list[[3]]cause_ex$corr_non <- rollapply(train[cor_idx[[3]], !colnames(train)[-1] %in% cause_lists[[3]]],                                     60, mean_cor, by.column = FALSE, align = "right")[60:310]# Graph causal subset against returnscause_ex %>%  ggplot(aes(corr*100, xli*100)) +  geom_point(color = "blue") +  geom_smooth(method="lm", formula = y ~ x, se=FALSE, color = "darkgrey", linetype = "dashed")+  geom_smooth(method="loess", formula = y ~ x, se=FALSE, color = "darkblue") +  labs(x = "Correlation (%)",       y = "Return (%)",       title = "Return (XLI) vs. correlation (causal subset)")# Graph non causalcause_ex %>%  ggplot(aes(corr_non*100, xli*100)) +  geom_point(color = "blue") +  geom_smooth(method="lm", formula = y ~ x, se=FALSE, color = "darkgrey", linetype = "dashed")+  geom_smooth(method="loess", formula = y ~ x, se=FALSE, color = "darkblue") +  labs(x = "Correlation (%)",       y = "Return (%)",       title = "Return (XLI) vs. correlation (non-causal subset)")# Run modelscausal_kern <- kern(cause_ex$xli, cause_ex$corr)$R2causal_lin <- summary(lm(cause_ex$xli ~ cause_ex$corr))$r.squarednon_causal_kern <- kern(cause_ex$xli, cause_ex$corr_non)$R2non_causal_lin <- summary(lm(cause_ex$xli ~ cause_ex$corr_non))$r.squared# Show tabledata.frame(Models = c("Kernel", "Linear"),            Causal = c(causal_kern, causal_lin),           `Non-causal` = c(non_causal_kern, non_causal_lin),           check.names = FALSE) %>%   mutate_at(vars('Causal', `Non-causal`), function(x) round(x,3)*100) %>%   knitr::kable(caption = "Regression R-squareds (%)")## Linear regressionmodels <- list()for(i in 1:length(merge_list)){  models[[i]] <- lm(xli~corr, merge_list[[i]])}model_df <- data.frame(model = seq(1,length(models)),                       rsq = rep(0,length(models)),                       t_int = rep(0,length(models)),                       t_coef = rep(0,length(models)),                       P_int = rep(0,length(models)),                       p_coef = rep(0,length(models)))for(i in 1:length(models)){  model_df[i,2] <- broom::glance(models[[i]])[1]  model_df[i,3] <- broom::tidy(models[[i]])[1,4]  model_df[i,4] <- broom::tidy(models[[i]])[2,4]  model_df[i,5] <- broom::tidy(models[[i]])[1,5]  model_df[i,6] <- broom::tidy(models[[i]])[2,5]}start <- index(train)[seq(250,2250,250)] %>% year()end <- index(train)[seq(500,2500,250)] %>% year()model_dates <- paste(start, end, sep = "-")model_df <- model_df %>%  mutate(model_dates = model_dates) %>%  select(model_dates, everything())## Kernel regresssionkernel_models <- list()for(i in 1:length(merge_list)){  kernel_models[[i]] <- kern(merge_list[[i]]$xli, merge_list[[i]]$corr)}kern_model_df <- data.frame(model_dates = model_dates,                       rsq = rep(0,length(kernel_models)),                       rmse = rep(0,length(kernel_models)),                       rmse_scaled = rep(0,length(kernel_models)))for(i in 1:length(kernel_models)){  kern_model_df[i,2] <- kernel_models[[i]]$R2  kern_model_df[i,3] <- sqrt(kernel_models[[i]]$MSE)  kern_model_df[i,4] <- sqrt(kernel_models[[i]]$MSE)/sd(merge_list[[i]]$xli)}## Load datamodel_df <- readRDS("corr3_lin_model_df.rds")kern_model_df <- readRDS("corr3_kern_model_df.rds")## R-squared graphdata.frame(Dates = model_dates,            Linear = model_df$rsq,           Kernel = kern_model_df$rsq) %>%   gather(key, value, -Dates) %>%   ggplot(aes(Dates, value*100, fill = key)) +  geom_bar(stat = "identity", position = "dodge") +  scale_fill_manual("", values = c("blue", "darkgrey")) +  labs(x = "",       y = "R-squared (%)",       title = "R-squared output for regression results by period and model") +  theme(legend.position = c(0.06,0.9),        legend.background = element_rect(fill = NA)) # NOn_causal linear modelnon_models <- list()for(i in 1:length(reg_list)){  non_models[[i]] <- lm(xli~corr, non_cause_list[[i]])}non_model_df <- data.frame(model = seq(1,length(models)),                           rsq = rep(0,length(models)),                           t_int = rep(0,length(models)),                           t_coef = rep(0,length(models)),                           P_int = rep(0,length(models)),                           p_coef = rep(0,length(models)))for(i in 1:length(non_models)){  non_model_df[i,2] <- broom::glance(non_models[[i]])[1]  non_model_df[i,3] <- broom::tidy(non_models[[i]])[1,4]  non_model_df[i,4] <- broom::tidy(non_models[[i]])[2,4]  non_model_df[i,5] <- broom::tidy(non_models[[i]])[1,5]  non_model_df[i,6] <- broom::tidy(non_models[[i]])[2,5]}non_model_df <- non_model_df %>%  mutate(model_dates = model_dates) %>%  select(model_dates, everything())# Bar chart of causal and non-causaldata.frame(Dates = model_dates,            `Linear--causal` = model_df$rsq,           `Linear--non-causal` = non_model_df$rsq,           Kernel = kern_model_df$rsq,           check.names = FALSE) %>%   gather(key, value, -Dates) %>%   ggplot(aes(Dates, value*100, fill = key)) +  geom_bar(stat = "identity", position = "dodge") +  scale_fill_manual("", values = c("blue", "darkgrey", "darkblue")) +  labs(x = "",       y = "R-squared (%)",       title = "R-squared output for regression results by period and model") +  theme(legend.position = c(0.3,0.9),        legend.background = element_rect(fill = NA)) ## RMSE comparisonlin_rmse <- c()lin_non_rmse <- c()kern_rmse <- c()for(i in 1:length(models)){  lin_rmse[i] <- sqrt(mean(models[[i]]$residuals^2))  lin_non_rmse[i] <- sqrt(mean(non_models[[i]]$residuals^2))  kern_rmse[i] <- sqrt(kernel_models[[i]]$MSE)}data.frame(Dates = model_dates,            `Linear--causal` = lin_rmse,           `Linear--non-causal` = lin_non_rmse,           Kernel = kern_rmse,           check.names = FALSE) %>%   gather(key, value, -Dates) %>%   ggplot(aes(Dates, value*100, fill = key)) +  geom_bar(stat = "identity", position = "dodge") +  scale_fill_manual("", values = c("blue", "darkgrey", "darkblue")) +  labs(x = "",       y = "RMSE (%)",       title = "RMSE results by period and model") +  theme(legend.position = c(0.08,0.9),        legend.background = element_rect(fill = NA)) ## RMSE graphdata.frame(Dates = model_dates,            `Kernel - Linear-causal` = lin_rmse - kern_rmse,           `Kernel - Linear--non-causal` = lin_non_rmse - kern_rmse ,           check.names = FALSE) %>%   gather(key, value, -Dates) %>%   ggplot(aes(Dates, value*100, fill = key)) +  geom_bar(stat = "identity", position = "dodge") +  scale_fill_manual("", values = c("darkgrey", "darkblue")) +  labs(x = "",       y = "RMSE (%)",       title = "RMSE differences by period and model") +  theme(legend.position = c(0.1,0.9),        legend.background = element_rect(fill = NA))avg_lin <- round(mean(lin_rmse - kern_rmse),3)*100avg_lin_non <- round(mean(lin_non_rmse - kern_rmse),3)*100## Price graphprices_xts["2010/2014","xli"] %>%    ggplot(aes(index(prices_xts["2010/2014"]), xli)) +  geom_line(color="blue", size = 1.25) +  labs(x = "",       y = "Price (US$)",       title = "XLI price log-scale") +  scale_y_log10()

Python code:

# Built using Python 3.7.4## Import packagesimport numpy as npimport pandas as pdimport pandas_datareader as drimport matplotlib.pyplot as pltimport matplotlib%matplotlib inlinematplotlib.rcParams['figure.figsize'] = (12,6)plt.style.use('ggplot')## Load dataprices = pd.read_pickle('xli_prices.pkl')xli = pd.read_pickle('xli_etf.pkl')returns = prices.drop(columns = ['OTIS', 'CARR']).pct_change()returns.head()xli_rets = xli.pct_change(60).shift(-60)## Import cause_lists created using R# See R code above to createcause_lists = pd.read_csv("cause_lists.csv",header=None)cause_lists = cause_lists.iloc[1:,1:]## Define correlation functiondef mean_cor(df):    corr_df = df.corr()    np.fill_diagonal(corr_df.values, np.nan)    return np.nanmean(corr_df.values)    ## Create data frames and train-test splits    corr_comp = pd.DataFrame(index=returns.index[59:])corr_comp['corr'] = [mean_cor(returns.iloc[i-59:i+1,:]) for i in range(59,len(returns))]xli_rets = xli.pct_change(60).shift(-60)total_60 = pd.merge(corr_comp, xli_rets, how="left", on="Date").dropna()total_60.columns = ['corr', 'xli']split = round(len(total_60)*.7)train_60 = total_60.iloc[:split,:]test_60 = total_60.iloc[split:, :]tot_returns = pd.merge(xli,prices.drop(columns = ["CARR", "OTIS"]), "left", "Date")tot_returns = tot_returns.rename(columns = {'Adj Close': 'xli'})tot_returns = tot_returns.pct_change()tot_split = len(train_60)+60train = tot_returns.iloc[:tot_split,:]test = tot_returns.iloc[tot_split:len(tot_returns),:]train.head()## Create period indices to run pairwise correlations and forward returns for regressionscor_idx = np.array((np.arange(190,500), np.arange(440,750), np.arange(690,1000), np.arange(940,1250),                np.arange(1190,1500), np.arange(1440,1750), np.arange(1690,2000), np.arange(1940,2250),                np.arange(2190,2500)))# Add 1 since xli is price while train is ret so begin date is off by 1 biz dayret_idx = np.array((np.arange(250,561), np.arange(500,811), np.arange(750,1061), np.arange(1000,1311),                np.arange(1250,1561), np.arange(1500,1811), np.arange(1750,2061), np.arange(2000,2311),                np.arange(2250,2561)))# Create separate data arrays using cause_lists and indices# Causal subsetmerge_list = [0]*9for i in range(len(cor_idx)):        dat = train.reset_index().loc[cor_idx[i],cause_lists.iloc[i,:].dropna()]        corr = [mean_cor(dat.iloc[i-59:i+1,:]) for i in range(59,len(dat))]        ret1 = xli.reset_index().iloc[ret_idx[i],1]        ret1 = ret1.pct_change(60).shift(-60).values        ret1 = ret1[~np.isnan(ret1)]        merge_list[i] = np.c_[corr, ret1]        # Non-causal subset        non_cause_list = [0] * 9for i in range(len(cor_idx)):    non_c = [x for x in list(train.columns[1:]) if x not in cause_lists.iloc[3,:].dropna().to_list()]    dat = train.reset_index().loc[cor_idx[i], non_c]    corr = [mean_cor(dat.iloc[i-59:i+1,:]) for i in range(59,len(dat))]    ret1 = xli.reset_index().iloc[ret_idx[i],1]    ret1 = ret1.pct_change(60).shift(-60).values    ret1 = ret1[~np.isnan(ret1)]    non_cause_list[i] = np.c_[corr, ret1]        # Create single data set for examplecause_ex = np.c_[merge_list[2],non_cause_list[2][:,0]]# Run linear regressionfrom sklearn.linear_model import LinearRegressionX = cause_ex[:,0].reshape(-1,1)y = cause_ex[:,1]lin_reg = LinearRegression().fit(X,y)y_pred = lin_reg.predict(X)# Graph scatterplot with lowess and linear regressionimport seaborn as snssns.regplot(cause_ex[:,0]*100, cause_ex[:,1]*100, color = 'blue', lowess=True, line_kws={'color':'darkblue'}, scatter_kws={'alpha':0.4})plt.plot(X*100, y_pred*100, color = 'darkgrey', linestyle='dashed')plt.xlabel("Correlation (%)")plt.ylabel("Return (%)")plt.title("Return (XLI) vs. correlation (causal subset)")plt.show()# Run linear regression on non-causal component of cause_ex data framefrom sklearn.linear_model import LinearRegressionX_non = cause_ex[:,2].reshape(-1,1)y = cause_ex[:,1]lin_reg_non = LinearRegression().fit(X_non,y)y_pred_non = lin_reg_non.predict(X_non)# Graph scatter plotsns.regplot(cause_ex[:,2]*100, cause_ex[:,1]*100, color = 'blue', lowess=True, line_kws={'color':'darkblue'}, scatter_kws={'alpha':0.4})plt.plot(X_non*100, y_pred_non*100, color = 'darkgrey', linestyle='dashed')plt.xlabel("Correlation (%)")plt.ylabel("Return (%)")plt.title("Return (XLI) vs. correlation (non-causal subset)")plt.show()## Run regressions on cause_exfrom sklearn_extensions.kernel_regression import KernelRegressionimport statsmodels.api as smx = cause_ex[:,0]X = sm.add_constant(x)x_non = cause_ex[:,2]X_non = sm.add_constant(x_non)y = cause_ex[:,1]lin_c = sm.OLS(y,X).fit().rsquared*100lin_nc = sm.OLS(y,X_non).fit().rsquared*100# Note KernelRegressions() returns different results than kern() from generalCorrkr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10)) kr.fit(X,y)kr_c = kr.score(X,y)*100kr.fit(X_non, y)kr_nc = kr.score(X_non, y)*100print(f"R-squared for kernel regression causal subset: {kr_c:0.01f}")print(f"R-squared for kernel regression non-causal subset: {kr_nc:0.01f}")print(f"R-squared for linear regression causal subset: {lin_c:0.01f}")print(f"R-squared for linear regression non-causal subset: {lin_nc:0.01f}")## Run regressions on data listsimport statsmodels.api as sm# Causal subset linear modellin_mod = []for i in range(len(merge_list)):    x = merge_list[i][:,0]    X = sm.add_constant(x)    y = merge_list[i][:,1]    mod_reg = sm.OLS(y,X).fit()    lin_mod.append(mod_reg.rsquared)start = train.index[np.arange(249,2251,250)].yearend = train.index[np.arange(499,2500,250)].yearmodel_dates = [str(x)+"-"+str(y) for x,y in zip(start,end)]# Non-causal subset linear modelnon_lin_mod = []for i in range(len(non_cause_list)):    x = non_cause_list[i][:,0]    X = sm.add_constant(x)    y = non_cause_list[i][:,1]    mod_reg = sm.OLS(y,X).fit()    non_lin_mod.append(mod_reg.rsquared)        # Causal subset kernel regressionfrom sklearn_extensions.kernel_regression import KernelRegressionkern = []for i in range(len(merge_list)):    X = merge_list[i][:,0].reshape(-1,1)    y = merge_list[i][:,1]    kr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10))    kr.fit(X,y)    kern.append(kr.score(X,y))    ## Plot R-squared comparisons# Causal kernel vs. linear    df = pd.DataFrame(np.c_[np.array(kern)*100, np.array(lin_mod)*100], columns = ['Kernel', 'Linear'])df.plot(kind='bar', color = ['blue','darkgrey'])plt.xticks(ticks = df.index, labels=model_dates, rotation=0)plt.legend(loc = 'upper left')plt.show()# Causal kerner vs causal & non-causal lineardf = pd.DataFrame(np.c_[np.array(kern)*100, np.array(lin_mod)*100, np.array(non_lin_mod)*100],                   columns = ['Kernel', 'Linear-causal', 'Linear--non-causal'])df.plot(kind='bar', color = ['blue','darkgrey', 'darkblue'], width=.85)plt.xticks(ticks = df.index, labels=model_dates, rotation=0)plt.legend(bbox_to_anchor=(0.3, 0.9), loc = 'center')plt.ylabel("R-squared (%)")plt.title("R-squared output for regression results by period and model")plt.show()## Create RMSE listslin_rmse = []for i in range(len(merge_list)):    x = merge_list[i][:,0]    X = sm.add_constant(x)    y = merge_list[i][:,1]    mod_reg = sm.OLS(y,X).fit()    lin_rmse.append(np.sqrt(mod_reg.mse_resid))    lin_non_rmse = []for i in range(len(non_cause_list)):    x = non_cause_list[i][:,0]    X = sm.add_constant(x)    y = non_cause_list[i][:,1]    mod_reg = sm.OLS(y,X).fit()    lin_non_rmse.append(np.sqrt(mod_reg.mse_resid))    kern_rmse = []for i in range(len(merge_list)):    X = merge_list[i][:,0].reshape(-1,1)    y = merge_list[i][:,1]    kr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10))     kr.fit(X,y)    rmse = np.sqrt(np.mean((kr.predict(X)-y)**2))    kern_rmse.append(rmse)    ## Graph RMSE comparisonsdf = pd.DataFrame(np.c_[np.array(kern_rmse)*100, np.array(lin_rmse)*100, np.array(lin_non_rmse)*100],                   columns = ['Kernel', 'Linear-causal', 'Linear--non-causal'])df.plot(kind='bar', color = ['blue','darkgrey', 'darkblue'], width=.85)plt.xticks(ticks = df.index, labels=model_dates, rotation=0)plt.legend(loc = 'upper left')plt.ylabel("RMSE (%)")plt.title("RMSE results by period and model")plt.show()## Graph RMSE differenceskern_lin = [x-y for x,y in zip(lin_rmse, kern_rmse)]kern_non = [x-y for x,y in zip(lin_non_rmse, kern_rmse)]df = pd.DataFrame(np.c_[np.array(kern_lin)*100, np.array(kern_non)*100],                   columns = ['Kernel - Linear-causal', 'Kernel - Linear--non-causal'])df.plot(kind='bar', color = ['darkgrey', 'darkblue'], width=.85)plt.xticks(ticks = df.index, labels=model_dates, rotation=0)plt.legend(loc = 'upper left')plt.ylabel("RMSE (%)")plt.title("RMSE differences by period and model")plt.show()## Graph XLIfig, ax = plt.subplots(figsize=(12,6))ax.plot(xli["2010":"2014"], color='blue')ax.set_label("")ax.set_ylabel("Price (US$)")ax.set_yscale("log")ax.yaxis.set_major_formatter(ScalarFormatter())ax.yaxis.set_minor_formatter(ScalarFormatter())ax.set_title("XLI price log-scale")plt.show()

We’d like to thank Prof. Vinod for providing us an overview of his package. The implementation of the package is our own: we take all the credit for any errors.
Granger causality from C.W. Granger’s 1969 paper “Investigating Causal Relations by Econometric Methods and Cross Spectral Methods”
We used ggplot’s loess method for a non-parametric model simply to make the coding easier. A bit lazy, but we wanted to focus on the other stuff.

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Round about the kernel first appeared on R-bloggers.

↧

ROC Day at BARUG

November 11, 2020, 10:00 am

≫ Next: Where Does RStudio Fit into Your Cloud Strategy?

≪ Previous: Round about the kernel

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week, the Bay Area useR Group (BARUG) held a mini-conference focused on ROC Curves. Talks discussed the history of the ROC, extending ROC analysis to multiclass problems, various ways to think about and interpret ROC curves, and how to translate concrete business goals into the ROC framework, and pick the optimal threshold for a given problem.

Some History

I introduced the session with a very brief eclectic “history” of the ROC anchored on a few key papers that seem to me to represent inflection points in its development and adoption.

Anecdotal accounts of early ROC such as this brief mention in Deranged Physiology make it clear that Receiver Operating Characteristic referred to the ability of a radar technician, sitting at at a receiver to look at a blimp on the screen and distinguish an aircraft from background noise. The DoD report written by Peterson and Birdsall in 1953 shows that the underlying mathematical theory, and many of the statistical characteristics of the ROC, had already been worked out by that time. Thereafter, (see the references below) the ROC became a popular tool in Psychology, Medicine and many other disciplines seeking to make optimal decisions based on the ability to detect signals.

Jumping to “modern times” his 1996 paper Bradley argues for the ROC to replace overall accuracy as the single best measure to describe classifier performance. Given the prevalent use of ROC curves, it is interesting to contemplate a time when that was not so. Finally, the landmark 2009 paper by David Hand indicates that soon after the adoption of the ROC, researchers were already noticing problems using the area under the curve (AUC) to compare the performance of classifiers whose ROC curves cross. Additionally, Hand observes that:

(The AUC) is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. …

Hand goes on to propose H Measure as an alternative to AUC.

Multiclass Classification

In his talk, ROC Curves extended to multiclass classification, and how they do or do not map to the binary case (slides here), Mario Inchiosa discusses extensions of the ROC curve to multiclass classification and why these extensions don’t all apply to the binary case. He distinguishes between multiclass and multilabel classification and discusses the pros and cons of different averaging techniques in the multiclass One vs. Rest scenario. He also points (see references below) to both R and scikit-learn packages useful in this kind of analysis.

Intrepreting the ROC

In his highly original talk, Six Ways to Think About ROC Curves (slides here), Robert Horton challenges you to see the ROC curve from multiple perspectives. Even if you have been working with ROC curves for some time you are likely to learn something new here. The “Turtle Eye” view is eye opening for many.

The discrete “Turtle’s Eye” view, where labeled cases are sorted by score, and the path of the curve is determined by the order of positive and negative cases.
The categorical view, where we have to handle tied scores, or when scores put cases in sortable buckets.
The continuous view, where the cumulative distribution function (CDF) for the positive cases is plotted against the CDF for the negative cases.
The ROC curve can be thought of as the limit of the cumulative gain curve (or “Total Operating Characteristic” curve) as the prevalence of positive cases goes to zero.
The probabilistic view, where AUC is the probability that a randomly chosen positive case will have a higher score than a randomly chosen negative case.
The ROC curve emerges from a graphical interpretation of the Mann-Whitney Wilcoxon U Test Statistic, which illustrates how AUC relates to this commonly used non-parametric hypothesis test.

Picking the Optimal Utility Threshold

John Mount with a talk on How to Pick an Optimal Utility Threshold Using the ROC Plot(slides here) closed out the evening with some original work on how to translate concrete business goals into the ROC framework and then use the ROC plot to pick the optimal classification threshold for a given problem. John emphasizes the advantages of working with parametric representations of ROC curves and the importance of discovering utility requirements through iterated negotiation. All of this flows from John’s original and insightful definition of an ROC plot.

Finally, the zoom video covering the talks by Inchiosa, Horton and Mount is well-worth watching.

Horton Talk References

Fawcett (2006) An Introduction to ROC Analysis]
Berrizbeitia Receiver Operating Characteristic (ROC) Curves – Shiny App
Kanchanaraksa (2008) [Evaluation of Diagnostic and Screening Tests: Validity and Reliability]
Kruchten (2016) ML Meets Economics
Mount and Zumel The Win-Vector blog]()

Inchiosa Talk References

ROC
Multiclass Classification
roc auc score
roc metrics
plot roc
Hand and Till (2001) reference for one-vs-one
HandTill2001 package for Hand & Till’s “M” measure that extends AUC to multiclass using One vs. One
multiROC

Rickert Talk References

Bradley (1996) The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms – recommends ROC replace overall accuracy as a single measure of classifier performance
Deranged Physiology ROC characteristic of radar operator
Hajian-Tilake (2013) Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation
Hand (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve
McClelland (2011) Use of Signal Detection Theory as a Tool for Enhancing Performance and Evaluating Tradecraft in Intelligence Analysis
Lusted (1984) Editorial on medical uses of ROC
Pelli and Farell (1995) Psychophysical Methods
Peterson and Birdsall (1953) DoD Report on The Theory of Signal Detectability – Early paper referencing ROC
Woodward (1953) Probability and Information Theory, with Applications to Radar – early book mentioning ROC
hmeasure The H-Measure and Other Scalar Classification Performance Metrics

_____='https://rviews.rstudio.com/2020/11/12/roc-day-at-barug/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post ROC Day at BARUG first appeared on R-bloggers.

↧

Where Does RStudio Fit into Your Cloud Strategy?

November 11, 2020, 10:00 am

≫ Next: Installing V8 is now even easier

≪ Previous: ROC Day at BARUG

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Mantas Hesthaven on Unsplash

p { padding: 0 0 8px 0; }th { font-size: 90%; background-color: #4D8DC9; color: #fff; vertical-align: center }td { font-size: 80%; background-color: #F6F6FF; vertical-align: top; line-height: 16px; }caption { padding: 0 0 16px 0; }table { width: 100%; }th.problem { width: 15%; } th.solution { width: 15%; }th.proscons { width: 35%; }th.options { width: 35%; }</p><p>div.action { padding: 0 0 16px 0;}div.procon { padding: 0 0 0 0; }td.ul { padding: 0 0 0 0; margin-block-start: 0em;}table { border-top-style: hidden; border-bottom-style: hidden;}

Over the last few years, more companies have begun migrating their data science work to the cloud. As they do, they naturally want to bring along their favorite data science tools, including RStudio, R, and Python. In this blog post, we discuss the various ways RStudio products can help you along that journey.

Why Do Organizations Want to Move to the Cloud?

There are many reasons why organizations are looking to use cloud services more widely for data science. They include:

Long delays and high startup costs for new data science teams: When you bring a new team of data scientists onboard, it can be costly and time consuming to spin up the necessary hardware for the team. New hardware might be needed for developing data science analyses or for sharing interactive Shiny applications for stakeholders. These burdens tend to fall either on the individual data scientists or on DevOps and IT administrators who are responsible for configuring servers.
Obstacles to collaboration between organizations or groups: If a team is restricted to operating within their organization’s firewall, it can be very difficult to support collaboration or instruction between groups that don’t normally interact with each other. For example, running a data science workshop or statistics class can be unwieldy if everyone is working within their own separate environments.
High costs of computing infrastructure: Another key challenge is the potentially high costs of setting up and maintaining an organization’s computing infrastructure, including both hardware and software. These costs include the initial investments, maintenance and upgrade fees, and the related manpower costs.
Difficulty scaling to meet variable demand: Scaling server resources to satisfy highly variable data science demands can be very difficult because organizations rarely maintain excess capacity. For example, an organization may want to publish a news article or a COVID dashboard for which they expect high demand, only to discover that it needs the IT organization to spin up a back-end Kubernetes cluster to handle the load.
Excessive time and costs moving the data to the analysis: If an organization’s data is already stored on one of the major cloud providers or in a remote data center, moving that data to your laptop for analysis can be slow and expensive. Ideally, you should perform the data access, transformation and analysis as close to where the data lives as possible. Not doing so could subject you to excessive data transfer charges to move the data.

Let Your Data Science Goals Drive Your Cloud Strategy

Depending on the circumstances of your organization and what specific challenges you are trying to address, you should consider four possible options for your data science cloud strategy:

Hosted and Software as a Service (SaaS) offerings: A fully hosted service can minimize the cost and time required to start up a new project. However, functionality may be limited compared to on premise offerings and integration with your internal data and infrastructure can be challenging.
Deployment to a Virtual Private Cloud (VPC) provider: Deploying software on a major cloud platform such as Amazon Web Services (AWS) or Azure can provide the full flexibility and customization of on premise software. However, setting up a virtual private cloud application often requires more management overhead to integrate with your internal systems as well as careful administration of usage to avoid unexpected usage charges.
Cloud marketplace Offerings: Pre-built applications offered on services such as the AWS and Azure Marketplaces make it easy to get started at a pay-as-you-go hourly cost, but require careful management to ensure the software is available and running only when needed.
Data science in your data lake: By embedding your data science tools into your existing data platform, your computations can be run close to the data, minimize overhead, and easily tie into your data pipeline. However, this adds additional complexity and potential limitations.

We’re provided the table below to help you assess the various RStudio cloud offerings. It matches up problems and potential solutions with specific RStudio options and resources to consider. The options are arranged in order of increasing complexity of configuration and administration.

Table 1: Summary of Cloud Options for RStudio Software

Problem	Potential Solution	Pros and Cons	Options to consider
Simplify and reduce startup costs	SaaS/Hosted offering	Pros: Simplest and lowest cost to deploy Hardware and software managed by the provider Costs may be fixed, variable or a mix of the two Cons: Limited integration with your organization’s internal data and security protocols. May not be cost efficient for large groups May have limited options for custom configuration	Create data science analyses with RStudio Cloud Share Shiny applications with shinyapps.io Manage packages with RStudio Public Package Manager, a free service to provide easy installation of package binaries, and access to previous package versions
Promote collaboration or instruction between organizations or groups	SaaS/Hosted offering	Pros: Same pros as above, plus the ability to easily share projects Cons: Same cons as above	Share projects or teach classes/workshops with RStudio Cloud
Mitigate high costs of computing infrastructure	Marketplace Offerings	Pros: Easy to get started at minimal, pay-as-you-go (hourly) cost. Access to specialized hardware (e.g GPUs) Cons: To manage hourly costs, careful management is required to ensure software is running only when needed	RStudio products on AWS Marketplace, Azure Marketplace, and Google Cloud Platform.
Deployment to a VPC on a major cloud provider	Pros: Outsources hardware costs Integrates with existing analytic assets on cloud platforms Allows easy customization and configuration Provides access to specialized hardware (e.g GPUs) Ensures data sovereignty by running your processes in a local cloud region Cons: Complexity of managing software configuration and integration with your organization’s on-premise data and security protocols. Costs may be highly variable, based on usage	Deploy RStudio products in a VPC, using cloud formation templates for AWS and Azure ARM template (See RStudio Cloud Tools) Deploy RStudio products via Docker e.g. use EKS (Elastic Kubernetes Service) on AWS. (See Docker images for RStudio Professional Products) Connect to cloud based data storage, such as Redshift or S3.
Scale to meet variable demand	Clustering approaches, including Kubernetes	Pros: Cloud-deployed applications can be easily scaled to meet demand, since cloud providers provide container resources on demand. Cons: Careful management required to avoid unnecessary compute costs, while still matching job requirements to computational needs.	In addition to the points above, RStudio Server Pro’s Launcher integrates with Kubernetes, an industry-standard clustering solution that allows efficient scaling. RStudio Connect provides many options to scale and tune performance, including being part of an autoscaling group. These options allow Connect to deliver dashboards, Shiny applications, and other types of content to large numbers of users.
Minimize data movement	Data lakes	Pros: Run your computations close to the data, minimizing overhead Tie your data science directly into your data pipeline Cons: Adds additional complexity and potential limitations	RStudio Server Pro in Qubole Data Platform, for Azure, AWS and GCP Use sparklyr with DataBricks Connect to cloud based data storage, such as Redshift or S3. Managed RStudio Server Pro on Spark and Hadoop on Azure and AWS (Cazena)

Ready to Take RStudio to the Cloud?

If you’d like to take RStudio along on your journey to the cloud, you can start by exploring the resources linked in the table above. We also invite you to join us on December 2 for a webinar, “What does it mean to do data science in the cloud?”, conducted with our partner ProCogia. You can register for the webinar here.

Our product team is also happy to provide advice and guidance along this journey. If you’d like to set up a time to talk with us, you can book a time here. We look forward to being your guide.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Where Does RStudio Fit into Your Cloud Strategy? first appeared on R-bloggers.

↧

Installing V8 is now even easier

November 11, 2020, 10:00 am

≫ Next: Python and R – Part 2: Visualizing Data with Plotnine

≪ Previous: Where Does RStudio Fit into Your Cloud Strategy?

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Google’s amazing V8 JavaScript/WASM engine is probably one of the most sophisticated open-source software libraries available today. It is used to power the computation in Google Chrome, NodeJS, and also CloudFlare Workers, which make it possible to run code for your website inside the CDN edges.

The R package V8 exposes this same engine in R, and has been on CRAN since 2014. It is used by many R packages to wrap JavaScript libraries, such as geojsonio, jsonld, DiagrammeR, or almanac. Recently we have seen an increase in usage because the latest version of rstan now uses V8 for their parser.

However some rstan users complained that they found V8 difficult to install on Linux servers. This release tries to make that even easier.

Installing V8 on Windows / MacOS

Installing V8 from CRAN on Windows and MacOS works out of the box:

install.packages("V8")

The V8 engine is statically linked with the R package, so there are no external dependencies. Everything just works.

Installing V8 on Linux, the usual way

Because R packages on Linux are always installed from source, you need to install the V8 C++ engine separately. This is easy to do, for example on Ubuntu/Debian you use:

# Debian / Ubuntusudo apt-get install libv8-dev

And on Fedora/CentOS you would need:

# Fedora / CentOSsudo yum install v8-devel

Once the V8 engine is installed, you can install the R package using the regular install.packges("V8") and everything will work as usual.

Installing V8 on Linux, the alternative way

For most users, the instructions above are all you need. However some Linux users complained that they had difficulty getting V8, for example because they do not have sudo permissions, or because they are on a Linux distribution that does not provide the V8 engine (e.g. Gentoo Linux or OpenSuse).

Therefore we added an alternative installation method for Linux to automatically download a static build of libv8 during package installation. Simply set an env variable DOWNLOAD_STATIC_LIBV8 when installing the package, for example:

# For Linux: download libv8 during installationSys.setenv(DOWNLOAD_STATIC_LIBV8=1)install.packages("V8")

This way, you can install the V8 package on any x64 Linux system, without separate system requirements.

Another benefit over the other method, is that this gives you a more recent version of the V8 engine than what ships with some Linux distributions. We found that it works so well that we decided to enable this by default on Travis and Github-Actions. But for local installations, you need to opt-in via the environment variable above.

I hope that this takes away the last bit of friction, to take advantage of the amazing features of V8 in R, and that it is safe to depend on this package. At least it has made some rstan users very happy.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Installing V8 is now even easier first appeared on R-bloggers.

↧

Python and R – Part 2: Visualizing Data with Plotnine

November 11, 2020, 12:00 pm

≫ Next: NHS-R Community – Computer Vision Classification – How it can aid clinicians – Malaria cell case study with R

≪ Previous: Installing V8 is now even easier

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Article Update

Interested in more Python and R tutorials?

Register for our blog to get new articles as we release them.

Introduction

In this post, we start out where we left off in Python and R – Part 1: Exploring Data with Datatable. In the chunk below, we load our cleaned up big MT Cars data set in order to be able to refer directly to the variable without a short code or the f function from our datatable. On the other hand, we will also load plotnine with the short code p9. We found this to be cumbersome relative to the R behavior, but given that we use so many different functions in ggplot when exploring a data set, it is hard to know which functions to load into the name space in advance. Our experience and discussions we have read by others with matplotlib and seaborn, is that they are not very intuitive, and probably not better than ggplot (given mixed reviews that we have read). If we can port over with a familiar library and avoid a learning curve, it would be a win. As we mentioned in our previous post, plotnine feels very similar with ggplot with a few exceptions. We will take the library through the paces below.

 # R Librarieslibrary("reticulate")knitr::opts_chunk$set(  fig.width = 15,  fig.height = 8,  out.width = '100%')

# Choose Python 3.7 minicondareticulate::use_condaenv(  condaenv = "r-reticulate",  required = TRUE  )

# Install Python packageslapply(c("plotnine"), function(package) {       conda_install("r-reticulate", package, pip = TRUE)})

# Python librariesfrom datatable import *import numpy as npimport plotnine as p9 import re

# Load cleaned vehiclesbig_mt = fread("~/Desktop/David/Projects/general_working/mt_cars/vehicles_cleaned.csv")# Export names to list to add to dictionaryexpr = [exp for exp in big_mt.export_names()]names = big_mt.names# Assign all exported name expressions to variable namesnames_dict = { names[i]: expr[i] for i in range(len(names)) } locals().update(names_dict)

Consolidate make Into Parent manufacturer

In the previous post, we collapsed VClass from 35 overlapping categories down to 7. Here, we similarly consolidate many brands in make within their parent producers. Automotive brands often transfer, and there have been some large mergers over the years, such as Fiat and Chrysler in 2014 and upcoming combination with Peugeot, making this somewhat of a crude exercise. We used the standard that the brand was owned by the parent currently, but this may not have been the case over most of the period which will be shown in the charts below. This can also effect the parent’s efficiency compared to peers. For example, Volkswagen bought a portfolio of luxury European gas guzzlers over the recent period, so its position is being pulled down from what would be one of the most efficient brands.

# Control flow statement used to collapse Make levelsdef collapse_make(make):  manufacturer = str()  if make in ['Pontiac', 'Oldmobile', 'Cadillac', 'Chevrolet', 'Buick', 'General Motors', 'Saturn', 'GMC']:      manufacturer = 'GM'  elif make in ['Ford', 'Mercury', 'Lincoln']:      manufacturer = 'Ford'  elif make in ['Toyota', 'Lexus', 'Scion']:      manufacturer = 'Toyota'  elif make in ['Nissan', 'Infiniti', 'Renault', 'Mitsubishi']:      manufacturer = 'Nissan'  elif make in ['Volkswagen', 'Audi', 'Porshe', 'Bentley', 'Bentley', 'Bugatti', 'Lamborghini']:      manufacturer = 'Volkswagen'  elif make in ['Chrysler', 'Plymouth', 'Dodge', 'Jeep', 'Fiat', 'Alfa Romeo', 'Ram']:      manufacturer = 'Chrysler'  elif make in ['Honda', 'Acura']:      manufacturer = 'Honda'  elif make in ['BMW', 'Rolls Royce', 'MINI']:      manufacturer = 'BMW'  elif make in ['Isuzu', 'Suburu', 'Kia', 'Hyundai', 'Mazda', 'Tata', 'Genesis']:      manufacturer = 'Other Asian'  elif make in ['Volvo', 'Saab', 'Peugeot', 'Land Rover', 'Jaguar', 'Ferrari']:      manufacturer = 'Other Euro'  else:    manufacturer = 'Other'  return manufacturer# Set up vclass of categories list for iterationvclass = big_mt[:, VClass].to_list()[0]big_mt[:, 'vehicle_type'] = Frame(['Cars' if re.findall('Car', item) else 'Trucks' for item in vclass]).to_numpy()# Consolidate make under parents#manufacturers = [tup[0] for tup in big_mt[:, 'make'].to_tuples()]big_mt[:,'manufacturer'] = Frame([collapse_make(line[0]) for line in big_mt[:, 'make'].to_tuples()])# Assign expressions to new variablesvehicle_type, manufacturer = big_mt[:, ('vehicle_type', 'manufacturer')].export_names()

Imports Started Ahead and Improved Efficency More

Here, we selected the largest volume brands in two steps, first creating an numpy vector of makes which sold more than 1500 separate models over the full period, and then creating an expression to filter for the most popular. Then, we iterated over our vector and classified vehicles as ‘Cars’ or ‘Trucks’ based on regex matches to build a new vehicle_type variable. We would love to know streamlined way to accomplish these operations, because they would surely be easier for us using data.table. Excluding EV’s, we found the combined mean mpg by year and make for both cars and trucks. It could be that we are missing something, but it also feels more verbose than it would have been in data.table, where we probably could have nested the filtering expressions within the frames, but again this could be our weakness in Python.

# Filter for brands with most models over full periodmost_popular_vector = big_mt[:, count(), by(manufacturer)][(f.count > 1500), 'manufacturer'].to_numpy()most_popular = np.isin(big_mt[:, manufacturer], most_popular_vector)# Create data set for chartsdata = big_mt[ most_popular, :] \             [ (is_ev == 0), :] \             [:, { 'mean_combined' : mean(comb08),                   'num_models' : count() },                       by(year,                          manufacturer,                         vehicle_type)]

Our plotnine code and graph below looks very similar to one generated from ggplot, but we struggled with sizing the plot on the page and avoiding cutting off axis and legend labels. We tried to put the legend on the right, but the labels were partially cut off unless we squeezed the charts too much. When we put it at the bottom with horizontal labels, the x-axis for the ‘Cars’ facet was still partially blocked by the legend title. We couldn’t find much written on how to make the charts bigger or to change the aspect ratio or figure size parameters, so the size looks a bit smaller than we would like. We remember these struggles while learning ggplot, but it felt like we could figure it out more quickly.

It is also important to mention that confidence intervals are not implemented yet for lowess smoothing with geom_smooth() in plotnine. This probably isn’t such a big deal for our purposes in this graph, where there are a large number of models in each year. However, it detracts from Figure below, where it the uncertainty about the true mean efficiency of cars with batteries in the early years is high because there were so few models.

# Smoothed line chart of efficiency by manufacturer(p9.ggplot(data.to_pandas(),          p9.aes(x = 'year',                  y= 'mean_combined',                  group = 'manufacturer',                  color = 'manufacturer')) +           p9.geom_smooth() +          p9.theme_bw() +           p9.labs(title = 'Imported Brands Start Strong, Make More Progress on Efficiency',                    x = 'Year',                    y = 'MPG',                    caption = 'EPA',                    color = 'Manufacturer') +          p9.facet_wrap('~vehicle_type',                         ncol = 2) +          p9.theme(                subplots_adjust={'bottom': 0.25},            figure_size=(8, 6), # inches            aspect_ratio=1/0.7,    # height:width            dpi = 200,            legend_position='bottom',            legend_direction='horizontal') )

## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

One thing to note is that it is difficult to tell which line maps to which make just by the colors. The original plan was to pipe this into plotly as we would do in R, but this functionality is not available. While the plotnine functionality is pretty close to ggplot, the lack of support of plotly is a pretty serious shortcoming.

From what we can see in the chart, we can see that “Other Asian” started out well in the beginning of the period, and made remarkable progress leaving Toyota behind as the leader in cars and trucks. Our family has driven Highlanders over the last 20 years, and seen the size of that model go from moderate to large, so it is not surprising to see Toyota trucks going from 2nd most to 2nd least efficient. BMW made the most progress of all producers in cars, and also made gains since introducing trucks in 2000. As a general comment, relative efficiency seems more dispersed and stable for cars than for trucks.

# Stacked line of number of models per manufacturer(p9.ggplot(data[year < 2020, :].to_pandas(),          p9.aes(x = 'year',                  y= 'num_models',                  fill = 'manufacturer')) +           p9.geom_area(position = 'stack') +          p9.theme_bw() +           p9.labs(title = 'BMW Making a Lot of Car Models, While GM Streamlines',                    x = 'Year',                    y = 'Number of Models',                    caption = 'EPA',                    color = 'Manufacturer') +          p9.facet_wrap('~vehicle_type',                         ncol = 2,                         scales= 'free') +          p9.theme(                subplots_adjust={'bottom': 0.25},            figure_size=(8, 6), # inches            aspect_ratio=1/0.7,    # height:width            dpi = 200,            legend_position='bottom',            legend_direction='horizontal') )

##

When we look number of models by Manufacturer , we can see that the number of models declined steadily from 1984 though the late 1990s, but has been rising since. Although the number of truck models appear to be competitive with cars, note that the graphs have different scales so there are about 2/3 as many in most years. In addition to becoming much more fuel efficient, BMW has increased the number of models to an astonishing degree over the period, even while most other European imports have started to tail off (except Mercedes). We would be interested to know the story behind such a big move by a still niche US player. GM had a very large number of car and truck models at the beginning of the period, but now has a much more streamlined range. It is important to remember that these numbers are not vehicles sold or market share, just models tested for fuel efficiency in a given year.

Electric Vehicles Unsurprisingly Get Drastically Better Mileage

After the looking at the efficiency by manufacturer in Figure above, we had a double-take when we saw the chart Figure below. While progress for gas-powered vehicles looked respectable above, in the context of cars with batteries, gas-only vehicles are about half as efficient on average. Though the mean improved, the mileage of the most efficient gas powered vehicle in any given year steadily lost ground over the period.

Meanwhile, vehicles with batteries are not really comparable because plug-in vehicles don’t use any gas. The EPA imputes energy equivalence for those vehicles. The EPA website explains in Electric Vehicles: Learn More About the Label that a calculation of equivalent electricity to travel 100 miles for plug-in vehicles. This seems like a crude comparison as electricity prices vary around the country. Still, the most efficient battery-powered car (recently a Tesla) improved to an incredible degree.

Around 2000, there were only a handful of battery-powered cars so the error bars would be wide if included, and we are counting all cars with any battery as one category when there are hybrids and plug-ins. In any case, caution should be used in interpreting the trend, but there was a period where the average actually declined, and really hasn’t improved over 20-years with the most efficient.

# Prepare data for charting by gas and battery-powereddata = big_mt[ (vehicle_type == "Cars"), :][:,                { "maximum": dt.max(comb08),                  "mean" : dt.mean(comb08),                  "minimum": dt.min(comb08),                  "num_models" : dt.count() },                    by(year, is_ev)]# Reshape data = data.to_pandas().melt(                  id_vars=["year",                            "is_ev",                           "num_models"],                  value_vars=["maximum",                               "mean",                              "minimum"],                  var_name = "Description",                  value_name = "MPG")# Facet plot smoothed line for gas and battery-powered(p9.ggplot(    data,     p9.aes('year',            'MPG',            group = 'Description',           color = 'Description')) +     p9.geom_smooth() +    p9.facet_wrap('~ is_ev') +    p9.labs(      title = 'Gas Powered Cars Make Little Progress, While EV Driven by Most Efficient',      x = 'Year'    ) +    p9.theme_bw() +    p9.theme(          subplots_adjust={'right': 0.85},      figure_size=(10, 8), # inches      aspect_ratio=1/1,    # height:width      legend_position='right',      legend_direction='vertical'))

## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

Efficiency of Most Vehicle Types Started Improving in 2005

We were surprised to see the fuel efficiency of mid-sized overtake even small cars as the most efficient around 2012. Small pickups and SUV’s also made a lot of progress as did standard pick-up trucks. Sport Utility Vehicles were left behind by the improvement most categories saw since 2005, while vans steadily lost efficiency over the whole period. As mentioned earlier, we noticed that the same model SUV that we owned got about 20% larger over the period. It seems like most families in our area have at least oneSUV, but they didn’t really exist before 2000.

# Prepare data for plotting smoothed line by VClassdata = big_mt[(is_ev == False), :][:,                 {'mean' : dt.mean(comb08),                 'num_models' : count() },                    by(year, VClass, is_ev)].to_pandas()# Plot smoothed line of efficiency by VClass(p9.ggplot(    data,    p9.aes('year',            'mean',            group = 'VClass',            color = 'VClass')) +             p9.geom_smooth() +            p9.labs(                title = "Midsize Cars Pull Ahead in Efficiency",                y = 'MPG',                x = 'Year') +            p9.theme_bw()  +    p9.theme(          subplots_adjust={'right': 0.75},      figure_size=(10, 4), # inches      aspect_ratio=1/1.5,    # height:width      legend_position='right',      legend_direction='vertical'))

## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

Efficiency by Fuel Type

We can see that fuel efficiency of electric vehicles almost doubled over the period, while we didn’t see the average efficiency of vehicles with batteries make the same improvement. We generated our is_ev battery if the car had a battery, but didn’t specify if it was plug-in or hybrid, so this discrepancy may have something to do with this. We can also see efficiency of diesel vehicles comes down sharply during the 2000s. We know that Dieselgate broke in 2015 for vehicles sold from 2009, so it is interesting to see the decline in listed efficiency started prior to that period. Natural gas vehicles seem to have been eliminated five years ago, which is surprising with the natural gas boom.

# Prepare data for plotting by fuelType1data = big_mt[: ,               { 'maximum': dt.max(comb08),                 'minimum': dt.min(comb08),                 'num_models' : count(),                 'mpg' : dt.mean(comb08) },                   by(year, fuelType1)].to_pandas()# Plot smoothed line of efficiency by fuelType1 by VClass              (p9.ggplot(data,             p9.aes('year',                    'mpg',                    color='fuelType1')) +             p9.geom_smooth() +             p9.theme_bw() +            p9.labs(                title = "Efficiency of Electric Vehicles Takes Off",                y = 'MPG',                x = 'Year',                color='Fuel Type') +            #p9.geom_hline(aes(color="Overall mean")) +            p9.theme(                  subplots_adjust={'right': 0.75},              figure_size=(10, 4), # inches              aspect_ratio=1/1.5,    # height:width              legend_position='right',              legend_direction='vertical'))

## ## ## /Users/davidlucey/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.##   "for lowess smoothings.", PlotnineWarning)

We don’t know if fuelType1 refers to the recommended or required fuel, but didn’t realize that there had been such a sharp increase in premium over the period. Our understanding was that premium gasoline had more to do with the engine performance than gas efficiency. it is notable that despite all the talk about alternative fuels, they can still be used in only a small minority of new models.

# Plot stacked line of share of fuelType1 by VClass(p9.ggplot(data[data['year'] < 2020],            p9.aes('year',                    'num_models',                    fill = 'fuelType1')) +             p9.geom_area(position = 'stack') +            p9.theme_bw() +            p9.labs(                title = "Number of Cars and Trucks Requiring Premium Overtakes Regular",                y = 'Number of Models',                x = 'Year',                fill = 'Fuel Type') +            p9.theme(                  subplots_adjust={'right': 0.75},              figure_size=(10, 4), # inches              aspect_ratio=1/1.5,    # height:width              legend_position='right',              legend_direction='vertical'))

##

Comments About Plotnine and Python Chunks in RStudio

In addition to the charts rendering smaller than we would have liked, we would have liked to have figure captions (as we generally do in for our R chunks). In addition, our cross-referencing links are currently not working for the Python chunks as they would with R. There is a bug mentioned on the knitr news page which may be fixed when the 1.29 update becomes available.

Conclusion

There is a lot of complexity in this system and more going on than we are likely to comprehend in a short exploration. We know there is a regulatory response to the CAFE standards which tightened in 2005, and that at least one significant producer may not have had accurate efficiency numbers during the period. The oil price fluctuated widely during the period, but not enough to cause real change in behavior in the same way it did during the 1970s. We also don’t know how many vehicles of each brand were sold, so don’t know how producers might jockey to sell more profitable models within the framework of overall fleet efficiency constraints. There can be a fine line between a light truck and a car, and the taxation differentials importation of cars vs light trucks are significant. Also, the weight cutoffs for trucks changed in 2008, so most truck categories are not a consistent weight over the whole period. That is all for now, but a future post might involve scraping CAFE standards, where there is also long term data available, to see if some of the blanks about volumes and weights could be filled in to support more than just exploratory analysis.

Author: David Lucy, Founder of Redwall Analytics David spent 25 years working with institutional global equity research with several top investment banking firms.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Python and R - Part 2: Visualizing Data with Plotnine first appeared on R-bloggers.

↧

NHS-R Community – Computer Vision Classification – How it can aid clinicians – Malaria cell case study with R

November 11, 2020, 9:00 pm

≫ Next: Appsilon is Hiring Globally: Remote R Shiny, Front-End, and Business Roles Open

≪ Previous: Python and R – Part 2: Visualizing Data with Plotnine

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The NHS-R conference is coming up and I have been invited to speak for the third year running, I know hat trick, as well as becoming an official fellow of the organisation, as I am now back with the NHS.

This year I wanted to bring Deep Learning to the table, and my focus will be on how to apply Convolutional Neural Networks to a Kaggle Malaria cell problem.

Understanding the data

To understand the data I have first created two functions and prepared all my imports from the relevant R libraries:

library(tensorflow)library(keras)library(plyr)library(dplyr)library(ggplot2)library(magrittr)library(tidyr)library(sys)library(caret)library(magick)library(fs)library(abind)library(imager)library(purrr)# FUNCTIONSshow_image

Explaining the functions. The first function uses imager to load the image and prepare a plot. I have wrapped this in a function for convenience and so other people can make use of it. The second function get_image_info() extracts the dimensions from an image and stores each dimension in an R list.

Creating a link to where the images are stored

The next step was to setup the relevant directories to understand where the images are stored:

dataset_dir

This takes some explaining:

The dataset_dir variable links to a string path of where the cell images are stored
The train_dir uses this path and then appends the next level down, which is the train directory
The test_dir does the same thing, but for the test directory
The train_parasite and train_uninfected variables use the dir_ls function to search the relevant parasite and uninfected folders for all the matching .png files in the directory. If these were other formats, then the postfix would need to be changed to suit

Getting the file path

The next step is to index one of the file paths to uses as a test image to view. I select the second index from both the train_parasite and train_uninfected images, which have been accessed by using the relevant dir_ls path:

parasite_image

Running the code will produce a cell with a parasite infection for malaria and the second will run an uninfected image and the dim commands will print out the image size and dimensions:

> dim(parasite_image)[1] 148 208   1   3> dim(uninfected_image)[1] 145 136   1   3

Animating the images

The following codes uses Magick to animate the images for the parasitised cells and those that are uninfected:

# Build infected cell animationsystem.time(train_parasite[1:100] %>%  map(image_read) %>%  image_join() %>%  image_scale("300x300") %>%   image_animate(fps = .5) %>%  image_write("Data/Parasite_Cells.gif"))# Build Uninfected Cells animationsystem.time(train_uninfected[1:100] %>%              map(image_read) %>%              image_join() %>%              image_scale("300x300") %>%               image_animate(fps = .5) %>%              image_write("Data/Uninfected_Cells.gif"))

Running these two lines you get the resultant gifs of the first 100 images:

Malaria Parasite Infected Cells

The below shows the cells that are not infected:

Uninfected cells

Building the CNN model skeleton

A bit about CNNs

The general process for a CNN is shown graphically here:

I will design these layers in our model baseline skeleton hereunder.

The model skeleton

The first thing I will save is the target image shape and then I will start to build the model skeleton:

#Build Keras Baseline Modelmodel %   layer_conv_2d(filters=32, kernel_size=c(3,3), activation = "relu",                input_shape = image_shape) %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%     layer_conv_2d(filters=64, kernel_size = c(3,3),                input_shape = image_shape, activation="relu") %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%     layer_conv_2d(filters=64, kernel_size = c(3,3)) %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%     layer_conv_2d(filters=32, kernel_size=c(3,3), activation = "relu",                input_shape = image_shape) %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%     layer_flatten() %>%   layer_dense(1, activation = "sigmoid") %>%   layer_dropout(0.5)

To explain this I will explain how a CNN is structured:

The model is a sequential model so this needs to be the first statement, as each of the hidden layers connect together to be a fully connected network
I create a 2 dimensional convolution over the image, set the filters to 32 to begin with. In the context of CNN, a filter is a set of learnable weights which are learned using the backpropagation algorithm. You can think of each filter as storing a single template/pattern and this is called a feature map. These filters allow for different maps to be created for each image – a little analogous to when you go to an optician for new lens and they test different filters out. The activation function is set to relu (rectified linear units), and this is now the defacto activation function for CNNs – along with Leaky-ReLu.
This data then gets reduced or pooled by a process called max, min or average pooling:

The graphic shows the process of what this type of pooling reduction achieves:

This process is repeated three times, one layer adding more feature maps , the next holding 64 constant and the final reducing back down to 32.
The next part is to flatten the number of layers back down to one layer – implemented by the layer_dense function which will then produce the binary outcome I need. This time the activation is changed to sigmoid – as this is a binary classification task, the same as that used in logistic regression
The final step is to use a method called dropout, invented by Geoffrey Hinton, the godfather of Neural Networks. Hear what he has to say.

Compiling the model

The next step is to compile the model. Here I use the loss equal to binary crossentropy (for multiclass classification – the best choice is categorical crossentropy) and you would need to change the activation function in the final dense layer to softmax.

The optimizer I used here is rmsprop, this takes the root mean square proportion across the classes, then I set the metrics equal to accuray:

model %>%   compile(    loss='binary_crossentropy',    optimizer=optimizer_rmsprop(),    metrics = c("acc")  )print(model)

Printing the compiled model prints the structure of the model for me:

print(model)ModelModel: "sequential"_________________________________________________________________________________________Layer (type)                            Output Shape                       Param #       =========================================================================================conv2d (Conv2D)                         (None, 128, 128, 32)               896           _________________________________________________________________________________________max_pooling2d (MaxPooling2D)            (None, 64, 64, 32)                 0             _________________________________________________________________________________________conv2d_1 (Conv2D)                       (None, 62, 62, 64)                 18496         _________________________________________________________________________________________max_pooling2d_1 (MaxPooling2D)          (None, 31, 31, 64)                 0             _________________________________________________________________________________________conv2d_2 (Conv2D)                       (None, 29, 29, 64)                 36928         _________________________________________________________________________________________max_pooling2d_2 (MaxPooling2D)          (None, 14, 14, 64)                 0             _________________________________________________________________________________________conv2d_3 (Conv2D)                       (None, 12, 12, 32)                 18464         _________________________________________________________________________________________max_pooling2d_3 (MaxPooling2D)          (None, 6, 6, 32)                   0             _________________________________________________________________________________________flatten (Flatten)                       (None, 1152)                       0             _________________________________________________________________________________________dense (Dense)                           (None, 1)                          1153          _________________________________________________________________________________________dropout (Dropout)                       (None, 1)                          0             =========================================================================================Total params: 75,937Trainable params: 75,937Non-trainable params: 0_________________________________________________________________________________________

Rescaling and the awesome Keras flow_from_directory function

The first part is to rescale my train_datagen and test/validatioon data gen variables using the image_data_generator object from Keras:

train_datagen

Here I also set the batch size. This will be the size of the total training set / batch size equals the images to process in each batch.

Next, I use the flow_images_from_directory Keras method to look in my folder structure and use the training folder and the subfolders inside will be the classification labels:

Structure of cell images storage

The flow images from directory function only works when the images are organised in this structure, and it is worth it, as it saves tons of manual image labelling i.e. image 1 is a horse; image 2 is a cat; image 3 is a parasite infected malaria cell, you get the point.

Once this setup is in place you can run the functions and you will get this output:

> train_generator

The same message will be printed for the test_generator:

> test_generator

The target size here only expects the height and width dimensions – so this is why I have index sliced the image_shape vector.

Training the baseline model

The last step is to use the the fit_generator function to train the model:

history % fit_generator(  train_generator,  steps_per_epoch = 150,  epochs = 50,  validation_data = test_generator,  validation_steps = 75)model %>% save_model_hdf5("Data/parasite_cells_classification.h5")

Here I create a variable history – this is an old Python thing – and use the fit generator on the train generator object, we state we want 150 steps per epoch, with 50 epochs and we want to test the validation (test) data against the training set – with approx half the steps of the training steps.

Once this has completed when then save the model as .h5 format, as this is accessed via reticulate (Python interface) and can only use the models in that format.

The output of the baseline model is shown here:

This has not done very well at all – it is failing to pick up on anything useful and has plateaued at about 50% accuracy. I am sure we can improve this…

Improving the baseline model

To improve the baseline model I will use this approach.

Fun with images and image augmentation

I will use image augmentation functions in Keras to enhance, add images as copies and do some rotations, shifting and scaling:

image_gen

From looking at the images in the directory, I think we need to focus in more on the parasite images. I cranked the zoom in Windows Picture Viewer to 80% enlargement and the parasite cells became more clear, so guess what I will apply this to every image and the augmented images using the zoom_range of 0.8 (80%).

Adding more flesh to our skeleton

This time I will add filters of 128 to the model and collapse to a dense layer of 512, before dropping down to 1 layer with a sigmoid and applying dropout:

model %   layer_conv_2d(filters=32, kernel_size=c(3,3), activation = "relu",                input_shape = image_shape) %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%     layer_conv_2d(filters=64, kernel_size = c(3,3),                input_shape = image_shape, activation="relu") %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%       layer_conv_2d(filters=128, kernel_size = c(3,3),                input_shape = image_shape, activation="relu") %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%   layer_conv_2d(filters=128, kernel_size = c(3,3),                input_shape = image_shape, activation="relu") %>%   layer_max_pooling_2d(pool_size = c(2,2)) %>%   layer_flatten() %>%   layer_dense(512, activation = "relu") %>%   layer_dense(1, activation = "sigmoid") %>%   layer_dropout(0.5)# Compile the model and add a learning ratemodel %>%  compile(  loss = "binary_crossentropy",  optimizer = optimizer_rmsprop(lr=1e-4),  metrics = c("acc"))

This shows the model summary of:

summary(model)## Model outputModel: "sequential_1"_________________________________________________________________________________________Layer (type)                            Output Shape                       Param #       =========================================================================================conv2d_4 (Conv2D)                       (None, 128, 128, 32)               896           _________________________________________________________________________________________max_pooling2d_4 (MaxPooling2D)          (None, 64, 64, 32)                 0             _________________________________________________________________________________________conv2d_5 (Conv2D)                       (None, 62, 62, 64)                 18496         _________________________________________________________________________________________max_pooling2d_5 (MaxPooling2D)          (None, 31, 31, 64)                 0             _________________________________________________________________________________________conv2d_6 (Conv2D)                       (None, 29, 29, 128)                73856         _________________________________________________________________________________________max_pooling2d_6 (MaxPooling2D)          (None, 14, 14, 128)                0             _________________________________________________________________________________________conv2d_7 (Conv2D)                       (None, 12, 12, 128)                147584        _________________________________________________________________________________________max_pooling2d_7 (MaxPooling2D)          (None, 6, 6, 128)                  0             _________________________________________________________________________________________flatten_1 (Flatten)                     (None, 4608)                       0             _________________________________________________________________________________________dense_1 (Dense)                         (None, 512)                        2359808       _________________________________________________________________________________________dense_2 (Dense)                         (None, 1)                          513           _________________________________________________________________________________________dropout_1 (Dropout)                     (None, 1)                          0             =========================================================================================Total params: 2,601,153Trainable params: 2,601,153Non-trainable params: 0_________________________________________________________________________________________

Flowing images from directories again

This time we use the augmented model, with the adjustments e.g. increased zoom and other shifting adjustments. I have also increased the batch sizes from 16 to 32.

The code below implements this:

train_gen_augment

The same image sizes will be printed, as but this will add augmented images on top of these to bolster model sample size.

Fitting the model

The same code is used, but because this is a deeper network and takes about 50 mins to train, I will take down the steps per epoch and the number of epochs across the training set.

One other important function I have added here is something called early stopping. What this does is it evaluates the loss in the values in the model and if the model starts to overfit, it will hault at that point and output the model accuracy.

history_augment %   fit_generator(    train_gen_augment,    steps_per_epoch = 100,    epochs = 50,     validation_data = test_gen_augment,    validation_steps = as.integer(100 / 2),    callbacks = callback_early_stopping(monitor = "val_loss",                                         patience=5)  )

The patience parameter shows how long after early stopping has been detected should it wait. This says if it haults at epoch 5 go to 10 and then terminate.

Evaluating the model

This model reported these results:

Checking the model summary the below summary statistics indicate the accuracy of the model across the 23 epochs it completed:

summary(history_augment$metrics$acc)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.  0.6135  0.6612  0.6741  0.6680  0.6809  0.6932

This is a good improvement and it might be useful. You could spend a couple of weeks finalising and improving this, but for the sake of example, this accuracy is good enough. I will go with the median.

Saving the model

To save the model – I will use the same script in R:

model %>% save_model_hdf5("Data/parasite_cells_classification_augmented.h5")

As this took 50+ mins to train and this would only increase with increasing the sample sizes. I will only retrain every now and then. What I will do is load the model from my directory and use this pretrained model to make predictions.

Making predictions

This section will show how to make predictions against our trained model

Loading pretrained model

To load a pretrained model you use the below syntax:

model

Predicting with new cell scan

The following steps will show what we need to do to convert the image into a format that the model can work with.

pred_img

There is a bit to this, so I will bullet it out for you:

pred_img uses index number 100 of the train_parasite list of files. This gives the full path, so we can work with the image
img_new – this uses Keras’s image_load function to take the file path of the pred_img and then adjusts the image to the image shape defined by the variable set at the start of this, but only selects the height and width, not the colour channel
We then convert the image_to_array by using the numpy array function call from Keras. Essentially, changing it to a array structure
img_tensor then uses the array_reshape command to take the image we have just converted to an array and adjusts by the full dimensions of the tensor – so 1 image, 130 height, 130 width and 3 colour channels i.e. RGB.
The last part is to standardise the image between 0 and 1 using the tensor size divided by the max pixels of 255 in an image.
This is then plotted as a raster image using tensor slicing i.e. img_tensor[1,,,].

The image is now prepared and a plot of the image appears:

Predicting class label and class membership

The last step of the process is to predict the class membership.

The steps of how to achieve are highlighted below:

predval %   dplyr::mutate(Class=case_when(    V1 == 0 ~ "Parasite Class",    TRUE ~ "Uninfected"  )) %>%   dplyr::rename(ClassPred=V1) %>%   cbind(predval)print(pred)

The steps are:

https://keras.rstudio.com/reference/predict_proba.htmlThe predval variable holds the probablistic prediction of belonging to that class.
The pred variable uses the keras::predict_classes function to pass in the trained model and make class predictions. Please note – that the class predictions from the flow_from_directory function are assigned 0 – n (n being the number of classes, in this case 1, as it is a binary classification task)
We then use piping to convert the object to a data.frame. Next, we use mutate to add a new column and use a case_when statement to set the class label. As indicated in the previous bullet, we assign 0 to the parasite class, as in the flow_images_from_directory parasite comes before the folder uninfected.
The final step is to rename the class prediction and bind the predval with the pred data frame.

The resultant output shows:

ClassPred       Class      predval0 Parasite Class 1.030897e-05

Conclusion

There are a number of different ways this network could be improved, some of which have already been highlighted. For a good introductory text to CNNs and Deepe Learning I would recommend Deep Learning with R.

As part of my previous role at Draper and Dash I would run three days courses on deep learning to tackle vision, supervised, unsupervised ML and time series forecasting with LSTM networks. I would be happy to be contacted about training any NHS trust or other agency that might need this sort of training.

For the resources for this post please go to my GitHub by clicking on the icon below. I will provide my Twitter icon, if anyone wants to enquire about training courses.

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post NHS-R Community – Computer Vision Classification – How it can aid clinicians – Malaria cell case study with R first appeared on R-bloggers.

↧

Appsilon is Hiring Globally: Remote R Shiny, Front-End, and Business Roles Open

November 12, 2020, 2:15 am

≫ Next: on arithmetic derivations of square roots

≪ Previous: NHS-R Community – Computer Vision Classification – How it can aid clinicians – Malaria cell case study with R

[This article was first published on r – Appsilon | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Join a World-Class Team of Explorers

Do you want to work with global leaders in R and R Shiny? Are you located on planet Earth? Then this blog post is for you!

Appsilon was founded in 2013 by software engineering veterans from Google, Microsoft, Bank of America, and Domino Data Lab. Over the years, our team has excelled in providing innovative data analytics and machine learning solutions for enterprise companies, NGOs, and non-profit organizations.

We’ve been growing rapidly in the past few years, and we’re on the hunt for kind, intelligent people to join our team. We currently have nine positions open and are looking forward to filling them with exceptional people. We are a global remote-first company with multiple Fortune 500 clients and team members in Europe, the UK, South America, and North Africa.

Why Should You Work at Appsilon?

Appsilon is a remote-first company. We are very proud of our remote-first policy, and we pivoted to a remote setup at the very beginning of the COVID-19 crisis. This past year has proved that not only can we maintain our current pace of work and high standards in a remote environment, but that we can also expand and improve!

Want to learn more about how our remote office works? Read Remote Data Science Team Best Practices: Scrum, GitHub, Docker, and More

As an Appsilonian, you’ll have a meaningful impact on the world. We routinely contribute our expertise in data science, machine learning, and computer vision to projects that aim to have a positive impact. Through our AI For Good initiative, we tackle some of the world’s most pressing challenges – from fighting climate change to preserving wildlife populations in Africa with computer vision.

At Appsilon, we value work-life harmony. With us, it’s possible to have a challenging and stimulating career without sacrificing your personal life or family time. We value mental and physical health, and we encourage each other to enjoy hobbies outside of work. We are a community of kind, friendly, and motivated professionals committed to excellence and innovation.

Application/Dashboard Development with R Shiny

We create, maintain, and develop Shiny applications for enterprise customers all over the world. We provide scalability, security, and modern UI/UX with custom R packages that native Shiny apps do not provide. At Appsilon, we are among the world’s foremost experts in R Shiny and have made a variety of Shiny innovations over the years. We cover everything from business analysis and data science consulting to code refactoring.

See our work: Appsilon’s Shiny Demos

Wildlife Image Classification with Machine Learning

Since 2018 we’ve been working with Gabon’s National Park Agency to lead a large-scale biodiversity monitoring program that uses 200 camera traps to survey the mammal population in Lopé and Waka National Parks (comprising 7000 km2 of a continuous forest). The aim of this project has been to build capacity for long-term, scientifically rigorous biodiversity monitoring in both national parks and also within the National Parks Agency as a whole. On this project, we are building a fully open-source multi-platform solution for wildlife image classification. The project is directly tied to our AI For Good initiative.

Read more: ML Wildlife Image Classification to Analyze Camera Trap Datasets

Open Positions

We are primarily looking for talented R Shiny developers, Frontend specialists, and Infrastructure engineers, but we also have other less technical (and non-technical) positions. Here’s a list of some of our open roles:

Fullstack Software Engineer / Tech Lead (Javascript, React.js, Python) – work on some of the most advanced R Shiny apps for global leaders. Learn more
R Shiny Developer (R Shiny, Javascript) – build applications and dashboards and allow users to interact with their data and analysis. Learn more
Infrastructure Engineer (DevOps, Linux, Cloud) – build infrastructure in data science projects, experiment with technologies of your choice, and help build dashboards for large companies. Learn more
Frontend Engineer (Javascript, CSS/SASS, HTML, React) – build the world’s most advanced dashboards and transform them into innovative tools for Fortune 500 company managers. Learn more
Community Manager– manage our events, webinars, social media, and newsletters. You will also work on content for our blog and help generate case studies with our tech team. Learn more

There are more job listing on our Careers page, so make sure to check it out. There’s also a possibility to create your own position, so feel free to apply at any time.

See the full list of nine available positions

Our Recruiting Process

So you see a role that fits your skills and you want to apply? Simply submit your application through the submission form on our Careers page or send a direct email to paulina@appsilon.com. Here’s an outline of our hiring process:

These steps aren’t set in stone, as our recruitment process is flexible to each candidate and to every position.

We can’t wait to meet you and to start working together. Apply now!

Article Appsilon is Hiring Globally: Remote R Shiny, Front-End, and Business Roles Open comes from Appsilon | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Appsilon is Hiring Globally: Remote R Shiny, Front-End, and Business Roles Open first appeared on R-bloggers.

↧

on arithmetic derivations of square roots

November 12, 2020, 9:20 am

≫ Next: 10 Must-Know Tidyverse Functions: #3 – Pivot Wider and Longer

≪ Previous: Appsilon is Hiring Globally: Remote R Shiny, Front-End, and Business Roles Open

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An intriguing question made a short-lived appearance on the CodeGolf section of Stack Exchange, before being removed, namely the (most concise possible) coding of an arithmetic derivation of the square root of an integer, S, with a 30 digit precision and using only arithmetic operators. I was not aware of the myriad of solutions available, as demonstrated on the dedicated WIkipedia page. And ended playing with three of them during a sleepless pre-election night!

The first solution for finding √S is based on a continued fraction representation of the root,

$\sqrt{S}=a+\cfrac{r}{2a+\cfrac{r}{2a+\ddots}}$

with a²≤S and r=S-a². It is straightforward to code-golf:

while((r<-S-T*T)^2>1e-9)T=(F<-2*T+r/(2*T+F))-T;F

but I found it impossible to reach the 30 digit precision (even when decreasing the error bound from 10⁻⁹). Given the strict rules of the game, this would have been considered to be a failure.

The second solution is Goldschmidt’s algorithm

b=ST=1/sum((1:S)^21e-9){  b=b*T[1]^2  T=c((3-b)/2,T)}S*prod(T)

which is longer for code-golfing but produces both √S and 1/√S (and is faster than the Babylonian method and Newton-Raphson). Again no luck with high precision and almost surely unacceptable for the game.

The third solution is the most interesting [imho] as it mimicks long division, working two digits at a time (and connection with Napier’s bones)

`~`=lengthD=~SS=c(S,0*(1:30))p=d=0a=1:9while(~S){   F=c(F,x<-sum(a*(20*p+a)<=(g<-100*d+10*S[1]+S[2])))  d=g-x*(20*p+x)  p=x+10*p  S=S[-1:-2]}sum(10^{1+D/2-1:~F}*F)

plus providing an arbitrary number of digits with no error. This code requires S to be entered as a sequence of digits (with a possible extra top digit 0 to make the integer D even). Returning one digit at a time, it would further have satisfied the constraints of the question (if in a poorly condensed manner).

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post on arithmetic derivations of square roots first appeared on R-bloggers.

↧

10 Must-Know Tidyverse Functions: #3 – Pivot Wider and Longer

November 12, 2020, 11:30 am

≫ Next: NYC R Meetup: Slides on Future

≪ Previous: on arithmetic derivations of square roots

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Learn how to use pivot_wider() and pivot_longer() to format data like a data wizard:

Get the Code: GitHub Link
Video Tutorial: YouTube Tutorial

(Click image to play tutorial)

Why Pivot Wider?

Pivoting wider is essential for making summary tables that go into reports & help humans (like you and me) understand key information.

Let’s say we have some automobile manufacturer data that we want to format into a table that people can read.

We can summarize and pivot the data by manufacturer and class to understand the number of vehicle classes that each manufacturer produces.

The result is a table that I can glean for insights.

Why Pivot Longer?

Pivot longer lengthens data, increasing the number of rows and decreasing the number of columns.

We can convert from wide to long with Pivot Longer, which gets it into the correct format to visualize with GGPLOT HEATMAP.

That was ridiculously easy. Keep it up & you’ll become a tidyverse rockstar.

rockstar

You Learned Something New!

Great! But, you need to learn a lot to become an R programming wizard.

What happens after you learn R for Business from Matt

Tidyverse wizard

…And the look on your boss’ face after seeing your first Shiny App.

Amazed

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter
Set Up the GitHub Repo: https://github.com/business-science/free_r_tips
Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post 10 Must-Know Tidyverse Functions: #3 - Pivot Wider and Longer first appeared on R-bloggers.

↧

NYC R Meetup: Slides on Future

November 12, 2020, 12:30 pm

≫ Next: Little useless-useful R functions – Play rock-paper-scissors with your R engine

≪ Previous: 10 Must-Know Tidyverse Functions: #3 – Pivot Wider and Longer

[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The official poster for this New York Open Statistical Programming Meetup

I presented Future: Simple, Friendly Parallel Processing for R (65 minutes; 59 slides + Q&A slides) at New York Open Statistical Programming Meetup, on November 9, 2020:

HTML (incremental Google Slides; requires online access)
PDF (flat slides)
Video (presentation starts at 0h10m30s, Q&A starts at 1h17m40m)

I like to thanks everyone who attented and everyone who asked lots of brilliant questions during the Q&A. I’d also want to express my gratitude to Amada, Jared, and Noam for the invitation and making this event possible. It was great fun.

– Henrik

Links

Relevant packages mentioned in this talk:
- future package: CRAN, GitHub
- future.apply package: CRAN, GitHub
- furrr package: CRAN, GitHub
- foreach package: CRAN, GitHub
- doFuture package: CRAN, GitHub
- doParallel package: CRAN, GitHub
- future.batchtools package: CRAN, GitHub
- future.callr package: CRAN, GitHub
- future.tests package: CRAN, GitHub
- clustermq package: CRAN, GitHub

To leave a comment for the author, please follow the link and comment on their blog: JottR on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post NYC R Meetup: Slides on Future first appeared on R-bloggers.

↧

Little useless-useful R functions – Play rock-paper-scissors with your R engine

November 12, 2020, 5:10 pm

≫ Next: The Bachelorette Eps. 4 & 5 – Influencers in the Garden – Data and Drama in R

≪ Previous: NYC R Meetup: Slides on Future

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Naaah! not a joke. But quarantine restrictions are getting tighter and you might want to spent playing some useless game with your R engine, whilst programming, doing machine learning or other learning.

Source: Creative Commons

Without complications, the simplest version of this game is to play against R Engine. Here is how to:

##### Input bet as a functionplay_RPS <- function(bet) {  bets <- c("R","P", "S")   if(bet %in% bets){       solution_df <- data.frame(combo=c("RP", "PR", "PS", "SP", "RS", "SR", "PP", "RR", "SS"), win = c("01","10", "01","10", "10", "01", "00","00","00") )  REngine <- sample(bets,1)    combo <- paste0(REngine,bet, collapse="")  res <-solution_df[ which(solution_df$combo==combo),2]  if (res=="10"){    print(paste0("You lost. Computer draw: ", REngine), collapse="")  } else if(res=="00"){    print(paste0("It's a tie! Computer draw: ", REngine), collapse="")    }else {    print(paste0("You win! Computer draw: ", REngine), collapse="")    }   }  else {    print("Please input valid bet!")  }}

And by running the function with your bet:

play_RPS("R")

Getting the results (in my case / run).

Game-play could use some refinement. I have decided to use x11 function, I have already mentioned it in one of my previous posts. In this case, I will be building a loop with x11 function for continuous selection of Rock-Paper-Scissors. With x11 you can get little bit better interface:

As always, the code is available at Github.

Happy R-coding! And stay healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R functions – Play rock-paper-scissors with your R engine first appeared on R-bloggers.

↧

The Bachelorette Eps. 4 & 5 – Influencers in the Garden – Data and Drama in R

November 13, 2020, 2:23 am

≫ Next: #FunDataFriday – #BlackInDataWeek

≪ Previous: Little useless-useful R functions – Play rock-paper-scissors with your R engine

[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The level of drama in The Bachelorette was so high that I had to wait 2 episodes to gather my thoughts. We have now seen a shift from Clare to Tayshia and Instagram followers have jumped on board the Tayshia train. However, they have not come close to abandoning Clare. It seems hard to believe that people could follow both of them simultaneously because one person can only handle so much drama.

If we look at the follower gap between the two Bachelorettes, it has decreased substantially but is likely to expand again after Clare fades into the background and Tayshia becomes the star of the show.

I asked myself, what were the top hashtags used with #TheBachelorette on Twitter each day the show aired? The answer: SPAM.

Or so I thought…

It seemed odd that people would be tweeting about the US presidential election instead of The Bachelorette. So, I did some digging. I recalled seeing vote count charts that looked very similar to the patterns we saw between the number of Instagram followers of Tayshia and Clare. If we look at vote share in Nevada between Trump and Biden, we notice a striking similarity between follower count of Tayshia and Clare.

Here’s the chart from the NY Times:

Screen Shot 2020-11-12 at 8.11.05 PM.png

It has an uncanny resemblance to the plot of the bachelorette Instagram followers:

We can see that Tayshia started with over 57% of the total followers but that gap narrowed down: 53% (T) to 47% (C). If this chart seems ridiculous, it’s because it is.

Looking forward to seeing the addition of the new guys to the predictive model.

Side note: Chris Harrison is way cooler than Dale Moss.

As always, please feel free to play with the data yourself at https://stoltzmaniac.shinyapps.io/TheBacheloretteApp/ where you can take advantage of some fancy algorithms to determine the emotions of the faces in each post the contestant made public on Instagram.

We’ll be doing some analysis after next week’s show, hope to see you then. The code for the plots is below and the data is available upon request by using our contact page.

library(dplyr)library(lubridate)library(ggplot2)library(ggridges)GLOBAL_DATA = get_database_data()GLOBAL_DATA$insta_followers %>%  filter(suitor %in% c('clarecrawley', 'tayshiaaa'),         datetime <= '2020-11-12') %>%  group_by(name, datetime) %>%  summarize(follower_count = mean(follower_count), .groups = 'drop') %>%  ggplot(aes(x = datetime, y = follower_count, col = name)) +   geom_line() +   geom_line(col = '#ff6699', fill = '#ff6699', size = 1.5) +  gghighlight::gghighlight(name == 'Tayshia Adams', label_key = name, use_group_by = TRUE) +   theme_minimal() +  labs(x = '', y = '', title = "Bachelorette Instagram Followers Over Time", subtitle = "Tayshia vs. Clare") +   scale_y_continuous(label = scales::comma) +   theme(legend.position = 'top', legend.direction = "horizontal",         legend.title = element_blank(),         plot.title = element_text( hjust = 0.5, vjust = -1),        plot.subtitle = element_text( hjust = 0.5, vjust = -1))dat = GLOBAL_DATA$insta_followers %>%  filter(suitor %in% c('clarecrawley', 'tayshiaaa'),         datetime <= '2020-11-12') %>%  group_by(name, datetime) %>%  summarize(follower_count = mean(follower_count), .groups = 'drop') %>%  pivot_wider(id_cols = datetime, names_from = name, values_from = follower_count) %>%  fill(`Tayshia Adams`) %>%  mutate(margin = `Tayshia Adams` - `Clare Crawley`,         total = `Tayshia Adams` + `Clare Crawley`,         pct_clare = `Clare Crawley` / total,         pct_tayshia = `Tayshia Adams` / total) p1 = dat %>%  select(datetime, `Clare Crawley` = pct_clare, `Tayshia Adams` = pct_tayshia) %>%  pivot_longer(cols = c(`Clare Crawley`, `Tayshia Adams`), "Bachelorette", values_to = 'pct_of_total_followers') %>%  ggplot(aes(x = datetime, y = pct_of_total_followers, col = Bachelorette)) +  geom_line(size = 1.5) +  theme_minimal() +   labs(x = '', y = '', title = 'Percentage of Total Follower Count') +  scale_y_continuous(label = scales::percent) +   scale_color_manual(values = c('red', 'blue')) +   theme(legend.position = 'top', legend.direction = "horizontal",         legend.title = element_blank(),         plot.title = element_text( hjust = 0.5, vjust = -1),        plot.subtitle = element_text( hjust = 0.5, vjust = -1))p2 = dat %>%  ggplot(aes(x = datetime, y = margin)) +  geom_line(size = 1.5, color = '#ff6699') +   theme_minimal() +   labs(x = '', y = '', title = 'Follower Count Difference', subtitle = 'Difference = Tayshia - Clare') +  scale_y_continuous(label = scales::comma) +   theme(legend.position = 'top', legend.direction = "horizontal",         legend.title = element_blank(),         plot.title = element_text( hjust = 0.5, vjust = -1),        plot.subtitle = element_text( hjust = 0.5, vjust = -1))  gridExtra::grid.arrange(p1, p2)    ggplot(aes(x = datetime, y = follower_count, col = name)) +   geom_line() +   geom_line(col = '#ff6699', fill = '#ff6699', size = 1.5) +  gghighlight::gghighlight(name == 'Tayshia Adams', label_key = name, use_group_by = TRUE) +   theme_minimal() +  labs(x = '', y = '', title = "Bachelorette Instagram Followers Over Time", subtitle = "Tayshia vs. Clare") +   scale_y_continuous(label = scales::comma) +   theme(legend.position = 'top', legend.direction = "horizontal",         legend.title = element_blank(),         plot.title = element_text( hjust = 0.5, vjust = -1),        plot.subtitle = element_text( hjust = 0.5, vjust = -1))GLOBAL_DATA$insta_followers_w_losers %>%  filter(suitor %in% c('chrisbharrison', 'DaleMoss13'),         datetime <= '2020-11-12') %>%  group_by(name, datetime) %>%  summarize(follower_count = mean(follower_count), .groups = 'drop') %>%  ggplot(aes(x = datetime, y = follower_count, col = name)) +   geom_line() +   geom_line(col = '#ff6699', fill = '#ff6699', size = 1.5) +  gghighlight::gghighlight(name == 'Chris Harrison', label_key = name, use_group_by = TRUE) +   theme_minimal() +  labs(x = '', y = '', title = "Bachelorette Instagram Followers Over Time", subtitle = "Chris Harrison vs. Dale Moss") +   scale_y_continuous(label = scales::comma) +   theme(legend.position = 'top', legend.direction = "horizontal",         legend.title = element_blank(),         plot.title = element_text( hjust = 0.5, vjust = -1),        plot.subtitle = element_text( hjust = 0.5, vjust = -1))

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Bachelorette Eps. 4 & 5 - Influencers in the Garden - Data and Drama in R first appeared on R-bloggers.

↧

#FunDataFriday – #BlackInDataWeek

November 13, 2020, 12:06 am

≫ Next: New Light-board Lecture: wrapr::unpack

≪ Previous: The Bachelorette Eps. 4 & 5 – Influencers in the Garden – Data and Drama in R

[This article was first published on #FunDataFriday - Little Miss Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

WHAT IS IT?

#BlackInDataWeek is a free, online data conference taking place November 16-21, 2020. Straight from their website, #BlackInDataWeek is:

> “A week-long celebration to (1) highlight the valuable work and experiences of Black people in the field of Data, (2) provide community, and (3) educational and professional resources.”

WHY IS IT AWESOME?

I can’t possibly list all of the reasons why this event is awesome. Just look at the description above, it’s full of goodness! The organizers have created a very exciting event structured to provide community, support and growth for Black people working in data.

Also, the sessions are on fire! I’ve added nearly all of them to my calendar. They offer a wide range of content from career to technical sessions and geared towards all levels of experience.

I’m particularly excited to attend these sessions:

Data Careers After Age 40 – November 17 1:30 – 2:30 PM EST
Survival Strategies in Data Careers – November 17 5:00 – 6:00 PM EST
Data Journeys Fireside Chat – November 17 6:00 – 8:00 PM EST
COVID-19 Health Disparities – November 19 12:00 – 12:30 PM EST
Avoid a Blank Stare: How to Tell a Great Story with Data – November 19 1:00 – 1:30 PM EST
Visualize Your Data Journey – November 19 5:00 – 7:00 PM EST
Bias in AI Algorithms – November 20 6:00 – 7:00 PM EST
AMA Algorithmic Fairness r/blackpeopletwitter – November 20 12:00 PM EST

For beginners data professionals, you will not want to miss these sessions:

Introduction to R – November 18 2:00 – 3:00pm EST
Machine Learning Tutorial – November 18 12:00 – 1:00 PM EST
Career Development and Mentorship Panel – November 21 2:00 – 5:00 PM EST

HOW TO GET STARTED?

Visit their Event Brite sign up page and register for the conference today. I will see you there!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: #FunDataFriday - Little Miss Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post #FunDataFriday - #BlackInDataWeek first appeared on R-bloggers.

↧

New Light-board Lecture: wrapr::unpack

November 13, 2020, 11:27 am

≫ Next: Reproduce analysis of a political attitudes experiment by @ellis2013nz

≪ Previous: #FunDataFriday – #BlackInDataWeek

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post New Light-board Lecture: wrapr::unpack first appeared on R-bloggers.

↧