Nonlinear combinations of model parameters in regression

January 8, 2020, 4:00 pm

≫ Next: `R` you ready for python (gentle introduction to reticulate package)’

≪ Previous: rfoaas 2.1.0: New upstream so new access point!

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nonlinear regression plays an important role in my research and teaching activities. While I often use the ‘drm()’ function in the ‘drc’ package for my research work, I tend to prefer the ‘nls()’ function for teaching purposes, mainly because, in my opinion, the transition from linear models to nonlinear models is smoother, for beginners. One problem with ‘nls()’ is that, in contrast to ‘drm()’, it is not specifically tailored to the needs of biologists or students in biology. Therefore, now and then, I have to build some helper functions, to perform some specific tasks; I usually share these functions within the ‘aomisc’ package, that is available on github (see this link).

In this post, I would like to describe one of these helper functions, i.e. ‘gnlht()’, which is aimed at helping students (and practitioners; why not?) with one of their tasks, i.e. making some simple manipulations of model parameters, to obtain relevant biological information. Let’s see a typical example.

Motivating example

This is a real-life example, taken from a research published by Vischetti et al. in 1996. That research considered three herbicides for weed control in sugar beet, i.e. metamitron (M), phenmedipham (P) and cloridazon (C). Four soil samples were contaminated, respectively with: (i) M alone, (ii) M + P (iii) M + C and (iv) M + P + C. The aim was to assess whether the degradation speed of metamitron in soil depended on the presence of co-applied herbicides. To reach this aim, the soil samples were incubated at 20°C and sub-samples were taken in different times after the beginning of the experiment. The concentration of metamitron in those sub-samples was measured by HPLC analyses, performed in triplicate. The resulting dataset is available within the ‘aomisc’ package; we can load it and use the ‘lattice’ package to visualise the observed means over time.

# library(devtools)# install_github("OnofriAndreaPG/aomisc")library(aomisc)library(lattice)data(metamitron)xyplot(Conc ~ Time|Herbicide, data = metamitron,       xlab = "Time (d)", ylab = "Concentration",       scales = list(alternating = F),       panel = function(x, y, ...) {          panel.grid(h = -1, v = -1)         fmy <- tapply(y, list(factor(x)), mean)         fmx <- tapply(x, list(factor(x)), mean)         panel.xyplot(fmx, fmy, col="red", type="b", cex = 1)       })

Considering this exemplary dataset, let’s see how we can answer the following research questions.

What is the degradation rate for metamitron, in the four combinations?
Is there a significant difference between the degradation rate of metamitron alone and with co-applied herbicides?
What is the half-life for metamitron, in the four combinations?
What are the times to reach 70 and 30% of the initial concentration, for metamitron in the four combinations?
Is there a significant difference between the half-life of metamitron alone and with co-applied herbicides?

Fitting a degradation model

The figure above shows a visible difference in the degradation pattern of metamitron, which could be attributed to the presence of co-applied herbicides. The degradation kinetics can be described by the following (first-order) model:

\[ C(t, h) = A_h \, \exp \left(-k_h \, t \right) \]

where $C(t, h)$ is the concentration of metamitron at time $t$ in each of the four combinations $h$, $A_h$ and $k_h$ are, respectively, the initial concentration and degradation rate for metamitron in each combination.

The model is nonlinear and, therefore, we can use the ‘nls()’ function for nonlinear least squares regression. The code is given below: please, note that the two parameters are followed by the name of the factor variable in square brackets (i.e.: A[Herbicide] and k[Herbicide]). This is necessary to fit a different parameter value for each level of the ‘Herbicide’ factor.

#Fit nls grouped modelmodNlin <- nls(Conc ~ A[Herbicide] * exp(-k[Herbicide] * Time),                start=list(A=rep(100, 4), k=rep(0.06, 4)),                data=metamitron)summary(modNlin)## ## Formula: Conc ~ A[Herbicide] * exp(-k[Herbicide] * Time)## ## Parameters:##     Estimate Std. Error t value Pr(>|t|)    ## A1 9.483e+01  4.796e+00   19.77   <2e-16 ***## A2 1.021e+02  4.316e+00   23.65   <2e-16 ***## A3 9.959e+01  4.463e+00   22.31   <2e-16 ***## A4 1.116e+02  4.184e+00   26.68   <2e-16 ***## k1 4.260e-02  4.128e-03   10.32   <2e-16 ***## k2 2.574e-02  2.285e-03   11.26   <2e-16 ***## k3 3.034e-02  2.733e-03   11.10   <2e-16 ***## k4 2.186e-02  1.822e-03   12.00   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.701 on 88 degrees of freedom## ## Number of iterations to convergence: 5 ## Achieved convergence tolerance: 7.136e-06

For the sake of simplicity, I will neglige an accurate model check, although I need to point out that this is highly wrong. I’ll come back to this issue in another post.

Working with model parameters

Considering the research questions, it is clear that the above output answers the first one, as it gives the four degradation rates, $k1$, $k2$, $k3$ and $k4$. All the other questions can be translated into sets of linear/nonlinear functions (combinations) of model parameters. If we use the naming of parameter estimates in the nonlinear regression object, for the second question we can write the following functions: $k1 – k2$, $k1 – k3$ and $k1 – k4$. The third question requires some slightly more complex math: if we invert the equation above for one herbicide, we get to the following inverse:

\[ t = \frac{- log \left[\frac{C(t)}{A} \right] }{k} \]

I do not think this is complex enough to scare the biologists, is it? The half-life is the time required for C(t) to drop to half of the initial value, so that $C(t)/A$ is equal to $0.5$. Thus:

\[ t_{1/2} = \frac{- \log \left[0.5 \right] }{k} \]

Analogously, we can answer the question 4, by replacing $0.5$ respectively with $0.7$ and $0.3$. The difference between the half-lives of metamitron alone and combined with the second herbicide can be calculated by:

\[ \frac{- \log \left[0.5 \right] }{k_1} – \frac{- \log \left[0.5 \right] }{k_2} = \frac{k_2 – k_1}{k_1 \, k_2} \, \log(0.5)\]

The other differences are obtained analogously.

Inferences and hypotheses testing

All parameter estimates are characterised by some uncertainty, which is summarised by way of the standard errors (see the code output above). Clearly, such an uncertainty propagates to their combinations. As for the first question, the combinations are linear, as only subtraction is involved. Therefore, the standard error for the difference can be easily calculated by the usual law of propagation of errors, which I have dealt with in this post.

In R, linear combinations of model parameters can be built and tested by using the ‘glht()’ function in the ‘multcomp’ package. However, I wanted to find a general solution, that could handle both linear and nonlinear combinations of model parameters. Such a solution should be based on the ‘delta method’, which I have dealt with in this post. Unfortunately, the function ‘deltaMethod()’ in the ‘car’ package is not flexible enough to the aims of my students and mine.

Therefore, I wrote a wrapper for the ‘deltaMethod()’ function, which I named ‘gnlht()’, as it might play for nonlinear combinations the same role as ‘glht()’ for linear combinations. To use this function, apart from loading the ‘aomisc’ package, we need to prepare a list of formulas. Care needs to be taken to make sure that the element in the formulas correspond to the names of the estimated parameters in the model object, as returned by the ‘coef()’ method. In the box below, I show how we can calculate the differences between the degradation rates.

funList <- list(~k1 - k2, ~k1 - k3, ~k1 - k4)gnlht(modNlin, funList)##      form   Estimate          SE  t-value      p-value## 1 k1 - k2 0.01686533 0.004718465 3.574325 5.727311e-04## 2 k1 - k3 0.01226241 0.004951372 2.476568 1.517801e-02## 3 k1 - k4 0.02074109 0.004512710 4.596150 1.430392e-05

The very same code can be used for nonlinear combinations of model parameters. In order to calculate the half-lives, we can use the following code:

funList <- list(~ -log(0.5)/k1, ~ -log(0.5)/k2,                ~ -log(0.5)/k3, ~ -log(0.5)/k4)gnlht(modNlin, funList)##           form Estimate       SE  t-value      p-value## 1 -log(0.5)/k1 16.27089 1.576827 10.31876 7.987827e-17## 2 -log(0.5)/k2 26.93390 2.391121 11.26413 9.552915e-19## 3 -log(0.5)/k3 22.84747 2.058588 11.09861 2.064093e-18## 4 -log(0.5)/k4 31.70942 2.643329 11.99601 3.257067e-20

Instead of writing ‘0.5’, we can introduce a new model term, e.g. ‘prop’, as a ‘constant’, in the sense that it is not an estimated parameter. We can pass a value for this constant in a data frame, by using the ‘const’ argument:

funList <- list(~ -log(prop)/k1, ~ -log(prop)/k2,                ~ -log(prop)/k3, ~ -log(prop)/k4)gnlht(modNlin, funList, const = data.frame(prop = 0.5))##            form prop Estimate       SE  t-value      p-value## 1 -log(prop)/k1  0.5 16.27089 1.576827 10.31876 7.987827e-17## 2 -log(prop)/k2  0.5 26.93390 2.391121 11.26413 9.552915e-19## 3 -log(prop)/k3  0.5 22.84747 2.058588 11.09861 2.064093e-18## 4 -log(prop)/k4  0.5 31.70942 2.643329 11.99601 3.257067e-20

This is very flexible, because it lets us to calculate, altogether, the half-life and the times required for the concentration to drop to 70 and 30% of the initial value:

funList <- list(~ -log(prop)/k1, ~ -log(prop)/k2,                ~ -log(prop)/k3, ~ -log(prop)/k4)gnlht(modNlin, funList, const = data.frame(prop = c(0.7, 0.5, 0.3)))##             form prop  Estimate        SE  t-value      p-value## 1  -log(prop)/k1  0.7  8.372564 0.8113927 10.31876 7.987827e-17## 2  -log(prop)/k1  0.5 16.270892 1.5768267 10.31876 7.987827e-17## 3  -log(prop)/k1  0.3 28.261979 2.7388937 10.31876 7.987827e-17## 4  -log(prop)/k2  0.7 13.859465 1.2304069 11.26413 9.552915e-19## 5  -log(prop)/k2  0.5 26.933905 2.3911214 11.26413 9.552915e-19## 6  -log(prop)/k2  0.3 46.783265 4.1532956 11.26413 9.552915e-19## 7  -log(prop)/k3  0.7 11.756694 1.0592942 11.09861 2.064093e-18## 8  -log(prop)/k3  0.5 22.847468 2.0585881 11.09861 2.064093e-18## 9  -log(prop)/k3  0.3 39.685266 3.5756966 11.09861 2.064093e-18## 10 -log(prop)/k4  0.7 16.316814 1.3601864 11.99601 3.257067e-20## 11 -log(prop)/k4  0.5 31.709415 2.6433295 11.99601 3.257067e-20## 12 -log(prop)/k4  0.3 55.078163 4.5913724 11.99601 3.257067e-20

The differences between the half-lives (and other degradation times) can be calculated as well:

funList <- list(~ (k2 - k1)/(k1 * k2) * log(prop),                ~ (k3 - k1)/(k1 * k3) * log(prop),                 ~ (k4 - k1)/(k1 * k4) * log(prop))gnlht(modNlin, funList, const = data.frame(prop = c(0.7, 0.5, 0.3)))##                              form prop  Estimate       SE  t-value## 1 (k2 - k1)/(k1 * k2) * log(prop)  0.7  5.486900 1.473859 3.722813## 2 (k2 - k1)/(k1 * k2) * log(prop)  0.5 10.663013 2.864235 3.722813## 3 (k2 - k1)/(k1 * k2) * log(prop)  0.3 18.521287 4.975078 3.722813## 4 (k3 - k1)/(k1 * k3) * log(prop)  0.7  3.384130 1.334340 2.536183## 5 (k3 - k1)/(k1 * k3) * log(prop)  0.5  6.576577 2.593100 2.536183## 6 (k3 - k1)/(k1 * k3) * log(prop)  0.3 11.423287 4.504125 2.536183## 7 (k4 - k1)/(k1 * k4) * log(prop)  0.7  7.944250 1.583814 5.015900## 8 (k4 - k1)/(k1 * k4) * log(prop)  0.5 15.438524 3.077917 5.015900## 9 (k4 - k1)/(k1 * k4) * log(prop)  0.3 26.816185 5.346236 5.015900##        p-value## 1 3.468973e-04## 2 3.468973e-04## 3 3.468973e-04## 4 1.297111e-02## 5 1.297111e-02## 6 1.297111e-02## 7 2.718445e-06## 8 2.718445e-06## 9 2.718445e-06

The possibility of passing constants in a data.frame adds flexibility with respect to the ‘deltaMethod()’ function in the ‘car’ package. For example, we can use this method to make predictions:

funList <- list(~ A1 * exp (- k1 * Time), ~ A2 * exp (- k2 * Time),                 ~ A3 * exp (- k3 * Time), ~ A4 * exp (- k4 * Time))pred <- gnlht(modNlin, funList, const = data.frame(Time = seq(0, 67, 1)))head(pred)##                   form Time Estimate       SE  t-value      p-value## 1 A1 * exp(-k1 * Time)    0 94.83198 4.795948 19.77336 3.931107e-34## 2 A1 * exp(-k1 * Time)    1 90.87694 4.381511 20.74101 1.223613e-35## 3 A1 * exp(-k1 * Time)    2 87.08684 4.015039 21.69016 4.511113e-37## 4 A1 * exp(-k1 * Time)    3 83.45482 3.695325 22.58389 2.205772e-38## 5 A1 * exp(-k1 * Time)    4 79.97427 3.421034 23.37722 1.623774e-39## 6 A1 * exp(-k1 * Time)    5 76.63888 3.190531 24.02072 2.050113e-40tail(pred)##                     form Time Estimate       SE  t-value      p-value## 267 A4 * exp(-k4 * Time)   62 28.78518 2.657182 10.83297 7.138133e-18## 268 A4 * exp(-k4 * Time)   63 28.16278 2.648687 10.63273 1.824651e-17## 269 A4 * exp(-k4 * Time)   64 27.55384 2.639403 10.43942 4.525865e-17## 270 A4 * exp(-k4 * Time)   65 26.95807 2.629361 10.25270 1.090502e-16## 271 A4 * exp(-k4 * Time)   66 26.37517 2.618594 10.07227 2.555132e-16## 272 A4 * exp(-k4 * Time)   67 25.80489 2.607131  9.89781 5.827812e-16

Although this is not very fast, in contrast to the ‘predict()’ method for ‘nls’ objects, it has the advantage of reporting standard errors.

Hope this is useful. Happy coding!

Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) Borgo XX Giugno 74 I-06121 – PERUGIA

References

John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA:Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Torsten Hothorn, Frank Bretz and Peter Westfall (2008). Simultaneous Inference in General Parametric Models. Biometrical Journal 50(3), 346–363.
Ritz, C., Baty, F., Streibig, J. C., Gerhard, D. (2015) Dose-Response Analysis Using R PLOS ONE, 10(12), e0146021
Vischetti, C., Marini, M., Businelli, M., Onofri, A., 1996. The effect of temperature and co-applied herbicides on the degradation rate of phenmedipham, chloridazon and metamitron in a clay loam soil in the laboratory, in: Re, A.D., Capri, E., Evans, S.P., Trevisan, M. (Eds.), “The Environmental Phate of Xenobiotics”, Proceedings X Symposium on Pesticide Chemistry, Piacenza. La Goliardica Pavese, Piacenza, pp. 287–294.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

`R` you ready for python (gentle introduction to reticulate package)’

January 10, 2020, 4:00 pm

≫ Next: postdoc at Warwick on robust SMC [call]

≪ Previous: Nonlinear combinations of model parameters in regression

[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Just like how Thanos claimed to be inevitable in The Avengers, the direct or indirect use of python has become inevitable for R users in recent years. Fret not R users, you don’t have to abandon your favourite IDE, Rstudio, when using python. With the reticulate package you can use python in Rstudio and even have a mixture of R and python code running in the same session. If you blog with blogdown, you don’t have to migrate to another platform to write about your python projects. With the help of reticulate, you can continue publishing content on your blogdown site. An analogy of reticulate will be like a translator between R and python.

library(tidyverse)
library(reticulate)

Setup

reticulate has a default approach to instruct R where to find python, which environment and version to use. There are three approaches to manually configure this.

virtualenv where you specify the directory of python virtual environment
use_python where you specify the path where your ‘python’ resides.
use_condaenv where you specify the name of the specific Conda environment to use. You can access the name(s) of the available environments via conda_list()[[1]]

conda_list()[[1]] %>% use_condaenv()

Let’s check which python version, environment and configuration has been bind to this R session.

py_config() #not run to maintain privacy

Running `python`

Amend {r} in your code chunk to {python} to run python code. For this post, I will add #{python/r} in my code chunks to make it explicit that I ran the code as a python or a r code chunk.

#{python}

Plist= [11,22,33,44,55,66]

print(Plist)

## [11, 22, 33, 44, 55, 66]

#{python}

def Psq_fun (x):
  value= x*x 
  return(value)
  
print(Psq_fun(9))

## 81

Alternatively, you can execute python scripts in your r chunks using the function py_run_string. You will need to wrap your python scripts within the quotation marks “”.

#{r}

py_run_string("def Pten (x):
          value= x*10
          return(value)")

Let’s run the above in a python code chunk.

#{python}

Pten(2)

## 20

Do note that python objects/functions are not explicitly displayed in your global environment (you are after all working in an R global environment by default).

#{r}

ls()

## character(0)

Nevertheless, be assured that you can still access them in future python and r chunks.

#{python}

print(Psq_fun(9))

## 81

#{python}

print(Plist)

## [11, 22, 33, 44, 55, 66]

Accessing `python`

You can access previously ran python functions/objects in your r chunks using the prefix py combined with the R’s dollar sign syntax $.

#{r}

py$Psq_fun(4)

## [1] 16

#{r}

py$Plist

## [1] 11 22 33 44 55 66

Alternatively, you can directly evaluate previous python objects/functions in r chunks using the py_eval function.

#{r}

py_eval("Plist")

## [1] 11 22 33 44 55 66

#{r}

py_eval("Psq_fun(2)")

## [1] 4

Accessing `R`

Likewise, you can access R objects/functions in your python chunks using the prefix r combined with a punctuation mark .

#{r}

(Rvec<-c(11,22,33,44,55,66))

## [1] 11 22 33 44 55 66

#{python}

r.Rvec

## [11.0, 22.0, 33.0, 44.0, 55.0, 66.0]

#{r}

Rroot_fun<-function(x){
  value= x^.5
  print(value)
  }

#{python}

r.Rroot_fun(81)

## 9.0

Converting objects between languages

Besides accessing a R object in python via r., you can convert the R object into a python object while still running R with r_to_py.

Rvec %>% r_to_py() %>% class()

## [1] "python.builtin.list"   "python.builtin.object"

Previously, I mentioned that python objects do not exist in your global R environment when you run the python script directly inside {python} code chunks or with python_run_string. However, when you create python objects in {r} code chunks, the python object is saved in the R environment.

#{r}

# convert R object into python 
Python_in_Renv<- Rvec %>% r_to_py()

# check if the converted object lives in the `R` environment. https://stackoverflow.com/questions/1169248/test-if-a-vector-contains-a-given-element
"Python_in_Renv" %in% ls()

## [1] TRUE

For python objects living in the R global environment, you can convert it back to a R object with py_to_r.

Python_in_Renv %>% py_to_r() %>% class()

## [1] "numeric"

Layout

At times you may decide to display R and python code side by side to compare which language is superior.

Html document

static

The first outlay is static where the page is divided into half and content appears on either half of the page. You will sandwich text and code chunks meant to be on the left and right with the respective html/css code.

#Start of column partition 

  
#Left column

Text: `R` code will be on the left
Chunk of R code ``{r}``


#Right column

Text: `python` code will be on the right
Chunk of Python code ``{python}``


# End of column partition

Dynamic

An alternative will be a where different languages are compartmentalized into their respective tabs and the tabs are displayed next to each another. Users decide which language they wish to view and therefore which tab to highlight. You will use the {.tabset} function with headers to create the tabs and end the tab section with ##.

## Level 2 heading {.tabset}

### Level 3 heading (first tab i.e. left tab)
Text: R code will be on the left tab
Chunk of R code ``{r}``

### Level 3 heading (next tab i.e. right tab)
Text: python code will be on the right tab
Chunk of python code ``{python}``

##

Blog

Unfortunately, the above do not work when building a site with blogdown despite trying several Hugo themes. I’ve adapted code from here which successfully created two columns on this site.


 .col2 {
    columns: 2;
  }


  
 
Text: R code will be on the left 
Code: ``{r}`` 

 
Text: python code will be on the left 
Code: ``{python}``

.col2 { columns: 2; }

R code on the left

#{r}

Rroot_fun

## function(x){
##   value= x^.5
##   print(value)
##   }

python code on the right

#{python}

print(Pten)

##

Resources

The adoption of reticulate in data science projects is endless. For example, Manuel Tilgner used R for data wrangling and pre-processing and python via reticulate to do some prediction. Martin Henze used python again via reticulate to do some prediction and used R’s almighty ggplot to visualize the results. Sean Lopp used reticulate to run some python code to create a Shiny app. If you prefer a video presentation of a little bit of all the above, you can watch Dan Chen’s presentation during the late 2019 Rstats DC Conference.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

postdoc at Warwick on robust SMC [call]

January 11, 2020, 6:18 am

≫ Next: New vtreat Feature: Nested Model Bias Warning

≪ Previous: `R` you ready for python (gentle introduction to reticulate package)’

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a call for a research fellow at the University of Warwick to work with Adam Johansen and Théo Damoulas on the EPSRC and Lloyds Register Foundaton funded project “Robust Scalable Sequential Monte Carlo with application to Urban Air Quality”. To quote

The position will be based primarily at the Department of Statistics of the University of Warwick. The post holder will work closely in collaboration with the rest of the project team and another postdoctoral researcher to be recruited shortly to work within the Data Centric Engineering programme at the Alan Turing Institute in London. The post holder will be expected to visit the Alan Turing Institute regularly.
Candidates with strong backgrounds in the mathematical analysis of stochastic algorithms or sequential Monte Carlo methods are particularly encouraged to apply. Closing date is 19 Jan 2020.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

New vtreat Feature: Nested Model Bias Warning

January 11, 2020, 9:51 am

≫ Next: Game of Life, the DTerminal edition

≪ Previous: postdoc at Warwick on robust SMC [call]

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).

The next version of vtreat will warn the user if they have improperly used the same data for both vtreat impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. vtreat has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.

Set up the Example

This example is excerpted from some of our classification documentation.

Demonstrate the Warning

One way to design variable treatments for binomial classification problems in vtreat is to design a cross-frame experiment.


# For this example we want vtreat version 1.5.1 or newer
# remotes::install_github("WinVector/vtreat")
library(vtreat)

packageVersion("vtreat")
## [1] '1.5.1'

...

transform_design =vtreat::mkCrossFrameCExperiment(
    # data to learn transform fromdframe = training_data, 
    # columns to transformvarlist =setdiff(colnames(training_data), c('y', 'yc')), 
    # outcome variableoutcomename ='yc', 
    # outcome of interestoutcometarget =TRUE 
)

Once we have that we can pull the data transform and correct cross-validated training frame off the returned object as follows.

transform <-transform_design$treatments
train_prepared <-transform_design$crossFrame

train_prepared is prepared in the correct way to use the same training data for inferring the impact-coded variables, using the returned $crossFrame from mkCrossFrameCExperiment().

We prepare new test or application data as follows.

test_prepared <-prepare(transform, test_data)

The issue is: for training data we should not call prepare(), but instead use the cross-frame that is produces during transform design.

The point is we should not do the following:

train_prepared_wrong <-prepare(transform, training_data)

## Warning in prepare.treatmentplan(transform, training_data):
## possibly called prepare() on same data frame as designTreatments*()/
## mkCrossFrame*Experiment(), this can lead to over-fit. To avoid this, please use
## mkCrossFrame*Experiment$crossFrame.

Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.

And that is the new nested model bias warning feature.

The full R example can be found here, and a full Python example can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Game of Life, the DTerminal edition

January 6, 2020, 4:00 pm

≫ Next: Data Science Essentials

≪ Previous: New vtreat Feature: Nested Model Bias Warning

[This article was first published on r-bloggers on mpjdem.xyz, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yet another offshoot from my Advent of Code 2019 adventures (first half, second half, all solutions): on day 24, the challenge was to program a variant of Conway’s game of life, and I figured I might as well try my approach on the real thing!

A quick google for existing implementations yields three main approaches: nested for, shifting of matrix rows and columns, and repeated filtering of a coordinate data.frame. Visualisations tend to rely on R’s plotting capabilities, and more recently {gganimate}.

I used data.table for the computations, because it’s fast and succinct. Here’s the setup of the Game of Life universe, randomly seeding half of the cells as alive, and defining the relevant relative ‘neighbourhood’ of each cell through a small auxiliary table. For those unfamiliar with data.table, CJ() performs a cross-join to obtain all combinations of the vector arguments.

library(data.table)dims <- c(49, 49)universe <- CJ(x = seq(dims[1]), y = seq(dims[2]), k = 1, cell = FALSE)universe[, cell := sample(c(FALSE, TRUE), prod(dims), TRUE)]neighbours <- CJ(xd = -1:1, yd = -1:1, k = 1)[xd != 0 | yd != 0]

Next, we want to define a function to perform one step (or tick) of the game. The basic approach is to do a full Cartesian join of the neighbourhood and the universe, to determine the neighbouring coordinates of each cell. We clip off at the edges (unlike a proper GoL universe, which is infinite), and aggregate grouped by the original cell coordinate to count the number of neighbours. data.table allows us to express all of this in a really compact manner:

gol_tick <- function(xy, sz, nb) {  nb[xy, on = .(k), allow.cartesian = TRUE    ][, nbx := x + xd][, nby := y + yd    ][nbx >= 1 & nbx <= sz[1] & nby >= 1 & nby <= sz[2]    ][xy, on = .(nbx = x, nby = y)    ][, .(nnb = sum(i.cell)), by = .(x, y, cell, k)    ][!cell & nnb == 3, cell := TRUE    ][cell & (nnb < 2 | nnb > 3), cell := FALSE    ][, nnb := NULL]}

So how about some visuals – and perhaps a bit of interaction? I chose to do this in the terminal, just to make the point that you can easily create these old-school games fully in R! You will need an ANSI-capable terminal emulator though, such as the default Ubuntu one. Do make it large enough (or the font small enough).

First, the interaction part. To collect keypresses without pausing the universe to prompt the user, we need the {keypress} package. Usage is as simple as calling keypress(FALSE) to get the currently pressed key. Second, the visuals. Geometric unicode characters can provide a nice grid layout, but how do we ensure that we update the visuals with each tick, instead of spitting out an endless sequence of universe states into the terminal? The answer is ANSI escape codes, which allow you to colour the output, clear terminal lines, and crucially move the cursor back to a previous position. All of this is achieved simply by outputting strings starting with \033[ (or \u001B[), followed by the ANSI instruction. For a more user-friendly interface to many of these functionalities have a look at the {cli} package – but here is the fully manual approach:

library(keypress)cat("\033[?25l")repeat ({  kp <- keypress(FALSE)  universe[order(y, x)    ][, cat(fifelse(.SD$cell, "\033[32;40m◼", "\033[90;40m◌"),            "\033[K\n"),      by = y]  cat("\033[2K\033[33;40m", sum(universe$cell), "\n")  if (kp == "q") break  if (kp == "x") {    new_cells <- sample(c(FALSE, TRUE), prod(dims), TRUE, c(9, 1))    universe[, cell := cell | new_cells]  }  Sys.sleep(0.2)  cat(paste0("\033[", dims[2] + 1, "A"))  universe <- gol_tick(universe, dims, neighbours)})cat("\033[?25h")

The game speed is throttled using Sys.sleep(), and the number of cells currently alive are displayed at the bottom. Two keys will be interpreted: q exits the game, and x insert new cells at random locations, to bring some new life to the eventually oscillatory universe!

The full code can be found in this gist. Run Rscript game_of_life.R and you should be seeing something like this:

Now go forth and multiply!

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on mpjdem.xyz.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Data Science Essentials

January 11, 2020, 4:00 pm

≫ Next: How to reverse engineer a heat map into its underlying values

≪ Previous: Game of Life, the DTerminal edition

[This article was first published on jessecambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One the greatest strengths of R for data science work is the vast number and variety of packages and capabilities that are available. However, it can be intimidating to navigate this large and dynamic open source ecosystem, especially for a newcomer. All the information you need is out there, but it is often fragmented across numerous stack overflow threads and websites.

In an attempt to consolidate some of this information, this blog post demonstrates fundamental methods that I have used repeatedly as a data scientist. This code should get you started in performing some essential and broadly useful data science tasks with R – data manipulation, summarization, and visualization.

I will mainly rely on the dplyr, tidyr, and ggplot2 packages which all have excellent documentation that you can refer to for further details. Datasets that are built into these packages will be used so that there is no need to download external data. Also note that the input and output datasets will be displayed for each example, but at times only the first several rows will be shown for display purposes.

If you’d like to follow along while running the code, you can find the RMarkdown file that generated this blog post here. Also, if you haven’t installed the tidyverse packages already, you’ll need to do that first with this command: install.packages('tidyverse').

Basic Data Manipulation

To begin, we need to load the tidyverse packages:

library(tidyverse)

Now, let’s take a look at the mpg dataset from the ggplot2 package:

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

We’ll perform a few of the most commonly used data manipulation operations on this dataset using the dplyr package. If you’re new to R, the <- operator that you see below is used to assign the value of what follows it to the dataset that precedes it. In this example, we are manipulating the ‘mpg’ dataset and saving it as the ‘mpg_subset’ dataset.

If you’re not familiar with dplyr, note that the “pipe” operator %>% is used to pass the output of a function to the following function. This allows us to perform data manipulations in sequence in a clear and readable way.

In this example, we select the two rows that contain Nissan vehicles years 2005 and later with 4 cylinders, create two new columns, order the columns, remove four columns, and rename a column. In the order that they are used below, here are the main functions used to accomplish this:

filter controls which rows we want to keep from the input dataset. In this example, three conditions are applied using the “&” (AND) operator.
mutate is used to create new columns.
str_c combines multiple strings. In this case we are combining the manufacturer and model string fields into a single field with a single space in between.
select is used to pick which columns from the input dataset we want to keep. This select statement sets the order of the first four columns, includes the remaining columns, but then removes four columns.
- A ‘-’ before a column name indicates that we want to remove that column.
- The everything() function is shorthand for selecting all remaining columns and is an example of a select helper.
rename is used to rename the ‘fl’ column to ‘fuel_type’

mpg_subset<-mpg%>%filter(cyl==4&year>=2005&manufacturer=="nissan")%>%mutate(mpg_ratio=hwy/cty,make_model=str_c(manufacturer,' ',model))%>%select(make_model,year,hwy,cty,everything(),-manufacturer,-model,-drv,-trans)%>%rename(fuel_type=fl)

make_model	year	hwy	cty	displ	cyl	fuel_type	class	mpg_ratio
nissan altima	2008	31	23	2.5	4	r	midsize	1.347826
nissan altima	2008	32	23	2.5	4	r	midsize	1.391304

Summary Statistics

Calculating summary statistics like counts, means, and medians is a good initial step to understand a dataset. To count observations (rows) by a categorical variable, we can use the count function. Here we find the number of rows for each value of the ‘cyl’ (cylinders) column:

count_cyl<-mpg%>%count(cyl)

cyl	n
4	81
5	4
6	79
8	70

A broader variety of statistics can be calculated using the group_by and summarize functions. In this example, we create a new categorical column ‘class_c’ (which combines 2 seaters and subcompact vehicles into a single category) using the case_when function and then calculate a variety of basic summary statistics by this column.

The arrange function is used to order the rows in the dataset in descending order of the created ‘count’ variable. Note that the ‘ungroup’ function is not strictly necessary in this case, but it is a good practice if we plan to manipulate our dataset in the future without using groups.

mpg_stats<-mpg%>%select(class,hwy)%>%mutate(class_c=case_when(class%in%c("2seater","subcompact")~"subcompact",TRUE~class))%>%group_by(class_c)%>%summarize(count=n(),min_hwy=min(hwy),max_hwy=max(hwy),median_hwy=median(hwy),mean_hwy=mean(hwy))%>%ungroup()%>%arrange(desc(count))# sort dataset

class_c	count	min_hwy	max_hwy	median_hwy	mean_hwy
suv	62	12	27	17.5	18.12903
compact	47	23	44	27.0	28.29787
midsize	41	23	32	27.0	27.29268
subcompact	40	20	44	26.0	27.72500
pickup	33	12	22	17.0	16.87879
minivan	11	17	24	23.0	22.36364

Stacking Data

If you have datasets whose columns or rows align, you can combine them by stacking the datasets vertically or horizontally. To demonstrate this, we will first use the slice function to subset the ‘mpg’ dataset by row numbers to create the ‘mpg1’ and ‘mpg2’ datasets.

mpg1<-mpg%>%slice(1)%>%select(manufacturer,model,hwy,cty)%>%mutate(dataset=1)

manufacturer	model	hwy	cty	dataset
audi	a4	29	18	1

mpg2<-mpg%>%slice(44:45)%>%select(manufacturer,model,hwy,cty)%>%mutate(dataset=2)

manufacturer	model	hwy	cty	dataset
dodge	caravan 2wd	17	11	2
dodge	caravan 2wd	22	15	2

Since these two datasets we just created have the same columns we can stack them vertically using bind_rows:

mpg_stack_vert<-mpg1%>%bind_rows(mpg2)

manufacturer	model	hwy	cty	dataset
audi	a4	29	18	1
dodge	caravan 2wd	17	11	2
dodge	caravan 2wd	22	15	2

Now let’s create a third subsection of the ‘mpg’ dataset using the same rows that generated ‘mpg1’ and ‘mpg2’ above, but with different columns. We’ll call it ‘mpg3’:

mpg3<-mpg%>%slice(1,44:45)%>%select(displ,year)

displ	year
1.8	1999
3.3	2008
3.8	1999

We can stack the ‘mpg_stack_vert’ and ‘mpg3’ datasets horizontally since their rows align (we used the ‘slice’ function to subset the ‘mpg’ dataset on the same row numbers). We use the bind_cols function to do this.

mpg_stack_horz<-mpg_stack_vert%>%bind_cols(mpg3)

manufacturer	model	hwy	cty	dataset	displ	year
audi	a4	29	18	1	1.8	1999
dodge	caravan 2wd	17	11	2	3.3	2008
dodge	caravan 2wd	22	15	2	3.8	1999

Joining Data

If you have datasets that contain a common “key” column (or a set of key columns) then you can use one of the join functions from dplyr to combine these datasets. First let’s create a dataset named ‘car_type’ using the distinct function:

car_type<-mpg%>%select(manufacturer,model,class)%>%distinct()

manufacturer	model	class
audi	a4	compact
audi	a4 quattro	compact
audi	a6 quattro	midsize
chevrolet	c1500 suburban 2wd	suv

Now we will join this newly created ‘car_type’ dataset to the ‘mpg_stack_horz’ dataset (created above) using the ‘left_join’ function. The ‘manufacturer’ and ‘model’ columns are used as joining keys. The resulting dataset, ‘joined’, now contains all the columns from ‘mpg_stack_horz’ as well as the ‘class’ column from the ‘car_type’ dataset.

joined<-mpg_stack_horz%>%left_join(car_type,by=c('manufacturer','model'))%>%select(-dataset,everything())# make the 'dataset' column last

manufacturer	model	hwy	cty	displ	year	class	dataset
audi	a4	29	18	1.8	1999	compact	1
dodge	caravan 2wd	17	11	3.3	2008	minivan	2
dodge	caravan 2wd	22	15	3.8	1999	minivan	2

Converting Long to Wide Format

Let’s take a look at the us_rent_income dataset from the tidyr package:

GEOID	NAME	variable	estimate	moe
01	Alabama	income	24476	136
01	Alabama	rent	747	3
02	Alaska	income	32940	508
02	Alaska	rent	1200	13

Each row of this dataset pertains to either income or rent as we can see by looking at the value of the ‘variable’ column. This is an example of a “long” data format. The long format is versatile and desirable for data manipulation, but we may want to convert to the “wide” data format in some cases, particularly for presenting data.

To perform this conversion, we can use the pivot_wider function from tidyr. The end result is that the rent and income variables are put into separate columns. This function has two arguments you will need to set:

names_from: name of the column which contains values that will become our new column names.
values_from: name of the column which contains the values that will populate our new columns.

Additionally we use the select function to drop two columns, drop_na to remove rows with missing values, and mutate to create an income to rent ratio.

col_ratio<-us_rent_income%>%select(-GEOID,-moe)%>%pivot_wider(names_from=variable,values_from=estimate)%>%drop_na()%>%mutate(income_rent_ratio=income/(12*rent))

NAME	income	rent	income_rent_ratio
Alabama	24476	747	2.730478
Alaska	32940	1200	2.287500
Arizona	27517	972	2.359139

Converting Wide to Long Format

Now let’s look at the world_bank_pop dataset from tidyr (only the first 6 columns are shown for display purposes):

country	indicator	2000	2001	2002	2003
ABW	SP.URB.TOTL	42444.000000	43048.000000	43670.000000	44246.00000
ABW	SP.URB.GROW	1.182632	1.413021	1.434559	1.31036
ABW	SP.POP.TOTL	90853.000000	92898.000000	94992.000000	97017.00000

This dataset is in “wide” format since a categorical variable, in this case the year, is stored in the column names. To convert this dataset to the “long” format”, which can be more convenient for data manipulation, we use the pivot_longer function from tidyr. This function takes three inputs:

cols (1st argument): a list of the columns we want to pivot. In this example we create this list by subtracting the columns we don’t want to pivot.
names_to : the name of new column which will have the current column names as values.
values_to : name of the new column which will contain values.

We also use the ‘mutate’ and ‘as.numeric’ functions to convert our new ‘year’ variable to numeric and then filter so that our output dataset only includes certain years using the ‘seq’ function. The format for the ‘seq’ function is seq(start, stop, increment).

wb_pop<-world_bank_pop%>%pivot_longer(c(-country,-indicator),names_to="year",values_to="value")%>%mutate(year=as.numeric(year))%>%# convert to numericfilter(year%in%seq(2000,2016,2))

country	indicator	year	value
ABW	SP.URB.TOTL	2000	42444
ABW	SP.URB.TOTL	2002	43670
ABW	SP.URB.TOTL	2004	44669

Visualizations

Now that we have manipulated and summarized some datasets, we’ll make a few visualizations with ggplot2. Ggplot graphs are constructed by adding together a series of ggplot functions with the “+” operator. This gives us a large amount of customization options since ggplot functions can be combined in many different ways.

Below you will find code for several commonly used charts, but you can refer to ggplot’s documentation for more information. Here is a brief overview of the package:

The ggplot function initializes a graph and typically specifies the dataset that is being used.
Atleast one geom (geometric object) function such as geom_histogram, geom_point, or geom_line is included which controls how data will be displayed.
The aes (aesthetic mappings) function controls which variables are used in the plot. This function can be included as part of the ggplot function or in a geom function depending on whether you want the effect to be global or specific to a geom function.
The formatting of the chart (such as margins, legend position, and grid lines) can be modified using preset themes such as theme_bw and theme_classic or the theme function which gives more manual control.
The ‘color’ parameter is used for setting the color of plot lines and points while the ‘fill’ parameter controls the color of areas (such as bars on bar charts). These parameters can be set to a value such as ‘navy’ or to a categorical variable. You can read more about this on ggplot’s site here.
To save a plot to a file use the ggsave function.

Scatter Plots

Scatter plots are used to visually examine the relationship between two continuous variables and can be created using geom_point. In this example, we plot engine displacement against highway MPG for the ‘mpg’ dataset. A ‘Transmission’ column is created to combine the various transmission types in the ‘trans’ variable into the ‘auto’ (automatic) and ‘manual’ categories using the str_detect function.

The ‘color’ argument in the ‘aes’ function is used to color our points according to the newly created ‘Transmission’ variable. A legend is automatically created and we’ve positioned it at the top of our graph.

ggplot(data=mpg%>%mutate(Transmission=case_when(str_detect(trans,'auto')~'auto',TRUE~'manual')),aes(x=displ,y=hwy,color=Transmission))+geom_point()+theme_light()+theme(legend.position='top',legend.text=element_text(size=11))+xlab('Displacement (L)')+ylab('Highway MPG')

Line Charts

Here we create a line graph with the SP.POP.GROW indicator from the ‘wb_pop’ dataset we created earlier based on world bank data. SP.POP.GROW is the percent population growth of a country and we divide its value (which is in the ‘value’ column) by 100 to convert it to a decimal percentage value.

In this example, both lines and points are displayed for our data because we have used both the geom_point and geom_line functions. The expand_scale function is used to control the margins in the x axis. We’ve also formatted the y axis as a percentage using the ‘labels’ argument in scale_y_continuous.

ggplot(wb_pop%>%filter(country%in%c("USA","CAN","MEX")&indicator=="SP.POP.GROW"),aes(x=year,y=value/100,color=country))+theme_minimal()+geom_line()+geom_point()+# lines and pointsscale_x_continuous(expand=expand_scale(mult=c(.05,.05)))+scale_y_continuous(labels=scales::percent)+theme(legend.title=element_blank(),# suppress legend titlepanel.grid.minor.x=element_blank(),legend.text=element_text(size=11),legend.position='right')+xlab('Year')+ylab('Population Growth')

Histograms

Histograms display distributions of variables. We use a histogram to look at the distribution of highway MPG below. You may want to experiment with the ‘binwidth’ argument in the geom_histogram function to see how it effects what your histogram looks like. The expand_scale function is used to control the margins in the y axis.

ggplot(mpg,aes(hwy))+geom_histogram(binwidth=1)+theme_bw()+scale_y_continuous(expand=expand_scale(mult=c(0,.05)))+xlab('Highway MPG')+ylab('Vehicles')

Bar Charts

Bar charts are commonly used to show relative size and can be created with geom_bar. I find it helpful to order the bars by their size which I’ve done with the reorder function below. The geom_text function is used to add the labels to the top of the bars.

ggplot(data=mpg_stats,aes(x=reorder(class_c,-mean_hwy),y=mean_hwy))+geom_bar(stat='identity',color='black')+scale_y_continuous(expand=expand_scale(mult=c(0,.1)))+# expand top margingeom_text(aes(label=round(mean_hwy)),vjust=-0.5)+# label barstheme_bw()+xlab('Vehicle Class')+ylab('Mean Highway MPG')+# no axis labelstheme(panel.grid=element_blank())# turn off grid

Lollipop Charts

Lollipop charts can be an attractive alternative to bar charts. We construct one here using geom_segment and geom_point. The coord_flip function is used to orient the chart horizontally instead of vertically. We use the theme function to hide all grid lines except for the major vertical lines. The reorder function is again used to order the axis (in this case by rent in descending order).

ggplot(data=col_ratio%>%arrange(desc(rent))%>%head(15),aes(x=NAME,y=rent))+geom_segment(aes(x=reorder(NAME,rent),xend=NAME,y=0,yend=rent),color="grey")+geom_point(size=3)+theme_minimal()+theme(plot.subtitle=element_text(face="bold",hjust=0.5),plot.title=element_text(lineheight=1,face="bold",hjust=0.5),panel.grid.minor.y=element_blank(),panel.grid.major.y=element_blank(),panel.grid.minor.x=element_blank())+coord_flip()+scale_y_continuous(labels=scales::dollar,expand=expand_scale(mult=c(0,.1)))+labs(title='US States with the Highest Rent',caption='Source: 2017 American Community Survey (Census)')+xlab('')+ylab('Median Monthly Rent')

Additional References

Here some additional resources that you may find useful:

For importing data from files, refer to the readr (for CSV and text files) or readxl (for excel spreadsheets) packages.
To coerce a column in a dataset into a different format, you can use the as.numeric, as.character, as.Date, and as.factor functions (from base R). For more functions to work with date and datetime data see the lubridate package, for strings reference the stringr package, and for manipulating factors you can use the forcats package.
For quickly summarizing datasets with basic summary statistics, you can use the summary function (base R) or the skimr package.
The purrr package allows you apply functions across the values of a list using the map function. One example of where this is useful is in reading and combining data from multiple sheets in an excel spreadsheet by applying a function that reads a single sheet to a list of sheets.
I keep reference data science code (both R and Python) in a GitHub repository. You’ll find some more advanced techniques like modeling demonstrated there.

To leave a comment for the author, please follow the link and comment on their blog: jessecambon-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to reverse engineer a heat map into its underlying values

January 12, 2020, 11:26 am

≫ Next: Start 2020 with mad new skills you learned at rstudio::conf. Final Call

≪ Previous: Data Science Essentials

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Astrolabe Diagnostics is a fully bootstrapped five-person biotech startup. We offer the Antibody Staining Data Set (ASDS), a free service that helps immunologists find out the expression of different molecules (markers) across subsets in the immune system. Essentially, the ASDS is a big table of numbers, where every row is a subset and every column a marker. Recently, the Sean Bendall lab at Stanford released the preprint of a similar study, where they measured markers for four of the subsets that the ASDS covered. Since the two studies used different techniques for their measurements I was curious to examine the correlation between the results. However, the preprint did not include any of the actual data. The closest was Figure 1D, a heat map for 98 of the markers measured in the study:

I decided to take the heat map image and “reverse engineer” it into the underlying values. Specifically, what I needed was the “Median scaled expression” referred to in the legend in the bottom right. Since I could not find any existing packages or use cases for easily doing this I decided to hack a solution (check out the code and PNG and CSV files at the github repository).

First, I manually entered the marker names from the X-axis into a spreadsheet. Then, I cropped the above image, removing the legends, axes, and the top heat map row which includes an aggregate statistic not relevant to this exercise.

I loaded the image into R using the readPNG function from the png package. This results in a three-dimensional matrix where the first two dimensions are the X- and Y-values and the third is the RGB values. The X axis maps to the markers and the Y axis maps to the four subsets (“Transitional”, “Naive”, “Non-switched”, and “Switched”), and I wanted to get a single pixel value for each (Subset, Marker) combination. Deciding on the row for each subset was easy enough: I loaded the image in GIMP and picked rows 50, 160, 270, and 380. In order to find the column for each marker I initially planned to iterate over the tile width. Unfortunately, tile widths are not consistent, which is further complicated by the vertical white lines. I ended up choosing them manually in GIMP as well:

Marker,PixelCD1d,14CD31,40HLA-DQ,70CD352,100CD21,128CD196,156CD79b,185CD1c,219...

I could now get the RGB value for a (Subset, Marker) from the PNG. For example, if I wanted the CD31 value for the “Non-switched” subset, I would go to heat_map_png[270, 40, ]. This will give me the vector [0.6823529, 0.0000000, 0.3882353]. In order to map these values into the “Median scaled expression” values, I used the legend in the bottom left. First, I cropped it into its own PNG file:

I imported it into R using readPNG, arbitrarily took the pixels from row 10, and mapped them into values using seq:

# Import legend PNG, keep only one row, and convert to values. The values "0"# and "0.86" are taken from the image.legend_png <- png::readPNG("legend.png")legend_mtx <- legend_png[10, , ]legend_vals <- seq(0, 0.86, length.out = nrow(legend_mtx))

At this point I planned to reshape the heat map PNG matrix into a data frame and join the RGB values into the legend values. However, this led to two issues.

One, reshaping a three-dimensional matrix into two dimensions is a headache since I want to make sure I end up with the row and column order I need. Sticking to the spirit of the hack, I iterated over all (Subset, Marker) values instead. This is inelegant (iterating in R is frowned upon) but is a reasonable compromise given the small image size.

Two, I can’t actually join on the legend RGB values. The heat map uses a gradient and therefore some of its values might be missing from the legend itself (the reader can visually infer them). Instead, I calculated the distance between each heat map pixel and the legend pixels and picked the nearest legend pixel for its “Median scaled expression”.

heat_map_df <- lapply(names(marker_cols), function(marker) {  lapply(names(cell_subset_rows), function(cell_subset) {    v <- t(heat_map_png[cell_subset_rows[cell_subset], marker_cols[marker], ])    dists <- apply(legend_mtx, 1, function(x) sqrt(sum((x - v) ^ 2)))    data.frame(      Marker = marker,      CellSubset = cell_subset,      Median = legend_vals[which.min(dists)],      stringsAsFactors = FALSE    )  }) %>% dplyr::bind_rows()}) %>% dplyr::bind_rows()

I now have the heat_map_df values I need to compare to the ASDS! As a sanity check, I reproduced the original heat map using ggplot:

heat_map_df$Marker <-   factor(heat_map_df$Marker, levels = names(marker_cols))heat_map_df$CellSubset <-  factor(heat_map_df$CellSubset, levels = rev(names(cell_subset_rows)))ggplot(heat_map_df, aes(x = Marker, y = CellSubset)) +  geom_tile(aes(fill = Median), color = "white") +  scale_fill_gradient2(    name = "Median Scaled Expression",    low = "black", mid = "red", high = "yellow",    midpoint = 0.4) +  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.4),        axis.title = element_blank(),        legend.position = "bottom",        panel.background = element_blank())

The resulting code gets the job done and can be easily repurposed for other heat maps. There will be some manual work involved, namely, setting cell_subset_rows to the rows in the new heat map, updating marker_cols.csv accordingly, and setting the boundary values in the seq call when calculating legend_vals. Furthermore, we should be able to adapt the above into a more autonomous solution by calculating the boundaries between tiles using diff, running it separately on the rows and the columns (getting the row and column labels will not be trivial and will require OCR). For a one-time exercise, though, the above hack works remarkably well — sometimes that is all the data science you need to get the job done. Check out this YouTube video for the actual comparison between the data sets!

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Start 2020 with mad new skills you learned at rstudio::conf. Final Call

January 11, 2020, 4:00 pm

≫ Next: 5 Reasons to Learn H2O for High-Performance Machine Learning

≪ Previous: How to reverse engineer a heat map into its underlying values

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There will be no better time or place this year to accelerate your knowledge of all things R and RStudio than at rstudio::conf 2020 in San Francisco. While we’re approaching capacity, there’s still room for you! Whether you’re a dedicated R user or one of the many people who use R and Python, come join us and more than 2,000 fellow data scientists and data science team leaders in San Francisco January 27 – 30.

You can still register for a workshop (January 27 – 28), the conference (January 29 – 30), or both! With a little more than 2 weeks to go, we expect to reach conference capacity soon.

Here are the career-building workshops with seats still available for you as of January 9, 2020. We apologize in advance if a workshop listed here is sold out before you have the chance to register.

A Practical Introduction to Data Visualization with ggplot2
Modern Geospatial Data Analysis with R
Designing the Data Science Classroom
Text Mining with Tidy Data Principles
Big Data with R
R Markdown and Interactive Dashboards
R for Excel Users
What They Forgot to Teach You about R Workshop
My Organization’s First R Package Workshop
Shiny from Start to Finish

Note: Childcare for registered attendees is also still available for children 6 months to 8 years of age from 8am – 6pm daily. The cost is $20/day.

Hear what other data scientists have to say about rstudio::conf!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

5 Reasons to Learn H2O for High-Performance Machine Learning

January 12, 2020, 9:30 pm

≫ Next: Sudoku game with R

≪ Previous: Start 2020 with mad new skills you learned at rstudio::conf. Final Call

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

H2O is the scalable, open-source Machine Learning library that features AutoML. Here are 5 Reasons why it’s an essential library for creating production data science code.

Full-Stack Data Science Series

This is part of a series of articles on essential Data Science and Web Application skills for 2020 and beyond:

Part 1 – 5 Full-Stack Data Science Technologies for 2020 (and Beyond)
Part 2 – AWS Cloud
Part 3 – Docker
Part 4 – Git Version Control
Part 5 – H2O Automated Machine Learning (AutoML)
Part 6 – Shiny Web Applications (Coming Soon)
[NEW BOOK] – Shiny Production with AWS, Docker, Git Book

Machine Learning Up 440% vs 5-Years Ago

Before I jump into H2O, let’s first understand the demand for ML. The 5-year trends in Technology Job Postings show a 440% increase in “Machine Learning” skills being requested, capturing a 7% share in all technology-related job postings.

Not just “Data Scientist” Jobs… ALL Technology Jobs.

Today's Top Tech Skills

Top 20 Tech Skills 2014-2019 Source: Indeed Hiring Lab.

My point: Learning ML is essential

We can safely say that if you are in a technology job (or seeking one) then you need to learn how to apply AI and Machine Learning to solve business problems.

The problem: There are a dozen machine learning and deep learning frameworks – TensorFlow, Scikit-Learn, H2O, MLR3, PyTorch, … These all take time and effort to learn. So, which framework should you learn for business?

Why I use and recommend H2O:H2O has singlehandedly produced results in hours that would have otherwise taken days or weeks. I recommend learning H2O for applying Machine Learning to business data. I’ve been using H2O for several years now on both consulting projects and teaching it to clients. I have 5 reasons that explain how I have gotten this productivity enhancement using H2O on my business projects.

5-Reasons why I use and teach H2O

My Top 5-Reasons why I use and recommend learning H2O.

1. AutoML Massive Productivity Booster

H2O AutoML automates the machine learning workflow, which includes automatic training and tuning of many models. This allows you to spend your time on more important tasks like feature engineering and understanding the problem.

Me holding my H2O AutoML Hex Sticker H2O is my go-to for production ML

2. Scalable on Local Compute Distributed, In-Memory Processing speeds up computations

In-memory processing with fast serialization between nodes and clusters to support massive datasets enables problems that traditionally need bigger tools to be solved in-memory on your local computer.

3. Spark Integration & GPU Support Big Data

H2O’s Spark integration (Sparkling Water) enables distributed processing on Big Data.
H2O4GPU enables running H2O’s R and Python libraries using GPUs.

The result is 100x faster training than traditional ML.

Sparkling Water

rsparkling – The Spark + H2O Big Data Solution

4. Best Algorithms, Optimized and Ensembled Superior Performance

H2O’s algorithms are developed from the ground up for distributed computing. The most popular algorithms are incorporated including:

XGBoost
GBM
GLM
Random Forest
and more.

AutoMLensembles (combines) these models to provide superior performance.

5. Production Ready Docker Containers

Learn More: What is Docker?

I love using Docker (learn why) + H2O to integrate AutoML models into Shiny Web Applications. H2O is built on (and depends on) Java, which traditionally creates overhead. But, with H2O Docker Images, it makes deploying H2O Models super easy with all necessary software inside the pre-built Docker Image.

H2O in Production

H2O can be integrated into Shiny Applications like this one – an Employee Attrition Prediction & Prevention App.

Employee Attrition App

Employee Attrition Prevention App(Course coming to BSU soon)

H2O is the underlying prediction technology

You need to learn H2O AutoML to build the Employee Attrition Shiny App. H2O AutoML generates the “Employee Attrition Machine Learning Model” that scores the employees based on features like tenure, over time, stock option level, etc.

Employee Attrition Machine Learing Model

H2O AutoML – Employee Attrition Machine Learning ModelBuilt in DS4B 201-R Course

The H2O Course

If you are ready to learn H2O AutoML along with critical supporting technologies and data science workflow processes that follow an enterprise-grade system, then look no further: DS4B 201-R (Advanced Machine Learning & Business Consulting Course).

You follow a 10-week program for solving Business Problems with Data Science that teaches each of the tools needed to solve a $15M/year employee attrition problem using Machine Learning (H2O), Explainable ML (LIME), and Optimization (purrr).

DS4B 201-R - 10-Week Program

10-Week System for Solving Business Problems with Machine LearningDS4B 201-R Course

In weeks 5 & 6, you learn H2O AutoML in-depth as part of your learning journey.

Learn H2O AutoML

Learn H2O AutoML – Weeks 5 and 6DS4B 201-R Course

No Machine Learning Experience? Don’t worry. You’re covered.

You are probably thinking, “How do I learn H2O if I have no Machine Learning background or coding experience?”

That’s why I created the 4-Course R-Track Program.

Go from beginner to expert in 6-months or less with no prior experience required.

You learn:

Data Science Foundations
Advanced Machine Learning & Business Consulting– H2O AutoML
Shiny Dashboards
Shiny Developer with AWS (NEW)

I look forward to providing you the best data science for business education.

Matt Dancho

Founder, Business Science

Lead Data Science Instructor, Business Science University

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Sudoku game with R

January 12, 2020, 11:35 pm

≫ Next: Convert Apple Card PDF Statements to Tidy Data (i.e. for CSV/Excel/database export)

≪ Previous: 5 Reasons to Learn H2O for High-Performance Machine Learning

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sudoku is a classical logical game based on combinatorial number replacement puzzle. Objective is to to fill 9×9 matrix with digits so that each column, each row, and each box (3×3 sub-grid) of nine contain all of the digits from 1 to 9.

Solving sometimes can be a nagging process. For this purpose, here is the R helper function for you to solve this with R.

grid

Let’s get the Sudoku grid we want to solve into R:

sudoku <- matrix(data=c(6,0,0,2,1,0,0,3,0,5,0,9,0,0,0,6,0,0,2,0,0,9,7,0,0,0,4,0,0,2,3,0,4,0,0,0,0,6,0,0,5,0,0,9,0,0,0,0,1,0,9,7,0,0,9,0,0,0,3,8,0,0,6,0,0,7,0,0,0,2,0,5,0,8,0,0,4,2,0,0,9), nrow=9, ncol=9, byrow=FALSE)

Now, we will need a function that will find all the 0 values – these are the values we need to work on.

get_zeros <- function(board_su){  #empty df  df <- data.frame(i=NULL,j=NULL)  for (i in 1:nrow(board_su)){    for (j in 1:ncol(board_su)){      if (board_su[i,j] == 0) {        a <- data.frame(i,j)        #names(a) <- c("i", "j")        #df <- rbind(df, a)        df <- a        return(df)      }     }  }}

In addition we will need a function to solve and validated the solution.

Function validater will validate for the sudoku board a particular solution at a particular position:

validater(sudoku, 1, c(1,4))

In matrix, at position x=1, y=4, where there is 0, it will test if number 1 is valid or not. If the number is valid, it returns TRUE (number) to outer function for finding complete solution.

This function iterates through all the possible 0-positions and iterates through solutions that are still available based on the rules:

each row can contain only one number in range of 1..9
each column can contain only one numer in range of 1..9
each sub-grid of 3×3 can contain only one number in range of 1..9

And the nuts and bolts of the validater function:

validater <- function(board_su, num, pos=c(NULL,NULL)){  status <- FALSE  a <- as.integer(pos[1])  b <- as.integer(pos[2])  num <- as.integer(num)  while (status == FALSE) {    for (i in 1:9) {          if ((board_su[a,i] == num & b != i) == TRUE) {        status <- FALSE        return(status)      }    }        for (i in 1:9) {          if ((board_su[i,b] == num & a != i) == TRUE) {        status <- FALSE        return(status)      }    }        #which box are we in    boxNx <- as.integer(ifelse(as.integer(b/3)==0, 1, as.integer(b/3)))    boxNy <- as.integer(ifelse(as.integer(a/3)==0, 1, as.integer(a/3)))        #looping through the box    for (i in boxNy*3:(boxNy*3 + 3)) {      for (j in  boxNx * 3 : (boxNx*3 + 3)) {        if ((board_su[i, j] == num &  i != a & j != b) == TRUE){          status <- FALSE        }      }    }    status <- TRUE    return(status)   }}

With the following solution:

solution

For sure, this is not to be taken seriously, as you get the application on your mobile phone where you make a photo of your grid to be solved and the phone solves it for you, using library like OpenCV. The code was created only and just for fun (and because the Advent of Code for 2019 is over).

Happy R coding

As always, the code is available at Github.

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Convert Apple Card PDF Statements to Tidy Data (i.e. for CSV/Excel/database export)

January 13, 2020, 6:20 am

≫ Next: Record Dec temperature and snow in Scotland

≪ Previous: Sudoku game with R

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I saw this CNBC article on an in-theory browser client-side-only conversion utility for taking Apple Card PDF statements and turning them into CSV files.

Since I (a) never trust any browser or site and (b) the article indicated that there is a $5 fee to avoid the “single random transaction removal”, I felt compelled to throw together an R script to do this for at least folks who are capable of setting up R so that all processing is guaranteed to be local.

FWIW the site does appear to do what it says on the tin (all processing is, indeed, local). That doesn’t mean one of your extensions isn’t spying on you, nor does it mean that the site could not turn evil someday (one its own or via an attacker compromise).

read_apple_card_statement <- function(path) {

  require(stringi)
  require(pdftools)
  require(tidyverse)

  # make sure the file exists
  path <- path.expand(path[1])
  if (!file.exists(path)) stop("File '", path, "' not found.", call.=FALSE)

  pdf_text(path) %>% # read it in
    stri_split_lines() %>% # turn \n to a separate character vector element
    unlist() %>% # flatten it
    stri_trim_both() %>% # get rid of leading/trailing spaces
    keep(stri_detect_regex, "^([[:digit:]]{2}/[[:digit:]]{2}/[[:digit:]]{4})") %>% # find lines that start with a date
    map_df(~{
      rec <- as.list(unlist(stri_split_regex(.x, "[[:space:]]{3,}"))) # find the columns
      if (stri_detect_fixed(.x, "%")) { # lines with a `%` in them distinguish charges from payments
        rec <- set_names(rec, c("date", "description", "daily_cash_pct", "daily_cash_amt", "amt")) # ones with charges have cash back columns
      } else {
        rec <- set_names(rec, c("date", "description", "amt")) # ones w/o % do not
      }
    }) %>%
    mutate(
      date = lubridate::mdy(date), # make dates dates
      amt = stri_replace_first_fixed(amt, "$", "") %>% parse_number(), # dollars to numbers
      daily_cash_pct = parse_number(daily_cash_pct)/100, # % to numbers
      daily_cash_amt = parse_number(daily_cash_amt) # dollars to numbners
    )

}

list.files("~/Downloads", pattern = "Apple Card Statement", full.names = TRUE) %>% 
  map_df(read_apple_card_statement)

You can send the PDF statements from the Apple Card app to your Mac via Air Drop and it will put them into ~/Downloads. I recommend putting them somewhere else since you’ve likely given all sorts of applications access to ~/Downloads when prompted to on Catalina (yay security theatre). Wherever you put them, you can read them individually with read_apple_card_statment() or you can then just list_files() and bind all the individual statements together:

list.files("~/WhereYouPutAppleCardStatements", pattern = "Apple Card Statement", full.names = TRUE) %>% 
  map_df(read_apple_card_statement)

FIN

Be very wary of what you put your trust into online. Just because a site is benign one day does not mean it won’t be malicious (deliberately or otherwise) the next. Also, lobby Apple to provide data in more useful formats, especially since it provides applications like Numbers for free with their operating system.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Record Dec temperature and snow in Scotland

January 13, 2020, 10:00 am

≫ Next: OMG O2G!

≪ Previous: Convert Apple Card PDF Statements to Tidy Data (i.e. for CSV/Excel/database export)

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

December 2019 broke a couple of temperature records for the UK. One for the warmest night time temperature and one for the warmest temperature overall. Both of these UK records were broken in northern Scotland, a place with little daylight during the winter months.

The night time temperature record was observed at a place called Cassley Power Station, on Loch Shin and the overall temperature at Achfary. These places are only about 25 km apart, but the river next to Achfary drains to the West and the water at Cassley to the East of Scotland. Cassley and Achfary were both Snow Survey of Great Britain (SSGB) sites. The SSGB recorded the snowline visible from stations across GB at 09:00 each morning. I transcribed the Scottish data for my PhD and you can download the SSGB from EIDC.

The SSGB was recorded at Cassley from April 1961 until March 2007 and at Achfary from November 1946 until October 1968. Unfortunately the Met Office closed the snow survey in 2007 as many of their stations were automated, making SSGB data collection difficult/impossible, and little use was being made of collected data. Data collection at Achfary was patchy, and there any many missing records (possibly due to weather conditions – if it’s cloudy you can’t see the hills to record snowline!).

For this blog post I’m presenting SSGB data from Cassley Power Station for December, to match the new records. Below you can see the number of days with no snow at any elevation (code to create at the bottom of the post).

So there was a big increase in the number of days without snow between 1980 and 2000 – it about doubled! This is inline with anecdotal data about the Scottish Ski resorts which struggled to operate during the 90s and 00s.

What hills can the Cassley site see? Which hills are we reporting no snow cover on? The SSGB returns listed the visible hills as “Ben More (Assynt): 3276′. Moavally 1673′. Ben Hee 2864′”. I completed a line of site analysis using Ordnance Survey data. In GRASS and mapped the output in QGIS. This is similar work I’ve presented on which hills can you see from Arthur’s seat. Below is a map of the area I mapped as visible from Cassley Power Station, which doesn’t include Ben More. However, there are no dwellings at Cassley, so the operator would need to travel along the loch. This would have given them a different view to that from the power station, and presumably included Ben More.

Click image to enlarge!

It’s a huge shame the SSGB stopped being run as the period since 2007 has been really interesting. Some winters have been exceptionally snowy, with others having almost no snow cover. You can read more about snow cover variability in this work I did for the Cairngorms with the James Hutton Institute.

Finally, here’s the code I used to extract the data from my SSGB database and make the plot.

# Packageslibrary(tidyverse)library(lubridate)library(janitor)library(RSQLite)# db extractdb = dbConnect(SQLite(), "~/Cloud/Mike/Uni_temp/SSGB/SSGB.sqlite")df = dbGetQuery(db, "SELECT * FROM SSGB_obs WHERE Station = 'CassleyPS'") %>%  janitor::clean_names() %>%  as_tibble() %>%  mutate(hog = str_sub(date, 6, 10),         date = ymd(date),         snowline_elev = replace(snowline_elev, snowline_elev == "m", NA),         snowline_elev = replace(snowline_elev, snowline_elev == "99", NA),         snowline_elev = replace(snowline_elev, snowline_elev == "n", "2000"),         snowline_elev = as.numeric(snowline_elev)) %>%  select(-snowline, snowline = snowline_elev)dbDisconnect(db)# December onlydf_dec = df %>%  filter(month(date) == 12)# Missing values and no snowy = x %>%              filter(is.na(snowline)) %>%              count(hydro_year, name = "value") %>%  mutate(var = "missing") %>%  bind_rows(x %>%              filter(snowline == 2000) %>%              count(hydro_year, name = "value") %>%            mutate(var = "no_snow")) %>%  spread(var, value) %>%  mutate(missing = replace_na(missing, 0),         no_snow = replace(no_snow, is.na(no_snow) & missing < 5, 0))# Plotggplot(y, aes(hydro_year, no_snow)) +  geom_point() +  stat_smooth(size = 1.5) +  coord_cartesian(ylim = c(0, 31)) +  labs(title = "How often are the hills around\nCassley Power Station snow free in December?",       subtitle = "Created by Mike Spencer with SSGB data (https://doi.org/10.5285/caf989a5-82d7-4db7-b6ff-c0475fdae07e)",       x = "Year",       y = "Days without snow") +  theme_bw() +  theme(text = element_text(size = 20),        plot.subtitle = element_text(size = 12))

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

OMG O2G!

January 12, 2020, 4:00 pm

≫ Next: No Framework, No Problem! Structuring your project folder and creating custom Shiny components

≪ Previous: Record Dec temperature and snow in Scotland

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The oil-to-gas ratio was recently at its highest level since October 2013, as Middle East saber-rattling and a recovering global economy supported oil, while natural gas remained oversupplied despite entering the major draw season. Even though the ratio has eased in the last week, it remains over one standard deviation above its long-term average. Is now the time to buy chemical stocks leveraged to the ratio? Or is this just another head fake foisted upon unsuspecting generalists unaccustomed to the vagaries of energy volatility?

If you’re reading this thinking “what the…”, not to worry. This post will go a little bit off our normal beaten path. But it can give you a glimpse into the world of equity research. You see before we discovered data science, the power of R programming, and created this blog, we toiled away on Medieval spreadsheets, trying to make sense of ethylene and polyethylene, global cost curves, ethane’s cheapness relative to naphtha, and whether any of this mattered to earnings or share prices of the various publicly-traded chemical companies we followed. In short, we were equity research analysts, making recommendations on a slew of chemical stocks to the benefit or chagrin of companies and investors.

Even though we haven’t analyzed chemical stocks in a while, when we recently noticed that the oil-to-gas ratio (once, one of our favorite metrics to discuss) was nearing territory not seen since the inception of the “shale-gas revolution”, we began to grow nostalgic. Why not dust off the old playbook? But this time we’d be armed with R and could chug through data and statistical models faster than it takes to format charts and tables for regurgitated earnings reports.

Before we start, please note this post is not an investment recommendation! With that over, we will be looking at the oil-to-gas ratio and its predictive power for chemical stock price returns. What’s the punchline? We find that the ratio’s impact on returns is significant. But it’s overall explanatory power is limited. We also find that if the ratio is above 30, there is encouraging evidence that returns over the next 30 days will be nicely positive too. But we need to test that particular model further. If you want more detail around our analyses, read on!

A useful rule-of-thumb?

Many analysts watch the oil-to-gas ratio, which is the price of a barrel of oil divided by the price of million BTUs of natural gas. The reason: it is thought to capture the profitability of the US chemical producer. In short, US producers consume natural gas (and its derivatives) to make a whole bunch of plastics, while most of the world consumes oil. That implies four things:

Since most of the global supply of chemicals is produced from oil, the marginal cost to supply and hence the price of most chemicals is set by oil.
Since most of the US chemical suppliers consume natural gas, how expensive natural gas is compared to oil is a significant determinant of profitability.
When the oil-to-gas ratio is widening, US producers should enjoy improving profitability, all else equal.
As profitability goes up, so should stock prices, as that means more cash-flow to equity holders.

All logical on first glance. We can make some arguments against each of these statements, but that is beyond the scope of the current post. If statements one and two are correct, three should be as well, at least provisionally. All of which leads us to ask whether statement four is correct. R programming to the rescue!

Typically, the companies most levered to the oil-to-gas ratio are those that are direct consumers of natural gas or its close derivatives. Historically, that has been Dow Chemical (DOW), Eastman Chemical (EMN), LyondellBasell (LYB), and Westlake Chemical (WLK). Now the actual exposure varies due to the range products these companies sell. And while it would be too complicated to explain that range here, suffice it to say that the least exposed has probably been Eastman, while the most is probably Lyondell or Westlake.

Here’s our road map for analyzing the ratio’s predictive power. We’ll start off with some normal price charts, then drill down into some exploratory graphical analysis, and end with some regressions.

First, a chart of indexed stock prices for each of the companies along with the indexed oil-to-gas ratio. We do this to make comparisons a little easier. Note that this isn’t the cleanest of data series. Dow has gone through a bunch of corporate actions, as has Lyondell, resulting in missing data for the period of reference—2010-2019. We did our best to create a complete series. But it is imperfect. See our footnote for more detail.¹

We indexed the stock price and oil-to-gas values to the beginning of 2010 to compare the changes across time on a normalized basis. But having everything on the same scale, doesn’t always help one see the time series correlation with oil-to-gas. Below, we present the same charts with each y-axis scaled to the individual stock index.

That gives one a slightly different picture, but it’s hard to see a strong relationship. Let’s run some scatter plots to see if there’s a more recognizable relationship. In the following graphs, we plot the daily percentage change in the oil-to-gas ratio (on the x-axis) against the daily return in the respective stock (on the y-axis). We also include a 45^o line to help identify a pure one-to-one relationship.

As we can, see the linear relationship isn’t that strong. But the scatter plots don’t show any odd clustering or massive outliers other than what we’d expect with share price data. What’s the correlation between the oil-to-gas ratio and stock returns? We show that in the table below.

Table 1: Oil-to-gas correlation with chemical stocks

Stock	Correlation (%)
DOW	20.0
EMN	20.3
LYB	21.0
WLK	24.7

While correlations of 20% may not be that high, they do show a positive linear relationship. Importantly, many variables, systematic and idiosyncratic, drive stock returns, so it would be surprising to see such a relatively esoteric ratio having an impact above 40-50%. Hence, on first glance, this appears enough to warrant a deeper investigation.

Regression time!

We’ll now regress the changes in the oil-to-gas ratio against the returns of the various stocks. We’ll first look at the size effect (the slope of the regression equation) on stock returns and then the explanatory power.

So what does this mean? Since we’re regressing stock returns against changes in the oil-to-gas ratio, for every 1% change in the ratio, the chemical stocks move 6-13 basis points.² That seems pretty modest. What’s more interesting is that the size effects relative to one another are close to what we would expect based on exposure to natural gas and product slate. Also, while we don’t show it, the size effects are all significant below the 5% level, implying a solid relationship between the ratio and returns.

How much does the variability in the oil-to-gas ratio explain the variability in stock returns? Not very much. As one can see from the chart below, even the highest R-squared is less than 5%.

One might wonder why you should pay attention to the oil-to-gas ratio at all given what appears to be a limited impact on stock returns. But, recall, we’ve been using daily data. There’s a lot of noise in daily returns. If we switch to monthly data, we might be able to tease out the signal. With R, a few tweaks to the code and we can re-run all the analysis. If we were trying to do this in a spreadsheet, we’d have started thinking about getting our dinner order it, because it was sure to be a long night!

Here’s the size effect based on monthly data.

That appears to be a significant improvement. For every one percent change in the oil-to-gas ratio, monthly returns change by 14-to-28 bps. What about the explanatory power? Check out the graph below.

Again, a noticeable improvement in explanatory power. The variability in the oil-to-gas ratio explains about 5-10% of the variability in monthly returns.

Where might we go from here? One avenue would be to build a machine learning model to see how well the oil-to-gas ratio might predict stock returns on out-of-sample data. We can split the data from 2010 to 2015, which includes just about a round trip in the oil-to-gas ratio, as we can see from the graph below. The dashed lines are the 2000-2019 average and standard deviation lines.

We’ll then test the model trained on the 2010-2015 data on the out-of-sample 2016-2019 data and compare the predicted returns to the actual returns. Here is the size effect graph based on the training data. Notice the greater effect on LYB and WLK for the training period vs. the previous total period.

And here is a graph of the R-squareds. Note how the stocks form pairs, which roughly match the higher correlations between the two—i.e, LYB and WLK are more highly correlated with each other, than with the other stocks.

Now, to get a visual sense of the how the predicted values stack up to the actual, we present scatter plots of the two series with a regression line to show accuracy.

Not exactly the one-for-one correspondence one might hope for. But there appears to be a nice linear relationship, suggesting that the out-of-sample results aren’t atrocious. If we want a single numerical comparison, we can compute the the root mean-squared error (RMSE), which tells us how much the predicted values deviate from the actual values.

Interestingly, the out-of-sample RMSE’s are modestly better than the in-sample. This is unusual, though not unheard of. The main reason for the difference is that the period from 2016 to present had less volatility in the oil-to-gas ratio (and generally less in equities, excluding some late bursts in 2018), so there would be less error. Since the differences in RMSE are small, it suggests this is a good model in the sense that the training model has not over fit the data. But there may be a problem here since, as we mentioned above, the training period was “harder” than the testing period.

Our goal is to see how accurate the model is at predicting returns. To do that we can compare the RMSE to size effect, since they’re on the same scale. Recall, that a percent change in the oil-to-gas ratio resulted in about a 25-50bps change in monthly stock returns. So if the prediction is off by seven-to-eleven percentage points, then we’d have to conclude that this model isn’t the best in terms of prediction. Of course, we knew going in that the oil-to-gas ratio is only one component of stock returns.

Getting back to the original headline of the oil-to-gas ratio at multi-year highs, we need to ask whether that has any implications for returns. The fact that in the last two months the oil-to-gas ratio increased by 17% per month on average, while the stocks only moved 2-4% suggests the stocks aren’t performing the way our model would predict. Of course, the problem with the model is that there’s still a fair amount of unexplained variation that needs to be addressed. We could do that by adding additional risk factors like excess equity returns, valuation, size, etc.

Another alternative might be to look at various levels of the oil-to-gas ratio, rather than changes, and to see what impact that has on future returns. We are, after all, concerned with future, not concurrent, returns. A quick regression where we grouped the ratio by every ten points and regressed those categories against returns a month in advance, suggests that when the ratio is between 30 and 40, the stocks have typically seen a 1-7% return in the next month on average. We provide the size effect graph below. Still, we’d need to perform more testing on this model as well as on additional risk factors mentioned above. But both of those would require another post.

What’s the conclusion? Changes in the oil-to-gas ratio exhibit a significant relationship with chemical stock returns, but the impact is modest on a univariate basis. The impact increases when examining a monthly rather than daily time series. But we haven’t looked at longer periods. The ratio does not, however, have strong explanatory power, though it does improve with monthly data. Given the rise in the oil-to-gas ratio over the last two months, a simple linear regression model trained on data from 2010-2015 suggests that the magnitude of the stocks’ reactions was not as great as would have been anticipated. A rough cut model in which the oil-to-gas ratio was transformed into categorical variables also suggests that returns should be nicely positive if the ratio surpasses 30. But there is a fair amount of unexplained variance in the models, so including other risk factors may yield more robust results. That’s an avenue we might pursue in future posts if interest warrants it. Until then, all the code used to produce the previous analyses and charts is below. Let us know if you have any questions.

# Load packagelibrary(tidyquant)library(broom)library(knitr)library(kableExtra)library(Quandl)Quandl.api_key("Your key!")## Load data# Energyoil <- Quandl("CHRIS/CME_CL1", type = "xts", start_date = "2000-01-01")nat_gas <- Quandl("CHRIS/CME_NG1", type = "xts", start_date = "2000-01-01")energy <- merge(oil[,"Last"], nat_gas[,"Last"])names(energy) <- c("oil", "nat_gas")energy$oil_2_gas <- energy$oil/energy$nat_gas# Equitysymbols <- c("LYB", "DOW", "WLK", "EMN", "^GSPC")prices <- getSymbols(symbols,                      from = "2000-01-01",                     to = "2019-12-31",                     warning = FALSE) %>%   map(~Ad(get(.))) %>%   reduce(merge) %>%   `colnames<-`(symbols)# Dow specificdow <- Quandl("WIKI/DOW", type = "xts", start_date = "2000-01-01")dwdp <- Quandl("WIKI/DWDP", type = "xts", start_date = "2000-01-01")dd <- Quandl("WIKI/DD", type = "xts", start_date = "2000-01-01")dd_y <- getSymbols("DD", from = "2000-01-01", auto.assign = FALSE)dow_con <- rbind(dow$`Adj. Close`,                  dwdp$`Adj. Close`["2017-09-01/2018-03-27"],                  Ad(DD["2018-03-28/2019-03-19"]),                 Ad(DOW))dd_delt <- Ad(DD["2018-03-26/2019-03-20"])dd_delt <- dd_delt/lag(dd_delt)dow_int <- as.numeric(dwdp$`Adj. Close`["2018-03-27"])*  cumprod(as.numeric(dd_delt["2018-03-28/2019-03-19"]))dow_con["2018-03-28/2019-03-19"] <- dow_intprices <- merge(prices, dow_con)prices$DOW <- NULL## Create data framesxts_df <- merge(energy, prices)colnames(xts_df)[4:8] <- c(tolower(colnames(xts_df)[4:6]), "sp", "dow")xts_mon <- to.monthly(xts_df, indexAt = "lastof", OHLC = FALSE)df <- data.frame(date = index(xts_df), coredata(xts_df))df_mon <- data.frame(date = index(xts_mon), coredata(xts_mon))# Graph ratiodf %>%   ggplot(aes(date, oil_2_gas)) +  geom_line(color = "blue") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE) + sd(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE) - sd(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  labs(x = "",       y = "Ratio (x)",       title = "Oil-to-gas ratio")# Facet graphdf %>%   filter(date > "2010-05-01") %>%   select(-oil, -nat_gas, -sp) %>%  gather(key, value, -c(date, oil_2_gas)) %>%   group_by(key) %>%   mutate(value = value/first(value)*100,         oil_2_gas = oil_2_gas/first(oil_2_gas)*100) %>%  ggplot(aes(date)) +  geom_line(aes(y = value, color = key)) +  geom_line(aes(y = oil_2_gas, color = "Oil-to-Gas")) +  scale_color_manual("",labels = c("DOW", "EMN", "LYB", "Oil-to-gas", "WLK"),                     values = c("red", "orange", "green", "blue", "purple")) +  facet_wrap(~key, labeller = labeller(key = c("dow" = "DOW",                                               "emn" = "EMN",                                                "lyb" = "LYB",                                                "wlk" = "WLK"))) +  labs(x = "",       y = "Index",       title = "Oil-to-gas ratio vs. chemical stocks") +  theme(legend.position = "top")# Facet graphdf %>%   filter(date > "2010-05-01") %>%   select(-oil, -nat_gas, -sp) %>%  gather(key, value, -c(date, oil_2_gas)) %>%   group_by(key) %>%   mutate(value = value/first(value)*100,         oil_2_gas = oil_2_gas/first(oil_2_gas)*100) %>%  ggplot(aes(date)) +  geom_line(aes(y = value, color = key)) +  geom_line(aes(y = oil_2_gas, color = "Oil-to-Gas")) +  scale_color_manual("",labels = c("DOW", "EMN", "LYB", "Oil-to-gas", "WLK"),                     values = c("red", "orange", "green", "blue", "purple")) +  facet_wrap(~key,             scales = "free",             labeller = labeller(key = c("dow" = "DOW",                                               "emn" = "EMN",                                                "lyb" = "LYB",                                                "wlk" = "WLK"))) +  labs(x = "",       y = "Index",       title = "Oil-to-gas ratio vs. chemical stocks") +  theme(legend.position = "top")df %>%   select(-c(oil, nat_gas, sp, date)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x)-1) %>%   gather(key, value, -oil_2_gas) %>%  group_by(key) %>%   ggplot(aes(oil_2_gas*100, value*100, color = key)) +  geom_point() +  geom_abline(color = "blue") +  facet_wrap(~key,             labeller = labeller(key = c("dow" = "DOW",                                         "emn" = "EMN",                                          "lyb" = "LYB",                                          "wlk" = "WLK"))) +   labs(x = "Oil-to-gas (%)",       y = "Return (%)",       title = "Scatter plot: oil-to-gas vs returns") +  scale_color_manual("",labels = c("DOW", "EMN", "LYB", "WLK"),                     values = c("red", "orange", "green", "purple")) +    theme(legend.position = "top")# Correlation tabledf %>%   filter(date > "2010-01-01") %>%   select(-c(date, oil, nat_gas, sp)) %>%  mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x) - 1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%   gather(Stock, value, -oil_2_gas) %>%   group_by(Stock) %>%  summarise(`Correlation (%)` = round(cor(value, oil_2_gas, use = "pairwise.complete.obs"),3)*100) %>%   knitr::kable(caption = "Oil-to-gas correlation with chemical stocks")# Graph of change in oil-to-gas ratio size effectdf %>%   select(-c(oil, nat_gas, sp, date)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x)-1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%   gather(key, value, -oil_2_gas) %>%  group_by(key) %>%   do(tidy(lm(value ~ oil_2_gas,.))) %>%   filter(term != "(Intercept)") %>%   ggplot(aes(reorder(key, estimate), estimate*100)) +  geom_bar(stat = 'identity', fill = "blue") +  labs(x = "Stocks",       y = "Size effect (bps)",       title = "Oil-to-gas ratio size effect on chemical stock returns") +  geom_text(aes(label = round(estimate,3)*100), nudge_y = 0.5)# Graph of r-squaredsdf %>%   # filter(date <= "2015-01-01") %>%   select(-c(oil, nat_gas, sp, date)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x)-1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%  gather(key, value, -oil_2_gas) %>%  group_by(key) %>%   do(glance(lm(value ~ oil_2_gas,.))) %>%   ggplot(aes(reorder(key, r.squared), r.squared*100)) +  geom_bar(stat = 'identity', fill = "blue") +  geom_text(aes(label = round(r.squared,3)*100), nudge_y = 0.25 ) +  labs(x = "Stocks",       y = "R-squared (%)",       title = "Oil-to-gas ratio explanatory power on chemical stock returns")# Graph of change in oil-to-gas ratio size effectdf_mon %>%   select(-c(oil, nat_gas, sp, date)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x)-1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%   gather(key, value, -oil_2_gas) %>%  group_by(key) %>%   do(tidy(lm(value ~ oil_2_gas,.))) %>%   filter(term != "(Intercept)") %>%   ggplot(aes(reorder(key, estimate), estimate*100)) +  geom_bar(stat = 'identity', fill = "blue") +  labs(x = "Stocks",       y = "Size effect (bps)",       title = "Oil-to-gas ratio size effect on monthly chemical stock returns") +  geom_text(aes(label = round(estimate,3)*100), nudge_y = 1)# Graph of r-squaredsdf_mon %>%   # filter(date <= "2015-01-01") %>%   select(-c(oil, nat_gas, sp, date)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) x/lag(x)-1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%  gather(key, value, -oil_2_gas) %>%  group_by(key) %>%   do(glance(lm(value ~ oil_2_gas,.))) %>%   ggplot(aes(reorder(key, r.squared), r.squared*100)) +  geom_bar(stat = 'identity', fill = "blue") +  geom_text(aes(label = round(r.squared,3)*100), nudge_y = 0.5 ) +  labs(x = "Stocks",       y = "R-squared (%)",       title = "Oil-to-gas ratio explanatory power on monthly chemical stock returns")df %>%   filter(date >= "2010-01-01", date < "2016-01-01") %>%   ggplot(aes(date, oil_2_gas)) +  geom_line(color = "blue") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE) + sd(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  geom_hline(yintercept = mean(df$oil_2_gas, na.rm = TRUE) - sd(df$oil_2_gas, na.rm = TRUE),             linetype = "dashed") +  labs(x = "",       y = "Ratio (x)",       title = "Oil-to-gas ratio: 2010-2015")## Train & test splitdf_mon_train <- df_mon %>%   select(-c(oil, nat_gas, sp)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) (x/lag(x)-1)) %>%   filter(date >= "2010-05-01", date < "2016-01-01")df_mon_test <- df_mon %>%   select(-c(oil, nat_gas, sp)) %>%   mutate_at(vars(oil_2_gas:dow), function(x) (x/lag(x)-1)) %>%   filter(date >= "2016-01-01")# Graph size effecsdf_mon_train %>%   rename("DOW" = dow,       "EMN" = emn,       "LYB" = lyb,       "WLK" = wlk) %>%   gather(key, value, -c(oil_2_gas, date)) %>%  group_by(key) %>%   do(tidy(lm(value ~ oil_2_gas,.))) %>%   filter(term != "(Intercept)") %>%   ggplot(aes(reorder(key, estimate), estimate*100)) +  geom_bar(stat = 'identity', fill = "blue") +  labs(x = "Stocks",       y = "Size effect (bps)",       title = "Training model: Oil-to-gas ratio size effect on chemical stock monthly returns") +  geom_text(aes(label = round(estimate,3)*100), nudge_y = 1.5)# Graph R-squaredsdf_mon_train %>%   rename("DOW" = dow,          "EMN" = emn,          "LYB" = lyb,          "WLK" = wlk) %>%  gather(key, value, -c(oil_2_gas,date)) %>%  group_by(key) %>%   do(glance(lm(value ~ oil_2_gas,.))) %>%   ggplot(aes(reorder(key, r.squared), r.squared*100)) +  geom_bar(stat = 'identity', fill = "blue") +  geom_text(aes(label = round(r.squared,3)*100), nudge_y = 1 ) +  labs(x = "Stocks",       y = "R-squared (%)",       title = "Training model: oil-to-gas ratio explanatory power on chemical stock monthly returns")models <- list()for(i in 1:4){  formula <- as.formula(paste(colnames(df_mon_train)[i+2], "oil_2_gas", sep = "~"))  models[[i]] <- lm(formula, data = df_mon_train)}preds <- data.frame(lyb_pred = rep(0, nrow(df_mon_test)),                    wlk_pred = rep(0, nrow(df_mon_test)),                    emn_pred = rep(0, nrow(df_mon_test)),                    dow_pred = rep(0, nrow(df_mon_test)))for(i in 1:4){  preds[,i] <- predict(models[[i]], df_mon_test)}# scatter plot of predicted vs. actualdf_mon_test %>%   select(-date, -oil_2_gas) %>%   mutate(output = "actual",         obs = row_number()) %>%  bind_rows(preds %>%               mutate(output = "predicted",                     obs = row_number()) %>%              rename("lyb" = lyb_pred,                     "wlk" = wlk_pred,                     "emn" = emn_pred,                     "dow" = dow_pred)) %>%   gather(lyb:dow, key = series, value = value) %>%   spread(key = output, value = value) %>%   ggplot(aes(predicted, actual, color = series)) +  geom_point() +  geom_smooth(method = "lm", se = FALSE) +  facet_wrap(~ series,             scales = "free_y",             labeller = labeller(series = c("dow" = "DOW",                                         "emn" = "EMN",                                          "lyb" = "LYB",                                          "wlk" = "WLK"))) +  labs(x = "Predicted",       y = "Actual",       title = "Out of sample scatter plots predicted vs. actual") +  theme(legend.position = "")## Root mean squared error# Create predicted data frame on in-sample datapreds_mod <- data.frame(lyb_pred = rep(0, nrow(df_mon_train)),                    wlk_pred = rep(0, nrow(df_mon_train)),                    emn_pred = rep(0, nrow(df_mon_train)),                    dow_pred = rep(0, nrow(df_mon_train)))# For loop predictionfor(i in 1:4){  preds_mod[,i] <- predict(models[[i]], df_mon_train)}# Compute in-sample RMSErmse_train <- c()for(i in 1:4){  rmse_train[i] <- sqrt(mean((preds_mod[,i] - df_mon_train[,i+2])^2))}# Compute out-of-sample RMSErmse_test <- c()for(i in 1:4){  rmse_test[i] <- sqrt(mean((preds[,i] - df_mon_test[,i+2])^2))}# Create RMSE data framermse <- data.frame(stock = toupper(colnames(df_mon_test)[3:6]), rmse_train, rmse_test)# Graph RMSErmse %>%   gather(key, value, -stock) %>%   ggplot(aes(stock, value*100, fill = key)) +  geom_bar(stat = "identity", position = "dodge") +  scale_fill_manual("",                    labels = c("Test", "Train"),                    values = c("blue", "slateblue")) +  geom_text(aes(label = round(value,3)*100), position = position_dodge(width = 1), vjust = -0.25) +  theme(legend.position = "top") +  labs(x = "",        y = "RMSE (% pts)",        title = "Root mean-squared error train and test sets")# Graph of categorial modeldf %>%   select(-c(date, oil, nat_gas, sp)) %>%   mutate(oil_2_gas = cut(oil_2_gas, c(10, 20,30, 40, 50))) %>%   mutate_at(vars(lyb:dow), function(x) lead(x,22)/x-1) %>%  rename("DOW" = dow,         "EMN" = emn,         "LYB" = lyb,         "WLK" = wlk) %>%   gather(key, value, -oil_2_gas) %>%   group_by(key) %>%   do(tidy(lm(value ~ oil_2_gas,.))) %>%   filter(term == "oil_2_gas(30,40]") %>%   ggplot(aes(key, estimate*100)) +  geom_bar(stat = "identity", fill = "blue") +  labs(x = "Stocks",       y = "Size effect (%)",       title = "When the oil-to-gas ratio is between 30 & 40 next monthly return is ...") +  geom_text(aes(label = round(estimate,3)*100), nudge_y = 0.5)

Data providers will have different numbers. Since this blog is meant to be reproducible, we used only publicly available sources. Our code will show what we did to create a uniform series for Dow. Not the prettiest code, however. LYB emerged from bankruptcy in 2010. Finding publicly available data of the original Lyondell (LYO) is tough. so we just use the post-bankruptcy period.
A basis point is 1/100th of a percent.

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

No Framework, No Problem! Structuring your project folder and creating custom Shiny components

January 12, 2020, 4:00 pm

≫ Next: How to import Python classes into R

≪ Previous: OMG O2G!

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pedro Coutinho Silva is a software engineer at Appsilon Data Science.

It is not always possible to create a dashboard that fully meets your expectations or requirements using only existing libraries. Maybe you want a specific function that needs to be custom built, or maybe you want to add your own style or company branding. Whatever the case, a moment might come when you need to expand and organize your code base, and dive into creating a custom solution for your project; but where to start? In this post, I will explain the relevant parts of my workflow for Shiny projects using our hiring funnel application as an example.

Hiring Funnel Dashboard

I will cover:

Structuring the project folder: what goes where?
- Managers and extracting values into settings files.
- Using modules to organize your code.
What does it actually take to create a custom component?

Hopefully, these topics will be as valuable to you as they have been to me!

Structuring your project folder

Your typical dashboard project doesn’t have a very complex structure, but this can change a lot as your project grows, so I typically try to keep things as separate as possible, and to provide some guidance for future collaborators.

I wont go over styles since these are basically an implementation of Sass. Sass lets you keep your sanity while avoiding inline styling. You can read a bit more about it on my previous post about it.

So what does our project folder actually look like? Here is our hiring funnel project folder for example:

│ app.R└─── app    │ global.R    │ server.R    │ ui.R    └─── managers    │ constants_manager.R    │ data_manager.R    └─── settings    │ app.json    │ texts.json    └─── modules    │ header.R    │ sidebar.R    │ ui_components.R    └─── styles    │   │ main.scss    │  ...    └─── www        │ sass.min.css        └─── assets        └─── scripts

Quite a lot to unpack, but lets go over the important bits:

Managers and Settings

Managers are scripts that make use of R6 classes. The constants manager has already been covered before by one of my colleagues in the super solutions series. The data manager contains all of the abstraction when dealing with data loading and processing.

Settings has all of the values that should not be hard coded. This includes constants, texts and other values that can be extracted.

Modules

Modules let you easily create files for managing parts of your code. This means you can have modules for specific elements or even layout sections without bloating your main files too much.

They are great when it comes to code structure. Let’s take our header for example, instead of growing our ui.R with all of the header code, we can extract it to a separate file:

# Header.Rimport("shiny")import("modules")export("ui")ui_components <- use("modules/ui_components.R")ui <- function(id) {  tags$header(    ...  )}

All it takes is importing the libraries you plan to use, and exporting the functions you would like to make available. You can even call other modules from inside a module!

After this we just instance the module in UI.R:

# ui.Rheader <- use("modules/header.R")

And can now use by simply calling the function we want:

fluidPage(  header$ui()  ...

Custom components

When you cannot find a component that does what you want, sometimes the only option is to create it yourself. Since we are talking about HTML components, we can expect the average component to have three main parts: – Layout – Style – Behavior

We have already covered modules, but how do we deal with styling and behavior? Lets take for example our navigation. What we are looking for behaves as a tab system, but where the navigation is split from the content. So we need two different ui functions:

tabs_navigation <- function(id = "main-tabs-navigation", options) {  tagList(    tags$head(      tags$script(src = "scripts/tab-navigation.js")    ),    tags$div(      id = id,      class = "tabs-navigation",      `data-tab-type`= "navigation",      lapply(options, tabs_single_navigation)    )  )}tabs_panel <- function(  id = "main_tab_set",  class = "tabs-container",  tabs_content) {  div(    id = "main-tabs-container",    class = "tabs-container",    `data-tab-type`= "tabs-container",    lapply(tabs_content, single_tab),    tags$script("init_tab_navigation()")  )}

By giving the different elements id’s and classes, we can use sass to easily style these components. And by including a Javascript file in the element we can load and initialize browser behavior. In this case our tab-navigation.js just initializes the first tab and binds a click event to cycle through the different tabs when clicked.

init_tab_navigation = function() {  $( document ).ready(function() {    $("[data-tab-type='navigation']")      .find("[data-tab-type='controller']")      .first().addClass("active")    $(`[data-tab-type="tabs-container"]`)      .find(`[data-tab-type="tab"]`)      .first().addClass("active")    $("[data-tab-type='controller']").on("click", function(e){    $(this)        .addClass("active")        .siblings(`[data-tab-type="controller"]`)        .removeClass("active")      let target = $(this).data("target")      $(`[data-tab-id="${target}"]`).addClass("active")        .siblings(`[data-tab-type="tab"]`)        .removeClass("active")    })  });}

It takes a bit of effort, but the result is something truly custom. Hiring Funnel Dashboard

We barely scratched the surface of what can be done when it comes to custom solutions, but I hope it already gives you an idea of how to start or improve your next project!

Craving more, or have any questions? Feel free to reach out and ask!

References

Modules (https://cran.r-project.org/web/packages/modules/vignettes/modulesInR.html)
Sass (https://github.com/rstudio/sass)
R6 classes (https://adv-r.hadley.nz/r6.html)

_____='https://rviews.rstudio.com/2020/01/13/no-framework-no-problem-structuring-your-project-folder-and-creating-custom-shiny-components/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to import Python classes into R

January 13, 2020, 5:10 pm

≫ Next: Heterogeneous treatment effects and homogeneous outcome variances

≪ Previous: No Framework, No Problem! Structuring your project folder and creating custom Shiny components

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

import python class into r

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

Background

This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes.

To install reticulate, we can run:

install.packages("reticulate")

Creating a Python class

Let’s create a simple class in Python.

import pandas as pd# define a python classclass explore:  def __init__(self, df):    self.df = df  def metrics(self):    desc = self.df.describe()    return desc  def dummify_vars(self):    for field in self.df.columns:        if isinstance(self.df[field][0], str):          temp = pd.get_dummies(self.df[field])                   self.df = pd.concat([self.df, temp], axis = 1)          self.df.drop(columns = [field], inplace = True)

Porting the Python class to R

There’s a couple ways we can can port our Python class to R. One way is by sourcing a Python script defining the class. Let’s suppose our class is defined in a script called “sample_class.py”. We can use the reticulate function source_python.

# load reticulate packagelibrary(reticulate)# inside R, source Python scriptsource_python("sample_class.py")

Running the command above will not only make the class we defined will available to us in our R session, but would also make any other variables or functions we defined in the script available as well (if those exist). Thus, if we define 10 classes in the script, all 10 classes will be available in the R session. We can refer to any specific method defined in a class using R’s “$” notation, rather than the “dot” notation of Python.

recitulate import python class into r

result <- explore(iris)# get summary stats of dataresult$metrics()# create dummy variables from factor / character fields # (just one in this case)result$dummify_vars()

One other note is that when you import a class from Python, the class becomes a closure in R. You can see this by running R’s typeof function:

typeof(explore) # closure

Using R markdown to switch between R and Python

Another way of using a Python class in R is by using R Markdown. This feature is available in RStudio v. 1.2+ and it allows us to write chunks of R code followed by chunks of Python code, and vice-versa. It also lets us pass variables, or even classes from Python to R. Below, we write the same code as above, but this time using chunks in an R Markdown file. When we write switch between R and Python chunks in R Markdown, we can reference Python objects (including classes) by typing py$name_of_object, where you just need to replace name_of_object with whatever you’re trying to reference from the Python code. In the below case, we reference the explore class we created by typing py$explore.

```{r}library(reticulate)``````{python}import pandas as pd # define a python class class explore:   def __init__(self, df):     self.df = df   def metrics(self):     desc = self.df.describe()     return desc   def dummify_vars(self):     for field in self.df.columns:         if isinstance(self.df[field][0], str):           temp = pd.get_dummies(self.df[field])                    self.df = pd.concat([self.df, temp], axis = 1)           self.df.drop(columns = [field], inplace = True)``````{r}py$explore```

Example with class and instance variables

Now, let’s look at another example. Below, we create a Python class in a file called “sample_class2.py” that has an instance variable (value) and a class variable (num).

class test:def __init__(self, value):self.value = valuedef class_var(self):test.num = 10

source_python("sample_class2.py")check = test(5)check$valuecheck$num # error because class_var hasn't been called yetcheck$class_var()check$num # works nowtest$num

reticulate class and instance variables

That’s it for now! If you enjoyed this post, please follow my blog on Twitter.

To learn more about reticulate, check out its official documentation here.

The post How to import Python classes into R appeared first on Open Source Automation.

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Heterogeneous treatment effects and homogeneous outcome variances

January 10, 2020, 11:00 am

≫ Next: Le Monde puzzle [#1120]

≪ Previous: How to import Python classes into R

[This article was first published on R Psychologist - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently there has been a couple of meta-analyses investigating heterogeneous treatment effects by analyzing the ratio of the outcome variances in the treatment and control group. The argument made in these articles is that if individuals differ in their response, then observed variances in the treatment and control group in RCTs should differ. For instance, Winkelbeiner et al. (2019) write:

The SDs of the pretreatment and posttreatment outcome difference scores in the treatment and control groups consist of the same variance components, including the within-patient variation. The treatment group, however, may also include the additional treatment-by-patient interaction, which could indicate the presence of individual response differences. Thus, in the case of a variable treatment effect, an increase of the variance in the treatment group, compared with the control group, should be observable.

Altough I agree with much of what’s written in these articles, I’m still not completely sold on the core argument. If we filter the argument through my (perhaps missinformed) understanding of the ideas, we can write the following model for the individual-level treatment effect (ITE):

$$ \begin{align} \text{ITE}_i &= \Delta_i = Y_i(1) – Y_i(0),\\ Y_i(0) &= \beta_0 + e_i \\ Y_i(1) &= Y_i(0) + \Delta_i, \end{align} $$

where $Y_i(1)$ is the outcome after treatment and $Y_i(0)$ after placebo. For a single patient we can only observe one of these potential outcomes. We can see that the only difference between an individual’s treatment-free outcome and their outcome after treatment is the causal effect of the treatment. We have two variance components: 1) the treatment effects ($\sigma_{\Delta}^2$), and 2) all other sources of variance ($\sigma_{e}^2$). Let’s assume they are bivariate normal,

$$ \begin{pmatrix} e_i \\ \Delta_i \end{pmatrix} \sim\mathcal{N} \left( \begin{matrix} 0 &\\\ 0 \end{matrix} , \begin{matrix} \sigma^2_{e} & \sigma_{e}\sigma_{\Delta}\rho \\\ \sigma_{e}\sigma_{\Delta}\rho & \sigma^2_{\Delta} \end{matrix} \right). $$

Hence, the treatment-free outcome $Y_i(0)$ is potentially correlated with the individual-level treatment effect. Thus, in an RCT the observed variance in each arm would be:

$$ \begin{align} \text{Var}(Y \mid \text{Tx = 0}) &= \sigma^2_e, \\ \text{Var}(Y \mid \text{Tx = 1}) &= \sigma^2_e + \sigma^2_{\Delta} + 2\sigma_e \sigma_{\Delta}\rho \end{align} $$

It should be clear that all variance in the treatment effect is represented by $\sigma^2_{\Delta}$,

$$ \begin{align} \text{Var}(\Delta_i) &= \text{Var}[Y_i(1)- Y_i(0)] \\ &= \sigma^2_{\Delta}. \end{align} $$

Now let’s investigate under what conditions we can have equal variances and treatment effect heterogeneity, i.e., when is

$$ \begin{align} \frac{ \text{Var}(Y \mid \text{Tx = 0}) } { \text{Var}(Y \mid \text{Tx = 1}) } = 1 \quad \text{and} \quad \sigma^2_{\Delta} > 0. \end{align} $$

We can rewrite the expression above as,

$$ \begin{align} \frac{ \sigma^2_e } { \sigma^2_e + \sigma^2_{\Delta} + 2\sigma_e \sigma_{\Delta}\rho } = 1 \quad \text{and} \quad \sigma^2_{\Delta} > 0. \end{align} $$

We can see that this ratio will only equal 1 when the covariance exactly cancel the treatment effect variance, i.e., when

$$ \begin{align} \sigma^2_{\Delta} + 2\sigma_e \sigma_{\Delta}\rho &= 0 \quad \text{and} \quad \sigma^2_{\Delta} > 0 \\ -\frac{1}{2}\frac{\sigma_{\Delta}}{\sigma_e} &= \rho. \end{align} $$

Under this model it seems quite unlikely that heterogeneous effects are present when the outcome variances are equal in magnitude, as they could only be present given a highly specific correlation and for all other values of $\rho$ the outcome variances will be heterogeneous. This result agrees with what Cortés et al. (2019) conclude in their supplement. However, why must we assume that the treatment variance is an entirely separate variance component? Let us see what happens if we change that assumption.

What if the treatment changes one of the variance components?

Let’s assume that there’s a causal mechanism that causes some or all of the variance in symptoms. This variable, M, is now a source of variance in both the treatment and the control group. However, if the treatment impacts the outcome by fixing a dysfunction in this causal mechanism (i.e., the treatment effect is mediated) then this can be a source of treatment effect heterogeneity without having to introduce a new variance component in the treatment group. Let’s write down this model, first we have an outcome,

$$ Y_i = Y_i[Tx_i, M(Tx_i)] $$

where $Tx_i$ is an individuals assigned treatment, and $M_i(Tx_i)$ the value of the mediator realized depending on the assigment $Tx_i$. The potential outcomes after treatment or control is then,

$$ \begin{align} Y_i(0) &= \beta_0 + \beta_M M_i(0) + e_i \\ Y_i(1) &= \beta_0 + \beta_M M_i(1) + \beta_{\Delta} + e_i. \end{align} $$

and we assume that the potential outcome values of the mediator is correlated,

$$ \begin{pmatrix} M_i(0) \\ M_i(1) \end{pmatrix} \sim\mathcal{N} \left( \begin{matrix} M_0 &\\\ M_1 \end{matrix} , \begin{matrix} \sigma^2_{M(0)} & \sigma_{M(0)}\sigma_{M(1)}\rho \\\ \sigma_{M(0)}\sigma_{M(1)}\rho & \sigma^2_{M(1)} \end{matrix} \right) $$

Now the outcome variances in each group can be written as,

$$ \begin{align} \text{Var}(Y \mid \text{Tx = 0}) &= (\sigma_{M(0)}\beta_M)^2 + \sigma^2_e \\ \text{Var}(Y \mid \text{Tx = 1}) &= (\sigma_{M(1)}\beta_M)^2 + \sigma^2_e \end{align} $$

Hence variances are equal in the treatment and control group if the variance of the mediator is the same in each group (and that the effect of the mediator on the outcome is the same). Let’s assume that’s the case so that we have $\sigma_{M(0)} = \sigma_{M(1)} = \sigma_{M}$. Then the variance of the individual-level effects is,

$$ \begin{align} \text{Var}(\Delta_i) &= \text{Var}[Y_i(1) – Y_i(0)] \\ &= \beta_M^2 \sigma_{M}^2 + \beta_M^2 \sigma_{M}^2 – 2\beta_M^2 \text{Cov}[M_i(0), M_i(1)] \\ &= 2\beta_M^2 \sigma_M^2 – 2\beta_M^2 \sigma_{M}^2\rho. \end{align} $$

We can see that $\text{Var}(\Delta_i)$ can only be 0 if $\rho = 1$. Thus in this example, the results are reversed, and homogeneous effects are now only possible if individuals’ potential outcomes on the mediator are perfectly (positively) correlated. The point here is not to claim that heterogeneous effects are likely and that “precision psychiatry” is the way forward. My point is simply that I’m not sure how much we can learn from looking at the ratio of outcome variances.

A Numerical Example

Here is a numerical example of equal outcome variances with varying degrees of heterogeneous individual-level treatment effects. Although a simulation is not needed here, I know some people prefer it over equations.

library(dplyr)library(ggplot2)library(MASS)theme_set(theme_minimal()+theme(legend.position="bottom"))scale_colour_discrete<-function(...,values)scale_color_manual(...,values=c("black","#0984e3"))#' Simulate potential outcomes with mediator#'#' @param n number of subjects#' @param b_M0 Effect of mediator on outcome in control#' @param b_M1 Effect of mediator on outcome in treatment#' @param b_TX Direct effect of treatment#' @param M0_M Mediator mean in control#' @param M1_M Mediator mean in treatment#' @param M0_SD Mediator SD in control#' @param M1_SD Mediator SD in treatment#' @param M01_cor Correlation between counterfactual mediator values#' @param sigma "Error" variance#' @param ... #'#' @return a data.frame#' @exportsim_data<-function(n,b_M0,b_M1,b_TX,M0_M=1,M1_M=2.5,M0_SD=1,M1_SD=1,M01_cor=0,sigma=1,...){Sigma<-matrix(c(M0_SD^2,M0_SD*M1_SD*M01_cor,M0_SD*M1_SD*M01_cor,M1_SD^2),nrow=2)Mm<-mvrnorm(n=n,mu=c(M0_M,M1_M),Sigma=Sigma,empirical=TRUE)tibble::tibble(# treatment assigmentTX=rbinom(n,1,0.5),# Mediator in controlM0=Mm[,1],# Mediator under treatmentM1=Mm[,2],# Y(0, M(0)), outcome in control when mediator at control levelsY0_M0=6+b_M0*M0,# Y(0, M(1)), outcomes in control when mediator at TX levelsY0_M1=6+b_M1*M1,# Y(1, M(0)), outcomes in TX when mediator at control levelsY1_M0=Y0_M0+b_TX,# Y(1, M(1)), outcomes in TX when mediator at TX levels Y1_M1=Y0_M1+b_TX,# Obs. MediatorM=(TX==0)*M0+(TX==1)*M1,# Obs. Outcomey=(TX==0)*Y0_M0+(TX==1)*Y1_M1+rnorm(n,sd=sigma))}

I base these values on Plöderl and Hengartner (2019). The SD in each group is 8, with an average treatment effect of -2 points on the Hamilton Depression Rating Scale (HDRS). I arbitrarily assume that 25% of the outcome variance is caused by The Causal Mechanism and that 50% of the total treatment effect is mediated.

# Sim datar<-0.5b_M<-0.5M0_SD<-sqrt(8^2*1/4)/b_MM1_SD<-M0_SDsigma<-sqrt(8^2*3/4)set.seed(5050505)d<-sim_data(n=5e5,b_TX=-1,b_M0=b_M,b_M1=b_M,M0_M=10,M1_M=8,M0_SD=M0_SD,M1_SD=M1_SD,sigma=sigma,M01_cor=r,data_gen=sim_data)d%>%group_by(TX)%>%summarize(mean(y),sd(y),cor(Y0_M0,Y1_M1),cor(M0,M1),mean(Y1_M1-Y0_M0),sd(Y1_M1-Y0_M0))%>%kable(digits=2)

TX	mean(y)	sd(y)	cor(Y0_M0, Y1_M1)	cor(M0, M1)	mean(Y1_M1 – Y0_M0)	sd(Y1_M1 – Y0_M0)
0	10.99	7.99	0.5	0.5	-2	3.99
1	9.01	8.00	0.5	0.5	-2	4.01

Let us plot $SD(\Delta_i)$ as a function of the correlation of potential mediator outcomes.

get_SD_ITE_df<-function(r,b_M,M0_SD,M1_SD){sds<-sqrt(b_M^2*M0_SD^2+b_M^2*M1_SD^2-2*b_M^2*M0_SD^2*r)data.frame(cor=r,sd_ITE=sds)}tmp<-get_SD_ITE_df(r=seq(-1,1,length.out=100),b_M=b_M,M0_SD=M0_SD,M1_SD)ggplot(tmp,aes(cor,sd_ITE))+geom_line(color="#0984e3",size=1)+labs(y="SD",x="Cor[M(0), M(1)]",title="Variation in Individual-Level Effects")

center

We can also plot the proportion of “responders” (improve by more than 5 points on HDRS) as a function of the correlation.

get_responder_df<-function(r,cutoff,ATE,b_M,M0_SD,M1_SD){sds<-sqrt(b_M^2*M0_SD^2+b_M^2*M1_SD^2-2*b_M^2*M0_SD^2*r)data.frame(cor=r,prop_responder=pnorm(cutoff,ATE,sds))}tmp<-get_responder_df(r=seq(-1,1,length.out=100),cutoff=-5,ATE=-2,b_M=b_M,M0_SD=M0_SD,M1_SD)ggplot(tmp,aes(cor,prop_responder))+geom_line(color="#0984e3",size=1)+labs(y="Proportion",x="Cor[M(0), M(1)]",title="Proportion of Participants with Treatment Effects Greater Than 5")

center

We can also take a sample of participants and plot their potential outcomes.

library(tidyr)d_sub<-sample_n(d,100)tmp<-dplyr::select(d_sub,Y0_M0,Y1_M1,TX)%>%gather(PO,val,-TX)%>%mutate(obs=case_when(PO=="Y0_M0"&TX=="0"~"Yes",PO=="Y0_M0"&TX=="1"~"No",PO=="Y1_M1"&TX=="0"~"No",PO=="Y1_M1"&TX=="1"~"Yes",),obs=factor(obs),TX=factor(TX))tmp_tx<-filter(d_sub,TX=="1")%>%mutate(TX=factor(TX))ggplot(tmp,aes(PO,val))+geom_point(aes(color=TX,group=TX,shape=obs),position=position_dodge(width=0.1))+geom_segment(data=tmp_tx,aes(color="1",y=Y1_M1,yend=Y0_M0,x=2+0.025,xend=1+0.025,alpha=NULL),alpha=0.25,position=position_dodge(width=0.1))+labs(y="Y",x="Potential Outcome",shape="Observed",color="Group",title="Potential Outcomes")

center

References

Cortés, J., González, J. A., Medina, M. N., Vogler, M., Vilaró, M., Elmore, M., … Cobo, E. (2019). Does evidence support the high expectations placed in precision medicine? A bibliographic review. F1000Research, 7, 30. https://doi.org/10.12688/f1000research.13490.5
Munkholm, K. (n.d.). Individual response to antidepressants for depression in adults – a simulation study and meta-analysis. 10.
Plöderl, M., & Hengartner, M. P. (2019). What are the chances for personalised treatment with antidepressants? Detection of patient-by-treatment interaction with a variance ratio meta-analysis. BMJ Open, 9(12). https://doi.org/10.1136/bmjopen-2019-034816
Winkelbeiner, S., Leucht, S., Kane, J. M., & Homan, P. (2019). Evaluation of Differences in Individual Treatment Response in Schizophrenia Spectrum Disorders: A Meta-analysis. JAMA Psychiatry, 76(10), 1063–1073. https://doi.org/10.1001/jamapsychiatry.2019.1530

if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = "center", indent = "0em", linebreak = "false";</p><p> if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; var configscript = document.createElement('script'); configscript.type = 'text/x-mathjax-config'; configscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\$','\\\$'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " availableFonts: ['STIX', 'TeX']," + " preferredFont: 'STIX'," + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript); (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);}

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Le Monde puzzle [#1120]

January 13, 2020, 3:20 pm

≫ Next: Biomedical Data Science Textbook Available

≪ Previous: Heterogeneous treatment effects and homogeneous outcome variances

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A board game as Le weekly Monde current mathematical puzzle:

11 players in a circle and 365 tokens first owned by a single player. Players with at least two tokens can either remove one token and give another one left or move two right and one left. How quickly does the game stall, how many tokens are left, and where are they?

The run of a R simulation like

od=function(i)(i-1)%%11+1muv<-function(bob){  if (max(bob)>1){    i=sample(rep((1:11)[bob>1],2),1)    dud=c(0,-2,1)    if((runif(1)<.5)&(bob[i]>2))dud=c(2,-3,1)    bob[c(od(i+10),i,od(i+1))]=bob[c(od(i+10),i,od(i+1))]+dud  }  bob}

always provides a solution

> bob [1] 1 0 1 1 0 1 1 0 1 0 0

with six ones at these locations. However the time it takes to reach this frozen configuration varies, depending on the sequence of random choices.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Biomedical Data Science Textbook Available

January 13, 2020, 5:34 pm

≫ Next: eRum2020: call for submissions open!

≪ Previous: Le Monde puzzle [#1120]

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Bob Hoyt & Bob Muenchen

Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.

For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.

The book is available in print and various electronic formats. Because it is self-published, we plan to update it more rapidly than would be possible through traditional publishers.

Below you will find a detailed table of contents and a list of the textbook authors.

OVERVIEW OF BIOMEDICAL DATA SCIENCE

Introduction
Background and history
Conflicting perspectives
1. the statistician’s perspective
2. the machine learner’s perspective
3. the database administrator’s perspective
4. the data visualizer’s perspective
Data analytical processes
1. raw data
2. data pre-processing
3. exploratory data analysis (EDA)
4. predictive modeling approaches
5. types of models
6. types of software
Major types of analytics
1. descriptive analytics
2. diagnostic analytics
3. predictive analytics (modeling)
4. prescriptive analytics
5. putting it all together
Biomedical data science tools
Biomedical data science education
Biomedical data science careers
Importance of soft skills in data science
Biomedical data science resources
Biomedical data science challenges
Future trends
Conclusion
References

SPREADSHEET TOOLS AND TIPS

Introduction
1. basic spreadsheet functions
1. download the sample spreadsheet
Navigating the worksheet
Clinical application of spreadsheets
1. formulas and functions
2. filter
3. sorting data
4. freezing panes
5. conditional formatting
6. pivot tables
7. visualization
8. data analysis
Tips and tricks
1. Microsoft Excel shortcuts – windows users
2. Google sheets tips and tricks
Conclusions
Exercises
References

BIOSTATISTICS PRIMER

Introduction
Measures of central tendency & dispersion
1. the normal and log-normal distributions
Descriptive and inferential statistics
Categorical data analysis
Diagnostic tests
Bayes’ theorem
Types of research studies
1. observational studies
2. interventional studies
3. meta-analysis
4. orrelation
Linear regression
Comparing two groups
1. the independent-samples t-test
2. the wilcoxon-mann-whitney test
Comparing more than two groups
Other types of tests
1. generalized tests
2. exact or permutation tests
3. bootstrap or resampling tests
Stats packages and online calculators
1. commercial packages
2. non-commercial or open source packages
3. online calculators
Challenges
Future trends
Conclusion
Exercises
References

DATA VISUALIZATION

Introduction
1. historical data visualizations
2. visualization frameworks
Visualization basics
Data visualization software
1. Microsoft Excel
2. Google sheets
3. Tableau
4. R programming language
5. other visualization programs
Visualization options
1. visualizing categorical data
2. visualizing continuous data
Dashboards
Geographic maps
Challenges
Conclusion
Exercises
References

INTRODUCTION TO DATABASES

Introduction
Definitions
A brief history of database models
1. hierarchical model
2. network model
3. relational model
Relational database structure
Clinical data warehouses (CDWs)
Structured query language (SQL)
Learning SQL
Conclusion
Exercises
References

BIG DATA

Introduction
The seven v’s of big data related to health care data
Technical background
Application
Challenges
1. technical
2. organizational
3. legal
4. translational
Future trends
Conclusion
References

BIOINFORMATICS and PRECISION MEDICINE

Introduction
History
Definitions
Biological data analysis – from data to discovery
Biological data types
1. genomics
2. transcriptomics
3. proteomics
4. bioinformatics data in public repositories
5. biomedical cancer data portals
Tools for analyzing bioinformatics data
1. command line tools
2. web-based tools
Genomic data analysis
Genomic data analysis workflow
1. variant calling pipeline for whole exome sequencing data
2. quality check
3. alignment
4. variant calling
5. variant filtering and annotation
6. downstream analysis
7. reporting and visualization
Precision medicine – from big data to patient care
Examples of precision medicine
Challenges
Future trends
Useful resources
Conclusion
Exercises
References

PROGRAMMING LANGUAGES FOR DATA ANALYSIS

Introduction
History
R language
1. installing R & rstudio
2. an example R program
3. getting help in R
4. user interfaces for R
5. R’s default user interface: rgui
6. Rstudio
7. menu & dialog guis
8. some popular R guis
9. R graphical user interface comparison
10. R resources
Python language
1. installing Python
2. an example Python program
3. getting help in Python
4. user interfaces for Python
reproducibility
R vs. Python
Future trends
Conclusion
Exercises
References

MACHINE LEARNING

Brief history
Introduction
1. data refresher
2. training vs test data
3. bias and variance
4. supervised and unsupervised learning
Common machine learning algorithms
Supervised learning
Unsupervised learning
1. dimensionality reduction
2. reinforcement learning
3. semi-supervised learning
Evaluation of predictive analytical performance
1. classification model evaluation
2. regression model evaluation
Machine learning software
1. Weka
2. Orange
3. Rapidminer studio
4. KNIME
5. Google TensorFlow
6. honorable mention
7. summary
Programming languages and machine learning
Machine learning challenges
Machine learning examples
1. example 1 classification
2. example 2 regression
3. example 3 clustering
4. example 4 association rules
Conclusion
Exercises
References

ARTIFICIAL INTELLIGENCE

Introduction
1. definitions
History
Ai architectures
Deep learning
Image analysis (computer vision)
1. Radiology
2. Ophthalmology
3. Dermatology
4. Pathology
5. Cardiology
6. Neurology
7. Wearable devices
8. Image libraries and packages
Natural language processing
1. NLP libraries and packages
2. Text mining and medicine
3. Speech recognition
Electronic health record data and AI
Genomic analysis
AI platforms
1. deep learning platforms and programs
Artificial intelligence challenges
1. General
2. Data issues
3. Technical
4. Socio economic and legal
5. Regulatory
6. Adverse unintended consequences
7. Need for more ML and AI education
Future trends
Conclusion
Exercises
References

Authors

Brenda Griffith Technical Writer Data.World Austin, TX

Robert Hoyt MD, FACP, ABPM-CI, FAMIA Associate Clinical Professor Department of Internal Medicine Virginia Commonwealth University Richmond, VA

David Hurwitz MD, FACP, ABPM-CI Associate CMIO Allscripts Healthcare Solutions Chicago, IL

Madhurima Kaushal MS Bioinformatics Washington University at St. Louis, School of Medicine St. Louis, MO

Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA Assistant Professor New York Medical College Department of Emergency Medicine Valhalla, NY

Karen A. Monsen PhD, RN, FAMIA, FAAN Professor School of Nursing University of Minnesota Minneapolis, MN

Robert Muenchen MS, PSTAT Manager, Research Computing Support University of Tennessee Knoxville, TN

Dallas Snider PhD Chair, Department of Information Technology University of West Florida Pensacola, FL

A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

eRum2020: call for submissions open!

January 13, 2020, 11:00 pm

≫ Next: Business Case Analysis with R (Guest Post)

≪ Previous: Biomedical Data Science Textbook Available

[This article was first published on MilanoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The new year has started and we at MilanoR are deep into the frantic organization of the third European R User Meeting (eRum). eRum will take place next May (27-30) in Milan, it is a nonprofit event and completely community-driven. An impressive array of experts, enthusiasts and R professionals will animate the conference. We expect more than 500 R users and a rich showcase of talks in six different tracks. MilanoR is organizing partner in collaboration with Università di Milano Bicocca and Politecnico di Milano.

Submit your work to eRum!

The call for contributed sessions is now officially open. You can now submit abstracts for the following types of contributed sessions (the deadline for submissions is January 29, 2020):

Workshops
Regular Talks
Lightning Talks
Shiny Demos
Posters

If you are an R developer or lover, this is a great way to showcase your work and find peers and opportunities. The whole team and collaborators of MilanoR are dedicating plenty of time in building a great event. Don’t miss this chance and submit your work! You can easily do it by clicking this link.

Become a Sponsor

eRum2020 is also a great sponsorship opportunity. Beside our sponsors there is still room for other companies and associations working with R in the data-world willing to sponsor the conference. Becoming a sponsor you will have high visibility and access to an audience of R professionals from all over Europe. You could be part of the conference with a talk or workshop: you certainly will be able to find and recruit skilled and motivated talents. Download the sponsorship packages brochure or write us (sponsors@erum.io).

Submit your work at eRum, 27-30 May 2020 in Milan

Book your agenda, submit your R-work and see you at eRum2020! (Do not miss following our social activity on Twitter , Facebook or Linkedin)

The post eRum2020: call for submissions open! appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Business Case Analysis with R (Guest Post)

January 14, 2020, 12:00 am

≫ Next: pointblank v0.3

≪ Previous: eRum2020: call for submissions open!

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learning Machines proudly presents a fascinating guest post by decision and risk analyst Robert D. Brown III with a great application of R in the business and especially startup-arena! I encourage you to visit his blog too: Thales’ Press. Have fun!

Introduction

While Excel remains the tool of choice among analysts for business case analysis, presumably because of its low barrier to entry and ease of use, it nonetheless continues to be the source of many vexing and disastrous errors. As the research from Panko and others have revealed, the source of these errors arise in part for the following reasons:

Analysts continue in a persistent lack of QA standards and practices to ensure that spreadsheets are as error free as possible.
Cell referencing and replicating formulas across relevant dimensions propagate errors at exponential rates.
Effective auditing gets obscured because formulas’ conceptual meanings are difficult to interpret from formulas constructed from grid coordinates.

To suggest a means to remedy that situation, in the spring of 2018, I published Business Case Analysis with R: Simulation Tutorials to Support Complex Business Decisions (BCAwR). I also wanted to showcase how the popular programming language R could be employed to evaluate business case opportunities that are fraught with uncertainty nor supported by an abundance of well structured, pedigreed, and relevant data, a situation that plagues just about all business decisions faced in the real world. BCAwR takes the reader through the process of analyzing a rather complex commercial investment problem of bringing a chemical processing plant on line. For my post here, I will present a simplified model to give a preview of the contents of the book. I will also provide a little surprise at the end that I hope will contribute to bridging the efforts of the decision and data science communities.

But before we delve into any coding or quantitative activity related to analysis of any business case, we ought to thoroughly frame our problem by narrowing the scope of our inquiry to the appropriate level for the context of the problem at hand, identifying the goals and objectives we want to achieve when we eventually commit to action, partitioning the set of actionable decision we can make to achieve our goals, and creating an inventory of the influences and relationships that connect decisions and uncertainties to the measures of success. Although we will look at a much more simplified business problem here, a little framing will still serve us well.

Imagine that we are considering investing in a project that solves a particular market problem with the ultimate desire to generate a financial return to our organization. We identify three possible strategic pathways to compare for the investment, each with a different level of capital commitment required to address certain nuances of the market problem. Each (1) investment, in turn, will generate (2) an initial post investment net cash flow, (3) a peak cash flow, and varying durations of (4) time to reach the peak from the initial cash flow.

To complicate the matter, for each strategy, we do not know the precise quantitative value of the four parameters that characterize the potential cash flow. For this reason, when we evaluate the three strategies, we will employ a Monte Carlo simulation approach that samples a large number of potential realizations for each parameter from probability distributions. With these ideas in mind, we can now set up the coding to handle the simulation problem.

library(leonRdo) # https://www.incitedecisiontech.com/packages/packages.htmllibrary(tidyverse) # (on CRAN)# Set the simulation seed.set.seed(42)# The number of simulation runs per uncertain variable.trials <- 1000# The time frame across which our cash flow simulation will run.# The initial investment will occur in year 0.year <- 0:15# The weighted average cost of capital without any additional risk premium.discount.rate <- 0.1# The distinctive capital allocation strategic decisions.strategy <- c("Strategy 1", "Strategy 2", "Strategy 3")

R employs simple Monte Carlo techniques, which doesn’t typically yield stable means for result distributions until many thousands of trials are used. However, When developing a model like this, I suggest setting the trials to 50-100 to make sure that the logic is running correctly in short responsive time frames. Then, when you’re ready to produce real results, change the values to greater than 10,000. Given that Monte Carlo in R is noisy even for reasonably large samples (~1000) and for different seed settings, I recently developed the leonRdo package (R Packages: leonRdo & inteRest) to provide a median Latin hypercube approach to sampling that produces much more stable means with approximately 1/10th the number of trials regardless of the seed value used.

In this particular showcase discussion, I have included the tidyverse package (on CRAN) to help with data manipulation and graphing. In the original BCAwR, I based all the code on base R so that readers new to R and simulation could focus more on learning the key principles over the idiosyncratic syntax of secondary packages.

The set.seed() function ensures that we will produce the same sample trial set on each run of our model. This will help us debug problems as we develop code. Eventually as others might interact with our code, they will be able to observe the same set of results we work with. Later, we can reset the seed to other values to make sure that our inferences on a given set of trials remain consistent on a different trial set.

Model Functions

Next, we need to declare some important functions that we will call in our model. The first is a function that provides an analytic solution to the differential equation that relates the proportional absorption of an entity into a fixed environment. This ramp up function takes as parameters the amount of time Tp required to go from, say, Y = 1% absorption to Y = 99% across Time. We will use this function to model the growth of our cash flow over the year index.

# Ramp up function.calcRampUp <- function(Y0, Time, Tp) {  # Y0 = initial absorption  # Time = time index  # tp = time to Y = 1 - Y0, the time to peak.  Y <- 1 / (1 + (Y0 / (1 - Y0)) ^ (2 * (Time - 1) / (Tp - 1) - 1))  return(Y)}

The second function we need is the actual business model we want to represent. The business model logic will remain fixed across strategies, and the results will vary only on the basis of the strategy conditional parameters we supply. By “functionalizing” our model, we can iterate the entire model over variable ranges to understand the sensitivity of the output function to those variables, as we shall later see. Please note, this is a toy model we are using for illustration purposes only. BCAwR demonstrates a much more complex consideration for the implementation of a chemical processing plant.

# Business model cash flow.calcCashFlow <- function(I, SCF, PCF, YP, Y) {  # I = initial investment  # SCF = starting cash flow  # PCF = peak cash flow  # YP = year of peak cash flow  # Y = year index  cf <- (Y == 0) * (-I) +    (Y > 0) * (SCF + (PCF - SCF) * calcRampUp(0.01, Y, YP))  return(cf)}

The last function we will employ is the net present value (NPV) function that we will apply to the individual trial results of the cash flow simulation across strategies.

# The net present value of the cash flow function.calcNPV <- function(CF, Y, DR) {  # CF = cash flow vector  # Y = year index  # DR = discount rate  npv <- sum(CF / (1 + DR) ^ Y)  return(npv)}

Initialize Simulation Trial Samples

Our next task is to create trial samples for the business model cash flow function parameters for the three project strategy investments. For this particular example, I have chosen to use canonical distributions as if their parameters were based on empirical or historical data and to utilize a simple method for generating trials. However, typically I would use a combination of both empirical data and subject matter expert guidance reflected as cumulative probabilities across the range of the assessed variables’ potential values. I explain how to take this latter approach in greater detail in Section III of BCAwR.

investment1 <-  rlnorm_mlhs(n = trials,         meanlog = log(800),         sdlog = log(1 + 200 / 800))investment2 <-  rlnorm_mlhs(n = trials,         meanlog = log(700),         sdlog = log(1 + 50 / 700))investment3 <-  rlnorm_mlhs(n = trials,         meanlog = log(1000),         sdlog = log(1 + 150 / 1000))start.cash.flow1 <- rnorm_mlhs(n = trials, mean = -100, sd = 20)start.cash.flow2 <- rnorm_mlhs(n = trials, mean = -90, sd = 5)start.cash.flow3 <- rnorm_mlhs(n = trials, mean = -120, sd = 15)peak.cash.flow1 <- rnorm_mlhs(n = trials, mean = 300, sd = 20)peak.cash.flow2 <- rnorm_mlhs(n = trials, mean = 280, sd = 5)peak.cash.flow3 <- rnorm_mlhs(n = trials, mean = 375, sd = 15)yr.peak1 <- rnorm_mlhs(n = trials, mean = 8.5, sd = 0.75)yr.peak2 <- rnorm_mlhs(n = trials, mean = 10, sd = 0.85)yr.peak3 <- rnorm_mlhs(n = trials, mean = 11, sd = 1)# Store the business model parameter samples in a list of data frames.proj.data <-  list(    investment = data.frame(investment1, investment2, investment3),    start.cash.flow = data.frame(start.cash.flow1, start.cash.flow2, start.cash.flow3),    peak.cash.flow = data.frame(peak.cash.flow1, peak.cash.flow2, peak.cash.flow3),    yr.peak = data.frame(yr.peak1, yr.peak2, yr.peak3)  )

Model Results

For each strategy, we apply the samples from each parameter to the business model cash flow function. This will result in a list of cash flows for the three project strategies. For each strategy, there will be as many cash flows as defined by trials, and each cash flow trial will be as long as the year index. We can use the lapply() and sapply() functions to avoid using for loops [see also Learning R: A gentle introduction to higher-order functions].

proj.cf <- lapply(1:length(strategy), function(s) {  sapply(1:trials, function(t) {    calcCashFlow(      I = proj.data$investment[[s]][t],      SCF = proj.data$start.cash.flow[[s]][t],      PCF = proj.data$peak.cash.flow[[s]][t],      YP = proj.data$yr.peak[[s]][t],      Y = year    )  })})names(proj.cf) <- strategy

By running the first five trials of Strategy 1, we can see what the cash flows look like.

head(round(proj.cf[[1]][, 1:5], digits = 1), n=length(year))##         [,1]   [,2]   [,3]   [,4]   [,5]##  [1,] -852.7 -852.2 -766.7 -974.3 -912.6##  [2,]  -85.8 -103.1 -117.2 -116.1  -88.1##  [3,]  -77.8  -93.6 -107.3 -106.9  -79.9##  [4,]  -55.3  -64.3  -76.7  -79.9  -57.1##  [5,]    0.0    9.4    0.2  -13.1   -1.6##  [6,]   98.0  128.7  123.7  102.2   97.9##  [7,]  201.5  230.2  227.2  215.6  206.8##  [8,]  265.3  279.2  276.2  279.8  276.9##  [9,]  292.5  296.5  293.3  305.3  308.0## [10,]  302.3  301.8  298.5  314.1  319.5## [11,]  305.6  303.4  300.1  316.9  323.5## [12,]  306.7  303.9  300.5  317.8  324.8## [13,]  307.1  304.1  300.7  318.1  325.2## [14,]  307.2  304.1  300.7  318.2  325.4## [15,]  307.2  304.1  300.7  318.2  325.4## [16,]  307.2  304.1  300.7  318.2  325.5

We can calculate some summary results for the cash flows for each strategy and plot them. First, we might like to plot the mean cash flow over time.

proj.cf.mean <- as.data.frame(lapply(strategy, function(s) {  rowMeans(proj.cf[[s]])}))names(proj.cf.mean) <- strategyproj.cf.mean <- cbind(year, proj.cf.mean)# Plot the mean strategy cash flows.gg.mean.cf <-  ggplot(gather(proj.cf.mean, "strategy", "mean", -year)) +  geom_line(aes(x = year, y = mean, color = strategy)) +  geom_point(aes(x = year, y = mean, color = strategy)) +  labs(title = "Mean Strategy Cash Flow",       y = "[$M]")plot(gg.mean.cf)

Figure 1: The mean cash flows over time show tradeoffs in initial outlays, times to peak, level of peak, and the timing of breakeven.

Then we can calculate the cumulative mean cash flow and plot that for each strategy, too.

proj.ccf.mean <- proj.cf.mean %>%  mutate_at(vars(strategy), list(~ cumsum(.)))# Plot the cumulative mean strategy cash flows.gg.mean.ccf <-  ggplot(gather(proj.ccf.mean, "strategy", "mean", -year)) +  geom_line(aes(x = year, y = mean, color = strategy)) +  geom_point(aes(x = year, y = mean, color = strategy)) +  labs(title = "Cumulative Mean Strategy Cash Flow",       y = "[$M]")plot(gg.mean.ccf)

Figure 2: The cumulative mean cash flows over time show tradeoffs in drawdown, nominal payback timing, and terminal cash recovery level.

Now we can observe the risk profile of the cash flows by calculating the trial NPVs of the cash flows.

proj.npv <- as.data.frame(lapply(1:length(strategy), function(s) {  sapply(1:trials, function(t) {    calcNPV(CF = proj.cf[[s]][, t],            Y = year,            DR = discount.rate)  })}))names(proj.npv) <- strategy# Plot the CDF of the strategies' sample NPVs.gg.ecdf <-  ggplot(gather(proj.npv, "strategy", "NPV"), aes(x = NPV, color = strategy)) +  stat_ecdf(geom = "point") +  labs(title = "Strategy NPV Risk Profile",       x = "Strategy NPV [$M]",       y = "Cumulative Probability")plot(gg.ecdf)

Figure 3: The cumulative probability chart shows tradeoffs in stategy dominance, range of value, and relative value volatility.

And we can calculate the mean NPV of the cash flows.

proj.npv.mean <- round(colMeans(proj.npv), digits = 1)print(proj.npv.mean)## Strategy 1 Strategy 2 Strategy 3 ##      157.4       59.3     -123.6

The first observation we make about the risk profiles of the strategies is that we can dismiss Strategy 3 immediately because it presents negative mean NPV, the probability that it produces a negative economic value is ~75% (i.e., the probability(NPV<=0) = 0.75), and practically none of its trials present any opportunity for dominance over any other strategy.

When we observe the relative volatility and dominance of the remaining strategies, we realize that we face a bit of ambiguity about how to choose the best pathway forward. Strategy 1 exhibits the best overall mean NPV, but it does so with the greatest relative volatility. While Strategy 2 exhibits approximately the same risk of failure (~25%) as Strategy 1 (~27%), it also exhibits the least maximum exposure and relative volatility. To reduce the ambiguity of choosing, we might like to know which uncertainty, due to the overall quality of the information we possess about it, contributes most to switching dominance from Strategy 1 over Strategy 2. Knowing which uncertainty our strategy values are most sensitive to gives us the ability to stop worrying about the other uncertainties for the purpose of choosing clearly and focus only on improving our understanding of those that matter most.

To accomplish that latter feat, we run our functionalized model to test the sensitivity of the strategy means to the 80th percentile range in each of the variables. We start by initializing the lists to hold the sensitivity responses.

data.temp <- proj.dataproj.npv.sens <- list(p10 = proj.data,                      p90 = proj.data)d <- data.frame(0, 0, 0)dd <-  list(    investment = d,    start.cash.flow = d,    peak.cash.flow = d,    yr.peak = d  )proj.npv.sens.mean <- list(p10 = dd,                           p90 = dd)

We calculate the sensitivity of the strategies’ mean NPVs by fixing each variable to its p10 and p90 quantile values sequentially while all the other variables run according to their defined variation. The result is a chart that shows how sensitive we might be to changing our decision to pursue the best strategy over the next best strategy on the outcome of a given uncertainty.

p1090 <- c(0.1, 0.9)for (q in 1:length(p1090)) {  for (v in 1:length(proj.data)) {    for (s in 1:length(strategy)) {      data.temp[[v]][s] <-        rep(quantile(unlist(proj.data[[v]][s]), probs = p1090[q]), trials)      for (t in 1:trials) {        proj.npv.sens[[q]][[v]][t, s] <- calcNPV(          CF = calcCashFlow(            I = data.temp$investment[[s]][t],            SCF = data.temp$start.cash.flow[[s]][t],            PCF = data.temp$peak.cash.flow[[s]][t],            YP = data.temp$yr.peak[[s]][t],            Y = year          ),          Y = year,          DR = discount.rate        )      }      proj.npv.sens.mean[[q]][[v]][s] <-        mean(proj.npv.sens[[q]][[v]][, s])      data.temp <- proj.data    }  }}# Recast the sensitivity values to a form that can be plotted.variable.names <-  c("investment", "start.cash.flow", "peak.cash.flow", "yr.peak")proj.npv.sens.mean2 <- as.data.frame(sapply(1:length(p1090), function(q) {  sapply(1:length(proj.data), function(v) {    sapply(1:length(strategy), function(s) {      proj.npv.sens.mean[[q]][[v]][[s]]    })  })})) %>%  rename(p10 = V1,         p90 = V2) %>%  mutate(    variable = rep(variable.names, each = length(strategy)),    strategy = rep(strategy, times = length(proj.data)),    strategy.mean = rep(t(proj.npv.mean), times = length(proj.data))  ) %>%  select(variable, strategy, strategy.mean, p10, p90)gg.sens <-  ggplot(proj.npv.sens.mean2,         aes(           x = variable,           y = p10,           yend = p90,           color = strategy         )) +  geom_segment(aes(    x = variable,    xend = variable,    y = p10,    yend = p90  ),  color = "gray50",  size = 0.75) +  geom_point(    aes(x = variable, y = strategy.mean, color = strategy),    size = 1,    color = "black",    fill = "black",    shape = 23  ) +  geom_point(    aes(x = variable, y = p90),    size = 3) +  geom_point(    aes(x = variable, y = p10),    size = 3) +  labs(title = "Sensitivity of Strategy Mean NPV to Uncertain Variable Ranges",       y = "Strategy NPV [$M]") +  coord_flip() +  theme(legend.position = "none") +  facet_wrap(~ strategy)plot(gg.sens)

Figure 4: The sensitivity chart shows the range of variation in a strategy mean due to an 80th percentile variation in a given uncertainty. The mean value of each strategy is marked by the black diamond within the range of each variable. According to our results here, we should focus additional research attention on the investment range first and then the year to peak.

This sensitivity chart is valuable not only to the decision analyst, but to the data analyst as well for this one simple reason: now that we know which uncertainties potentially exposes us to the greatest risk or reward, we can calculate the value of information on those uncertainties. This value of information represents the rational maximum budget we should allocate to acquire better data and insight on those uncertainties. While a typical business case analysis of sufficient complexity may contain 10-100 uncertainties, we should focus our research budget only on those that are critical to making a decision. This simple “stopping function” helps us ensure that our data science resources are allocated properly and most efficiently. Section IV of BCAwR provides a means to calculate the value of information on continuous variables like the ones we have utilized in our example here.

Conclusion

Business decision analysts can definitely use R effectively to conduct many of the business case analyses in a more transparent and less error prone environment than their current tools typically allow, especially through the use of clearly named variables that array abstract and the use of functionalized structural models. However, the real power that R provides is the means to simulate information when high quality, large scale data may be scarce, infeasible, or impossible to obtain. Then, by using such concepts as value of information, data scientists can be called upon to refine the quality of information on those critical uncertainties that may frustrate success when commitment to action turns into real decisions. I hope the simple example presented here inspires you to learn more about it.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Motivating example

Fitting a degradation model

Working with model parameters

Inferences and hypotheses testing

References

Setup

Running python

Accessing python

Accessing R

Converting objects between languages

Layout

Html document

static

Dynamic

Blog

Resources

Set up the Example

Demonstrate the Warning

Basic Data Manipulation

Summary Statistics

Stacking Data

Joining Data

Converting Long to Wide Format

Converting Wide to Long Format

Visualizations

Scatter Plots

Line Charts

Histograms

Bar Charts

Lollipop Charts

Additional References

Hear what other data scientists have to say about rstudio::conf!

Full-Stack Data Science Series

Machine Learning Up 440% vs 5-Years Ago

My point: Learning ML is essential

5-Reasons why I use and teach H2O

1. AutoML Massive Productivity Booster

2. Scalable on Local Compute Distributed, In-Memory Processing speeds up computations

3. Spark Integration & GPU Support Big Data

4. Best Algorithms, Optimized and Ensembled Superior Performance

5. Production Ready Docker Containers

H2O in Production

H2O is the underlying prediction technology

The H2O Course

No Machine Learning Experience? Don’t worry. You’re covered.

Go from beginner to expert in 6-months or less with no prior experience required.

FIN

A useful rule-of-thumb?

Regression time!

Structuring your project folder

Managers and Settings

Modules

Custom components

References

Background

Creating a Python class

Porting the Python class to R

Using R markdown to switch between R and Python

Example with class and instance variables

What if the treatment changes one of the variance components?

A Numerical Example

References

Table of Contents​

Authors

Submit your work to eRum!

Become a Sponsor

Introduction

Model Functions

Initialize Simulation Trial Samples

Model Results

Conclusion

Running `python`

Accessing `python`

Accessing `R`

Table of Contents