Y is for Ys, Y-hats, and Residuals

April 28, 2018, 6:30 am

≫ Next: Deep Learning from first principles in Python, R and Octave – Part 7

≪ Previous: Re-exporting the magrittr pipe operator

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Y is for Ys, Y-hats, and Residuals When working with a prediction model, like a linear regression, there are a few Ys you need to concern yourself with: the ys (observed outcome variable), the y-hats (predicted outcome variables based on the equation), and the residuals (y minus y-hat). Today, I’ll dig into the different flavors of y and how you might work with them when conducting linear regression.

As my data example, I’ll use my dissertation dataset and the linear model I introduced in the GLM post.

dissertation<-read.delim("dissertation_data.txt",header=TRUE)guilt_lm_full<-lm(guilt~illev+circumst+deftest+policepower+suspicious+overzealous+upstanding,data=dissertation)options(scipen=999)summary(guilt_lm_full)

##  ## Call: ## lm(formula = guilt ~ illev + circumst + deftest + policepower +  ##     suspicious + overzealous + upstanding, data = dissertation) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -3.0357 -0.7452  0.1828  0.9706  2.5013  ##  ## Coefficients: ##             Estimate Std. Error t value             Pr(>|t|)     ## (Intercept)  4.16081    0.38966  10.678 < 0.0000000000000002 *** ## illev        0.11111    0.05816   1.911              0.05689 .   ## circumst    -0.08779    0.06708  -1.309              0.19147     ## deftest     -0.02020    0.05834  -0.346              0.72942     ## policepower  0.02828    0.06058   0.467              0.64090     ## suspicious   0.17286    0.06072   2.847              0.00468 **  ## overzealous -0.03298    0.04792  -0.688              0.49176     ## upstanding   0.08941    0.05374   1.664              0.09706 .   ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residual standard error: 1.169 on 347 degrees of freedom ##   (1 observation deleted due to missingness) ## Multiple R-squared:  0.07647, Adjusted R-squared:  0.05784  ## F-statistic: 4.105 on 7 and 347 DF,  p-value: 0.0002387

In this model, the outcome variable is a guilt rating, ranging from 1 to 7. This is our y, which our regression model is trying to recreate through the linear relationship between our x’s and our y. Using the coefficients in the output, we could compute the predicted y (y-hat) – what a person’s score would be if the linear model fit the data perfectly. Fortunately, R has a built-in function that will compute y-hat for a dataset: predict. This function requires two arguments: a regression model and the dataset to use to predict values. Let’s have R predict values for the dissertation dataset, and add it on as a new variable.

dissertation$predicted<-predict(guilt_lm_full, dissertation)

In this application, we don’t care as much about the predicted values – we will later on in this post – but we probably do care about the residuals: the difference between the observed value and the predicted value. This gives us an idea of how our model is doing and whether it fits reasonably well. It can also tell us if the model falls apart at certain values or ranges of values.

In the residuals post, I showed that you can easily request residuals from the model. As we did with predicted, let’s create a new variable in the dataset that contains our residuals.

dissertation$residual<-resid(guilt_lm_full)

## Error in `$<-.data.frame`(`*tmp*`, residual, value = structure(c(0.0326393185592984, : replacement has 355 rows, data has 356

Ruh-roh, we got an error. Our dataset contains 356 observations, but we only have 355 residuals. This is because someone has a missing value on one of the variables in the regression model and was dropped from the analysis. There are a variety of ways we could find out which case is missing a value, but since I’m only working with a handful of variables, I’ll just run descriptives and look for the variable with only 355 values.

library(psych)

## Warning: package 'psych' was built under R version 3.4.4

describe(dissertation[c(13,15,18,21,27,29,31,44)])

##             vars   n mean   sd median trimmed  mad min max range  skew ## illev          1 356 2.98 1.13      3    3.02 1.48   1   5     4 -0.17 ## circumst       2 356 2.99 0.95      3    2.97 1.48   1   5     4  0.13 ## deftest        3 356 3.39 1.46      4    3.57 0.00  -9   5    14 -5.25 ## policepower    4 355 3.86 1.41      4    4.02 0.00  -9   5    14 -6.40 ## suspicious     5 356 2.09 1.14      2    2.01 1.48  -9   5    14 -1.97 ## overzealous    6 356 3.34 1.34      4    3.41 1.48  -9   5    14 -4.49 ## upstanding     7 356 3.09 1.29      3    3.11 1.48  -9   5    14 -2.31 ## guilt          8 356 4.80 1.21      5    4.90 1.48   2   7     5 -0.59 ##             kurtosis   se ## illev          -1.04 0.06 ## circumst       -0.51 0.05 ## deftest        40.74 0.08 ## policepower    55.05 0.08 ## suspicious     23.52 0.06 ## overzealous    38.44 0.07 ## upstanding     19.66 0.07 ## guilt          -0.54 0.06

The variable policepower is the culprit. I can drop that missing value then rerun the residual code.

dissertation<-subset(dissertation,!is.na(policepower))dissertation$residual<-resid(guilt_lm_full)

Now I can plot my observed values and residuals.

library(ggplot2)

##  ## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych': ##  ##     %+%, alpha

qplot(guilt,residual,data=dissertation)

We want our residuals to fall around 0, which is only happening for guilt ratings near the midpoint of the scale. This suggests that ordinary least squares regression may not be the best fit for the data, as the model shows a tendency to overpredict (negative residual) guilt rating for people with lower observed ratings and underpredict (positive residual) for people with higher observed ratings.

But, as I often do on this blog for the sake of demonstration, let’s pretend the model is doing well. One way we could use a regression model is to predict scores in a new sample. For instance, there are multiple rumors that different graduate schools have prediction equations they use to predict a candidate’s anticipated graduate school GPA, based on a combination of factors asked about in the application packet, to determine if a person is grad school-ready (and ultimately, to decide if they should be admitted). Schools generally won’t confirm they do this, nor would they ever share the prediction equation, should such an equation exist. But this is one way regression results from a training sample could be used to make decisions on a testing sample. So let’s do that.

Unfortunately, I don’t have a second dissertation dataset laying around that I could apply this equation to, but I could take a note from the data science playbook, and randomly divide my sample into training and testing datasets. I use the training dataset to generate my equation, and I use the testing dataset to apply my equation and predict values. Since I have outcome variable data in the testing dataset too, I can see how well my model did. Once I have a well-performing model, I could then apply it to new data, maybe to predict how highly you’ll rate a book or movie, or to generate recommendations, or even to determine if I should let you in to the super-elite Monstersori school I want to create.

First, I’ll split my dataset in half.

smp_size<-floor(0.50*nrow(dissertation))set.seed(42)train_ind<-sample(seq_len(nrow(dissertation)),size= smp_size)train<-dissertation[train_ind, ]test<-dissertation[-train_ind, ]

Now I have a train dataset, with 177 observations, and a test dataset with 178. I can rerun the same linear model as before, this time only using the training data.

guilt_lm_train<-lm(guilt~illev+circumst+deftest+policepower+suspicious+overzealous+upstanding,data=train)summary(guilt_lm_train)

##  ## Call: ## lm(formula = guilt ~ illev + circumst + deftest + policepower +  ##     suspicious + overzealous + upstanding, data = train) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -2.9420 -0.8359  0.1641  0.9371  2.3151  ##  ## Coefficients: ##             Estimate Std. Error t value       Pr(>|t|)     ## (Intercept)  5.28874    0.77150   6.855 0.000000000128 *** ## illev        0.08866    0.08485   1.045        0.29759     ## circumst    -0.13018    0.09917  -1.313        0.19109     ## deftest     -0.25726    0.10699  -2.405        0.01727 *   ## policepower  0.01758    0.12316   0.143        0.88665     ## suspicious   0.25716    0.08857   2.903        0.00419 **  ## overzealous -0.11683    0.08240  -1.418        0.15807     ## upstanding   0.10371    0.07574   1.369        0.17273     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residual standard error: 1.194 on 169 degrees of freedom ## Multiple R-squared:  0.1265, Adjusted R-squared:  0.09027  ## F-statistic: 3.495 on 7 and 169 DF,  p-value: 0.001586

I can use my predict function to predict scores for the testing dataset. Remember, all this function needs is the linear model name and a dataset to use for the prediction function – and it can be any dataset, as long as it contains the same variables from the model.

test$predicted2<-predict(guilt_lm_train, test)

The original predicted value (from when I was working with the full dataset) is still in this set. I could have replaced values by using the same variable name, but just for fun, decided to keep those values and create a second prediction variable.

Because we have observed and predicted2 for our training dataset, let’s see how well our model did, by creating a new residual variable, residual2. We can’t use the resid function, because we didn’t have R perform a linear regression on the testing dataset, but we can easily generate this variable by subtracting the predicted score from the observed score. Then we can once again plot our observed and residual values.

test$residual2<-test$guilt-test$predicted2qplot(guilt,residual2,data=test)

We’re still seeing similar issues with the residuals as we did for the full dataset. If we wanted to actually apply our linear model, we’d want to do more research and pilot work to get the best equation we can. As with many things in statistics, the process is heavily determined by what you plan to do with the results. If you want to report variables that have a strong linear relationship with an outcome, we might be fine with these regression results. If we want to build an equation that predicts an outcome with a minimum of error, we’d want to do more exploratory work, pulling in new variables and dropping low-performing ones. Many of the equations in the model were not significant, so we could start model-building from those results. We may need to build multiple training datasets, to ensure we aren’t only picking up chance relationships. And for much larger applications, such as recommendation systems on services like Amazon and Netflix, machine learning may be a better, more powerful method.

One more A to Z post left!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Deep Learning from first principles in Python, R and Octave – Part 7

April 29, 2018, 5:55 am

≫ Next: Statistics Sunday: Conducting Meta-Analysis in R

≪ Previous: Y is for Ys, Y-hats, and Residuals

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Artificial Intelligence is the new electricity. – Prof Andrew Ng

Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI. – Yann LeCun, March 14, 2016 (Facebook)

Introduction

In this post ‘Deep Learning from first principles with Python, R and Octave-Part 7’, I implement optimization methods used in Stochastic Gradient Descent (SGD) to speed up the convergence. Specifically I discuss and implement the following gradient descent optimization techniques

a.Vanilla Stochastic Gradient Descent b.Learning rate decay c. Momentum method d. RMSProp e. Adaptive Moment Estimation (Adam)

This post, further enhances my generic L-Layer Deep Learning Network implementations in vectorized Python, R and Octave to also include the Stochastic Gradient Descent optimization techniques. You can clone/download the code from Github at DeepLearning-Part7

Incidentally, a good discussion of the various optimizations methods used in Stochastic Gradient Optimization techniques can be seen at Sebastian Ruder’s blog

Note: In the vectorized Python, R and Octave implementations below only a 1024 random training samples were used. This was to reduce the computation time. You are free to use the entire data set (60000 training data) for the computation.

This post is largely based of on Prof Andrew Ng’s Deep Learning Specialization. All the above optimization techniques for Stochastic Gradient Descent are based on the technique of exponentially weighted average method. So for example if we had some time series data $\theta_{1},\theta_{2},\theta_{3}... \theta_{t}$ then we we can represent the exponentially average value at time ‘t’ as a sequence of the the previous value $v_{t-1}$ and $\theta_{t}$ as shown below $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$

Here $v_{t}$ represent the average of the data set over $\frac {1}{1-\beta}$ By choosing different values of $\beta$ , we can average over a larger or smaller number of the data points. We can write the equations as follows $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$ $v_{t-1} = \beta v_{t-2} + (1-\beta)\theta_{t-1}$ $v_{t-2} = \beta v_{t-3} + (1-\beta)\theta_{t-2}$ and $v_{t-k} = \beta v_{t-(k+1))} + (1-\beta)\theta_{t-k}$ By substitution we have $v_{t} = (1-\beta)\theta_{t} + \beta v_{t-1}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta v_{t-2}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta ((1-\beta)\theta_{t-2}+ \beta v_{t-3} )$

Hence it can be seen that the $v_{t}$ is the weighted sum over the previous values $\theta_{k}$ , which is an exponentially decaying function.

By the way, also take a look at my compact, minimal book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

1.1a. Stochastic Gradient Descent (Vanilla) – Python

import numpy as npimport matplotlibimport matplotlib.pyplot as pltimport sklearn.linear_modelimport pandas as pdimport sklearnimport sklearn.datasetsexec(open("DLfunctions7.py").read())exec(open("load_mnist.py").read())# Read the training datatraining=list(read(dataset='training',path=".\\mnist"))test=list(read(dataset='testing',path=".\\mnist"))lbls=[]pxls=[]for i in range(60000):       l,p=training[i]       lbls.append(l)       pxls.append(p)labels= np.array(lbls)pixels=np.array(pxls)       y=labels.reshape(-1,1)X=pixels.reshape(pixels.shape[0],-1)X1=X.TY1=y.T# Create  a list of 1024 random numbers.permutation = list(np.random.permutation(2**10))# Subset 16384 from the dataX2 = X1[:, permutation]Y2 = Y1[:, permutation].reshape((1,2**10))# Set the layer dimensions  layersDimensions=[784, 15,9,10] # Perform SGD with regular gradient descentparameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',                                    outputActivationFunc="softmax",learningRate = 0.01 ,                                   optimizer="gd",                                   mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig1.png")

1.1b. Stochastic Gradient Descent (Vanilla) – R

source("mnist.R")source("DLfunctions7.R")#Load and read MNIST dataload_mnist() x <- t(train$x)X <- x[,1:60000]y <-train$yy1 <- y[1:60000]y2 <- as.matrix(y1)Y=t(y2)# Subset 1024 random samples from MNIST permutation = c(sample(2^10))# Randomly shuffle the training dataX1 = X[, permutation]y1 = Y[1, permutation]y2 <- as.matrix(y1)Y1=t(y2)# Set layer dimensionslayersDimensions=c(784, 15,9, 10) # Perform SGD with regular gradient descentretvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,                            hiddenActivationFunc='tanh',                            outputActivationFunc="softmax",                            learningRate = 0.05,                            optimizer="gd",                            mini_batch_size = 512,                             num_epochs = 5000,                             print_cost = True)

#Plot the cost vs iterationsiterations <- seq(0,5000,1000)costs=retvalsSGD$costsdf=data.frame(iterations,costs)ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs no of epochs") + xlab("No of epochss") + ylab("Cost")

1.1c. Stochastic Gradient Descent (Vanilla) – Octave

source("DL7functions.m")#Load and read MNISTload('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024permutation = randperm(1024);disp(length(permutation));# Use this 1024 as the batchX=trainX(permutation,:);Y=trainY(permutation,:);# Set layer dimensionslayersDimensions=[784, 15, 9, 10];# Perform SGD with regular gradient descent[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu',  outputActivationFunc="softmax", learningRate = 0.005, lrDecay=true,  decayRate=1, lambd=0, keep_prob=1, optimizer="gd", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512,  num_epochs = 5000);plotCostVsEpochs(5000,costs);

2.1. Stochastic Gradient Descent with Learning rate decay

Since in Stochastic Gradient Descent,with each epoch, we use slight different samples, the gradient descent algorithm, oscillates across the ravines and wanders around the minima, when a fixed learning rate is used. In this technique of ‘learning rate decay’ the learning rate is slowly decreased with the number of epochs and becomes smaller and smaller, so that gradient descent can take smaller steps towards the minima.

There are several techniques employed in learning rate decay

a) Exponential decay: $\alpha = decayRate^{epochNum} *\alpha_{0}$ b) 1/t decay : $\alpha = \frac{\alpha_{0}}{1 + decayRate*epochNum}$ c) $\alpha = \frac {decayRate}{\sqrt(epochNum)}*\alpha_{0}$

In my implementation I have used the ‘exponential decay’. The code snippet for Python is shown below

if lrDecay == True:   learningRate = np.power(decayRate,(num_epochs/1000)) * learningRate

2.1a. Stochastic Gradient Descent with Learning rate decay – Python

import numpy as npimport matplotlibimport matplotlib.pyplot as pltimport sklearn.linear_modelimport pandas as pdimport sklearnimport sklearn.datasetsexec(open("DLfunctions7.py").read())exec(open("load_mnist.py").read())# Read the MNIST datatraining=list(read(dataset='training',path=".\\mnist"))test=list(read(dataset='testing',path=".\\mnist"))lbls=[]pxls=[]for i in range(60000):       l,p=training[i]       lbls.append(l)       pxls.append(p)labels= np.array(lbls)pixels=np.array(pxls)       y=labels.reshape(-1,1)X=pixels.reshape(pixels.shape[0],-1)X1=X.TY1=y.T# Create  a list of random numbers of 1024permutation = list(np.random.permutation(2**10))# Subset 16384 from the dataX2 = X1[:, permutation]Y2 = Y1[:, permutation].reshape((1,2**10))# Set layer dimensionslayersDimensions=[784, 15,9,10] # Perform SGD with learning rate decayparameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',                                    outputActivationFunc="softmax",                                   learningRate = 0.01 , lrDecay=True, decayRate=0.9999,                                   optimizer="gd",                                   mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig2.png")

2.1b. Stochastic Gradient Descent with Learning rate decay – R

source("mnist.R")source("DLfunctions7.R")# Read and load MNISTload_mnist()x <- t(train$x)X <- x[,1:60000]y <-train$yy1 <- y[1:60000]y2 <- as.matrix(y1)Y=t(y2)# Subset 1024 random samples from MNIST permutation = c(sample(2^10))# Randomly shuffle the training dataX1 = X[, permutation]y1 = Y[1, permutation]y2 <- as.matrix(y1)Y1=t(y2)# Set layer dimensionslayersDimensions=c(784, 15,9, 10) # Perform SGD with Learning rate decayretvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,                                  hiddenActivationFunc='tanh',                                  outputActivationFunc="softmax",                                  learningRate = 0.05,                                  lrDecay=TRUE,                                  decayRate=0.9999,                                  optimizer="gd",                                  mini_batch_size = 512,                                   num_epochs = 5000,                                   print_cost = True)

#Plot the cost vs iterationsiterations <- seq(0,5000,1000)costs=retvalsSGD$costsdf=data.frame(iterations,costs)ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

2.1c. Stochastic Gradient Descent with Learning rate decay – Octave

source("DL7functions.m")#Load and read MNISTload('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024permutation = randperm(1024);disp(length(permutation));# Use this 1024 as the batchX=trainX(permutation,:);Y=trainY(permutation,:);# Set layer dimensionslayersDimensions=[784, 15, 9, 10];# Perform SGD with regular Learning rate decay[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu',  outputActivationFunc="softmax", learningRate = 0.01, lrDecay=true,  decayRate=0.999, lambd=0, keep_prob=1, optimizer="gd", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512,  num_epochs = 5000);plotCostVsEpochs(5000,costs)

3.1. Stochastic Gradient Descent with Momentum

Stochastic Gradient Descent with Momentum uses the exponentially weighted average method discusses above and more generally moves faster into the ravine than across it. The equations are $v_{dW}^l = \beta v_{dW}^l + (1-\beta)dW^{l}$ $v_{db}^l = \beta v_{db}^l + (1-\beta)db^{l}$ $W^{l} = W^{l} - \alpha v_{dW}^l$ $b^{l} = b^{l} - \alpha v_{db}^l$ where $v_{dW}$ and $v_{db}$ are the momentum terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet for Stochastic Gradient Descent with momentum in R is shown below

# Perform Gradient Descent with momentum# Input : Weights and biases#       : beta#       : gradients#       : learning rate#       : outputActivationFunc - Activation function at hidden layer sigmoid/softmax#output : Updated weights after 1 iterationgradientDescentWithMomentum  <- function(parameters, gradients,v, beta, learningRate,outputActivationFunc="sigmoid"){    L = length(parameters)/2 # number of layers in the neural network        # Update rule for each parameter. Use a for loop.    for(l in 1:(L-1)){        # Compute velocities        # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk        v[[paste("dW",l, sep="")]] = beta*v[[paste("dW",l, sep="")]] +                    (1-beta) * gradients[[paste('dW',l,sep="")]]        v[[paste("db",l, sep="")]] = beta*v[[paste("db",l, sep="")]] +             (1-beta) * gradients[[paste('db',l,sep="")]]                parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -            learningRate* v[[paste("dW",l, sep="")]]         parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -            learningRate* v[[paste("db",l, sep="")]]     }        # Compute for the Lth layer    if(outputActivationFunc=="sigmoid"){        v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +             (1-beta) * gradients[[paste('dW',L,sep="")]]        v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +             (1-beta) * gradients[[paste('db',L,sep="")]]                parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -            learningRate* v[[paste("dW",l, sep="")]]          parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -            learningRate* v[[paste("db",l, sep="")]]            }else if (outputActivationFunc=="softmax"){        v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +             (1-beta) * t(gradients[[paste('dW',L,sep="")]])        v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +             (1-beta) * t(gradients[[paste('db',L,sep="")]])               parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -            learningRate* t(gradients[[paste("dW",L,sep="")]])        parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -            learningRate* t(gradients[[paste("db",L,sep="")]])    }    return(parameters)}

3.1a. Stochastic Gradient Descent with Momentum- Python

import numpy as npimport matplotlibimport matplotlib.pyplot as pltimport sklearn.linear_modelimport pandas as pdimport sklearnimport sklearn.datasets# Read and load dataexec(open("DLfunctions7.py").read())exec(open("load_mnist.py").read())training=list(read(dataset='training',path=".\\mnist"))test=list(read(dataset='testing',path=".\\mnist"))lbls=[]pxls=[]for i in range(60000):       l,p=training[i]       lbls.append(l)       pxls.append(p)labels= np.array(lbls)pixels=np.array(pxls)       y=labels.reshape(-1,1)X=pixels.reshape(pixels.shape[0],-1)X1=X.TY1=y.T# Create  a list of random numbers of 1024permutation = list(np.random.permutation(2**10))# Subset 16384 from the dataX2 = X1[:, permutation]Y2 = Y1[:, permutation].reshape((1,2**10))layersDimensions=[784, 15,9,10] # Perform SGD with momentumparameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',                                    outputActivationFunc="softmax",learningRate = 0.01 ,                                   optimizer="momentum", beta=0.9,                                   mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig3.png")

3.1b. Stochastic Gradient Descent with Momentum- R

source("mnist.R")source("DLfunctions7.R")load_mnist()x <- t(train$x)X <- x[,1:60000]y <-train$yy1 <- y[1:60000]y2 <- as.matrix(y1)Y=t(y2)# Subset 1024 random samples from MNIST permutation = c(sample(2^10))# Randomly shuffle the training dataX1 = X[, permutation]y1 = Y[1, permutation]y2 <- as.matrix(y1)Y1=t(y2)layersDimensions=c(784, 15,9, 10) # Perform SGD with momentumretvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,                                  hiddenActivationFunc='tanh',                                  outputActivationFunc="softmax",                                  learningRate = 0.05,                                  optimizer="momentum",                                  beta=0.9,                                  mini_batch_size = 512,                                   num_epochs = 5000,                                   print_cost = True)

#Plot the cost vs iterationsiterations <- seq(0,5000,1000)costs=retvalsSGD$costsdf=data.frame(iterations,costs)ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

3.1c. Stochastic Gradient Descent with Momentum- Octave

source("DL7functions.m")#Load and read MNISTload('./mnist/mnist.txt.gz'); #Create a random permutatation from 60Kpermutation = randperm(1024);disp(length(permutation));# Use this 1024 as the batchX=trainX(permutation,:);Y=trainY(permutation,:);# Set layer dimensionslayersDimensions=[784, 15, 9, 10];# Perform SGD with Momentum[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu',  outputActivationFunc="softmax", learningRate = 0.01, lrDecay=false,  decayRate=1, lambd=0, keep_prob=1, optimizer="momentum", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512,  num_epochs = 5000);plotCostVsEpochs(5000,costs)

4.1. Stochastic Gradient Descent with RMSProp

Stochastic Gradient Descent with RMSProp tries to move faster towards the minima while dampening the oscillations across the ravine. The equations are

$s_{dW}^l = \beta_{1} s_{dW}^l + (1-\beta_{1})(dW^{l})^{2}$ $s_{db}^l = \beta_{1} s_{db}^l + (1-\beta_{1})(db^{l})^2$ $W^{l} = W^{l} - \frac {\alpha s_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$ $b^{l} = b^{l} - \frac {\alpha s_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$ where $s_{dW}$ and $s_{db}$ are the RMSProp terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’

The code snippet in Octave is shown below

# Update parameters with RMSProp# Input : parameters#       : gradients#       : s#       : beta#       : learningRate#       : #output : Updated parameters RMSPropfunction [weights biases] = gradientDescentWithRMSProp(weights, biases,gradsDW,gradsDB, sdW, sdB, beta1, epsilon, learningRate,outputActivationFunc="sigmoid")    L = size(weights)(2); # number of layers in the neural network    # Update rule for each parameter.     for l=1:(L-1)        sdW{l} =  beta1*sdW{l} + (1 -beta1) * gradsDW{l} .* gradsDW{l};        sdB{l} =  beta1*sdB{l} + (1 -beta1) * gradsDB{l} .* gradsDB{l};        weights{l} = weights{l} - learningRate* gradsDW{l} ./ sqrt(sdW{l} + epsilon);         biases{l} = biases{l} -  learningRate* gradsDB{l} ./ sqrt(sdB{l} + epsilon);    endfor      if (strcmp(outputActivationFunc,"sigmoid"))        sdW{L} =  beta1*sdW{L} + (1 -beta1) * gradsDW{L} .* gradsDW{L};        sdB{L} =  beta1*sdB{L} + (1 -beta1) * gradsDB{L} .* gradsDB{L};        weights{L} = weights{L} -learningRate* gradsDW{L} ./ sqrt(sdW{L} +epsilon);         biases{L} = biases{L} -learningRate* gradsDB{L} ./ sqrt(sdB{L} + epsilon);     elseif (strcmp(outputActivationFunc,"softmax"))        sdW{L} =  beta1*sdW{L} + (1 -beta1) * gradsDW{L}' .* gradsDW{L}';        sdB{L} =  beta1*sdB{L} + (1 -beta1) * gradsDB{L}' .* gradsDB{L}';        weights{L} = weights{L} -learningRate* gradsDW{L}' ./ sqrt(sdW{L} +epsilon);         biases{L} = biases{L} -learningRate* gradsDB{L}' ./ sqrt(sdB{L} + epsilon);     endif   end

4.1a. Stochastic Gradient Descent with RMSProp – Python

import numpy as npimport matplotlibimport matplotlib.pyplot as pltimport sklearn.linear_modelimport pandas as pdimport sklearnimport sklearn.datasetsexec(open("DLfunctions7.py").read())exec(open("load_mnist.py").read())# Read and load MNISTtraining=list(read(dataset='training',path=".\\mnist"))test=list(read(dataset='testing',path=".\\mnist"))lbls=[]pxls=[]for i in range(60000):       l,p=training[i]       lbls.append(l)       pxls.append(p)labels= np.array(lbls)pixels=np.array(pxls)       y=labels.reshape(-1,1)X=pixels.reshape(pixels.shape[0],-1)X1=X.TY1=y.Tprint("X1=",X1.shape)print("y1=",Y1.shape)# Create  a list of random numbers of 1024permutation = list(np.random.permutation(2**10))# Subset 16384 from the dataX2 = X1[:, permutation]Y2 = Y1[:, permutation].reshape((1,2**10))  layersDimensions=[784, 15,9,10] # Use SGD with RMSPropparameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',                                    outputActivationFunc="softmax",learningRate = 0.01 ,                                   optimizer="rmsprop", beta1=0.7, epsilon=1e-8,                                   mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig4.png")

4.1b. Stochastic Gradient Descent with RMSProp – R

source("mnist.R")source("DLfunctions7.R")load_mnist()x <- t(train$x)X <- x[,1:60000]y <-train$yy1 <- y[1:60000]y2 <- as.matrix(y1)Y=t(y2)# Subset 1024 random samples from MNIST permutation = c(sample(2^10))# Randomly shuffle the training dataX1 = X[, permutation]y1 = Y[1, permutation]y2 <- as.matrix(y1)Y1=t(y2)layersDimensions=c(784, 15,9, 10) #Perform SGD with RMSPropretvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,                                  hiddenActivationFunc='tanh',                                  outputActivationFunc="softmax",                                  learningRate = 0.001,                                  optimizer="rmsprop",                                  beta1=0.9,                                  epsilon=10^-8,                                  mini_batch_size = 512,                                   num_epochs = 5000 ,                                   print_cost = True)

#Plot the cost vs iterationsiterations <- seq(0,5000,1000)costs=retvalsSGD$costsdf=data.frame(iterations,costs)ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

4.1c. Stochastic Gradient Descent with RMSProp – Octave

source("DL7functions.m")load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024permutation = randperm(1024);# Use this 1024 as the batchX=trainX(permutation,:);Y=trainY(permutation,:);# Set layer dimensionslayersDimensions=[784, 15, 9, 10];#Perform SGD with RMSProp[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu',  outputActivationFunc="softmax", learningRate = 0.005, lrDecay=false,  decayRate=1, lambd=0, keep_prob=1, optimizer="rmsprop", beta=0.9, beta1=0.9, beta2=0.999, epsilon=1, mini_batch_size = 512,  num_epochs = 5000);plotCostVsEpochs(5000,costs)

5.1. Stochastic Gradient Descent with Adam

Adaptive Moment Estimate is a combination of the momentum (1st moment) and RMSProp(2nd moment). The equations for Adam are below $v_{dW}^l = \beta_{1} v_{dW}^l + (1-\beta_{1})dW^{l}$ $v_{db}^l = \beta_{1} v_{db}^l + (1-\beta_{1})db^{l}$ The bias corrections for the 1st moment $vCorrected_{dW}^l= \frac {v_{dW}^l}{1 - \beta_{1}^{t}}$ $vCorrected_{db}^l= \frac {v_{db}^l}{1 - \beta_{1}^{t}}$

Similarly the moving average for the 2nd moment- RMSProp $s_{dW}^l = \beta_{2} s_{dW}^l + (1-\beta_{2})(dW^{l})^2$ $s_{db}^l = \beta_{2} s_{db}^l + (1-\beta_{2})(db^{l})^2$ The bias corrections for the 2nd moment $sCorrected_{dW}^l= \frac {s_{dW}^l}{1 - \beta_{2}^{t}}$ $sCorrected_{db}^l= \frac {s_{db}^l}{1 - \beta_{2}^{t}}$

The Adam Gradient Descent is given by $W^{l} = W^{l} - \frac {\alpha vCorrected_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$ $b^{l} = b^{l} - \frac {\alpha vCorrected_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$ The code snippet of Adam in R is included below

# Perform Gradient Descent with Adam# Input : Weights and biases#       : beta1#       : epsilon#       : gradients#       : learning rate#       : outputActivationFunc - Activation function at hidden layer sigmoid/softmax#output : Updated weights after 1 iterationgradientDescentWithAdam  <- function(parameters, gradients,v, s, t,                         beta1=0.9, beta2=0.999, epsilon=10^-8, learningRate=0.1,outputActivationFunc="sigmoid"){        L = length(parameters)/2 # number of layers in the neural network    v_corrected <- list()    s_corrected <- list()    # Update rule for each parameter. Use a for loop.    for(l in 1:(L-1)){        # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk        v[[paste("dW",l, sep="")]] = beta1*v[[paste("dW",l, sep="")]] +             (1-beta1) * gradients[[paste('dW',l,sep="")]]        v[[paste("db",l, sep="")]] = beta1*v[[paste("db",l, sep="")]] +             (1-beta1) * gradients[[paste('db',l,sep="")]]                        # Compute bias-corrected first moment estimate.         v_corrected[[paste("dW",l, sep="")]] = v[[paste("dW",l, sep="")]]/(1-beta1^t)        v_corrected[[paste("db",l, sep="")]] = v[[paste("db",l, sep="")]]/(1-beta1^t)                       # Element wise multiply of gradients        s[[paste("dW",l, sep="")]] = beta2*s[[paste("dW",l, sep="")]] +             (1-beta2) * gradients[[paste('dW',l,sep="")]] * gradients[[paste('dW',l,sep="")]]        s[[paste("db",l, sep="")]] = beta2*s[[paste("db",l, sep="")]] +             (1-beta2) * gradients[[paste('db',l,sep="")]] * gradients[[paste('db',l,sep="")]]                # Compute bias-corrected second moment estimate.         s_corrected[[paste("dW",l, sep="")]] = s[[paste("dW",l, sep="")]]/(1-beta2^t)        s_corrected[[paste("db",l, sep="")]] = s[[paste("db",l, sep="")]]/(1-beta2^t)                # Update parameters.         d1=sqrt(s_corrected[[paste("dW",l, sep="")]]+epsilon)        d2=sqrt(s_corrected[[paste("db",l, sep="")]]+epsilon)                                parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -            learningRate * v_corrected[[paste("dW",l, sep="")]]/d1        parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -            learningRate*v_corrected[[paste("db",l, sep="")]]/d2    }        # Compute for the Lth layer    if(outputActivationFunc=="sigmoid"){        v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +             (1-beta1) * gradients[[paste('dW',L,sep="")]]        v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +             (1-beta1) * gradients[[paste('db',L,sep="")]]                        # Compute bias-corrected first moment estimate.         v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)        v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)                        # Element wise multiply of gradients        s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +             (1-beta2) * gradients[[paste('dW',L,sep="")]] * gradients[[paste('dW',L,sep="")]]        s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +             (1-beta2) * gradients[[paste('db',L,sep="")]] * gradients[[paste('db',L,sep="")]]                # Compute bias-corrected second moment estimate.         s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)        s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)                # Update parameters.         d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)        d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)                  parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -            learningRate * v_corrected[[paste("dW",L, sep="")]]/d1        parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -            learningRate*v_corrected[[paste("db",L, sep="")]]/d2            }else if (outputActivationFunc=="softmax"){        v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +             (1-beta1) * t(gradients[[paste('dW',L,sep="")]])        v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +             (1-beta1) * t(gradients[[paste('db',L,sep="")]])                        # Compute bias-corrected first moment estimate.         v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)        v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)                        # Element wise multiply of gradients        s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +             (1-beta2) * t(gradients[[paste('dW',L,sep="")]]) * t(gradients[[paste('dW',L,sep="")]])        s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +             (1-beta2) * t(gradients[[paste('db',L,sep="")]]) * t(gradients[[paste('db',L,sep="")]])                # Compute bias-corrected second moment estimate.         s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)        s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)                # Update parameters.         d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)        d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)                 parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -            learningRate * v_corrected[[paste("dW",L, sep="")]]/d1        parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -            learningRate*v_corrected[[paste("db",L, sep="")]]/d2    }    return(parameters)}

5.1a. Stochastic Gradient Descent with Adam – Python

import numpy as npimport matplotlibimport matplotlib.pyplot as pltimport sklearn.linear_modelimport pandas as pdimport sklearnimport sklearn.datasetsexec(open("DLfunctions7.py").read())exec(open("load_mnist.py").read())training=list(read(dataset='training',path=".\\mnist"))test=list(read(dataset='testing',path=".\\mnist"))lbls=[]pxls=[]print(len(training))#for i in range(len(training)):for i in range(60000):       l,p=training[i]       lbls.append(l)       pxls.append(p)labels= np.array(lbls)pixels=np.array(pxls)       y=labels.reshape(-1,1)X=pixels.reshape(pixels.shape[0],-1)X1=X.TY1=y.T# Create  a list of random numbers of 1024permutation = list(np.random.permutation(2**10))# Subset 16384 from the dataX2 = X1[:, permutation]Y2 = Y1[:, permutation].reshape((1,2**10))layersDimensions=[784, 15,9,10] #Perform SGD with Adam optimizationparameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',                                    outputActivationFunc="softmax",learningRate = 0.01 ,                                   optimizer="adam", beta1=0.9, beta2=0.9, epsilon = 1e-8,                                   mini_batch_size =512, num_epochs = 1000, print_cost = True, figure="fig5.png")

5.1b. Stochastic Gradient Descent with Adam – R

source("mnist.R")source("DLfunctions7.R")load_mnist()x <- t(train$x)X <- x[,1:60000]y <-train$yy1 <- y[1:60000]y2 <- as.matrix(y1)Y=t(y2)# Subset 1024 random samples from MNIST permutation = c(sample(2^10))# Randomly shuffle the training dataX1 = X[, permutation]y1 = Y[1, permutation]y2 <- as.matrix(y1)Y1=t(y2)layersDimensions=c(784, 15,9, 10) #Perform SGD with AdamretvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,                                  hiddenActivationFunc='tanh',                                  outputActivationFunc="softmax",                                  learningRate = 0.005,                                  optimizer="adam",                                  beta1=0.7,                                  beta2=0.9,                                  epsilon=10^-8,                                  mini_batch_size = 512,                                   num_epochs = 5000 ,                                   print_cost = True)

#Plot the cost vs iterationsiterations <- seq(0,5000,1000)costs=retvalsSGD$costsdf=data.frame(iterations,costs)ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

5.1c. Stochastic Gradient Descent with Adam – Octave

source("DL7functions.m")load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024permutation = randperm(1024);disp(length(permutation));# Use this 1024 as the batchX=trainX(permutation,:);Y=trainY(permutation,:);# Set layer dimensionslayersDimensions=[784, 15, 9, 10];# Note the high value for epsilon. #Otherwise GD with Adam does not seem to converge   # Perform SGD with Adam         [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,                       hiddenActivationFunc='relu',                        outputActivationFunc="softmax",                       learningRate = 0.1,                       lrDecay=false,                        decayRate=1,                       lambd=0,                       keep_prob=1,                       optimizer="adam",                       beta=0.9,                       beta1=0.9,                       beta2=0.9,                       epsilon=100,                       mini_batch_size = 512,                        num_epochs = 5000);plotCostVsEpochs(5000,costs)

Conclusion: In this post I discuss and implement several Stochastic Gradient Descent optimization methods. The implementation of these methods enhance my already existing generic L-Layer Deep Learning Network implementation in vectorized Python, R and Octave, which I had discussed in the previous post in this series on Deep Learning from first principles in Python, R and Octave. Check it out, if you haven’t already. As already mentioned the code for this post can be cloned/forked from Github at DeepLearning-Part7

Watch this space! I’ll be back!

Also see 1.My book ‘Practical Machine Learning with R and Python’ on Amazon 2. Deep Learning from first principles in Python, R and Octave – Part 3 3. Experiments with deblurring using OpenCV 3. Design Principles of Scalable, Distributed Systems 4. Natural language processing: What would Shakespeare say? 5. yorkr crashes the IPL party! – Part 3! 6. cricketr flexes new muscles: The final analysis

To see all post click Index of posts

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

↧

Statistics Sunday: Conducting Meta-Analysis in R

April 29, 2018, 6:30 am

≫ Next: Using Shiny Dashboards for Financial Analysis

≪ Previous: Deep Learning from first principles in Python, R and Octave – Part 7

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Here it is, everyone! The promised 4th post on meta-analysis, and my second video for Deeply Trivial! In this video, I walk through conducting a basic meta-analysis, both fixed and random effects, in the metafor package:

See these previous posts and links for more information:

Effect sizes
Meta-Analysis Variance
Meta-Analysis Weights
the BMJ Open article mentioned in the video
Finally, the homepage for the metafor package

You can access the code I used in the video here as well as code to do similar analysis with the or_meta dataset here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

↧

Using Shiny Dashboards for Financial Analysis

April 29, 2018, 4:54 pm

≫ Next: 2018-04 Extreme Makeover: R Graphics Edition

≪ Previous: Statistics Sunday: Conducting Meta-Analysis in R

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

For some time now, I have been trading traditional assets—mostly U.S. equities. About a year ago, I jumped into the cryptocurrency markets to try my hand there as well. In my time in investor Telegram chats and subreddits, I often saw people arguing over which investments had performed better over time, but the reality was that most such statements were anecdotal, and thus unfalsifiable.

Given the paucity of cryptocurrency data available in an easily accessible format, it was quite difficult to say for certain that a particular investment was a good one relative to some alternative, unless you were very familiar with a handful of APIs. Even then, assuming you knew how to get daily OHLC data for a crypto-asset like Bitcoin, in order to compare it to some other asset—say Amazon stock—you would have to eyeball trends from a website like Yahoo finance or scrape that data separately and build your own visualizations and metrics. In short, historical asset performance comparisons in the crypto space were difficult to conduct for all but the most technically savvy individuals, so I set out to build a tool that would remedy this, and the Financial Asset Comparison Tool was born.

In this post, I aim to describe a few key components of the dashboard, and also call out lessons learned from the process of iterating on the tool along the way. Prior to proceeding, I highly recommend that you read the app’s README and take a look at the UI and code base itself, as this will provide the context necessary to understanding the rest of the commentary below.

I’ll start by delving into a few principles that I find to be to key when designing analytic dashboards, drawing on the asset comparison dashboard as my exemplar, and will end with some discussion of the relative utility of a few packages integral to the app. Overall, my goal is not to focus on the tool that I built alone, but to highlight a few main best practices when it comes to building dashboards for any analysis.

Build the app around the story, not the other way around.

Before ever writing a single line of code for an analytic app, I find that it is absolutely imperative to have a clear vision of the story that the tool must tell. I do not mean by this that you should already have conclusions about your data that you will then force the app into telling, but rather, that you must know how you want your user to interact with the app in order glean useful information.

In the case of my asset comparison tool, I wanted to serve multiple audiences—everyone from a casual trader who just wanted to see which investment produced the greatest net profit over a period of time, to a more experience trader, who had more nuanced questions about risk-adjusted return on investment given varying discount rates. The trick is thus building the app in such a way that serves all possible audiences without hindering any one type of user in particular.

The way I designed my app to meet this need was to build the UI such that as you descend the various sections vertically, the metrics displayed scale in complexity. My reasoning for this becomes apparent when you consider the two extremes in terms of users—the most basic vs. the most advanced trader.

The most basic user will care only about the assets of interest, the time period they want to examine, and how their initial investment performed over time. As such, they will start with the sidebar, input their assets and time frame of choice, and then use the top right-most input box to modulate their initial investment amount (although some may choose to stick with the default value here). They will then see the first chart change to reflect their choices, and they will see, both visually, and via the summary table below, which asset performed better.

The experienced trader, on the other hand, will start off exactly as the novice did, by choosing assets of interest, a time frame of reference, and an initial investment amount. They may then choose to modulate the LOESS parameters as they see fit, descending the page, looking over the simple returns section, perhaps stopping to make changes to the corresponding inputs there, and finally ending at the bottom of the page—at the Sharpe Ratio visualizations. Here they will likely spend more time—playing around with the time period over which to measure returns and changing the risk-free rate to align with their own personal macroeconomic assumptions.

The point of these two examples is to illustrate that the app by dint of its structure alone guides the user through the analytic story in a waterfall-like manner—building from simple portfolio performance, to relative performance, to the most complicated metrics for risk-adjusted returns. This keeps the novice trader from being overwhelmed or confused, and also allows the most experienced user to follow the same line of thought that they would anyway when comparing assets, while following a logical progression of complexity, as shown via the screenshot below.

Once you think you have a structure that guides all users through the story you want them to experience, test it by asking yourself if the app flows in such a way that you could pose and answer a logical series of questions as you navigate the app without any gaps in cohesion. In the case of this app, the questions that the UI answers as you descend are as follows:

How do these assets compare in terms of absolute profit?
How do these assets compare in terms of simple return on investment?
How do these assets compare in terms of variance-adjusted and/or risk-adjusted return on investment?

Thus, when you string these questions together, you can make statements of the type: “Asset X seemed to outperform Asset Y in terms of absolute profit, and this trend held true as well when it comes to simple return on investment, over varying time frames. That said, when you take into account the variance inherent to Asset X, it seems that Asset Y may have been the best choice, as the excess downside risk associated with Asset X outweighs its excess net profitability.

Too many cooks in the kitchen—the case for a functional approach to app-building.

While the design of the UI of any analytic app is of great importance, it’s important to not forget that the code base itself should also be well-designed; a fully-functional app from the user’s perspective can still be a terrible app to work with if the code is a jumbled, incomprehensible mess. A poorly designed code base makes QC a tiresome, aggravating process, and knowledge sharing all but impossible.

For this reason, I find that sourcing a separate R script file containing all analytic functions necessitated by the app is the best way to go, as done below (you can see Functions.R at my repo here).

# source the Functions.R file, where all analytic functions for the app are storedsource("Functions.R")

Not only does this allow for a more comprehensible and less-cluttered App.R, but it also drastically improves testability and reusability of the code. Consider the example function below, used to create the portfolio performance chart in the app (first box displayed in the UI, center middle).

build_portfolio_perf_chart <- function(data, port_loess_param = 0.33){    port_tbl <- data[,c(1,4:5)]    # grabbing the 2 asset names  asset_name1 <- sub('_.*', '', names(port_tbl)[2])  asset_name2 <- sub('_.*', '', names(port_tbl)[3])    # transforms dates into correct type so smoothing can be done  port_tbl[,1] <- as.Date(port_tbl[,1])  date_in_numeric_form <- as.numeric((port_tbl[,1]))  # assigning loess smoothing parameter  loess_span_parameter <- port_loess_param    # now building the plotly itself  port_perf_plot <- plot_ly(data = port_tbl, x = ~port_tbl[,1]) %>%    # asset 1 data plotted    add_markers(y =~port_tbl[,2],                marker = list(color = '#FC9C01'),                name = asset_name1,                showlegend = FALSE) %>%    add_lines(y = ~fitted(loess(port_tbl[,2] ~ date_in_numeric_form, span = loess_span_parameter)),              line = list(color = '#FC9C01'),              name = asset_name1,              showlegend = TRUE) %>%    # asset 2 data plotted    add_markers(y =~port_tbl[,3],                marker = list(color = '#3498DB'),                name = asset_name2,                showlegend = FALSE) %>%    add_lines(y = ~fitted(loess(port_tbl[,3] ~ date_in_numeric_form, span = loess_span_parameter)),              line = list(color = '#3498DB'),              name = asset_name2,              showlegend = TRUE) %>%    layout(      title = FALSE,      xaxis = list(type = "date",                   title = "Date"),      yaxis = list(title = "Portfolio Value ($)"),      legend = list(orientation = 'h',                    x = 0,                    y = 1.15)) %>%    add_annotations(      x= 1,      y= 1.133,      xref = "paper",      yref = "paper",      text = "",      showarrow = F    )    return(port_perf_plot)  }

Writing this function in the sourced Functions.R file instead of directly within the App.R allows for the developer to first test the function itself with fake data—i.e. data not gleaned from the reactive inputs. Once it has been tested in this way, it can be integrated in the app.R on the server side as shown below, with very little code.

  output$portfolio_perf_chart <-     debounce(      renderPlotly({        data <- react_base_data()        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)      }),       millis = 2000) # sets wait time for debounce

This process allows for better error-identification and troubleshooting. If, for example, you want to change the work accomplished by the analytic function in some way, you can make the changes necessary to the code, and if the app fails to produce the desired outcome, you simply restart the chain: first you test the function in a vacuum outside of the app, and if it runs fine there, then you know that you have a problem with the way the reactive inputs are integrating with the function itself. This is a huge time saver when debugging.

Lastly, this allows for ease of reproducibility and hand-offs. If, say, one of your functions simply takes in a dataset and produces a chart of some sort, then it can be easily copied from the Functions.R and reused elsewhere. I have done this too many times to count, ripping code from project and, with a few alterations, instantly applying it in other contexts. This is easy to do if the functions are written in a manner not dependent on a particular Shiny reactive structure. For all these reasons, it makes sense in most cases to keep the code for the app UI and inputs cleanly separated from the analytic functions via a sourced R script.

Dashboard documentation—both a story and a manual, not one or the other.

When building an app for a customer at work, I never simply write an email with a link in it and write “here you go!” That will result in, at best, a steep learning curve, and at worst, an app used in an unintended way, resulting in user frustration or incorrect results. I always meet with the customer, explain the purpose and functionalities of the tool, walk through the app live, take feedback, and integrate any key takeaways into further iterations.

Even if you are just planning on writing some code to put up on GitHub, you should still consider all of these steps when working on the documentation for your app. In most cases, the README is the epicenter of your documentation—the README is your meeting with the customer. As you saw when reading the README for the Asset Comparison Tool, I always start my READMEs with a high-level introduction to the purpose of the app—hopefully written or supplemented with visuals (as seen below) that are easy to understand and will capture the attention of browsing passers-by.

After this introduction, the rest of the potential sections to include can vary greatly from app-to-app. In some cases apps are meant to answer one particular question, and might have a variety of filters or supplemental functionalities—one such example can be found here. As can be seen, in that README, I spend a great deal of time on the methodology after making the overall purpose clear, calling out additional options along the way. In the case of the README for the Asset Comparison Tool, however, the story is a bit different. Given that there are many questions that the app seeks to answer, it makes sense to answer each in turn, writing the README in such a way that its progression mirrors the logical flow of the progression intended for the app’s user.

One should of course not neglect to cover necessary technical information in the README as well. Anything that is not immediately clear from using the app should be clarified in the README—from calculation details to the source of your data, etc. Finally, don’t neglect the iterative component! Mention how you want to interact with prospective users and collaborators in your documentation. For example, I normally call out how I would like people to use the Issues tab on GitHub to propose any changes or additions to the documentation, or the app in general. In short, your documentation must include both the story you want to tell, and a manual for your audience to follow.

Why Shiny Dashboard?

One of the first things you will notice about the app.R code is that the entire thing is built using Shiny Dashboard as its skeleton. There are a two main reasons for this, which I will touch on in turn.

Shiny Dashboard provides the biggest bang for your buck in terms of how much UI complexity and customizability you get out of just a small amount of code.

I can think of few cases where any analyst or developer would prefer longer, more verbose code to a shorter, succinct solution. That said, Shiny Dashboard’s simplicity when it comes to UI manipulation and customization is not just helpful because it saves you time as a coder, but because it is intuitive from the perspective of your audience.

Most of the folks that use the tools I have built to shed insight into economic questions don’t know how to code in R or Python, but they can, with a little help from extensive commenting and detailed documentation, understand the broad structure of an app coded in Shiny Dashboard format. This is, I believe, largely a function of two features of Shiny Dashboard: the colloquial-English-like syntax of the code for UI elements, and the lack of the necessity for in-line or external CSS.

As you can see from the example below, Shiny Dashboard’s system of “boxes” for UI building is easy to follow. Users can see a box in the app and easily tie that back to a particular box in the UI code.

Here is the box as visible to the user:

And here is the code that produces the box:

box(        title = "Portfolio Performance Inputs",        status= "primary",        solidHeader = TRUE,        h5("This box focuses on portfolio value, i.e., how much an initial investment of the amount specified below (in USD) would be worth over time, given price fluctuations."),                textInput(          inputId = "initial_investment",          label = "Enter your initial investment amount ($):",          value = "1000"),                hr(),                h5("The slider below modifies the", a(href = "https://stats.stackexchange.com/questions/2002/how-do-i-decide-what-span-to-use-in-loess-regression-in-r", "smoothing parameter"), "used in the", a(href = "https://en.wikipedia.org/wiki/Local_regression", "LOESS function"), "that produces the lines on the scatterplot."),                sliderInput(          inputId = "port_loess_param",          label = "Smoothing parameter for portfolio chart:",          min = 0.1,          max = 2,          value = .33,          step = 0.01,          animate = FALSE        ),                hr(),        h5("The table below provides metrics by which we can compare the portfolios. For each column, the asset that performed best by that metric is colored green."),                height = 500,         width = 4      )

Secondly, and somewhat related to the first point, with Shiny Dashboard, much of the coloring and overall UI design comes pre-made via dashboard-wide “skins”, and box-specific “statuses.”

This is great if you are okay sacrificing a bit of control for a significant reduction in code complexity. In my experience dealing with non-coding-proficient audiences, I find that in-line CSS or complicated external CSS makes folks far more uncomfortable with the code in general. Anything you can do to reduce this anxiety and make those using your tools feel as though they understand them better is a good thing, and Shiny Dashboard makes that easier.

Shiny Dashboard’s combination of sidebar and boxes makes for easy and efficient data processing when your app has a waterfall-like analytic structure.

Having written versions of this app both in base Shiny and using Shiny Dashboard, the number one reason I chose Shiny Dashboard was the fact that the analytic questions I sought to solve followed a waterfall-like structure, as explained in the previous section. This works perfectly well with Shiny Dashboard’s combination of sidebar input controls and inputs within UI boxes themselves.

The inputs of primordial importance to all users are included in the sidebar UI: the two assets to analyze, and the date range over which to compare their performance. These are the only inputs that all users, regardless of experience or intent, must absolutely use, and when they are changed, all views in the dashboard will be affected. All other inputs are stored in the UI Boxes adjacent to the views that they modulate. This makes for a much more intuitive and fluid user experience, as once the initial sidebar inputs have been modulated, the sidebar can be hidden, as all other non-hidden inputs affect only the visualizations to which they are adjacent.

This waterfall-like structure also makes for more efficient reactive processes on the Shiny back-end. The inputs on the sidebar are parameters that, when changed, force the main reactive function that creates that primary dataset to fire, thus recreating the base dataset (as can be seen in the code for that base datasets creation below).

  # utility functions to be used within the server; this enables us to use a textinput for our portfolio values  exists_as_number <- function(item) {    !is.null(item) && !is.na(item) && is.numeric(item)  }    # data-creation reactives (i.e. everything that doesn't directly feed an output)    # first is the main data pull which will fire whenever the primary inputs (asset_1a, asset_2a, initial_investment, or port_dates1a change)  react_base_data <- reactive({    if (exists_as_number(as.numeric(input$initial_investment)) == TRUE) {      # creates the dataset to feed the viz      return(        get_pair_data(          asset_1 = input$asset_1a,          asset_2 = input$asset_2a,           port_start_date = input$port_dates1a[1],          port_end_date = input$port_dates1a[2],          initial_investment = (as.numeric(input$initial_investment))        )      )    } else {      return(        get_pair_data(          asset_1 = input$asset_1a,          asset_2 = input$asset_2a,           port_start_date = input$port_dates1a[1],          port_end_date = input$port_dates1a[2],          initial_investment = (0)        )      )    }  })

Each of the visualizations are then produced via their own separate reactive functions, each of which takes as an input the main reactive (as shown below). This makes it so that whenever the sidebar inputs are changed, all reactives fire and all visualizations are updated; however, if all that is changed is a single loess smoothing parameter input, only the reactive used in the creation of that particular parameter-dependent visualization fires, which makes for great computational efficiency.

 # Now the reactives for the actual visualizations  output$portfolio_perf_chart <-     debounce(      renderPlotly({        data <- react_base_data()        build_portfolio_perf_chart(data, port_loess_param = input$port_loess_param)      }),       millis = 2000) # sets wait time for debounce

Why Plotly?

Plotly vs. ggplot is always a fun subject for discussion among folks who build visualizations in R. Sometimes I feel like such discussions just devolve into the same type of argument as R vs. Python for data science (my answer to this question being just pick one and learn it well), but over time I have found that there are actually some circumstances where the plotly vs. ggplot debate can yield cleaner answers.

In particular, I have found in working on this particular type of analytic app that there are two areas where plotly has a bit of an advantage: clickable interactivity, and wide data.

Those familiar with ggplot will know that every good ggplot begins with long data. It is possible, via some functions, to transform wide data into a long format, but that transformation can sometimes be problematic. While there are essentially no circumstances in which it is impossible to transform wide data into long format, there are a handful of cases where it is excessively cumbersome: namely, when dealing with indexed xts objects (as shown below) or time series / OHLC-styled data.

In these cases—either due to the sometimes-awkward way in which you have to handle rowname indexes in xts, or the time and code complexity saved by not having to transform every dataset into long format—plotly offers efficiency gains relative to ggplot.

The aforementioned efficiency gains are a reason to choose plotly in some cases because it makes the life of the coder easier, but there are also reasons why it sometimes make the life of the user easier as well.

If one of the primary utilities of a visualization is to allow the user the ability to seamlessly and intuitively zoom in on, select, or filter the data displayed, particularly in the context of a Shiny App, then plotly should be strongly considered. Sure, ggplotly wrappers can be used to make a ggplot interactive, but with an added layer of abstraction comes an added layer of possible errors. While in most cases a ggplotly wrapper should work seamlessly, I have found that, particularly in cases where auto-sizing and margin size specification is key, ggplotly can require a great deal of added code in order to work correctly in a Shiny context.

In summary, when considering when to start with plotly vs. when to start with ggplot, I find one question to be particularly helpful: what do I value most—visual complexity and/or customization, or interactive versatility and/or preserving wide data?

If I choose the former, then ggplot is what I need; otherwise, I go with plotly. More often than not I find that ggplot emerges victorious, but even if you disagree with me in my decision-making calculus, I think it is helpful to at least think through what your personal calculus is. This will save you time when coding, as instead of playing around with various types of viz, you can simply pose the question(s) behind your calculus and know quickly what solution best fits your problem.

Why Formattable?

The case for formattable is, in my opinion, a much easier case to make than arguing for plotly vs. ggplot. The only question worth asking when deciding on whether or not to use formattable in your app is: do I want my table to tell a quick story via added visual complexity within the same cell that contains my data, or is a reference table all I am looking for? If you chose the former, formattable is probably a good way to go. You’ll notice as well that the case for formattable is very specific–in most cases there is likely a simpler solution via the DT or kableExtra packages.

The one downside that I have encountered in dealing with formattable code is the amount of code necessary to generate even moderately complicated tables. That said, this problem is easily remedied via a quick function that we can use to kill most of the duplicative coding, as seen in the example below.

First, here is the long form version:

  react_formattable <- reactive({    return(      formattable(react_port_summary_table(),                   list(                    "Asset Portfolio Max Worth" = formatter("span",                                                            style = x ~ style(                                                              display = "inline-block",                                                              direction = "rtl",                                                              "border-radius" = "4px",                                                              "padding-right" = "2px",                                                              "background-color" = csscolor("darkslategray"),                                                              width = percent(proportion(x)),                                                              color = csscolor(gradient(x, "red", "green"))                                                            )),                    "Asset Portfolio Latest Worth" = formatter("span",                                                               style = x ~ style(                                                                 display = "inline-block",                                                                 direction = "rtl",                                                                 "border-radius" = "4px",                                                                 "padding-right" = "2px",                                                                 "background-color" = csscolor("darkslategray"),                                                                 width = percent(proportion(x)),                                                                 color = csscolor(gradient(x, "red", "green"))                                                               )),                    "Asset Portfolio Absolute Profit" = formatter("span",                                                                  style = x ~ style(                                                                    display = "inline-block",                                                                    direction = "rtl",                                                                    "border-radius" = "4px",                                                                    "padding-right" = "2px",                                                                    "background-color" = csscolor("darkslategray"),                                                                    width = percent(proportion(x)),                                                                    color = csscolor(gradient(x, "red", "green"))                                                                  )),                    "Asset Portfolio Rate of Return" = formatter("span",                                                                 style = x ~ style(                                                                   display = "inline-block",                                                                   direction = "rtl",                                                                   "border-radius" = "4px",                                                                   "padding-right" = "2px",                                                                   "background-color" = csscolor("darkslategray"),                                                                   width = percent(proportion(x)),                                                                   color = csscolor(gradient(x, "red", "green"))                                                                 ))                                      )      )          )  })

This code can easily be shortened via the integration of a custom function, as shown below.

simple_formatter <- function(){    formatter("span",              style = x ~ style(                display = "inline-block",                direction = "rtl",                "border-radius" = "4px",                "padding-right" = "2px",                "background-color" = csscolor("darkslategray"),                width = percent(proportion(x)),                color = csscolor(gradient(x, "red", "green"))              ))  }    react_formattable <- reactive({    return(      formattable(react_port_summary_table(),                   list(                    "Asset Portfolio Max Worth" = simple_formatter(),                    "Asset Portfolio Latest Worth" = simple_formatter(),                    "Asset Portfolio Absolute Profit" = simple_formatter(),                    "Asset Portfolio Rate of Return" = simple_formatter()                    )                  )      )    })

As can be seen, formattable allows for a great deal of added complexity in crafting your table—complexity that may not be suited for all apps. That said, if you do want to quickly draw a user’s attention to something in a table, formattable is a great solution, and most of the details of the code can be greatly simplified via a function, as shown.

Conclusions:

That was a lot—I know—but I hope that from this commentary and my exemplar of the Asset Comparison Tool more generally has helped to inform your understanding of how dashboards can serve as a helpful analytic tool. Furthermore, I hope to have prompted some thoughts as to the best practices to be followed when building such a tool. I’ll end with a quick tl;dr:

Shave complexity wherever possible, and make code as simple as possible by keeping the code for the app’s UI and inner mechanism (inputs, reactives, etc.) separate from the code for the analytic functions and visualizations.
Build with the most extreme cases in mind (think of how your most edge-case user might use the app, and ensure that behavior won’t break the app)
Document, document, and then document some more. Make your README both a story and a manual.
Give Shiny Dashboard a shot if you want an easy-to-construct UI over which you don’t need complete control when it comes to visual design.
Pick your visualization packages based on what you want to prioritize for your user, not the other way around (this applies to ggplot, plotly, formattable, etc.).

Thanks for reading!

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

↧

2018-04 Extreme Makeover: R Graphics Edition

April 29, 2018, 8:50 pm

≫ Next: Z is for Z-Scores and Standardizing

≪ Previous: Using Shiny Dashboards for Financial Analysis

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report describes a complex R graphics customisation example using functions from the ‘grid’ and ‘gridGraphics’ packages and introduces two new functions in ‘grid’: deviceLoc and deviceDim.

Paul Murrell

Download

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech.

↧

Z is for Z-Scores and Standardizing

April 30, 2018, 6:09 am

≫ Next: Data Science For Business: Course Now Open!

≪ Previous: 2018-04 Extreme Makeover: R Graphics Edition

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Z is for Z-Scores and Standardizing Last April, I wrapped up the A to Z of Statistics with a post about Z-scores. It seems only fitting that I’m wrapping this April A to Z with the same topic. Z-scores are frequently used, sometimes when you don’t even realize it. When you take your child to the doctor and they say he’s at the x percentile on height, or when you take a standardized test and are told you scored in the y percentile, those values are being derived from Z-scores. You create a Z-score when you subtract the population mean from a value and divide that result by the population standard deviation.

Of course, we often will standardize variables in statistics, and the results are similar to Z-scores (though technically not the same if the mean and standard deviation aren’t population values). In fact, when I demonstrated the GLM function earlier this month, I skipped a very important step when conducting an analysis with interactions. I should have standardized my continuous predictors first, which means subtracting the variable mean and dividing by the variable standard deviation, creating a new variable with a mean of 0 and a standard deviation of 1 (just like the normal distribution).

Let’s revisit that GLM analysis. I was predicting verdict (guilty, not guilty) with binomial regression. I did one analysis where I used a handful of attitude items and the participant’s guilt rating, and a second analysis where I created interactions between each attitude item and the guilt rating. The purpose was to see if an attitude impacts the threshold – how high the guilt rating needed to be before a participant selected “guilty”. When working with interactions, the individual variables are highly correlated with the interaction variables based on them, so we solve that problem, and make our analysis and output a bit cleaner, by centering our variables and using those centered values to create interactions.

I’ll go ahead and load my data. Also, since I know I have some missing values, which caused an error when I tried to work with predicted values and residuals yesterday, I’ll also go ahead and identify that case/those cases.

dissertation<-read.delim("dissertation_data.txt",header=TRUE)predictors<-c("obguilt","reasdoubt","bettertolet","libertyvorder","jurevidence","guilt")library(psych)

## Warning: package 'psych' was built under R version 3.4.4

describe(dissertation[predictors])

##               vars   n mean   sd median trimmed  mad min max range  skew ## obguilt          1 356 3.50 0.89      4    3.52 0.00   1   5     4 -0.50 ## reasdoubt        2 356 2.59 1.51      2    2.68 1.48  -9   5    14 -3.63 ## bettertolet      3 356 2.36 1.50      2    2.38 1.48  -9   5    14 -3.41 ## libertyvorder    4 355 2.74 1.31      3    2.77 1.48  -9   5    14 -3.89 ## jurevidence      5 356 2.54 1.63      2    2.66 1.48  -9   5    14 -3.76 ## guilt            6 356 4.80 1.21      5    4.90 1.48   2   7     5 -0.59 ##               kurtosis   se ## obguilt          -0.55 0.05 ## reasdoubt        26.92 0.08 ## bettertolet      25.47 0.08 ## libertyvorder    34.58 0.07 ## jurevidence      25.39 0.09 ## guilt            -0.54 0.06

dissertation<-subset(dissertation,!is.na(libertyvorder))

R has a built-in function that will do this for you: scale. The scale function has 3 main arguments – the variable or variables to be scaled, and whether you want those variables centered (subtract mean) and/or scaled (divided by standard deviation). For regression with interactions, we want to both center and scale. For instance, to center and scale the guilt rating:

dissertation$guilt_c<-scale(dissertation$guilt,center=TRUE,scale=TRUE)

I have a set of predictors I want to do this to, so I want to apply a function across those specific columns:

dissertation[46:51]<-lapply(dissertation[predictors],function(x) {y<-scale(x,center=TRUE,scale=TRUE)})

Now, let’s rerun that binomial regression, this time using the centered variables in the model.

pred_int<-'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 +                    jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +                   bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'model2<-glm(pred_int,family="binomial",data=dissertation)summary(model2)

##  ## Call: ## glm(formula = pred_int, family = "binomial", data = dissertation) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -2.6101  -0.5432  -0.1289   0.6422   2.2805   ##  ## Coefficients: ##                         Estimate Std. Error z value Pr(>|z|)     ## (Intercept)             -0.47994    0.16264  -2.951  0.00317 **  ## obguilt.1                0.25161    0.16158   1.557  0.11942     ## reasdoubt.1             -0.09230    0.20037  -0.461  0.64507     ## bettertolet.1           -0.22484    0.20340  -1.105  0.26899     ## libertyvorder.1          0.05825    0.21517   0.271  0.78660     ## jurevidence.1            0.07252    0.19376   0.374  0.70819     ## guilt.1                  2.31003    0.26867   8.598  < 2e-16 *** ## obguilt.1:guilt.1        0.14058    0.23411   0.600  0.54818     ## reasdoubt.1:guilt.1     -0.61724    0.29693  -2.079  0.03764 *   ## bettertolet.1:guilt.1    0.02579    0.30123   0.086  0.93178     ## libertyvorder.1:guilt.1 -0.27492    0.29355  -0.937  0.34899     ## jurevidence.1:guilt.1    0.27601    0.36181   0.763  0.44555     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## (Dispersion parameter for binomial family taken to be 1) ##  ##     Null deviance: 490.08  on 354  degrees of freedom ## Residual deviance: 300.66  on 343  degrees of freedom ## AIC: 324.66 ##  ## Number of Fisher Scoring iterations: 6

The results are essentially the same; the constant and slopes of the predictors variables are different but the variables that were significant before still are. So standardizing doesn’t change the results, but it is generally recommended because it makes results easier to interpret, because the variables are centered around the mean. So negative numbers are below the mean, and positive numbers are above the mean.

Hard to believe A to Z is over! Don’t worry, I’m going to keep blogging about statistics, R, and whatever strikes my fancy. I almost kept this post going by applying the prediction work from yesterday to the binomial model, but decided that would make for a fun future post. And I’ll probably sprinkle in posts in the near future on topics I didn’t have room for this month or that I promised to write a future post on. Thanks for reading and hope you keep stopping by, even though April A to Z is officially over!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

↧

Data Science For Business: Course Now Open!

April 29, 2018, 5:00 pm

≫ Next: Interpretable Machine Learning with iml and mlr

≪ Previous: Z is for Z-Scores and Standardizing

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We are pleased to announce that our Data Science For Business (#DS4B) Course (HR 201) is OFFICIALLY OPEN! This course is for intermediate to advanced data scientists looking to apply H2O and LIME to a real-world binary classification problem in an organization: Employee Attrition. If you are interested applying data science for business in a real-world setting with advanced tools using a client-proven system that delivers ROI to the organization, then this is the course for you. For a limited time we are offering 15% off enrollment.

What You Learn

Watch the 2-Minute DS4B Course Overview.

The course is large in scope, which is necessary to go into the level of detail to cover:

How to understand a business problem
How to tie data science to financial impact
How to code effectively for data science
How to use the tidyverse and program with Tidy Eval
How to implement data science project management frameworks
How to manage an analysis using an R Project
How to use H2O, LIME, and various other R packages that are amazingly useful

“It’s well worth the investment you or your organization are making. I designed this course after what I wished I had when I was first seeking to apply data science in business.”
Matt Dancho, Founder of Business Science

Course Launch Coupon: 15% OFF

DS4B Course Launch 15% OFF

Coupon Details:

To get 15% off, click the button above or visit Business Science University using promotional coupon code: DS4B_15.
The coupon expires next Monday (5/7/2018). Hurry, don’t miss out!

Your Instructor

I’m Matt. I’m the Founder of Business Science, and I’ll be your instructor for HR 201. Check out a 1 minute video to learn more about me and how I can help you apply DS4B.

Tools And Techniques Used

We’ll apply the following tools and techniques using a real-world problem:

Business Science Problem Framework for data science project management
Costing attrition and developing data science code workflows in R
H2O for automated machine learning (AutoML)
LIME for black-box model feature explanation
Correlation Analysis for pre-modeling feature importance
Recommendation Logic for assisting executives and managers with data-driven decision-making
recipes for preprocessing data for machine learning algorithms
GGally and skimr for exploratory data analysis
Tidy Eval for programming in the tidyverse
And more!

There’s a lot that HR 201 offers. Learn more on the Course Homepage. Be sure to take a peek at the Course Curriculum to get a glimpse of what’s inside.

H2O Highlights

We’ll use H2O to generate a high performance classifier that predicts employee attrition.

H2O Leaderboard Visualization

We’ll measure the classifier performance using visualizations that data scientists and executives need.

H2O Model Metrics

What You Need To Get Started

A basic (novice) knowledge of R, dplyr, and ggplot2 is our expectation. We’ll take care of the rest. If you are unsure, there is a proficiency quiz to check your baseline. Also, there’s a 30-day money-back guarantee if the course is too difficult or if you are not completely satisfied.

Education Assistance

Many employers offer assistance to cover the cost of continuous education. Begin discussions with your employer immediately if this is available to you and you are interested in this course. They will benefit from you taking this course.

Start Learning DS4B Now

Enrollment in BSU is open. Enroll Now to take advantage of 15% OFF. The coupon expires next Monday! Time is limited!

DS4B Course Launch 15% OFF

About Business Science

Business Science specializes in “ROI-driven data science”. We offer training, education, coding expertise, and data science consulting related to business and finance. Our latest creation is Business Science University, a Virtual Workshop that is self-paced and teaches you our data science process! In addition, we deliver about 80% of our effort into the open source data science community in the form of software and our Business Science blog. Visit Business Science on the web or contact us to learn more!

Don’t Miss A Beat

Sign up for the Business Science blog to stay updated
Enroll in Business Science University to learn how to solve real-world data science problems from Business Science
Check out our Open Source Software

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

↧

Interpretable Machine Learning with iml and mlr

April 29, 2018, 5:00 pm

≫ Next: March 2018: “Top 40” New Package Picks

≪ Previous: Data Science For Business: Course Now Open!

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable.

Fortunately, there are many methods that can make machine learning models interpretable. The R package iml provides tools for analysing any black box machine learning model:

Feature importance: Which were the most important features?
Feature effects: How does a feature influence the prediction? (Partial dependence plots and individual conditional expectation curves)
Explanations for single predictions: How did the feature values of a single data point affect its prediction? (LIME and Shapley value)
Surrogate trees: Can we approximate the underlying black box model with a short decision tree?
The iml package works for any classification and regression machine learning model: random forests, linear models, neural networks, xgboost, etc.

This blog post shows you how to use the iml package to analyse machine learning models. While the mlr package makes it super easy to train machine learning models, the iml package makes it easy to extract insights about the learned black box machine learning models.

If you want to learn more about the technical details of all the methods, read the Interpretable Machine Learning book.

Time for Interpretable Machine Learning

Let’s explore the iml-toolbox for interpreting an mlr machine learning model with concrete examples!

Data: Boston Housing

We’ll use the MASS::Boston dataset to demonstrate the abilities of the iml package. This dataset contains median house values from Boston neighbourhoods.

data("Boston",package="MASS")head(Boston)

#>      crim zn indus chas   nox    rm  age    dis rad tax ptratio#> 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3#> 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8#> 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8#> 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7#> 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7#> 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7#>    black lstat medv#> 1 396.90  4.98 24.0#> 2 396.90  9.14 21.6#> 3 392.83  4.03 34.7#> 4 394.63  2.94 33.4#> 5 396.90  5.33 36.2#> 6 394.12  5.21 28.7

Fitting the machine learning model

First we train a randomForest to predict the Boston median housing value:

library("mlr")data("Boston",package="MASS")# create an mlr task and modeltsk=makeRegrTask(data=Boston,target="medv")lrn=makeLearner("regr.randomForest",ntree=100)mod=train(lrn,tsk)

Using the iml Predictor container

We create a Predictor object, that holds the model and the data. The iml package uses R6 classes: New objects can be created by calling Predictor$new(). Predictor works best with mlr models (WrappedModel-class), but it is also possible to use models from other packages.

library("iml")X=Boston[which(names(Boston)!="medv")]predictor=Predictor$new(mod,data=X,y=Boston$medv)

Feature importance

We can measure how important each feature was for the predictions with FeatureImp. The feature importance measure works by shuffling each feature and measuring how much the performance drops. For this regression task we choose to measure the loss in performance with the mean absolute error (‘mae’); another choice would be the mean squared error (‘mse’).

Once we created a new object of FeatureImp, the importance is automatically computed. We can call the plot() function of the object or look at the results in a data.frame.

imp=FeatureImp$new(predictor,loss="mae")plot(imp)

plot of chunk unnamed-chunk-5

imp$results

#>    feature original.error permutation.error importance#> 1    lstat       0.929379         4.3533565   4.684156#> 2       rm       0.929379         3.0678264   3.300942#> 3      nox       0.929379         1.6636358   1.790051#> 4      dis       0.929379         1.6288497   1.752622#> 5     crim       0.929379         1.6115494   1.734007#> 6  ptratio       0.929379         1.5988103   1.720300#> 7    indus       0.929379         1.4023210   1.508880#> 8      tax       0.929379         1.3150335   1.414959#> 9      age       0.929379         1.2712218   1.367819#> 10   black       0.929379         1.1936640   1.284367#> 11     rad       0.929379         1.0826712   1.164941#> 12    chas       0.929379         0.9753240   1.049436#> 13      zn       0.929379         0.9585688   1.031408

Partial dependence

Besides learning which features were important, we are interested in how the features influence the predicted outcome. The Partial class implements partial dependence plots and individual conditional expectation curves. Each individual line represents the predictions (y-axis) for one data point when we change one of the features (e.g. ‘lstat’ on the x-axis). The highlighted line is the point-wise average of the individual lines and equals the partial dependence plot. The marks on the x-axis indicates the distribution of the ‘lstat’ feature, showing how relevant a region is for interpretation (little or no points mean that we should not over-interpret this region).

pdp.obj=Partial$new(predictor,feature="lstat")plot(pdp.obj)

plot of chunk unnamed-chunk-6

If we want to compute the partial dependence curves for another feature, we can simply reset the feature. Also, we can center the curves at a feature value of our choice, which makes it easier to see the trend of the curves:

pdp.obj$set.feature("rm")pdp.obj$center(min(Boston$rm))plot(pdp.obj)

plot of chunk unnamed-chunk-7

Surrogate model

Another way to make the models more interpretable is to replace the black box with a simpler model – a decision tree. We take the predictions of the black box model (in our case the random forest) and train a decision tree on the original features and the predicted outcome. The plot shows the terminal nodes of the fitted tree. The maxdepth parameter controls how deep the tree can grow and therefore how interpretable it is.

tree=TreeSurrogate$new(predictor,maxdepth=2)plot(tree)

plot of chunk unnamed-chunk-8

We can use the tree to make predictions:

head(tree$predict(Boston))

#>     .y.hat#> 1 28.43299#> 2 21.74169#> 3 28.43299#> 4 28.43299#> 5 28.43299#> 6 28.43299

Explain single predictions with a local model

Global surrogate model can improve the understanding of the global model behaviour. We can also fit a model locally to understand an individual prediction better. The local model fitted by LocalModel is a linear regression model and the data points are weighted by how close they are to the data point for wich we want to explain the prediction.

lime.explain=LocalModel$new(predictor,x.interest=X[1,])lime.explain$results

#>               beta x.recoded    effect x.original feature#> rm       4.3190928     6.575 28.398035      6.575      rm#> ptratio -0.5285876    15.300 -8.087391       15.3 ptratio#> lstat   -0.4273493     4.980 -2.128199       4.98   lstat#>         feature.value#> rm           rm=6.575#> ptratio  ptratio=15.3#> lstat      lstat=4.98

plot(lime.explain)

plot of chunk unnamed-chunk-10

Explain single predictions with game theory

An alternative for explaining individual predictions is a method from coalitional game theory named Shapley value. Assume that for one data point, the feature values play a game together, in which they get the prediction as a payout. The Shapley value tells us how to fairly distribute the payout among the feature values.

shapley=Shapley$new(predictor,x.interest=X[1,])plot(shapley)

plot of chunk unnamed-chunk-11

We can reuse the object to explain other data points:

shapley$explain(x.interest=X[2,])plot(shapley)

plot of chunk unnamed-chunk-12

The results in data.frame form can be extracted like this:

results=shapley$resultshead(results)

#>   feature         phi      phi.var feature.value#> 1    crim -0.02168342  1.071941296  crim=0.02731#> 2      zn -0.00016250  0.006865947          zn=0#> 3   indus -0.27755494  0.492201863    indus=7.07#> 4    chas -0.01886100  0.016614559        chas=0#> 5     nox  0.33932047  0.925398396     nox=0.469#> 6      rm -1.19031582 13.544574195      rm=6.421

The iml package is available on CRAN and on Github.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

↧

March 2018: “Top 40” New Package Picks

April 29, 2018, 5:00 pm

≫ Next: Comparing dependencies of popular machine learning packages with `pkgnet`

≪ Previous: Interpretable Machine Learning with iml and mlr

(This article was first published on R Views, and kindly contributed to R-bloggers)

By my count, just over 200 new packages made it to CRAN and stuck during March. The trend for specialized, and sometimes downright esoteric science packages continues. I counted 40 new packages in this class. Most, but not all of these, are focused on bio-science applications. For example, the foreSIGHT package profiled below focuses on climate science. I was also pleased to see two new packages (not from RStudio) in the Data Science category, h2o4gpu and onnx, built on the reticulate package for interfacing with Python. I hope this also becomes a trend.

The following are my “Top 40” picks for March in nine categories: Computational Methods, Data, Data Science, Political Science, Science, Statistics, Time Series, Utilities and Visualizations.

Computational Methods

dynprog v0.1.0: Implements a domain-specific language for specifying translating recursions into dynamic-programming algorithms.

fmlogcondens v1.0.2: Implements a fast solver for the maximum likelihood estimator of the family of multivariate log-concave probability function. Includes well-known parametric densities including the normal, uniform, and exponential distributions and many more. For details, see Rathke et al. (2015). The vignette shows how to use the package.

knor v0.0-5: Provides access to knor, a NUMA-optimized, in-memory, distributed library for computing k-means.

Data

daymetr v1.3.1: Provides programmatic interface to the Daymet climate data. The vignette shows how to use it.

NOAAWeather v0.1.0: Provides functions to retrieve real-time weather data from all NOAA stations, and plot time series, boxplot, calendar heatmap, and geospatial maps to analyze trends. The vignette shows how to use the package.

ppitables v0.1.2: Contains country-specific lookup data tables used as reference to determine the poverty likelihood of a household based on their PPI score (Poverty Probability Index), with documentation from Innovations for Poverty Action.

usfertilizer v0.1.5: Provides county-level estimates of fertilizer, nitrogen and phosphorus, from 1945 to 2012 in the United States of America. There is an Introduction and a vignette on Data Scources and Processes.

Data Science

greybox v0.2.0: Implements tools for model selection and combinations via information criteria based on the values of partial correlations. The vignette provides details.

h2o4gpu v0.2.0: Implements an interface to H2O4GPU, a collection of GPU solvers for machine learning algorithms. There is a vignette.

iml v0.3.0: Provides interpretability methods to analyze the behavior and predictions of any machine learning model, including feature importance, partial dependence plots, [individual conditional expectation (ice plots), local models, the Shapley Value, and tree surrogate models.

iTOP v1.0.1: Provides functions to infer a topology of relationships between different datasets, such as multi-omics and phenotypic data recorded on the same samples. The methodology is based on the extension of the RV coefficient, a measure of matrix correlation to partial matrix correlations and binary data. See Aben et al. (2018) for details and the vignette introduction to the package.

onnx v0.0.1: Implements an interface to ONNX, the Open Neural Network Exchange, which provides an open-source format for machine-learning models.

rcqp v0.5: Implements Corpus Query Protocol functions based on the CWB software, a collection of open-source tools for managing and querying large text corpora. The vignette provides a roadmap.

Political Science

coalitions v0.6.2: Implements an MCMC method to calculate probabilities for a coalition majority based on survey results. See Bender and Bauer (2018). There are vignettes on Workflows, Pooling, and Diagnostics.

Science

diagmeta v0.2-0: Implements methods by Steinhauser et al. (2016) for meta-analysis of diagnostic accuracy studies with several cutpoints.

NetworkExtinction v0.1.0: Provides functions to simulate the extinction of species in the food web, and analyze the cascading effects as described in Dunne et al. (2002). There is a vignette.

foreSIGHT v0.9.2: Provides a tool to create hydroclimate scenarios, stress test systems, and visualize system performance in scenario-neutral climate-change impact assessments. Functions generate perturbed time series using a range of approaches, including simple scaling of observed time series (Culley et al. (2016)) and stochastic simulation of perturbed time series. (Guo et al. (2018)). The vignette offers a tutorial.

PINSPlus v1.0.0: Implements PINS: Perturbation clustering for data INtegration and disease Subtyping Nguyen et al. (2017), a novel approach for integration of data and classification of diseases into various subtypes There is a vignette.

Statistics

chandwich v1.0.0: Provides functions to adjustment user-supplied independence loglikelihood functions using a robust sandwich estimator of the parameter covariance matrix, based on the methodology in Chandler and Bate (2007). The vignette shows how it works.

ciuupi v1.0.0: Provides functions to compute a confidence interval for a specified linear combination of regression parameters in a linear regression model with iid normal errors and known variance, when there is uncertain prior information that a distinct specified linear combination of the regression parameters takes a given value. See Kabaila and Mainzer (2017) and the vignette for details.

CoxPhLb v1.0.0: Provides functions to analyze right-censored, length-biased data using Cox model, including model fitting and checking, and the stationarity assumption test. The model fitting and checking methods are described in Qin and Shen (2010) and Lee, Ning, and Shen (2018).

cutpointr v0.7.3: Provides functions to estimate cutpoints that optimize a specified metric in binary classification tasks and validate performance using bootstrapping. The vignette shows how to use the functions.

fcr v1.0: Provides a function for dynamic prediction in functional concurrent regression that extends the pffr() function from the refund package to handle the scenario where the functional response and concurrently measured functional predictor are irregularly measured. See Leroux et al. (2017) and the vignette.

ggdag v0.1.0: Builds on the DAGitty web tool to provide functions to tidy, analyze, and plot directed acyclic graphs (DAGs). There is an Introduction to DAGS, an Introduction to ggdag, and a vignette on Common Structures of Bias.

hdme v0.1.1: Provides a function for penalized regression for generalized linear models for measurement error problems including the lasso (L1-penalization), which corrects for measurement error (Sorensen et al. (2015), and an implementation of the Generalized Matrix Uncertainty Selector (Sorensen et al. (2018). The vignette gives the details.

joineRmeta v0.1.1: Extends the joint models proposed by Henderson et. al. (2000) to include multi-study, meta-analytic cases. See the vignette for details.

rare v0.1.0: Implements the alternating direction method of multipliers algorithm of Yan and Bien (2018) for fitting linear models with tree-based lasso regularization. The vignette shows how to use the package.

Time Series

rMEA v1.0.0: Provides tools to read, visualize, and export bivariate motion energy time-series. Lagged synchrony between subjects can be analyzed through windowed cross-correlation. See Ramseyer & Tschacher (2011) for an application, and the README for how to use the package.

tsfknn v0.1.0: Provides a function to forecast time series using nearest neighbors regression. See Martinez et al. (2017) and the vignette for details.

spGARCH v0.1.4: Provides functions to analyze spatial and spatiotemporal autoregressive conditional heteroscedasticity Otto, Schmid, Garthoff (2017), simulation of spatial ARCH-type processes, quasi-maximum-likelihood estimation of the parameters of spARCH models, spatial autoregressive models with spARCH disturbances, diagnostic checks, and visualizations.

Utilities

base2grob v0.0.2: Provides a function to convert a base plot function call (using expression or formula) to grob objects that are compatible to the grid ecosystem so that cowplot can be used to align base plots with ggplot objects. The vignette shows how things work.

cranly v0.1: Provides functions to clean, organize, summarize, and visualize CRAN package database information, and also for building package directives networks (depends, imports, suggests, enhances) and collaboration networks. The vignette shows how to use the package.

osrmr v0.1.28: Implements a wrapper around the Open Source Routing Machine (OSRM) API. See the vignette for details.

fasterize v1.0.0: Provides a fast, drop-in replacement for rasterize() from the raster package that takes sf-type objects and uses the scan line algorithm attributed to [Wylie et al. (1967)](doi:10.¹¹⁴⁵⁄_1465611.1465619 There is a vignette.

jsr223 v0.3.1: Provides a high-level integration that makes Java objects easy to use from within R, and an unified interface for integrating R with several programming languages, including Groovy, JavaScript, JRuby, (Ruby), Jython (Python), and Kotlin. See the manual for details.

Visualization

clustree v0.1.2: Provides functions to produce clustering tree visualizations for interrogating clusterings as resolution increases. See the vignette for details.

datamaps v0.0.2: Enables users to create interactive choropleth maps with bubbles and arcs by coordinates or region name that can be used directly from the console, from RStudio, in Shiny apps, and in R Markdown documents. The vignette will help you get started.

funnelR v0.1.0: Provides functions for creating funnel plots for proportion data, and supports user-defined benchmarks, confidence limits, and estimation methods (e.g., exact or approximate) based on Spiegelhalter (2005). See the Introduction to get started.

nVennR v0.2.0: Provides an interface for the nVenn algorithm of Perez-Silva et al. (2018). See the vignette for an introduction to the package, and the R package UpSetR for help interpreting the results.

smovie v1.0.1: Uses the rpanel package to create interactive movies to help students understand statistical concepts. There are movies to: visualize probability distributions (including user-supplied ones); illustrate sampling distributions of the sample mean (central limit theorem); the sample maximum (extremal types theorem); and more. See the vignette for an overview.

_____='https://rviews.rstudio.com/2018/04/30/march-2018-top-40-new-package-picks/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

↧

Comparing dependencies of popular machine learning packages with `pkgnet`

April 29, 2018, 5:00 pm

≫ Next: Make a sculpture in LEGO from a photo, with R

≪ Previous: March 2018: “Top 40” New Package Picks

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

When looking through the CRAN list of packages, I stumbled upon this little gem:

pkgnet is an R library designed for the analysis of R libraries! The goal of the package is to build a graph representation of a package and its dependencies.

And I thought it would be fun to play around with it. The little analysis I ended up doing was to compare dependencies of popular machine learning packages.

I first loaded the packages:

library(pkgnet)library(tidygraph)

## ## Attache Paket: 'tidygraph'

## The following object is masked from 'package:stats':## ##     filter

library(ggraph)

## Lade nötiges Paket: ggplot2

I then created a function that will

create the package report with pkgnet::CreatePackageReport
convert the edge (report$DependencyReporter$edges) and node (report$DependencyReporter$nodes) data into a graph object with tidygraph::as_tbl_graph

create_pkg_graph <- function(package_name, DependencyReporter = TRUE) {    report <- CreatePackageReport(pkg_name = package_name)    if (DependencyReporter) {    graph <- as_tbl_graph(report$DependencyReporter$edges,                      directed = TRUE,                      nodes = as.data.frame(report$DependencyReporter$nodes))  } else {    graph <- as_tbl_graph(report$FunctionReporter$edges,                      directed = TRUE,                      nodes = as.data.frame(report$FunctionReporter$nodes))  }    return(graph)}

To create a vector of machine learning packages from R I looked at CRAN’s machine learning task view
These are the packages I ended up including:

pkg_list <- c("caret", "h2o", "e1071", "mlr")

Note: I wanted to include other packages, like tensorflow, randomFores, gbm, etc. but for those, pkgnet threw an error:

Error in data.table::data.table(node = names(igraph::V(self$pkg_graph)), : column or argument 1 is NULL

Next, I ran them through my function from before and assigned them each a unique name.

for (pkg in pkg_list) {  graph <- create_pkg_graph(pkg)  assign(paste0("graph_", pkg), graph)}

These individual objects I combined with tidygraph and calculated node centrality as the number of outgoing edges.

graph <- graph_caret %>%   graph_join(graph_h2o, by = "name") %>%  graph_join(graph_e1071, by = "name") %>%  graph_join(graph_mlr, by = "name") %>%  mutate(color = ifelse(name %in% pkg_list, "a", "b"),         centrality = centrality_degree(mode = "out"))

Finally, I plotted the dependency network with ggraph:

The bigger the node labels (package names), the higher their centrality. Seems like the more basic utilitarian packages have the highest centrality (not really a surprise…).

graph %>%  ggraph(layout = 'nicely') +     geom_edge_link(arrow = arrow()) +     geom_node_point() +    geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) +    theme_graph() +    scale_fill_brewer(palette = "Set1")

Because the complete network is a bit hard to make sense of, I plotted it again with only the packages I wanted to analyze plus dependencies that had at least 1 outgoing edge; now it is easier to see shared dependencies.

For example, methods and stats are dependencies of caret, mlr and e1071 but not h2o, while utils is a dependency of all four.

graph %>%  filter(centrality > 1 | color == "a") %>%  ggraph(layout = 'nicely') +     geom_edge_link(arrow = arrow()) +     geom_node_point() +    geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) +    theme_graph() +    scale_fill_brewer(palette = "Set1")

It would of course be interesting to analyse a bigger network with more packages. Maybe someone knows how to get these other packages to work with pkgnet?

sessionInfo()

## R version 3.5.0 (2018-04-23)## Platform: x86_64-apple-darwin15.6.0 (64-bit)## Running under: macOS High Sierra 10.13.4## ## Matrix products: default## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib## ## locale:## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8## ## attached base packages:## [1] stats     graphics  grDevices utils     datasets  methods   base     ## ## other attached packages:## [1] bindrcpp_0.2.2  ggraph_1.0.1    ggplot2_2.2.1   tidygraph_1.1.0## [5] pkgnet_0.2.0   ## ## loaded via a namespace (and not attached):##  [1] Rcpp_0.12.16         RColorBrewer_1.1-2   plyr_1.8.4          ##  [4] compiler_3.5.0       pillar_1.2.2         formatR_1.5         ##  [7] futile.logger_1.4.3  bindr_0.1.1          viridis_0.5.1       ## [10] futile.options_1.0.1 tools_3.5.0          digest_0.6.15       ## [13] viridisLite_0.3.0    gtable_0.2.0         jsonlite_1.5        ## [16] evaluate_0.10.1      tibble_1.4.2         pkgconfig_2.0.1     ## [19] rlang_0.2.0          igraph_1.2.1         ggrepel_0.7.0       ## [22] yaml_2.1.18          blogdown_0.6         xfun_0.1            ## [25] gridExtra_2.3        stringr_1.3.0        dplyr_0.7.4         ## [28] knitr_1.20           htmlwidgets_1.2      grid_3.5.0          ## [31] rprojroot_1.3-2      glue_1.2.0           data.table_1.10.4-3 ## [34] R6_2.2.2             rmarkdown_1.9        bookdown_0.7        ## [37] udunits2_0.13        tweenr_0.1.5         tidyr_0.8.0         ## [40] purrr_0.2.4          lambda.r_1.2.2       magrittr_1.5        ## [43] units_0.5-1          MASS_7.3-49          scales_0.5.0        ## [46] backports_1.1.2      mvbutils_2.7.4.1     htmltools_0.3.6     ## [49] assertthat_0.2.0     ggforce_0.1.1        colorspace_1.3-2    ## [52] labeling_0.3         stringi_1.1.7        visNetwork_2.0.3    ## [55] lazyeval_0.2.1       munsell_0.4.3

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

↧

Make a sculpture in LEGO from a photo, with R

April 30, 2018, 8:25 am

≫ Next: Microsoft R Open 3.4.4 now available

≪ Previous: Comparing dependencies of popular machine learning packages with `pkgnet`

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The entrance to our office in Redmond in is adorned with this sculpture of our department logo, rendered in LEGO:

We had fun with LEGO bricks at work this week. APEX is our internal team name, this was fun. Oh and we're hiring for all roles in Azure! pic.twitter.com/VlNNaTexA5
— Jeff Sandquist (@jeffsand) March 30, 2017

Our team, the Cloud Developer Advocates, has a logo as well, created by the multitalented Ashley Macnamara. (The mascot's name is Bit: he's a raccoon because, like developers, he's into everything.) It would be nice to have a LEGO rendition of Bit for the wall as well, but converting an image into LEGO bricks isn't easy … until now.

This R script by Ryan Timpe provides everything you need render an image in LEGO. It will downscale the image to a size that meets your bricks budget, convert the colors to those available as LEGO bricks, and divide the image up into LEGO-sized pieces, ready to lay out on a flat tray. The script is super easy to use: just source a file of utility functions and then:

(You can also use readJPEG to read in JPG images; I just loaded in the png package and used readPNG which works just as well.) Here's what the output looks like. (Click to see the original, for comparison.)

The script also provides a shopping list of the bricks you need by color and size: this particular project will require 1842 LEGO bricks in 19 different colors to create the 48×48 image. It will even provide a series of step-by-step instructions showing how the project will look in various stages of completion:

The R script is available on GitHub, here, and works with any recent version of R and with up-to-date tidyverse installation. (I used R 3.5.0.) You can find a complete walkthrough of using the scripts in the blog post at the link below.

Ryan Timple: How To: LEGO mosaics from photos using R & the tidyverse

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Microsoft R Open 3.4.4 now available

April 30, 2018, 11:52 am

≫ Next: 17 Jobs for R users from around the world (2018-04-30)

≪ Previous: Make a sculpture in LEGO from a photo, with R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

An update to Microsoft R Open (MRO) is now available for download on Windows, Mac and Linux. This release upgrades the R language engine to version 3.4.4, which addresses some minor issues with timezone detection and some edge cases in some statistics functions. As a maintenance release, it's backwards-compatible with scripts and packages from the prior release of MRO.

MRO 3.4.4 points to a fixed CRAN snapshot taken on April 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

Looking ahead, the next update based on R 3.5.0 has started the build and test process. Microsoft R Open 3.5.0 is scheduled for release on May 31.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

17 Jobs for R users from around the world (2018-04-30)

April 30, 2018, 1:09 pm

≫ Next: RcppArmadillo 0.8.500.0

≪ Previous: Microsoft R Open 3.4.4 now available

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

Freelance
Freelance Project: R Libraries Install on Amazon AWS EC2WS
Anywhere
27 Apr2018
Full-Time
Associate, Data AnalyticsNational Network for Safe Communities (NNSC>) – Posted by nnsc
New York New York, United States
18 Apr2018
Full-Time
Senior Research AssociateKellogg Research Support – Posted by lorbos
Evanston Illinois, United States
17 Apr2018
Full-Time
Empirical Research FellowKellogg School of Management (Northwestern University)– Posted by lorbos
Evanston Illinois, United States
17 Apr2018
Part-Time
Data Scientist Alchemy – Posted by Andreas Voniatis
Anywhere
7 Apr2018
Full-Time
Solutions EngineerRStudio – Posted by nwstephens
Anywhere
3 Apr2018

All New R Jobs

Full-Time
Lead Data Scientist @ Washington, District of Columbia, U.S.AFL-CIO – Posted by carterkalchik
Washington District of Columbia, United States
27 Apr2018
Full-Time
Data Scientist R @ Medellín, Antioquia, ColombiaIDATA S.A.S – Posted by vmhoyos
Medellín Antioquia, Colombia
27 Apr2018
Full-Time
Data Scientist @ Annapolis, Maryland, United StatesThe National Socio-Environmental Synthesis Center – Posted by sesync
Annapolis Maryland, United States
27 Apr2018
Freelance
Freelance Project: R Libraries Install on Amazon AWS EC2WS
Anywhere
27 Apr2018
Full-Time
Associate, Data AnalyticsNational Network for Safe Communities (NNSC>) – Posted by nnsc
New York New York, United States
18 Apr2018
Full-Time
Senior Research AssociateKellogg Research Support – Posted by lorbos
Evanston Illinois, United States
17 Apr2018
Full-Time
Empirical Research FellowKellogg School of Management (Northwestern University)– Posted by lorbos
Evanston Illinois, United States
17 Apr2018
Full-Time
Discrete Choice Modeler @ Chicago, Illinois, U.S.Resource Systems Group – Posted by patricia.holland@rsginc.com
Chicago Illinois, United States
13 Apr2018
Full-Time
R&D Database Developer @ Toronto, CanadaCrescendo Technology Ltd – Posted by Crescendo
Toronto Ontario, Canada
12 Apr2018
Full-Time
Applied Statistics Position @ Florida U.S.New College of Florida – Posted by Jobs@New College
Sarasota Florida, United States
9 Apr2018
Full-Time
PhD Fellowship @ Wallace University, US Ubydul Haque – Posted by Ubydul
Khon Kaen Chang Wat Khon Kaen, Thailand
9 Apr2018
Part-Time
Data Scientist Alchemy – Posted by Andreas Voniatis
Anywhere
7 Apr2018
Full-Time
Product Owner (Data Science) (m/f)Civey GmbH – Posted by Civey
Berlin Berlin, Germany
6 Apr2018
Full-Time
Solutions EngineerRStudio – Posted by nwstephens
Anywhere
3 Apr2018
Full-Time
Post-Doctoral Fellow – Data Scientist @ CIMMYT (eastern Africa)International Maize and Wheat Improvement Center – Posted by cimmyt-jobs
Marsabit County Kenya
27 Mar2018
Freelance
Statistician / R Developer – for Academic Statistical Research Academic Research – Posted by empiricus
Anywhere
27 Mar2018
Full-Time
Graduate Data ScientistJamieAi – Posted by JamieAi
Anywhere
27 Mar2018

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

↧

RcppArmadillo 0.8.500.0

April 30, 2018, 3:26 pm

≫ Next: How to do Repeated Measures ANOVAs in R

≪ Previous: 17 Jobs for R users from around the world (2018-04-30)

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

armadillo image

RcppArmadillo release 0.8.500.0, originally prepared and uploaded on April 21, has hit CRAN today (after having already been available via the RcppCore drat repo). A corresponding Debian release will be prepared as well. This RcppArmadillo release contains Armadillo release 8.500.0 with a number of rather nice changes (see below for details), and continues our normal bi-monthly CRAN release cycle.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 472 other packages on CRAN.

A high-level summary of changes follows.

Changes in RcppArmadillo version 0.8.500.0 (2018-04-21)
Upgraded to Armadillo release 8.500 (Caffeine Raider)
faster handling of sparse matrices by kron() and repmat()
faster transpose of sparse matrices
faster element access in sparse matrices
faster row iterators for sparse matrices
faster handling of compound expressions by trace()
more efficient handling of aliasing in submatrix views
expanded normalise() to handle sparse matrices
expanded .transform() and .for_each() to handle sparse matrices
added reverse() for reversing order of elements
added repelem() for replicating elements
added roots() for finding the roots of a polynomial
Fewer LAPACK compile-time guards are used, new unit tests for underlying features have been added (Keith O’Hara in #211 addressing #207).
The configure check for LAPACK features has been updated accordingly (Keith O’Hara in #214 addressing #213).
The compile-time check for g++ is now more robust to minimal shell versions (#217 addressing #216).
Compiler tests to were added for macOS (Keith O’Hara in #219).

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

How to do Repeated Measures ANOVAs in R

April 30, 2018, 5:00 pm

≫ Next: Get your tracks from the Strava API and plot them on Leaflet maps

≪ Previous: RcppArmadillo 0.8.500.0

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

Don’t do it

Ha! Got ya! Trying to run some old school ANOVAs hum? I’ll show you even better!

There is now a tremendous amount of data showing the inadequacy of ANOVAs as a statistical procedure (Camilli, 1987; Levy, 1978; Vasey, 1987; Chang, 2009). Instead, many papers suggest moving toward the mixed-modelling framework (Kristensen, 2004; Jaeger, 2008), which was shown to be more flexible, accurate, powerful and suited for psychological data.

Using this framework, we will see how we can very simply answer our questions with R and the psycho package.

The Emotion Dataset

Let’s take the example dataset included in the psycho package.

library(psycho)library(tidyverse)df<-psycho::emotion%>%select(Participant_ID,Participant_Sex,Emotion_Condition,Subjective_Valence,Recall)summary(df)

 Participant_ID Participant_Sex Emotion_Condition Subjective_Valence 10S    : 48    Female:720      Negative:456      Min.   :-100.000   11S    : 48    Male  :192      Neutral :456      1st Qu.: -65.104   12S    : 48                                      Median :  -2.604   13S    : 48                                      Mean   : -18.900   14S    : 48                                      3rd Qu.:   7.000   15S    : 48                                      Max.   : 100.000   (Other):624                                                           Recall        Mode :logical   FALSE:600       TRUE :312

Our dataframe (called df) contains data from several participants, exposed to neutral and negative pictures (the Emotion_Condition column). Each row corresponds to a single trial. As there were 48 trials per participants, there are 48 rows by participant. During each trial, the participant had to rate its emotional valence (Subjective_Valence: positive – negative) experienced during the picture presentation. Moreover, 20min after this emotional rating task, the participant was asked to freely recall all the pictures he remembered.

Our dataframe contains, for each trial, 5 variables: the name of the participant (Participant_ID), its sex (Participant_Sex), the emotion condition (Emotion_Condition), the valence rating (Subjective_Valence) and whether the participant recalled the picture (Recall).

The effect of Emotion

Does the emotion condition modulate the subjective valence? How to answer?

Whith a repeated measures ANOVA of course!

Let’s run it:

summary(aov(Subjective_Valence~Emotion_Condition+Error(Participant_ID/Emotion_Condition),data=df))

Error: Participant_ID          Df Sum Sq Mean Sq F value Pr(>F)Residuals 18 115474    6415               Error: Participant_ID:Emotion_Condition                  Df  Sum Sq Mean Sq F value   Pr(>F)    Emotion_Condition  1 1278417 1278417   245.9 6.11e-12 ***Residuals         18   93573    5198                     ---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Error: Within           Df Sum Sq Mean Sq F value Pr(>F)Residuals 874 935646    1070

Wow, we found that there is a significant effect of the emotional condition on valence ratings. We might have Science material here.

As you know, an ANOVA is pretty much a condensed linear model where the predictors are factors. Therefore, we can run an ANOVA on a linear mixed model (which includes the “error” term, or random effect).

library(lmerTest)fit<-lmer(Subjective_Valence~Emotion_Condition+(1|Participant_ID),data=df)anova(fit)

Type III Analysis of Variance Table with Satterthwaite's method                   Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    Emotion_Condition 1278417 1278417     1   892    1108 < 2.2e-16 ***---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the results are, for the important bits (the sum of squares, mean square and p value), very close to those of the traditional approach.

Note that the psycho package, through the analyze function, also allows to display the interpretation of the underlying model itself with the following:

results<-analyze(fit)print(results)

The overall model predicting Subjective_Valence (formula = Subjective_Valence ~ Emotion_Condition + (1 | Participant_ID)) successfully converged and explained 56.73% of the variance of the endogen (the conditional R2). The variance explained by the fixed effects was of 52.62% (the marginal R2) and the one explained by the random effects of 4.11%. The model's intercept is at -56.34 (SE = 2.88, 95% CI [-62.07, -50.61]). Within this model:   - The effect of Emotion_ConditionNeutral is  significant (beta = 74.88, SE = 2.25, 95% CI [70.47, 79.29], t(892.00) = 33.29, p < .001***) and can be considered as medium (std. beta = 0.73, std. SE = 0.022).

Post-hoc / Contrast Analysis

Then, we wou’d like to see how the levels are different. To do this, we have to run a “contrast” analysis, comparing the estimated means of each level.

# We have to provide the model (here called fit and the factors we want to contrastresults<-get_contrasts(fit,"Emotion_Condition")print(results$contrasts)

Contrast	Difference	SE	df	t.ratio	p.value
Negative – Neutral	-74.88	2.25	892	-33.29	0

It appears that the negative condition yields a significantly lower valence (i.e., more negative) than the neutral (-74.88 points of difference). At this point, we usually also want to know the means of each conditions. However, we often do it by directly computing the means and SDs of our observed data. But that’s not the cleanest way, as our data might be unbalanced or biased.

The best way to do it is to estimate means based on the fitted model (marginal means). Those were automatically computed when running the get_contrasts function. We just have to extract them.

Emotion_Condition	Mean	SE	df	CI_lower	CI_higher
Negative	-56.34	2.88	25.04	-62.27	-50.41
Neutral	18.54	2.88	25.04	12.61	24.47

Finally, we can plot these means:

library(ggplot2)ggplot(results$means,aes(x=Emotion_Condition,y=Mean,group=1))+geom_line()+geom_pointrange(aes(ymin=CI_lower,ymax=CI_higher))+ylab("Subjective Valence")+xlab("Emotion Condition")+theme_bw()

Interaction

Let’s repeat the previous steps with adding the participant’s sex as a predictor.

fit<-lmer(Subjective_Valence~Emotion_Condition*Participant_Sex+(1|Participant_ID),data=emotion)anova(fit)

Type III Analysis of Variance Table with Satterthwaite's method                                  Sum Sq Mean Sq NumDF DenDF  F valueEmotion_Condition                 703963  703963     1   891 621.8068Participant_Sex                      520     520     1    17   0.4593Emotion_Condition:Participant_Sex  20496   20496     1   891  18.1041                                     Pr(>F)    Emotion_Condition                 < 2.2e-16 ***Participant_Sex                      0.5071    Emotion_Condition:Participant_Sex 2.313e-05 ***---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that there is a significant main effect of the emotion condition, as well as an interaction with the participants’ sex. Let’s plot the estimated means.

results<-get_contrasts(fit,"Emotion_Condition * Participant_Sex")print(results$means)

Emotion_Condition	Participant_Sex	Mean	SE	df	CI_lower	CI_higher
Negative	Female	-59.73	3.28	23.3	-66.51	-52.95
Neutral	Female	20.05	3.28	23.3	13.27	26.83
Negative	Male	-43.63	6.35	23.3	-56.76	-30.50
Neutral	Male	12.89	6.35	23.3	-0.24	26.02

ggplot(results$means,aes(x=Emotion_Condition,y=Mean,color=Participant_Sex,group=Participant_Sex))+geom_line(position=position_dodge(.3))+geom_pointrange(aes(ymin=CI_lower,ymax=CI_higher),position=position_dodge(.3))+ylab("Subjective Valence")+xlab("Emotion Condition")+theme_bw()

Let’s investigate the contrasts:

print(results$contrasts)

Contrast	Difference	SE	df	t.ratio	p.value
Negative,Female – Neutral,Female	-79.78	2.51	891.0	-31.81	0.00
Negative,Female – Negative,Male	-16.10	7.15	23.3	-2.25	0.14
Negative,Female – Neutral,Male	-72.62	7.15	23.3	-10.16	0.00
Neutral,Female – Negative,Male	63.67	7.15	23.3	8.91	0.00
Neutral,Female – Neutral,Male	7.15	7.15	23.3	1.00	0.75
Negative,Male – Neutral,Male	-56.52	4.86	891.0	-11.64	0.00

It appears that the differences between men and women is not significant. However, by default, get_contrasts uses the Tukey method for p value adjustment. We can, with an exploratory mindset, turn off the p value correction (or choose other methods such as bonferonni, fdr and such).

results<-get_contrasts(fit,"Emotion_Condition * Participant_Sex",adjust="none")print(results$contrasts)

Contrast	Difference	SE	df	t.ratio	p.value
Negative,Female – Neutral,Female	-79.78	2.51	891.0	-31.81	0.00
Negative,Female – Negative,Male	-16.10	7.15	23.3	-2.25	0.03
Negative,Female – Neutral,Male	-72.62	7.15	23.3	-10.16	0.00
Neutral,Female – Negative,Male	63.67	7.15	23.3	8.91	0.00
Neutral,Female – Neutral,Male	7.15	7.15	23.3	1.00	0.33
Negative,Male – Neutral,Male	-56.52	4.86	891.0	-11.64	0.00

Without correcting for multiple comparisons, we observe that men rate the negative pictures as significantly less negative than women.

Note

This analysis is even simpler in the Bayesian framework. See this tutorial.

Credits

This package helped you? Don’t forget to cite the various packages you used

You can cite psycho as follows:

Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

↧

Get your tracks from the Strava API and plot them on Leaflet maps

May 1, 2018, 2:29 am

≫ Next: Simulating animal movements and habitat use

≪ Previous: How to do Repeated Measures ANOVAs in R

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Here is some updated R code from my previous post. It doesn’t throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).

Have fun and leave a comment if you have any questions.

options(stringsAsFactors = F)

rm(list=ls())

library(httr) library(rjson) library(leaflet) library(dplyr)

token <- “”

# Functions —————————————————————

get.coord.df.from.stream <- function (stream.obj) { data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]), lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]])) }

get.stream.from.activity <- function (act.id, token) { stream <- GET(“https://www.strava.com/”, path = paste0(“api/v3/activities/”, act.id, “/streams/latlng”), query = list(access_token = token)) content(stream) }

get.activities2 <- function (token) { activities <- GET(“https://www.strava.com/”, path = “api/v3/activities”, query = list(access_token = token, per_page = 200)) activities <- content(activities, “text”) activities <- fromJSON(activities) res.df <- data.frame() for (a in activities) { values <- sapply(c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”, “type”, “id”, “start_date_local”, “location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”, “elev_low”, “average_heartrate”, “max_heartrate”), FUN = function (x) { if (is.null(a[[x]])) { NA } else { a[[x]] } }) res.df <- rbind(res.df, values) } names(res.df) <- c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”, “type”, “id”, “start_date_local”, “location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”, “elev_low”, “average_heartrate”, “max_heartrate”) res.df }

get.multiple.streams <- function (act.ids, token) { res.list <- list() for (act.id.i in 1:length(act.ids)) { if (act.id.i %% 5 == 0) cat(“Actitivy no.”, act.id.i, “of”, length(act.ids), “\n”) stream <- get.stream.from.activity(act.ids[act.id.i], token) coord.df <- get.coord.df.from.stream(stream) res.list[[length(res.list) + 1]] <- list(act.id = act.ids[act.id.i], coords = coord.df) } res.list }

activities <- get.activities2(token)

stream.list <- get.multiple.streams(activities$id, token) # Leaflet —————————————————————–

lons.range <- c(9.156572, 9.237580) lats.range <- c(48.74085, 48.82079)

map <- leaflet() %>% addProviderTiles(“OpenMapSurfer.Grayscale”, # nice: CartoDB.Positron, OpenMapSurfer.Grayscale, CartoDB.DarkMatterNoLabels options = providerTileOptions(noWrap = T)) %>% fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))

add.run <- function (act.id, color, act.name, act.dist, strlist = stream.list) { act.ind <- sapply(stream.list, USE.NAMES = F, FUN = function (x) { x$act.id == act.id }) act.from.list <- strlist[act.ind][[1]] map <<- addPolylines(map, lng = act.from.list$coords$lon, lat = act.from.list$coords$lat, color = color, opacity = 1/3, weight = 2, popup = paste0(act.name, “, “, round(as.numeric(act.dist) / 1000, 2), ” km”)) }

# plot all for (i in 1:nrow(activities)) { add.run(activities[i, “id”], ifelse(activities[i, “type”] == “Run”, “red”, ifelse(activities[i, “type”] == “Ride”, “blue”, “black”)), activities[i, “name”], activities[i, “distance”]) }

map

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

↧

Simulating animal movements and habitat use

May 1, 2018, 12:33 pm

≫ Next: Moving to blogdown

≪ Previous: Get your tracks from the Strava API and plot them on Leaflet maps

(This article was first published on long time ago..., and kindly contributed to R-bloggers)

Hi there!

I was training some ways to simulate animal (or other organisms) movements having into account habitat suitability. To do this, I used my previous eWalk model as the underlying process to simulate random or directional walks. This model is based on Brownian / Ornstein–Uhlenbeck process. You can find more about eWalk model here!

Today, I will add one more element to this movement simulations. In this case, we will have into account the habitat or environmental preferences of the simulated species, to perform a simulation like this:

First, we will create a raster layer as a random environmental variable, for example tree cover.

     library (raster)    library (dismo)        tc <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0,ymx=100)    tc[] <- runif(10000, -80, 180)    tc <- focal(tc, w=matrix(1, 5, 5), mean)    tc <- focal(tc, w=matrix(1, 5, 5), mean)    plot(tc)

Second, we will define the species class. The species will be defined by their position (coordinates), and their optimum for the environmental variable.

     species <- setClass("species", slots=c(x="numeric", y="numeric", opt="numeric"))

Here we will define the Red deer species as a specimen in the coordinates (50,50) and an optimum of 80 for the environmental variable (tree cover). In the same way, we will define the Egyptian mongoose as a specimen in the coordinates (50,50) and an optimum of 30 for the tree cover variable.

     Red_deer <- species(x= 50, y =50, opt= 90)    Egyptian_mongoose <- species(x= 50, y =50, opt= 30)

Now, we will load the “go” function (I do not have a name for it yet). It require a species (sp), a raster layer with any environmental variable (env), number of iterations (n), a Brownian motion parameter (that is, how random is the movement of your species), a geographical optimum (the wanted destination of your species theta_x and theta_y), and the attraction strength or “interest” of the species to get this position (alpha_x and alpha_y). The syntaxis should be something like this:

     path <- go (sp, env, n, sigma, theta_x, alpha_x, theta_y, alpha_y)

Here is the function to load (I will comment the function in a future post):

     go <- function (sp, env, n, sigma, theta_x, alpha_x, theta_y, alpha_y) {     track <- data.frame()     track[1,1] <- sp@x     track[1,2] <- sp@y     for (step in 2:n) {      neig <- adjacent(env,                cellFromXY(env, matrix(c(track[step-1,1],                           track[step-1,2]), 1,2)),                directions=8, pairs=FALSE )      options <- data.frame()      for (i in 1:length(neig)){       options[i,1]<-neig[i]       options[i,2]<- sp@opt - env[neig[i]]      }      option <- c(options[abs(na.omit(options$V2)) == min(abs(na.omit(options$V2))), 1 ],             options[abs(na.omit(options$V2)) == min(abs(na.omit(options$V2))), 1 ])      new_cell <- sample(option,1)      new_coords <- xyFromCell(env,new_cell)      lon_candidate<--9999      lat_candidate<--9999            while ( is.na(extract(env, matrix(c(lon_candidate,lat_candidate),1,2)))) {       lon_candidate <- new_coords[1]+ (sigma * rnorm(1)) + (alpha_x * ( theta_x - new_coords[1]))       lat_candidate <- new_coords[2]+ (sigma * rnorm(1)) + (alpha_y * ( theta_y - new_coords[2]))      }      track[step,1] <- lon_candidate      track[step,2] <- lat_candidate     }     return(track)    }

Well, now we can perform a simple experiment with our two specimens. We will simulate random movement of these two species having into account their environmental optimums. The “go” function will return us the track or the path followed by each specimen (coordinates by each step).

     deer_simul <- go (Red_deer, tc, 100, 2, 90, 0, 90, 0)    mongoose_simul <- go (Egyptian_mongoose, tc, 100, 2, 90, 0, 90, 0)

We can plot the paths…

     plot(tc)    lines(deer_simul, lwd=1.5, col="red")    points(deer_simul, cex=0.3, col="red")    lines(mongoose_simul, lwd=1.5, col="blue")    points(mongoose_simul, cex=0.3, col="blue")    legend("topleft", legend=c("deer","mongoose"), col=c("red","blue"),        lty=c(1,1), lwd=c(2,2))

To test if each species is actually “searching” their environmental optimum, we can extract the environmental values by step for each species and plot their density distributions.

     plot(density(extract(tc, deer_simul)),lwd=3, col="red", xlim=c(20,80),        ylim=c(0,max(c(density(extract(tc, deer_simul))$y,              density(extract(tc, mongoose_simul))$y))),       main="locations density distribution", xlab="tree cover")    lines(density(extract(tc, mongoose_simul)),lwd=3, col="blue")    legend("topleft", legend=c("deer","mongoose"), col=c("red","blue"),        lty=c(1,1), lwd=c(3,3))

So, we can see that the deer is actually using patches with a higher value of tree cover than the mongoose… our simulation worked! You can use the code of this post to perform a GIF like the one above.

That’s all I have to say about this for now… In the next posts we will simulate more animal movements and migrations!

PD: Let me show you a nice song about the mongoose…

To leave a comment for the author, please follow the link and comment on their blog: long time ago....

↧

Moving to blogdown

May 1, 2018, 8:47 pm

≫ Next: How efficient are multifactorial experiments?

≪ Previous: Simulating animal movements and habitat use

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages. The new blog, or rather, the continuation of this blog, will be at webbedfeet.github.io/posts, and it went live today.

I’ll be cross-posting here for a while, at least until Tal gets my new blog address included in R-Bloggers. I’m enjoying the RMarkdown blogging experience now, which is quite nice, and any code or analyses I want to include isn’t “lost in translation” when on WP. Since I live in R most of my days, it is also allowing a rather free flow of ideas onto the virtual page.

Hope you’ll come visit

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit.

↧

How efficient are multifactorial experiments?

May 1, 2018, 5:00 pm

≫ Next: Statistical Sins: Is Your Classification Model Any Good?

≪ Previous: Moving to blogdown

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I recently described why we might want to conduct a multi-factorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome.

In all of the examples that follow, I am assuming we are in an exploratory phase of research, so our alpha levels are relaxed a bit to $\alpha = 0.10$. In addition, we make no adjustments for multiple testing. This might be justifiable, since we are not as concerned about making a Type 1 error (concluding an effect is real when there isn’t actually one). Because this is a screening exercise, the selected interventions will be re-evaluated. At the same time, we are setting desired power to be 90%. This way, if an effect really exists, we are more likely to select it for further review.

Two scenarios with a continuous outcome

To start, I have created two sets of underlying assumptions. In the first, the effects of the four interventions (labeled fac1, fac2, fac3, and fac4) are additive. (The factor variables are parameterized using effect-style notation, where the value -1 represents no intervention and 1 represents the intervention.) So, with no interventions the outcome is 0, and each successive intervention adds 0.8 to the observed outcome (on average), so that individuals exposed to all four factors will have an average outcome $4 \times 0.8 = 3.2$.

cNoX <- defReadCond("DataMF/FacSumContNoX.csv")cNoX

##                            condition formula variance   dist     link## 1: (fac1 + fac2 + fac3 + fac4) == -4     0.0      9.3 normal identity## 2: (fac1 + fac2 + fac3 + fac4) == -2     0.8      9.3 normal identity## 3: (fac1 + fac2 + fac3 + fac4) ==  0     1.6      9.3 normal identity## 4: (fac1 + fac2 + fac3 + fac4) ==  2     2.4      9.3 normal identity## 5: (fac1 + fac2 + fac3 + fac4) ==  4     3.2      9.3 normal identity

In the second scenario, each successive exposure continues to add to the effect, but each additional intervention adds a little less. The first intervention adds 0.8, the second adds 0.6, the third adds 0.4, and the fourth adds 0.2. This is a form of interaction.

cX <- defReadCond("DataMF/FacSumContX.csv")cX

##                            condition formula variance   dist     link## 1: (fac1 + fac2 + fac3 + fac4) == -4     0.0      9.3 normal identity## 2: (fac1 + fac2 + fac3 + fac4) == -2     0.8      9.3 normal identity## 3: (fac1 + fac2 + fac3 + fac4) ==  0     1.4      9.3 normal identity## 4: (fac1 + fac2 + fac3 + fac4) ==  2     1.8      9.3 normal identity## 5: (fac1 + fac2 + fac3 + fac4) ==  4     2.0      9.3 normal identity

This is what a plot of the means might look like for each of the scenarios. The straight line represents the additive (non-interactive) scenario, and the bent line is the interaction scenario:

Sample size requirement for a single intervention compared to control

If we were to conduct a more traditional randomized experiment with two groups – treatment and control – we would need about 500 total subjects under the assumptions that we are using:

power.t.test(power = 0.90, delta = .8, sd = 3.05, sig.level = 0.10)

## ##      Two-sample t test power calculation ## ##               n = 249.633##           delta = 0.8##              sd = 3.05##       sig.level = 0.1##           power = 0.9##     alternative = two.sided## ## NOTE: n is number in *each* group

To take a look at the sample size requirements for a multi-factorial study, I’ve written this function that repeatedly samples data based on the definitions and fits the appropriate model, storing the results after each model estimation.

library(simstudy)iterFunc <- function(dc, dt, seed = 464653, iter = 1000, binary = FALSE) {    set.seed(seed)  res <- list()  for (i in 1:iter) {      dx <- addCondition(dc, dt, "Y")        if (binary == FALSE) {      fit <- lm(Y~fac1*fac2*fac3*fac4, data = dx)      } else {      fit <- glm(Y~fac1*fac2*fac3*fac4, data = dx, family = binomial)      }        # A simple function to pull data from the fit        res <- appendRes(res, fit)     }    return(res)}

And finally, here are the results for the sample size requirements based on no interaction across interventions. (I am using function genMultiFac to generate replications of all the combinations of four factors. This function is now part of simstudy, which is available on github, and will hopefully soon be up on CRAN.)

dt <- genMultiFac(32, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(cNoX, dt)

apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.894 0.895 0.905 0.902

A sample size of $32 \times 16 = 512$ gives us 90% power that we are seeking. In case you don’t believe my simulation, we can compare the estimate provided by the MOST package, created by the Methodology Center at Penn State:

library(MOST)FactorialPowerPlan(alpha = 0.10, model_order = 1, nfactors = 4,                    ntotal = 500, sigma_y = 3.05, raw_main = 0.8)$power

## [1] "------------------------------------------------------------"## [1] "FactorialPowerPlan Macro"## [1] "The Methodology Center"## [1] "(c) 2012 Pennsylvania State University"## [1] "------------------------------------------------------------"## [1] "Assumptions:"## [1] "There are 4 dichotomous factors."## [1] "There is independent random assignment."## [1] "Analysis will be based on main effects only."## [1] "Two-sided alpha:     0.10"## [1] "Total number of participants: 500"## [1] "Effect size as unstandardized difference in means:     0.80"## [1] "Assumed standard deviation for the response variable is      3.05"## [1] "Attempting to calculate the estimated power."## [1] "------------------------------------------------------------"## [1] "Results:"## [1] "The calculated power is 0.9004"

## [1] 0.9004

Interaction

A major advantage of the multi-factorial experiment over the traditional RCT, of course, is that it allows us to investigate if the interventions interact in any interesting ways. However, in practice it may be difficult to generate sample sizes large enough to measure these interactions with much precision.

In the next pair of simulations, we see that even if we are only interested in exploring the main effects, underlying interaction reduces power. If there is actually interaction (as in the second scenario defined above), the original sample size of 500 may be inadequate to estimate the main effects:

dt <- genMultiFac(31, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(cX, dt)apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.567 0.556 0.588 0.541

Here, a total sample of about 1300 does the trick:

dt <- genMultiFac(81, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(cX, dt)apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.898 0.893 0.908 0.899

But this sample size is not adequate to estimate the actual second degree interaction terms:

apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`,                 `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean)

## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ##     0.144     0.148     0.163     0.175     0.138     0.165

You would actually need a sample size of about 32,000 to be adequately powered to estimate the interaction! Of course, this requirement is driven by the size of the interaction effects and the variation, so maybe this is a bit extreme:

dt <- genMultiFac(2000, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(cX, dt)apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`,                 `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean)

## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ##     0.918     0.902     0.888     0.911     0.894     0.886

A binary outcome

The situation with the binary outcome is really no different than the continuous outcome, except for the fact that sample size requirements might be much more sensitive to the strength of underlying interaction.

Again, we have two scenarios – one with interaction and one without. When I talk about an additive (non-interaction) model in this context, the additivity is on the log-odds scale. This becomes apparent when looking at a plot.

I want to reiterate here that we have interaction when there are limits to how much marginal effect an additional intervention can have conditional on the presence of other interventions. In a recent project (one that motivated this pair of blog entries), we started with the assumption that a single intervention would have a 5 percentage point effect on the outcome (which was smoking cessation), but a combination of all four interventions might only get a 10 percentage point reduction. This cap generates severe interaction which dramatically affected sample size requirements, as we see below (using even less restrictive interaction assumptions).

No interaction:

##                            condition formula variance   dist     link## 1: (fac1 + fac2 + fac3 + fac4) == -4    0.10       NA binary identity## 2: (fac1 + fac2 + fac3 + fac4) == -2    0.18       NA binary identity## 3: (fac1 + fac2 + fac3 + fac4) ==  0    0.30       NA binary identity## 4: (fac1 + fac2 + fac3 + fac4) ==  2    0.46       NA binary identity## 5: (fac1 + fac2 + fac3 + fac4) ==  4    0.63       NA binary identity

Interaction:

##                            condition formula variance   dist     link## 1: (fac1 + fac2 + fac3 + fac4) == -4    0.10       NA binary identity## 2: (fac1 + fac2 + fac3 + fac4) == -2    0.18       NA binary identity## 3: (fac1 + fac2 + fac3 + fac4) ==  0    0.24       NA binary identity## 4: (fac1 + fac2 + fac3 + fac4) ==  2    0.28       NA binary identity## 5: (fac1 + fac2 + fac3 + fac4) ==  4    0.30       NA binary identity

The plot highlights that additivity is on the log-odds scale only:

The sample size requirement for a treatment effect of 8 percentage points for a single intervention compared to control is about 640 total participants:

power.prop.test(power = 0.90, p1 = .10, p2 = .18, sig.level = 0.10)

## ##      Two-sample comparison of proportions power calculation ## ##               n = 320.3361##              p1 = 0.1##              p2 = 0.18##       sig.level = 0.1##           power = 0.9##     alternative = two.sided## ## NOTE: n is number in *each* group

Simulation shows that the multi-factorial experiment requires only 500 participants, a pretty surprising reduction:

dt <- genMultiFac(31, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(bNoX, dt, binary = TRUE)apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.889 0.910 0.916 0.901

But, if there is a cap to how much we can effect the outcome (i.e. there is underlying interaction), estimated power is considerably reduced:

dt <- genMultiFac(31, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(bX, dt, binary = TRUE)apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.398 0.409 0.405 0.392

We need to increase the sample size to about $125 \times 16 = 2000$ just to explore the main effects:

dt <- genMultiFac(125, nFactors = 4, coding = "effect",                   colNames = paste0("fac", c(1:4)))res <- iterFunc(bX, dt, binary = TRUE)apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean)

##  fac1  fac2  fac3  fac4 ## 0.910 0.890 0.895 0.887

I think the biggest take away from all of this is that multi-factorial experiments are a super interesting option when exploring possible interventions or combinations of interventions, particularly when the outcome is continuous. However, this approach may not be as feasible when the outcome is binary, as sample size requirements may quickly become prohibitive, given the number of factors, sample sizes, and extent of interaction.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

↧

Statistical Sins: Is Your Classification Model Any Good?

May 2, 2018, 5:27 am

≫ Next: EARL London Keynote Speaker announcement: Garrett Grolemund

≪ Previous: How efficient are multifactorial experiments?

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Prediction with Binomial Regression April A to Z is complete! We now return to your regularly scheduled statistics blog posts! Today, I want to talk about an issue I touched on during A to Z: using regression to predict values and see how well your model is doing.

Specifically, I talked a couple of times about binomial regression (here and here), which is used to predict (read: recreate with a set of variables significantly related to) a binary outcome. The data example I used involved my dissertation data and the binary outcome was verdict: guilty or not guilty. A regression model returns the linear correction applied to the predictor variables to reproduce the outcome, and will highlight whether a predictor was significantly related to the outcome or not. But a big question you may be asking of your binomial model is: how well does it predict the outcome? Specifically, how can you examine whether your regression model is correctly classifying cases?

We’ll start by loading/setting up the data and rerunning the binomial regression with interactions.

dissertation<-read.delim("dissertation_data.txt",header=TRUE)dissertation<-dissertation[,1:44]predictors<-c("obguilt","reasdoubt","bettertolet","libertyvorder","jurevidence","guilt")dissertation<-subset(dissertation,!is.na(libertyvorder))dissertation[45:50]<-lapply(dissertation[predictors],function(x) {y<-scale(x,center=TRUE,scale=TRUE)})pred_int<-'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 +                    jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +                   bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'model<-glm(pred_int,family="binomial",data=dissertation)summary(model)

##  ## Call: ## glm(formula = pred_int, family = "binomial", data = dissertation) ##  ## Deviance Residuals:  ##     Min       1Q   Median       3Q      Max   ## -2.6101  -0.5432  -0.1289   0.6422   2.2805   ##  ## Coefficients: ##                         Estimate Std. Error z value Pr(>|z|)     ## (Intercept)             -0.47994    0.16264  -2.951  0.00317 **  ## obguilt.1                0.25161    0.16158   1.557  0.11942     ## reasdoubt.1             -0.09230    0.20037  -0.461  0.64507     ## bettertolet.1           -0.22484    0.20340  -1.105  0.26899     ## libertyvorder.1          0.05825    0.21517   0.271  0.78660     ## jurevidence.1            0.07252    0.19376   0.374  0.70819     ## guilt.1                  2.31003    0.26867   8.598  < 2e-16 *** ## obguilt.1:guilt.1        0.14058    0.23411   0.600  0.54818     ## reasdoubt.1:guilt.1     -0.61724    0.29693  -2.079  0.03764 *   ## bettertolet.1:guilt.1    0.02579    0.30123   0.086  0.93178     ## libertyvorder.1:guilt.1 -0.27492    0.29355  -0.937  0.34899     ## jurevidence.1:guilt.1    0.27601    0.36181   0.763  0.44555     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## (Dispersion parameter for binomial family taken to be 1) ##  ##     Null deviance: 490.08  on 354  degrees of freedom ## Residual deviance: 300.66  on 343  degrees of freedom ## AIC: 324.66 ##  ## Number of Fisher Scoring iterations: 6

The predict function, which I introduced here, can also be used for the binomial model. Let’s have R generate predicted scores for everyone in the dissertation sample:

dissertation$predver<-predict(model)dissertation$predver

##   [1]  0.3907097456 -4.1351129605  2.1820478279 -2.8768390246  2.5804618523 ##   [6]  0.4244692909  2.3065468369 -2.7853434926  0.3504760502 -0.2747339639 ##  [11] -1.8506160725 -0.6956240161 -4.7860574839 -0.3875950731 -2.4955679446 ##  [16] -0.3941516951 -4.5831011509  1.6185480937  0.4971923298  4.1581842900 ##  [21] -0.6320531052 -4.8447046319 -2.3974890696  1.8566258698  0.0360685822 ##  [26]  2.2151040131  2.3477149003 -2.4493726369 -0.2253481404 -4.8899805287 ##  [31]  1.7789459288 -0.0978703861 -3.5541042186 -3.6009218603  0.1568318789 ##  [36]  3.7866003489 -0.6371816898 -0.7047761441 -0.7529742376 -0.0302759317 ##  [41] -0.1108055330  1.9751810033  0.2373614802  0.0424471071 -0.4018757856 ##  [46]  0.0530272726 -1.0763759980  0.0099577637  0.3128581222  1.4806679691 ##  [51] -1.7468626219  0.2998282372 -3.6359162016 -2.2200774510  0.3192366472 ##  [56]  3.0103216033 -2.0625775984 -6.0179845235  2.0300503627  2.3676828409 ##  [61] -2.8971753746 -3.2131490026  2.1349358889  3.0215336139  1.2436192890 ##  [66]  0.2885535375  0.2141821004  1.9480686936  0.0438751446 -1.9368013875 ##  [71]  0.2931258287  0.5319938265  0.0177643261  3.3724920900  0.0332949791 ##  [76]  2.5935500970  0.7571810150  0.7131757400  2.5411073339  2.8499853550 ##  [81]  2.8063291084 -0.4500738791  1.4700679077 -0.8659309719  0.0870492258 ##  [86]  0.5728074322  0.1476797509  2.4697257261  2.5935500970 -2.2200774510 ##  [91] -0.0941827753  1.3708676633  1.4345235392 -0.2407209578  2.4662700339 ##  [96] -1.9687731888 -6.7412580522 -0.0006224018 -4.4132951092 -2.8543032695 ## [101]  1.2295635352  2.8194173530  0.1215689324 -3.8258079371  1.8959803882 ## [106] -4.5578801595  2.3754402614  0.0826808026  1.5112359711 -3.5402060466 ## [111]  0.2556657363  0.7054183194  1.4675797244 -2.3974890696  2.6955929822 ## [116] -0.3123518919 -4.8431862346 -2.0132721372  0.4673405434 -2.3053405270 ## [121]  1.9498822386 -0.5164183930 -1.8277820872 -0.0134750769 -2.3013547136 ## [126] -0.2498730859 -4.4281010683 -0.0134750769 -0.2604532514  0.1476797509 ## [131] -2.3392939519 -2.0625775984 -3.5541042186  1.5087477879 -4.6453051124 ## [136]  2.0616474606 -3.2691362859 -7.3752231145 -1.6666447439  1.0532964013 ## [141] -2.0625775984 -0.3355312717  2.2481601983 -2.2200774510 -4.3276959075 ## [146]  0.8685972087 -0.7727065311  1.7511589809 -0.4774548995  0.0008056357 ## [151]  1.7022334970 -0.4202625135 -0.2902646169  2.4409712692  0.0008056357 ## [156]  0.0008056357 -3.6009218603 -0.8567788439 -0.4528474822  0.3517462520 ## [161]  0.1307210605 -3.7843118182 -2.8419024763 -3.5191098774 -0.1460684795 ## [166]  1.8809888141  2.8194173530 -2.4656469123  1.0589888029  0.1659840070 ## [171]  1.4345235392  2.3676828409  1.5749534339 -0.1681557545  2.6406620359 ## [176]  0.1476797509 -2.2135177411  1.9168260534 -3.4993205379  0.4557086940 ## [181] -3.8136089417 -0.1121510987 -3.9772095600  1.3849234171  0.3504760502 ## [186]  2.3807710856 -3.0667307601  2.3040586537  1.7599138086 -0.2083894500 ## [191]  0.6844579761 -0.3552635652 -1.9459392035 -0.6075281598 -2.1663310490 ## [196]  2.3676828409 -1.9205271122 -2.2334295071 -4.4265826710 -1.0117771483 ## [201] -0.0161530548 -0.3072233074 -0.0161530548 -0.7451676752 -7.0351269313 ## [206]  2.6406620359 -3.7523234832 -0.2498730859  2.0222929422  3.2886316225 ## [211] -1.6221457956  2.4749949634  1.7570711677  0.0904873650 -4.7332807307 ## [216]  0.1568318789 -0.0302759317  0.5127229828  1.3097316594 -6.9309218514 ## [221]  0.0515992352 -0.4514194447 -0.2253481404 -4.7652690656 -0.4279866041 ## [226] -4.4136563866 -3.7618312672  0.0156676181 -0.2590252139  2.6076058507 ## [231]  1.6420333133 -3.9985172969 -6.2076483227  0.1632104039  0.1829426974 ## [236] -4.7652690656 -4.4212844958  1.6001906117  0.8579971472 -3.8699110198 ## [241]  0.3022779567 -0.1679979189  1.9421248181  0.6592738895  1.6132788564 ## [246] -0.0366544567 -3.4818233673 -3.9422152187 -0.3473613776  0.4321933815 ## [251]  0.7480288869 -0.2498730859 -1.9861068488 -2.2297920164 -0.7621263656 ## [256]  1.2966434147  0.1632104039  0.2048721368  1.7789459288  0.4926393080 ## [261]  0.4096285430 -1.7794744955 -2.5822853071  2.0413250624 -6.6574350219 ## [266] -0.1277642235 -2.1972434657 -2.5075677545 -0.4482774141 -0.6943740757 ## [271] -0.7821891015  6.3289445390  0.1568318789  0.1165981835  1.4781797859 ## [276] -4.2287015488 -3.6157278195 -0.1511970641 -0.7047761441  2.0935344484 ## [281] -3.8258079371 -4.4231102471  1.3097316594  3.4081542651 -0.4996175382 ## [286] -2.0534397824  0.9783975145 -2.2562634924  3.7196170683  1.1110084017 ## [291]  2.1661785291 -4.2138955896  1.9421248181  2.3065468369 -0.7139282722 ## [296] -4.1431023472 -2.0854115837  2.9389399956  1.7711269214 -0.0302759317 ## [301] -2.6458711124  0.5856241187 -0.1199576611  1.8566258698 -2.2383553905 ## [306]  2.3807710856 -0.2838860920  3.1176953128  2.8499853550  2.8063291084 ## [311]  0.0034011417 -0.4683781352 -3.0377484314 -1.3833686805  1.7764577456 ## [316]  1.7842151661  3.4081542651  0.1165981835 -4.6988069009 -2.6013721641 ## [321]  2.0616474606 -0.2498730859 -4.2207121622  4.1705330009  5.2103776377 ## [326] -4.5406977837 -1.5080855068 -2.5232652805 -5.7259789038  2.5211393933 ## [331] -0.3487069432 -2.5035573312 -2.2764097339 -5.8364854607 -1.8694684539 ## [336]  1.3402996614  0.5728074322  0.3663267540 -0.1603491921 -2.1690805453 ## [341] -1.4105339689  3.0768201201 -5.1065624241 -4.5966850670 -4.5498907729 ## [346] -1.3078399029 -1.0882592824  0.3128581222 -0.3644156933  0.3100845191 ## [351]  2.4774831467 -1.0763759980  2.2151040131 -0.0952748801 -4.6864864366

Now, remember that the outcome variable is not guilty (0) and guilty (1), so you might be wondering – what’s with these predicted values? Why aren’t they 0 or 1?

Binomial regression is used for nonlinear outcomes. Since the outcome is 0/1, it’s nonlinear. But binomial regression is based on the general linear model. So how can we apply the general linear model to a nonlinear outcome? Answer: by transforming scores. Specifically, it transforms the outcome into a log odds ratio; the log transform makes the outcome variable behave somewhat linearly and symmetrically. The predicted outcome, then, is also a log odds ratio.

ordvalues<-dissertation[order(dissertation$predver),]ordvalues<-ordvalues[,51]ordvalues<-data.frame(1:355,ordvalues)colnames(ordvalues)<-c("number","predver")library(ggplot2)ggplot(data=ordvalues,aes(number,predver))+geom_smooth()

## `geom_smooth()` using method = 'loess'

Log odds ratios are great for analysis, but when trying to understand how well your model is predicting values, we want to convert them into a metric that’s easier to understand in isolation and when compared to the observed values. We can convert them into probabilities with the following equation:

dissertation$verdict_predicted<-exp(predict(model))/(1+exp(predict(model)))

This gives us a value ranging from 0 to 1, which is the probability that a particular person will select guilty. We can use this value in different ways to see how well our model is doing. Typically, we’ll divide at the 50% mark, so anyone with a probability of 0.5 or greater is predicted to select guilty, and anyone with a probability less than 0.5 would be predicted to select not guilty. We then compare this new variable with the observed results to see how well the model did.

dissertation$vpred_rounded<-round(dissertation$verdict_predicted,digits=0)library(expss)

## Warning: package 'expss' was built under R version 3.4.4

dissertation<-apply_labels(dissertation,verdict="Actual Verdict",verdict=c("Not Guilty"=0,"Guilty"=1),vpred_rounded="Predicted Verdict",vpred_rounded=c("Not Guilty"=0,"Guilty"=1))cro(dissertation$verdict,list(dissertation$vpred_rounded,total()))

	Predicted Verdict
	Not Guilty	Guilty
Actual Verdict
Not Guilty	152	39	191
Guilty	35	129	164
#Total cases	187	168	355

One thing we could look at regarding this table, which when dealing with actual versus predicted categories is known as a confusion matrix, is how well the model did at correctly categorizing cases – which we get by adding together the number of people with both observed and predicted not guilty, and people with observed and predicted guilty, then dividing that sum by the total.

accuracy<-(152+129)/355accuracy

## [1] 0.7915493

Our model correctly classified 79% of the cases. However, this is not the only way we can determine how well our model did. There are a variety of derivations you can make from the confusion matrix. But two you should definitely include when doing this kind of analysis are sensitivity and specificity. Sensitivity refers to the true positive rate, and specificity refers to the true negative rate.

When you’re working with confusion matrices, you’re often trying to diagnose or identify some condition, one that may be deemed positive or present, and the other that may be deemed negative or absent. These derivations are important because they look at how well your model identifies these different states. For instance, if most of my cases selected not guilty, I could get a high accuracy rate by simply predicting that everyone will select not guilty. But then my model lacks sensitivity – it only identifies negative cases (not guilty) and fails to identify any positive cases (guilty). If I were dealing with something even higher stakes, like whether a test result indicates the presence of a condition, I want to make certain my classification is sensitive to those positive cases. And vice versa, I could keep from missing any positive cases by just classifying everyone as positive, but then my model lacks specificity and I may subject people to treatment they don’t need (and that could be harmful).

Just like accuracy, sensitivity and specificity are easy to calculate. As I said above, I’ll consider not guilty to be negative and guilty to be positive. Sensitivity is simply the number of true positives (observed and predicted guilty) divided by the sum of true positives and false negatives (people who selected guilty but were classified as not guilty).

sensitivity<-129/164sensitivity

## [1] 0.7865854

And specificity is the number of true negatives (observed and predicted not guilty) divided by the sum of true negatives and false positives (people who selected not guilty but were classified as guilty).

specificity<-152/191specificity

## [1] 0.7958115

So the model correctly classifies 79% of the positive cases and 80% of the negative cases. The model could be improved, but it’s functioning equally well across positive and negative cases, which is good.

It should be pointed out that you can select any cutpoint you want for your probability variable. That is, if I want to be very conservative in identifying positive cases, I might want there to be a higher probability that it is a positive case before I classify it as such – perhaps I want to use a cutpoint like 75%. I can easily do that.

dissertation$vpred2[dissertation$verdict_predicted<0.75]<-0dissertation$vpred2[dissertation$verdict_predicted>=0.75]<-1dissertation<-apply_labels(dissertation,vpred2="Predicted Verdict (0.75 cut)",vpred2=c("Not Guilty"=0,"Guilty"=1))cro(dissertation$verdict,list(dissertation$vpred2,total()))

	Predicted Verdict (0.75 cut)
	Not Guilty	Guilty
Actual Verdict
Not Guilty	177	14	191
Guilty	80	84	164
#Total cases	257	98	355

accuracy2<-(177+84)/355sensitivity2<-84/164specificity2<-177/191accuracy2

## [1] 0.7352113

sensitivity2

## [1] 0.5121951

specificity2

## [1] 0.9267016

Changing the cut score improves specificity but at the cost of sensitivity, which makes sense, because our model was predicting equally well (or poorly, depending on how you look at it) across positives and negatives. In this case, a different cut score won’t improve our model. We would need to go back and see if there are better variables to use for prediction. And to keep us from fishing around in our data, we’d probably want to use a training and testing set for such exploratory analysis.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

↧