OddsPlotty – the first official package I have ‘officially’ launched

November 7, 2019, 7:19 am

≫ Next: A small simple random sample will often be better than a huge not-so-random one by @ellis2013nz

≪ Previous: Tidyverse evolutions: curly-curly operator and pivoting (feat. tidytuesday data & leaflet visuals)

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation for this

The background to this package linked to a project I undertook about a year ago.

The video relates to the project and the how R really sped up the process.

The exam question was to use a regression model to predict admissions and we had to evaluate, as a consequence, 60 different variations. In Excel, this would have been a nightmare. Again, R to the rescue. You can see the presentation below, if interested:

Getting off my arse and creating the documentation / package

Reflecting and cogitating on the NHS-R conference this year, I thought why not use the remainder of Wednesday evening to hop on to R Studio and start to make this as a package. I found the MIT PDF very helpful: http://web.mit.edu/insong/www/pdf/rpackage_instructions.pdf.

So that is just what I did.

What does this package do?

It simply analyses the results of a logistic regression model and creates an odds plot. This, I found, was an easier way for the strategic audience to engage with the outputs of the tool.

Reflecting on this – I thought this is a really useful little chart and so OddsPlotty was born.

Installing the package

The installation of the package can be achieved by loading up R Studio and then typing in the code below:

1	devtools::install_github("https://github.com/StatsGary/OddsPlotty", dependencies = TRUE)

Once this has been typed into the console or script editor R will automatically look for the package on Github. This subsequently returns and loads the package into the environment, and then you can load this in by using library(OddsPlotty).

Using the package

The vignette I have created on the RPubs website shows how to use the tool: https://rpubs.com/StatsGary/547542.

Outputs from the tool

As this is now a ggplot2 object – you can also use the plot with different themes. The below are three examples of different themes from the tool:

Classic view

Classic view – defaulting to blue points and black error bars

This shows, on the test dataset, that the odds of a cancer being benign or malignant, based on the sample I was using, are greater if the measured thickness and bare nuclei are larger. This is a sample machine learning dataset from the package mlbench.

Odds plot dressed in economist view

ggthemes Economist view

As you can see – the plot is flexible and allows for lots of customisability.

Edward Tufte would be proud

If you have never heard of Edward Tufte he was one of the pioneers of data visualisation. Read some of his stuff, the visuals are outdated, but the principles still hold true.

On a tangential, I will return to the final output used as an example:

For the full implmentation of how to use this chart with logistic regression models – refer to the aforementioned vignette.

Evolution and iteration

This is just the first cut of the application, however, I would love to hear feedback and potential future developments for the tool. If you have found any issues or problems then leave a post on my Twitter page (using #oddsplottyR).

Please stay posted, as I aim to publish more packages in the future. The quote below resonates with me when it comes to some of the options in the devtools() package:

“It’s still magic even if you know how it’s done.”
― Terry Pratchett, A Hat Full of Sky

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A small simple random sample will often be better than a huge not-so-random one by @ellis2013nz

November 8, 2019, 5:00 am

≫ Next: Intrumental variable regression and machine learning

≪ Previous: OddsPlotty – the first official package I have ‘officially’ launched

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An interesting big data thought experiment

The other day on Twitter I saw someone referencing a paper or a seminar or something that was reported to examine the following situation: if you have an urn with a million balls in it of two colours (say red and white) and you want to estimate the proportion of balls that are red, are you better off taking the top 800,000 balls – or stirring the urn and taking a sample of just 10,000? The answer given was the second of these options. The idea is to illustrate the limitations of “big data” methods, which can often be taken to mean samples that are very large but of uncertain quality with regard to representativeness and randomness.

The nature of the Twitter user experience is such that I’ve since lost track of the original post and can’t find it again.

My first thought was “wow, what a great illustration!” My second was “actually, that sounds a bit extreme.” After all, worsening your estimate by 80:1 a pretty severe design effect penalty to pay for non-random sampling. A bit of thinking shows that whether the small, random sample outperforms the bigger one is going to depend very much on how the urn was filled in the first place.

Consider as an extreme example that the urn was filled by pouring in all the balls of one colour, then all of another. In this situation you will certainly be better off with the stirring method (in all that follows, I am going to assume that the stirring is highly effective, and that stirring and sampling equates to taking a simple random sample).

But at the other extreme, if the urn was filled in a completely random order, than either sampling method equates to simple random sampling, and the larger sample will greatly outperform the smaller. In fact, in that case the standard errors from the stirring method will be 8.9 times from the large data method (square root of 80). So the choice between the methods depends on how spatially correlated (in 3 dimensional space) the colour of the balls is within the urn. This is similar to how the need for special time series methods depends on whether there is autocorrelation over time; spatial methods depends on whether there is spatial autocorrelation; and adjusting for survey design effects depends on intra-cluster correlation.

Efficiently simulating filling an urn with balls

To explore just how correlated the balls need to be for the simple random sampling method to be preferred, I ran a simple simulation. In this simulation, the balls are added to the urn one at a time, with no natural stirring. There is an overall probability p of a ball being red (and this is the exact probability of the first ball that goes in being red), but for the second and subsequent balls there is a second probability to be aware of, q, which is the chance that this ball will be forced to be the same colour as the previous ball, rather than pulled from the hyper population with parameter p.

Here’s a simple R function that will generate an urn of size n with those parameters:

Post continues after R code.

library(tidyverse)library(scales)library(Rcpp)library(microbenchmark)#' Simulate filling of an urn with red and white balls, with serial correlation#'#' @param n Number of red and white balls in the urn#' @param p overall hyper-population proportion of red balls#' @param q probability that a ball, when originally placed in the urn in order, will be the same#' colour as the preceding ball rather than randomly chosing between red and white with probability pfill_urn_r<-function(n,p,q){x<-numeric(n)x[1]<-rbinom(1,1,prob=p)for(iin2:n){x[i]<-ifelse(rbinom(1,1,prob=q)==1,x[i-1],rbinom(1,1,prob=p))}return(x)}

However, this R program is too slow for my purposes. I’m going to want to generate many thousands of these urns, with a million balls in each, so time really matters. It was worth re-writing this in C++ via the Rcpp R package.

Post continues after R code.

# C++ version, hopefully much faster:cppFunction('NumericVector fill_urn_c(int n, double p, double q) {  NumericVector x(n);  x[0] = as(rbinom(1, 1, p));  for (int i=1; i(rbinom(1, 1, q)) == 1){    x[i] = x[i-1];   } else {       x[i] = as(rbinom(1, 1, p));   }  }    return x;}')# Check the R and C++ versions give same results:all.equal({set.seed(123);fill_urn_c(30,0.3,0.9)},{set.seed(123);fill_urn_r(30,0.3,0.9)})# Compare speedsmicrobenchmark("C++"=fill_urn_c(10000,0.3,0.9),"R"=fill_urn_r(10000,0.3,0.9))

The C++ function was as easy to write as the R version (helped by the Rcpp function rbinom which makes C++ seem even more familiar to R users) and delivers roughly a 40- or 50-fold speed up. Here’s the results of that benchmarking at the end of the last chunk of code:

Unit: microseconds expr     min       lq      mean   median      uq      max neval  C++   705.9   755.85   991.928   851.15   960.3   6108.1   100    R 31868.7 36086.60 69251.103 41174.25 48143.8 427038.0   100

By the way, the as() in my Rcpp code is needed because rbinom generates a vector (of size one in this case) and I want to treat it as a scalar. There may be a more elegant way of handling this, but this was good enough for my purposes.

Comparing the two sampling methods

To help with comparing the two sampling methods, I write two more functions:

compare_methods() to run the “big data” and random sampling methods on a single urn and compare their results to the true proportion in that particular urn (using the actual proportion in the urn, not the hypothetical hyper-parameter p)
overall_results() to generate many urns with the same set of parameters

Post continues after R code.

#------------------Comparing two sampling methods-----------#' Compare two methods of sampling#' #' @param x a vector of 1 and 0 for which we want to estimate the proportion of 1s#' @return a data frame with one row and two columns, showing the difference between two #' methods of sampling and estimating a proportion#' @details The "big data" method takes the most recent 80% as the sample. The "sample" method#' takes a simple random sample.compare_methods<-function(x){n<-length(x)correct<-mean(x)# Method 1 - the most recent 80% of observations:bigdata_method<-mean(tail(x,round(n*0.8)))-correct# Method 2 - simple random sample of 1% of observations:sample_method<-mean(sample(x,round(n*0.01)))-correctreturn(data.frame(bigdata_method=bigdata_method,sample_method=sample_method))}#' Repeated comparison of two methods of sampling#' #' @param m Number of times to run the experiment#' @param n Number of red and white balls in the urn for which we are trying to estimate a proportion#' @param p overall hyper-population proportion of red balls#' @param q probability that a ball, when originally placed in the urn in order, will be the same#' colour as the preceding ball rather than randomly chosing between red and white with probability p#' @details This works by repeated calls to fill_urn_c() and compare_methods()overall_results<-function(m,n,p,q){results<-lapply(1:m,function(j){x<-fill_urn_c(n,p,q)y<-compare_methods(x)return(y)})tmp<-do.call(rbind,results)%>%tidyr::gather(variable,value)%>%dplyr::mutate(m=m,n=n,p=p,q=q)return(tmp)}

Here’s what I get when I compare the two methods for a thousand runs with urns of 10,000 balls each, with p = 0.3 and various values of q:

… and here are the results for the original use case, a thousand runs (at each value of q) of an urn with one million balls:

So we can see in both cases we need a lot of serial correlation between balls (based on the order they go into the urn) for the method of random sampling 1% of the balls to out-perform the brute force selection of the top 80% of balls. Somewhere between a value for q of 0.99 and 0.999 is when the stirring method is clearly better. Remember, q is the probability that any ball going into the urn is the same as the previous ball, before the alternative colour selection being chosen with probability p (0.3 in our case).

Here’s the code for the actual simulations.

Post continues after R code.

#-------------------------------Small urns--------------tmp1<-overall_results(1000,1e4,0.3,0.9)tmp2<-overall_results(1000,1e4,0.3,0.99)tmp3<-overall_results(1000,1e4,0.3,0.999)tmp4<-overall_results(1000,1e4,0.3,0.9999)rbind(tmp1,tmp2,tmp3,tmp4)%>%mutate(variable=case_when(variable=="bigdata_method"~"Sample 80% of balls at top of urn",variable=="sample_method"~"Stir urn and take random sample of 1%"))%>%mutate(q=paste(q,"serial relationship factor"))%>%ggplot(aes(x=value,fill=variable,colour=variable))+geom_density(alpha=0.5)+facet_wrap(~q,scales="free")+labs(fill="Sampling method:",colour="Sampling method:",x="Discrepency between sample proportion and true proportion from entire urn",title="Comparison of 'big data' and random sampling methods",subtitle="Estimating a proportion from an urn of 10,000 red and white balls.Which method works best depends on the degree of serial correlation between balls.",caption="http://freerangestats.info")#----------------------Large urns------------------tmp5<-overall_results(1000,1e6,0.3,0.9)tmp6<-overall_results(1000,1e6,0.3,0.99)tmp7<-overall_results(1000,1e6,0.3,0.999)tmp8<-overall_results(1000,1e6,0.3,0.9999)rbind(tmp5,tmp6,tmp7,tmp8)%>%mutate(variable=case_when(variable=="bigdata_method"~"Sample 80% of balls at top of urn",variable=="sample_method"~"Stir urn and take random sample of 1%"))%>%mutate(q=paste(q,"serial relationship factor"))%>%ggplot(aes(x=value,fill=variable,colour=variable))+geom_density(alpha=0.5)+facet_wrap(~q,scales="free")+labs(fill="Sampling method:",colour="Sampling method:",x="Discrepency between sample proportion and true proportion from entire urn",title="Comparison of 'big data' and random sampling methods",subtitle="Estimating a proportion from an urn of 10,000 red and white balls.Which method works best depends on the degree of serial correlation between balls.",caption="http://freerangestats.info")

One final point – what if we look at the absolute value of the discrepency between each methods estimate of the proportion and its true value. We see basically the same picture.

rbind(tmp5,tmp6,tmp7,tmp8)%>%mutate(variable=case_when(variable=="bigdata_method"~"Sample 80% of balls at top of urn",variable=="sample_method"~"Stir urn and take random sample of 1%"))%>%mutate(q=paste(q,"serial relationship factor"))%>%mutate(value=abs(value))%>%ggplot(aes(x=value,fill=variable,colour=variable))+geom_density(alpha=0.5)+facet_wrap(~q,scales="free")+labs(fill="Sampling method:",colour="Sampling method:",x="Absolute value of discrepency between sample proportion and true proportion from entire urn",title="Comparison of 'big data' and random sampling methods, urn with 1,000,000",subtitle="Estimating a proportion from an urn of 10,000 red and white balls.Which method works best depends on the degree of serial correlation between balls.",caption="http://freerangestats.info")

Reflections

If we thought the data generating process (ie how the urn was filled with balls) resembled my simulation, you would be better off choosing the large sample method (assuming equal cost) unless you had reason to believe the serial relationship factor q was very high. But this doesn’t invalidate the basic idea of the usefulness of random sampling. This is for several reasons:

The costs are unlikely to be the same. Even with today’s computers, it is easier and cheaper to collect, process and analyse smaller than larger datasets.
The “big data” method is risky, in the sense that it makes the analyst vulnerable to the true data generating process in a way that simple random sampling doesn’t. With random sampling we can calculate the properties of the statistic exactly and with confidence, so long as our stirring is good. We can’t say the same for the “top 80%” method.
Related to the point above, the risk is particularly strong if the data generating process is quirkier than my simulation. For example, given my simulated data generating process, both sampling methods produce unbiased results. However, this isn’t always going to be the case. Consider if the urn had been filled on the basis of “put all the red balls in first; then fill up the rest of the urn with white balls”. In this case the “top 80%” method will be very badly biased to underestimate the proportion of red balls (in fact, with p = 0.3, the method will estimate it to be 0.125 – a ghastly result).

That final dot point might sound perverse, but actually it isn’t hard to imagine a real life situation in which data is generated in a way that is comparable to that method. For example, I might have one million observations of the level of sunlight, taken between 1am and 9am on a November morning in a temperate zone.

So my overall conclusion – a small random sample will give better results than a much larger non-random sample, under certain conditions; but more importantly, it is reliable and controls for risk.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Intrumental variable regression and machine learning

November 8, 2019, 4:00 pm

≫ Next: Learning Linux – the wrong way – day 2

≪ Previous: A small simple random sample will often be better than a huge not-so-random one by @ellis2013nz

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

Just like the question “what’s the difference between machine learning and statistics” has shed a lot of ink (since at least Breiman (2001)), the same question but where statistics is replaced by econometrics has led to a lot of discussion, as well. I like this presentation by Hal Varian from almost 6 years ago. There’s a slide called “What econometrics can learn from machine learning”, which summarises in a few bullet points Varian (2014) and the rest of the presentation discusses what machine learning can learn from econometrics. Varian argues that the difference between machine learning and econometrics is that machine learning focuses on prediction, while econometrics on inference and causality (and to a lesser extent prediction as well). Varian cites some methods that have been in the econometricians’ toolbox for decades (at least for some of them), such as regression discontinuity, difference in differences and instrumental variables regression. Another interesting paper is Mullainathan and Spiess, especially the section called What Do We (Not) Learn from Machine Learning Output?. The authors discuss the tempting idea of using LASSO to perform variable (feature) selection. Econometricians might be tempted to use LASSO to perform variable selection, and draw conclusions such as The variable (feature) “Number of rooms” has not been selected by LASSO, thus it plays no role in the prediction of house prices. However, when variables (features) are highly correlated, LASSO selects variables essentially randomly, without any meaningful impact on model performance (for prediction). I found this paragraph quite interesting (emphasis mine):

This problem is ubiquitous in machine learning. The very appeal of these algorithms is that they can fit many different functions. But this creates an Achilles’ heel: more functions mean a greater chance that two functions with very different coefficients can produce similar prediction quality. As a result, how an algorithm chooses between two very different functions can effectively come down to the flip of a coin. In econometric terms, while the lack of standard errors illustrates the limitations to making inference after model selection, the challenge here is (uniform) model selection consistency itself.

Assuming that we successfully dealt with model selection, we still have to content with significance of coefficients. There is recent research into this topic, such as Horel and Giesecke, but I wonder to what extent explainability could help with this. I have been looking around for papers that discuss explainability in the context of the social sciences but have not found any. If any of the readers of this blog are aware of such papers, please let me know.

Just to wrap up Mullainathan and Spiess; the authors then suggest to use machine learning mainly for prediction tasks, such as using images taken using satellites to predict future harvest size (the authors cite Lobell (2013)), or for tasks that have an implicit prediction component. For instance in the case of instrumental variables regression, two stages least squares is often used, and the first stage is a prediction task. Propensity score matching is another prediction task, where machine learning could be used. Other examples are presented as well. In this blog post, I’ll explore two stages least squares and see what happens when a random forest is used for the first step.

Instrumental variables regression using two-stage least squares

Let’s work out a textbook (literally) example of instrumental variable regression. The below example is taken from Wooldrige’s Econometric analysis of cross section and panel data, and is an exercise made using data from Mroz (1987) The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.

Let’s first load the needed packages and the data "mroz" included in the {wooldridge} package:

library(tidyverse)library(randomForest)library(wooldridge)library(AER)library(Metrics)data("mroz")

Let’s only select the women that are in the labour force (inlf == 1), and let’s run a simple linear regression. The dependent variable, or target, is lwage, the logarithm of the wage, and the explanatory variables, or features are exper, expersq and educ. For a full description of the data, click below:

Description of data

mroz {wooldridge}   R DocumentationmrozDescriptionWooldridge Source: T.A. Mroz (1987), “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions,” Econometrica 55, 765-799. Professor Ernst R. Berndt, of MIT, kindly provided the data, which he obtained from Professor Mroz. Data loads lazily.Usagedata('mroz')FormatA data.frame with 753 observations on 22 variables:inlf: =1 if in lab frce, 1975hours: hours worked, 1975kidslt6: # kids < 6 yearskidsge6: # kids 6-18age: woman's age in yrseduc: years of schoolingwage: est. wage from earn, hrsrepwage: rep. wage at interview in 1976hushrs: hours worked by husband, 1975husage: husband's agehuseduc: husband's years of schoolinghuswage: husband's hourly wage, 1975faminc: family income, 1975mtr: fed. marg. tax rte facing womanmotheduc: mother's years of schoolingfatheduc: father's years of schoolingunem: unem. rate in county of resid.city: =1 if live in SMSAexper: actual labor mkt expernwifeinc: (faminc - wage*hours)/1000lwage: log(wage)expersq: exper^2Used in Textpages 249-251, 260, 294, 519-520, 530, 535, 535-536, 565-566, 578-579, 593- 595, 601-603, 619-620, 625Sourcehttps://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041

working_w <- mroz %>%     filter(inlf == 1)wage_lm <- lm(lwage ~ exper + expersq + educ,               data = working_w)summary(wage_lm)

## ## Call:## lm(formula = lwage ~ exper + expersq + educ, data = working_w)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -3.08404 -0.30627  0.04952  0.37498  2.37115 ## ## Coefficients:##               Estimate Std. Error t value Pr(>|t|)    ## (Intercept) -0.5220406  0.1986321  -2.628  0.00890 ** ## exper        0.0415665  0.0131752   3.155  0.00172 ** ## expersq     -0.0008112  0.0003932  -2.063  0.03974 *  ## educ         0.1074896  0.0141465   7.598 1.94e-13 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6664 on 424 degrees of freedom## Multiple R-squared:  0.1568, Adjusted R-squared:  0.1509 ## F-statistic: 26.29 on 3 and 424 DF,  p-value: 1.302e-15

Now, we see that education is statistically significant and the effect is quite high. The return to education is about 11%. Now, let’s add some more explanatory variables:

wage_lm2 <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ,               data = working_w)summary(wage_lm2)

## ## Call:## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + ##     huswage + city + educ, data = working_w)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -3.07431 -0.30500  0.05477  0.37871  2.31157 ## ## Coefficients:##               Estimate Std. Error t value Pr(>|t|)    ## (Intercept) -0.3853695  0.3163043  -1.218  0.22378    ## exper        0.0398817  0.0133651   2.984  0.00301 ** ## expersq     -0.0007400  0.0003985  -1.857  0.06402 .  ## kidslt6     -0.0564071  0.0890759  -0.633  0.52692    ## kidsge6     -0.0143165  0.0276579  -0.518  0.60499    ## husage      -0.0028828  0.0049338  -0.584  0.55934    ## huswage      0.0177470  0.0102733   1.727  0.08482 .  ## city         0.0119960  0.0725595   0.165  0.86877    ## educ         0.0986810  0.0151589   6.510 2.16e-10 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6669 on 419 degrees of freedom## Multiple R-squared:  0.1654, Adjusted R-squared:  0.1495 ## F-statistic: 10.38 on 8 and 419 DF,  p-value: 2.691e-13

The return to education lowers a bit, but is still significant. Now, the issue is that education is not exogenous (randomly assigned), and is thus correlated with the error term of the regression, due to an omitted variable for instance contained in the error term, that is correlated with education (for example work ethic).

To deal with this, econometricians use instrumental variables (IV) regression. I won’t go into detail here; just know that this method can deal with these types of issues. The Wikipedia page gives a good intro on what this is all about. This short paper is also quite interesting in introducing instrumental variables.

In practice, IV is done in two steps. First, regress the endogenous variable, in our case education, on all the explanatory variables from before, plus so called instruments. Instruments are variables that are correlated with the endogenous variable, here education, but uncorrelated to the error term. They only affect the target variable through their correlation with the endogenous variable. We will be using the education level of the parents of the women, as well as the education levels of their husbands as intruments. The assumption is that the parents’, as well as the husband’s education are exogenous in the log wage of the woman. This assumption can of course be challenged, but let’s say that it holds.

To conclude stage 1, we obtain the predictions of education:

first_stage <- lm(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage                   + city + motheduc +  fatheduc + huseduc, data = working_w)working_w$predictions_first_stage <- predict(first_stage)

We are now ready for the second stage. In the regression from before:

wage_lm2 <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ,               data = working_w)

we now replace educ with the predictions of stage 1:

second_stage <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage                    + city + predictions_first_stage,                  data = working_w)summary(second_stage)

## ## Call:## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + ##     huswage + city + predictions_first_stage, data = working_w)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -3.13493 -0.30004  0.03046  0.37142  2.27199 ## ## Coefficients:##                           Estimate Std. Error t value Pr(>|t|)   ## (Intercept)              0.1763588  0.4206911   0.419   0.6753   ## exper                    0.0419047  0.0139885   2.996   0.0029 **## expersq                 -0.0007881  0.0004167  -1.891   0.0593 . ## kidslt6                 -0.0255934  0.0941128  -0.272   0.7858   ## kidsge6                 -0.0234422  0.0291914  -0.803   0.4224   ## husage                  -0.0042628  0.0051919  -0.821   0.4121   ## huswage                  0.0263802  0.0114511   2.304   0.0217 * ## city                     0.0215685  0.0759034   0.284   0.7764   ## predictions_first_stage  0.0531993  0.0263735   2.017   0.0443 * ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6965 on 419 degrees of freedom## Multiple R-squared:  0.08988,    Adjusted R-squared:  0.0725 ## F-statistic: 5.172 on 8 and 419 DF,  p-value: 3.581e-06

We see that education, now instrumented by the parents’ and the husband’s education is still significant, but the effect is much lower. The return to education is now about 5%. However, should our assumption hold, this effect is now causal. However there are some caveats. The IV estimate is a local average treatment effect, meaning that we only get the effect on those individuals that were affected by the treatment. In this case, it would mean that the effect we recovered is only for women who were not planning on, say, studying, but only did so under the influence of their parents (or vice-versa).

IV regression can also be achieved using the ivreg() function from the {AER} package:

inst_reg <- ivreg(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ                   | .-educ + motheduc + fatheduc + huseduc,                  data = working_w)summary(inst_reg)

## ## Call:## ivreg(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + ##     husage + huswage + city + educ | . - educ + motheduc + fatheduc + ##     huseduc, data = working_w)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -3.10175 -0.30407  0.03379  0.35255  2.25107 ## ## Coefficients:##               Estimate Std. Error t value Pr(>|t|)   ## (Intercept)  0.1763588  0.4071522   0.433   0.6651   ## exper        0.0419047  0.0135384   3.095   0.0021 **## expersq     -0.0007881  0.0004033  -1.954   0.0514 . ## kidslt6     -0.0255934  0.0910840  -0.281   0.7789   ## kidsge6     -0.0234422  0.0282519  -0.830   0.4071   ## husage      -0.0042628  0.0050249  -0.848   0.3967   ## huswage      0.0263802  0.0110826   2.380   0.0177 * ## city         0.0215685  0.0734606   0.294   0.7692   ## educ         0.0531993  0.0255247   2.084   0.0377 * ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6741 on 419 degrees of freedom## Multiple R-Squared: 0.1475,  Adjusted R-squared: 0.1312 ## Wald test: 5.522 on 8 and 419 DF,  p-value: 1.191e-06

Ok, great, now let’s see how a machine learning practitioner who took an econometrics MOOC might tackle the issue. The first step will be to split the data into training and testing sets:

set.seed(42)sample <- sample.int(n = nrow(working_w), size = floor(.90*nrow(working_w)), replace = F)train <- working_w[sample, ]test  <- working_w[-sample, ]

Let’s now run the same analysis as above, but let’s compute the RMSE of the first stage regression on the testing data as well:

first_stage <- lm(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage                   + city + motheduc +  fatheduc + huseduc, data = train)test$predictions_first_stage <- predict(first_stage, newdata = test)lm_rmse <- rmse(predicted = test$predictions_first_stage, actual = test$educ)train$predictions_first_stage <- predict(first_stage)

The first stage is done, let’s go with the second stage:

second_stage <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 +                        husage + huswage + city + predictions_first_stage,                  data = train)summary(second_stage)

## ## Call:## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + ##     huswage + city + predictions_first_stage, data = train)## ## Residuals:##      Min       1Q   Median       3Q      Max ## -3.09828 -0.28606  0.05248  0.37258  2.29947 ## ## Coefficients:##                           Estimate Std. Error t value Pr(>|t|)   ## (Intercept)             -0.0037711  0.4489252  -0.008  0.99330   ## exper                    0.0449370  0.0145632   3.086  0.00218 **## expersq                 -0.0008394  0.0004344  -1.933  0.05404 . ## kidslt6                 -0.0630522  0.0963953  -0.654  0.51345   ## kidsge6                 -0.0197164  0.0306834  -0.643  0.52089   ## husage                  -0.0034744  0.0054358  -0.639  0.52310   ## huswage                  0.0219622  0.0118602   1.852  0.06484 . ## city                     0.0679668  0.0804317   0.845  0.39863   ## predictions_first_stage  0.0618777  0.0283253   2.185  0.02954 * ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6952 on 376 degrees of freedom## Multiple R-squared:  0.1035, Adjusted R-squared:  0.08438 ## F-statistic: 5.424 on 8 and 376 DF,  p-value: 1.764e-06

The coefficients here are a bit different due to the splitting, but that’s not an issue. Ok, great, but our machine learning engineer is in love with random forests, so he wants to use a random forest for the prediction task of the first stage:

library(randomForest)first_stage_rf <- randomForest(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage                   + city + motheduc +  fatheduc + huseduc,                                data = train)test$predictions_first_stage_rf <- predict(first_stage_rf, newdata = test)rf_rmse <- rmse(predicted = test$predictions_first_stage_rf, actual = test$educ)train$predictions_first_stage_rf <- predict(first_stage_rf)

Let’s compare the RMSE’s of the two first stages. The RMSE of the first stage using linear regression was 2.0558723 and for the random forest 2.0000417. Our machine learning engineer is happy, because the random forest has better performance. Let’s now use the predictions for the second stage:

second_stage_rf_lm <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 +                              husage + huswage + city + predictions_first_stage_rf,                  data = train)summary(second_stage_rf_lm)

## ## Call:## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + ##     huswage + city + predictions_first_stage_rf, data = train)## ## Residuals:##     Min      1Q  Median      3Q     Max ## -3.0655 -0.3198  0.0376  0.3710  2.3277 ## ## Coefficients:##                              Estimate Std. Error t value Pr(>|t|)   ## (Intercept)                -0.0416945  0.4824998  -0.086  0.93118   ## exper                       0.0460311  0.0145543   3.163  0.00169 **## expersq                    -0.0008594  0.0004344  -1.978  0.04863 * ## kidslt6                    -0.0420827  0.0952030  -0.442  0.65872   ## kidsge6                    -0.0211208  0.0306490  -0.689  0.49117   ## husage                     -0.0033102  0.0054660  -0.606  0.54514   ## huswage                     0.0229111  0.0118142   1.939  0.05322 . ## city                        0.0688384  0.0805209   0.855  0.39314   ## predictions_first_stage_rf  0.0629275  0.0306877   2.051  0.04100 * ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.6957 on 376 degrees of freedom## Multiple R-squared:  0.1021, Adjusted R-squared:  0.08302 ## F-statistic: 5.346 on 8 and 376 DF,  p-value: 2.251e-06

The results are pretty similar. Now, why not go a bit further and use a random forest for the second stage as well?

Two-stage random forests

I have tried to find literature on this, but did not find anything that really fits what I’ll be doing here. Honestly, I don’t know if this is sound theoretically, but it does have intuitive appeal. Using random forests instead of the linear regressions of each stages poses at least the following question: how can we interpret the results of the second stage? As you have seen above, interpretation of the coefficients and standard errors is important, and random forests do not provide this. My idea is to use explainability techniques of black box models, such as partial dependence plots. In this setting, the whole first stage could be interpreted as a feature engineering step. Let’s do it and see what happens.

We already have the first step from before, so let’s go straight to the first step:

second_stage_rf_rf <- randomForest(lwage ~ exper + expersq + kidslt6 + kidsge6 +                                        husage + huswage + city + predictions_first_stage_rf,                  data = train)

Let’s now use the {iml} package for explainability. Let’s start first by loading the package, defining a predictor object, and then get model-agnostic feature importance:

library("iml")predictor <- Predictor$new(  model = second_stage_rf_rf,   data = select(test, exper, expersq,                kidslt6, kidsge6,                husage, huswage, city,                predictions_first_stage_rf),   y = test$lwage,   predict.fun = predict,  class = "regression"  )

The plot below shows the ratio of the original model error and model error after permutation. A higher value indicates that this feature is important:

imp_rf <- FeatureImp$new(predictor, loss = "rmse")plot(imp_rf)

According to this measure of feature importance, there does not seem to be any feature that is important in predicting log wage. This is similar to the result we had with linear regression: most coefficients were not statistically significant, but some were. Does that mean that we should not trust the results of linear regression? After all, how likely is it that log wages can be modeled as a linear combination of features?

Let’s see if the random forest was able to undercover strong interaction effects:

interactions <- Interaction$new(predictor, feature = "predictions_first_stage_rf")plot(interactions)

This can seem to be a surprising result: education interacts strongly with the number of kids greater than 6 in household. But is it? Depending on a woman’s age, in order to have 2 or 3 kids with ages between 6 and 18, she would have needed to start having them young, and thus could not have pursued a master’s degree, or a PhD. The interaction strength is measured as the share of variance that is explained by the interaction.

Let’s now take a look at the partial dependence plots and individual conditional expectation curves. Let me quote the advantages of pdps from Christoph Molnar’s book on interpretable machine learning:

The calculation for the partial dependence plots has a causal interpretation. We intervene on a feature and measure the changes in the predictions. In doing so, we analyze the causal relationship between the feature and the prediction. The relationship is causal for the model – because we explicitly model the outcome as a function of the features – but not necessarily for the real world!

That sounds good. If we can defend the assumption that our instruments are valid, then the relationship should between the feature and the prediction should be causal, and not only for the model. However, pdps have a shortcoming. Again, quoting Christoph Molnar:

The assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features.

Let’s take a look at the correlation of features

corr_vars <- cor(select(test, exper, expersq,                        kidslt6, kidsge6,                        husage, huswage, city,                        predictions_first_stage_rf)) %>%     as.data.frame() %>%     rownames_to_column() %>%     pivot_longer(-rowname, names_to = "vars2") %>%     rename(vars1 = rowname)head(corr_vars)

## # A tibble: 6 x 3##   vars1 vars2    value##        ## 1 exper exper    1    ## 2 exper expersq  0.959## 3 exper kidslt6 -0.272## 4 exper kidsge6 -0.360## 5 exper husage   0.487## 6 exper huswage -0.181

corr_vars %>%     mutate(value = abs(value)) %>%     filter(value != 1, value > 0.2) %>%    filter(vars1 == "predictions_first_stage_rf")

## # A tibble: 5 x 3##   vars1                      vars2   value##                            ## 1 predictions_first_stage_rf expersq 0.243## 2 predictions_first_stage_rf kidslt6 0.292## 3 predictions_first_stage_rf husage  0.217## 4 predictions_first_stage_rf huswage 0.494## 5 predictions_first_stage_rf city    0.369

Only 5 variables have a correlation greater than 0.2 with education, and the one with highest correlation is the husband’s wage. It would seem that this situation is ideal to use pdps and ice curves. Before computing them however, let’s read about ice curves:

Individual conditional expectation curves are even more intuitive to understand than partial dependence plots. One line represents the predictions for one instance if we vary the feature of interest.

however:

ICE curves can only display one feature meaningfully, because two features would require the drawing of several overlaying surfaces and you would not see anything in the plot. ICE curves suffer from the same problem as PDPs: If the feature of interest is correlated with the other features, then some points in the lines might be invalid data points according to the joint feature distribution. If many ICE curves are drawn, the plot can become overcrowded and you will not see anything. The solution: Either add some transparency to the lines or draw only a sample of the lines. In ICE plots it might not be easy to see the average. This has a simple solution: Combine individual conditional expectation curves with the partial dependence plot. Unlike partial dependence plots, ICE curves can uncover heterogeneous relationships.

So great, let’s go:

inst_effect <- FeatureEffect$new(predictor, "predictions_first_stage_rf", method = "pdp+ice")plot(inst_effect)

Interesting, we see that the curves are fairly similar, but there seem to be two groups: one group were adding education years increases wages, and another where the effect seems to remain constant.

Let’s try to dig a bit deeper, and get explanations for individual predictions. For this, I create two new observations that have exactly the same features, but one without children older than 6 and another with two children older than 6:

(new_obs <- data.frame(    exper = rep(10, 2),    expersq = rep(100, 2),    kidslt6 = rep(1, 2),    kidsge6 = c(0, 2),    husage = rep(35, 2),    huswage = rep(6, 2),    city = rep(1, 2),    predictions_first_stage_rf = rep(10, 2)))

##   exper expersq kidslt6 kidsge6 husage huswage city## 1    10     100       1       0     35       6    1## 2    10     100       1       2     35       6    1##   predictions_first_stage_rf## 1                         10## 2                         10

Let’s see what the model predicts:

predict(second_stage_rf_rf, newdata = new_obs)

##        1        2 ## 1.139720 1.216423

Let’s try to understand the difference between these two predictions. For this, we will be using Shapley values as described here. Shapley values use game theory to compute the contribution of each feature towards the prediction of one particular observation. Interpretation of the Shapley values is as follows (quoting Christoph Molnar’s book): Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value.

Let’s compute the Shapley values of all the features:

shapley_1 <-  Shapley$new(predictor, x.interest = new_obs[1, ], sample.size = 100)shapley_2 <-  Shapley$new(predictor, x.interest = new_obs[2, ], sample.size = 100)

plot(shapley_1)

plot(shapley_2)

The average prediction is 1.21 and the prediction for the first new observation is 1.14, which is 0.07 below the average prediction. This difference of 0.07 is the sum of the Shapley values. For the second observation, the prediction is 1.22, so 0.01 above the average prediction. The order and magnitude of contributions is not the same as well; and surprisingly, the contribution of the instrumented education to the prediction is negative.

Ok, let’s end this here. I’m quite certain that explainability methods will help econometricians adopt more machine learning methods in the future, and I am also excited to see the research of causality in machine learning and AI continue.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

Buy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Learning Linux – the wrong way – day 2

November 8, 2019, 4:00 pm

≫ Next: Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

≪ Previous: Intrumental variable regression and machine learning

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Unborking the borked laptop –

Recap

I’m trying to learn some Linux. Ostensibly to do some data science at the command line, because it feels like something I might need to know at some point. I have reclaimed an old Windows laptop ( poorly specced, with missing keys and a penchant for sending the cursor up several lines at once with no prior warning), and I installed a Ubuntu distribution on it.

I then decided to update the distribution, and accidentally lost power during that install. So now, I have a laptop with no Windows, no Linux, and it won’t boot

Day 2

Today the clocks go back 1 hour. That’s supposed to mean an extra hour in bed. That doesn’t happen when you have kids though, so, having staggered to bed at a stupid time in the morning, I am being woken a few hours later. Miraculously, there is no fighting or squabbling (amongst the kids that is), so once breakfast is out the way, I pick up the laptop and retrace my steps from the previous night:

install 2015 version of Ubuntu from cover disk
fix the broken APT, which I later discover means Advanced Packaging Tool. Without it, I can’t update.
Peform a proper installation of Ubuntu 2016 LTS.

At last, I have a decent installation. I take a look at the software centre, which is where you find new programs and tools, and see some new options available. A python interpreter is already installed, Julia is available, and I eventually find an older version of R, so of course, I install that.

It really is old though, so I find myself browsing to CRAN and downloading the latest version. Thankfully, I don’t need to do any command line kung fu to get it installed, which is kind of cheating, but I just want it on there.

Having fired it up, and being glad it worked, I decide to go for RStudio as well.

This is where things start to get interesting – it doesn’t install because of missing / out of date files.

So it’s back to firing up the terminal, and apt-get install to try and find the stuff I need. Ubuntu is pretty good at telling you what is going on, and what the problems are, so I didn’t feel too lost. There was a bit of googling going on, but nothing out of the ordinary.

I’ve written an R package, which builds fine in Windows. I tried a random Linux build using RHub’s online tool, and it failed. So, I wondered if I could download my package from github and build it in Ubuntu.

Once I’d got RStudio installed, I begin installing packages. I figured the biggest bang for my buck was installing tidyverse. Blimey, that took ages. A few things failed (dependencies of dependencies), so there were some false starts getting everything needed, and then it began to build the package.

I had no idea it would take so long. Seriously, it feels like Windows has the edge here.

My first plot in Ubuntu

It’s worth reading that thread as Chris Beeley, Paul Drake and others all chipped in with really good advice.

Various other packages were installed. Data.table was a breeze. Eventually I had what I needed ( data.table, zoo, ggplot2 and magrittr) so I downloaded my package files.

I opened them up in RStudio, and verified that it passed all checks, then built it.

runcharter

Success, package built, and working in Linux.

I tweeted out some pics, and spawned a bit of discussion amongst several other NHS colleagues – you know who you are. Later on that night, I tried to burn Linux Mint to disk. I used the built in ‘brasero’ disk burner, but it failed. Possibly due to a combination of poor quality disk, and a rickety CD/DVD drive.

Apparently, I should be using a usb-key for this, but it’s been years since I had need for a usb-drive, so did not have one to hand.

I finished off by browsing through an ebook I picked up for free on Amazon (Linux for Beginners by Jason Cannon happened to be on offer a day or so earlier)

It starts off by explaining file directories, and how to navigate them, which is definitely going to take a bit of practice, for someone so used to ‘point and click’ navigation.

So, to recap, some slight progress:

2016 LTS installed
R , RStudio and VS Code installed
Checked and built my R package on Ubuntu – woohoo
Informed that Mint was the way to go, as was usb installation
Lots of practice with apt-get upgrade, apt-get install and apt autoremove

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

November 9, 2019, 4:00 am

≫ Next: Rcpp 1.0.3: More Spit and Polish

≪ Previous: Learning Linux – the wrong way – day 2

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs and SQL query generators that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed.

In this fourth part, we will look at how to write R functions that interface with Spark via a lower-level invocation API that lets us use all the functionality that is exposed by the Scala Spark APIs. We will also show how such R calls relate to Scala code.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the first one for some setup instructions and a ready-made Docker image.

If you have docker available, running

docker run -d -p 8787:8787 -e PASSWORD=pass --name rstudio jozefhajnala/sparkly:add-rstudio

Should make RStudio available by navigating to http://localhost:8787 in your browser. You can then use the user name rstudio and password pass to login and continue experimenting with the code in this post.

# Load packagessuppressPackageStartupMessages({  library(sparklyr)  library(dplyr)  library(nycflights13)})# Prepare the dataweather <- nycflights13::weather %>%  mutate(id = 1L:nrow(nycflights13::weather)) %>%   select(id, everything())# Connectsc <- sparklyr::spark_connect(master = "local")# Copy the weather dataset to the instancetbl_weather <- dplyr::copy_to(  dest = sc,   df = weather,  name = "weather",  overwrite = TRUE)# Copy the flights dataset to the instancetbl_flights <- dplyr::copy_to(  dest = sc,   df = nycflights13::flights,  name = "flights",  overwrite = TRUE)

The invoke() API of sparklyr

So far when interfacing with Spark from R, we have used the sparklyr package in three ways:

Writing combinations of dplyr verbs that would be translated to Spark SQL via the dbplyr package and the SQL executed by Spark when requested
Generating Spark SQL code directly and sending it for execution in multiple ways
Combinations of the above two methods

What these methods have in common is that they translate operations written in R to Spark SQL and that SQL code is then sent for execution by our Spark instance.

There is however another approach that we can use with sparklyr, which will be more familiar to users or developers who have worked with packages like rJava or rscala before. Even though arguably less convenient than the APIs provided by the 2 aforementioned packages, sparklyr provides an invocation API that exposes 3 functions:

invoke(jobj, method, ...) to execute a method on a Java object reference
invoke_static(sc, class, method, ...) to execute a static method associated with a Java class
invoke_new(sc, class, ...) to invoke a constructor associated with a Java class

Apache Spark and R logos

Let us have a look at how we can use those functions in practice to efficiently work with Spark from R.

Getting started with the invoke API

We can start with a few very simple examples of invoke() usage, for instance getting the number of rows of the tbl_flights:

# Get the count of rowstbl_flights %>% spark_dataframe() %>%  invoke("count")

## [1] 336776

We see one extra operation before invoking the count: spark_dataframe(). This is because the invoke() interface works with Java object references and not tbl objects in remote sources such as tbl_flights. We, therefore, need to convert tbl_flights to a Java object reference, for which we use the spark_dataframe() function.

Now, for something more exciting, let us compute a summary of the variables in tbl_flights using the describe method:

tbl_flights_summary <- tbl_flights %>% spark_dataframe() %>%  invoke("describe", as.list(colnames(tbl_flights))) %>%  sdf_register()tbl_flights_summary

## # Source: spark [?? x 19]##   summary year  month day   dep_time sched_dep_time dep_delay arr_time##                               ## 1 count   3367… 3367… 3367… 328521   336776         328521    328063  ## 2 mean    2013… 6.54… 15.7… 1349.10… 1344.25484001… 12.63907… 1502.05…## 3 stddev  0.0   3.41… 8.76… 488.281… 467.335755734… 40.21006… 533.264…## 4 min     2013  1     1     1        106            -43.0     1       ## 5 max     2013  12    31    2400     2359           1301.0    2400    ## # … with 11 more variables: sched_arr_time , arr_delay ,## #   carrier , flight , tailnum , origin , dest ,## #   air_time , distance , hour , minute

We also one see extra operation after invoking the describe method: sdf_register(). This is because the invoke() interface also returns Java object references and we may like to see a more user-friendly tbl object instead. This is where sdf_register() comes in to register a Spark DataFrame and return a tbl_spark object back to us.

And indeed, we can see that the wrapper sdf_describe() provided by the sparklyr package itself works in a very similar fashion:

sparklyr::sdf_describe

## function(x, cols = colnames(x)) {##   in_df <- cols %in% colnames(x)##   if (any(!in_df)) {##     msg <- paste0("The following columns are not in the data frame: ",##                   paste0(cols[which(!in_df)], collapse = ", "))##     stop(msg)##   }##   cols <- cast_character_list(cols)## ##   x %>%##     spark_dataframe() %>%##     invoke("describe", cols) %>%##     sdf_register()## }##

If we so wish, for DataFrame related object references, we can also call collect() to retrieve the results directly, without using sdf_register() first, for instance retrieving the full content of the origin column:

tbl_flights %>% spark_dataframe() %>%  invoke("select", "origin", list()) %>%  collect()

## # A tibble: 336,776 x 1##    origin##     ##  1 EWR   ##  2 LGA   ##  3 JFK   ##  4 JFK   ##  5 LGA   ##  6 EWR   ##  7 EWR   ##  8 LGA   ##  9 JFK   ## 10 LGA   ## # … with 336,766 more rows

It can also be helpful to investigate the schema of our DataFrame:

tbl_flights %>% spark_dataframe() %>%  invoke("schema")

## ##   org.apache.spark.sql.types.StructType##   StructType(StructField(year,IntegerType,true), StructField(month,IntegerType,true), StructField(day,IntegerType,true), StructField(dep_time,IntegerType,true), StructField(sched_dep_time,IntegerType,true), StructField(dep_delay,DoubleType,true), StructField(arr_time,IntegerType,true), StructField(sched_arr_time,IntegerType,true), StructField(arr_delay,DoubleType,true), StructField(carrier,StringType,true), StructField(flight,IntegerType,true), StructField(tailnum,StringType,true), StructField(origin,StringType,true), StructField(dest,StringType,true), StructField(air_time,DoubleType,true), StructField(distance,DoubleType,true), StructField(hour,DoubleType,true), StructField(minute,DoubleType,true), StructField(time_hour,TimestampType,true))

We can also use the invoke interface on other objects, for instance the SparkContext. Let’s for instance retrieve the uiWebUrl of our context:

sc %>% spark_context() %>%  invoke("uiWebUrl") %>%  invoke("toString")

## [1] "Some(http://localhost:4040)"

Grouping and aggregation with invoke chains

Imagine we would like to do simple aggregations of a Spark DataFrame, such as an average of a column grouped by another column. For reference, we can do this very simply using the dplyr approach. Let’s compute the average departure delay by origin of the flight:

tbl_flights %>%  group_by(origin) %>%  summarise(avg(dep_delay))

## # Source: spark [?? x 2]##   origin `avg(dep_delay)`##                ## 1 EWR                15.1## 2 JFK                12.1## 3 LGA                10.3

Now we will show how to do the same aggregation via the lower level API. Using the Spark shell we would simply do:

flights.  groupBy("origin").  agg(avg("dep_delay"))

Translating that into the lower level invoke() API provided by sparklyr looks something like this:

tbl_flights %>%  spark_dataframe() %>%  invoke("groupBy", "origin", list()) %>%  invoke("agg", invoke_static(sc, "org.apache.spark.sql.functions", "expr", "avg(dep_delay)"), list()) %>%  sdf_register()

What is all that extra code?

Now, compared to the very simple 2 operations in the Scala version, we have some gotchas to examine:

one of the invoke() calls is quite long. Instead of just avg("dep_delay") like in the Scala example, we use invoke_static(sc, "org.apache.spark.sql.functions", "expr", "avg(dep_delay)"). This is because the avg("dep_delay") expression is somewhat of a syntactic sugar provided by Scala, but when calling from R we need to provide the object reference hidden behind that sugar.
the empty list() at the end of the "groupBy" and "agg" invokes. This is needed as a workaround some Scala methods take String, String* as arguments and sparklyr currently does not support variable parameters. We can pass list() to represent an empty String[] in Scala as the needed second argument.

Wrapping the invocations into R functions

Seeing the above example, we can quickly write a useful wrapper to ease the pain a little. First, we can create a small function that will generate the aggregation expression we can use with invoke("agg", ...):

agg_expr <- function(tbl, exprs) {  sparklyr::invoke_static(    tbl[["src"]][["con"]],    "org.apache.spark.sql.functions",    "expr",    exprs  )}

Next, we can wrap around the entire process to make a more generic aggregation function, using the fact that a remote tibble has the details on sc within its tbl[["src"]][["con"]] element:

grpagg_invoke <- function(tbl, colName, groupColName, aggOperation) {  avgColumn <- tbl %>% agg_expr(paste0(aggOperation, "(", colName, ")"))  tbl %>%  spark_dataframe() %>%     invoke("groupBy", groupColName, list()) %>%    invoke("agg", avgColumn, list()) %>%     sdf_register()}

And finally use our wrapper to get the same results in a more user-friendly way:

tbl_flights %>%   grpagg_invoke("arr_delay", groupColName = "origin", aggOperation = "avg")

## # Source: spark [?? x 2]##   origin `avg(arr_delay)`##                ## 1 EWR                9.11## 2 JFK                5.55## 3 LGA                5.78

Reconstructing variable normalization

Now we will attempt to construct the variable normalization that we have shown in the previous parts with dplyr verbs and SQL generation – we will normalize the values of a column by first subtracting the mean value and then dividing the values by the standard deviation:

normalize_invoke <- function(tbl, colName) {  sdf <- tbl %>% spark_dataframe()  stdCol <- agg_expr(tbl, paste0("stddev_samp(", colName, ")"))  avgCol <- agg_expr(tbl, paste0("avg(", colName, ")"))  avgTemp <- sdf %>% invoke("agg", avgCol, list()) %>% invoke("first")  stdTemp <- sdf %>% invoke("agg", stdCol, list()) %>% invoke("first")  newCol <- sdf %>%    invoke("col", colName) %>%    invoke("minus", as.numeric(avgTemp)) %>%    invoke("divide", as.numeric(stdTemp))  sdf %>%    invoke("withColumn", colName, newCol) %>%    sdf_register()}tbl_weather %>% normalize_invoke("temp")

## # Source: spark [?? x 16]##       id origin  year month   day  hour   temp  dewp humid wind_dir##                  ##  1     1 EWR     2013     1     1     1 -0.913  26.1  59.4      270##  2     2 EWR     2013     1     1     2 -0.913  27.0  61.6      250##  3     3 EWR     2013     1     1     3 -0.913  28.0  64.4      240##  4     4 EWR     2013     1     1     4 -0.862  28.0  62.2      250##  5     5 EWR     2013     1     1     5 -0.913  28.0  64.4      260##  6     6 EWR     2013     1     1     6 -0.974  28.0  67.2      240##  7     7 EWR     2013     1     1     7 -0.913  28.0  64.4      240##  8     8 EWR     2013     1     1     8 -0.862  28.0  62.2      250##  9     9 EWR     2013     1     1     9 -0.862  28.0  62.2      260## 10    10 EWR     2013     1     1    10 -0.802  28.0  59.6      260## # … with more rows, and 6 more variables: wind_speed ,## #   wind_gust , precip , pressure , visib ,## #   time_hour

The above implementation is just an example and far from optimal, but it also has a few interesting points about it:

Using invoke("first") will actually compute and collect the value into the R session
Those collected values are then sent back during the invoke("minus", as.numeric(avgTemp)) and invoke("divide", as.numeric(stdTemp))

This means that there is unnecessary overhead when sending those values from the Spark instance into R and back, which will have slight performance penalties.

Where invoke can be better than dplyr translation or SQL

As we have seen in the above examples, working with the invoke() API can prove more difficult than using the intuitive syntax of dplyr or SQL queries. In some use cases, the trade-off may still be worth it. In our practice, these are some examples of such situations:

When Scala’s Spark API is more flexible, powerful or suitable for a particular task and the translation is not as good
When performance is crucial and we can produce more optimal solutions using the invocations
When we know the Scala API well and not want to invest time to learn the dplyr syntax, but it is easier to translate the Scala calls into a series of invoke() calls
When we need to interact and manipulate other Java objects apart from the standard Spark DataFrames

Conclusion

In this part of the series, we have looked at how to use the lower-level invoke interface provided by sparklyr to manipulate Spark objects and other Java object references. In the following part, we will dig a bit deeper and look into using Java’s reflection API to make the invoke interface more accessible from R, getting detail invocation logs and more.

References

The first part of this series
The second part of this series
The third part of this series
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
Wikipedia’s article on Method Chaining

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Rcpp 1.0.3: More Spit and Polish

November 9, 2019, 6:06 am

≫ Next: Reading in Data

≪ Previous: Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rcpp logo

The third maintenance release 1.0.3 of Rcpp, following up on the 10th anniversary and the 1.0.0. release both pretty much exactly one year ago, arrived on CRAN yesterday. This deserves a special shoutout to Uwe Ligges who was even more proactive and helpful than usual. Rcpp is a somewhat complex package with many reverse dependencies, and both the initial check tickles one (grandfathered) NOTE, and the reverse dependencies typically invoke a few false positives too. And in both cases did he move the process along before I even got around to replying to the auto-generated emails. So just a few hours passed between my upload, and the Thanks, on its way to CRAN email—truly excellent work of the CRAN team. Windows and macOS binaries are presumably being built now. The corresponding Debian package was also uploaded as a source package, and binaries have since been built.

Just like for Rcpp 1.0.1 and Rcpp 1.0.2, we have a four month gap between releases which seems appropriate given both the changes still being made (see below) and the relative stability of Rcpp. It still takes work to release this as we run multiple extensive sets of reverse dependency checks so maybe one day we will switch to six month cycle. For now, four months seem like a good pace.

Rcpp has become the most popular way of enhancing R with C or C++ code. As of today, 1832 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with 190 in BioConductor. And per the (partial) logs of CRAN downloads, we are currently running at 1.1 millions downloads per month.

This release features a number of different pull requests by five different contributors as detailed below.

Changes in Rcpp version 1.0.3 (2019-11-08)
Changes in Rcpp API:
Compilation can be sped up by skipping Modules headers via a toggle RCPP_NO_MODULES (Kevin in #995 for #993).
Compilation can be sped up by toggling RCPP_NO_RTTI which implies RCPP_NO_MODULES (Dirk in #998 fixing #997).
XPtr tags are now preserved in as<> (Stephen Wade in #1003 fixing #986, plus Dirk in #1012).
A few more temporary allocations are now protected from garbage collection (Romain Francois in #1010, and Dirk in #1011).
Changes in Rcpp Modules:
Improved initialization via explicit Rcpp:: prefix (Riccardo Porreca in #980).
Changes in Rcpp Deployment:
A unit test for Rcpp Class exposure was updated to not fail under r-devel (Dirk in #1008 fixing #1006).
Changes in Rcpp Documentation:
The Rcpp-modules vignette received a major review and edit (Riccardo Porreca in #982).
Minor whitespace alignments and edits were made in three vignettes following the new pinp release (Dirk).
New badges for DOI and CRAN and BioConductor reverse dependencies have been added to README.md (Dirk).
Vignettes are now included pre-made (Dirk in #1005 addressing #1004)).
The Rcpp FAQ has two new entries on ‘no modules / no rtti’ and exceptions across shared libraries (Dirk in #1009).

Thanks to CRANberries, you can also look at a diff to the previous release. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. Bugs reports are welcome at the GitHub issue tracker as well (where one can also search among open or closed issues); questions are also welcome under rcpp tag at StackOverflow which also allows searching among the (currently) 2255 previous questions.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Reading in Data

November 9, 2019, 9:35 am

≫ Next: future 1.15.0 – Lazy Futures are Now Launched if Queried

≪ Previous: Rcpp 1.0.3: More Spit and Polish

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s a common situation: you have a folder full of similarly-formatted CSV or otherwise structured text files that you want to get into R quickly and easily. Reading data into R is one of those tasks that can be a real source of frustration for beginners, so I like collecting real-life examples of the many ways it’s become much easier.

This week in class I was working with country-level historical mortality rate estimates. These are available from mortality.org, a fabulous resource. They have a variety of data available but I was interested in the 1×1 year estimates of mortality for all available countries. By “1×1” I mean that the tables show (for men, women, and in total) age-specific morality rate estimates for yearly ages from 0 to 110 and above, for every available historical year (e.g. from 1850 to 2016 or what have you). So you can have an estimate of the mortality rate for, say, 28 year olds in France in 1935.

Downloading this data gives me a folder of text files, one for each country. (Or rather, country-like unit: there are separate series for, e.g. East Germany, West Germany, and Germany as a whole, for example, along with some countries where sub-populations are broken out historically.) The names of the files are consistently formatted, as is the data inside them, and they have a .txt extension. What I wanted to do was get each one of these files into R, ideally putting them all into a single big table that could be the jumping-off point for subsetting and further analysis.

I know from the documentation provided by mortality.org that the files all have the same basic format, which of course makes things much easier. The data is already clean. It’s just a matter of loading it all in efficiently, or “ingesting” it, to use the charming image that seems to be preferred at present.

Here we go. First, some libraries.

library(tidyverse)library(janitor)library(here)

## here() starts at /Users/kjhealy/Source/demog

We get a list of the filenames in our raw data folder, along with their full paths. Then we take a look at them.


filenames <-dir(path = here("rawdata"),
                 pattern ="*.txt",
                 full.names =TRUE)

filenames

##  [1] "/Users/kjhealy/Source/demog/rawdata/AUS.Mx_1x1.txt"    ##  [2] "/Users/kjhealy/Source/demog/rawdata/AUT.Mx_1x1.txt"    ##  [3] "/Users/kjhealy/Source/demog/rawdata/BEL.Mx_1x1.txt"    ##  [4] "/Users/kjhealy/Source/demog/rawdata/BGR.Mx_1x1.txt"    ##  [5] "/Users/kjhealy/Source/demog/rawdata/BLR.Mx_1x1.txt"    ##  [6] "/Users/kjhealy/Source/demog/rawdata/CAN.Mx_1x1.txt"    ##  [7] "/Users/kjhealy/Source/demog/rawdata/CHE.Mx_1x1.txt"    ##  [8] "/Users/kjhealy/Source/demog/rawdata/CHL.Mx_1x1.txt"    ##  [9] "/Users/kjhealy/Source/demog/rawdata/CZE.Mx_1x1.txt"    ## [10] "/Users/kjhealy/Source/demog/rawdata/DEUTE.Mx_1x1.txt"  ## [11] "/Users/kjhealy/Source/demog/rawdata/DEUTNP.Mx_1x1.txt" ## [12] "/Users/kjhealy/Source/demog/rawdata/DEUTW.Mx_1x1.txt"  ## [13] "/Users/kjhealy/Source/demog/rawdata/DNK.Mx_1x1.txt"    ## [14] "/Users/kjhealy/Source/demog/rawdata/ESP.Mx_1x1.txt"    ## [15] "/Users/kjhealy/Source/demog/rawdata/EST.Mx_1x1.txt"    ## [16] "/Users/kjhealy/Source/demog/rawdata/FIN.Mx_1x1.txt"    ## [17] "/Users/kjhealy/Source/demog/rawdata/FRACNP.Mx_1x1.txt" ## [18] "/Users/kjhealy/Source/demog/rawdata/FRATNP.Mx_1x1.txt" ## [19] "/Users/kjhealy/Source/demog/rawdata/GBR_NIR.Mx_1x1.txt"## [20] "/Users/kjhealy/Source/demog/rawdata/GBR_NP.Mx_1x1.txt" ## [21] "/Users/kjhealy/Source/demog/rawdata/GBR_SCO.Mx_1x1.txt"## [22] "/Users/kjhealy/Source/demog/rawdata/GBRCENW.Mx_1x1.txt"## [23] "/Users/kjhealy/Source/demog/rawdata/GBRTENW.Mx_1x1.txt"## [24] "/Users/kjhealy/Source/demog/rawdata/GRC.Mx_1x1.txt"    ## [25] "/Users/kjhealy/Source/demog/rawdata/HRV.Mx_1x1.txt"    ## [26] "/Users/kjhealy/Source/demog/rawdata/HUN.Mx_1x1.txt"    ## [27] "/Users/kjhealy/Source/demog/rawdata/IRL.Mx_1x1.txt"    ## [28] "/Users/kjhealy/Source/demog/rawdata/ISL.Mx_1x1.txt"    ## [29] "/Users/kjhealy/Source/demog/rawdata/ISR.Mx_1x1.txt"    ## [30] "/Users/kjhealy/Source/demog/rawdata/ITA.Mx_1x1.txt"    ## [31] "/Users/kjhealy/Source/demog/rawdata/JPN.Mx_1x1.txt"    ## [32] "/Users/kjhealy/Source/demog/rawdata/KOR.Mx_1x1.txt"    ## [33] "/Users/kjhealy/Source/demog/rawdata/LTU.Mx_1x1.txt"    ## [34] "/Users/kjhealy/Source/demog/rawdata/LUX.Mx_1x1.txt"    ## [35] "/Users/kjhealy/Source/demog/rawdata/LVA.Mx_1x1.txt"    ## [36] "/Users/kjhealy/Source/demog/rawdata/NLD.Mx_1x1.txt"    ## [37] "/Users/kjhealy/Source/demog/rawdata/NOR.Mx_1x1.txt"    ## [38] "/Users/kjhealy/Source/demog/rawdata/NZL_MA.Mx_1x1.txt" ## [39] "/Users/kjhealy/Source/demog/rawdata/NZL_NM.Mx_1x1.txt" ## [40] "/Users/kjhealy/Source/demog/rawdata/NZL_NP.Mx_1x1.txt" ## [41] "/Users/kjhealy/Source/demog/rawdata/POL.Mx_1x1.txt"    ## [42] "/Users/kjhealy/Source/demog/rawdata/PRT.Mx_1x1.txt"    ## [43] "/Users/kjhealy/Source/demog/rawdata/RUS.Mx_1x1.txt"    ## [44] "/Users/kjhealy/Source/demog/rawdata/SVK.Mx_1x1.txt"    ## [45] "/Users/kjhealy/Source/demog/rawdata/SVN.Mx_1x1.txt"    ## [46] "/Users/kjhealy/Source/demog/rawdata/SWE.Mx_1x1.txt"    ## [47] "/Users/kjhealy/Source/demog/rawdata/TWN.Mx_1x1.txt"    ## [48] "/Users/kjhealy/Source/demog/rawdata/UKR.Mx_1x1.txt"    ## [49] "/Users/kjhealy/Source/demog/rawdata/USA.Mx_1x1.txt"

What does each of these files look like? Let’s take a look at the first one, using read_lines() to show us the top of the file.


read_lines(filenames[1], n_max =5)## [1] "Australia, Death rates (period 1x1), \tLast modified: 26 Sep 2017;  Methods Protocol: v6 (2017)"## [2] ""                                                                                               ## [3] "  Year          Age             Female            Male           Total"                         ## [4] "  1921           0             0.059987        0.076533        0.068444"                        ## [5] "  1921           1             0.012064        0.014339        0.013225"

All the files have a header section like this. When we read the data in we’ll want to ignore that and go straight to the data. But seeing as it’s there, we can make use of it to grab the name of the country. It saves us typing it ourselves. Let’s say we’d also like to have a code-friendly version of those names (i.e., in lower-case with underscores instead of spaces). And finally—while we’re at it—let’s grab those all-caps country codes used in the file names, too. We write three functions:

get_country_name() grabs the first word or words on the first line of each file, up to the first comma. That’s our country name.
shorten_name() makes the names lower-case and replaces spaces with underscores, and also shortens “The United States of America” to “USA”.
make_ccode() wraps a regular expression that finds and extracts the capitalized country codes in the file names.



get_country_name <-function(x){
    read_lines(x, n_max =1)%>%
        str_extract(".+?,")%>%
        str_remove(",")}

shorten_name <-function(x){
    str_replace_all(x," -- "," ")%>%
        str_replace("The United States of America","USA")%>%
        snakecase::to_any_case()}

make_ccode <-function(x){
    str_extract(x,"[:upper:]+((?=\\.))")}

Now we create a tibble of summary information by mapping the functions to the filenames.



countries <- tibble(country = map_chr(filenames, get_country_name),
                        cname = map_chr(country, shorten_name),
                        ccode = map_chr(filenames, make_ccode),
                        path = filenames)

countries

## # A tibble: 49 x 4##    country     cname       ccode path                                      ##                                                        ##  1 Australia   australia   AUS   /Users/kjhealy/Source/demog/rawdata/AUS.M…##  2 Austria     austria     AUT   /Users/kjhealy/Source/demog/rawdata/AUT.M…##  3 Belgium     belgium     BEL   /Users/kjhealy/Source/demog/rawdata/BEL.M…##  4 Bulgaria    bulgaria    BGR   /Users/kjhealy/Source/demog/rawdata/BGR.M…##  5 Belarus     belarus     BLR   /Users/kjhealy/Source/demog/rawdata/BLR.M…##  6 Canada      canada      CAN   /Users/kjhealy/Source/demog/rawdata/CAN.M…##  7 Switzerland switzerland CHE   /Users/kjhealy/Source/demog/rawdata/CHE.M…##  8 Chile       chile       CHL   /Users/kjhealy/Source/demog/rawdata/CHL.M…##  9 Czechia     czechia     CZE   /Users/kjhealy/Source/demog/rawdata/CZE.M…## 10 East Germa… east_germa… DEUTE /Users/kjhealy/Source/demog/rawdata/DEUTE…## # … with 39 more rows

Nice. We could have written each of those operations as anonymous functions directly inside of map_chr(). This would have been more compact. But often it can be useful to break out the steps as shown here, for clarity—especially if map() operations have a tendency to break your brain, as they do mine.

We still haven’t touched the actual data files, of course. But now we can just use this countries table as the basis for reading in, I mean ingesting, everything in the files. We’re going to just add a list column named data to the end of the table and put the data for each country in it. We’ll temporarily unnest it to clean the column names and recode the age variable, then drop the file paths column and nest the data again.

The hard work is done by the map() call. This time we will use ~ formula notation inside map() to write what we want to do. We’re going to feed every filename in path to read_table(), one at a time. We tell read_table() to skip the first two lines of every file it reads, and also tell it that in these files missing data are represented by a . character. Everything read in ends up in a new list column named data.



mortality <- countries %>%
    mutate(data = map(path,~ read_table(., skip =2, na =".")))%>%
    unnest(cols =c(data))%>%
    clean_names()%>%
    mutate(age =as.integer(recode(age,"110+"="110")))%>%
    select(-path)%>%
    nest(data =c(year:total))

mortality


## # A tibble: 49 x 4##    country      cname        ccode           data##                     >##  1 Australia    australia    AUS     [10,434 × 5]##  2 Austria      austria      AUT      [7,881 × 5]##  3 Belgium      belgium      BEL     [19,425 × 5]##  4 Bulgaria     bulgaria     BGR      [7,104 × 5]##  5 Belarus      belarus      BLR      [6,438 × 5]##  6 Canada       canada       CAN     [10,101 × 5]##  7 Switzerland  switzerland  CHE     [15,651 × 5]##  8 Chile        chile        CHL      [1,887 × 5]##  9 Czechia      czechia      CZE      [7,437 × 5]## 10 East Germany east_germany DEUTE    [6,660 × 5]## # … with 39 more rows

And we’re done. Forty nine tables of data smoothly imported and bundled together. Each of the country-level data tables is a row in data that we can take a look at as we like:



mortality %>% 
  filter(country =="Austria")%>% 
  unnest(cols =c(data))## # A tibble: 7,881 x 8##    country cname   ccode  year   age   female     male    total##                        ##  1 Austria austria AUT    1947     0 0.0798   0.0994   0.0899  ##  2 Austria austria AUT    1947     1 0.00657  0.00845  0.00753 ##  3 Austria austria AUT    1947     2 0.00425  0.00469  0.00447 ##  4 Austria austria AUT    1947     3 0.00337  0.00340  0.00339 ##  5 Austria austria AUT    1947     4 0.00235  0.00270  0.00253 ##  6 Austria austria AUT    1947     5 0.00174  0.00195  0.00184 ##  7 Austria austria AUT    1947     6 0.00131  0.00152  0.00142 ##  8 Austria austria AUT    1947     7 0.00132  0.00169  0.00151 ##  9 Austria austria AUT    1947     8 0.00115  0.00149  0.00132 ## 10 Austria austria AUT    1947     9 0.000836 0.000997 0.000918## # … with 7,871 more rows

Now you can get on with the actual analysis.

There isn’t anything especially unusual in the steps shown here. It’s just a pretty common operation that’s worth knowing how to do cleanly. One nice thing about this approach is that it’s immediately applicable to, say, a folder containing the 5-year mortality estimates rather than the 1 year estimates. You don’t have to do anything new, and there’s no mucking around with manually naming files and so on.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

future 1.15.0 – Lazy Futures are Now Launched if Queried

November 9, 2019, 11:00 am

≫ Next: Dangerous streets of Bratislava! Animated maps using open data in R

≪ Previous: Reading in Data

[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

No dogs were harmed while making this release

future 1.15.0 is now on CRAN, accompanied by a recent, related update of future.callr 0.5.0. The main update is a change to the Future API:

resolved() will now also launch lazy futures

Although this change does not look much to the world, I’d like to think of this as part of a young person slowly finding themselves. This change in behavior helps us in cases where we create lazy futures upfront;

fs <- lapply(X, future, lazy = TRUE)

Such futures remain dormant until we call value() on them, or, as of this release, when we call resolved() on them. Contrary to value(), resolved() is a non-blocking function that allows us to check in on one or more futures to see if they are resolved or not. So, we can now do:

while (!all(resolved(fs))) {  do_something_else()}

to run that loop until all futures are resolved. Any lazy future that is still dormant will be launched when queried the first time. Previously, we would have had to write specialized code for the lazy=TRUE case to trigger lazy futures to launch. If not, the above loop would have run forever. This change means that the above design pattern works the same regardless of whether we use lazy=TRUE or lazy=FALSE (default). There is now one less thing to worry about when working with futures. Less mental friction should be good.

What else?

The Future API now guarantees that value() relays the “visibility” of a future’s value. For example,

> f <- future(invisible(42))> value(f)> v <- value(f)> v[1] 42

Other than that, I have fixed several non-critical bugs and improved some documentation. See news(package="future") or NEWS for all updates.

What’s next?

I’ll be talking about futures at rstudio::conf 2020 (San Francisco, CA, USA) at the end of January 2020. Please come and say hi – I am keen to hear your R story.
I will wrap up the deliverables for the project Future Minimal API: Specification with Backend Conformance Test Suite sponsored by the R Consortium. This project helps to robustify the future ecosystem and validate that all backends fulfill the Future API specification. It also serves to refine the Future API specifications. For example, the above change to resolved() resulted from this project.
The maintainers of foreach plan to harmonize how foreach() identifies global variables with how the future framework identifies them. The idea is to migrate foreach to use the same approach as future, which relies on the globals package. If you’re curious, you can find out more about this over at the foreach issue tracker. Yeah, the foreach issue tracker is a fairly recent thing – it’s a great addition.
The progressr package (GitHub only) is a proof-of-concept and a working prototype showing how to signal progress updates when doing parallel processing. It works out of the box with the core Future API and higher-level Future APIs such as future.apply, foreach with doFuture, furrr, and plyr– regardless of what parallel backend is being used. It should also work with all known non-parallel map-reduce frameworks, including baselapply() and purrr. For parallel processing, the “granularity” of progress updates varies with the type of parallel worker used. Right now, you will get live updates for sequential processing, whereas for parallel processing the updates will come in chunks along with the value whenever it is collected for a particular future. I’m working on adding support for “live” progress updates also for some parallel backends including when running on local and remote workers.

Happy futuring!

Dangerous streets of Bratislava! Animated maps using open data in R

November 9, 2019, 4:00 pm

≫ Next: #TidyTuesday: horror films, squirrels and commuters

≪ Previous: future 1.15.0 – Lazy Futures are Now Launched if Queried

[This article was first published on Peter Laurinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At the work recently, I wanted to make some interesting start-up pitch (presentation) ready animated visualization and got some first experience with spatial data (e.g. polygons). I enjoyed working with such a type of data and I wanted to improve on working with them, so I decided to try to visualize something interesting with Bratislava (Slovakia) open-data and OpenStreetMaps. I ended with animated maps of violations on Bratislava streets through the time of 2 and a half years.

Since spatial time series are analyzed in this post, it still sticks with the blog domain and it is time series data mining You can read more about time series forecasting, representations and clustering in my previous blog posts here.

Aaand teaser, what I will create in this blog post:

In this blog post you will learn how to:

get free city district polygons data from OpenStreetMaps API,
get free street lines (polygons) coordinates from OpenStreetMaps API,
visualize polygons and street lines with ggplot2 and ggmap,
merge spatial data with violation data represented as time series,
animate spatial data combined with violation time series with gganimate.

Bratislava Open-Data

The ultimate goal is to show where and when are the most dangerous places in the capital of Slovakia – Bratislava. For this task, I will use open-data that cover violations gathered from the city police with locations (city district and street name) and time-stamp.

Firstly, load all the needed packages.

library(data.table)# handling datalibrary(lubridate)# handling timestampslibrary(stringi)# text processinglibrary(osmdata)# OSM data API# library(nominatim) # alternative OSM data APIlibrary(sf)# handling spatial datalibrary(ggplot2)# visualisationslibrary(ggsci)# color palletslibrary(ggmap)# map vislibrary(gganimate)# animationslibrary(magick)# image handling

Then, let’s download the violations data from opendata.bratislava.sk webpage and translate Slovak column names to English.

data_violation_17<-fread("https://opendata.bratislava.sk/dataset/download/852",encoding="Latin-1")data_violation_18<-fread("https://opendata.bratislava.sk/dataset/download/407",encoding="Latin-1")data_violation_19<-fread("https://opendata.bratislava.sk/dataset/download/911",encoding="Latin-1")# colnames to Englishsetnames(data_violation_17,colnames(data_violation_17),c("Date_Time","Group_vio","Type_vio","Street","Place"))setnames(data_violation_18,colnames(data_violation_18),c("Date_Time","Group_vio","Type_vio","Street","Place"))setnames(data_violation_19,colnames(data_violation_19),c("Date_Time","Group_vio","Type_vio","Street","Place"))

Let’s bind all the data together.

data_violation<-rbindlist(list(data_violation_17,data_violation_18,data_violation_19),use.names=T)

Next, I will transform Date_Time to POCIXct format and generate time aggregation features – Year and Month – Year_M.

data_violation[,Date_Time:=dmy_hm(Date_Time)]data_violation[,Date:=date(Date_Time)]# Aggregation featuresdata_violation[,Month:=month(Date_Time)]data_violation[,Year:=year(Date_Time)]data_violation[,Year_M:=Year+Month/100]

Let’s see how many violations have each Place (city district) for whole available period (2017-2019.09):

data_violation[,.N,by=.(Place)]

##                 Place     N##  1:       Stare Mesto 68714##  2:       Karlova Ves 11462##  3:         Petrzalka 17062##  4:          Dubravka  6307##  5:           Ruzinov 21352##  6:           Vajnory  1194##  7:    Pod. Biskupice  1590##  8:           Vrakuna  1262##  9:        Nove Mesto 17964## 10:             Devin   269## 11:              Raca  2390## 12: Devinska Nova Ves  1406## 13:             Lamac  1455## 14:           Jarovce    66## 15:     Zah. Bystrica   928## 16:           Rusovce   753## 17:            Cunovo   569

We can see that Old-town (Stare Mesto) rocks in this statistic..obviously – e.g. lot of tourists. There are also some misunderstand Slavic letters. We should get rid of them – in the blog post there we be lot of handling of these special Slavic (Slovak) symbols.

data_violation[,Place:=stri_trans_general(Place,"Latin-ASCII")]data_violation[.("Eunovo"),on=.(Place),Place:="Cunovo"]data_violation[.("Raea"),on=.(Place),Place:="Raca"]data_violation[.("Vrakuoa"),on=.(Place),Place:="Vrakuna"]data_violation[.("Lamae"),on=.(Place),Place:="Lamac"]

I will extract only types of violations that relate to the bad behavior of people, like harassment, using alcoholic beverages on public space and related things. So, I will not extract traffic violations like bad car parking, etc. For this task, I need to extract codes of violations:

data_violation[,Group_vio_ID:=as.numeric(substr(Group_vio,1,4))]# our codes are: 4836, 5000, 4834, 4700, 4838, 4828, 4839,# 4813, 9803, 4840, 4841, 9809, 3000, 9806, 4900

Polygons of city districts

I want to show a number of crimes per city district and street (so both place information) on a map, so I need to get coordinates of city districts and streets. Let’s get polygons of city districts first using OpenStreetMap API.

poly_sm<-getbb(place_name="Stare Mesto Bratislava",format_out="polygon")poly_nm<-getbb(place_name="Nove Mesto Bratislava",format_out="polygon")poly_ruz<-getbb(place_name="Ruzinov Bratislava",format_out="polygon")poly_raca<-getbb(place_name="Raca Bratislava",format_out="polygon")poly_petr<-getbb(place_name="Petrzalka Bratislava",format_out="polygon")poly_kv<-getbb(place_name="Karlova Ves Bratislava",format_out="polygon")poly_dubr<-getbb(place_name="Dubravka Bratislava",format_out="polygon")poly_vajn<-getbb(place_name="Vajnory Bratislava",format_out="polygon")poly_lamac<-getbb(place_name="Lamac Bratislava",format_out="polygon")poly_vrak<-getbb(place_name="Vrakuna Bratislava",format_out="polygon")poly_pd<-getbb(place_name="Podunajske Biskupice Bratislava",format_out="polygon")poly_jar<-getbb(place_name="Jarovce Bratislava",format_out="polygon")poly_dnv<-getbb(place_name="Devinska Nova Ves Bratislava",format_out="polygon")poly_rus<-getbb(place_name="Rusovce Bratislava",format_out="polygon")poly_zb<-getbb(place_name="Zahorska Bystrica Bratislava",format_out="polygon")poly_devin<-getbb(place_name="Devin Bratislava",format_out="polygon")poly_cun<-getbb(place_name="Cunovo Bratislava",format_out="polygon")

Let’s bind all the districts data of Bratislava.

data_poly_ba<-rbindlist(list(as.data.table(poly_sm)[,Place:="Stare Mesto"],as.data.table(poly_nm[[1]])[,Place:="Nove Mesto"],as.data.table(poly_ruz[[1]])[,Place:="Ruzinov"],as.data.table(poly_raca)[,Place:="Raca"],as.data.table(poly_petr)[,Place:="Petrzalka"],as.data.table(poly_kv[[1]])[,Place:="Karlova Ves"],as.data.table(poly_dubr[[1]])[,Place:="Dubravka"],as.data.table(poly_vajn[[1]])[,Place:="Vajnory"],as.data.table(poly_lamac[[1]])[,Place:="Lamac"],as.data.table(poly_vrak[[1]])[,Place:="Vrakuna"],as.data.table(poly_pd[[1]])[,Place:="Pod. Biskupice"],as.data.table(poly_jar[[1]])[,Place:="Jarovce"],as.data.table(poly_dnv[[1]])[,Place:="Devinska Nova Ves"],as.data.table(poly_rus[[1]])[,Place:="Rusovce"],as.data.table(poly_zb[[1]])[,Place:="Zah. Bystrica"],as.data.table(poly_devin[[1]])[,Place:="Devin"],as.data.table(poly_cun[[1]])[,Place:="Cunovo"]))setnames(data_poly_ba,c("V1","V2"),c("lon","lat"))data_poly_ba[,Place:=factor(Place)]

Let’s visualize simply the districts.

ggplot(data_poly_ba)+geom_polygon(aes(x=lon,y=lat,fill=Place,color=Place))+theme_void()

plot of chunk unnamed-chunk-10

For animation and visualization purposes, I need to aggregate violations data by districts (Place) and Year+Month columns (Year_M).

data_violation_agg_place<-copy(data_violation[.(c(4836,5000,4834,4700,4838,4828,4839,4813,9803,4840,4841,9809,3000,9806,4900)),on=.(Group_vio_ID),.(N=.N),by=.(Place,Year_M)])

The next step is to merge polygon data with aggregated violation data:

data_places<-merge(data_poly_ba[,.(Place,lon,lat)],data_violation_agg_place[,.(Place,N_violation=N,Year_M)],by="Place",all.y=T,allow.cartesian=T)

Let’s also compute mean coordinates for every district for showing theirs names on a graph.

data_poly_ba_mean_place<-copy(data_poly_ba[,.(lon=mean(lon),lat=mean(lat)),by=.(Place)])

Let’s test visualization of violations on June 2019:

ggplot()+geom_polygon(data=data_places[.(2019.06),on=.(Year_M)],aes(x=lon,y=lat,fill=N_violation,group=Place),color="grey40")+scale_fill_material("grey")+geom_label(data=data_poly_ba_mean_place,aes(x=lon,y=lat,label=Place),color="black",fill="dodgerblue1",alpha=0.65)+theme_void()

plot of chunk unnamed-chunk-14

Street lines coordinates

The next step is to download street coordinates from OpenStreetMaps.

Firstly, we have to extract unique street names from violation data, and handle Slovak letters and other punctuation for easier matching by street name.

dt_uni_streets<-copy(data_violation[,.(Place=data.table::first(Place)),by=.(Street)])dt_uni_streets[,Street_edit:=gsub(pattern="Ãˆ",replacement="C",x=Street)]dt_uni_streets[,Street_edit:=gsub(pattern="Ã²",replacement="n",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Ã¨",replacement="c",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="\u009d",replacement="t",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Â¾",replacement="l",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Â¼",replacement="L",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Ã²",replacement="n",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="_",replacement="",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Ã¯",replacement="d",x=Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Ã ",replacement="r",x=Street_edit)]dt_uni_streets[,Street_edit:=stri_trans_general(Street_edit,"Latin-ASCII")]dt_uni_streets[,Street_edit:=sub(" - MC.*","",Street_edit)]dt_uni_streets[,Street_edit:=gsub(pattern="Nam.",replacement="Namestie ",x=Street_edit)]dt_uni_streets[,Street_query:=gsub(pattern=" ",replacement="+",x=Street_edit)]dt_uni_streets[,Street_query:=paste0(Street_query,"+Bratislava")]

Let’s get all streets coordinates of Bratislava just by one (powerful) command, again using OSM API:

ba_bb<-getbb(place_name="Bratislava")bratislava<-opq(bbox=ba_bb)%>%add_osm_feature(key='highway')%>%osmdata_sf()%>%osm_poly2line()

Let’s plot it:

ggplot(data=bratislava$osm_lines)+geom_sf()+theme_bw()

plot of chunk unnamed-chunk-17

Pretty nice web.

Now, we need to handle again street names downloaded from OSM. I will extract only available streets from violation data:

street_names<-data.table(Street=unique(bratislava$osm_lines$name))street_names[,Street_edit:=gsub(pattern="Ãƒ",replacement="i",x=Street)]street_names[,Street_edit:=gsub(pattern="iÂ©",replacement="e",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂ¡",replacement="a",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂº",replacement="u",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…Â¾",replacement="z",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„Â¾",replacement="l",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„\u008d",replacement="c",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…Â¡",replacement="s",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„Â½",replacement="L",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…Ë†",replacement="n",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂ½",replacement="y",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…â€¢",replacement="r",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂ³",replacement="o",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂ¤",replacement="a",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…Â½",replacement="Z",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã… ",replacement="S",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…Â¥",replacement="t",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„Å’",replacement="C",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„\u008f",replacement="d",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="iÂ¶",replacement="o",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã…â€˜",replacement="o",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„â€º",replacement="e",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="Ã„â€º",replacement="e",x=Street_edit)]street_names[,Street_edit:=stri_trans_general(Street_edit,"Latin-ASCII")]street_names[,Street_edit:=gsub(pattern="i-",replacement="i",x=Street_edit)]street_names[,Street_edit:=gsub(pattern="A ",replacement="S",x=Street_edit)]street_names[dt_uni_streets,on=.(Street_edit)]street_names_vio<-copy(street_names[dt_uni_streets,on=.(Street_edit),Street_orig:=i.Street])[!is.na(Street_orig)]

We lost some street data from the dataset by not exact matched street names.

Let’s subset the street data.

ba_streets_vio<-bratislava$osm_linesba_streets_vio<-ba_streets_vio[ba_streets_vio$name%in%street_names_vio$Street,]ba_streets_vio<-ba_streets_vio[!is.na(ba_streets_vio$name),]

Now, I will transform data to standard lon/lat matrix (data.table class) format instead of sf object (I highly recommend this for next ggplot usage).

data_streets_st<-data.table::data.table(sf::st_coordinates(ba_streets_vio$geometry))

Next, I will bound streets by existing polygons of Bratislava and add merging column – Street_edit:

ba_streets_vio$L1<-1:nrow(ba_streets_vio)data_streets_st<-merge(data_streets_st,as.data.table(ba_streets_vio)[,.(L1,name)],by="L1")data_streets_st[street_names_vio[,.(name,Street_edit)],on=.(name),Street_edit:=i.Street_edit]setnames(data_streets_st,c("X","Y"),c("lon","lat"))data_streets_st<-data_streets_st[!lon>data_poly_ba[,max(lon)]]data_streets_st<-data_streets_st[!lon<data_poly_ba[,min(lon)]]data_streets_st<-data_streets_st[!lat>data_poly_ba[,max(lat)]]

Let’s also aggregate violation data by street names and Year + Month (Year_M):

data_violation_agg_street<-copy(data_violation[.(c(4836,5000,4834,4700,4838,4828,4839,4813,9803,4840,4841,9809,3000,9806,4900)),on=.(Group_vio_ID),.(N=.N),by=.(Street,Year_M)])setorder(data_violation_agg_street,Year_M,-N)

Let’s add look-up column Street_edit to aggregated data and merge spatial street data with violation data.

data_violation_agg_street[dt_uni_streets,on=.(Street),Street_edit:=i.Street_edit]data_streets_st<-merge(data_streets_st,data_violation_agg_street[,.(Street_edit,N_violations=N,Year_M)],by="Street_edit",all.y=T,allow.cartesian=TRUE)

Now, I will transform integers of the number of violations to reasonable factors segments:

data_streets_st[N_violations>=11,N_violation_type:="> 11"]data_streets_st[N_violations>=6&N_violations<11,N_violation_type:="(6, 10)"]data_streets_st[N_violations>=3&N_violations<6,N_violation_type:="(3, 5)"]data_streets_st[N_violations<=2,N_violation_type:="< 2"]data_streets_st[,N_violations_per_street:=factor(N_violation_type,levels=c("< 2","(3, 5)","(6, 10)","> 11"))]

levels(data_streets_st$N_violations_per_street)

## [1] "< 2"     "(3, 5)"  "(6, 10)" "> 11"

Let’s see what we extracted so far for streets in one example year-month…

ggplot(data_streets_st[!is.na(L1)][.(2019.06),on=.(Year_M)])+geom_line(aes(lon,lat,group=L1,color=N_violations_per_street),size=1.4)+scale_color_brewer(palette="Reds")+theme_bw()

plot of chunk unnamed-chunk-26

ggmap

Since, we will use ggmap for visualizations, we need to extract map image of Bratislava:

bbox<-make_bbox(lon,lat,data=data_poly_ba,f=.01)# used our polygonsmap_ba<-get_map(location=bbox,source='stamen',maptype="watercolor")

Now, let’s make it altogether to one visualization – so violations by city districts and also streets for one time-stamp (Year_M).

ggmap(map_ba,extent="device")+geom_polygon(data=data_places[.(2019.06),on=.(Year_M)],aes(x=lon,y=lat,fill=N_violation,group=Place),color="grey40")+scale_fill_material("grey",alpha=0.75)+geom_line(data=data_streets_st[!is.na(L1)][.(2019.06),on=.(Year_M)],aes(lon,lat,group=L1,color=N_violations_per_street),size=1.4)+scale_color_brewer(palette="Reds")+geom_label(data=data_poly_ba_mean_place,aes(x=lon,y=lat,label=Place,fill=NA_real_),color="black",fill="dodgerblue1",alpha=0.75)+labs(fill="N_vio_district",color="N_vio_street")+guides(color=guide_legend(override.aes=list(size=2)))+theme_void()

plot of chunk unnamed-chunk-28

I will also extract the most violated streets by Year+Month for labeling purposes.

data_streets_st_max<-copy(data_streets_st[!is.na(L1),.SD[N_violations==max(N_violations,na.rm=T)],by=.(Year_M)])data_streets_st_max[,unique(Street_edit)]# the unique month "winners"data_streets_st_max<-copy(data_streets_st_max[,.(lon=mean(lon),lat=mean(lat),Street_edit=first(Street_edit),N_violations=first(N_violations),N_violations_per_street=first(N_violations_per_street)),by=.(Year_M)])

As we noticed so far, the most violated city district is Old-town, so I will create also zoomed visualization for this city district like this (using coord_map function):

ggmap(map_ba)+geom_polygon(data=data_places[.(2019.06),on=.(Year_M)],aes(x=lon,y=lat,fill=N_violation,group=Place),color="grey40")+scale_fill_material("grey",alpha=0.75)+geom_line(data=data_streets_st[!is.na(L1)][.(2019.06),on=.(Year_M)],aes(lon,lat,group=L1,color=N_violations_per_street),size=1.4)+geom_label(data=data_streets_st_max[.(2019.06),on=.(Year_M)],aes(x=lon,y=lat,label=paste0(Street_edit," - ",N_violations)),color="black",fill="firebrick2",size=4.5,alpha=0.75)+scale_color_brewer(palette="Reds")+geom_label(data=data_poly_ba_mean_place,aes(x=lon,y=lat,label=Place,fill=NA_real_),color="black",fill="dodgerblue1",alpha=0.75,size=4.5)+coord_map(xlim=c(data_poly_ba[.("Stare Mesto"),on=.(Place),min(lon)],data_poly_ba[.("Stare Mesto"),on=.(Place),max(lon)]),ylim=c(data_poly_ba[.("Stare Mesto"),on=.(Place),min(lat)],data_poly_ba[.("Stare Mesto"),on=.(Place),max(lat)]))+labs(fill="N_vio_district",color="N_vio_street")+guides(color=guide_legend(ncol=1,override.aes=list(size=2),title.position="top"),fill=guide_legend(nrow=2,title.position="top"))+theme_bw()+theme(axis.text=element_blank(),axis.line=element_blank(),axis.title=element_blank(),axis.ticks=element_blank(),legend.text=element_text(size=13),legend.title=element_text(size=14,face="bold"),legend.background=element_rect(fill="white"),legend.key=element_rect(fill="white"),legend.position="bottom")

plot of chunk unnamed-chunk-30

Using ggnanimate

We have all prepared to create some interesting animated map. For this purpose, I will use the gganimate package that simply uses the transition_states function that combines our previously prepared ggplot2 and ggmap plot and time feature Year_M. Let’s firstly animate the whole city picture.

gg_ba_anim_lines<-ggmap(map_ba,extent="device")+geom_polygon(data=data_places,aes(x=lon,y=lat,fill=N_violation,group=Place),color="grey40")+scale_fill_material("grey",alpha=0.7)+geom_line(data=data_streets_st[!is.na(L1)],aes(lon,lat,group=L1,color=N_violations_per_street),size=1.4)+scale_color_brewer(palette="Reds")+labs(fill="N_vio_district",color="N_vio_street")+geom_label(data=data_poly_ba_mean_place,aes(x=lon,y=lat,label=Place,fill=NA_real_),color="black",fill="dodgerblue1",alpha=0.75,size=4.5)+transition_states(reorder(Year_M,Year_M),transition_length=2,state_length=1)+labs(title="Year.Month - {closest_state}")+guides(color=FALSE,fill=FALSE)+theme_void()+theme(plot.title=element_text(size=16,face="bold"))# save animation to objectanim_whole<-animate(gg_ba_anim_lines,duration=46,width=700,height=750)

Then, let’s animate zoomed map of Old-town district.

gg_ba_anim_lines_zoom<-ggmap(map_ba,extent="device")+geom_polygon(data=data_places,aes(x=lon,y=lat,fill=N_violation,group=Place),color="grey40")+scale_fill_material("grey",alpha=0.7)+geom_line(data=data_streets_st[!is.na(L1)],aes(lon,lat,group=L1,color=N_violations_per_street),size=1.4)+scale_color_brewer(palette="Reds")+geom_label(data=data_streets_st_max,aes(x=lon,y=lat,label=paste0(Street_edit," - ",N_violations)),color="black",fill="firebrick2",alpha=0.75,size=4.5)+geom_label(data=data_poly_ba_mean_place,aes(x=lon,y=lat,label=Place,fill=NA_real_),color="black",fill="dodgerblue1",alpha=0.75,size=4.5)+coord_map(xlim=c(data_poly_ba[.("Stare Mesto"),on=.(Place),min(lon)],data_poly_ba[.("Stare Mesto"),on=.(Place),max(lon)]),ylim=c(data_poly_ba[.("Stare Mesto"),on=.(Place),min(lat)],data_poly_ba[.("Stare Mesto"),on=.(Place),max(lat)]))+transition_states(reorder(Year_M,Year_M),transition_length=2,state_length=1)+labs(fill="N_vio_district",color="N_vio_street")+guides(color=guide_legend(ncol=1,override.aes=list(size=2),title.position="top"),fill=guide_legend(nrow=2,title.position="top"))+theme_bw()+theme(axis.text=element_blank(),axis.line=element_blank(),axis.title=element_blank(),axis.ticks=element_blank(),legend.text=element_text(size=13),legend.title=element_text(size=14,face="bold"),legend.background=element_rect(fill="white"),legend.key=element_rect(fill="white"),legend.position="bottom")# save animation to objectanim_zoom<-animate(gg_ba_anim_lines_zoom,duration=46,width=700,height=750)

GIFS binding with magick package

Now, we have to bind these two animations into one GIF. I will do it by functions from the magick package.

a_mgif<-image_read(anim_whole)b_mgif<-image_read(anim_zoom)new_gif<-image_append(c(a_mgif[1],b_mgif[1]))for(iin2:100){combined<-image_append(c(a_mgif[i],b_mgif[i]))new_gif<-c(new_gif,combined)}# final!new_gif

Voilaaa:

For full resolution – right click on the gif and click on the view image option.

We can see that at the end of the observed period, so the summer of 2019, the Old-town district has even-more violations than usual. That can be caused by multiple factors…

Also notice, that peripheral areas of districts as Ruzinov and Vrakuna have streets with repeating multiple violations.

Streets with the most violations

Let’s see some of the most violated streets as time series as we seen in the animation.

times<-data.table(Year_M=data_violation_agg_street[,sort(unique(Year_M))])times[,Date:=as.Date("2017-01-01")]times[,row_id:=1:.N]times[,Date:=Date+row_id*30.4]data_vio_most_street<-copy(data_violation_agg_street[.(c("Michalska","Postova","Rybarska brana","Namestie slobody")),on=.(Street_edit)])data_vio_most_street[times,on=.(Year_M),Date:=i.Date]setnames(data_vio_most_street,"N","N_violations")ggplot(data_vio_most_street)+geom_line(aes(Date,N_violations,color=Street_edit),size=0.8)+theme_bw()

plot of chunk unnamed-chunk-34

The most of the time, Michalska street in the historical center of the Old-town was the most violated street, but the last three months is Postova the “winner of the most violated street”…this is maybe because of the new city-police station nearby.

Summary

In this blog post, I showed you how to:

work with different spatial data as polygons, street lines or map images,
combine these spatial objects with external data as city violation time series,
create animated maps using packages as ggplot2, ggmap, gganimate, and magick.

I hope, you will use these information with your spatial-time series data combo for creating some interesting visualization

To leave a comment for the author, please follow the link and comment on their blog: Peter Laurinec.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

#TidyTuesday: horror films, squirrels and commuters

November 10, 2019, 3:34 am

≫ Next: Cleaning the Table

≪ Previous: Dangerous streets of Bratislava! Animated maps using open data in R

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidy Tuesday is a fun weekly activity where a lot of R enthusiasts make different visualisations, and possibly modelling, of the same dataset. You can read more about it at their Github page. I participated for three weeks, and here is a recap. I will show excerpts of the code, but you can read the whole thing by clicking through to Github.

2019-10-22 Horror films

Data: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-10-22

My code: https://github.com/mrtnj/rstuff/blob/master/tidytuesday/horror_movies.R

In time for Halloween, we got a dataset with horror film data from IMDB. (Yes, I will be mixing the terms ”film” and ”movie” wildly.)

The first week, I started with making a pretty boring plot, the way I’d normally plot things (white background, small multiples, you know the drill). I wanted to look at distribution over the year, so I plotted what month films are released and the distribution of review scores and budgets each month. After thinking about it for a while, I thought a logarithmic scale would make sense for budgets, that span a huge range. Also, after realising that the budget column actually didn’t contain dollars, but a mix of currencies, I decided not to try to convert, but use only the US dollar budgets.

I don’t often run into dates, to using the date functions from readr and lubridate was new to me, as was the built-in vector month.abb:

library(dplyr)library(egg)library(ggplot2)library(ggimage)library(lubridate)library(readr)library(stringr)movies <- read_csv("horror_movies.csv")## Parse datesmovies$release_parsed  <- parse_date(movies$release_date,                                     format = "%d-%b-%y",                                     locale = locale("en")) movies$release_year <- ifelse(is.na(movies$release_parsed),                              movies$release_date,                              year(movies$release_parsed))movies$release_month  <- month.abb[month(movies$release_parsed)]

Here, we parse the release data, and extract the release year, treating films that only have a release year separately.

I also put in means with confidence intervals, like so, and a line for the mean review rating:

model  <- lm(review_rating ~ release_month, movies)fit  <- data.frame(release_month = month.abb,                   predict(model,                           newdata = data.frame(release_month = month.abb),                                                interval = "confidence"),                   stringsAsFactors = FALSE)grand_mean_rating  <- mean(movies$review_rating,                           na.rm = TRUE)

As an example of the plotting code, here is the middle panel for ratings. As usual with ggplot2, we layer geometries on top of each other (here: violin plots, points with range bars, and a horizontal line, followed by a lot of formatting.

plot_rating <- ggplot() +    geom_violin(aes(x = release_month,                    y = review_rating),                fill = "grey",                colour = NA,                data = movies) +    scale_x_discrete(limits = month.abb) +    geom_pointrange(aes(x = release_month,                        y = fit,                        ymax = upr,                        ymin = lwr),                    data = fit) +    geom_hline(yintercept = grand_mean_rating,               linetype = 2,               colour = "red") +    ylim(0, 10) +    theme_bw(base_size = 12) +    theme(panel.grid = element_blank()) +    xlab("") +    ylab("Review rating")

There is similar code for the other two panels. Finally, I used ggarrange from the egg package to put everything together. In summary, most horror films are released in October, probably around Halloween. The review ratings of films released in this horror season are also a tiny bit higher than during the rest of the year, but there is not much of a difference in the budgets.

After that, and after seeing some of the fun horror-themed graphs other people made, I decided to make something more colourful. Here is a plot on the same theme, showing each day and year separately, an appropriately horrendous colour scheme, and a pumpkin icon to indicate the date of Halloween. I like this plot better because it shows more of the data. It shows the increase at Halloween. We also see some spikes at other dates, like 1 January of some years. It also shows how the dataset ends at Halloween 2017.

The code for this plot is mostly a lot of theme formatting. The ggplot2theme function takes a lot of arguments I’ve never used before.

movies$yday  <- yday(movies$release_parsed)daycount <- summarise(group_by(movies, yday, release_year), n = n())

First, we turn dates into days of the year, and count the number of film releases.

halloween  <-  yday("2019-10-31")pumpkin_data  <- data.frame(x = halloween,                            y = -1,                            image = "pumpkin.png",                            stringsAsFactors = FALSE)

Then, we set up the date of Halloween and a data frame for the pumpkin icon. We’re going to use geom_image from the ggimage package to add this icon to each subplot.

breaks  <- yday(paste("2019-", 1:12, "-01", sep = ""))plot_year <- ggplot() +    geom_point(aes(x = yday,                   y = n),               colour = "green",               data = na.exclude(dc)) +    geom_image(aes(x = x,                   y = y,                   image = image),               data = pumpkin_data) +    facet_wrap(~ release_year,               ncol = 2) +    scale_x_continuous(breaks = breaks,                       labels = month.abb) +    ylim(-3, NA) +    labs(caption = "Pumpkin icon by Good Ware from www.flatiron.com.") +    theme(panel.grid = element_blank(),          strip.background = element_blank(),          text = element_text(family = "mono",                              colour = "grey",                              size = 16),          axis.text = element_text(family = "mono",                                   colour = "green",                                   size = 14),          axis.ticks = element_line(colour = "green"),          strip.text = element_text(family = "mono",                                    colour = "grey",                                    size = 16),          plot.background = element_rect(fill = "black"),          panel.background = element_rect(fill = "black")) +    xlab("") +    ylab("Horror films released on this day") +    ggtitle("When horror films are released")

A lot of other people made graphs that highlight the increase in horror film releases around Halloween in different ways. Here are some that I like:

It's #October– that means pumpkins, falling leaves, and #horror movies! This #TidyTuesday: horror movie releases per country/month. It's been a while… Happy to be back!#rstats #dataviz #R4DS #ggplot #tidyverse
Code: https://t.co/g71Fc8cn62 Source: https://t.co/YYRtdYoUmE pic.twitter.com/lfsyyCFNog
— Veerle van Son (@veerlevanson) October 25, 2019

#TidyTuesday this week inspired by a 'calendar plot' I saw from @BBCWorld (https://t.co/v0DIBDw74C). Made my own using geom_tile. Horror movie releases are clearly targeted at Halloween. #tidyverse #rstats
Code: https://t.co/cxJhXcfQCm Plot: https://t.co/lhL594QJTp pic.twitter.com/sH8ppgbyx6
— Liam Bailey (@ldbailey255) October 23, 2019

My first #tidytuesday contribution! Inspired by the @github contribution graph. Horror movies are most frequently released on Fridays, esp leading up to Halloween. (Data from 2017.) Code: https://t.co/hKd2ML3RI2 pic.twitter.com/QV49nzrhoQ
— Alyson La (@alysonlaaa) October 30, 2019

And, looking deeper, there is a pattern within months too:

They love 13. It is the most preferred day in the month for #Horror movie releases (excl. the 1st). Be careful #TidyTuesday is addictive. #r4ds #rstats #tidyverse #animate code: https://t.co/fYEZUw7aT9 pic.twitter.com/InmTpk1NWk
— Serdar Korur (@Dataatomic) October 25, 2019

Finally, I also like this plot, that makes a case for a U-shaped relationship between budget and rating:

Horror film ratings by budget. Obsessed with annotations lately…. Code: https://t.co/JAMMvGdUMT #TidyTuesday #dataviz #rstats pic.twitter.com/CEYgmqQfyL
— Jenna DeVries (@jennaldevries) October 28, 2019

And for contrast, another that makes a different case with the same data:

#tidytuesday Apparently big budgets don't help with horror movie ratings https://t.co/NswrlRpR8q pic.twitter.com/Qom1OtFpn7
— Larry D'Agostino (@larrydag) October 24, 2019

This seems to be a recurrent theme when it comes to interpretation and quantitative analysis in the Tidy Tuesday datasets. People make different modeling choices, or visualisation choices (which are modeling choices) about what to lump together, what to separate into bins, how to transform the data, and how to show uncertainty. In some cases, as with the pattern of film releases around Halloween, they all find similar results. In some other cases, they don’t.

2019-10-28 NYC Squirrel Census

Data: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-10-29

My code: https://github.com/mrtnj/rstuff/blob/master/tidytuesday/nyc_squirrels.R

This week, the data was about the location and activities of squirrels in New York central park on certain times. I had this vision of an animated map of squirrel locations. I ended up with an animation, but no map. The colour of the squirrel icon shows the main fur colour of the squirrels (grey, black, cinnamon), and the size shows adults and juveniles.

I had never used gganimate before (only animation, as in this post about the Game of Life), but I had seen Thomas Lin Pedersen tweet about it, and I wanted to try.

library(dplyr)library(gganimate)library(ggimage)library(ggplot2)library(readr)squirrels <- read_csv("nyc_squirrels.csv")## Parse the datesquirrels$date_parsed  <- parse_date(as.character(squirrels$date), format = "%m%d%Y")## Give each observation a unique ID (to use as group in the## animation, so as to not have points turn into one another but fade## instead.squirrels$key  <- 1:nrow(squirrels)## Associate the different squirrel colours with the filenames of## icons in different colours (manually filled with GIMP).squirrels$image  <- "squirrel.png"squirrels$image[squirrels$primary_fur_color == "Cinnamon"]  <- "squirrel_cinnamon.png"squirrels$image[squirrels$primary_fur_color == "Gray"]  <- "squirrel_grey.png"squirrels$image[is.na(squirrels$primary_fur_colour)]  <- NA

Again, we need to parse the date. We already have latitude and longitude. We need a unique identifier for each observation, to tell gganimate that we want each squirrel to be in its own group. Then, we associate squirrel colours with three different files with a squirrel icon in different colours.

First, we make two image scatterplot layers, setting the sizes of adults and juveniles manually. The colour is deal with by mapping the image column containing the file names to the image aesthetic. We add some formatting, and then, the transition_states layer, which is where the graph turns from still and boring to magical moving pictures. This will animate a series of discrete ”states”, which here consist of the date pasted together with the shift (AM or PM squirrel observation shift). The special ”{closest_state}” variable in the title string puts this state name as plot title.

plot_colour <- ggplot() +    geom_image(aes(y = long, x = lat, image = image, group = key),               size = 0.04,               data = filter(squirrels, age == "Adult")) +    geom_image(aes(y = long, x = lat, image = image, group = key),               size = 0.03,               data = filter(squirrels, age == "Juvenile")) +    theme_bw(base_size = 16) +    theme(panel.grid = element_blank()) +    xlab("Latitude") +    ylab("Longitude") +    labs(title = "{closest_state}",         caption = "Data from NYC Squirrel Census. Squirrel icon made by Freepik from www.flatiron.com.") +    transition_states(paste(date_parsed, shift),                      state_length = 2,                      transition_length = 1)## Render it and write to fileanimate(plot_colour,        fps = 10,        nframes = 400,        end_pause = 20,        rewind = FALSE,        width = 1000,        height = 1000)

I was faffing around with different map packages to try to find something of Central Park. It seems ggmaps is the way to go. Other participants made nice maps, though:

This is my first #tidytuesday contribution and my first attempt working with maps Code: https://t.co/xLVWkXJZWG pic.twitter.com/XthNrnxhDc
— Anna L (@The_Anna_L) November 4, 2019

Used the squirrel #TidyTuesday dataset (2019-10-29) to make some simple hexmaps of activities (eating, moving or making noise). Also trying #brickr for the first time to make some fun plots. Code at: https://t.co/diXAzzmkEm #rstats pic.twitter.com/0WNhPtvHZA
— Jesus M. Castagnetto (@jmcastagnetto) November 4, 2019

Couldn't come up with something better for this week's #TidyTuesday, so here is a little tribute to the two squirrels that were found dead during the 2018 Central Park @SquirrelCensus. I'm a little sad…
Code (@CedScherer): https://t.co/91Kciy2NSN #ggplot #ThisIsNotDataViz pic.twitter.com/bbaYcruXx4
— Georgios Karamanis (@geokaramanis) November 1, 2019

However, I think this was my favourite:

Possibly the worst #tidytuesday submission ever! Recreating this classic XKCD, but with squirrels. https://t.co/ZaG5qGU6Le #rstats #r4ds pic.twitter.com/WrKhDBQG1A
— Ryan Timpe (@ryantimpe) October 31, 2019

https://github.com/ryantimpe/TidyTuesday/blob/master/2019w44/2019w44.R

The original Squirrel Census Report seems to be amazing object, too, with a beautiful map.

2019-11-05 Biking and walking to work in the US (and Sweden)

Data: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-11-05

My code: https://github.com/mrtnj/rstuff/blob/master/tidytuesday/commute.R

This week I felt I had to make a map. The end result doesn’t look like much, but it took a while. Here are the average percentages of commuters who walk and bike to work in different US states 2008-2012 with data from the American Community Survey:

library(dplyr)library(ggplot2)library(readr)library(usmap)commute <- read_csv("commute.csv")## Map data from the usmap packagestate_map  <- us_map(regions = "state")## There are some incompletely labelled states; fix themmissing  <- setdiff(commute$state, state_map$full)commute$state_modified <- commute$statecommute$state_modified[commute$state == "Ca"] <- "California"commute$state_modified[commute$state == "Massachusett"]  <- "Massachusetts"

We get map coordinates for the US states from the usmap package (because the one in maps doesn’t have Alaska and Hawaii).

Then we fix some mislabelling in the data.

## Get the average per statestate_average  <- summarise(group_by(commute, state_modified, mode),                            average = sum(percent * n)/sum(n))## Combine averages and coordinatescombined  <- inner_join(state_average,                        state_map,                        by = c("state_modified" = "full"))

We take a weighted average of the percentages per state and join the state averages with the state map coordinates. The map I posted on Twitter didn’t weight the average, but I think that is a bit better. There is still the issue that states have different populations and different distributions of large and small cities, but that’s the nature of things. In summary, there is not much biking going on, but some more walking to work.

plot_map  <- ggplot() +    geom_polygon(aes(x = x, y = y, fill = average, group = group),                 colour = "black",                 data = combined) +    facet_wrap(~ mode) +    scale_fill_continuous(low = "white",                          high = "blue",                          name = "Percent commuters") +    theme_bw(base_size = 16) +    theme(panel.grid = element_blank(),          strip.background = element_blank(),          axis.text = element_blank(),          axis.ticks = element_blank(),          legend.position = "bottom") +    xlab("") +    ylab("") +    labs(caption = "Cycling and walking to work 2008-2012 in the American Community Survey.")

The US seems to live up to its reputation as a motorised country. But I have no feeling for the scale of the data. For comparision, here is a map of Sweden with some not too recent data (2005-2006, from this VTI report>). The map is from the swemap package.

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Cleaning the Table

November 10, 2019, 9:12 am

≫ Next: Scraping Machinery Parts

≪ Previous: #TidyTuesday: horror films, squirrels and commuters

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While I’m talking about getting data into R this weekend, here’s another quick example that came up in class this week. The mortality data in the previous example were nice and clean coming in the door. That’s usually not the case. Data can be and usually is messy in all kinds of ways. One of the most common, particularly in the case of summary tables obtained from some source or other, is that the values aren’t directly usable. The following summary table was copied and pasted into Excel from an external source, saved as a CSV file, and arrived looking like this:

library(tidyverse)

rfm_tbl <- read_csv("data/rfm_table.csv")## Parsed with column specification:## cols(##   SEGMENT = col_character(),##   DESCRIPTION = col_character(),##   R = col_character(),##   F = col_character(),##   M = col_character()## )


rfm_tbl 


## # A tibble: 23 x 5##    SEGMENT        DESCRIPTION                             R     F     M    ##                                                   ##  1                                                     ##  2 Champions      Bought recently, buy often and spend t… 4– 5  4– 5  4– 5 ##  3                                                     ##  4 Loyal Custome… Spend good money. Responsive to promot… 2– 5  3– 5  3– 5 ##  5                                                     ##  6 Potential Loy… Recent customers, spent good amount, b… 3– 5  1– 3  1– 3 ##  7                                                     ##  8 New Customers  Bought more recently, but not often     4– 5  <= 1  <= 1 ##  9                                                     ## 10 Promising      Recent shoppers, but haven’t spent much 3– 4  <= 1  <= 1 ## # … with 13 more rows

This is messy and we can’t do anything with the values in R, F, and M. Ultimately we want a table with separate columns containing the low and high values for these variables. If no lower bound is shown, the lower bound is zero. We’re going to use a few tools, notably separate() to get where we want to be. I’ll step through this pipeline one piece at a time, so you can see how the table is being changed from start to finish.

First let’s clean clean the variable names and remove the entirely blank lines.


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))## # A tibble: 11 x 5##    segment        description                             r     f     m    ##                                                   ##  1 Champions      Bought recently, buy often and spend t… 4– 5  4– 5  4– 5 ##  2 Loyal Custome… Spend good money. Responsive to promot… 2– 5  3– 5  3– 5 ##  3 Potential Loy… Recent customers, spent good amount, b… 3– 5  1– 3  1– 3 ##  4 New Customers  Bought more recently, but not often     4– 5  <= 1  <= 1 ##  5 Promising      Recent shoppers, but haven’t spent much 3– 4  <= 1  <= 1 ##  6 Need Attention Above average recency, frequency & mon… 2– 3  2– 3  2– 3 ##  7 About To Sleep Below average recency, frequency & mon… 2– 3  <= 2  <= 2 ##  8 At Risk        Spent big money, purchased often but l… <= 2  2– 5  2– 5 ##  9 Can’t Lose Th… Made big purchases and often, but long… <= 1  4– 5  4– 5 ## 10 Hibernating    Low spenders, low frequency, purchased… 1– 2  1– 2  1– 2 ## 11 Lost           Lowest recency, frequency & monetary s… <= 2  <= 2  <= 2

Next we start work on the values. I thought about different ways of doing this, notably working out a way to apply or map separate() to each of the columns I want to change. I got slightly bogged down doing this, and instead decided to lengthen the r, f, and m variables into a single key-value pair, do the recoding there, and then widen the result again. First, lengthen the data:


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))%>%
  pivot_longer(cols = r:m)## # A tibble: 33 x 4##    segment         description                                  name  value##                                                        ##  1 Champions       Bought recently, buy often and spend the mo… r     4– 5 ##  2 Champions       Bought recently, buy often and spend the mo… f     4– 5 ##  3 Champions       Bought recently, buy often and spend the mo… m     4– 5 ##  4 Loyal Customers Spend good money. Responsive to promotions   r     2– 5 ##  5 Loyal Customers Spend good money. Responsive to promotions   f     3– 5 ##  6 Loyal Customers Spend good money. Responsive to promotions   m     3– 5 ##  7 Potential Loya… Recent customers, spent good amount, bought… r     3– 5 ##  8 Potential Loya… Recent customers, spent good amount, bought… f     1– 3 ##  9 Potential Loya… Recent customers, spent good amount, bought… m     1– 3 ## 10 New Customers   Bought more recently, but not often          r     4– 5 ## # … with 23 more rows

I’m quite sure that there’s an elegant way to use one of the map() functions to process the r, f, and m columns in sequence. But seeing as I couldn’t quickly figure it out, this alternative strategy works just fine. In fact, as a general approach I think it’s always worth remembering that the tidyverse really “wants” your data to be in long form, and lots of things that are awkward or conceptually tricky can suddenly become much easier if you get the data into the shape that the function toolbox wants it to be in. Lengthening the data you’re working with is very often the right approach, and you know you can widen it later on once you’re done cleaning or otherwise manipulating it.

With our table in long format we can now use separate() on the value column. The separate() function is very handy for pulling apart variables that should be in different columns. Its defaults are good, too. In this case I didn’t have to write a regular expression to specify the characters that are dividing up the values. In the function call we use convert = TRUE to turn the results into integers, and fill = "left" because there’s an implicit zero on the left of each entry that looks like e.g. <= 2.


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))%>%
  pivot_longer(cols = r:m)%>% 
  separate(col = value, into =c("lo","hi"), 
           remove =FALSE, convert =TRUE, 
           fill ="left")## # A tibble: 33 x 6##    segment       description                        name  value    lo    hi##                                              ##  1 Champions     Bought recently, buy often and sp… r     4– 5      4     5##  2 Champions     Bought recently, buy often and sp… f     4– 5      4     5##  3 Champions     Bought recently, buy often and sp… m     4– 5      4     5##  4 Loyal Custom… Spend good money. Responsive to p… r     2– 5      2     5##  5 Loyal Custom… Spend good money. Responsive to p… f     3– 5      3     5##  6 Loyal Custom… Spend good money. Responsive to p… m     3– 5      3     5##  7 Potential Lo… Recent customers, spent good amou… r     3– 5      3     5##  8 Potential Lo… Recent customers, spent good amou… f     1– 3      1     3##  9 Potential Lo… Recent customers, spent good amou… m     1– 3      1     3## 10 New Customers Bought more recently, but not oft… r     4– 5      4     5## # … with 23 more rows

Before widening the data again we drop the value column. We don’t need it anymore. (It will mess up the widening if we keep it, too: try it and see what happens.)


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))%>%
  pivot_longer(cols = r:m)%>% 
  separate(col = value, into =c("lo","hi"), 
           remove =FALSE, convert =TRUE, 
           fill ="left")%>%
  select(-value)## # A tibble: 33 x 5##    segment        description                             name     lo    hi##                                                   ##  1 Champions      Bought recently, buy often and spend t… r         4     5##  2 Champions      Bought recently, buy often and spend t… f         4     5##  3 Champions      Bought recently, buy often and spend t… m         4     5##  4 Loyal Custome… Spend good money. Responsive to promot… r         2     5##  5 Loyal Custome… Spend good money. Responsive to promot… f         3     5##  6 Loyal Custome… Spend good money. Responsive to promot… m         3     5##  7 Potential Loy… Recent customers, spent good amount, b… r         3     5##  8 Potential Loy… Recent customers, spent good amount, b… f         1     3##  9 Potential Loy… Recent customers, spent good amount, b… m         1     3## 10 New Customers  Bought more recently, but not often     r         4     5## # … with 23 more rows

Now we can widen the data, with pivot_wider().


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))%>%
  pivot_longer(cols = r:m)%>% 
  separate(col = value, into =c("lo","hi"), 
           remove =FALSE, convert =TRUE, 
           fill ="left")%>%
  select(-value)%>%
  pivot_wider(names_from = name, 
              values_from = lo:hi)## # A tibble: 11 x 8##    segment     description               lo_r  lo_f  lo_m  hi_r  hi_f  hi_m##                                    ##  1 Champions   Bought recently, buy of…     4     4     4     5     5     5##  2 Loyal Cust… Spend good money. Respo…     2     3     3     5     5     5##  3 Potential … Recent customers, spent…     3     1     1     5     3     3##  4 New Custom… Bought more recently, b…     4    NA    NA     5     1     1##  5 Promising   Recent shoppers, but ha…     3    NA    NA     4     1     1##  6 Need Atten… Above average recency, …     2     2     2     3     3     3##  7 About To S… Below average recency, …     2    NA    NA     3     2     2##  8 At Risk     Spent big money, purcha…    NA     2     2     2     5     5##  9 Can’t Lose… Made big purchases and …    NA     4     4     1     5     5## 10 Hibernating Low spenders, low frequ…     1     1     1     2     2     2## 11 Lost        Lowest recency, frequen…    NA    NA    NA     2     2     2

Finally we put back those implicit zeros using replace_na() and reorder the columns to our liking. Using replace_na() is fine here because we know that every missing value should in fact be a zero.


rfm_tbl %>% 
  janitor::clean_names()%>%
  filter_all(any_vars(!is.na(.)))%>%
  pivot_longer(cols = r:m)%>% 
  separate(col = value, into =c("lo","hi"), 
           remove =FALSE, convert =TRUE, 
           fill ="left")%>%
  select(-value)%>%
  pivot_wider(names_from = name, 
              values_from = lo:hi)%>%
  mutate_if(is.integer, replace_na,0)%>%
  select(segment, 
         lo_r, hi_r, 
         lo_f, hi_f, 
         lo_m, hi_m, 
         description)## # A tibble: 11 x 8##    segment      lo_r  hi_r  lo_f  hi_f  lo_m  hi_m description             ##                                    ##  1 Champions       4     5     4     5     4     5 Bought recently, buy of…##  2 Loyal Cust…     2     5     3     5     3     5 Spend good money. Respo…##  3 Potential …     3     5     1     3     1     3 Recent customers, spent…##  4 New Custom…     4     5     0     1     0     1 Bought more recently, b…##  5 Promising       3     4     0     1     0     1 Recent shoppers, but ha…##  6 Need Atten…     2     3     2     3     2     3 Above average recency, …##  7 About To S…     2     3     0     2     0     2 Below average recency, …##  8 At Risk         0     2     2     5     2     5 Spent big money, purcha…##  9 Can’t Lose…     0     1     4     5     4     5 Made big purchases and …## 10 Hibernating     1     2     1     2     1     2 Low spenders, low frequ…## 11 Lost            0     2     0     2     0     2 Lowest recency, frequen…

Much nicer.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Scraping Machinery Parts

November 10, 2019, 6:00 pm

≫ Next: A comparison of methods for predicting clothing classes using the Fashion MNIST dataset in RStudio and Python (Part 1)

≪ Previous: Cleaning the Table

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been exploring the feasibility of aggregating data on prices of replacement parts for heavy machinery. There are a number of websites which list this sort of data. I’m focusing on the static sites for the moment.

I’m using are R with {rvest} (and a few other Tidyverse packages thrown in for good measure).

library(glue)library(dplyr)library(purrr)library(stringr)library(rvest)

The data are paginated. Fortunately the URL includes the page number as a GET parameter, so stepping through the pages is simple. I defined a global variable, URL, with a {glue} placeholder for the page number.

This is how I’m looping over the pages. The page number, page, is set to one initially and incremented after each page of results is scraped. When it gets to a page without at results, the loop is stopped.

page =1while (TRUE) {  items <-read_html(glue(URL)) %>%html_nodes(SELECTOR)  #if (length(items) ==0) break                  # Check if gone past last page.# Extract data for each item here...  page <- page +1# Advance to next page.}

That’s the mechanics. Within the loop I then used map_dfr() from {purrr} to iterate over items, delving into each item to extract its name and price.

map_dfr(items, function(item) {    tibble(      part = item %>%html_node("p") %>%html_text(),      price = item %>%html_node(".product-item--price small") %>%html_text()    )  })

The results from each page are appended to a list and finally concatenated using bind_rows().

>dim(parts)[1] 9872

Scraping a single category yields 987 parts. Let’s take a look at the first few.

>head(parts)  part                                                                                   price      <chr><chr>1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-7732$1,534.002 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-5268$1,854.003 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-8034$1,374.004 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-6728$1,754.005 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-8869$1,494.006 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-6805$1,534.00

That’s looking pretty good already. There’s one final niggle: the data in the price column are strings. Ideally we’d want those to be numeric. But to do that we have to strip off some punctuation. Not a problem thanks to functions from {stringr}.

parts <- parts %>%mutate(    price =str_replace(price, "^\\$", ""),      # Strip off leading "$"    price =str_replace_all(price, ",", ""),     # Strip out comma separators    price =as.numeric(price)  )

Success!

>head(parts)  part                                                                                   price  <chr><dbl>1 R986110000 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 240-773215342 R986110001 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 213-526818543 R986110002 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 266-803413744 R986110003 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 296-672817545 R986110004 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 136-886914946 R986110005 Bosch Rexroth New Replacement Hydraulic Axial Piston Motor For CAT 255-68051534

The great thing about scripting a scraper is that, provided that the website has not been dramatically restructured, the scraper can be run at any time to gather an updated set of results.

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A comparison of methods for predicting clothing classes using the Fashion MNIST dataset in RStudio and Python (Part 1)

November 10, 2019, 4:00 pm

≫ Next: Statistical uncertainty with R and pdqr

≪ Previous: Scraping Machinery Parts

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Florianne Verkroost is a PhD candidate at Nuffield College at the University of Oxford. With a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.

In this series of blog posts, I will compare different machine and deep learning methods to predict clothing categories from images using the Fashion MNIST data. In this first blog of the series, we will explore and prepare the data for analysis. I will also show you how to predict the clothing categories of the Fashion MNIST data using my go-to model: an artificial neural network. To show you how to use one of RStudio’s incredible features to run Python from RStudio, I build my neural network in Python using the code in this Python script or this Jupyter notebook on my Github. In the second blog post, we will experiment with tree-based methods (single tree, random forests and boosting) and support vector machines to see whether we can beat the neural network in terms of performance. As Python cannot be run in this blog post, I will walk you through the results from this script produced earlier, but if you would also like to see how to embed Python code and results in R Markdown files, check out this Markdown file on my Github! The R code used for this blog is also included on my Github.

To start, we first set our seed to make sure the results are reproducible.

set.seed(1234)

Importing and exploring the data

The keras package contains the Fashion MNIST data, so we can easily import the data into RStudio from this package directly after installing it from Github and loading it.

library(devtools)install.packages("keras")#devtools::install_github("rstudio/keras")library(keras)        install_keras()  fashion_mnist <- keras::dataset_fashion_mnist()

The resulting object named fashion_mnist is a nested list, consisting of lists train and test. Each of these lists in turn consists of arrays x and y. To look at the dimensions of these elements, we recursively apply the dim() function to the fashion_mnist list.

rapply(fashion_mnist, dim)

train.x1 train.x2 train.x3  train.y  test.x1  test.x2  test.x3   test.y    60000       28       28    60000    10000       28       28    10000

From the result, we observe that the x array in the training data contains 28 matrices each of 60000 rows and 28 columns, or in other words 60000 images each of 28 by 28 pixels. The y array in the training data contains 60000 labels for each of the images in the x array of the training data. The test data has a similar structure but only contains 10000 images rather than 60000. For simplicity, we rename these lists elements to something more intuitive (where x now represents images and y represents labels):

c(train.images, train.labels) %<-% fashion_mnist$trainc(test.images, test.labels) %<-% fashion_mnist$test

Every image is captured by a 28 by 28 matrix, where entry [i, j] represents the opacity of that pixel on an integer scale from 0 (white) to 255 (black). The labels consist of integers between zero and nine, each representing a unique clothing category. As the category names are not contained in the data itself, we have to store and add them manually. Note that the categories are evenly distributed in the data.

cloth_cats = data.frame(category = c('Top', 'Trouser', 'Pullover', 'Dress', 'Coat',                                       'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boot'),                                      label = seq(0, 9))

To get an idea of what the data entail and look like, we plot the first ten images of the test data. To do so, we first need to reshape the data slightly such that it becomes compatible with ggplot2. We select the first ten test images, convert them to data frames, rename the columns into digits 1 to 28, create a variable named y with digits 1 to 28 and then we melt by variable y. We need package reshape2 to access the melt() function. This results in a 28 times 28 equals 784 by 3 (y pixels (= y), x pixels (= variable) and the opacity (= value)) data frame. We bind these all together by rows using the rbind.fill() function from the plyr package and add a variable Image, which is a unique string repeated 784 times for each of the nine images containing the image number and corresponding test set label.

library(reshape2)library(plyr)subarray <- apply(test.images[1:10, , ], 1, as.data.frame)subarray <- lapply(subarray, function(df){  colnames(df) <- seq_len(ncol(df))  df['y'] <- seq_len(nrow(df))  df <- melt(df, id = 'y')  return(df)})plotdf <- rbind.fill(subarray)first_ten_labels <- cloth_cats$category[match(test.labels[1:10], cloth_cats$label)]first_ten_categories <- paste0('Image ', 1:10, ': ', first_ten_labels)plotdf['Image'] <- factor(rep(first_ten_categories, unlist(lapply(subarray, nrow))),                           levels = unique(first_ten_categories))

We then plot these first ten test images using package ggplot2. Note that we reverse the scale of the y-axis because the original dataset contains the images upside-down. We further remove the legend and axis labels and change the tick labels.

library(ggplot2)

ggplot() +   geom_raster(data = plotdf, aes(x = variable, y = y, fill = value)) +   facet_wrap(~ Image, nrow = 2, ncol = 5) +   scale_fill_gradient(low = "white", high = "black", na.value = NA) +   theme(aspect.ratio = 1, legend.position = "none") +   labs(x = NULL, y = NULL) +   scale_x_discrete(breaks = seq(0, 28, 7), expand = c(0, 0)) +   scale_y_reverse(breaks = seq(0, 28, 7), expand = c(0, 0))

Data Preparation

Next, it’s time to start the more technical work of predicting the labels from the image data. First, we need to reshape our data to convert it from a multidimensional array into a two-dimensional matrix. To do so, we vectorize each 28 by 28 matrix into a column of length 784, and then we bind the columns for all images on top of each other, finally taking the transpose of the resulting matrix. This way, we can convert a 28 by 28 by 60000 array into a 60000 by 784 matrix. We also normalize the data by dividing between the maximum opacity of 255.

train.images <- data.frame(t(apply(train.images, 1, c))) / max(fashion_mnist$train$x)test.images <- data.frame(t(apply(test.images, 1, c))) / max(fashion_mnist$train$x)

We also create two data frames that include all training and test data (images and labels), respectively.

pixs <- 1:ncol(fashion_mnist$train$x)^2names(train.images) <- names(test.images) <- paste0('pixel', pixs)train.labels <- data.frame(label = factor(train.labels))test.labels <- data.frame(label = factor(test.labels))train.data <- cbind(train.labels, train.images)test.data <- cbind(test.labels, test.images)

Artificial Neural Network

Now, let’s continue by building a simple neural network model to predict our clothing categories. Neural networks are artificial computing systems that were built with human neural networks in mind. Neural networks contain nodes, which transmit signals amongst one another. Usually the input at each node is a number, which is transformed according to a non-linear function of the input and weights, the latter being the parameters that are tuned while training the model. Sets of neurons are collected in different layers; neural networks are referred to as ‘deep’ when they contain at least two hidden layers. If you’re not familiar with artificial neural networks, then this free online book is a good source to start learning about them.

In this post, I will show you how artificial neural networks with different numbers of hidden layers compare, and I will also compare these networks to a convolutional network, which often performs better in the case of visual imagery. I will show you some basic models and how to code these, but will not spend too much time on tuning neural networks, for example when it comes to choosing the right amount of hidden layers or the number of nodes in each hidden layer. In essence, what it comes down to is that these parameters largely depend on your data structure, magnitude and complexity. The more hidden layers one adds, the more complex non-linear relationships can be modelled. Often, in my experience, adding hidden layers to a neural network increases their performance up to a certain number of layers, after which the increase becomes non-significant while the computational requirements and interpretation become more infeasible. It is up to you to play around a bit with your specific data and test how this trade-off works.

Although neural networks can easily built in RStudio using TensorFlow and Keras, I really want to show you one of the incredible features of RStudio where you can run Python in RStudio. This can be done in two ways: either we choose “Terminal” on the top of the output console in RStudio and run Python via Terminal, or we use the base system2() function to run Python in RStudio.

For the second option, to use the system2() command, it’s important to first check what version of Python should be used. You can check which versions of Python are installed on your machine by running python --version in Terminal. Note that with RStudio 1.1 (1.1.383 or higher), you can run in Terminal directly from RStudio on the “Terminal” tab. You can also run python3 --version to check if you have Python version 3 installed. On my machine, python --version and python3 --version return Python 2.7.16 and Python 3.7.0, respectively. You can then run which python (or which python3 if you have Python version 3 installed) in Terminal, which will return the path where Python is installed. In my case, these respective commands return /usr/bin/python and /Library/Frameworks/Python.framework/Versions/3.7/bin/python3. As I will make use of Python version 3, I specify the latter as the path to Python in the use_python() function from the reticulate package. We can check whether the desired version of Python is used by using the sys package from Python. Just make sure to change the path in the code below to what version of Python you desire using and where that version in installed.

library(reticulate)use_python(python = '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3')

sys <- import("sys")sys$version

Now that we’ve specified the correct version of Python to be used, we can run our Python script from RStudio using the system2() function. This function also takes an argument for the version of Python used, which in my case is Python version 3. If you are using an older version of Python, make sure to change "python3" in the command below to "python2".

python_file <- "simple_neural_network_fashion_mnist.py"system2("python3", args = c(python_file), stdout = NULL, stderr = "")

The source code used to build and fit the neural networks from the above script can be found in this Python script or this Jupyter notebook on my Github.. In this post, I will walk you through the results from this script produced earlier, but if you would also like to see how to embed Python code and results in R Markdown files, check out this file on my Github!

I will now guide you step by step through the script called in the command above. First, we load the required packages in Python and set the session seed for replicability.

We then load the fashion MNIST data from keras and we normalize the data by dividing by maximum opacity of 255.

We start by building a simple neural network containing one hidden layer. Note that as here we use the untransformed but normalized data, we need to flatten the 28 by 28 pixels input first. We add one hidden densely-connected layer which performs output = relu(dot(input, kernel) + bias), where the rectified linear unit (relu) activation function has been proven to work well. We set the number of nodes equal to 128, because this seems to work well in our case. The number of nodes could essentially be any of the numbers 32, 64, 128, 256 and 512, as these are in a sequence of multiples between the number of nodes in the output (= 10) and input (= 784) layers. The softmax layer then assigns predicted probabilities to each of the ten clothing categories, which is also why there are ten nodes in this layer.

After building the neural network, we compile it. We specify sparse_categorical_crossentropy as the loss function, which is suitable for categorical multi-class responses. The optimizer controls the learning rate; adam (adaptive moment estimation) is similar to classical stochastic gradient descent and usually a safe choice for the optimizer. We set our metric of interest to be the accuracy, or the percentage of correctly classified images. Hereafter, we fit the model onto our training data set using ten iterations through the training data (“epochs”). Here, 70% is used for training and 30% is used for validation.

Next, we print the results of the model in terms of training and testing loss and accuracy.

We can see that the neural network with one hidden layer already performs relatively well with a test accuracy of 87.09%. However, it seems like we are slightly overfitting (i.e. the model is fitted too well to a particular data set and therefore does not well extend to other data sets), as the training set accuracy (88.15%) is slightly higher than the test set accuracy. There are several ways to avoid overfitting in neural networks, such as simplifying our model by reducing the number of hidden layers and neurons, adding dropout layers that randomly remove some of the connections between layers, and early stopping when validation loss starts to increase. Later on in this post, I will demonstrate some of these methods to you. For further reading, I personally like this and this post showing how to avoid overfitting when building neural networks using keras. Instead, to see whether a deep neural network performs better at predicting clothing categories, we build a neural network with three hidden layers in a similar way as before.

It seems like the model with two additional layers does not perform better than the previous one with only one hidden layer, given that both the training (87.42%) and test set (86.03%) accuracies are lower and the loss (38.49) is higher. Let’s try whether adding another five hidden layers improves model performance, or whether we can include that increasing model complexity does not improve performance.

The model with eight hidden layers performs best in terms of training (88.21%) and test (87.58%) accuracy as well as loss (36.12). Nevertheless, the difference in performance between the first model with one hidden layer and the current model with eight hidden layers is only quite small. Although it seems that with so many hidden layers, we can model additional complexity that improves the accuracy of the model, we must ask ourselves whether increasing model complexity at the cost of interpretability and computational feasibility is worth this slight improvement in accuracy and loss.

Now that we have seen how the number of hidden layers affects model performance, let’s try and see whether increasing the number of epochs (i.e. the number of times the model iterates through the training data) from ten to fifty improves the performance of our first neural network with one hidden layer.

The three-layer model trained with fifty epochs has the highest train (89.32%) and test (88.68%) accuracies we have seen so far. However, the loss (54.73) is also about a third larger than we have seen before. Additionally, the model is also less time-efficient, given that the increase in accuracy is not substantial but the model takes significantly longer to fit. To better understand the trade-off between minimizing loss and maximizing accuracy, we plot model loss and accuracy over the number of epochs for the training and cross-validation data.

We observe that for the training data, loss decreases to zero while accuracy increases to one, as a result of overfitting. This is why we also check how the model performs on the cross-validation data, for which we observe that loss increases with the number of epochs while accuracy remains relatively stable. Using this figure, we can select an “optimal” number of epochs such that accuracy is maximized while loss is minimized. Looking at the cross-validation data accuracy, we see that the accuracy peak lays at around 20 epochs, for which loss is approximately 0.4. However, similar accuracies but much lower losses and modelling time are achieved with around 6 and 12 epochs, and so we might rather choose to train our model with around 6 or 20 epochs.

Regarding the model output, the predictions returned are probabilities per class or clothing category. We can calculate the majority vote by taking class that has the maximum of predicted probabilities of all classes. We can print the first ten elements of the majority_vote dictionary, which we can obtain as follows:

All except the fifth (number 4) prediction are correct. In the fifth prediction, a shirt (category 6) is being misclassified as a top (category 0).

Convolutional Neural Network

I also wanted to show you how to build a convolutional neural network and compare its performance to the neural networks presented earlier, mostly because convolutional neural networks have generally been shown to perform better on visual image data. Essentially, what happens in a convolutional neural network is that a smaller matrix (the “filter matrix” or “kernel”) slides over the full image matrix, moving pixel by pixel, multiplies the filter matrix with the part of the full image matrix covered by the filter matrix on that moment, sums up these values and then repeats this until the full image matrix has been covered. For a more extensive explanation on how convolutional neural networks, I refer you to this page or this page.

As we need to prepare our data slightly differently for a convolutional neural network, we reload the data and reshape the images to “flatten” them. The last “1” in the reshape dimensions stand for a greyscale, as we have images on a black-to-white scale. If we would have RGB images, we would change the “1” into a “3”.

We make sure the the values of the pixels, ranging from zero to 255, are of the float type and then we normalize the values as before.

The convolutional neural network cannot deal with categorical labels. Therefore, we transform the labels to binary vectors, where all vectors have length ten (as there are ten categories), a “1” at the index of the category and zeros elsewhere. For example, category 3 and 8 would be coded as [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], respectively. This transformation is referred to as “one hot encoding”.

Now, we can start building our convolutional neural network. The first layer Conv2D is a convolutional layer that takes a 2-dimensional matrix of 28 by 28 pixels in greyscale (1) as input. As before, we use 128 nodes in this layer, as the size of the data is not extremely large and we want to avoid making our model unnecessarily complex. The filter matrix is of size 3 by 3, which is quite standard. As before, we use the rectified linear (“relu”) activation function. The MaxPooling2D layer reduces the dimensionality (and thus required computational power) by outputting the maximum of the part of the input image that is captured by the filter matrix. The Flatten layer simply flattens the result from the previous layer into a vector. As we saw before, the softmax layer then assigns predicted probabilities to each of the ten clothing categories. Note that we use the same optimizer and metric as before, but that we now use “categorical_crossentropy” as the loss function instead of “sparse_categorical_crossentropy”. The reason for this is that the former works for one-hot encoded labels, whereas the other works for categorical labels.

We fit our model to the training data, where we set the batch_size argument equal to the number of neurons in the convolutional layers (= 128).

Although we are still overfitting, we observe that the convolutional neural network performs better than the neural networks we saw earlier, achieving a training set accuracy of 95.16% and a test set accuracy of 90.39%, and a lower loss of 28.70. This was to be expected, because convolutional neural networks have previously been shown to perform well on visual imagery data. Let’s see if we can reduce overfitting by reducing the number of neurons from 128 to 64, adding dropout layers and enabling early stopping. Note that the rate in the Dropout layer is the percentage of connections between layers that are being removed. the SpatialDropout2D is a special kind of dropout layer for convolutional neural networks, which drops certain multiplications of the filter matrix with parts of the original image before pooling across all movements over the original image.

When fitting our model, we also enable early stopping to reduce overfitting. Instead of going through all epochs specified, early stopping automatically stops the iterations through the epoch once it’s being noticed that the validation loss increases.

From the results, we observe that although the training and test accuracies have decreased, they are now much more similar than before. The test accuracy has not decreased substantially, but the training accuracy has, which means that overfitting is much less of a problem than before. Next, we can print the first ten predictions from the model and the first ten actual labels and compare them.

Comparing these predictions to the first ten labels in the data set, we observe that the first ten predictions are correct!

Next up…

Next up in this series of blog posts, I will experiment with tree-based methods and support vector machines to see if they perform as well as the neural network in predicting clothing categories in the Fashion MNIST data. Let’s go!

_____='https://rviews.rstudio.com/2019/11/11/a-comparison-of-methods-for-predicting-clothing-classes-using-the-fashion-mnist-dataset-in-rstudio-and-python-part-1/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Statistical uncertainty with R and pdqr

November 10, 2019, 4:00 pm

≫ Next: Free Training: Mastering Data Structures in R

≪ Previous: A comparison of methods for predicting clothing classes using the Fashion MNIST dataset in RStudio and Python (Part 1)

[This article was first published on QuestionFlow , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CRAN has accepted my ‘pdqr’ package. Here are important examples of how it can be used to describe and evaluate statistical uncertainty.

Prologue

I am glad to announce that my latest, long written R package ‘pdqr’ is accepted to CRAN. It provides tools for creating, transforming and summarizing custom random variables with distribution functions (as base R ‘p*()’, ‘d*()’, ‘q*()’, and ‘r*()’ functions). You can read a brief overview in one of my previous posts.

We will need the following setup:

library(pdqr)library(magrittr)# For the sake of reproducibilityset.seed(20191111)

Statistical uncertainty

General description

Statistical estimation usually has the following setup. There is a sample (observed, usually randomly chosen, set of values of measurable quantities) from some general population (whole set of values of the same measurable quantities). We need to make conclusions about the general population based on a sample. This is done by computing summary values (called statistics) of a sample, and making reasonable assumptions (with process usually called inference) about how these values are close to values that potentially can be computed based on whole general population. Thus, summary value based on a sample (sample statistic) is an estimation of potential summary value based on a general population (true value).

How can we make inference about quality of this estimation? This question itself describes statistical uncertainty and can be unfolded into a deep philosophical question about probability, nature, and life in general. Basically, the answer depends on assumptions about the relation between sample, general population, and statistic.

For me, the most beautiful inferential approach is bootstrap. It has the following key assumption: process of producing samples from general population can be simulated by doing random sampling with replacement from present sample. In other words, we agree (and practice often agrees with us) that random sampling with replacement from current sample (sometimes called bootstrap sampling) has a “close enough” behavior to the “true nature” of how initial sample was created. Numerical estimation of “how close” is also an interesting problem, but it is a more complicated topic.

Computation with pdqr

Natural way of computing bootstrap quantities is straightforward: produce \(B\) random bootstrap samples, for each one compute value of statistic under question, and summarize sample of statistic values with numerical quantity (usually with some center and spread values).

There are many ways of performing bootstrap in R, like boot::boot(), rsample::bootstraps(), and others. In turn, ‘pdqr’ offers its own way of describing and doing bootstrap inference for one-dimensional numeric sample(s):

Create a random variable (in the form of pdqr-function with new_*() family) based on initial sample. This random variable already describes a general population with “bootstrap assumption”: it will produce values based on initial sample. Type of this variable determines the type of bootstrap:
- Type "discrete" describes ordinary bootstrap. Only values from initial sample can be produced.
- Type "continuous" describes smooth bootstrap. Initial sample is smoothed by doing kernel density estimation with density() function and random variable produces values from distribution with that density.
Transform created random variable into one that produces statistic values obtained with bootstrap. Sometimes this can be done with basic mathematical operations like +, min, etc. But usually this is done with form_estimate() function: it creates many (10000 by default) bootstrap samples, for each computes statistic value, and creates its own random variable in the form of pdqr-function (class and type are preserved from supplied random variable, but this can be adjusted). It needs at least three arguments:
- f: pdqr-function representing random variable. In described setup it is created as a result of “Create” step.
- stat: statistic function that accepts numeric vector of size sample_size and returns single numeric or logical output.
- sample_size: Size of a sample that each bootstrap draw should produce. In described setup it should be equal to number of elements in initial sample.
Summarize distribution of statistic. Usually this is point measure of center or spread, or interval.

Example 1: single numerical estimate

Mean value of ‘mpg’ variable in mtcars dataset is 20.090625. However, having in mind statistical uncertainty, we can ask how precise is this estimation? This can, and should, be reformulated in the following question: if we repeat sampling sets of 32 cars from general population of all cars, how close their ‘mpg’ sample means will be to each other? This can be answered by computing bootstrap distribution of sample means (pipe %>% function from ‘magrittr’ package is used to simplify notation):

# Using ordinary bootstrapd_mpg_dis_mean <- mtcars$mpg %>%   new_d(type = "discrete") %>%   form_estimate(stat = mean, sample_size = nrow(mtcars))  # Spread of this bootstrap distribution describes the precision of estimation:  # bigger values indicate lower precisionsumm_sd(d_mpg_dis_mean)## [1] 1.04067  # This discrete distribution has the following d-functionplot(  d_mpg_dis_mean,  main = "Ordinary bootstrap distribution of 'mpg' sample mean")

If modeling assumption about continuous nature of ‘mpg’ variable is reasonable (which it seems so), you can use “smooth bootstrap” by changing type of initial pdqr-function:

# Using smooth bootstrap with `type = "continuous"`d_mpg_con_mean <- mtcars$mpg %>%   new_d(type = "continuous") %>%   form_estimate(stat = mean, sample_size = nrow(mtcars))  # Spread is higher in this case because kernel density estimation with  # `density()` function extends support during creation of pdqr-function on the  # bootstrap stepsumm_sd(d_mpg_con_mean)## [1] 1.153957plot(  d_mpg_con_mean,  main = "Smooth bootstrap distribution of 'mpg' sample mean")

One can also do ordinary bootstrap but represent bootstrap distribution of sample mean with continuous random variable:

# Using ordinary bootstrap, but treating sample mean as continuousd_mpg_con_mean_2 <- mtcars$mpg %>%   new_d(type = "discrete") %>%   form_estimate(    stat = mean, sample_size = nrow(mtcars),    # Create continuous pdqr-function from bootstrap sample means    args_new = list(type = "continuous")  )summ_sd(d_mpg_con_mean_2)## [1] 1.063524plot(  d_mpg_con_mean_2,  main = "Ordinary bootstrap distribution of 'mpg' continuous sample mean")

In this case, sample mean has standard deviation from 1.04067 to 1.1539572 (depends on assumptions about data generating process).

Example 2: single logical estimate

Share of 4-cylinder cars in mtcars is equal to 0.34375. However, it might happen that we don’t care about actual value, but only if it is bigger 0.3 or not. In present data it is bigger, but how sure we can be about that? In other words: if we repeat sampling sets of 32 cars from general population of all cars, which part of it will have share of 4-cylinder cars bigger than 0.3?. Here is the way of computing that with ‘pdqr’:

# If statistic returns logical value (indicating presence of some feature in# sample), output estimate pdqr-function is "boolean": "discrete" type function# with elements being exactly 0 (indicating `FALSE`) and 1 (indicating `TRUE`).d_cyl_lgl <- mtcars$cyl %>%   new_d(type = "discrete") %>%   form_estimate(    stat = function(x) {mean(x == 4) > 0.3},    sample_size = nrow(mtcars)  )d_cyl_lgl## Probability mass function of discrete type## Support: [0, 1] (2 elements, probability of 1: 0.7113)  # To extract certain probability from boolean pdqr-function, use  # `summ_prob_*()` functionssumm_prob_true(d_cyl_lgl)## [1] 0.7113summ_prob_false(d_cyl_lgl)## [1] 0.2887

In this case, estimated probability that share of 4-cylinder cars in general population is more than 0.3 is 0.7113.

Example 3: comparison of estimates

In mtcars there are 19 cars with automatic transmission (‘am’ variable is 0) and 13 with manual (‘am’ variable is 1). We might be concerned with the following question: are cars with automatic transmission heavier than cars with manual transmission? This is an example of question where reformulating is very crucial, because it leads to completely different methodologies. Basically, it is all about dealing with statistical uncertainty and how to measure that one numerical set is bigger than the other.

First, rather verbose, way of expanding this question is this one: if we randomly choose a car with automatic transmission (uniformly on set of all cars with automatic transmission) and a car with manual (uniformly on set of all cars with manual transmission), what is the probability that weight of the first one is bigger than the second one?. With ‘pdqr’ this can be computed straightforwardly by comparing two random variables (which is implemented exactly like the question above; read more here):

# Seems reasonable to treat weight as continuous random variable. Note that this# means use of kernel density estimation, which can lead to random variable that# returns negative values. As weight can be only positive, it is a good idea to# ensure that. Package 'pdqr' has `form_resupport()` function for that.d_wt_am0 <- mtcars$wt[mtcars$am == 0] %>%  new_d(type = "continuous") %>%   # Ensure that returned values are only positive  form_resupport(c(0, NA))d_wt_am1 <- mtcars$wt[mtcars$am == 1] %>%  new_d(type = "continuous") %>%   form_resupport(c(0, NA))# Comparing two pdqr-functions with `>=` results into boolean pdqr-functionsumm_prob_true(d_wt_am0 >= d_wt_am1)## [1] 0.9209063

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is around 0.921.

Second way of understanding question about comparing is the following: is average weight of “automatic” cars bigger than of “manual”?. This type of questions are more widespread in statistical practice. Having to deal with statistical uncertainty, this should be reformulated: if we repeat sampling (in parallel pairs) sets of 19 “automatic” cars and of 13 “manual” cars, which part of the set pairs will have mean weight of “automatic” cars bigger? This question implies creating bootstrap distribution of sample means for “automatic” and “manual” cars with the following comparing:

d_wt_am0_mean <- d_wt_am0 %>%   form_estimate(stat = mean, sample_size = sum(mtcars$am == 0)) %>%   # Ensure "positiveness" of random variable  form_resupport(c(0, NA))d_wt_am1_mean <- d_wt_am1 %>%   form_estimate(stat = mean, sample_size = sum(mtcars$am == 1)) %>%   form_resupport(c(0, NA))# Comparing two random variables representing sample meanssumm_prob_true(d_wt_am0_mean >= d_wt_am1_mean)## [1] 1

So in this case the answer is that probability of “automatic” cars being heavier than “manual” ones is 1.

Computed results can have decisively different outcomes. If researcher sets a standard 0.95 rule, first variant would imply that conclusion ‘“automatic” cars are heavier than “manual”’ isn’t significant, while the second would imply otherwise.

Epilogue

Basic knowledge about statistical uncertainty is crucial to understand the process of statistical inference.
One of the most popular methodologies for doing statistical inference is bootstrap. There are at least two kinds of it: ordinary and smooth.
Package ‘pdqr’ offers extensive functionality for describing and estimating statistical uncertainty. Core functions here are new_*() family, form_estimate(), and comparison operators.

sessionInfo()

sessionInfo()## R version 3.6.1 (2019-07-05)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 18.04.3 LTS## ## Matrix products: default## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so## ## locale:##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              ##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    ##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   ##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            ## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C       ## ## attached base packages:## [1] stats     graphics  grDevices utils     datasets  methods   base     ## ## other attached packages:## [1] magrittr_1.5 pdqr_0.2.0  ## ## loaded via a namespace (and not attached):##  [1] Rcpp_1.0.2      bookdown_0.13   crayon_1.3.4    digest_0.6.21  ##  [5] evaluate_0.14   blogdown_0.15   pillar_1.4.2    rlang_0.4.0    ##  [9] stringi_1.4.3   rmarkdown_1.15  tools_3.6.1     stringr_1.4.0  ## [13] xfun_0.9        yaml_2.2.0      compiler_3.6.1  htmltools_0.3.6## [17] knitr_1.25

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Free Training: Mastering Data Structures in R

November 11, 2019, 9:00 am

≫ Next: Using R and H2O Isolation Forest For Data Quality

≪ Previous: Statistical uncertainty with R and pdqr

[This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Next week I will be delivering a free online R training. This is a new course I’ve created called Mastering Data Structures in R. This course is for you if:

You are new to R, and want a rigorous introduction to R as a programming language
You know how to analyze data in R, but want to take the next step and learn to program in it too
You already know R, but find yourself frequently confused by its “idiosyncracies”

R as a Tool for Data Analysis

When clients reach out to me for training, they normally want help learning R as a tool for Data Analysis. Most of my students are already experts at Excel. They want to learn R because they have heard that it is more powerful than Excel. For these clients I normally propose a two-day course that focuses on the Tidyverse and R Markdown:

Day 1: learn to use ggplot2 and dplyr on datasets that have already been cleaned.
Day 2: in the morning, learn to import and clean data using the Tidyverse. In the afternoon, learn to use R Markdown to do reproducible research.

In general, this sort of course helps people learn to use R to do much of what they were already doing in Excel.

R as a Programming Language

The biggest problem with my 2-day workshop is this: while I intend it to be a starting point for learning R, many of my students think that it’s all that they need to know! In fact, though, I only consider it to be a starting point.

For example, when I was working at an online Real Estate company, I needed to analyze our website’s sales lead data. I started by using R as a data analysis tool. I used packages like ggplot2 to explore the ebb and flow of our sales leads over time for each metropolitan area. But I eventually hit a wall. What I really wanted to do was map our data at the ZIP code level, and mash it up with data from the Census Bureau. Of course, no package existed to do this: it’s too specific. But since I knew R as a programming language, I was able to create functions (and eventually a package) to answer the exact question I had.

And this is the level that I want all my students to get to. Every analyst has a unique problem. Learning to use R as a programming language allows you to answer the exact question you have.

Why Data Structures?

Most of my students struggle to learn the “programming language” aspect of R because they never formally studied Computer Science. I decided to address this by creating a course that resembles an introductory Computer Science course but uses R as the language of instruction.

My undergraduate Computer Science curriculum focused on Data Structures and Algorithms. This is why Mastering Data Structures in R provides a rigorous introduction to the basic data structures in R. The course contains dozens of exercises, and will increase your fluency at the console.

I recently gave a pilot version of this course that was well received. To celebrate, I will be running the course online, for free, next week. Here is the syllabus:

Monday 11/18: Data Types
Tuesday 11/19: Vectors
Wednesday 11/20: Factors and Lists
Friday 11/21: Data Frames

All sessions will start at 10am PT and last approximately 90 minutes.

If you are interested in the course but cannot attend live, then you should still register: I will be sending recordings to everyone who registers.

The post Free Training: Mastering Data Structures in R appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Using R and H2O Isolation Forest For Data Quality

November 11, 2019, 10:56 am

≫ Next: Community Call – Last Night, Testing Saved my Life

≪ Previous: Free Training: Mastering Data Structures in R

[This article was first published on R-Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction:

We will identify anomalous patterns in data, this process is useful, not only to find inconsistencies and errors but also to find abnormal data behavior, being useful even to find cyber attacks on organizations.

On this article there is more information as reference:

Data Quality and Anomaly Detection Thoughts For Web Analytics

Before starting we need the next software installed and working:

– R language installed.

– H2O open source framework.

– Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases

– R studio.

About the data used in this article.

# I am using https://www.kaggle.com/bradklassen/pga-tour-20102018-data# The version I have is not the most updated version but anyway, a new version# may be used.  # The file I am using is a csv 950 mb file with 9,720,530 records, including header. # # One very important thing is that we are going to see that instead to be lost in more  than 9 million records, we will just be looking at 158 records with anomalies for the  analysed variable, so, it is easier to inspect data in this way.

Let’s start coding:

# Loading libraries suppressWarnings( suppressMessages( library( h2o ) ) )  # For interactive plotting suppressWarnings( suppressMessages( library( dygraphs ) ) ) suppressWarnings( suppressMessages( library( dplyr ) ) ) suppressWarnings( suppressMessages( library( DT ) ) )  # Start a single-node instance of H2O using all available processor cores and reserve 5GB of memory h2oServer = h2o.init( ip = "localhost", port = 54321, max_mem_size = "5g", nthreads = -1 )

##  ## H2O is not running yet, starting it now... ##  ## Note:  In case of errors look at the following log files: ##     /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.out ##     /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.err ##  ##  ## Starting H2O JVM and connecting: . Connection successful! ##  ## R is connected to the H2O cluster:  ##     H2O cluster uptime:         2 seconds 395 milliseconds  ##     H2O cluster timezone:       America/Mexico_City  ##     H2O data parsing timezone:  UTC  ##     H2O cluster version:        3.26.0.6  ##     H2O cluster version age:    1 month and 8 days   ##     H2O cluster name:           H2O_started_from_R_ckassab_aat507  ##     H2O cluster total nodes:    1  ##     H2O cluster total memory:   4.44 GB  ##     H2O cluster total cores:    4  ##     H2O cluster allowed cores:  4  ##     H2O cluster healthy:        TRUE  ##     H2O Connection ip:          localhost  ##     H2O Connection port:        54321  ##     H2O Connection proxy:       NA  ##     H2O Internal Security:      FALSE  ##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4  ##     R Version:                  R version 3.6.1 (2019-07-05)

h2o.removeAll() # Removes all data from h2o cluster, ensuring it is clean. h2o.no_progress()  # Turn off progress bars for notebook readability# Setting H2O timezone for proper date data type handling#h2o.getTimezone() ===>>> UTC#h2o.listTimezones() # We can see all H2O timezones h2o.setTimezone("US/Central")

## [1] "US/Central"

# Note. I am using Ubuntu 19.10, using /tmp directory# Every time I boot my computer, I need to copy the data file again to /tmp# directory.# Importing data file and setting data types accordingly. allData = read.csv( "/tmp/PGA_Tour_Golf_Data_2019_Kaggle.csv", sep = ",", header = T )  # When using as.Posixct H2O is not importing data, so we are using as.Date. allData$Date = as.Date( allData$Date ) allData$Value = as.numeric(allData$Value)  # Convert dataset to H2O format. allData_hex = as.h2o( allData )  # Build an Isolation forest model startTime <- Sys.time() startTime

## [1] "2019-11-10 20:10:30 CST"

trainingModel = h2o.isolationForest( training_frame = allData_hex                                      , sample_rate = 0.1                                      , max_depth = 32                                      , ntrees = 100                                     )

## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping tolerance is ignored for _stopping_rounds=0..

Sys.time()

## [1] "2019-11-10 20:20:15 CST"

Sys.time() - startTime

## Time difference of 9.756691 mins

# According to H2O doc: # http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html## Isolation Forest is similar in principle to Random Forest and is built on # the basis of decision trees. # Isolation Forest creates multiple decision trees to isolate observations.# # Trees are split randomly, The assumption is that:#   #   IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,#   IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.# #   The less splits needed, the unit is more likely to be anomalous.# # The average number of splits is then used as a score.# Calculate score for all data. startTime <- Sys.time() startTime

## [1] "2019-11-10 20:20:15 CST"

score = h2o.predict( trainingModel, allData_hex ) result_pred = as.vector( score$predict ) Sys.time()

## [1] "2019-11-10 20:23:18 CST"

Sys.time() - startTime

## Time difference of 3.056829 mins

################################################################################# Setting threshold value for anomaly detection.################################################################################# Setting desired threshold percentage. threshold = .999# Let's say we want the .001% data different than the rest.# Using this threshold to get score limit to filter data anomalies. scoreLimit = round( quantile( result_pred, threshold ), 4 )  # Add row score at the beginning of dataset allData = cbind( RowScore = round( result_pred, 4 ), allData )  # Get data anomalies by filtering all data. anomalies = allData[ allData$RowScore > scoreLimit, ]  # As we can see in the summary: summary(anomalies)

##     RowScore              Player.Name        Date            ##  Min.   :0.9540   Jonas Blixt   : 231   Min.   :2019-07-07   ##  1st Qu.:0.9565   Jordan Spieth : 231   1st Qu.:2019-08-25   ##  Median :0.9614   Julian Etulain: 221   Median :2019-08-25   ##  Mean   :0.9640   Johnson Wagner: 213   Mean   :2019-08-24   ##  3rd Qu.:0.9701   John Chin     : 209   3rd Qu.:2019-08-25   ##  Max.   :1.0000   Keegan Bradley: 209   Max.   :2019-08-25   ##                   (Other)       :8325                        ##                            Statistic    ##  Club Head Speed                : 234   ##  Driving Pct. 300-320 (Measured): 193   ##  Carry Efficiency               : 163   ##  First Tee Early Lowest Round   : 161   ##  First Tee Late Lowest Round    : 160   ##  GIR Percentage - 100+ yards    : 158   ##  (Other)                        :8570   ##                                                      Variable    ##  First Tee Early Lowest Round - (LOW RND)                : 103   ##  First Tee Late Lowest Round - (LOW RND)                 :  96   ##  First Tee Late Lowest Round - (ROUNDS)                  :  64   ##  Driving Pct. 300-320 (Measured) - (TOTAL DRVS - OVERALL):  61   ##  GIR Percentage - 175-200 yards - (%)                    :  61   ##  First Tee Early Lowest Round - (ROUNDS)                 :  58   ##  (Other)                                                 :9196   ##      Value        ##  Min.   :  1268   ##  1st Qu.: 53058   ##  Median : 87088   ##  Mean   :111716   ##  3rd Qu.:184278   ##  Max.   :220583   ##

# The Statistic: GIR Percentage - 100+ yards is one of the most important values# Filtering all anomalies within this Statistic value statisticFilter = "GIR Percentage - 100+ yards"  specificVar = anomalies %>%   filter(Statistic==statisticFilter)  cat( statisticFilter,": ", dim(specificVar)[1] )

## GIR Percentage - 100+ yards :  158

if( dim(specificVar)[1]  > 0 ) {    # We want to know the relation between Players and "Approaches from 200-225 yards"# So, in order to get a chart, we assign a code to each player# Since factors in R are really integer values, we do this to get the codes:   specificVar$PlayerCode = as.integer(specificVar$Player.Name)       # To sort our dataset we convert the date to numeric    specificVar$DateAsNum = as.numeric( paste0( substr(specificVar$Date,1,4)                                                       , substr(specificVar$Date,6,7)                                                       , substr(specificVar$Date,9,10) ) )   # And sort the data frame.   specificVar = specificVar[order(specificVar$DateAsNum),]   # Set records num using a sequence.   rownames(specificVar) = seq(1:dim(specificVar)[1])      colNamesFinalTable = c( "PlayerCode", "Player.Name", "Date", "Variable", "Value" )   specificVar = specificVar[, colNamesFinalTable]   specificVar$PlayerCode = as.factor(specificVar$PlayerCode)      # Creating our final dataframe for our chart.   specificVarChartData = data.frame( SeqNum = as.integer( rownames(specificVar) )                                              , PlayerCode = specificVar$PlayerCode                                              , Value = specificVar$Value                                              )          AnomaliesGraph = dygraph( specificVarChartData, main = ''                       , xlab = paste(statisticFilter,"Anomaly Number."), ylab = "Player Code." ) %>%     dyAxis("y", label = "Player Code.") %>%     dyAxis("y2", label = "Value.", independentTicks = TRUE) %>%     dySeries( name = "PlayerCode", label = "Player Code.", drawPoints = TRUE, pointShape = "dot"               , color = "blue", pointSize = 2 ) %>%     dySeries( name = "Value", label = "Value.", drawPoints = TRUE, pointShape = "dot"               , color = "green", pointSize = 2, axis = 'y2' ) %>%     dyRangeSelector()   dyOptions( AnomaliesGraph, digitsAfterDecimal = 0 ) }

## Registered S3 method overwritten by 'xts': ##   method     from ##   as.zoo.xts zoo

Sample chart with the anomalies found:

Sample data table with the anomalies found:

Show 102550100 entries

Search:

	Player Code	Player Name	Date	Variable	Value

1	686	Josh Teater	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	198471
2	655	Johnson Wagner	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	186658
3	618	Jim Furyk	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	198471
4	723	Keegan Bradley	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	211362
5	213	Cameron Tringale	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	198471
6	712	Justin Thomas	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	199671
7	520	Hunter Mahan	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	178096
8	587	Jason Day	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	189755
9	539	J.J. Henry	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	177431
10	657	Jon Rahm	2019-08-25	GIR Percentage – 100+ yards – (ROUNDS)	199671

Showing 1 to 10 of 158 entries

Previous1 2 3 4 5…16Next

Here is the code on github, including the total html functional demo:https://github.com/LaranIkal/DataQuality

Enjoy it!!!…

Carlos Kassab

https://www.linkedin.com/in/carlos-kassab-48b40743/

More information about R:

https://www.r-bloggers.com

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Community Call – Last Night, Testing Saved my Life

November 11, 2019, 4:00 pm

≫ Next: What can we really expect to learn from a pilot study?

≪ Previous: Using R and H2O Isolation Forest For Data Quality

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To the uninitiated, software testing may seem variously boring, daunting or bogged down in obscure terminology. However, it has the potential to be enormously useful for people developing software at any level of expertise, and can often be put into practice with relatively little effort.

Our 1-hour Call will include two speakers and at least 20 minutes for Q & A.

As someone with a background in science, not software engineering, Steffi LaZerte will share her experiences using automated testing in R to ensure that packages do what they’re supposed to do, on all the operating systems they’re supposed to do it on, and that they handle weird stuff gracefully.

Rich FitzJohn will talk about how to make testing much more useful than a chore that one does just because it is “best practice”, focussing on how testing has been useful in his work – refactoring code bases, preventing regressions, and improving how code is written in the first place. He will discuss how testing can help at the heart of processes, particularly in collaborative work, and how automating your testing leaves you with more time and energy to focus on your software, and describe strategies for testing as systems grow more complex, such as using mocking to simulate components.

See below for speaker bios, resources, and how to join the call.

Share your thoughts in the comments!

What motivated you to start testing?
What do you wish your past-self knew?
What are your favorite resources on the topic?
What questions do you have?

Join the Call

Thursday, December 5, 10-11 AM PDT / 6-7 PM GMT (find your timezone)

Everyone is welcome. No RSVP needed. You can use https://zoom.us/test to check that you are set up for audio and video.

Join Zoom Meeting on Dec 5 https://zoom.us/j/297967525. To join by phone, find your local number. Meeting ID: 297 967 525.

After the Call, we’ll post the video and collaborative notes on the archive page.

Resources

Testing, Chapter 1.10 in the rOpenSci software development guide
Tests, Chapter 10 in R Packages, by Hadley Wickham and Jenny Bryan
An Ode to Testing, my first review, by Charles T. Gray
Packages for Testing your R Package, by Maëlle Salmon on the Locke Data blog
Mocking is catching, on the R-hub blog

Speakers

Steffi LaZerte is a consulting R programmer and teacher, and Adjunct Professor at Brandon University in Manitoba. She is the author of the rOpenSci-reviewed weathercan package. She has a PhD in Behavioral Ecology and recently received the Society of Canadian Ornithologists Early Career Research Award. Steffi’s research career has involved exploring seasonal changes in chipmunk activity patterns, effects of anthropogenic noise on communication in birds, and the development of R packages for behavioral ecology.Steffi on GitHub, Twitter, Website, rOpenSci

Rich FitzJohn is a research software engineer in the RESIDE group at the Department of Infectious Disease Epidemiology and MRC Centre for Global Infectious Disease Analysis in London UK. His focuses are infrastructure and tools that generalize problems common to research groups across the department. He is interested in reproducible research and in helping researchers get more science done per line of code that they write. Rich has a PhD in zoology. His research career involved modeling coexistence in tropical forests, diversification over macro-evolutionary timescales and the potential for gene flow from genetically-modified crops. Rich is part of rOpenSci’s Leadership team.Rich on GitHub, Twitter, Website, rOpenSci

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

What can we really expect to learn from a pilot study?

November 11, 2019, 4:00 pm

≫ Next: An API for @racently

≪ Previous: Community Call – Last Night, Testing Saved my Life

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am involved with a very interesting project – the NIA IMPACT Collaboratory – where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.

There is a rich literature on this topic. For example, these papers by Lancaster et al and Leon et al provide nice discussions about how pilot studies should fit into the context of larger randomized trials. The key point made by both groups of authors is that pilot studies are important sources of information about the feasibility of conducting a larger, more informative study: Can the intervention actually be implemented well enough to study it? Will it be possible to recruit and retain patients? How difficult will it be to measure the primary outcome? Indeed, what is the most appropriate outcome to be measuring?

Another thing the authors agree on is that the pilot study is not generally well-equipped to provide an estimate of the treatment effect. Because pilot studies are limited in resources (both time and money), sample sizes tend to be quite small. As a result, any estimate of the treatment effect is going to be quite noisy. If we accept the notion that there is some true underlying treatment effect for a particular intervention and population of interest, the pilot study estimate may very well fall relatively far from that true value. As a result, if we use that effect size estimate (rather than the true value) to estimate sample size requirements for the larger randomized trial, we run a substantial risk of designing an RCT that is too small, which may lead us to miss identifying a true effect. (Likewise, we may end up with a study that is too large, using up precious resources.)

My goal here is to use simulations to see how a small pilot study could potentially lead to poor design decisions with respect to sample size.

A small, two-arm pilot study

In these simulations, I will assume a two-arm study (intervention and control) with a true intervention effect \(\Delta = 50\). The outcome is a continuous measure with a within-arm standard deviation \(\sigma = 100\). In some fields of research, the effect size would be standardized as \(d = \Delta / \sigma\). (This is also known as Cohen’s \(d\).) So, in this case the true standardized effect size \(d=0.5\).

If we knew the true effect size and variance, we could skip the pilot study and proceed directly to estimate the sample size required for 80% power and Type I error rate \(\alpha = 0.05\). Using the pwr.t.test function in the pwr library, we specify the treatment effect (as \(d\)), significance level \(\alpha\), and power to get the number of subjects needed for each study arm. In this case, it would be 64 (for a total of 128):

library(pwr) pwr.t.test(n = NULL, d =  50/100, sig.level = 0.05,     power = 0.80, type = "two.sample")

## ##      Two-sample t test power calculation ## ##               n = 64##               d = 0.5##       sig.level = 0.05##           power = 0.8##     alternative = two.sided## ## NOTE: n is number in *each* group

If we do not have an estimate of \(d\) or even of the individual components \(\Delta\) and \(\sigma\), we may decide to do a small pilot study. I simulate a single study with 30 subjects in each arm (for a total study sample size of 60). First, I generate the data set (representing this one version of the hypothetical study) with a treatment indicator \(rx\) and an outcome \(y\):

library(simstudy)defd <- defDataAdd(varname = "y", formula = "rx * 50", variance = 100^2)ss <- 30set.seed(22821)dd <- genData(n = ss*2)dd <- trtAssign(dd, grpName = "rx")dd <- addColumns(defd, dd)head(dd)

##    id rx    y## 1:  1  0 -150## 2:  2  1   48## 3:  3  0 -230## 4:  4  1  116## 5:  5  1   91## 6:  6  1  105

Once we have collected the data from the pilot study, we probably would try to get sample size requirements for the larger RCT. The question is, what information can we use to inform \(d\)? We have a couple of options. In the first case, we can estimate both \(\Delta\) and \(\sigma\) from the data and use those results directly in power calculations:

lmfit <- lm(y ~ rx, data = dd) Delta <- coef(lmfit)["rx"]Delta

## rx ## 78

sd.rx <- dd[rx==1, sd(y)]sd.ctl <- dd[rx==0, sd(y)]pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2)  / 2 )pool.sd

## [1] 94

The estimated standard deviation (94) is less than the true value, and the effect size is inflated (78), so that the estimated \(\hat{d}\) is also too large, close to 0.83. This is going to lead us to recruit fewer participants (24 in each group) than the number we actually require (64 in each group):

pwr.t.test(n = NULL, d =  Delta/pool.sd, sig.level = 0.05,     power = 0.80, type = "two.sample")

## ##      Two-sample t test power calculation ## ##               n = 24##               d = 0.83##       sig.level = 0.05##           power = 0.8##     alternative = two.sided## ## NOTE: n is number in *each* group

Alternatively, if we had external information that provided some insight into the true effect size, or, absent that, we use a minimally clinically significant effect size, we might get a better result. In this case, we are quite fortunate to use an effect size of 50. However, we will continue to use the variance estimate from the pilot study. Using this approach, the resulting sample size (56) happens to be much closer to the required value (64):

pwr.t.test(n = NULL, d =  50/pool.sd, sig.level = 0.05,     power = 0.80, type = "two.sample")

## ##      Two-sample t test power calculation ## ##               n = 56##               d = 0.53##       sig.level = 0.05##           power = 0.8##     alternative = two.sided## ## NOTE: n is number in *each* group

Speak truth to power

Now the question becomes, what is the true expected power of the RCT based on the sample size estimated in the pilot study. To estimate this true power, we use the true effect size and the true variance (i.e. the true \(d\))?

In the first case, where we actually used the true \(d\) to get the sample size estimate, we just recover the 80% power estimate. No surprise there:

pwr.t.test(n = 64, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.8

In the second case, where we used \(\hat{d} = \hat{\Delta} / \hat{\sigma}\) to get the sample size \(n=24\), the true power of the larger RCT would be 40%:

pwr.t.test(n = 24, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.4

And if we had used \(\hat{d} = 50 / \hat{\sigma}\) to get the sample size estimate \(n=56\), the true power would have been 75%:

pwr.t.test(n = 56, d = 0.50, sig.level = 0.05, type = "two.sample")$power

## [1] 0.75

Conservative estimate of standard deviation

While the two papers I cited earlier suggest that it is not appropriate to use effect sizes estimated from a pilot study (and more on that in the next and last section), this 1995 paper by R.H. Browne presents the idea that we can use the estimated standard deviation from the pilot study. Or rather, to be conservative, we can use the upper limit of a one-sided confidence interval for the standard deviation estimated from the pilot study.

The confidence interval for the standard deviation is not routinely provided in R. Another paper analyzes one-sided confidence intervals quite generally under different conditions, and provides a formula in the most straightforward case under assumptions of normality to estimate the \(\gamma*100\%\) one-sided confidence interval for \(\sigma^2\):

\[ \left( 0,\frac{(N-2)s_{pooled}^2}{\chi^2_{N-2;\gamma}} \right) \]

where \(\chi^2_{N-2;\gamma}\) is determined by \(P(\chi^2_{N-2} > \chi^2_{N-2;\gamma}) = \gamma\). So, if \(\gamma = 0.95\) then we can get a one-sided 95% confidence interval for the standard deviation using that formulation:

gamma <- 0.95qchi <- qchisq(gamma, df = 2*ss - 2, lower.tail = FALSE)ucl <- sqrt( ( (2*ss - 2) * pool.sd^2 ) / qchi  )ucl

## [1] 111

The point estimate \(\hat{\sigma}\) is 94, and the one-sided 95% confidence interval is \((0, 111)\). (I’m happy to provide a simulation to demonstrate that this is in fact the case, but won’t do it here in the interest of space.)

If we use \(\hat{\sigma}_{ucl} = 111\) to estimate the sample size, we get a more conservative sample size requirement (78) than if we used the point estimate \(\hat{\sigma} = 94\) (where the sample size requirement was 56):

pwr.t.test(n = NULL, d =  50/ucl, sig.level = 0.05,     power = 0.80, type = "two.sample")

## ##      Two-sample t test power calculation ## ##               n = 78##               d = 0.45##       sig.level = 0.05##           power = 0.8##     alternative = two.sided## ## NOTE: n is number in *each* group

Ultimately, using \(\gamma = 0.95\) might be too conservative in that it might lead to an excessively large sample size requirement. Browne’s paper uses simulation to to evaluate a range of \(\gamma\)’s, from 0.5 to 0.9, which I also do in the next section.

Simulation of different approaches

At this point, we need to generate multiple iterations to see how the various approaches perform over repeated pilot studies based on the same data generating process, rather than looking at a single instance as I did in the simulations above.

As Browne does in his paper, I would like to evaluate the distribution of power estimates that arise from the various approaches. I compare using an external source or minimally clinically meaningful effect size to estimate \(\Delta\) (in the figures below, this would be the columns labeled ‘truth’) with using the effect size point estimate from the pilot (labeled pilot). I also compare using a point estimate of \(\sigma\) from the pilot (where \(\gamma=0\)), with using the upper limit of a one-sided confidence interval defined by \(\gamma\). In these simulations I compare three levels of \(\gamma\): \(\gamma \in (0.5, 0.7, 0.9)\).

In each of the simulations, I assume 30 subjects per arm, and evaluate true effect sizes of 30 and 75. In all cases, the true standard error \(\sigma = 100\) so that true \(d\) is 0.30 or 0.75.

The box plots in the figure represent the distribution of power estimates for the larger RCT under different scenarios. Each scenario was simulated 5000 times each. Ideally, the power estimates should cluster close to 80%, the targeted level of power. In the figure, the percentage next to each box plot reports the percent of simulations with power estimates at or above the target of 80%.

Two things jump out at me. First, using the true effect size in the power calculation gives us a much better chance of designing an RCT with close to 80% power, even when a point estimate is used for \(\hat{\sigma}\). In Browne’s paper, the focus is on the fact that even when using the true effect size, there is a high probability of power falling below 80%. This may be the case, but it may be more important to note that when power is lower than the target, it is actually likely to fall relatively close to the 80% target. If the researcher is very concerned about falling below that threshold, perhaps using \(\gamma\) higher than 0.6 or 0.7 might provide an adequate cushion.

Second, it appears using the effect size estimate from the pilot as the basis for an RCT power analysis is risky. The box plots labeled as pilot exhibit much more variation than the ‘true’ box plots. As a result, there is a high probability that the true power will fall considerably below 80%. And in many other cases, the true power will be unnecessarily large, due to the fact that they have been designed to be larger than they need to be.

The situation improves somewhat with larger pilot studies, as shown below with 60 patients per arm, where variation seems to be reduced. Still, an argument can be made that using effect sizes from pilot studies is too risky, leading to an under-powered or overpowered study, neither of which is ideal.

A question remains about how best to determine what effect size to use for the power calculation if using the estimate from the pilot is risky. I think a principled approach, such as drawing effect size estimates from the existing literature or using clinically meaningful effect sizes, is a much better way to go. And the pilot study should focus on other important feasibility issues that can help improve the design of the RCT.

References:

Lancaster, G.A., Dodd, S. and Williamson, P.R., 2004. Design and analysis of pilot studies: recommendations for good practice. Journal of evaluation in clinical practice, 10(2), pp.307-312.

Leon, A.C., Davis, L.L. and Kraemer, H.C., 2011. The role and interpretation of pilot studies in clinical research. Journal of psychiatric research, 45(5), pp.626-629.

Browne, R.H., 1995. On the use of a pilot sample for sample size determination. Statistics in medicine, 14(17), pp.1933-1940.

Cojbasic, V. and Loncar, D., 2011. One-sided confidence intervals for population variances of skewed distributions. Journal of Statistical Planning and Inference, 141(5), pp.1667-1672.

Support:

This research is supported by the National Institutes of Health National Institute on Aging U54AG063546. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

Addendum

Below is the code I used to run the simulations and generate the plots

getPower <- function(ssize, esize, gamma = 0, use.est = FALSE) {    estring <- paste0("rx * ", esize)  defd <- defDataAdd(varname = "y", formula = estring, variance = 100^2)    N <- ssize * 2    dd <- genData(n = N)  dd <- trtAssign(dd, grpName = "rx")  dd <- addColumns(defd, dd)    lmfit <- lm(y~rx, data = dd)  sd.rx <- dd[rx==1, sd(y)]  sd.ctl <- dd[rx==0, sd(y)]  pool.sd <- sqrt( (sd.rx^2 + sd.ctl^2)  / 2 )    qchi <- qchisq(gamma, df = N - 2, lower.tail = FALSE)  ucl <- sqrt( ( (N-2) * pool.sd^2 ) / qchi  )  p.sd <- estsd * (gamma == 0) + ucl * (gamma > 0)  p.eff <- esize * (use.est == FALSE) +               coef(lmfit)["rx"] * (use.est == TRUE)    if (abs(p.eff/p.sd) < 0.0002) p.eff <- sign(p.eff) * .0002 * p.sd    nstar <- round(pwr.t.test(n = NULL, d =  p.eff/p.sd, sig.level = 0.05,                             power = 0.80, type = "two.sample")$n,0)      power <- pwr.t.test(n=nstar, d = esize/100, sig.level = 0.05,                       type = "two.sample")    return(data.table(ssize, esize, gamma, use.est,    estsd = estsd, ucl = ucl, nstar, power = power$power,    est = coef(lmfit)["rx"],     lcl.est = confint(lmfit)["rx",1] ,     ucl.est = confint(lmfit)["rx",2])  )  }

dres <- data.table()  for (i in c(30, 60)) { for (j in c(30, 75)) {  for (k in c(0, .5, .7)) {   for (l in c(FALSE, TRUE)) {    dd <- rbindlist(lapply(1:5000,       function(x) getPower(ssize = i, esize = j, gamma = k, use.est = l))    )    dres <- rbind(dres, dd)}}}}

above80 <- dres[, .(x80 = mean(power >= 0.80)),                   keyby = .(ssize, esize, gamma, use.est)]above80[, l80 := scales::percent(x80, accuracy = 1)]  g_labeller <- function(value) {    paste("\U03B3", "=", value) # unicode for gamma}e_labeller <- function(value) {  paste("\U0394", "=", value) # unicdoe for Delta}  ggplot(data = dres[ssize == 30],        aes(x=factor(use.est, labels=c("'truth'", "pilot")), y=power)) +  geom_hline(yintercept = 0.8, color = "white") +  geom_boxplot(outlier.shape = NA, fill = "#9ba1cf", width = .4) +  theme(panel.grid = element_blank(),        panel.background = element_rect(fill = "grey92"),        axis.ticks = element_blank(),        plot.title = element_text(size = 9, face = "bold")) +  facet_grid(esize ~ gamma,     labeller = labeller(gamma = g_labeller, esize = e_labeller)) +  scale_x_discrete(    name = "\n source of effect size used for power calculation") +  scale_y_continuous(limits = c(0,1), breaks = c(0, .8),                      name = "distribution of power estimates \n") +  ggtitle("Distribution of power estimates (n = 30 per treatment arm)") +  geom_text(data = above80[ssize == 30],             aes(label = l80), x=rep(c(0.63, 1.59), 6), y = 0.95,            size = 2.5)

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

An API for @racently

November 11, 2019, 6:00 pm

≫ Next: Teach R to see by Borrowing a Brain

≪ Previous: What can we really expect to learn from a pilot study?

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@racently is a side project that I have been nursing along for a couple of years. It addresses a problem that I have as a runner: my race results are distributed across a variety of web sites. This makes it difficult to create a single view on my running performance (or lack thereof) over time. I suspect that I am not alone in this. Anyway, @racently was built to scratch my personal itch: my running results are now all aggregated in one place.

A few months ago @DanielCunnama suggested that I add the ability to creating running groups in @racently. This sounded like a good idea. It also sounded like a bit of work and TBH I just did not have the time. So I made a counter-suggestion: how about an API so that he could effectively aggregate the data in any way he wanted? He seemed happy with the idea, so it immediately went onto my backlog. And there it stayed. But @DanielCunnama is a persistent guy (perhaps this is why he’s a class runner!) and he pinged me relentlessly about this… until Sunday when I relented and created the API.

And now I’m happy that I did, because it gives me an opportunity to write up a quick post about how these data can be accessed from R.

Profiles on @racently

I’m going to use Gerda Steyn as an example. I hope she doesn’t mind. This is what Gerda’s profile looks like on @racently.

Now there are a couple of things I should point out:

This profile is far from complete. Gerda has run a lot more races than that. These are just the ones that we currently have in our database. We’re adding more races all the time, but it’s a long and arduous process.
The result for the 2019 Comrades Marathon was when she won the race!

A view like this can be created for any runner on the system. Most runners in South Africa should have a profile (unless they have explicitly requested that we remove it!).

Pulling Data with the API

Supposing that you wanted to do some analytics on the data. You’d want to pull the data into R or Python. You could scrape the site, but the API makes it a lot easier to access the data.

Load up some helpful packages.

library(glue)library(dplyr)library(purrr)library(httr)

Set up the URL for the API endpoint and the key for Gerda’s profile.

URL ="https://www.racently.com/api/athlete/{key}/"key ="7ef6fbc8-4169-4a98-934e-ff5fa79ba103"

Send a GET request and extract the results from the response object, parsing the JSON into an R list.

response <-glue(URL) %>%GET() %>%content()

Extract some basic information from the response.

response$url## [1] "http://www.racently.com/api/athlete/7ef6fbc8-4169-4a98-934e-ff5fa79ba103/"response$name## [1] "Gerda Steyn"response$gender## [1] "F"

Now get the race results. This requires a little more work because of the way that the JSON is structured: an array of licenses, each of which has a nested array of race result objects.

response$license %>%map_dfr(function(license) {  license$result %>%map_dfr(as_tibble)} %>%mutate(      club = license$club,      number = license$number,      date =as.Date(date)    )  ) %>%arrange(desc(date))##   date       race          distance time     club    number## 1 2019-06-09 Comrades      86.8 km  05:58:53 Nedbank     NA## 2 2018-06-10 Comrades      90.2 km  06:15:34 Nedbank   8300## 3 2018-05-20 RAC           10.0 km  00:35:38 Nedbank   8300## 4 2018-05-01 Wally Hayward 10.0 km  00:35:35 Nedbank   8300## 5 2017-06-04 Comrades      86.7 km  06:45:45 Nedbank     NA## 6 2016-05-29 Comrades      89.2 km  07:08:23 Nedbank     NA

For good measure, let’s throw in the results for @DanielCunnama.

##    date       race               distance time     club              number##  1 2019-09-29 Grape Run          21.1 km  01:27:49 Harfield Harriers   4900##  2 2019-06-09 Comrades           86.8 km  07:16:21 Harfield Harriers   4900##  3 2019-02-17 Cape Peninsula     42.2 km  03:08:47 Harfield Harriers   4900##  4 2019-01-26 Red Hill Marathon  36.0 km  02:52:55 Harfield Harriers   4900##  5 2019-01-13 Bay to Bay         30.0 km  02:15:55 Harfield Harriers   7935##  6 2018-11-10 Winelands          42.2 km  02:58:56 Harfield Harriers   7935##  7 2018-10-14 The Gun Run        21.1 km  01:22:30 Harfield Harriers   7935##  8 2018-10-07 Grape Run          21.1 km  01:36:46 Harfield Harriers   8358##  9 2018-09-23 Cape Town Marathon 42.2 km  03:11:52 Harfield Harriers   7935## 10 2018-09-09 Ommiedraai         10.0 km  00:37:46 Harfield Harriers  11167## 11 2018-06-10 Comrades           90.2 km  07:19:25 Harfield Harriers   7935## 12 2018-02-18 Cape Peninsula     42.2 km  03:08:27 Harfield Harriers   7935## 13 2018-01-14 Bay to Bay         30.0 km  02:11:50 Harfield Harriers   7935## 14 2017-10-01 Grape Run          21.1 km  01:27:18 Harfield Harriers   7088## 15 2017-09-17 Cape Town Marathon 42.2 km  02:57:55 Harfield Harriers   7088## 16 2017-06-04 Comrades           86.7 km  07:46:18 Harfield Harriers   7088## 17 2016-10-16 The Gun Run        21.1 km  01:19:09 Harfield Harriers     NA## 18 2016-09-10 Mont-Aux-Sources   50.0 km  05:42:23 Harfield Harriers     NA## 19 2016-05-29 Comrades           89.2 km  07:22:53 Harfield Harriers     NA## 20 2016-02-21 Cape Peninsula     42.2 km  03:17:12 Harfield Harriers     NA

Wrapping Up

Let’s digress for a moment to look at a bubble plot showing the number of races on @racently broken down by runner. There are some really prolific runners.

We’ve currently got just under one million individual race results across over a thousand races. If you have the time and inclination then there’s definitely some interesting science to be done using these results. I’d be very interested in collaborating, so just shout if you are interested.

Feel free to grab some data via the API. At the moment you’ll need to search for an athlete on the main website in order to find their API key. I’ll implement some search functionality in the API when I get a chance.

Finally, here’s a talk I gave about @racently at the Bulgaria Web Summit (2017) in Sofia, Bulgaria. A great conference, incidentally. Well worth making the trip to Bulgaria.

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Teach R to see by Borrowing a Brain

November 12, 2019, 12:00 am

≫ Next: dplyr and Oracle database with odbc on windows

≪ Previous: An API for @racently

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It has been an old dream to teach a computer to see, i.e. to hold something in front of a camera and let the computer tell you what it sees. For decades it has been exactly that: a dream – because we as human beings are able to see, we just don’t know how we do it, let alone be precise enough to put it into algorithmic form.

Enter machine learning!

As we have seen in Understanding the Magic of Neural Networks we can use neural networks for that. We have to show the network thousands of readily tagged pics (= supervised learning) and after many cycles, the network will have internalized all the important features of all the pictures shown to it. The problem is that it often takes a lot a computing power and time to train a neural network from scratch.

The solution: a pre-trained neural network which you can just use out of the box! In the following we will build a system where you can point your webcam in any direction or hold items in front of it and R will tell you what it sees: a banana, some toilet paper, a sliding door, a bottle of water and so on. Sounds impressive, right!

For the following code to work you first have to go through the following steps:

Install Python through the Anaconda distribution: https://www.anaconda.com
Install the R interface to Keras (a high-level neural networks API): https://keras.rstudio.com
Load the keras package and the pre-trained ResNet-50 neural network (based on https://keras.rstudio.com/reference/application_resnet50.html):

library(keras)# instantiate the modelresnet50 <- application_resnet50(weights = 'imagenet')

Build a function which takes a picture as input and makes a prediction on what can be seen in it:

predict_resnet50 <- function(img_path) {  # load the image  img <- image_load(img_path, target_size = c(224, 224))  x <- image_to_array(img)    # ensure we have a 4d tensor with single element in the batch dimension,  # the preprocess the input for prediction using resnet50  x <- array_reshape(x, c(1, dim(x)))  x <- imagenet_preprocess_input(x)    # make predictions then decode and print them  preds <- predict(resnet50, x)  imagenet_decode_predictions(preds, top = 3)[[1]]}

Start the webcam and set the timer to 2 seconds (depends on the technical specs on how to do that!), start taking pics.
Let the following code run and put different items in front of the camera… Have fun!

img_path <- "C:/Users/.../Pictures/Camera Roll" # change path appropriatelywhile (TRUE) {  files <- list.files(path = img_path, full.names = TRUE)  img <- files[which.max(file.mtime(files))] # grab latest pic  cat("\014") # clear console  print(predict_resnet50(img))  Sys.sleep(1)}

When done click the Stop button in RStudio and stop taking pics.
Optional: delete saved pics – you can also do this with the following command:

unlink(paste0(img_path, "/*")) # delete all pics in folder

Here are a few examples of my experiments with my own crappy webcam:

  class_name class_description        score1  n07753592            banana 9.999869e-012  n01945685              slug 5.599981e-063  n01924916          flatworm 3.798145e-06

  class_name class_description        score1  n07749582             lemon 0.99245375392  n07860988             dough 0.00627466293  n07747607            orange 0.0003545524

  class_name class_description        score1  n07753275         pineapple 0.99925714732  n07760859     custard_apple 0.00023878113  n04423845           thimble 0.0001032234

  class_name class_description      score1  n04548362            wallet 0.513296902  n04026417             purse 0.330635013  n02840245            binder 0.02906101

  class_name class_description        score1  n04355933          sunglass 5.837566e-012  n04356056        sunglasses 4.157162e-013  n02883205           bow_tie 9.142305e-05

So far, all of the pics were on a white background, what happens in a more chaotic setting?

  class_name class_description      score1  n03691459       loudspeaker 0.625597832  n03180011  desktop_computer 0.176713093  n03782006           monitor 0.04467739

  class_name class_description      score1  n03899768             patio 0.650156562  n03930313      picket_fence 0.047023493  n03495258              harp 0.04476695

  class_name class_description     score1  n02870880          bookcase 0.52051952  n03661043           library 0.35825343  n02871525          bookshop 0.1167464

Quite impressive for such a small amount of work, isn’t it!

Another way to make use of pre-trained models is to take them as a basis for building new nets that can e.g. recognize things the original net was not able to. You don’t have to start from scratch but use e.g. only the lower layers which hold the needed building block while retraining the higher layers (another possibility would be to add additional layers on top of the pre-trained model).

This method is called Transfer Learning and an example would be to reuse a net that is able to differentiate between male and female persons for recognizing their age or their mood. The main advantage obviously is that you get results much faster this way, one disadvantage may be that a net that is trained from scratch might yield better results. As so often in the area of machine learning there is always a trade-off…

Hope this post gave you an even deeper insight into the fascinating area of neural networks which is still one of the hottest areas of machine learning research.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Motivation for this

Getting off my arse and creating the documentation / package

What does this package do?

Installing the package

Using the package

Outputs from the tool

Classic view

Odds plot dressed in economist view

Edward Tufte would be proud

Evolution and iteration

An interesting big data thought experiment

Efficiently simulating filling an urn with balls

Comparing the two sampling methods

Reflections

Intro

Instrumental variables regression using two-stage least squares

Two-stage random forests

Recap

Day 2

Introduction

Contents

Preparation

The invoke() API of sparklyr

Getting started with the invoke API

Grouping and aggregation with invoke chains

What is all that extra code?

Wrapping the invocations into R functions

Reconstructing variable normalization

Where invoke can be better than dplyr translation or SQL

Conclusion

References

Changes in Rcpp version 1.0.3 (2019-11-08)

What else?

What’s next?

Links

Bratislava Open-Data

Polygons of city districts

Street lines coordinates

ggmap

Using ggnanimate

GIFS binding with magick package

Streets with the most violations

Summary

2019-10-22 Horror films

2019-10-28 NYC Squirrel Census

2019-11-05 Biking and walking to work in the US (and Sweden)

Importing and exploring the data

Data Preparation

Artificial Neural Network

Convolutional Neural Network

Next up…

Prologue

Statistical uncertainty

General description

Computation with pdqr

Example 1: single numerical estimate

Example 2: single logical estimate

Example 3: comparison of estimates

Epilogue

R as a Tool for Data Analysis

R as a Programming Language

Why Data Structures?

Introduction:

Share your thoughts in the comments!

Join the Call

Resources

Speakers

A small, two-arm pilot study

Speak truth to power

Conservative estimate of standard deviation

Simulation of different approaches

Addendum

Profiles on @racently

Pulling Data with the API

Wrapping Up