Quantcast
Channel: R-bloggers
Viewing all 12108 articles
Browse latest View live

Workaround for tidyr::spread with duplicate row identifiers

$
0
0

(This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers)

The problem

The tidyverse api is one the easiest APIs in R to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5).... I am sure we have all seen this some or other time.

Well, lets reproduce this error:

library(dplyr)
library(tidyr)
df
## # A tibble: 5 x 2
##   age   gender
##     
## 1 21    Male  
## 2 17    Female
## 3 32    Female
## 4 29    Male  
## 5 15    Male
# Spread
df %>% 
  spread(gender, age)
## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be:

## # A tibble: 3 x 2
##   Female Male 
## *   
## 1 17     21   
## 2 32     29   
## 3    15

But given that there is no index for spread to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df without getting the warning. Here are some links:

And the list goes on…

The solution

Luckily I stumbled accross a really nice issue on the tidyr github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by and row_number and just removing the column after the widening of the data.

df <- df %>% 
  group_by(gender) %>% 
  mutate(grouped_id = row_number())
df
## # A tibble: 5 x 3
## # Groups:   gender [2]
##   age   gender grouped_id
##           
## 1 21    Male            1
## 2 17    Female          1
## 3 32    Female          2
## 4 29    Male            2
## 5 15    Male            3
# Now Spread
df %>% 
  spread(gender, age) %>% 
  select(-grouped_id)
## # A tibble: 3 x 2
##   Female Male 
## *   
## 1 17     21   
## 2 32     29   
## 3    15

Using this method of giving spread an index ensures no duplicated row names are introduced and a rectangular dataset can remain.

I hope that this little gem of an insight into working with spread has saved you some time!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


DALEX @ eRum 2018

$
0
0

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

DALEX invasion has started with the workshop and talk @ eRum 2018.

Find workshop materials at DALEX: Descriptive mAchine Learning EXplanations. Tools for exploration, validation and explanation of complex machine learning models (thanks Mateusz Staniak for having the second part of the workshop).

And my presentation Show my your model 2.0! (thanks go to the whole MI2DataLab for contributions and Malgorzata Pawlak for great stickers).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Create your Machine Learning library from scratch with R ! (2/5) – PCA

$
0
0

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).

Principal components analysis

The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.

The mathematics of PCA

Let \mathbf{x}_1, ... \mathbf{x}_n be the observations of our datasets, the points are in \mathbb{R}^p. We assume that they are centered and of unit variance. We denote \mathbf{X}=(\mathbf{x}_1, ..., \mathbf{x}_n)^T the matrix of observations. Then, \mathbf{X}^T \mathbf{X} can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix). We denote \lambda_1 > ... > \lambda_p its ordered eigenvalues and u_1, ... , u_p the associated eigenvectors. It can be shown that \frac{\sum_{1\leq i \leq k} \lambda_i}{\sum_{1\leq i \leq p} \lambda_i} is the cumulative variance explained by u_1, ..., u_k. It can also be shown that u_1, ..., u_k is the orthonormal basis of size k which explains the most variances.

This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !

PCA in R

The implementation in R has three-steps:

  1. We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
  2. We diagonalise \mathbf{X}^T \mathbf{X} and store the eigenvectors and eigenvalues
  3. The cumulative variance is computed and the required numbers of eigenvectors k to reach the variance threshold is stored. We only keep the first k eigenvectors
###PCAmy_pca<-function(x,variance_explained=0.9,center=T,scale=T){  my_pca=list()  ##Compute the mean of each variable  if (center)  {    my_pca[['center']]=colMeans(x)  }  ## Otherwise, we set the mean to 0   else    my_pca[['center']]=rep(0,dim(x)[2])  ####Compute the standard dev of each variable  if (scale)  {    my_pca[['std']]=apply(x,2,sd)  }  ## Otherwise, we set the sd to 0   else    my_pca[['std']]=rep(1,dim(x)[2])  ##Normalization  ##Centering  x_std=sweep(x,2,my_pca[['center']])  ##Standardization  x_std=x_std%*%diag(1/my_pca[['std']])    ##Cov matrix  eigen_cov=eigen(crossprod(x_std,x_std))  ##Computing the cumulative variance  my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']])  ##Number of required components  my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1  ##Selection of the principal components  my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]]  attr(my_pca, "class") <- "my_pca"  return(my_pca)}

Now that we have the transformation matrix, we can perform the projection on the new basis.

predict.my_pca<-function(pca,x,..){  ##Centering  x_std=sweep(x,2,pca[['center']])  ##Standardization  x_std=x_std%*%diag(1/pca[['std']])  return(x_std%*%pca[['transform']])}

The function applies the change of basis formula and a projection on the k principals components.

Plot the PCA projection

Using the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.

library(ggplot2)pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE)projected=predict(pca1,as.matrix(iris[,1:4]))ggplot()+geom_point(aes(x=projected[,1],y=projected[,2],color=iris[,5]))
PCA irisProjection of the Iris dataset on the two mains PCA

Comparison with the FactoMineR implementation

We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.

library(FactoMineR)pca_stats= PCA(as.matrix(iris[,1:4]))projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2]ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')

When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.

Projection of the Iris dataset using the FactoMineR implementation

Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter. .

The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

grconvertX and grconvertY

$
0
0

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

These two functions are unbelievably useful for positioning graphical elements (text, axes, labels, …) in R. They allow one to convert coordinates between various different formats. For instance, you can convert your user coordinate (say 5 where x ranges from 0 to 200) to normalized device coordinates (proportional distance across the device) and vice versa. Very useful for positioning panel labels…

grconvertX(.1, "npc", "user")

returns the x coordinate that is 10% across the plot region. So with that and the equivalent y function, you can place your labels in exactly the same position on every panel and plot… e.g.

plot(rnorm(20), rnorm(20))text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "a)")

To show the difference between the from side, run the following…

text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "npc")text(grconvertX(.1, "ndc", "user"), grconvertY(.9, "ndc", "user"), "ndc", xpd = "n") # the text will be near the outer margin

It’s also very useful for positioning axis labels from boxplot when the labels are too long…

Hope that’s helpful to someone… 🙂

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Stouffer’s meta-analysis with weight and direction effect in R

$
0
0

(This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers)

I often have to perform meta-analysis of past experiments for which the only info I have is the fold change and the p-value (of any measure you may imagine: species richness, gene expression, depth of coverage, plates of carbonara eaten in 5 minutes, everything). The hardest thing to find out for me was how to take the direction of the changes into account. E.g. if monday I eat 10 carbonara more than my brother (p=0.01), on tuesday 10 more (p=0.01), on wednesday 5 more (p=0.1), on thursday 10 less (p=0.01), on friday ten less (p=0.01) and on saturday 5 less (p=0.1), a standard Stouffer meta-analysis would return a highly significant p-value, completely disregarding the fact that the significant changes were half in my favor, and half in favor of my brother. How can I take into account the information on the direction? I had no idea on how to proceed, and I wasn’t able to find any reference (I am not an expert of meta-analysis), but then I was so lucky to stumble upon the manual page of the software package METAINTER for performing meta-analysis.

The authors described there a method to perform Stouffer meta-analysis accounting for the direction of the effects. Their explanation was so clear that it was easy for me to write a simple function in R: I paste it below.

signed.Stouffer.meta <- function(p, w, sign) { # p is a vector of p-values
  if (missing(w)) {
    w <- rep(1, length(p))/length(p)
  } else {
    if (length(w) != length(p))
      stop("Length of p and w must equal!")
  }
  if(length(p)==1) Zi1)
  {
  Zi<-qnorm(p/2,lower.tail=FALSE) 
  Zi[sign<0]<-(-Zi[sign<0])
  }
  Z  <- sum(w*Zi)/sqrt(sum(w^2))
  p.val <- 2*(1-pnorm(abs(Z)))
  return(c(Z = Z, p.value = p.val))
}
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Elegant regression results tables and plots in R: the finalfit package

$
0
0

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and Documentation

It lives on GitHub.

You can install finalfit from github with:

# install.packages("devtools")devtools::install_github("ewenharrison/finalfit")

It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

install.packages("rstan")install.packages("boot")

To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features

1. Summarise variables/factors by a categorical variable

summary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single variable of interest. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

library(finalfit)library(dplyr)# Load example dataset, modified version of survival::colondata(colon_s)# Table 1 - Patient demographics by variable of interest ----explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")dependent = "perfor.factor" # Bowel perforationcolon_s %>%summary_factorlist(dependent, explanatory, p=TRUE, add_dependent_label=TRUE)

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an outcome variable (say dead yes/no).

# Table 2 - 5 yr mortality ----explanatory = c("age", "age.factor", "sex.factor",                "obstruct.factor")dependent = 'mort_5yr'colon_s %>%summary_factorlist(dependent, explanatory, p=TRUE, add_dependent_label=TRUE)

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

```{r, echo = FALSE, results='asis'}knitr::kable(example_table, row.names=FALSE,     align=c("l", "l", "r", "r", "r", "r"))```

2. Summarise regression model results in final table format

The second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and Cox proportional hazards (survival::coxph()) regression models.

The finalfit()“all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()

Of the form: glm(depdendent ~ explanatory, family="binomial")

explanatory = c("age.factor", "sex.factor",                 "obstruct.factor", "perfor.factor")dependent = 'mort_5yr'colon_s %>%    finalfit(dependent, explanatory)

Logistic regression with reduced model: glm()

Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")explanatory_multi = c("age.factor", "obstruct.factor")dependent = 'mort_5yr'colon_s %>%    finalfit(dependent, explanatory, explanatory_multi)

Mixed effects logistic regression: lme4::glmer()

Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")explanatory_multi = c("age.factor", "obstruct.factor")random_effect = "hospital"dependent = 'mort_5yr'colon_s %>%    finalfit(dependent, explanatory, explanatory_multi, random_effect)

Cox proportional hazards: survival::coxph()

Of the form: survival::coxph(dependent ~ explanatory)

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")dependent = "Surv(time, status)"colon_s %>%    finalfit(dependent, explanatory)

Add common model metrics to output

metrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")dependent = 'mort_5yr'colon_s %>%  finalfit(dependent, explanatory, metrics=TRUE)

```{r, echo=FALSE, results="asis"}knitr::kable(table7[[1]], row.names=FALSE, align=c("l", "l", "r", "r", "r"))knitr::kable(table7[[2]], row.names=FALSE)```

Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")explanatory_multi = c("age.factor", "obstruct.factor")random_effect = "hospital"dependent = 'mort_5yr'# Separate tablescolon_s %>%summary_factorlist(dependent, explanatory, fit_id=TRUE) -> example.summarycolon_s %>%glmuni(dependent, explanatory) %>%fit2df(estimate_suffix=" (univariable)") -> example.univariablecolon_s %>%glmmulti(dependent, explanatory) %>%fit2df(estimate_suffix=" (multivariable)") -> example.multivariablecolon_s %>%glmmixed(dependent, explanatory, random_effect) %>%fit2df(estimate_suffix=" (multilevel)") -> example.multilevel# Pipe togetherexample.summary %>%finalfit_merge(example.univariable) %>%finalfit_merge(example.multivariable) %>%finalfit_merge(example.multilevel) %>%select(-c(fit_id, index)) %>% # remove unnecessary columnsdependent_label(colon_s, dependent, prefix="") # place dependent variable label

Bayesian logistic regression: with stan

Our own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plot

Models can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot

# OR plotexplanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")dependent = 'mort_5yr'colon_s %>%  or_plot(dependent, explanatory)# Previously fitted models (`glmmulti()` or `glmmixed()`) can be provided directly to `glmfit`

HR plot

# HR plotexplanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")dependent = "Surv(time, status)"colon_s %>%  hr_plot(dependent, explanatory, dependent_label = "Survival")# Previously fitted models (`coxphmulti`) can be provided directly using `coxfit`

Kaplan-Meier survival plots

KM plots can be produced using the library(survminer)

# KM plotexplanatory = c("perfor.factor")dependent = "Surv(time, status)"colon_s %>%surv_plot(dependent, explanatory, xlab="Time (days)", pval=TRUE, legend="none")

Notes

Use Hmisc::label() to assign labels to variables for tables and plots.

label(colon_s$age.factor) = "Age (years)"

Export dataframe tables directly or to R Markdownknitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

colon_s %>%  summary_missing(dependent, explanatory)

Development will be on-going, but any input appreciated.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Enterprise Dashboards with R Markdown

$
0
0

(This article was first published on R Views, and kindly contributed to R-bloggers)

This is a second post in a series on enterprise dashboards. See our previous post, Enterprise-ready dashboards with Shiny Databases.

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. In addition to fostering low productivity, organizations risk profits and reputations in an age where insightful analyses and process control translate to a competitive advantage. Most organizations want better control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

This Excel dashboard attempts to function as a real application by allowing its users to filter and visualize key metrics about customers. It took dozens of hours to build. The intent was to hand off maintenance to someone else, but the dashboard was so complex that the author was forced to maintain it. Every week, the author copied data from an ETL tool and pasted it into the workbook, spot checked a few cells, and then emailed the entire workbook to a distribution list. Everyone on the distribution list got a new copy in their inbox every week. There were no security controls around data management or data access. Anyone with the report could modify its contents. The update process often broke the brittle cell dependencies; or worse, discrepancies between weeks passed unnoticed. It was almost impossible to guarantee the integrity of each weekly report.

Why coding is important

Excel workbooks are hard to maintain, collaborate on, and debug because they are not reproducible. The content of every cell and the design of every chart is set without ever recording the author’s actions. There is no simple way to recreate an Excel workbook because there is no recipe (i.e., set of instructions) that describes how it was made. Because Excel workbooks lack a recipe, they tend to be hard to maintain and prone to errors. It takes care, vigilance, and subject-matter knowledge to maintain a complex Excel workbook. Even then, human errors abound and changes require a lot of effort.

A better approach is to write code. There are many reasons to start programming. When you create a recipe with code, anyone can reproduce your work (including your future self). The act of coding implicitly invites others to collaborate with you. You can systematically validate and debug your code. All of these things lead to better code over time. Coding in R has particular advantages given its vast ecosystem of packages, its vibrant community, and its powerful tool chain.

Using R Markdown

There are many tools for replacing complex Excel dashboards with R code. One of these tools is R Markdown, an open-source R package that turns your analyses into high quality documents, reports, presentations and dashboards. R Markdown documents are fully reproducible and support dozens of output formats including HTML, PDF, and Microsoft Word documents.

Here is the same Excel dashboard translated to an R Markdown report. Because this report is written in code, it is vastly simpler and easier to maintain. Like the Excel dashboard above, this R Markdown report is designed to take user inputs so that it could render custom report versions.

Many people are already aware that R Markdown reports combine narrative, code, and output in a single document. What is less commonly known is that you can generalize any R Markdown report by declaring parameters in the document header. R Markdown documents with parameters are known as parameterized reports. In the Excel dashboard users can select segment, group, and period. In a parameterized R Markdown document, you would specify these inputs with the following YAML header:

---title: Customer Tracker Reportoutput: html_notebookparams:  seg:     label: "Segment:"    value: Total    input: select    choices: [Total, Heavy, Mainstream, Focus1, Focus2,               Specialty, Diverse1, Diverse2, Other, New]  grp:     label: "Group:"    value: Total    input: select    choices: [Total, Core, Extra]  per:     label: "Period:"    value: Week    input: radio    choices: [Week, YTD]---

You can then call the parameters you declare in the YAML header from your R code chunks.

```{r}params$segmentparams$grpparams$per```

You can render the document with different inputs by selecting knit with parameters in RStudio. This option will open a user interface that allows you to select the parameters you want.

If you want to automate the process of creating custom report versions, you can render these documents programmatically with the rmarkdown::render() function.

rmarkdown::render(  input = "tracker-report.Rmd",   params = list(seg = "Focus1", grp = "Core", per = "Weekly"))

Publishing to RStudio Connect

Managing access and permissions for an ocean of Excel files is painful. Data in Excel spreads through an organization without controls like a virus spreads through a body without disease prevention. There are better ways to secure the operation, access, and distribution of information.

RStudio Connect is a server product from RStudio that is designed for secure sharing of R content. It is on-premises software you run behind your firewall. You keep control of your data and of who has access. With RStudio Connect, you can see all your content, decide who should be able to view and collaborate on it, tune performance, schedule updates, and view logs. You can schedule your R Markdown reports to run automatically or even distribute the latest version by email.

When you publish a parameterized R Markdown report to RStudio Connect, an interface appears for selecting inputs. Viewers can create new report versions, then email themselves a copy. Collaborators can save and schedule new report versions, then email others a copy. You can even attach output files to these versions. Using parameterized R Markdown documents in RStudio Connect is a powerful way to communicate information.

You can publish content from the RStudio IDE by clicking the Publish button that looks like a blue Eye of Horus. Pressing this button will begin the publishing process. First, it creates a set of instructions for recreating your content. Second, it deploys your content bundle to the server. Third, it recreates your content on RStudio Connect. Push-button publishing has a long history of being used with RStudio. In 2012, RStudio enabled push-button publishing of R Markdown documents to RPubs. In 2014, RStudio enabled push-button publishing of Shiny apps to shinyapps.io. In 2016, RStudio enabled push-button publishing to RStudio Connect.

Adding Shiny

R Markdown documents are rendered with batch processing. That makes them ideal for automation, long running workflows, and custom report versions. However, if you want your documents to be immediately reactive to user input, then you can add a Shiny runtime. These interactive documents behave like a Shiny application in that they must be hosted. You can host interactive documents and Shiny applications with RStudio Connect. Deciding when to choose between R Markdown, interactive documents, and Shiny applications is a subject for a later post.

Summary

Reproducible code in R leads to better analysis and collaboration. You can use parameterized R Markdown reports to create complex, interactive dashboards. Hosting these dashboards securely in RStudio Connect gives you control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

Click here for source code.

_____='https://rviews.rstudio.com/2018/05/16/replacing-excel-reports-with-r-markdown-and-shiny/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cleaning up tables

$
0
0

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

This post is re-published from my blog


Context

One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.

One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.

Here’s what I want to do

Take a table like this:

LocationGendervalues
RuralMale32.74
RuralFemale25.18
UrbanMale40.48
UrbanFemale25.28

to something like this:

LocationGendervalues
RuralMale32.74
Female25.18
UrbanMale40.48
Female25.28

The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.

This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.

A solution

I created this simple function to do this for a single column within a magrittr pipeline:

clean_col = function(x, colvar){
require(dplyr)
colv = enquo(colvar)
x %>% group_by(!!colv) %>%
      mutate(rown = row_number()) %>%
      ungroup() %>%
      mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>%
      select (-rown)
}

The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.

The second thing was the use of the dummy rown column to identify the first row of each cluster of rows defined by the variable colvar. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the colvar column.

Desirable updates

There are two directions I want to take this, but I don’t understand tidyeval or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking

LocationGenderAgeGrpDeathRate
RuralMale50-5411.7
RuralMale55-5918.1
RuralMale60-6426.9
RuralMale65-6941.0
RuralMale70-7466.0
RuralFemale50-548.7
RuralFemale55-5911.7
RuralFemale60-6420.3
RuralFemale65-6930.9
RuralFemale70-7454.3
UrbanMale50-5415.4
UrbanMale55-5924.3
UrbanMale60-6437.0
UrbanMale65-6954.6
UrbanMale70-7471.1
UrbanFemale50-548.4
UrbanFemale55-5913.6
UrbanFemale60-6419.3
UrbanFemale65-6935.1
UrbanFemale70-7450.0

to

LocationGenderAgeGrpDeathRate
RuralMale50-5411.7
55-5918.1
60-6426.9
65-6941.0
70-7466.0
RuralFemale50-548.7
55-5911.7
60-6420.3
65-6930.9
70-7454.3
UrbanMale50-5415.4
55-5924.3
60-6437.0
65-6954.6
70-7471.1
UrbanFemale50-548.4
55-5913.6
60-6419.3
65-6935.1
70-7450.0

The second extension would be to create truly nested row labels, like this:

LocationGenderAgeGrpDeathRate
RuralMale50-5411.7
55-5918.1
60-6426.9
65-6941.0
70-7466.0
Female50-548.7
55-5911.7
60-6420.3
65-6930.9
70-7454.3
UrbanMale50-5415.4
55-5924.3
60-6437.0
65-6954.6
70-7471.1
Female50-548.4
55-5913.6
60-6419.3
65-6935.1
70-7450.0

I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.

Update

I can do the first example with the following code (based on Edwin Thoen’s blog, again):

clean_col = function(x, ...){
  require(dplyr)
  colvs = quos(...)
  x %>% group_by(!!!colvs) %>%
    mutate(rown = row_number()) %>%
    ungroup() %>%
    mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>%
    select (-rown)
}

Update 2

The second example can be solved by extracting elements of quosures, which are essentially a list:

clean_cols = function(x, ...){
  colvs = quos(...)
  for(i in 1:length(colvs)){
    rowvar =  rlang::sym(paste0('rown',i)) # Create dummy
    x = x %>% group_by(!!!colvs[1:i]) %>% 
      mutate(!!rowvar := row_number()) %>%
      ungroup()
  }
  for(i in 1:length(colvs)){
    rowvar = rlang::sym(paste0('rown',i))
    x = x %>% mutate_at(vars(!!colvs[[i]]), 
      funs(ifelse(!!rowvar > 1, '', .)))
  }
  x = x %>% select(-starts_with('rown')) # remove the dummies
  return(x)
}
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


50 years of Delaware Estuary Dissolved Oxygen

$
0
0

(This article was first published on AdventuresInData, and kindly contributed to R-bloggers)

The Delaware River Basin Commission’s Delaware Estuary water quality monitoring program, which was initiated in 1967, is one of the longest running monitoring programs in the world.  One advantage of such a long data set is that we can see the changes to water quality over time.

Since 1967, summer dissolved oxygen levels have steadily improved due to active water quality management and the addition of secondary treatment to waste water treatment plants.  More work remains, however.  A dissolved oxygen sag persists in the most urbanized portion of the estuary, and daytime surface spot measurements may not reveal pockets of lower dissolved oxygen water that could continue to impact aquatic life.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: AdventuresInData.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ready Made Plots make Work Easier

$
0
0

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

A while back Simon Jackson and Kara Woo shared some great ideas and graphs on grouped bar charts and density plots (link). Win-Vector LLC‘s Nina Zumel just added a graph of this type to the development version of WVPlots.

NewImage

Nina has, as usual, some great documentation here.

More and more I am finding when you are in the middle of doing something, having a ready made plotting tool is less distracting than working directly designing a ggplot2 graph on the fly. In addition to WVPlots other great “ready to go” plot packages include ggpubr and ggstatsplot. I definitely recommend checking out all three packages and the packages/tools they use.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

maximal spacing around order statistics

$
0
0

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of

\Delta_n=\max_i\,\min_j\,|X_{i}-X_{j}|

when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al.  covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):

(\xi_1,\xi_2/2,...)

where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by  nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

\mathbb{P}(\sqrt{2\log\,n}\,M^1\le x) \to \prod_{i=1}^{\infty} (1-e^{-ix})^2

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, in fine. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

treeio: Phylogenetic data integration

$
0
0

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Phylogenetic trees are commonly used to present evolutionary relationships of species. Newick is the de facto format in phylogenetic for representing tree(s). Nexus format incorporates Newick tree text with related information organized into separated units known as blocks. For the R community, we have ape and phylobase packages to import trees from Newick and Nexus formats. However, analysis results (tree + analysis findings) from widely used software packages in this field are not well supported. Some of them are extended from Newick and Nexus (e.g.RevBayes and BEAST outputs), while some of the others are just log files (e.g.r8s and PAML outputs). Parsing these output files is important for interpreting analysis findings.

Information associated with taxon species/strains can be meta-data (e.g. isolation host, time and location, etc.), phenotypic or experimental data. These data as well as analysis finding may be further analyzed in the evolutionary context for downstream integrative and comparative studies. Unfortunately, taxon-associated data are mostly stored in separate files and often in inconsistent formats. To fill this gap, I developed the treeio package for phylogenetic tree data integration.

Currently, treeio is able to read the following file formats: Newick, Nexus, New Hampshire eXtended format (NHX), jplace and Phylip as well as the data outputs from the following analysis programs: BEAST, EPA, HyPhy, MrBayes, PAML, PHYLDOG, pplacer, r8s, RAxML and RevBayes.

Installation

The treeio package is released within Bioconductor project.

To get the released version from Bioconductor:

## try http:// if https:// URLs are not supportedsource("https://bioconductor.org/biocLite.R")biocLite("treeio")

Or the development version from github:

## install.packages("devtools")devtools::install_github("GuangchuangYu/treeio")

Importing trees with data

Importing tree data from software outputs

The treeio package provides several parser functions as illustrated in the following table:

Parser functionDescription
read.beastparsing output of BEAST
read.codemlparsing output of CodeML (rst and mlc files)
read.codeml_mlcparsing mlc file (output of CodeML)
read.hyphyparsing output of HYPHY
read.jplaceparsing jplace file including output of EPA and pplacer
read.mrbayesparsing output of MrBayes
read.newickparsing newick string, with ability to parse node label as support values
read.nhxparsing NHX file including output of PHYLDOG and RevBayes
read.paml_rstparsing rst file (output of BaseML or CodeML)
read.phylipparsing phylip file (phylip alignment + newick string)
read.r8sparsing output of r8s
read.raxmlparsing output of RAxML

For example, users can parse BEAST output using read.beast function. All the information inferred by by BEAST will be stored in the output object.

file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")beast <- read.beast(file)beast
## 'treedata' S4 object that stored information of##  '/Library/R/library/treeio/extdata/BEAST/beast_mcc.tree'.## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 14 internal nodes.## ## Tip labels:##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...## ## Rooted; includes branch lengths.## ## with the following features available:##  'height',   'height_0.95_HPD',  'height_median',    'height_range', 'length',##  'length_0.95_HPD',  'length_median',    'length_range', 'posterior',    'rate',##  'rate_0.95_HPD',    'rate_median',  'rate_range'.

Linking external data to phylogeny

In addition to analysis findings that are associated with the tree as we showed above, there is a wide range of heterogeneous data, including phenotypic data, experimental data and clinical data etc., that need to be integrated and linked to phylogeny. For example, in the study of viral evolution, tree nodes may be associated with epidemiological information, such as location, age and subtype. To facilitate data integration, treeio provides full_join method to link external data to phylogeny as demonstrated below:

x <- data_frame(label = as.phylo(beast)$tip.label, trait = rnorm(Ntip(beast)))full_join(beast, x, by="label")
## 'treedata' S4 object that stored information of##  '/Library/R/library/treeio/extdata/BEAST/beast_mcc.tree'.## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 14 internal nodes.## ## Tip labels:##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...## ## Rooted; includes branch lengths.## ## with the following features available:##  'height',   'height_0.95_HPD',  'height_median',    'height_range', 'length',##  'length_0.95_HPD',  'length_median',    'length_range', 'posterior',    'rate',##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'trait'.
N <- Nnode2(beast)y <- data_frame(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))full_join(beast, y, by="node")
## 'treedata' S4 object that stored information of##  '/Library/R/library/treeio/extdata/BEAST/beast_mcc.tree'.## ## ...@ phylo: ## Phylogenetic tree with 15 tips and 14 internal nodes.## ## Tip labels:##  A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...## ## Rooted; includes branch lengths.## ## with the following features available:##  'height',   'height_0.95_HPD',  'height_median',    'height_range', 'length',##  'length_0.95_HPD',  'length_median',    'length_range', 'posterior',    'rate',##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'fake_trait',##  'another_trait'.

Combining tree data

To allow comparative study, treeio supports combining multiple phylogenetic trees into one with their node/branch-specific attribute data.

The following example merges a tree analyzed by BEAST and CODEML.

beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")rst_file <- system.file("examples/rst", package="ggtree")mlc_file <- system.file("examples/mlc", package="ggtree")beast_tree <- read.beast(beast_file)codeml_tree <- read.codeml(rst_file, mlc_file)merged_tree <- merge_tree(beast_tree, codeml_tree)merged_tree
## 'treedata' S4 object that stored information of##  '/Library/R/library/ggtree/examples/MCC_FluA_H3.tree',##  '/Library/R/library/ggtree/examples/rst',##  '/Library/R/library/ggtree/examples/mlc'.## ## ...@ phylo: ## Phylogenetic tree with 76 tips and 75 internal nodes.## ## Tip labels:##  A/Hokkaido/30-1-a/2013, A/New_York/334/2004, A/New_York/463/2005, A/New_York/452/1999, A/New_York/238/2005, A/New_York/523/1998, ...## ## Rooted; includes branch lengths.## ## with the following features available:##  'height',   'height_0.95_HPD',  'height_median',    'height_range', 'length',##  'length_0.95_HPD',  'length_median',    'length_range', 'posterior',    'rate',##  'rate_0.95_HPD',    'rate_median',  'rate_range',   'subs', 'AA_subs',  't',    'N',##  'S',    'dN_vs_dS', 'dN',   'dS',   'N_x_dN',   'S_x_dS'.

After merging, the merged_tree object contains the whole set of node/branch attributes from both beast_tree and codeml_tree. The tree object can be converted to tidy data frame using tidytree package and visualized as hexbin scatterplot of dN/dS, dN and dS inferred by CODEML versus rate (substitution rate in unit of substitutions/site/year) inferred by BEAST on the same branches, as an example to demonstrate comparison of node attributes inferred by different software.

library(tidytree)library(ggplot2)as_data_frame(merged_tree) %>%    dplyr::select(dN_vs_dS, dN, dS, rate) %>%    subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%    tidyr::gather(type, value, dN_vs_dS:dS) %>%    ggplot(aes(rate, value)) + geom_hex() +    facet_wrap(~factor(type, levels = c('dN_vs_dS', 'dN', 'dS')),               scale='free_y') +    ylab(NULL)

hexbin scatter plot: correlation of dN/dS with substitution rate

With this feature, users are able to compare/integrate evolutionary inferences obtained from different software packages/models.

Further readings

For more details of importing trees with data, please refer to the vignette. In addition, treeio also supports exporting tree data to BEAST Nexus format, which allows software output format conversion and also enables combining external data with tree into a single file.

The following screenshot shows an example of exporting CODEML output to BEAST Nexus file. The tree was visualized using FigTree and colored by dN. For more details, please refer to the vignette, exporting trees with data.

All the data parsed/merged by treeio can be converted to tidy data frame using tidytree and can be used to visualize and annotate the tree using ggtree.

For more details, please refer to the online vignettes:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Laser Beams and Elliptical Billiards: Euler Problem 144

$
0
0

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler problem 144 has been one of the most fun to solve. The underlying problem is the pathway of the reflection of a laser inside an ellipse-shaped mirror. Before I delve into this problem, I like to share this delightful video from Numberphile in which Alex Bellos demonstrates an elliptical billiards table. The billiards problem is mathematically equivalent to the laser problem. The reflection rule optics is the same as the bouncing rule in mechanics, but instead of using light, we use a billiard ball.

This article outlines my solution to Euler problem 104 and simulates the elliptical pool table in the R language. You can find the code on the GitHub repository for this website.

Euler Problem 144 Definition

In laser physics, a “white cell” is a mirror system that acts as a delay line for the laser beam. The beam enters the cell, bounces around on the mirrors, and eventually works its way back out. The specific white cell we will be considering is an ellipse with the equation 4x^2 + y^2= 100 . The section corresponding to -0.01 \leq \times \leq +0.01 at the top is missing, allowing the light to enter and exit through the hole.

Euler Problem 144

The light beam in this problem starts at the point (0.0,10.1) just outside the white cell, and the beam first impacts the mirror at (1.4,-9.6). Each time the laser beam hits the surface of the ellipse, it follows the usual law of reflection “angle of incidence equals the angle of reflection.” That is, both the incident and reflected beams make the same angle with the normal line at the point of incidence. In the figure on the left, the red line shows the first two points of contact between the laser beam and the wall of the white cell; the blue line shows the line tangent to the ellipse at the point of incidence of the first bounce. The slope m of the tangent line at any point (x,y) of the given ellipse is m = -4x/y. The normal line is perpendicular to this tangent line at the point of incidence.

How many times does the beam hit the internal surface of the white cell before exiting?

Proposed Solution to Euler Problem 144

The first step was to rewrite the equation to use functions to generalise the problem. The general Cartesian equation for an ellipse is:

\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1, a < b

The length of the axes for this problem is a =5 and b = 10. While the Project Euler description gives the formula for the slope of the tangent to the ellipse, I have generalised the code to reuse it for the elliptical billiards table. The slope of the tangent to an ellipse at point (x,y) is:

m=-\frac{b^2x}{a^2y}

This first code snippet defines functions to draw an ellipse and calculate the bouncing angle. The last part of the code bounces the laser inside the cell until it exits through the top.

I sourced the formula to find the intersection between a line and an ellipse from Ambrsoft. The equation has two possible solutions, one of which is the same as the original point.

plot_ellipse <- function(a, b, colour = NA, line = "black") {    plot.new()    plot.window(xlim = c(-a, a), ylim = c(-b, b), asp = 1)    par(mar = rep(0,4))    x <- seq(-a, a, length = 200)    y <- sqrt(b^2 - (b^2 / a^2) * x^2)    lines(x, y, col = line)    lines(x, -y, col = line)    polygon(x, y, col = colour, border = NA)    polygon(x, -y, col = colour, border = NA)}bounce <- function(coords) {    x <- coords$x    y <- coords$y    ## Tangent to ellipse    t <- -(b^2 / a^2) * (x[2] / y[2])    ## Deflection on sloping mirror y = mx + c    dydx <- diff(y) / diff(x)    m <- tan(pi - atan(dydx) + 2 * atan(t))    c <- y[2] - m * x[2]    ## Determine intersection point    ## Source: http://www.ambrsoft.com/TrigoCalc/Circles2/Ellipse/EllipseLine.htm    x[1] <- x[2]    y[1] <- y[2]    x2 <- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) /          (a^2 * m^2 + b^2)    x[2] <- ifelse(round(x[1] / x2[1], 6) == 1, x2[2], x2[1])    y[2] <- m * x[2] + c    return(data.frame(x, y))}# Initial conditionsa <- 5b <- 10x1 <- 0y1 <- 10.1x2 <- 1.4y2 <- -9.6answer <- 0plot_ellipse(a, b)points(c(0,0), c(-c, c), pch = 19)## Bounce laser breamslaser <- data.frame(x = c(x1, x2), y = c(y1, y2))while((laser$x[2] < -0.01 | laser$x[2] > 0.01) | laser$y[2] < 0) { ## Escape?    lines(laser$x, laser$y, col = "red", lwd = .5)    laser <- bounce(laser)    answer <- answer + 1}print(answer)

The result of this code is a pretty image of all the laser beams that have bounced around the mirror, which looks like the evil Eye of Sauron in Lord of the Rings. Solution to Euler Problem 144

Elliptical Pool Table

We can use the solution to Euler problem 144 to play billiards on an elliptical billiards table. To close the article, we return to the elliptical pool table demonstrated by Alex Bellos. This code draws the pool table to the dimensions mentioned in the video. We know that the table has an eccentricity of e = 0.43 and a long axis of a = 130 cm. The code defines the short axis (b) and the distance of the focal points from the centre.

The code selects a random starting point and angle of the shot. The code first determines whether the line passes through the pocket. If this is not the case, the algorithm then finds the place where the ball hits and keeps bouncing it until it falls into the pocket or the ball bounces 100 times.

Elliptical billiard tables have four possible outcomes. Any ball the pass through a focal point will fall into the pocket, ending the simulation. Any ball that passes outside the focal points will bounce around, and the combined trajectories form an ellipse. When the ball moves between the foci, the result is a hyperbola. Lastly, there are some unique circumstances which result in a regular polygon.

Elliptical billiards: Euler problem 144If simulations are not enough for you, then head over to the Instructables website to find out how you can construct an elliptical billiards table. There is even a patent for an elliptical pocket billiard table, with the pockets at the edge.

## Elliptical pool table## https://www.youtube.com/watch?time_continue=54&v=4KHCuXN2F3Ie <- 0.43a <- 130b <- a * sqrt((1 + e) * (1 - e)) # a > bf <- sqrt(a^2 - b^2)plot_ellipse(a, b, "darkgreen", NA)points(-f, 0, pch = 19, cex = 2)points(f, 0, pch = 19, col = "grey")## Simulate random shotangle <- runif(1, 0, 2 * pi)x1 <- runif(1, -a, a)ymax <- sqrt(b^2 - (b^2 / a^2) * x1^2)y1 <- runif(1, -ymax, ymax)## First shotm <- tan(angle)c <- y1 - m * x1x2 <- (-a^2 * m * c + c(-1, 1) * (a * b * sqrt(a^2 * m^2 + b^2 - c^2))) / (a^2 * m^2 + b^2)y2 <- m * x2 + cx2 <- x2[which(((x2 - x1) < 0) == (cos(angle) < 0))]y2 <- y2[which(((y2 - y1) < 0) == (sin(angle) < 0))]shot <- (data.frame(x = c(x1, x2), y = c(y1, y2)))## Bounce ballfor (i in 1:100){    dydx <- diff(shot$y) / diff(shot$x)    if (all.equal(dydx, (shot$y[1] - 0) / (shot$x[1] - -f)) == TRUE) {        shot[2, ] <- c(-f, 0)    }    lines(shot$x, shot$y, col = "yellow", lwd = 1)    if (shot[2,2] == 0) break    shot <- bounce(shot)}points(x1, y1, pch = 19, col = "blue", cex = 1.8)

The post Laser Beams and Elliptical Billiards: Euler Problem 144 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tuning xgboost in R: Part I

$
0
0

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos
Before we begin, I would like to thank Anuj for kindly including our blog in his list of the top40 R blogs! Check out the full list at his page, FeedSpot!

Introduction

Tuning a Boosting algorithm for the first time may be a very confusing task. There are so many parameters to choose and they all have different behaviour on the results. Also, the best choice may depends on the data. Every time I get a new dataset I learn something new. A good understanding of classification and regression trees (CART) is also helpful because we will be boosting trees, you can start here if you have no idea of what a CART is.

My favourite Boosting package is the xgboost, which will be used in all examples below. Before going to the data let’s talk about some of the parameters I believe to be the most important. These parameters mostly are used to control how much the model may fit to the data. We would like to have a fit that captures the structure of the data but only the real structure. In other words, we do not want the model to fit noise because this will be translated in a poor out-of-sample performance.

  • eta: Learning (or shrinkage) parameter. It controls how much information from a new tree will be used in the Boosting. This parameter must be bigger than 0 and limited to 1. If it is close to zero we will use only a small piece of information from each new tree. If we set eta to 1 we will use all information from the new tree. Big values of eta result in a faster convergence and more over-fitting problems. Small values may need to many trees to converge.
  • colsample_bylevel: Just like Random Forests, some times it is good to look only at a few variables to grow each new node in a tree. If we look at all variables the algorithm needs less trees to converge, but looking at, for example, 2/3 of the variables may result in models more robust to over-fitting. There is a similar parameter called colsample_bytree that re-sample the variables in each new tree instead of each new node.
  • max_depth: Controls the maximum depth of the trees. Deeper trees have more terminal nodes and fit more data. Convergence also requires less trees if we grow them deep. However, if the trees are to deep we will be using a lot of information from the first trees and the final trees of the algorithm will have less importance on the loss function. The Boosting benefits from using information from many trees. Therefore it is intuitive that huge trees are not desirable. Smaller trees also grow faster and because the Boosting grow new trees in a pseudo-residual and we do not require any amazing adjustment for an individual tree.
  • sub_sample: This parameter determines if we are estimating a Boosting or a Stochastic Boosting. If we use 1 we obtain the regular Boosting. Values between 0 and 1 are for the stochastic case. The stochastic Boosting uses only a fraction of the data to grow each tree. For example, if we use 0.5 each tree will sample 50% of the data to grow. Stochastic Boosting is very useful if we have outliers because it limits their influence on the final model because they are dropped on several sub-samples. Moreover, it is possible to have a significant improvement on smaller instances because they are more susceptible to over-fitting.
  • gamma: Controls the minimum reduction in the loss function required to grow a new node in a tree. This parameter is sensitive to the scale of the loss function, which will be linked to the scale of your response variable. The main consequence of using a gamma different from 0 is to stop the algorithm from growing useless trees that barely reduce the in-sample error and are likely to result in over-fitting. I will leave this parameter to a future post to save some space here.
  • min_child_weigth: Controls the minimum number of observations (instances) in a terminal node. The minimum value for this parameter is 1, which allows the tree to have terminal nodes with only one observation. If we use bigger values we limit a possible perfect fit on some observations. This parameter will also be left to part II.

Example

The data will be generate from the following equation:

\displaystyle y_i = (x_i'\beta)^2 + \epsilon_i

where x_i \sim N(0,1), \beta_k = (-0.5)^k,~~ k=1,\dots,K, e_i \sim N(0,1). The number of variables, K, will be set to 10 and the number of instances to 1000.

The experiment will be to change each Boosting parameter keeping all the others constant to try to isolate their effects. The standard model will have the following parameters:

  • eta: 0.1
  • colsample_bylevel: 2/3
  • max_depth: 6
  • sub_sample: 0.5

I will change each of these parameters to the values in the code. We will analyse the convergence and the root mean squared error (RMSE) in the test sample. The code below will prepare everything to run the models.

library(xgboost)library(ggplot2)library(reshape2)library(Ecdat)set.seed(1)N = 1000k = 10x = matrix(rnorm(N*k),N,k)b = (-1)^(1:k)yaux=(x%*%b)^2e = rnorm(N)y=yaux+e# = select train and test indexes = #train=sample(1:N,800)test=setdiff(1:N,train)# = parameters = ## = eta candidates = #eta=c(0.05,0.1,0.2,0.5,1)# = colsample_bylevel candidates = #cs=c(1/3,2/3,1)# = max_depth candidates = #md=c(2,4,6,10)# = sub_sample candidates = #ss=c(0.25,0.5,0.75,1)# = standard model is the second value  of each vector above = #standard=c(2,2,3,2)# = train and test data = #xtrain = x[train,]ytrain = y[train]xtest = x[test,]ytest = y[test]

eta

We will start analysing the eta parameter. The code below estimates Boosting models for each candidate eta. First we have the convergence plot, which shows that bigger values of eta converge faster. However, the train RMSE just below the plot shows that faster convergence does not translate into good out-of-sample performance. Smaller values of eta like 0.05 and 0.1 are the ones that produce smaller errors. My opinion is that in this case 0.1 gives us a good out-of-sample performance with acceptable convergence speed. Note that the result for eta=0.5 and 1 are really bad compared to the others.

set.seed(1)conv_eta = matrix(NA,500,length(eta))pred_eta = matrix(NA,length(test), length(eta))colnames(conv_eta) = colnames(pred_eta) = etafor(i in 1:length(eta)){  params=list(eta = eta[i], colsample_bylevel=cs[standard[2]],              subsample = ss[standard[4]], max_depth = md[standard[3]],              min_child_weigth = 1)  xgb=xgboost(xtrain, label = ytrain, nrounds = 500, params = params)  conv_eta[,i] = xgb$evaluation_log$train_rmse  pred_eta[,i] = predict(xgb, xtest)}
conv_eta = data.frame(iter=1:500, conv_eta)conv_eta = melt(conv_eta, id.vars = "iter")ggplot(data = conv_eta) + geom_line(aes(x = iter, y = value, color = variable))

plot of chunk unnamed-chunk-62

(RMSE_eta = sqrt(colMeans((ytest-pred_eta)^2)))
##      0.05       0.1       0.2       0.5         1##  9.964462 10.052367 10.223738 13.691344 20.929690

colsample_bylevel

The next parameter controls the fraction of variables (characteristics) tested in each new node. Recall that if we use 1 all variables are tested. Values smaller than 1 test only the correspondent fraction of variables. The convergence shows that the model is much less sensitive to the colsample_bylevel. The curves with smaller values are just slightly above the curves with big values. The most accurate model seems to be the one that uses 25% of the sample in each tree. This result may change with different types of data. For example, the data we generated used the same distribution to create all response variables and the betas gave the variables a similar weight, which makes the sampling less relevant.

set.seed(1)conv_cs = matrix(NA,500,length(cs))pred_cs = matrix(NA,length(test), length(cs))colnames(conv_cs) = colnames(pred_cs) = csfor(i in 1:length(cs)){  params = list(eta = eta[standard[1]], colsample_bylevel = cs[i],              subsample = ss[standard[4]], max_depth = md[standard[3]],              min_child_weigth = 1)  xgb=xgboost(xtrain, label = ytrain,nrounds = 500, params = params)  conv_cs[,i] = xgb$evaluation_log$train_rmse  pred_cs[,i] = predict(xgb, xtest)}
conv_cs = data.frame(iter=1:500, conv_cs)conv_cs = melt(conv_cs, id.vars = "iter")ggplot(data = conv_cs) + geom_line(aes(x = iter, y = value, color = variable))

plot of chunk unnamed-chunk-64

(RMSE_cs = sqrt(colMeans((ytest-pred_cs)^2)))
## 0.333333333333333 0.666666666666667                 1##          10.29836          10.05237          10.20938

max_depth

The third parameter is the maximum deepness allowed in each tree. Naturally, the model converges faster if we grow bigger trees (see the figure below). However, the best out-of-sample performance is for max_depth=4. Remember that bigger trees are more likely to result in over-fitting. In my experience there is no need to use values bigger than 4 to 6, but there may be exceptions.

set.seed(1)conv_md=matrix(NA,500,length(md))pred_md=matrix(NA,length(test),length(md))colnames(conv_md)=colnames(pred_md)=mdfor(i in 1:length(md)){  params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]],              subsample=ss[standard[4]],max_depth=md[i],              min_child_weigth=1)  xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params)  conv_md[,i] = xgb$evaluation_log$train_rmse  pred_md[,i] = predict(xgb, xtest)}
conv_md=data.frame(iter=1:500,conv_md)conv_md=melt(conv_md,id.vars = "iter")ggplot(data=conv_md)+geom_line(aes(x=iter,y=value,color=variable))

plot of chunk unnamed-chunk-66

(RMSE_md=sqrt(colMeans((ytest-pred_md)^2)))
##         2         4         6        10## 12.502733  9.374257 10.134965 10.100691

sub_sample

The next parameter determines if we are estimating a Boosting or a Stochastic Boosting. Smaller values result in bigger errors in-sample but it may generate more robust out-of-sample estimates. However, as mentioned before, you will obtain bigger improvement on samples that have to many characteristics compares to the number of observations and when the data is very noisy. This is not the case here. Nevertheless, there is some improvement compared to the deterministic case, which I believed is mostly caused by outliers in the response variable (boxplot the variable y to see).

set.seed(1)conv_ss=matrix(NA,500,length(ss))pred_ss=matrix(NA,length(test),length(ss))colnames(conv_ss)=colnames(pred_ss)=ssfor(i in 1:length(ss)){  params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]],              subsample=ss[i],max_depth=md[standard[3]],              min_child_weigth=1)  xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params)  conv_ss[,i] = xgb$evaluation_log$train_rmse  pred_ss[,i] = predict(xgb, xtest)}
conv_ss=data.frame(iter=1:500,conv_ss)conv_ss=melt(conv_ss,id.vars = "iter")ggplot(data=conv_ss)+geom_line(aes(x=iter,y=value,color=variable))

plot of chunk unnamed-chunk-68

(RMSE_ss=sqrt(colMeans((ytest-pred_ss)^2)))
##      0.25       0.5      0.75         1##  9.731273 10.052367 11.119535 11.233855

The main conclusion here is that there is no unique rule to tune Boosting models. The best way is to test several configurations.

Stay tuned if you liked this article, we will be talking more about Boosting soon.

References

(Boosting) Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.

(Stochastig Boosting) Friedman, Jerome H. “Stochastic gradient boosting.” Computational Statistics & Data Analysis 38.4 (2002): 367-378.

(CART) Breiman, Leo. Classification and regression trees. Routledge, 2017.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

General gems of comperes

$
0
0

(This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

Examples of exported functions from comperes package that could be useful for general tasks.

Prologue

I am very glad to announce that my new package comperes is on CRAN now. It provides tools for managing competition results in a tidy manner as much as possible. For more information go to:

Besides tools for competition results, comperes offers some functions that can be useful in more general tasks. This post presents examples of their most common usage.

Overview

This post covers the following themes:

For examples we will use a shortened version of the everlasting mtcars data set. We will need the following setup:

library(comperes)
library(rlang)
# For example analysis
library(dplyr)
library(tibble)

mtcars_tbl <- mtcars %>%
  rownames_to_column(var = "car") %>%
  select(car, cyl, vs, carb) %>%
  as_tibble()

Compute vector levels

We will start with the most simple function. During comperes development, idea about the it really helped me reason more clearly about package functional API. I am talking about levels2() which computes “levels” of any non-list vector.

It has the following logic: if x has levels attribute then return levels(x); otherwise return character representation of vector’s sorted unique values. Notes about design and implementation of this function:

  • I hesitated a lot about whether it should return character or same type as input vector in case x has no levels. In many practical cases there is a need in latter behavior. However, in the end I decided that type stable output (levels(x) always returns character vector or NULL) is better.
  • Conversion to character is done after sorting, which is really important when dealing with numeric vectors.

This function is helpful when one needs to produce unique values in standardized manner (for example, during pairwise distance computation). Some examples:

levels2(mtcars_tbl$cyl)
## [1] "4" "6" "8"

# Importance of conversion to character after sorting
tricky_vec <- c(10, 1, 2, 12)
sort(as.character(tricky_vec))
## [1] "1"  "10" "12" "2"
levels2(tricky_vec)
## [1] "1"  "2"  "10" "12"

Manage item summaries

Arguably, the most common task in data analysis is computation of group summaries. This task is conveniently done by consecutive application of dplyr’s group_by(), summarise() and ungroup() (to return regular data frame and not grouped one). comperes offers a wrapper summarise_item() for this task (which always returns tibble instead of a data frame) with additional feature of modifying column names by adding prefix (which will be handy soon):

cyl_vs_summary <- mtcars_tbl %>%
  summarise_item(
    item = c("cyl", "vs"),
    n = n(), mean_carb = mean(carb),
    .prefix = "cyl_vs__"
  )
cyl_vs_summary
## # A tibble: 5 x 4
##     cyl    vs cyl_vs__n cyl_vs__mean_carb
##                      
## 1    4.    0.         1              2.00
## 2    4.    1.        10              1.50
## 3    6.    0.         3              4.67
## 4    6.    1.         4              2.50
## 5    8.    0.        14              3.50

Sometimes, there is also a need to compare actual values with their summaries across different grouping. For example, determine whether car’s number of carburetors (carb) is bigger than average value per different groupings: by number of cylinders cyl and V/S vs.

To simplify this task, comperes offers a join_item_summary() function for that: it computes item summary with summarise_item() and joins it (with dplyr::left_join()) to input data frame:

# Save (with rlang magic) expression for reused summary
carb_summary <- list(mean_carb = expr(mean(carb)))

# Create new columns with joined grouped summaries
mtcats_gear_summary <- mtcars_tbl %>%
  join_item_summary("cyl", !!! carb_summary, .prefix = "cyl__") %>%
  join_item_summary("vs",  !!! carb_summary, .prefix = "vs__")

print(mtcats_gear_summary, width = Inf)
## # A tibble: 32 x 6
##   car                 cyl    vs  carb cyl__mean_carb vs__mean_carb
##                                     
## 1 Mazda RX4            6.    0.    4.           3.43          3.61
## 2 Mazda RX4 Wag        6.    0.    4.           3.43          3.61
## 3 Datsun 710           4.    1.    1.           1.55          1.79
## 4 Hornet 4 Drive       6.    1.    1.           3.43          1.79
## 5 Hornet Sportabout    8.    0.    2.           3.50          3.61
## # ... with 27 more rows

# Compute comparisons
mtcats_gear_summary %>%
  mutate_at(vars(ends_with("mean_carb")), funs(carb > .)) %>%
  select(car, ends_with("mean_carb")) %>%
  rename_at(vars(-car), funs(gsub("__mean_carb$", "", .)))
## # A tibble: 32 x 3
##   car               cyl   vs   
##                 
## 1 Mazda RX4         TRUE  TRUE 
## 2 Mazda RX4 Wag     TRUE  TRUE 
## 3 Datsun 710        FALSE FALSE
## 4 Hornet 4 Drive    FALSE FALSE
## 5 Hornet Sportabout FALSE FALSE
## # ... with 27 more rows

Adding different prefixes helps navigating through columns with different summaries.

Convert pariwise data

One of the main features of comperes is the ability to compute Head-to-Head values of players in competition. There are functions h2h_long() and h2h_mat() which produce output in “long” (tibble with row describing one ordered pair) and “matrix” (matrix with cell value describing pair in corresponding row and column) formats respectively.

These formats of pairwise data is quite common: “long” is better for tidy computing and “matrix” is better for result presentation. Also converting distance matrix to data frame with pair data is a theme of several Stack Overflow questions (for example, this one and that one).

Package comperes has functions as_h2h_long() and as_h2h_mat() for converting between those formats. They are powered by a “general usage” functions long_to_mat() and mat_to_long(). Here is an example of how they can be used to convert between different formats of pairwise distances:

# Compute matrix of pairwise distances based on all numeric columns
dist_mat <- mtcars_tbl %>%
  select_if(is.numeric) %>%
  dist() %>%
  as.matrix()
dist_mat[1:4, 1:4]
##          1        2        3        4
## 1 0.000000 0.000000 3.741657 3.162278
## 2 0.000000 0.000000 3.741657 3.162278
## 3 3.741657 3.741657 0.000000 2.000000
## 4 3.162278 3.162278 2.000000 0.000000

# Convert to data frame (tibble in this case)
dist_tbl <- dist_mat %>%
  mat_to_long(row_key = "id_1", col_key = "id_2", value = "dist")
dist_tbl
## # A tibble: 1,024 x 3
##   id_1  id_2   dist
##     
## 1 1     1      0.  
## 2 1     2      0.  
## 3 1     3      3.74
## 4 1     4      3.16
## 5 1     5      2.83
## # ... with 1,019 more rows

# Convert tibble back to matrix
dist_mat_new <- dist_tbl %>%
  # To make natural row and column sortings
  mutate_at(vars("id_1", "id_2"), as.numeric) %>%
  long_to_mat(row_key = "id_1", col_key = "id_2", value = "dist")
identical(dist_mat, dist_mat_new)
## [1] TRUE

Conclusion

  • Package comperes provides not only tools for managing competition results but also functions with general purpose:
    • Compute vector levels with levels2(). Usually used to produce unique values in standardized manner.
    • Manage item summaries with summarise_item() and join_item_summary(). May be used to concisely compute comparisons of values with summaries from different groupings.
    • Convert pairwise data with long_to_mat() and mat_to_long(). Very helpful in converting pairwise distances between “long” and “matrix” formats.

sessionInfo()

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2   tibble_1.4.2     dplyr_0.7.5.9000 rlang_0.2.0     
## [5] comperes_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16     knitr_1.20       bindr_0.1.1      magrittr_1.5    
##  [5] tidyselect_0.2.4 R6_2.2.2         stringr_1.3.0    tools_3.4.4     
##  [9] xfun_0.1         utf8_1.1.3       cli_1.0.0        htmltools_0.3.6 
## [13] yaml_2.1.17      rprojroot_1.3-2  digest_0.6.15    assertthat_0.2.0
## [17] crayon_1.3.4     bookdown_0.7     purrr_0.2.4      glue_1.2.0      
## [21] evaluate_0.10.1  rmarkdown_1.9    blogdown_0.5     stringi_1.1.6   
## [25] compiler_3.4.4   pillar_1.2.1     backports_1.1.2  pkgconfig_2.0.1
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


My book “Deep Learning from first principles” now on Amazon

$
0
0

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($16.99) and kindle ($6.65/Rs449). The book starts with the most primitive 2-layer Neural Network and works  its way to a generic L-layer Deep Learning Network, with all the bells and whistles.  The book includes detailed derivations and vectorized implementations in Python, R and Octave.  The code has been extensively  commented and has been included in the Appendix section.

Pick up your copy today!!!

My other books 1. Practical Machine Learning with R and Python 2. Beaten by sheer pace – Cricket analytics with yorkr 3. Cricket analytics with cricketr

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

To purrr or not to purrr

$
0
0

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Nicolas Attalides, Data Scientist
library(purrr)library(magrittr)

I first started using R long before the RStudio and tidyverse days… I remember writing chunks of code in a text editor and copy/pasting it into the R console! Yes I know, shocking. Nonetheless, most of us will have written code over the years that works just fine in base R, however in my case, the ever-growing adoption of the tidyverse packages (and my personal aspiration to improve my coding skills) has created a sense of necessity to re-write parts of it to fit within the tidyverse setting.

In this blog post I explore the purrr package (part of tidyverse collection) and its use within a data scientist’s toolset. I aim to present the case for using the purrr functions and through the use of examples compare them with base R functionality. To do this, we will concentrate on two typical coding scenarios in base R: 1) loops and 2) the suite of apply functions and then compare them with their relevant counterpart map functions in the purrr package.

However, before I start, I wanted to make it clear that I do sympathise with those of you whose first reaction to purrr is “but I can do all this stuff in base R”. Putting that aside, the obvious first obstacle for us to overcome is to lose the notion of “if it’s not broken why change it” and open our ‘coding’ minds to change. At least, I hope you agree with me that the silver lining of this kind of exercise is to satisfy ones curiosity about the purrr package and maybe learn something new!

Let us first briefly describe the concept of functional programming (FP) in case you are not familiar with it.

Functional programming (FP)

R is a functional programming language which means that a user of R has the necessary tools to create and manipulate functions. There is no need to go into too much depth here but it suffices to know that FP is the process of writing code in a structured way and through functions remove code duplications and redundancies. In effect, computations or evaluations are treated as mathematical functions and the output of a function only depends on the values of its inputs – known as arguments. FP ensures that any side-effects such as changes in state do not affect the expected output such that if you call the same function twice with the same arguments the function returns the same output.

For those that are interested to find out more, I suggest reading Hadley Wickham’s Functional Programming chapter in the “Advanced R” book. The companion website for this can be found here.

The purrr package, which forms part of the tidyverse ecosystem of packages, further enhances the functional programming aspect of R. It allows the user to write functional code with less friction in a complete and consistent manner. The purrr functions can be used, among other things, to replace loops and the suite of apply functions.

Let’s talk about loops

The motivation behind the examples we are going to look at involve iterating in R for various scenarios. For example, iterate over elements of a vector or list, iterate over rows or columns of a matrix … the list (pun intended) can go on and on!

One of the first things that one gets very excited to ‘play’ when learning to use R – at least that was the case for me – is loops! Lot’s of loops, elaborate, complex… dare I say never ending infinite loops (queue hysteric laughter emoji). Joking aside, it is usually the default answer to a problem that involves iteration of some sort as I demonstrate below.

# Create a vector of the mean values of all the columns of the mtcars dataset# The long repetitive waymean_vec <- c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp), mean(mtcars$hp),              mean(mtcars$drat), mean(mtcars$wt), mean(mtcars$qsec), mean(mtcars$vs),              mean(mtcars$am), mean(mtcars$gear), mean(mtcars$carb))mean_vec# The loop waymean_vec_loop <- vector("double", ncol(mtcars))for (i in seq_along(mtcars)) {  mean_vec_loop[[i]] <- mean(mtcars[[i]])}mean_vec_loop

The resulting vectors are the same and the difference in speed (milliseconds) is negligible. I hope that we can all agree that the long way is definitely not advised and actually is bad coding practice, let alone the frustration (and error-prone task) of copy/pasting. Having said that, I am sure there are other ways to do this – I demonstrate this later using lapply– but my aim was to show the benefit of using a for loop in base R for an iteration problem.

Now imagine if in the above example I wanted to calculate the variance of each column as well…

# Create two vectors of the mean and variance of all the columns of the mtcars dataset# For meanmean_vec_loop <- vector("double", ncol(mtcars))for (i in seq_along(mtcars)) {  mean_vec_loop[[i]] <- mean(mtcars[[i]])}mean_vec_loop#For variancevar_vec_loop <- vector("double", ncol(mtcars))for (i in seq_along(mtcars)) {  var_vec_loop[[i]] <- var(mtcars[[i]])}var_vec_loop# Or combine both calculations in one loopfor (i in seq_along(mtcars)) {  mean_vec_loop[[i]] <- mean(mtcars[[i]])  var_vec_loop[[i]] <- var(mtcars[[i]])}mean_vec_loopvar_vec_loop

Now let us assume that we know that we want to create these vectors not just for the mtcars dataset but for other datasets as well. We could in theory copy/paste the for loops and just change the dataset we supply in the loop but one should agree that this action is repetitive and could result to mistakes. Instead we can generalise this into functions. This is where FP comes into play.

# Create two functions that returns the mean and variance of the columns of a dataset# For meancol_mean <- function(df) {  output <- vector("double", length(df))  for (i in seq_along(df)) {    output[[i]] <- mean(df[[i]])  }  output}col_mean(mtcars)#For variancecol_variance <- function(df) {  output <- vector("double", length(df))  for (i in seq_along(df)) {    output[[i]] <- var(df[[i]])  }  output}col_variance(mtcars)

Why not take this one step further and take full advantage of R’s functional programming tools by creating a function that takes as an argument a function! Yes, you read it correctly… a function within a function!

Why do we want to do that? Well, the code for the two functions above, as clean as it might look, is still repetitive and the only real difference between col_mean and col_var is the mathematical function that we are calling. So why not generalise this further?

# Create a function that returns a computational value (such as mean or variance)# for a given datasetcol_calculation <- function(df, fun) {  output <- vector("double", length(df))  for (i in seq_along(df)) {    output[[i]] <- fun(df[[i]])  }  output}col_calculation(mtcars, mean)col_calculation(mtcars, var)

Did someone say apply?

I mentioned earlier that an alternative way to solve the problem is to use the apply function (or suite of apply functions such as lapply, sapply, vapply, etc). In fact, these functions are what we call Higher Order Functions. Similar to what we did earlier, these are functions that can take other functions as an argument.

The benefit of using higher order functions instead of a for loop is that they allow us to think about what code we are executing at a higher level. Think of it as: “apply this to that” rather than “take the first item, do this, take the next item, do this…”

I must admit that at first it might take a little while to get used to but there is definitely a sense of pride when you can improve your code by eliminating for loops and replace them with apply-type functions.

# Create a list/vector of the mean values of all the columns of the mtcars datasetlapply(mtcars, mean) %>% head # Returns a listsapply(mtcars, mean) %>% head # Returns a vector

Once again, speed of execution is not the issue and neither is the common misconception about loops being slow compared to apply functions. As a matter of fact the main argument in favour of using lapply or any of the purrr functions as we will see later is the pure simplicity and readability of the code. Full stop.

Enter the purrr

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

# Create a list/vector of the mean values of all the columns of the mtcars datasetmap(mtcars, mean) %>% head # Returns a listmap_dbl(mtcars, mean) %>% head # Returns a vector - of class double

Let us introduce the iris dataset with a slight modification in order to demonstrate the inconsistency that sometimes can occur when using the sapply function. This can often cause issues with the code and introduce mystery bugs that are hard to spot.

# Modify iris datasetiris_mod <- irisiris_mod$Species <- ordered(iris_mod$Species) # Ordered factor levelsclass(iris_mod$Species) # Note: The ordered function changes the class# Extract class of every column in iris_modsapply(iris_mod, class) %>% str # Returns a list of the resultssapply(iris_mod[1:3], class) %>% str # Returns a character vector!?!? - Note: inconsistent object type

Since by default map returns a list one can ensure that an object of the same class is returned without any unexpected (and unwanted) surprises. This is inline with FP consistency.

# Extract class of every column in iris_modmap(iris_mod, class) %>% str # Returns a list of the resultsmap(iris_mod[1:3], class) %>% str # Returns a list of the results

To further demonstrate the consistency of the purrr package in this type of setting, the map_*() functions (see below) can be used to return a vector of the expected type, otherwise you get an informative error.

  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.
# Extract class of every column in iris_modmap_chr(iris_mod[1:4], class) %>% str # Returns a character vectormap_chr(iris_mod, class) %>% str # Returns a meaningful error# As opposed to the equivalent base R function vapplyvapply(iris_mod[1:4], class, character(1)) %>% str  # Returns a character vectorvapply(iris_mod, class, character(1)) %>% str  # Returns a possibly harder to understand error

It is worth noting that if the user does not wish to rely on tidyverse dependencies they can always use base R functions but need to be extra careful of the potential inconsistencies that might arise.

Multiple arguments and neat tricks

In case we wanted to apply a function to multiple vector arguments we have the option of mapply from base R or the map2 from purrr.

# Create random normal values from a list of means and a list of standard deviationsmu <- list(10, 100, -100)sigma <- list(0.01, 1, 10)mapply(rnorm, n = 5, mu, sigma, SIMPLIFY = FALSE) # I need SIMPLIFY = FALSE because otherwise I get a matrixmap2(mu, sigma, rnorm, n = 5)

The map2 function can easily extend to further arguments – not just two as in the example above – and that is where the pmap function comes in.

I also thought of sharing a couple of neat tricks that one can use with the map function.

1) Say you want to fit a linear model for every cylinder type in the mtcars dataset. You can avoid code duplication and do it as follows:

# Split mtcars dataset by cylinder values and then fit a simple lmmodels <- mtcars %>%   split(.$cyl) %>% # Split by cylinder into 3 lists  map(function(df) lm(mpg ~ wt, data = df)) # Fit linear model for each list

2) Say we are using a function, such as sqrt (calculate square root), on a list that contains a non-numeric element. The base R function lapply throws an error and execution stops without knowing what caused the error. The safely function of purrr completes execution and the user can identify what caused the error.

x <- list(1, 2, 3, "e", 5)# Base Rlapply(x, sqrt)# purrr packagesafe_sqrt <- safely(sqrt)safe_result_list <- map(x, safe_sqrt) %>% transposesafe_result_list$result

Conclusion

Overall, I think it is fair to say that using higher order functions in R is a great way to improve ones code. With that in mind, my closing remark for this blog post is to simply re-iterate the benefits of using the purrr package. That is:

  • The output is consistent.
  • The code is easier to read and write.

If you enjoyed learning about purrr, then you can join us at our purrr workshop at this years EARL London – early bird tickets are available now!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1051]

$
0
0

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A combinatoric Le Monde mathematical puzzle of limited size: When the only allowed move is to switch two balls from adjacent boxes, what is the minimal number of moves to return all balls in the above picture to their respective boxes? Same question with six boxes and 12 balls.

The question is rather interesting to code as I decided to use recursion (as usual!) but wanted to gain time by storing the number of steps needed by any configuration to reach its ordered recombination. Meaning I had to update an external vector within the recursive function for each new configuration I met. With help from Julien Stoehr, who presented me with the following code, a simplification of a common R function

v.assign <- function (i,value,...) {  temp <- get(i, pos = 1)  temp[...] <- value  assign(i, temp, pos = 1)}

which assigns one or several entries to the external vector i. I thus used this trick in the following R code, where cosz is a vector of size 5¹⁰, much larger than the less than 10! values I need but easier to code. While n≤5.

n=5;tn=2*nbaz=n^(0:(tn-1))cosz=rep(-1,n^tn)swee <- function(balz){  indz <- sum((balz-1)*baz)  if (cosz[indz]==-1){   if (min(diff(balz))==0){ #ordered     v.assign("cosz",indz,value=1)}else{       val <- n^tn       for (i in 2:n)       for (j in (2*i-1):(2*i))       for (k in (2*i-3):(2*i-2)){         calz <- balz         calz[k] <- balz[j];calz[j] <- balz[k]         if (max(balz[k]-calz[k],calz[j]-balz[j])>0)            val <- min(val,1+swee(calz))}     v.assign("cosz",indz,value=val)  }} return(cosz[indz])}

which returns 2 for n=2, 6 for n=3, 11 for n=4, 15 for n=5. In the case n=6, I need a much better coding of the permutations of interest. Which is akin to ranking all words within a dictionary with letters (1,1,…,6,6). After some thinking (!) and searching, I came up with a new version, defining

parclass=rep(2,n)rankum=function(confg){    n=length(confg);permdex=1    for (i in 1:(n-1)){      x=confg[i]      if (x>1){        for (j in 1:(x-1)){            if(parclass[j]>0){                parclass[j]=parclass[j]-1                permdex=permdex+ritpermz(n-i,parclass)                parclass[j]=parclass[j]+1}}}        parclass[x]=parclass[x]-1}    return(permdex)}ritpermz=function(n,parclass){    return(factorial(n)/prod(factorial(parclass)))}

for finding the index of a given permutation, between 1 and (2n)!/2!..2!, and then calling the initial swee(p) with this modified allocation. The R code is still running…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Customer Data from Square

$
0
0

(This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)

The Square Data Model

Whether you own your own business or consult for a business using Square to capture payment data, Square can offer some amazing opportunities to gain insights by leveraging their Connect v1 & v2 APIs. The Square data backend operates much like a CRM system that holds information about transactions that customers make when purchasing items offered at a location. Naturally, there are API endpoints for each of the object types italicized above (Locations, Customers, Transactions, and Items) and many more endpoints for administrative tasks (employees, roles, timecards, refunds, etc.).

Authenticating

While working with Square data I decided to develop an R package squareupr that makes it easier to retrieve Square data from R so that you can focus on the analysis. After installing the squareupr package you must authenticate by supplying a personal access token (PAT) or using an OAuth 2.0 flow. You can find your PAT by logging into the Square Dashboard -> Apps -> My Apps. Create an app or click “Manage App” if you’ve already created one and there you should see your personal access token:

# The squareupr package is not yet available on CRAN so you must install from GitHub# install.packages("devtools")devtools::install_github("StevenMMortimer/squareupr")library(tidyverse)library(squareupr)# authenticate using your Personal Access Token (PAT)sq_auth(personal_access_token = "sq-Th1s1sMyPers0nalAcessT0ken")

The package also offers OAuth 2.0 authentication. More information is available here.

Pulling Transaction Data

As mentioned above there are endpoints for every major type of data stored by Square. The API documentation does a very good job at laying out how requests should be made to each of those endpoints. One thing to note is that it is important to first pull down the location details for your business because the location is often required when searching for things like transactions and items.

our_locations <- sq_list_locations()our_locations$name <- "{HIDDEN}"our_locations %>% select(id, name, address, timezone,                         capabilities, status, created_at)#> # A tibble: 5 x 7#>   id            name     address  timezone  capabilities status created_at#>                                      #> 1 46FYN9N9RQS54 {HIDDEN}    ACTIVE 2017-04-2…#> 2 DRDCJ2X8E2PMV {HIDDEN}    ACTIVE 2016-09-2…#> 3 8T1TYXE840S00 {HIDDEN}    ACTIVE 2016-09-2…#> 4 1AWPRVVVFWGQF {HIDDEN}    ACTIVE 2017-04-1…#> 5 50X1GNAWEC8V0 {HIDDEN}    ACTIVE 2017-03-0…

Now that you have the location ids if you would like to pull all of the transactions during a given timeframe, you would use the function sq_list_transactions().

# list all transactions for our 2nd location on May 11, 2018# by default, if a date is provided with no time, then the time component is set to midnightour_transactions <- sq_list_transactions(location = our_locations$id[2],                                          begin_time = as.Date('2018-05-11'),                                          end_time = as.Date('2018-05-12'))our_transactions#> # A tibble: 245 x 6#>    id          location_id  created_at    tenders product client_id       #>                                            #>  1 bUjFGVjBvN… DRDCJ2X8E2P… 2018-05-12T0…   2 5PZP31N5Zs… DRDCJ2X8E2P… 2018-05-11T2…   3 BTrGydD6he… DRDCJ2X8E2P… 2018-05-11T2…   4 XsqOAHl68z… DRDCJ2X8E2P… 2018-05-11T2…   5 vmLRzrwByS… DRDCJ2X8E2P… 2018-05-11T2…   6 pTbzQApZW7… DRDCJ2X8E2P… 2018-05-11T2…   7 lnE20zklpP… DRDCJ2X8E2P… 2018-05-11T2…   8 DSumrqQW0L… DRDCJ2X8E2P… 2018-05-11T2…   9 tPwFXetIwe… DRDCJ2X8E2P… 2018-05-11T2…  10 bqUuFrzH71… DRDCJ2X8E2P… 2018-05-11T2…  # ... with 235 more rows

At first glance there does not appear to be very much detail on the transaction record. However, the tender field represents a method of payment used in a Square transaction so it contains information regarding the amount of money paid in total, in Square fees, and tip. The tender field even contains information regarding the customer_id and credit card information. In the following I will loop through the transactions in April 2018 and determine the total spend for each customer in the dataset. Note: The call to sq_list_transactions may take a couple minutes to complete if you are pulling thousands of transactions.

april_transactions <- sq_list_transactions(location = our_locations$id[2],                                            begin_time = as.Date('2018-04-01'),                                            end_time = as.Date('2018-05-01'))

In order to extract the customer ID and money spent I create a function that checks for the tender object on the transaction and if it exists tries to extract the data into a tbl_df. When I supply this function as map_df(extract_cust_info_func) %>% I get the data from each transaction stacked into a single tbl_df that’s ready to analyze.

# create a function that will extract out just the customer id and money spentextract_cust_info_func <- function(x){  if(!is.null(x$tender)){    tibble(customer_id = sq_null_to_na(x$tender[[1]]$customer_id),           money_spent = sq_null_to_na(x$tender[[1]]$amount_money$amount))    } else {    tibble(customer_id = NA_character_,            money_spent = NA_integer_)  }}april_customer_spend <- april_transactions %>%   transpose() %>%  # pull out just the information we want from each transaction  map_df(extract_cust_info_func) %>%  group_by(customer_id) %>%  summarize(total_spend = sum(money_spent, na.rm=TRUE))april_customer_spend#> # A tibble: 208 x 2#>    customer_id                total_spend#>                                #>  1 064HFDQG0N52AHDBSG00C1BAC8        1700#>  2 07VNWH1V4S6W4W2EJ4AN7SEJNR           0#>  3 08M453QNJ97BCT97SM09TN7QK4           0#>  4 08XE43X8FS0MPX8P2W4N0DEQY0         350#>  5 0CZ78CVRW12V7AZET8S3S82GGW         675#>  6 0G1J81148H42GGMTMQKRWSJHGC           0#>  7 0V1Y1BX23WYRK889ERVBE2T0KM         900#>  8 13HHFBFZTD33RX0RSJNAZQKV5M           0#>  9 1BFCHB9MK91GQ39HTD7QK6R7ZR           0#> 10 1DR0AK5GKX57H9ER9SG1JF01P0        2900#> # ... with 198 more rows

The amounts in april_customer_spend may seem large, but the Square APIs return money as integers that represent whole cents. If you divide by 100, then you will have the money amounts in dollars.

Spend by Customer Group

Square has this concept of “groups” that customers belong to. These groups can be fashioned to do marketing campaigns complete with email blasts. In our analysis let’s further determine which groups these customers belong to. The Square API has an endpoint to retrieve one customer at a time; however, with large lists you may get subjected to rate limiting. Rate limiting is errors on your requests because too many are coming from the same application or access token. I would recommend pulling down the entire list of customers with sq_list_customers() and then matching them up in R.

cust_groups <- sq_list_customers() %>%  select(id, groups) %>%  sq_extract_cust_groups() %>%  # filter to only the groups designated automatically by Square  filter(grepl("^CQ689YH4KCJMY", groups.id))cust_groups#> # A tibble: 13,444 x 3#>    id                         groups.id                 groups.name       #>                                                            #>  1 M1RBDFRK7S1Q1EP6EZFJFV3CBW CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa…#>  2 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.CHURN_RISK  Lapsed            #>  3 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.REACHABLE   Reachable         #>  4 MBSJA4QV4WX6N2XV8WV9VJJTG8 CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa…#>  5 MBSJA4QV4WX6N2XV8WV9VJJTG8 CQ689YH4KCJMY.REACHABLE   Reachable         #>  6 ZCBZJ234217KTV812WX4DP2404 CQ689YH4KCJMY.REACHABLE   Reachable         #>  7 FKEMR8KZCN3BH98RV78PKHKQ1R CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa…#>  8 FKEMR8KZCN3BH98RV78PKHKQ1R CQ689YH4KCJMY.LOYAL       Regulars          #>  9 78VMJPJNK959AHH0ZQPXDXEG3C CQ689YH4KCJMY.LOYALTY_ALL Loyalty Participa…#> 10 QASM1G54VX0QN2S15YS6KHEFCC CQ689YH4KCJMY.LOYAL       Regulars          #> # ... with 13,434 more rows

Now that we know the customer memberships, let’s join back with the transaction data to determine the average total spend across the different membership groups.

cust_groups %>%   # bring in the spend data  inner_join(., april_customer_spend, by=c('id'='customer_id')) %>%  # group by the customer groups to find average spend per group  group_by(groups.name) %>%  summarize(avg_spend = mean(total_spend / 100, na.rm=TRUE))#> # A tibble: 6 x 2#>   groups.name              avg_spend#>                           #> 1 Cards on File                569. #> 2 Lapsed                        20.3#> 3 Loyalty Participants          13.5#> 4 Nearing a Loyalty Reward      15.3#> 5 Reachable                     51.0#> 6 Regulars                      27.4

Issues with the APIs

What is great about this analysis is that we can use the Square APIs to quickly and reliably pull down transaction data, match it to customer information and see how certain customer groupings or campaigns are performing. However, I did find some quirks and drawbacks that I wish the Square team would consider:

Cannot Request Specific Fields

First, it would be nice to implement, as part of the query parameters, the ability to only return certain fields (e.g. the id and groups fields from the Customer object). This would help for two reasons: 1) It would improve the speed since only the required data would be passed across and 2) For analysts who do not want to deal with personally identifiable information (even in RAM) you would not be forced to pull information like phone number, email, names, and even credit card information.

Cannot Update Customer Groups Programmatically

Second, the API will not allow you to update the customer groups programmatically. This means that you are stuck using the dashboard to create and assign customers to the groups you want to analyze. It is a drag when you want to create a reproducible research workflow to assign customer groups but the API hinders your ability to do so.

Customer ID Not On Transactions

Finally, it appears that the customer ID associated to a transaction is not reliably captured on every transaction’s tender object. This means that you cannot get a complete picture of all the transactions. I believe part of this is an issue with the merchant capturing the customer information at the point of sale, but also partly Square’s fault since they seem to be able to piece everything together in their dashboard. However, this is not the case with the APIs. Overall, the Square APIs are a rich data resource for helping run a business and they should only get better with time as development progresses.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beautiful and Powerful Correlation Tables in R

$
0
0

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

Another correlation function?!

Yes, the correlation function from the psycho package.

devtools::install_github("neuropsychology/psycho.R")# Install the newest versionlibrary(psycho)library(tidyverse)cor<-psycho::affective%>%correlation()

This function automatically select numeric variables and run a correlation analysis. It returns a psychobject.

A table

We can then extract a formatted table that can be saved and pasted into reports and manuscripts by using the summary function.

summary(cor)# write.csv(summary(cor), "myformattedcortable.csv")
 AgeLife_SatisfactionConcealingAdjusting
Age    
Life_Satisfaction0.03   
Concealing-0.05-0.06  
Adjusting0.030.36***0.22*** 
Tolerating0.030.15***0.070.29***

A Plot

It integrates a plot done with ggcorplot.

plot(cor)

A print

It also includes a pairwise correlation printing method.

print(cor)
Pearson Full correlation (p value correction: holm):   - Age / Life_Satisfaction:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Life_Satisfaction (r(1249) = 0.030, p > .1).   - Age / Concealing:   Results of the Pearson correlation showed a non significant and weak positive association between Age and Concealing (r(1249) = -0.050, p > .1).   - Life_Satisfaction / Concealing:   Results of the Pearson correlation showed a non significant and weak positive association between Life_Satisfaction and Concealing (r(1249) = -0.063, p > .1).   - Age / Adjusting:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Adjusting (r(1249) = 0.027, p > .1).   - Life_Satisfaction / Adjusting:   Results of the Pearson correlation showed a significant and moderate negative association between Life_Satisfaction and Adjusting (r(1249) = 0.36, p < .001***).   - Concealing / Adjusting:   Results of the Pearson correlation showed a significant and weak negative association between Concealing and Adjusting (r(1249) = 0.22, p < .001***).   - Age / Tolerating:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Tolerating (r(1249) = 0.031, p > .1).   - Life_Satisfaction / Tolerating:   Results of the Pearson correlation showed a significant and weak negative association between Life_Satisfaction and Tolerating (r(1249) = 0.15, p < .001***).   - Concealing / Tolerating:   Results of the Pearson correlation showed a non significant and weak negative association between Concealing and Tolerating (r(1249) = 0.074, p = 0.05°).   - Adjusting / Tolerating:   Results of the Pearson correlation showed a significant and weak negative association between Adjusting and Tolerating (r(1249) = 0.29, p < .001***).

Options

You can also cutomize the type (pearson, spearman or kendall), the p value correction method (holm (default), bonferroni, fdr, none…) and run partial, semi-partial or glasso correlations.

psycho::affective%>%correlation(method="pearson",adjust="bonferroni",type="partial")%>%summary()
 AgeLife_SatisfactionConcealingAdjusting
Age    
Life_Satisfaction0.01   
Concealing-0.06-0.16***  
Adjusting0.020.36***0.25*** 
Tolerating0.020.060.020.24***

Fun with p-hacking

In order to prevent people for running many uncorrected correlation tests (promoting p-hacking and result-fishing), we included the i_am_cheating parameter. If FALSE (default), the function will help you finding interesting results!

df_with_11_vars<-data.frame(replicate(11,rnorm(1000)))cor<-correlation(df_with_11_vars,adjust="none")
## Warning in correlation(df_with_11_vars, adjust = "none"): We've detected that you are running a lot (> 10) of correlation tests without adjusting the p values. To help you in your p-fishing, we've added some interesting variables: You never know, you might find something significant!## To deactivate this, change the 'i_am_cheating' argument to TRUE.
summary(cor)
 X1X2X3X4X5X6X7X8X9X10X11
X1           
X2-0.04          
X3-0.04-0.02         
X40.020.05-0.02        
X5-0.01-0.020.05-0.03       
X6-0.030.030.08*0.020.02      
X70.03-0.01-0.02-0.04-0.03-0.04     
X80.01-0.07*0.040.02-0.01-0.010.00    
X9-0.020.03-0.03-0.020.00-0.040.03-0.02   
X10-0.030.000.000.010.01-0.010.01-0.020.02  
X110.010.01-0.03-0.050.000.050.010.00-0.010.07* 
Local_Air_Density0.26***-0.02-0.44***-0.15***-0.25***-0.50***0.57***-0.11***0.47***0.060.01
Reincarnation_Cycle-0.03-0.020.020.040.010.000.05-0.04-0.05-0.010.03
Communism_Level0.58***-0.44***0.040.06-0.10**-0.18***0.10**0.46***-0.50***-0.21***-0.14***
Alien_Mothership_Distance0.00-0.030.010.00-0.01-0.03-0.040.010.01-0.020.00
Schopenhauers_Optimism0.11***0.31***-0.25***0.64***-0.29***-0.15***-0.35***-0.09**0.08*-0.22***-0.47***
Hulks_Power0.030.000.020.03-0.02-0.01-0.05-0.010.000.010.03

As we can see, Schopenhauer’s Optimism is strongly related to many variables!!!

Credits

This package was useful? You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Viewing all 12108 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>