Quantcast
Channel: R-bloggers
Viewing all 12135 articles
Browse latest View live

Reproduce analysis of a political attitudes experiment by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Terence Wood, Chris Hoy and Jonathan Pryke recently published this paper on The Effect of Geostrategic Competition on Public Attitudes to Aid. I was interested a) because of my past professional engagement with public attitudes to aid (my very first real, full time permanent job was a a community campaigns coordinator for Community Aid Abroad, which later rebranded as Oxfam Australia) b) because Terence and I used to share a cubicle in NZAID / DFAT and c) because I am particularly interested in experiments at the moment, and this was a nice example of an experiment in a non-medical area with reproducible data and code.

The experiment was to compare members’ of the Australian and New Zealand public attitudes to overseas aid, with and without the “treatment” of being exposed to written vignettes of varying degrees of forcefulness about China’s rise as an aid donor in the Pacific. Data collection was by questions included in Ipsos MORI online omnibus surveys.

From the Australian findings:

“As expected, treating participants reduced hostility to aid and increased support for more aid focused on the Pacific. Counter to expectations, however, treatment reduced support for using aid to advance Australian interests.”

The result in the first sentence, but not the second, was replicated with the New Zealand subjects (with questions appropriately reworded for New Zealand, rather than Australia, of course).

The finding about treatment reducing support for using aid to advance Australian interests was a surprise. China’s aid activities in the Pacific are fairly transparently undertaken with a heavy dose of geo-strategic competitive motivation. The researchers had expected, based on other literature to date, that being exposed to this phenomenon would make Australians more inclined to use our own aid program the same way.

Reproducing results

Terence and friends have made the data and Stata code for the analysis available on the Harvard Dataverse. My first move was to check that I could reproduce their results. Here’s the original table of results from the Australian analysis:

Regression results

That table combines the regression coefficients in the odd-numbered columns with contrasts between the ‘Measured’ and ‘Forceful’ versions of the vignette in the even-numbered columns. I found this mixture presentation slightly untidy so will reproduce these in two different steps. Here’s my reproduction of columns (1), (3) and (5) from that first table ie the regression results:


Table 1a – Regression results, experiment with three response variables, Australian public

Regression results for Australians
Dependent variable:
Too much aidMore to PacificHelp Australia
(1)(2)(3)
Measured vignette-0.079***0.052*-0.063**
(0.029)(0.028)(0.028)
Forceful vignette-0.093***0.089***-0.097***
(0.029)(0.028)(0.028)
Control group mean0.518***0.257***0.598***
(0.020)(0.020)(0.020)
Observations1,8161,6471,844
R20.0070.0060.007
Adjusted R20.0060.0050.005
Residual Std. Error0.497 (df = 1813)0.459 (df = 1644)0.497 (df = 1841)
F Statistic6.077*** (df = 2; 1813)5.150*** (df = 2; 1644)6.028*** (df = 2; 1841)
Note:*p<0.1; **p<0.05; ***p<0.01

Good, I seem to have exactly the same results. Here’s my R code that does the analysis so far for both the Australian and New Zealand results.

Post continues below R code

# Wood, Terence, 2020, "Replication Data for The effect of geostrategic# competition on public attitudes to aid", https://doi.org/10.7910/DVN/3VVWPL,# Harvard Dataverse, V1, UNF:6:a3zSXQF/lkQQhkGYNqchGg== [fileUNF]library(tidyverse)library(haven)library(glue)library(stargazer)library(kableExtra)library(emmeans)library(patchwork)library(boot)library(mice)library(ggdag)library(clipr)the_caption <- "Reproduction data from Terence Wood et al, 'The Effect of Geostrategic Competition on Public Attitudes to Aid'"# Data needed downloaded manually from Harvard Dataverse# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3VVWPLaust <- read_stata("1 Australia data JEPS FINAL.dta") %>%  mutate(treatment_group = as_factor(treatment_group))nz <- read_stata("2 NZ data JEPS FINAL.dta") %>%  mutate(treatment_group = as_factor(treatment)) %>%  rename(too_much_aid = toomuchaid,         more_to_pac = morepac,         favour_nz = favournz)# nz gets this error, but it is only a problem for the print method# Error in gsub(finish, start, ..., fixed = TRUE) : #   input string 3 is invalid UTF-8#-------------Australian models--------------------aust_mods <- list()all_response_vars <- c("too_much_aid", "more_to_pac", "favour_aus")all_response_labs <- c("Too much aid", "More to Pacific", "Help Australia")for(i in 1:length(all_response_vars)){  form <- as.formula(glue("{all_response_vars[[i]]} ~ treatment_group"))    aust_mods[[i]] <- lm(form, data = aust)  }# Regression results:stargazer(aust_mods[[1]], aust_mods[[2]], aust_mods[[3]],           type = "html",          dep.var.labels = all_response_labs,          title = "Regression results for Australians") %>%  write_clip()#---------------------New Zealand models------------------nz_mods <- list()all_response_vars_nz <- c("too_much_aid", "more_to_pac", "favour_nz")all_response_labs_nz <- c("Too much aid", "More to Pacific", "Help New Zealand")for(i in 1:length(all_response_vars_nz)){  form <- as.formula(glue("{all_response_vars_nz[[i]]} ~ treatment_group"))    nz_mods[[i]] <- lm(form, data = nz)  }# Regression results:stargazer(nz_mods[[1]], nz_mods[[2]], nz_mods[[3]],           type = "html",          dep.var.labels = all_response_labs_nz,          title = "Regression results for New Zealanders")

… and here are the results for the New Zealand side of the experiment. Again, these match the original article. Note that the New Zealand experiment only had one version of the vignette on the Chinese aid program (unlike the Australian which had ‘forceful’ and ‘measured’ versions).


Table 1b - Regression results, experiment with three response variables, New Zealand public

Regression results for New Zealanders
Dependent variable:
Too much aidMore to PacificHelp New Zealand
(1)(2)(3)
Vignette-0.077***0.070**-0.027
(0.028)(0.030)(0.031)
Control group mean0.350***0.310***0.531***
(0.020)(0.021)(0.021)
Observations1,0709981,070
R20.0070.0050.001
Adjusted R20.0060.004-0.0002
Residual Std. Error0.463 (df = 1068)0.474 (df = 996)0.500 (df = 1068)
F Statistic7.341*** (df = 1; 1068)5.422** (df = 1; 996)0.798 (df = 1; 1068)
Note:*p<0.1; **p<0.05; ***p<0.01

Contrasts

Tables (2), (4) and (6) of the original Australian table present the contrasts between the ‘measured’ and ‘forceful’ versions of the vignette on Chinese aid activity. There’s no significant evidence that the two versions of the vignette have different effects, but it’s important to check I can get the same results. Here’s my estimate of those contrasts, calculated with the help of the emmeans R package by Russell Lenth et al.


Table 2 - Contrast between effect of two vignettes - ‘measured’ and ‘forceful’ - on three different response variables

response 'Measured' minus 'Forceful' SE df t.ratio p.value
Too much aid -0.01 0.03 1813 0.49 0.62
More to Pacific 0.04 0.03 1644 -1.33 0.18
Help Australia -0.03 0.03 1841 1.19 0.24

emmeans is a nice package for comparing different levels of a factor in a linear model. It helps create plots of the comparisons too, looking after the tedious calculations of confidence intervals for you. This next chart, for just one of the response variables, is similar to Figure 1 from the original article. It shows the average proportion of the sample favouring the national interest as a motivation for aid for the different types of treatment. The red arrows inside the bars are an interpretive aid to help make confidence intervals for comparisons between two different effects, whereas the blue bars are the confidence intervals for a single level of the effect.

This finding - exposure to information about China leads to more support for a poverty-focused Australian aid program - is definitely interesting, and as mentioned earlier not what would be expected. It’s worth noting that it didn’t replicate with the New Zealand data. My main question about the interpretation of this finding is how much it depends on the vignette about Chinese aid activity, as opposed to a general priming on aid.

The three treatments in the Australian case were a strident piece on Chinese aid, a measured one, and no information at all. The comparison that would be interesting for future research would be a vignette about aid but perhaps without mentioning specific country donors at all. Because overseas aid and economic development in the Pacific are very low profile topics in Australia, it could be that almost any information about the issues leads to changes in how aid objectives are perceived. But other than noting the lack of obvious difference in impact between the “measured” and “foreceful” treatment groups, I am speculating here.

Here is the R code for the table and chart showing the contrasts between different treatment levels:

#----------Australian Contrasts--------------rbind(  as_tibble(pairs(emmeans(aust_mods[[1]], "treatment_group"))[3, ]),  as_tibble(pairs(emmeans(aust_mods[[2]], "treatment_group"))[3, ]),  as_tibble(pairs(emmeans(aust_mods[[3]], "treatment_group"))[3, ])) %>%  # pairs actually gives us the Measured estimate relative to Forceful; we want  # the reverse:  mutate(estimate = -estimate) %>%  # shuffle some stuff for presentation:  select(-contrast) %>%  rename(`'Measured' minus 'Forceful'` = estimate) %>%  mutate(response = all_response_labs) %>%  select(response, everything()) %>%  kable(digits = 2) %>%  kable_styling()#-------------margins plot, Figure 1:---------------p3a <- plot(emmeans(lm(favour_aus ~ treatment_group, data = aust), specs = "treatment_group"),            comparisons = TRUE) +  labs(x = "Aid should help Australia",       y = "")p3b <- plot(emmeans(lm(favour_poor ~ treatment_group, data = aust), specs = "treatment_group"),            comparisons = TRUE) +  labs(x = "Aid should help the poor",       y= "")p3a +     p3b +    plot_annotation(title = "Telling Australians about Chinese aid might make them more focused on helping the poor",                    subtitle = "Subjects were given a description of Chinese aid that was either forceful, measured or none (control)",                    caption = the_caption)

More complex models

The code for Wood et al’s online appendix is available with the rest of the replication data. It involves some more complex models that control for variables other than the treatment, like gender, age and income. And they use logistic regression rather than ordinary least squares, with a possible gain in robustness of findings at a considerable cost in ease of interpretability.

Even with data from a randomised control trial (which this study can be considered an example of), it is good practice to fit a statistical model that adjusts for explanatory variables that aren’t of direct interest. This definitely leads to better estimates of the treatment effect. Maybe the main article should have done this, rather than relegating this analysis to the online appendix, but I can see an argument for focusing on the simplest comparison in the main text.

Having fit the more complete model, an interpretive problem comes from the presentation of the results of that regression. Philosophically, there is a profound difference between the coefficients in front of the treatment variable (in this case, “did you get a vignette about China’s activities in the Pacific”) and those for the control variables. This difference is important enough to have its own name, Table 2 Fallacy, so called because in some fields Table 1 in a journal article is typically basic summary statistics about the sample and Table 2 presents the results of a regresssion with the coefficients for the treatment and confounder variables. The term Table 2 Fallacy was coined in 2013 by Westreich and Greenland.

The coefficient for the primary treatment can be legitimately interpreted as “the total impact of this variable on the outcome”. This is fine, and this inference is usually the whole justification of the experiment. However, the coefficients in front of the other variables can only be interpreted as “the impact of this variable, after controlling for the particular confounding variables in this model”. The critical differences come from the choice of variables to include in the model, some of which are on the causal path between other variables and the outcome of interest.

“In sum, presenting estimates of effect measures for secondary risk factors (confounders and modifiers of the exposure effect measure) obtained from the same model as that used to estimate the primary exposure effects can lead readers astray in a number of ways. Extra thought and description will be needed when interpreting such secondary estimates.”

Westreich and Greenland in “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients”

Basically, while we have the power of a controlled experiment for drawing inferences about the main treatment, we are back in the world of observational inference with the other variables.

In the case of this attitudes to aid study, here is my crude attempt at a causal diagram of the variables we have data on plus one latent unobserved variable (“political views”) that is of obvious importance:



Because the treatment (which_vignette) does not impact on any of the secondary variables (academic status, background political views, gender, etc) and is not impacted by them (because it was allocated at random), we can safely conclude that the estimated effect for the treatment is the total effect. This applies whether we control for the secondary variables (as we are about to) or we don’t (as was the case in the main paper, and the tables of results shown above).

In the case of the other variables, there are big complications. For example, all four of the variables gender, income, age and academic status are expected to impact on attitudes to aid, but probably mediated by the unobserved general “political views” variable. The variables all have complex relations with eachother and with political views; for example, gender impacts on income (in complex ways that might relate to the individual and certainly to the environment they are in), and both gender and income impact on political views. If income is excluded from the model, some of the reported gender effect will in fact be a proxy for an income effect. In other words, if there are important variables we have missed from the model (and there might be some we haven’t thought of here), we get omitted-variable bias. We can never be sure that (for example) the gender effect in relation to attitudes to aid isn’t standing in as a proxy for something else related to that “political views” circle.

Here’s how I drew that causal graph with R. Everyone should draw more causal graphs, I think.

#===============extra analysis with more variables================#--------------Direction of causality-------------# Direct graph (not 'acyclic' because the connections go in circles!)dagified <- dagify(aid_attitude ~ which_vignette + political_views,                   political_views ~ income + gender + age + academic,                   income ~ gender + age + academic + political_views,                   academic ~ gender + age + political_views,                   latent = "political_views",                   outcome = "aid_attitude"                   )# Draw causal graphset.seed(123)tidy_dagitty(dagified) %>%  ggplot(aes(x = x, y = y , xend = xend, yend = yend)) +  geom_dag_edges_arc(edge_colour = "grey50") +  geom_dag_node(colour ="grey80") +  geom_dag_text(colour = "steelblue", family = main_font) +  theme_void(base_family = main_font) +  labs(title = "Simplified causal diagram of factors in this experiment and views on aid",       caption = the_caption)

So having fit the results, how to present them without leading readers into Table 2 fallacy? The obvious thing is to just highlight that these are two different types of effect. Here’s how I did that with the three regressions in question - one for each of the outcome variables included in Table 1:





I’m frankly not sure of the best terminology here. Westreich and Greenland say “extra thought and description” is needed for secondary estimates, but what’s the one word to put in my legend of these charts? I’ve chosen “mediated” because it so happens they are all mediated by the “political views” latent variable, but in other situations this wouldn’t be quite right.

Anyway, I’m pretty happy with those charts. I like the way the confidence interval is shown directly (rather than in a mess of numbers as per the table of regression coefficients that is still - shockingly - standard in academic hournlas). I like the way the total effects are distinguished from the more complex, mediated ones. And I like the way we can see that:

  • people from an academic background are much less likely to say Australia gives too much aid or that aid should help Australia, and this seems to be a stronger effect than whether the respondent was given a China vignette or not;
  • that older people and men are more likely to support giving more aid to the Pacific
  • no evidence that income impacts on these attitudes

That last case is an interesting one in the light of Table 2 fallacy. I expected income to relate to aid attitudes, but more because of how it stands as a proxy for industry, occupation, age, education and indeed general attitudes to life. So the more of these other variables we can include in the model, the less omitted-variables bias we get and the lower the income effect.

Here’s the R code for fitting these logistic regression models with multiple explanatory variables. Note that we have a material amount of missing income information. To get around this I impute income, and include the imputation algorithm within a bootstrap, so the randomness of the imputation is reflected in the eventual estimates.

The process of imputing and fitting the model to one sample is done in the my_reg() function. I use the handy mice package to do the imputation, even though I’m not using all the multiple-imputation functionality in that package, just imputing a set of individual values. The imputatin and regression function is then called multiple times with different bootstrap resamples by boot from the invaluable boot package.

This is a method I use all the time - good, powerful, robust, general-purpose approach to many regression problems.

#-----------Bootstrap and imputation---------------------#' Imputation and regression#' #' @param d data frame with the necessary columns in it#' @param w 'weights' used to indicate which rows of d to usemy_reg <- function(d, w = 1:nrow(d),                    resp_var = c("too_much_aid", "more_to_pac", "favour_aus")){    resp_var <- match.arg(resp_var)    d_select <- d %>%    rename(y = ) %>%    mutate(y = as.numeric(y), # eg 1 = 'Favour Aus', 0 = 'Help overseas'           male = as.numeric(male),           over_fifty = as.numeric(over_fifty),           academic = as.numeric(academic),           log_inc_pp = log(income_per_person)) %>%    select(y,           treatment_group,           male,           over_fifty,           academic,           log_inc_pp)     d_imputed <- complete(    mice(d_select[w, ],          m = 1,          printFlag = FALSE,         method = c("cart", "cart", "cart", "cart", "cart", "norm"))    )    tmp <- dim(with(d_imputed, table(y, treatment_group)))  if(length(tmp) == 2 && tmp[1] == 2 && tmp[2] == 3){        full_mod <- glm(y ~ treatment_group + male + over_fifty + academic + log_inc_pp,                    data = d_imputed, family = "quasibinomial")        return(coef(full_mod))   } else {    return(NULL)  }}# Demo use:my_reg(aust, resp_var = "more_to_pac")# Apply to all three different response variables, bootstrapped 999 times eachboot_reg <- lapply(all_response_vars, function(v){  set.seed(123)  boot(aust, my_reg, R = 999, resp_var = v)})boot_plots <- list()for(j in 1:length(boot_reg)){  x <- lapply(1:7, function(i){boot.ci(boot_reg[[j]], type = "perc", index = i)$percent[4:5]})    set.seed(322)  boot_plots[[j]] <- do.call(rbind, x)  %>%    as.data.frame() %>%    mutate(variable = c("Intercept", "Measured vignette re China", "Forceful vignette re China",                         "Male", "Over fifty", "Academic", "Log Income Per Person"),           var_type = rep(c("doesn't matter", "Total", "Mediated"), c(1, 2, 4))) %>%    cbind(point = my_reg(aust, resp_var = all_response_vars[[j]])) %>%    filter(variable != "Intercept") %>%    rename(lower = V1,           upper = V2) %>%    mutate(variable = fct_reorder(variable, point)) %>%    ggplot(aes(x = point, y = variable)) +    geom_vline(xintercept = 0, size = 2, colour = "brown", alpha = 1/4) +    geom_segment(aes(xend = lower, x = upper, yend = variable, colour = var_type),                  size = 4, alpha = 0.25) +    geom_point() +    guides(color = guide_legend(override.aes = list(alpha = 1) ) ) +    labs(caption = the_caption,         x = glue("Impact (on log-odds scale) on probability of saying '{all_response_labs[j]}'"),         y = "",         colour = "How to interpret the effect of the variable:",         title = "The effect of geostrategic competition on public attitudes to aid",         subtitle = str_wrap(glue("Impact of a measured or forceful vignette about          Chinese aid, and other secondary variables, on likelihood         of supporting '{all_response_labs[j]}'"), 80))}boot_plots[[1]]boot_plots[[2]]boot_plots[[3]]

That’s all folks. Today’s thoughts:

  • Interesting research problem and data set.
  • Great to have the code and data.
  • It’s worth reflecting on how to interpret coefficients from regression models fit to experimental data; and be careful how you present secondary effects next to the primary variable of interest.
  • Put your imputation inside the bootstrap to properly account for the uncertainty it introduces.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Reproduce analysis of a political attitudes experiment by @ellis2013nz first appeared on R-bloggers.


Gold-Mining Week 10 (2020)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

<br /> Week 10 Gold Mining and Fantasy Football Projection Roundup now available.<br />

The post Gold-Mining Week 10 (2020) appeared first on Fantasy Football Analytics.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Gold-Mining Week 10 (2020) first appeared on R-bloggers.

Submitting a PR to {shiny} LIVE

$
0
0

[This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Debugging with Dean series: Instead of waiting for RStudio to fix a bug I found, I made a pull request to fix the issue, and recorded the entire process –

The “Debugging with Dean” educational video series is in full force, and in this video I do something a little risky: I submit a pull request into {shiny} in real-time.

As always, I would warmly welcome any comments, including constructive criticism so I can improve. If you have anything to share, please do let me know below the video. If you like my videos, make sure to subscribe so that I’ll know to keep making more!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Submitting a PR to {shiny} LIVE first appeared on R-bloggers.

Fast Class-Agnostic Data Manipulation in R

$
0
0

[This article was first published on R, Econometrics, High Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In previous posts I introduced collapse, a powerful (C/C++ based) new framework for data transformation and statistical computing in R – providing advanced grouped, weighted, time series, panel data and recursive computations in R at superior execution speeds, greater flexibility and programmability.

collapse 1.4 released this week additionally introduces an enhanced attribute handling system which enables non-destructive manipulation of vector, matrix or data frame based objects in R. With this post I aim to briefly introduce this attribute handling system and demonstrate that:

  1. collapse non-destructively handles all major matrix (time series) and data frame based classes in R.

  2. Using collapse functions on these objects yields uniform handling at higher computation speeds.

Data Frame Based Objects

The three major data frame based classes in R are the base R data.frame, the data.table and the tibble, for which there also exists grouped (dplyr) and time based (tsibble, tibbletime) versions. Additional notable classes are the panel data frame (plm) and the spatial features data frame (sf).

For the former three collapse offer extremely fast and versatile converters qDF, qDT and qTBL that can be used to turn many R objects into data.frame’s, data.table’s or tibble’s, respectively:

library(collapse); library(data.table); library(tibble)options(datatable.print.nrows = 10,         datatable.print.topn = 2)identical(qDF(mtcars), mtcars)## [1] TRUEmtcarsDT <- qDT(mtcars, row.names.col = "car")mtcarsDT##               car  mpg cyl disp  hp drat    wt  qsec vs am gear carb##  1:     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4##  2: Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## ---                                                                 ## 31: Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8## 32:    Volvo 142E 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2mtcarsTBL <- qTBL(mtcars, row.names.col = "car")print(mtcarsTBL, n = 3)## # A tibble: 32 x 12##   car             mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb##                      ## 1 Mazda RX4      21       6   160   110  3.9   2.62  16.5     0     1     4     4## 2 Mazda RX4 Wag  21       6   160   110  3.9   2.88  17.0     0     1     4     4## 3 Datsun 710     22.8     4   108    93  3.85  2.32  18.6     1     1     4     1## # ... with 29 more rows

These objects can then be manipulated using an advanced and attribute preserving set of (S3 generic) statistical and data manipulation functions. The following infographic summarizes the core collapse namespace:

More details are provided in the freshly released cheat sheet, and further in the documentation and vignettes.

The statistical functions internally handle grouped and / or weighted computations on vectors, matrices and data frames, and seek to keep the attributes of the object.

# Simple data frame: Grouped mean by cyl -> groups = row.names  fmean(fselect(mtcars, mpg, disp, drat), g = mtcars$cyl)##        mpg     disp     drat## 4 26.66364 105.1364 4.070909## 6 19.74286 183.3143 3.585714## 8 15.10000 353.1000 3.229286

With fgroup_by, collapse also introduces a fast grouping mechanism that works together with grouped_df versions of all statistical and transformation functions:

# Using Pipe operators and grouped data frameslibrary(magrittr)mtcars %>% fgroup_by(cyl) %>%   fselect(mpg, disp, drat, wt) %>% fmean  ##   cyl      mpg     disp     drat       wt## 1   4 26.66364 105.1364 4.070909 2.285727## 2   6 19.74286 183.3143 3.585714 3.117143## 3   8 15.10000 353.1000 3.229286 3.999214# This is still a data.table mtcarsDT %>% fgroup_by(cyl) %>%   fselect(mpg, disp, drat, wt) %>% fmean##    cyl      mpg     disp     drat       wt## 1:   4 26.66364 105.1364 4.070909 2.285727## 2:   6 19.74286 183.3143 3.585714 3.117143## 3:   8 15.10000 353.1000 3.229286 3.999214# Same with tibble: here computing weighted group means -> also saves sum of weights in each groupmtcarsTBL %>% fgroup_by(cyl) %>%   fselect(mpg, disp, drat, wt) %>% fmean(wt)## # A tibble: 3 x 5##     cyl sum.wt   mpg  disp  drat##        ## 1     4   25.1  25.9  110.  4.03## 2     6   21.8  19.6  185.  3.57## 3     8   56.0  14.8  362.  3.21

A specialty of the grouping mechanism is that it fully preserves the structure / attributes of the object, and thus permits the creation of a grouped version of any data frame like object.

# This created a grouped data.tablegmtcarsDT <- mtcarsDT %>% fgroup_by(cyl)gmtcarsDT##               car  mpg cyl disp  hp drat    wt  qsec vs am gear carb##  1:     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4##  2: Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4## ---                                                                 ## 31: Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8## 32:    Volvo 142E 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2## ## Grouped by:  cyl  [3 | 11 (3.5)]# The print shows: [N. groups | Avg. group size (SD around avg. group size)]# Subsetting drops groups gmtcarsDT[1:2]##              car mpg cyl disp  hp drat    wt  qsec vs am gear carb## 1:     Mazda RX4  21   6  160 110  3.9 2.620 16.46  0  1    4    4## 2: Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4# Any class-specific methods are independent of the attached groupsgmtcarsDT[, new := mean(mpg)]gmtcarsDT[, lapply(.SD, mean), by = vs, .SDcols = -1L] # Again groups are dropped##    vs      mpg      cyl     disp        hp     drat       wt     qsec        am     gear     carb## 1:  0 16.61667 7.444444 307.1500 189.72222 3.392222 3.688556 16.69389 0.3333333 3.555556 3.611111## 2:  1 24.55714 4.571429 132.4571  91.35714 3.859286 2.611286 19.33357 0.5000000 3.857143 1.785714##         new## 1: 20.09062## 2: 20.09062# Groups are always preserved in column-subsetting operationsgmtcarsDT[, 9:13] ##     vs am gear carb      new##  1:  0  1    4    4 20.09062##  2:  0  1    4    4 20.09062## ---                         ## 31:  0  1    5    8 20.09062## 32:  1  1    4    2 20.09062## ## Grouped by:  cyl  [3 | 11 (3.5)]

The grouping is also dropped in aggregations, but preserved in transformations keeping data dimensions:

# Grouped medians fmedian(gmtcarsDT[, 9:13])##    cyl vs am gear carb      new## 1:   4  1  1    4  2.0 20.09062## 2:   6  1  0    4  4.0 20.09062## 3:   8  0  0    3  3.5 20.09062# Note: unique grouping columns are stored in the attached grouping object # and added if keep.group_vars = TRUE (the default)# Replacing data by grouped median (grouping columns are not selected and thus not present)fmedian(gmtcarsDT[, 4:5], TRA = "replace")##      disp    hp##  1: 167.6 110.0##  2: 167.6 110.0## ---            ## 31: 350.5 192.5## 32: 108.0  91.0## ## Grouped by:  cyl  [3 | 11 (3.5)]# Weighted scaling and centering data (here also selecting grouping column)mtcarsDT %>% fgroup_by(cyl) %>%   fselect(cyl, mpg, disp, drat, wt) %>% fscale(wt)##     cyl    wt         mpg       disp      drat##  1:   6 2.620  0.96916875 -0.6376553 0.7123846##  2:   6 2.875  0.96916875 -0.6376553 0.7123846## ---                                           ## 31:   8 3.570  0.07335466 -0.8685527 0.9844833## 32:   4 2.780 -1.06076989  0.3997723 0.2400387## ## Grouped by:  cyl  [3 | 11 (3.5)]

As mentioned, this works for any data frame like object, even a suitable list:

# Here computing a weighted grouped standard deviationas.list(mtcars) %>% fgroup_by(cyl, vs, am) %>%   fsd(wt) %>% str## List of 11##  $ cyl   : num [1:7] 4 4 4 6 6 8 8##  $ vs    : num [1:7] 0 1 1 0 1 0 0##  $ am    : num [1:7] 1 0 1 1 0 0 1##  $ sum.wt: num [1:7] 2.14 8.8 14.2 8.27 13.55 ...##  $ mpg   : num [1:7] 0 1.236 4.833 0.655 1.448 ...##  $ disp  : num [1:7] 0 11.6 19.25 7.55 39.93 ...##  $ hp    : num [1:7] 0 17.3 22.7 32.7 8.3 ...##  $ drat  : num [1:7] 0 0.115 0.33 0.141 0.535 ...##  $ qsec  : num [1:7] 0 1.474 0.825 0.676 0.74 ...##  $ gear  : num [1:7] 0 0.477 0.32 0.503 0.519 ...##  $ carb  : num [1:7] 0 0.477 0.511 1.007 1.558 ...##  - attr(*, "row.names")= int [1:7] 1 2 3 4 5 6 7

The function fungroup can be used to undo any grouping operation.

identical(mtcarsDT,          mtcarsDT %>% fgroup_by(cyl, vs, am) %>% fungroup)## [1] TRUE

Apart from the grouping mechanism with fgroup_by, which is very fast and versatile, collapse also supports regular grouped tibbles created with dplyr:

library(dplyr)# Same as summarize_all(sum) and considerably fastermtcars %>% group_by(cyl) %>% fsum## # A tibble: 3 x 11##     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb##             ## 1     4  293. 1157.   909  44.8  25.1  211.    10     8    45    17## 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24## 3     8  211. 4943.  2929  45.2  56.0  235.     0     2    46    49# Same as muatate_all(sum)mtcars %>% group_by(cyl) %>% fsum(TRA = "replace_fill") %>% head(3)## # A tibble: 3 x 11## # Groups:   cyl [2]##     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb##             ## 1     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24## 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24## 3     4  293. 1157.   909  44.8  25.1  211.    10     8    45    17

One major goal of the package is to make R suitable for (large) panel data, thus collapse also supports panel-data.frames created with the plm package:

library(plm)pwlddev <- pdata.frame(wlddev, index = c("iso3c", "year"))# Centering (within-transforming) columns 9-12 using the within operator W()head(W(pwlddev, cols = 9:12), 3)##          iso3c year W.PCGDP  W.LIFEEX W.GINI W.ODA## ABW-1960   ABW 1960      NA -6.547351     NA    NA## ABW-1961   ABW 1961      NA -6.135351     NA    NA## ABW-1962   ABW 1962      NA -5.765351     NA    NA# Computing growth rates of columns 9-12 using the growth operator G()head(G(pwlddev, cols = 9:12), 3)##          iso3c year G1.PCGDP G1.LIFEEX G1.GINI G1.ODA## ABW-1960   ABW 1960       NA        NA      NA     NA## ABW-1961   ABW 1961       NA 0.6274558      NA     NA## ABW-1962   ABW 1962       NA 0.5599782      NA     NA

Perhaps a note about operators is necessary here before proceeding: collapse offers a set of transformation operators for its vector-valued fast functions:

# Operators.OPERATOR_FUN##  [1] "STD"  "B"    "W"    "HDB"  "HDW"  "L"    "F"    "D"    "Dlog" "G"# Corresponding (programmers) functionssetdiff(.FAST_FUN, .FAST_STAT_FUN)## [1] "fscale"     "fbetween"   "fwithin"    "fHDbetween" "fHDwithin"  "flag"       "fdiff"     ## [8] "fgrowth"

These operators are principally just function shortcuts that exist for parsimony and in-formula use (e.g. to specify dynamic or fixed effects models using lm(), see the documentation). They however also have some useful extra features in the data.frame method, such as internal column-subsetting using the cols argument or stub-renaming transformed columns (adding a ‘W.’ or ‘Gn.’ prefix as shown above). They also permit grouping variables to be passed using formulas, including options to keep (default) or drop those variables in the output. We will see this feature when using time series below.

To round things off for data frames, I demonstrate the use of collapse with classes it was not directly built to support but can also handle very well. Through it’s built in capabilities for handling panel data, tsibble’s can seamlessly be utilized:

library(tsibble)tsib <- as_tsibble(EuStockMarkets)# Computing daily and annual growth rates on tsibblehead(G(tsib, c(1, 260), by = ~ key, t = ~ index), 3)## # A tsibble: 3 x 4 [1s] ## # Key:       key [1]##   key   index               G1.value L260G1.value##                             ## 1 DAX   1991-07-01 02:18:33   NA               NA## 2 DAX   1991-07-02 12:00:00   -0.928           NA## 3 DAX   1991-07-03 21:41:27   -0.441           NA# Computing a compounded annual growth ratehead(G(tsib, 260, by = ~ key, t = ~ index, power = 1/260), 3)## # A tsibble: 3 x 3 [1s] ## # Key:       key [1]##   key   index               L260G1.value##                         ## 1 DAX   1991-07-01 02:18:33           NA## 2 DAX   1991-07-02 12:00:00           NA## 3 DAX   1991-07-03 21:41:27           NA

Similarly for tibbletime:

library(tibbletime); library(tsbox)# Using the tsbox convertertibtm <- ts_tibbletime(EuStockMarkets)# Computing daily and annual growth rates on tibbletimehead(G(tibtm, c(1, 260), t = ~ time), 3)## # A time tibble: 3 x 9## # Index: time##   time                G1.DAX L260G1.DAX G1.SMI L260G1.SMI G1.CAC L260G1.CAC G1.FTSE L260G1.FTSE##                                                  ## 1 1991-07-01 02:18:27 NA             NA NA             NA  NA            NA  NA              NA## 2 1991-07-02 12:01:32 -0.928         NA  0.620         NA  -1.26         NA   0.679          NA## 3 1991-07-03 21:44:38 -0.441         NA -0.586         NA  -1.86         NA  -0.488          NA# ...

Finally lets consider the simple features data frame:

library(sf)nc <- st_read(system.file("shape/nc.shp", package="sf"))## Reading layer `nc' from data source `C:\Users\Sebastian Krantz\Documents\R\win-library\4.0\sf\shape\nc.shp' using driver `ESRI Shapefile'## Simple feature collection with 100 features and 14 fields## geometry type:  MULTIPOLYGON## dimension:      XY## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965## geographic CRS: NAD27# Fast selecting columns (need to add 'geometry' column to not break the class)plot(fselect(nc, AREA, geometry))

# Subsetting fsubset(nc, AREA > 0.23, NAME, AREA, geometry)## Simple feature collection with 3 features and 2 fields## geometry type:  MULTIPOLYGON## dimension:      XY## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965## geographic CRS: NAD27##       NAME  AREA                       geometry## 1  Sampson 0.241 MULTIPOLYGON (((-78.11377 3...## 2  Robeson 0.240 MULTIPOLYGON (((-78.86451 3...## 3 Columbus 0.240 MULTIPOLYGON (((-78.65572 3...# Standardizing numeric columns (by reference)settransformv(nc, is.numeric, STD, apply = FALSE)# Note: Here using using operator STD() instead of fscale() to stub-rename standardized columns.# apply = FALSE uses STD.data.frame on all numeric columns instead of lapply(data, STD)head(nc, 2)## Simple feature collection with 2 features and 26 fields## geometry type:  MULTIPOLYGON## dimension:      XY## bbox:           xmin: -81.74107 ymin: 36.23436 xmax: -80.90344 ymax: 36.58965## geographic CRS: NAD27##    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79## 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0## 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3##   NWBIR79                       geometry  STD.AREA STD.PERIMETER STD.CNTY_ STD.CNTY_ID STD.FIPSNO## 1      19 MULTIPOLYGON (((-81.47276 3... -0.249186    -0.4788595 -1.511125   -1.511125  -1.568344## 2      12 MULTIPOLYGON (((-81.23989 3... -1.326418    -0.9163351 -1.492349   -1.492349  -1.637282##   STD.CRESS_ID  STD.BIR74  STD.SID74 STD.NWBIR74  STD.BIR79  STD.SID79 STD.NWBIR79## 1    -1.568344 -0.5739411 -0.7286824  -0.7263602 -0.5521659 -0.8863574  -0.6750055## 2    -1.637282 -0.7308990 -0.8571979  -0.7263602 -0.7108697 -0.5682866  -0.6785480

Matrix Based Objects

collapse also offers a converter qM to efficiently convert various objects to matrix:

m <- qM(mtcars)

Grouped and / or weighted computations and transformations work as with with data frames:

# Grouped meansfmean(m, g = mtcars$cyl)##        mpg cyl     disp        hp     drat       wt     qsec        vs        am     gear     carb## 4 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455## 6 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571## 8 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000# Grouped and weighted standardizinghead(fscale(m, g = mtcars$cyl, w = mtcars$wt), 3)##                      mpg cyl        disp         hp       drat         wt       qsec         vs## Mazda RX4      0.9691687 NaN -0.63765527 -0.5263758  0.7123846 -1.6085211 -1.0438559 -1.2509539## Mazda RX4 Wag  0.9691687 NaN -0.63765527 -0.5263758  0.7123846 -0.8376064 -0.6921302 -1.2509539## Datsun 710    -0.7333024 NaN -0.08822497  0.4896429 -0.5526066 -0.1688057 -0.4488514  0.2988833##                     am        gear      carb## Mazda RX4     1.250954  0.27612029  0.386125## Mazda RX4 Wag 1.250954  0.27612029  0.386125## Datsun 710    0.719370 -0.09429567 -1.133397

Various matrix-based time series classes such as xts / zoo and timeSeries are also easily handled:

# ts / mts# Note: G() by default renames the columns, fgrowth() does notplot(G(EuStockMarkets))

# xtslibrary(xts) ESM_xts <- ts_xts(EuStockMarkets) # using tsbox::ts_xtshead(G(ESM_xts), 3)##                         G1.DAX     G1.SMI    G1.CAC    G1.FTSE## 1991-07-01 02:18:27         NA         NA        NA         NA## 1991-07-02 12:01:32 -0.9283193  0.6197485 -1.257897  0.6793256## 1991-07-03 21:44:38 -0.4412412 -0.5863192 -1.856612 -0.4877652plot(G(ESM_xts), legend.loc = "bottomleft")

# timeSerieslibrary(timeSeries) # using tsbox::ts_timeSeriesESM_timeSeries <- ts_timeSeries(EuStockMarkets)# Note: G() here also renames the columns but the names of the series are also stored in an attributehead(G(ESM_timeSeries), 3)## GMT##                            DAX        SMI       CAC       FTSE## 1991-06-30 23:18:27         NA         NA        NA         NA## 1991-07-02 09:01:32 -0.9283193  0.6197485 -1.257897  0.6793256## 1991-07-03 18:44:38 -0.4412412 -0.5863192 -1.856612 -0.4877652plot(G(ESM_timeSeries), plot.type = "single", at = "pretty")legend("bottomleft", colnames(G(qM(ESM_timeSeries))), lty = 1, col = 1:4)

Aggregating these objects yields a plain matrix with groups in the row-names:

# Aggregating by year: creates plain matrix with row-names (g is second argument)EuStockMarkets %>% fmedian(round(time(.)))##           DAX     SMI    CAC    FTSE## 1991 1628.750 1678.10 1772.8 2443.60## 1992 1649.550 1733.30 1863.5 2558.50## 1993 1606.640 2061.70 1837.5 2773.40## 1994 2089.770 2727.10 2148.0 3111.40## 1995 2072.680 2591.60 1918.5 3091.70## 1996 2291.820 3251.60 1946.2 3661.65## 1997 2861.240 3906.55 2297.1 4075.35## 1998 4278.725 6077.40 3002.7 5222.20## 1999 5905.150 8102.70 4205.4 5884.50# Same thing with the other objectsall_obj_equal(ESM_xts %>% fmedian(substr(time(.), 1L, 4L)),              ESM_timeSeries %>% fmedian(substr(time(.), 1L, 4L)))## [1] TRUE

Benchmarks

Extensive benchmarks and examples against native dplyr / tibble and plm are provided here and here, making it evident that collapse provides both greater versatility and massive performance improvements over the methods defined for these objects. Benchmarks against data.table were provided in a previous post, where collapse compared favorably on a 2-core machine (particularly for weighted and := type operations). In general collapse functions are extremely well optimized, with basic execution speeds below 30 microseconds, and efficiently scale to larger operations. Most importantly, they preserve the data structure and attributes (including column attributes) of the objects passed to them. They also efficiently skip missing values and avoid some of the undesirable behavior endemic of base R1.

Here I will add to the above resources just a small benchmark to prove that computations with collapse are also faster than any native methods and suggested programming principles for the various time series classes:

library(dplyr) # needed for tibbletime / tsibble comparison    library(microbenchmark)# Computing the first differencemicrobenchmark(ts = diff(EuStockMarkets),               collapse_ts = fdiff(EuStockMarkets),               xts = diff(ESM_xts),               collapse_xts = fdiff(ESM_xts),               timeSeries = diff(ESM_timeSeries),               collapse_timeSeries = fdiff(ESM_timeSeries),               # taking difference function from tsibble               dplyr_tibbletime = mutate_at(tibtm, 2:5, difference, order_by = tibtm$time),               collapse_tibbletime_D = D(tibtm, t = ~ time),               # collapse equivalent to the dplyr method (tfmv() abbreviates ftransformv())               collapse_tibbletime_tfmv = tfmv(tibtm, 2:5, fdiff, t = time, apply = FALSE),               # dplyr helpers provided by tsibble package               dplyr_tsibble = mutate(group_by_key(tsib), value = difference(value, order_by = index)),               collapse_tsibble_D = D(tsib, 1, 1, ~ key, ~ index),               # Again we can do the same using collapse (tfm() abbreviates ftransform())               collapse_tsibble_tfm = tfm(tsib, value = fdiff(value, 1, 1, key, index)))## Unit: microseconds##                      expr      min        lq        mean    median         uq        max neval cld##                        ts 1344.993 1458.7855  1843.66603 1591.7675  1790.3480   9325.697   100 a  ##               collapse_ts   20.974   37.4850    50.27008   49.5340    58.4585    135.213   100 a  ##                       xts   84.788  131.4205   319.51851  147.7085   161.9885  15576.297   100 a  ##              collapse_xts   38.824   60.2440    77.00934   73.1845    85.9030    214.199   100 a  ##                timeSeries 1364.628 1630.3680  1907.73838 1775.3990  2051.8495   2887.227   100 a  ##       collapse_timeSeries   42.840   62.9220    86.59470   77.8705    91.0350    671.157   100 a  ##          dplyr_tibbletime 5835.143 6267.7805  7371.78980 6681.0065  7534.0105  37462.544   100  b ##     collapse_tibbletime_D  430.630  479.9400   565.78952  536.1675   601.0960    923.288   100 a  ##  collapse_tibbletime_tfmv  412.780  464.9910   557.34657  511.6240   612.4760   1460.570   100 a  ##             dplyr_tsibble 7539.811 8328.7780 11490.09014 8791.9835 10098.8220 223112.537   100   c##        collapse_tsibble_D  757.730  821.9900  1015.04996  909.2310   996.0265   6766.910   100 a  ##      collapse_tsibble_tfm  729.616  783.8350  1035.57745  862.5980   907.0000  13540.958   100 a# Sequence of lagged/leaded and iterated differences (not supported by either of these methods)head(fdiff(ESM_xts, -1:1, diff = 1:2)[, 1:6], 3) ##                     FD1.DAX FD2.DAX     DAX D1.DAX D2.DAX FD1.SMI## 1991-07-01 02:18:27   15.12    8.00 1628.75     NA     NA   -10.4## 1991-07-02 12:01:32    7.12   21.65 1613.63 -15.12     NA     9.9## 1991-07-03 21:44:38  -14.53  -17.41 1606.51  -7.12      8    -5.5head(D(tibtm, -1:1, diff = 1:2, t = ~ time), 3)## # A time tibble: 3 x 21## # Index: time##   time                FD1.DAX FD2.DAX   DAX D1.DAX D2.DAX FD1.SMI FD2.SMI   SMI D1.SMI D2.SMI FD1.CAC##                                         ## 1 1991-07-01 02:18:27   15.1     8.00 1629.  NA     NA      -10.4   -20.3 1678.   NA     NA      22.3## 2 1991-07-02 12:01:32    7.12   21.7  1614. -15.1   NA        9.9    15.4 1688.   10.4   NA      32.5## 3 1991-07-03 21:44:38  -14.5   -17.4  1607.  -7.12   8.00    -5.5    -3   1679.   -9.9  -20.3     9.9## # ... with 9 more variables: FD2.CAC , CAC , D1.CAC , D2.CAC , FD1.FTSE ,## #   FD2.FTSE , FTSE , D1.FTSE , D2.FTSE head(D(tsib, -1:1, diff = 1:2, ~ key, ~ index), 3)## # A tsibble: 3 x 7 [1s] ## # Key:       key [1]##   key   index               FD1.value FD2.value value D1.value D2.value##                                    ## 1 DAX   1991-07-01 02:18:33     15.1       8.00 1629.    NA       NA   ## 2 DAX   1991-07-02 12:00:00      7.12     21.7  1614.   -15.1     NA   ## 3 DAX   1991-07-03 21:41:27    -14.5     -17.4  1607.    -7.12     8.00microbenchmark(collapse_xts = fdiff(ESM_xts, -1:1, diff = 1:2),               collapse_tibbletime = D(tibtm, -1:1, diff = 1:2, t = ~ time),               collapse_tsibble = D(tsib, -1:1, diff = 1:2, ~ key, ~ index))## Unit: microseconds##                 expr     min      lq      mean    median        uq        max neval cld##         collapse_xts  99.067 127.404 4328.5683  146.8155  177.1610 222804.179   100   a##  collapse_tibbletime 504.707 561.827  613.9219  585.7020  614.7075   1100.003   100   a##     collapse_tsibble 849.657 945.600 1060.1478 1011.1990 1083.4915   1729.659   100   a

Conclusion

This concludes this short demonstration. collapse is an advanced, fast and versatile data manipulation package. If you have followed until here I am convinced you will find it very useful, particularly if you are working in advanced statistics, econometrics, surveys, time series, panel data and the like, or if you care much about performance and non-destructive working in R. For more information about the package see the website, study the cheat sheet or call help("collapse-documentation") after install to bring up the built-in documentation.

Appendix: So how does this all actually work?

Statistical functions like fmean are S3 generic with user visible ‘default’, ‘matrix’ and ‘data.frame’ methods, and hidden ‘list’ and ‘grouped_df’ methods. Transformation functions like fwithin additionally have ‘pseries’ and ‘pdata.frame’ methods to support plm objects.

The ‘default’, ‘matrix’ and ‘data.frame’ methods handle object attributes intelligently. In the case of ‘data.frame’s’ only the ‘row.names’ attribute is modified accordingly, other attributes (including column attributes) are preserved. This also holds for data manipulation functions like fselect, fsubset, ftransform etc.. ‘default’ and ‘matrix’ methods preserve attributes as long as the data dimensions are kept.

In addition, the ‘default’ method checks if its argument is actually a matrix, and calls the matrix method if is.matrix(x) && !inherits(x, "matrix") is TRUE. This prevents classed matrix-based objects (such as xts time series) not inheriting from ‘matrix’ being handled by the default method.


  1. For example. mean(NA, na.rm = TRUE) gives NaN, sum(NA, na.rm = TRUE) gives 0 and max(NA, na.rm = TRUE) gives -Inf whereas all_identical(NA_real_, fmean(NA), fsum(NA), fmax(NA)). na.rm = TRUE is the default setting for all collapse functions. Setting na.rm = FALSE also checks for missing values and returns NA if found instead of just running through the entire computation and then returning a NA or NaN value which is unreliable and inefficient.↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R, Econometrics, High Performance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Fast Class-Agnostic Data Manipulation in R first appeared on R-bloggers.

Colorize black-and-white photos

$
0
0

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

DeepAI is a research company that develops a wide variety of deep neural network (DNN) models using the bleeding edge of AI research. For example, they have built models for sentiment analysis of text, nudity detection, artistic style transfer, text summarization, etc. One model that I was particularly interested in using was the Image Colorization that adds realistic coloration to old black-and-white photos. In this post, I show how easy it is to use DeepAI’s API for this model to color your own images automatically using Python.

Overview

Using the API is very simple. The first step is to get an API key from DeepAI. Then, we just need to use the ‘requests’ package to send black-and-white photos and download the colorized results. Each step is explained in more detail below and a full working Python script is available at the bottom.

Examples

For inspiration, here are some examples of the Image Colorization DNN at work.

OriginalColorized

Using the Image Colorization API with Python

1. Obtain an API key

The first step to using the API is to get a free API key from DeepAI. All you need to do is create an account (there is a “login with GitHub” option that I often like to use for these sorts of applications) and you’ll find your API key on your profile page. If you want to experiment with the API first, you can use the demo API key 100e7990-a2b3-4da0-a2ab-281ffd41395c for a few images.

deepai-profile

I put this key into a file called secrets.py and immediately added it to the .gitignore file. My secrets.py file looks like the following:

DEEPAI_API_KEY="put your key here"

This file will be imported into Python as a module, makign the DEEPAI_API_KEY available as a variable.

2. Prepare Python

The only third-party (i.e. not built-in) package that is required for this is ‘requests’, so create a virtual environment ( Python Virtual Environments: A Primer) and install it before continuing.

python3 -m venv colorizer-envpip install requests

or

conda create -n colorizer-env python=3.8conda install requests

3. Post an image to be colorized

We are finally ready to send an image to the colorizer API. All of the code below will post an image located at “path/to/some/image.jpeg” to the API.

import requestsfrom secrets import DEEPAI_API_KEYfrom pathlib import Pathimage_path = Path("path/to/some/image.jpeg")deepai_res = requests.post("https://api.deepai.org/api/colorizer",files={"image": open(image_path, "rb")},headers={"api-key": DEEPAI_API_KEY},)

The response of the request is contained in deepai_res. It should look something like the following.

>>> deepai_res>>> deepai_res.json(){'id': '7b37e471-2f58-4a14-88f7-855bd5cfb6e5', 'output_url': 'https://api.deepai.org/job-view-file/7b37e471-2f58-4a14-88f7-855bd5cfb6e5/outputs/output.jpg'}

The colorized image should be visible if you follow the 'output_url' link.

4. Download the colorized image

There are probably plenty of ways to download the JPEG image at the URL in the response, but I used the following method.

First, the requests.get() function is used to stream the object. If the status code is 200, then the request was successful and the image can be downloaded and saved to disk. If the status code is not 200, then something went wrong and the code is printed to standard out.

import shutil# Where to save the image.save_path = Path("path/to/output_image.jpeg")# Use requests to get the image.colorized_image_url = requests.get(deepai_res.json()["output_url"], stream=True)# Check the status code of the request and save the image to disk.if colorized_image_url.status_code == 200:with open(save_path, "wb") as save_file:colorized_image_url.raw.decode_content = Trueshutil.copyfileobj(colorized_image_url.raw, save_file)else:# Print the status code if it is not 200 (something didn't work).print(f"image result status code: {colorized_image_url.status_code}")

Wrap up

That’s pretty much about it – it is incredible how easy it is to use this complex DNN! These few lines of code can be wrapped into a function to make the process streamlined to run multiple files through the API. My implementation of that is below; it runs all of the images in a specific directory. However, I doubt it would be much harder to add a simple GUI to make it even more simple to use. I think I may try my hand at making a simple MacOS app for this, in the future.


Full script

Here is my full script for running all of the images in “images/original-images/” through the colorizer API. Please, feel free to take the code as a whole or specific bits that you need.

import requestsimport shutilfrom pathlib import Pathfrom secrets import DEEPAI_API_KEYfrom os.path import basename, splitextdef colorize(image_path, save_path, API_KEY):deepai_res = requests.post("https://api.deepai.org/api/colorizer",files={"image": open(image_path, "rb")},headers={"api-key": API_KEY},)colorized_image_url = requests.get(deepai_res.json()["output_url"], stream=True)if colorized_image_url.status_code == 200:with open(save_path, "wb") as save_file:colorized_image_url.raw.decode_content = Trueshutil.copyfileobj(colorized_image_url.raw, save_file)else:print(f"image result status code: {colorized_image_url.status_code}")images_dir = Path("images")original_dir = images_dir / "original-images"output_dir = images_dir / "colorized-images"all_input_images = original_dir.glob("*jpg")for input_image in all_input_images:output_name = splitext(basename(input_image))[0] + "_color.jpg"output_path = output_dir / output_nameif not output_path.exists():print(f"Colorizing '{basename(input_image)}'...")colorize(input_image, output_path, DEEPAI_API_KEY)
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Colorize black-and-white photos first appeared on R-bloggers.

Fitting a spline with PyMC3

$
0
0

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Often, the model we want to fit is not a perfect line between some $x$ and $y$. Instead, the parameters of the model are expected to vary over $x$. There are multiple ways to handle this situation, one of which is to fit a spline. The spline is effectively multiple individual lines, each fit to a different section of $x$, that are tied togehter at their boundaries, often called knots. Below is an exmaple of how to fit a spline using the Bayesian framework PyMC3.

Fitting a spline with PyMC3

Below is a full working example of how to fit a spline using the probabilitic programming language PyMC3. The data and model are taken from Statistical Rethinking 2e by Richard McElreath. As the book uses Stan (another advanced probabilitistic programming language), the modeling code is primarily taken from the GitHub repository of the PyMC3 implementation of Statistical Rethinking. My contributions are primarily of explination and additional analyses of the data and results.

Set-up

Below is the code to import packages and set some variables used in the analysis. Most of the libraries and modules are likely familiar to most. Of those that may not be well known are ‘ArviZ’ and ‘patsy’, and ‘plotnine’. ‘ArviZ’ is a library for managing the components of a Bayesian model. I will use it to manage the results of fitting the model and some standard data visualizations. The ‘patsy’ library is an interface to statistical modeling using a specific formula language simillar to that used in the R language. Finally, ‘plotnine’ is a plotting library that implements the “Grammar or Graphics” system based on the ‘ggplot2’ R package. As I have a lot of experience with R, I found ‘plotnine’ far more natural than the “standard” in Python data science, ‘matplotlib’.

from pathlib import Pathimport arviz as azimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotnine as ggimport pymc3 as pmimport seaborn as snsfrom patsy import dmatrix# Set default theme for 'plotnine'.gg.theme_set(gg.theme_minimal())# For reproducibility.RANDOM_SEED = 847np.random.seed(RANDOM_SEED)# Path to the data used in Statistical Rethinking.rethinking_data_path = Path("../data/rethinking_data")

Data

The data for this example is the number of days of the year (doy) that some cherry trees were in bloom in each year (year). We will ignore the other columns for now.

d = pd.read_csv(rethinking_data_path / "cherry_blossoms.csv")d2 = d.dropna(subset=["doy"]).reset_index(drop=True)d2.head(n=10)
yeardoytemptemp_uppertemp_lower
081292nannannan
1815105nannannan
283196nannannan
38511087.3812.12.66
4853104nannannan
58641006.428.694.14
68661066.448.114.77
786995nannannan
88891046.838.485.19
98911096.988.965

There are 827 years with doy data.

>>> d2.shape(827, 5)

Below is the doy values plotted over year.

(gg.ggplot(d2, gg.aes(x="year", y="doy"))+ gg.geom_point(color="black", alpha=0.4, size=1.3)+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year", y="days of year", title="Cherry blossom data"))

blossom-data

Model

We will fit the following model.

$D \sim \mathcal{N}(\mu, \sigma)$ $\quad \mu = a + Bw$ $\qquad a \sim \mathcal{N}(100, 10)$ $\qquad w \sim \mathcal{N}(0, 10)$ $\quad \sigma \sim \text{Exp}(1)$

The number of days of bloom will be modeled as a normal distribution with mean $\mu$ and standard deviation $\sigma$. The mean will be a linear model composed of a y-intercept $a$ and spline with basis $w$. Both have relatively weak normal priors.

Prepare the spline

We can now prepare the spline matrix. First, we must determine the boundaries of the spline, often referred to as “knots” because the different lines will be tied together at these boundaries to make a continuous and smooth curve. For this example, we will create 15 knots evenly spaced as quantiles of the years data (the x-axis).

num_knots = 15knot_list = np.quantile(d2.year, np.linspace(0, 1, num_knots))>>> knot_listarray([ 812., 1036., 1174., 1269., 1377., 1454., 1518., 1583., 1650.,1714., 1774., 1833., 1893., 1956., 2015.])

Below is the plot of the data we are modeling with the splines indicated by the vertical gray lines.

(gg.ggplot(d2, gg.aes(x="year", y="doy"))+ gg.geom_point(color="black", alpha=0.4, size=1.3)+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots"))

blossom-knots

We can get an idea of what the spline will look like by fitting a LOESS curve (a local ploynomial regression).

(gg.ggplot(d2, gg.aes(x="year", y="doy"))+ gg.geom_point(color="black", alpha=0.4, size=1.3)+ gg.geom_smooth(method = "loess", span=0.3, size=1.5, color="blue", linetype="-")+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots"))

blossoms-data

Another way of visualizing what the spline should look like is to plot individual linear models over the data between each knot. The spline will effectively be a compromise between these individual models and a continuous curve.

d2["knot_group"] = [np.where(a <= knot_list)[0][0] for a in d2.year]d2["knot_group"] = pd.Categorical(d2["knot_group"], ordered=True)(gg.ggplot(d2, gg.aes(x="year", y="doy"))+ gg.geom_point(color="black", alpha=0.4, size=1.3)+ gg.geom_smooth(gg.aes(group = "knot_group"), method="lm", size=1.5, color="red", linetype="-")+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year", y="days of year", title="Cherry blossom data with spline knots"))

blossoms-data

Finally we can use ‘patsy’ to create the matrix $B$ that will be the b-spline basis for the regression. The degree is set to 3 to create a cubic b-spline.

B = dmatrix("bs(year, knots=knots, degree=3, include_intercept=True) - 1",{"year": d2.year.values, "knots": knot_list[1:-1]},)>>> BDesignMatrix with shape (827, 17)Columns:['bs(year, knots=knots, degree=3, include_intercept=True)[0]','bs(year, knots=knots, degree=3, include_intercept=True)[1]','bs(year, knots=knots, degree=3, include_intercept=True)[2]','bs(year, knots=knots, degree=3, include_intercept=True)[3]','bs(year, knots=knots, degree=3, include_intercept=True)[4]','bs(year, knots=knots, degree=3, include_intercept=True)[5]','bs(year, knots=knots, degree=3, include_intercept=True)[6]','bs(year, knots=knots, degree=3, include_intercept=True)[7]','bs(year, knots=knots, degree=3, include_intercept=True)[8]','bs(year, knots=knots, degree=3, include_intercept=True)[9]','bs(year, knots=knots, degree=3, include_intercept=True)[10]','bs(year, knots=knots, degree=3, include_intercept=True)[11]','bs(year, knots=knots, degree=3, include_intercept=True)[12]','bs(year, knots=knots, degree=3, include_intercept=True)[13]','bs(year, knots=knots, degree=3, include_intercept=True)[14]','bs(year, knots=knots, degree=3, include_intercept=True)[15]','bs(year, knots=knots, degree=3, include_intercept=True)[16]']Terms:'bs(year, knots=knots, degree=3, include_intercept=True)' (columns 0:17)(to view full data, use np.asarray(this_obj))

The b-spline basis is plotted below.

spline_df = (pd.DataFrame(B).assign(year=d2.year.values).melt("year", var_name="spline_i", value_name="value"))(gg.ggplot(spline_df, gg.aes(x="year", y="value"))+ gg.geom_line(gg.aes(group="spline_i", color="spline_i"))+ gg.scale_color_discrete(guide=gg.guide_legend(ncol=2))+ gg.labs(x="year", y="basis", color="spline idx"))

spline-basis

Fitting

Finally, the model can be built using PyMC3. A graphical diagram shows the organization of the model parameters.

with pm.Model() as m4_7:a = pm.Normal("a", 100, 10)w = pm.Normal("w", mu=0, sd=10, shape=B.shape[1])mu = pm.Deterministic("mu", a + pm.math.dot(np.asarray(B, order="F"), w.T))sigma = pm.Exponential("sigma", 1)D = pm.Normal("D", mu, sigma, observed=d2.doy)pm.model_to_graphviz(m4_7)

model-graphviz

2000 samples of the posterior distribution are taken along with samples for prior and posterior predictive checks.

with m4_7:prior_pc = pm.sample_prior_predictive(random_seed=RANDOM_SEED)trace_m4_7 = pm.sample(2000, tune=2000, random_seed=RANDOM_SEED)post_pc = pm.sample_posterior_predictive(trace_m4_7, random_seed=RANDOM_SEED)Auto-assigning NUTS sampler...Initializing NUTS using jitter+adapt_diag...Multiprocess sampling (2 chains in 2 jobs)NUTS: [sigma, w, a]Sampling 2 chains, 0 divergences: 100%|██████████| 8000/8000 [00:30<00:00, 259.07draws/s]The number of effective samples is smaller than 25% for some parameters.100%|██████████| 4000/4000 [00:06<00:00, 591.29it/s]

As mentioned above, the model and sampling results are collated into an ArviZ object for ease of use.

az_m4_7 = az.from_pymc3(model=m4_7, trace=trace_m4_7, posterior_predictive=post_pc, prior=prior_pc)

Fit parameters

Below is a table summarizing the posterior distributions of the model parameters. The posteriors of $a$ and $\sigma$ are quite narrow while those for $w$ are wider. This is likely because all of the data points are used to estimate $a$ and $\sigma$ whereas only a subset are used for each value of $w$. The number of effective samples for $a$ is quite low, though, likely due to autocorrelation of the MCMC chains (this is visible in the following plots of the trace).

az.summary(az_m4_7, var_names=["a", "w", "sigma"])
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_meaness_sdess_bulkess_tailr_hat
a103.3032.42498.879107.7240.0980.06961561461810271
w[0]-2.8763.862-10.3914.2080.110.07812251225122321051
w[1]-0.923.944-7.9446.870.1090.07713031303130617941
w[2]-0.953.64-7.695.9720.1150.08299499499517991
w[3]4.8962.917-1.00510.0290.0990.0787187187212361
w[4]-0.8272.937-6.6424.4370.1050.07577677678111991
w[5]4.3842.969-0.99410.0890.0980.06992192192216001
w[6]-5.3052.848-10.728-0.2490.1030.07377177177412261
w[7]7.8992.8452.31912.9680.0980.0784881884915461
w[8]-0.9742.921-6.474.4310.10.0786186186314021
w[9]3.1323.007-2.1919.0910.10.07191090691313991
w[10]4.6762.909-0.45510.5630.1040.07478078078113771
w[11]-0.0852.952-5.4345.6040.0980.06990990991114681
w[12]5.62.9470.16711.290.1040.07380980981312791
w[13]0.7843.116-5.0156.5790.1030.07392492492713821
w[14]-0.7823.333-7.1525.1640.1040.07310301030103014041
w[15]-6.9333.501-13.454-0.1330.1060.07510911091108416841
w[16]-7.613.292-14.056-1.6420.1040.0751003965100513681
sigma5.9460.1475.666.1990.0020.00236843671370925581

We can visualize the trace (MCMC samples) of $a$ and $\sigma$, again showing they were confidently estimated.

az.plot_trace(az_m4_7, var_names=["a", "sigma"])plt.show()

a-and-sigma_trace

A forest plot shows the distributions of the values for $w$ are larger, though some do fall primarily away from 0 indicating a non-null effect/association.

az.plot_forest(az_m4_7, var_names=["w"], combined=True)plt.show()

w-forest

Another visualization of the fit spline values is to plot them multiplied against the basis matrix. The knot boundaries are shown in gray again, but now the spline bases are multipled against the values of $w$ (represented as the rainbow-colored curves). The dot product of $B$ and $w$ - the actual computation in the linear model - is shown in blue.

wp = trace_m4_7["w"].mean(0)spline_df = (pd.DataFrame(B * wp.T).assign(year=d2.year.values).melt("year", var_name="spline_i", value_name="value"))spline_df_merged = (pd.DataFrame(np.dot(B, wp.T)).assign(year=d2.year.values).melt("year", var_name="spline_i", value_name="value"))(gg.ggplot(spline_df, gg.aes(x="year", y="value"))+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.5)+ gg.geom_line(data=spline_df_merged, linetype="-", color="blue", size=2, alpha=0.7)+ gg.geom_line(gg.aes(group="spline_i", color="spline_i"), alpha=0.7, size=1)+ gg.scale_color_discrete(guide=gg.guide_legend(ncol=2), color_space="husl")+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year", y="basis", title="Fit spline", color="spline idx"))

fit-spline-basis

Model predictions

Lastly, we can visualize the predictions of the model using the posterior predictive check.

post_pred = az.summary(az_m4_7, var_names=["mu"]).reset_index(drop=True)d2_post = d2.copy().reset_index(drop=True)d2_post["pred_mean"] = post_pred["mean"]d2_post["pred_hdi_lower"] = post_pred["hdi_3%"]d2_post["pred_hdi_upper"] = post_pred["hdi_97%"](gg.ggplot(d2_post, gg.aes(x="year"))+ gg.geom_ribbon(gg.aes(ymin="pred_hdi_lower", ymax="pred_hdi_upper"), alpha=0.3, fill="tomato")+ gg.geom_line(gg.aes(y="pred_mean"), color="firebrick", alpha=1, size=2)+ gg.geom_point(gg.aes(y="doy"), color="black", alpha=0.4, size=1.3)+ gg.geom_vline(xintercept=knot_list, color="gray", alpha=0.8)+ gg.theme(figure_size=(10, 5))+ gg.labs(x="year",y="days of year",title="Cherry blossom data with posterior predictions",))

posterior-predictions

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Fitting a spline with PyMC3 first appeared on R-bloggers.

UMAP clustering in Python

$
0
0

[This article was first published on poissonisfish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

engraving-brain-black-white-bg-illustration-grayscale-monochrome-color-background-85862032Cell by cell, we unravel the mysteries of the brain. Drawing by Pracha Jaruprateepkul [source]

Embracing Python in this tutorial series has long been a matter of time. For the last five years I have been championing R mostly because of its wide applicability and quite frankly, my own convenience. However, there is little any programming language can do to singlehandedly solve a variety of statistical and computational challenges and R too – take my word – is not without handicaps. Meanwhile, scientific computing from programming powerhouses in the likes of Python and Julia quickly caught up and there is now great potential in bringing them all together. Therefore, as much as I hold R dear I decided to endorse different programming languages in this and future work. I hope this change will help covering more ground and appeal to a wider readership.

The aim of this short Python tutorial is to introduce the uniform manifold approximation and projection (UMAP) algorithm, using 76,533 single-cell expression profiles from the human primary motor cortex. The data are available from the Cell Types database, which is part of the Allen Brain Map platform.

The UMAP has quickly established itself as a go-to clustering tool well poised to expand our knowledge of various many things, including the human brain. I hope by the end of this tutorial you will have a broad understanding of the UMAP algorithm and how to implement it.

Introduction

The UMAP algorithm

Uniform manifold approximation and projection (UMAP)1 is a scalable and efficient dimension reduction algorithm that performs competitively among state-of-the-art methods such as t-SNE2, and widely applied for unsupervised clustering.

To effectively approximate a uniformly distributed manifold in the data, this neighbour-graph technique first defines fuzzy simplicial sets using local metric spaces and then patches them together into a single topological structure. Of note, these local metric spaces are determined from the distance of each individual example to its k-th nearest neighbour, where the user-provided k balances the local-global structure representation. Next, a low-dimensional layout – the so-called embedding – is constructed from the fuzzy set cross-entropy that matches the largest edge weights from the topological structure with the shortest pairwise distances in the layout itself, and vice-versa.

In practice, to implement the UMAP users must provide the minimum distance to separate close observations in the embedding, as well as the numbers of training epochs, embedding dimensions and neighbours. Tweaking these hyperparameters may drastically change the resulting embedding, so these should be carefully tuned.

In addition to the main manuscript1 and this excellent walkthrough, you may find the presentation below from the first author himself just as enlightening.

Since its release, the UMAP has been widely applied to large, high-dimensional molecular data and particularly single-cell expression profiling345, a technology that will be introduced next.

Single-cell expression profiling

In the course of the last two decades, developments in molecular and cellular biology unlocked ever-increasing detail over developmental and metabolic processes in living organisms. Unlike DNA sequencing, RNA sequencing and quantification (RNA-Seq) captures the activity of tens of thousands of genes simultaneously and therefore more closely reflects the interplay among the various different cellular processes.

However, the earliest RNA-Seq workflows presented technical challenges that prevented their application to single cells and consequently access to cell-to-cell variability. Perhaps most critically, these workflows required large amounts of total RNA and lacked the ability to isolate and label single cells. Only recently, the advent of single-cell RNA-Seq (scRNA-Seq) solved these two fundamental problems and helped uncovering cell identity67. Among other things, we now have access to a much richer understanding of health and disease, which holds the key to discovering therapeutic targets for a range of diseases.

This widespread technology is largely based on microfluidic chips where RNA molecules of individual cells are tagged with unique barcode sequences. For a end-to-end overview of scRNA-Seq, below is a great explanation from an instrument and chip manufacturer.

For the present UMAP tutorial we will use scRNA-Seq expression profiles from the human primary motor cortex (M1). The dataset and some additional experimental information are available here– no need to download it manually! 😎

Let’s get started with Python

From an IPython console such as that from Spyder, we start off with importing a handful of modules. Most can be installed from the conda-forge channel, while in contrast umap and datatable can be installed with the commands !pip install umap-learn and !pip install datatable, respectively. Besides the usual numpy and  pandas dependencies for handling structured data, datatable and seaborn are each imported to more quickly load the scRNA-Seq and easily produce a two-dimensional embedding scatterplot, respectively.

Next, we set up a working directory of our choice and download both the large scRNA-Seq dataset and the associated metadata, the latter to derive the cell subgroup labels and distinguish the embedded expression profiles. To this end I opted for a couple of wget Bash commands, but any equivalent method will also work.

.gist table { margin-bottom: 0; }
# Imports
#!pip install datatable
importos, umap
importnumpyasnp
importpandasaspd
importdatatableasdt
importseabornassns
os.chdir('PATH/TO/WDIR')
# Download expression matrix (matrix.csv) and metadata (metadata.csv)
!wgethttps://idketlproddownloadbucket.s3.amazonaws.com/aibs_human_m1_10x/matrix.csv
!wgethttps://idketlproddownloadbucket.s3.amazonaws.com/aibs_human_m1_10x/metadata.csv

view raw1-umap.py hosted with ❤ by GitHub

Once the downloads are complete, you will certainly notice matrix.csv is approximately 7GB in size. We could try lifting this behemoth using pd.read_csv() with an appropriate choice of chunksize, but we will instead use the nimble dt.fread(). A file preview using the Bash command cut -d, -f-5 matrix.csv | head suggests that sample_name aside, all columns are encoded as integer type. Therefore, to facilitate the data ingestion I passed a Bash command to exclude sample_name prior to importing and apply a dt.int32 encoding on the filtered data. Please make sure to leverage parallelisation if possible, by setting and appropriate value of nthreads when executing dt.fread(). ⏰

.gist table { margin-bottom: 0; }
# Check first five columns in matrix.csv
#!cut -d, -f-5 matrix.csv | head
# Import data with Bash command discarding first column
matrix=dt.fread(cmd='cut -d, -f2- matrix.csv',
header=True, sep=',', columns=dt.int32) # ~7 GB (76533, 50281)
# Import metadata
metadata=pd.read_csv('metadata.csv')

view raw2-umap.py hosted with ❤ by GitHub

The resulting object matrix comprises a total of 76,533 expression profiles across 50,281 genes or expression features. If your RAM allows, the to_numpy() and to_pandas() methods will directly convert the datatable to the familiar NumPy or Pandas formats, respectively. To learn more about how to manipulate datatable objects check out the official documentation.

Next, for the sake of the quality and runtime of the UMAP training, we use np.apply_along_axis() to identify and discard expression features that equal or exceed 50% of null expression levels. I leave any further data cleansing up to you.

.gist table { margin-bottom: 0; }
# Remove expression features with > 50% zero-valued expression levels
is_expressed=np.apply_along_axis(lambdax: np.mean(x==0) <.5, arr=matrix, axis=0)
matrix=matrix[:,is_expressed.tolist()]
# Log2-transform
matrix=np.log2(matrix.to_numpy() +1)

view raw3-umap.py hosted with ❤ by GitHub

Now that 4,078 expression features were selected and log-transformed, we can proceed with fitting the UMAP and examining the resulting two-dimensional embedding. For this purpose I employed a min_dist of 0.25, n_neighbors of 30 and the default Euclidean metric. One advantage of the Euclidean metric is that it implicitly factors in the absolute differences between every pair of expression profiles, under the expectation that similar cells match their profiles to the magnitude of expression. Finally, a fixed random_state will ensure we arrive at the same embedding.

Before moving on with the UMAP embedding visualisation, it would help to extract some experimental information from the metadata table in order to make sense of the embedding. One of the many interesting labelling options is cell subclass, which will be used below.

.gist table { margin-bottom: 0; }
# Define UMAP
brain_umap=umap.UMAP(random_state=999, n_neighbors=30, min_dist=.25)
# Fit UMAP and extract latent vars 1-2
embedding=pd.DataFrame(brain_umap.fit_transform(matrix), columns= ['UMAP1','UMAP2'])
# Produce sns.scatterplot and pass metadata.subclasses as color
sns_plot=sns.scatterplot(x='UMAP1', y='UMAP2', data=embedding,
hue=metadata.subclass_label.to_list(),
alpha=.1, linewidth=0, s=1)
# Adjust legend
sns_plot.legend(loc='center left', bbox_to_anchor=(1, .5))
# Save PNG
sns_plot.figure.savefig('umap_scatter.png', bbox_inches='tight', dpi=500)

view raw4-umap.py hosted with ❤ by GitHub

umap_scatter

A grand total of 76,533 human brain cells are represented in the figure above, how cool is that? Under just a few minutes, the UMAP captured a topological representation of the single-cell expression dataset that – given no label information – neatly sorted the different cell subtypes from the human primary motor cortex. Any expert out there daring to comment whether the proximity of the clusters makes anatomical sense?

With that I hope you have a better understanding of the UMAP and how to apply it using Python. See you next time! 🧠

References

1. McInnes, L., Healy, J. & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv stat.ML 1802.0342

2. van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605

3. Becht, E., McInnes, L., Healy, J., Dutertre, C. A., Kwok, I., Ng, L. G., Ginhoux, F., & Newell, E. W. (2018). Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology, 10.1038/nbt.4314. Advance online publication. https://doi.org/10.1038/nbt.4314

4. Cao, J., Spielmann, M., Qiu, X., Huang, X., Ibrahim, D. M., Hill, A. J., Zhang, F., Mundlos, S., Christiansen, L., Steemers, F. J., Trapnell, C., & Shendure, J. (2019). The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745), 496–502. https://doi.org/10.1038/s41586-019-0969-x

5. Habermann, A. C., Gutierrez, A. J., Bui, L. T., Yahn, S. L., Winters, N. I., Calvi, C. L., Peter, L., Chung, M. I., Taylor, C. J., Jetter, C., Raju, L., Roberson, J., Ding, G., Wood, L., Sucre, J., Richmond, B. W., Serezani, A. P., McDonnell, W. J., Mallal, S. B., Bacchetta, M. J., … Kropski, J. A. (2020). Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Science advances, 6(28), eaba1972. https://doi.org/10.1126/sciadv.aba1972

6. Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B., Siddiqui, A., Lao, K., & Surani, M. A. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods, 6(5), 377–382. https://doi.org/10.1038/nmeth.1315

7. Marcus, J. S., Anderson, W. F., & Quake, S. R. (2006). Microfluidic single-cell mRNA isolation and analysis. Analytical chemistry, 78(9), 3084–3089. https://doi.org/10.1021/ac0519460

Citation

de Abreu e Lima, F (2020). poissonisfish: UMAP clustering in Python. From https://poissonisfish.com/2020/11/14/umap-clustering-in-python/
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: poissonisfish.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post UMAP clustering in Python first appeared on R-bloggers.

How to Build a Predictive Soccer Model

$
0
0

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We will provide you an example of how you can start building your predictive sport model, specifically for soccer, but you can extend the logic to other sports as well. We will provide the steps that we need to follow:

Get the Historical Data Regularly

The first thing that we need to do is to get the historical data of the past games, including the most recent ones. The data should be updated regularly. This is a relatively challenging part. You can either try to get the data on your own by applying scraping or you can get the data through an API where usually there is a fee. Let’s assume that we have arranged how will get the historical data on a regular basis. Below, we provide an example of how the data usually look like:

How to Build a Predictive Soccer Model 2

As we can see are tabular data, let’s return the column names

How to Build a Predictive Soccer Model 3

As we can see, this data referred to the outcome of each game. We will need to create other features for our predictive model.

Feature Engineering

The logic is that before each game, (example Southampton vs Chelsea)  we would like to know the average features of each team up to that point. This implies that we need to work on the data in order to transform them in the proper form. So before each game, we want to know, how many goals does Southampton scores on average when it plays Home, how many received when it plays Away and the same for the opponent, which is Chelsea in our example. Clearly, this should be extended to all features.

We will need to group the data per Season and Team. We will need to create features when the teams play Away and when the play Home. So each team, no matter if the next game is Home or Away, will have values for both Home and Away features. Finally, we will need to join the “Home” team Data Frame with the “Away” team Data Frame based on the match that we want to predict or to train the model.

Let’s see how we can do this in R. Notice that we use the “lag” function because we can to get the data up until this game and NOT including the game since in theory, we do not know the outcome and we use also the “cummean” function which returns the cumulative mean

Home Features

library(tidyverse)# Read the csv file of the data obtained from the API df<-read.csv("premierleague20162020.csv", sep=";")# Create the "Home" Data FrameH_df<-df%>%select(-Start.Time, - Away.Team.Name,-Result)%>%      group_by(Season, Home.Team.Name)%>%arrange(Round)%>%      mutate_at(vars(Home.Team.Goals:Away.Passes.Pct), funs(lag))%>%      na.omit()%>%      mutate_at(vars(Home.Team.Goals:Away.Passes.Pct), funs(cummean))%>%      select(-Away.Team.Points.Before.This.Match)%>%ungroup()# Add a prefix of "H_" for all the home features:colnames(H_df)<-paste("H", colnames(H_df), sep = "_") 

Away Features

# Create the "Avay" Data FrameA_df<-df%>%select(-Start.Time, - Home.Team.Name, -Result)%>%      group_by(Season, Away.Team.Name)%>%arrange(Round)%>%      mutate_at(vars(Home.Team.Goals:Away.Passes.Pct), funs(lag))%>%      na.omit()%>%      mutate_at(vars(Home.Team.Goals:Away.Passes.Pct), funs(cummean))%>%      select(-Home.Team.Points.Before.This.Match)%>%ungroup()# Add a prefix of "A_" for all the away features:colnames(A_df)<-paste("A", colnames(A_df), sep = "_")

Results Data Frame

We keep also the results data frame which consists of the Round, Season, Home.Team.Name, Away.Team.Name and the Result of the game.

# keep the table with the actual resultsresults_df<-df%>%select(Round, Season, Home.Team.Name,   Away.Team.Name, Result)

Final Data Frame

The final data frame consists of the three data frames above. So, we will need to join them

# join the three data framesfinal_df<-results_df%>%          inner_join(H_df, by=c("Home.Team.Name"="H_Home.Team.Name", "Round"="H_Round", "Season"="H_Season"))%>%          inner_join(A_df, by=c("Away.Team.Name"="A_Away.Team.Name", "Round"="A_Round", "Season"="A_Season"))

The final features for this dataset will be the Season plus the :

How to Build a Predictive Soccer Model 4How to Build a Predictive Soccer Model 5

Build the Machine Learning Model

Now we are ready to build the machine learning model. We can adjust the dependent variable that we want to predict based on our needs. It can be the “Under/Over“, the “Total Number of Goals” the “Win-Loss-Draw” etc. In our case, the “y” variable is the result that takes 3 values such as “Win”, “Loss” and “Draw”. I.e. for R this is a factor of 3 levels. Let’s see how we can build a classification algorithm working with R and H2O.

library(h2o)h2o.init()Train<-final_df%>%filter(Round<=29, Round>=6)Test<-final_df%>%filter(Round>=30)Train_h2o<-as.h2o(Train)Test_h2o<-as.h2o(Test)# auto machine learning model. Will pick the best oneaml <- h2o.automl(y = 5,  x=6:62, training_frame = Train_h2o,   leaderboard_frame  = Test_h2o, max_runtime_secs = 60)# pick the best modellb <- aml@leaderboard# The leader model is stored here# aml@leader#pred <- h2o.predict(aml, test)# or#pred <- h2o.predict(aml@leader, Test_h2o)h2o.performance(model = aml@leader, newdata = Train_h2o) 

Clearly, for betting purposes, we do not care so much about the predictive outcome of the model but mostly about the odds of each outcome so that to take advantage of bookies mispricing.


Discussion

The model that we described above is a reliable starting model. We can improve it by enriching it with other features like players’ injuries, team budget, other games within a week (eg Champions League Games) etc. However, always the logic remains the dame, we have the “X” features which are up until the most recent game and our “y” which is what we want to predict. There are also other techniques where we can give more weight to the most recent observations. Generally speaking, there is much research on predictive soccer games.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Build a Predictive Soccer Model first appeared on R-bloggers.


How to Scrape Data from Euroleague

$
0
0

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We will provide you an example of how you can get the results of the Euroleague games in a structured form. The example is from the 2016-2017 season but you can adapt it for any season. What you need is to get the corresponding URL for each team in Euroleague and also to define the period.

Let’s start coding:

library(tidyverse)library(rvest)IST<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=IST&seasoncode=E2016#!games")%>% html_nodes("table")%>%.[[1]]%>%html_table() #pBAS<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=BAS&seasoncode=E2016#!games")%>% html_nodes("table")%>%.[[1]]%>%html_table() #pBAM<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=BAM&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()RED<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=RED&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()CSK<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=CSK&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table() #p#NEWDAR<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=DAR&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table() #pMIL<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=MIL&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()BAR<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=BAR&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()ULK<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=ULK&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table() #p#NEWGAL<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=GAL&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()TEL<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=TEL&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()OLY<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=OLY&seasoncode=E2016#!games")%>% html_nodes("table")%>%.[[1]]%>%html_table() #pPAN<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=PAN&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table() #pMAD<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=MAD&seasoncode=E2016#!games")%>% html_nodes("table")%>%.[[1]]%>%html_table() #p#NEWUNK<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=UNK&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()ZAL<-read_html("http://www.euroleague.net/competition/teams/showteam?clubcode=ZAL&seasoncode=E2016")%>% html_nodes("table")%>%.[[1]]%>%html_table()IST$Team<-c("Anadolu Efes Istanbul")MIL$Team<-c("EA7 Emporio Armani Milan")BAS$Team<-c("Baskonia Vitoria Gasteiz")BAM$Team<-c("Brose Bamberg")RED$Team<-c("Crvena Zvezda mts Belgrade")CSK$Team<-c("CSKA Moscow")BAR$Team<-c("FC Barcelona Lassa")ULK$Team<-c("Fenerbahce Istanbul")DAR$Team<-c("Darussafaka Dogus Istanbul")TEL$Team<-c("Maccabi FOX Tel Aviv")OLY$Team<-c("Olympiacos Piraeus")PAN$Team<-c("Panathinaikos Superfoods Athens")MAD$Team<-c("Real Madrid")UNK$Team<-c("Unics Kazan")GAL$Team<-c("Galatasaray Odeabank Istanbul")ZAL$Team<-c("Zalgiris Kaunas")df<-rbind(IST,MIL, BAS, BAM, RED, CSK, BAR, ULK, GAL, TEL, OLY, PAN, MAD, UNK, DAR, ZAL )%>%filter(!grepl("^[A-z]", X4))%>%  mutate(Opponent = substr(X3,4, nchar(X3)), HomeVisitor = ifelse(substr(X3,1,2)=="vs", "Home", "Visitor"),  Score=X4   )%>%  separate(Score, into = c('HScore', 'VScore'), sep="-")%>%  mutate(HScore=as.numeric(trimws(HScore)),  VScore=as.numeric(trimws(VScore)) ,  TeamScore = ifelse(HomeVisitor=='Home', HScore, VScore), OpponetScore = ifelse(HomeVisitor!='Home', HScore, VScore))%>%  select(-X3)%>%rename(Game=X1, WL=X2)%>%select(Game, Team, Opponent, TeamScore, OpponetScore, HomeVisitor, WL)

Let’s see the df how does it look like:

How to Scrape Data from Euroleague 1

This is a good starting point in case you want to build a predictive model.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Scrape Data from Euroleague first appeared on R-bloggers.

poorman: Versions 0.2.2 and 0.2.3 Releases

$
0
0

[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Welcome to my series of blog posts about my data manipulation package, {poorman}. For those of you that don’t know, {poorman} is aiming to be a replication of {dplyr} but using only {base} R, and therefore be completely dependency free. What’s nice about this series is that if you would rather just use {dplyr}, then that’s absolutely OK! By highlighting {poorman} functionality, this series of blog posts simultaneously highlights {dplyr} functionality too! However I sometimes also describe how I developed the internals of {poorman}, often highlighting useful {base} R tips and tricks.

Since my last blog post about {poorman}, versions 0.2.2 and 0.2.3 have been released, bringing with them a whole host of new functions and features. In today’s blog post we will be taking a look at some of these new features. Given the sheer amount of features this release brings, we won’t be focusing on the internals of any of these functions. Instead, we will simply be taking a look at what some of them can do.

across()

One of the newer features in {dplyr}, across() is intended to eventually replace the scoped variants (_if, _at, _all) of the “single table” verb functions which have now been superseded. These functions will supposedly remain within {dplyr} for “several years” still, giving developers plenty of time to update their code.

across() makes it easy to apply the same transformation to multiple columns, allowing you to use poor-select (or tidy-select) semantics inside of summarise() and mutate(). Let’s take a look at the function in action.

library(poorman, warn.conflicts = FALSE)
iris %>%
  group_by(Species) %>%
  summarise(across(.cols = starts_with("Sepal"), .fn = mean))
#      Species Sepal.Length Sepal.Width
# 1     setosa        5.006       3.428
# 2 versicolor        5.936       2.770
# 3  virginica        6.588       2.974

In the above code chunk, we take the iris dataset and group it by the Species column; then we look to summarise across all columns which start with the string "Sepal" (Sepal.Length and Sepal.Width) by taking the mean of those columns within each Species group. Let’s take a look at a more complex example.

iris %>%
  group_by(Species) %>%
  summarise(across(.cols = contains("Width"), .fn = list(mean, sd)))
#      Species Sepal.Width_1 Sepal.Width_2 Petal.Width_1 Petal.Width_2
# 1     setosa         3.428     0.3790644         0.246     0.1053856
# 2 versicolor         2.770     0.3137983         1.326     0.1977527
# 3  virginica         2.974     0.3224966         2.026     0.2746501

So here, we are saying give me the mean and standard devitaion across all columns containing the string "Width" for each Species of iris flower. Notice how the output is named, the function will give the columns numbers to represent the functional output, i.e. here _1 represents the mean and _2 represents the standard deviation. You can control the names yourself but providing them to the .names argument.

iris %>%
  group_by(Species) %>%
  summarise(across(
    .cols = contains("Width"),
    .fn = list(mean, sd),
    .names = c(
      "sepal_width_mean", "sepal_width_sd", "petal_width_mean", "petal_width_sd"
    )
  ))
#      Species sepal_width_mean sepal_width_sd petal_width_mean petal_width_sd
# 1     setosa            3.428      0.3790644            0.246      0.1053856
# 2 versicolor            2.770      0.3137983            1.326      0.1977527
# 3  virginica            2.974      0.3224966            2.026      0.2746501

This is slightly different to how {dplyr} works since it imports {glue}, but remember, {poorman} aims to be dependency free. This functionality will be expanded upon in future releases of {poorman}.

case_when()

This function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match, NA is returned. The syntax for the function is a sequence of two-sided formulas. The left hand side determines which values match the particular case whereas the right hand side provides the replacement value.

x <- 1:50
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)
#  [1] "1"         "2"         "3"         "4"         "fizz"      "6"         "buzz"     
#  [8] "8"         "9"         "fizz"      "11"        "12"        "13"        "buzz"     
# [15] "fizz"      "16"        "17"        "18"        "19"        "fizz"      "buzz"     
# [22] "22"        "23"        "24"        "fizz"      "26"        "27"        "buzz"     
# [29] "29"        "fizz"      "31"        "32"        "33"        "34"        "fizz buzz"
# [36] "36"        "37"        "38"        "39"        "fizz"      "41"        "buzz"     
# [43] "43"        "44"        "fizz"      "46"        "47"        "48"        "buzz"     
# [50] "fizz"

Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general. case_when() is particularly useful inside mutate() when you want to create a new variable that relies on a complex combination of existing variables.

mtcars %>%
  mutate(efficient = case_when(mpg > 25 ~ TRUE, TRUE ~ FALSE))
#                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb efficient
# Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     FALSE
# Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     FALSE
# Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1     FALSE
# Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     FALSE
# Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2     FALSE
# Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1     FALSE
# Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     FALSE
# Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2     FALSE
# Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2     FALSE
# Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     FALSE
# Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     FALSE
# Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3     FALSE
# Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3     FALSE
# Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3     FALSE
# Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     FALSE
# Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     FALSE
# Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     FALSE
# Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1      TRUE
# Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2      TRUE
# Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1      TRUE
# Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     FALSE
# Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2     FALSE
# AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2     FALSE
# Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     FALSE
# Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2     FALSE
# Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1      TRUE
# Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2      TRUE
# Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2      TRUE
# Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4     FALSE
# Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     FALSE
# Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8     FALSE
# Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     FALSE

rename_with()

rename_with() acts like rename(), only it allows you to rename columns with a function. In the below example, we rename the columns of iris to be upper case.

rename_with(iris, toupper) %>% head()
#   SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

However we can have more control over which columns we rename by making use of the .cols parameter and poor-select selection semantics.

rename_with(iris, toupper, contains("Petal")) %>% head()
#   Sepal.Length Sepal.Width PETAL.LENGTH PETAL.WIDTH Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Conclusion

This post has demonstrated some of the capabilities of the {poorman} (and therefore {dplyr}) package. The v0.2.2 and v0.2.3 releases actually includes plenty more features and functions so be sure to check out the release page.

If you are interested in taking a closer look at how have coded these functions, you can see the code on the relevant {poorman} GitHub page.

If you’d like to show your support for {poorman}, please consider giving the package a Star on Github as it gives me that boost of dopamine needed to continue development.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post poorman: Versions 0.2.2 and 0.2.3 Releases first appeared on R-bloggers.

Some notes when using dot-dot-dot (…) in R

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When writing functions R, the ... argument is a special argument useful for passing an unknown number of arguments to another function. This is widely used in R, especially in generic functions such as plot(), print(), and apply().

Hadley Wickham’s Advanced R has a nice short section on the uses of ... and some potential pitfalls when using .... In this short post, I talk about some other things to be aware of when using ....

Example 1

The first thing to note is that the function receiving the ... does not itself need ... in its function signature. In the example below, f1 has ... in its function signature, and passes ... to f2. f2, does not have ... in its function signature but is able to interpret the call from f1 correctly.

f1 <- function(x, ...) {
  f2(...)
}

f2 <- function(y) {
  print(y)
}

f1(x = 1, y = 2)
# [1] 2

As one might expect, if f1 passes anything other than y, we get an error:

f1(x = 1, y = 2, z = 3)
# Error in f2(...) : unused argument (z = 3)

It’s interesting that if we do not specify y = in the f1 function call, R is smart enough to decipher what is going on and give the answer we expect. I don’t recommend writing such code though as it can be ambiguous to the reader.

f1(1, 2)
# [1] 2

Example 2

This example is almost the same as the previous one except f2 now has ... in its function signature as well. The f1 function call with just x and y works as expected.

f1 <- function(x, ...) {
  f2(...)
}

f2 <- function(y, ...) {
  print(y)
}

f1(x = 1, y = 2)
# [1] 2

Note, however, that the call with x, y and z does not fail:

f1(x = 1, y = 2, z = 3)
# [1] 2

When f1 calls f2, the z argument goes into f2‘s .... It’s not used by the function, but it does not throw an error because this is not an illegal input to f2. This may or may not be what you want! I recently got burned by this because I was expecting f2 to throw an error but it didn’t, which made me think that my code was working when it wasn’t.

Example 3

You can use list(...) to interpret the arguments passed through ... as a list. This can be useful if you want to amend the arguments before passing them on. Save the output of list(...) as a variable, amend this variable, then call the next function with the amended variable using do.call().

For example, in the code below, f1 checks to see if the argument y is passed. If so, it is doubled before being passed on to f2.

f1 <- function(x, ...) {
  args <- list(...)
  if ("y" %in% names(args)) {
    args$y <- 2 * args$y
  }
  do.call(f2, args)
}

f2 <- function(y) {
  print(y)
}

f1(x = 1, y = 2)
# [1] 4

References:

  1. Wickham, H. Advanced R (Section 6.6).
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Some notes when using dot-dot-dot (…) in R first appeared on R-bloggers.

COVID-19 Posts: A Public Dataset Containing 400+ COVID-19 Blog Posts

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the last few months, we’ve been collecting hundreds of COVID-19 blog posts from the R community. Today, we are excited to share this dataset publicly, to help bloggers who want to analyze COVID-19 data by unleashing R and the resources of its community by being able to research such posts.

So far, we have found and recorded 423 COVID posts in English. In an effort to encourage others to explore such posts, we’ve published a Shiny web app which allow users to find the names of the 231 bloggers who wrote those posts, their roles, and their country of focus. The app also lets users interactively search the collection of posts by primary topic, post title, date, and whether the post uses a particular mathematical technique or data source. To learn more about the evolution of this dataset, one of the authors (Rees) has published nine articles on Medium, which you can find here.

We encourage users to submit their own posts-or others’ posts-for inclusion, which can be done on this Google Form. Our dataset, as well as the code for the Shiny app, is available on GitHub. If anyone has corrections to the dataset, please write Rees (at) ReesMorrison (dot) com.

The remainder of this post highlights some of the findings from the dataset of COVID-19 posts. As will be made evident by the plots that follow, this is by no means a comprehensive review of every COVID-19 R blog post, but rather an overview of the data that we have found.

Posts Over Time

As the pandemic has progressed, fewer bloggers have engaged with COVID-related data, as we notice that blog posts peaked in March of 2020.

Some bloggers have been prolific; many more have been one and done. The plot below shows the names and posts of the 23 bloggers who have so far published at least four posts. For an example of how to read the plot, Tim Churches, at the bottom of the y-axis, has published a total of nine posts, but none after early April.

The color of the points corresponds to the work role of the blogger as explained in the legend at the bottom. It is immediately apparent that professors and academic researchers predominate in this group of bloggers. If you include the postgraduate students, universities writ large account for nearly all of the prolific bloggers.

Roles of Authors

The bloggers in our dataset describe their work-day roles in a variety of ways. One of the authors (Rees) standardized these job roles by categorizing the multitude of terms and descriptions, but it is quite possible that this effort misrepresented what some of these bloggers do for a living. We welcome corrections.

We’ve further categorized roles into a broad typology where professions fall into one of five categories: university, corporate, professional, government, and nonprofit. Those broader categories are represented as columns in the following chart.

Data Sources

A greater number of data sources related to COVID-19 will yield richer insights. Combining different datasets can shed new light on an issue, yield improvements, and allow authors to contruct better indices and measures. For that reason, one of the authors (Rees) extracted dataset information from our collection of blog posts.

For the most part, bloggers identified the data source they drew on for their analysis. On occasion, we had to apply some effort to standardize the 140 data sources.

By far the most prevalent data source is Johns Hopkins University, who early, comprehensively and consistently has set the standard for COVID-19 data collection and dissemination to the public.

Blog Post Topics

It may also be the case that readers want a summary of blogs, or to only look at posts that pertain to a certain topic. Assigning each blog post a primary topic introduces a fair amount of subjectivity, to be sure, but the hope is that these broad topics will help researchers find content and colleagues who share similar interests.

Here, a balloon plot shows various categories that the 423 posts address as their primary topic. Topics fall on the y axis and the blogger’s category of employment is on the x axis. The size (and opacity) of each bubble represents the count of posts that match that combination. Epidemiology leads the way, as might be expected, but quite a few posts seem to use COVID data to showcase something else, or apply R in novel ways.

Concluding Thoughts

We encourage you to use our Shiny application to explore the data for yourself. If you’d like to submit your post to be included, fill out this Google Form.

As we note in the footer of the application, the R community is intelligent and produces interesting content, but not all of us are experts when it comes to COVID-19. Engaging with these posts will allow you to better understand the application of R to our current moment, and perhaps provide feedback to post authors. We do not endorse the findings of any particular author and encourage you to find accurate, relevant, and recent information from reputable sources such as the CDC and the WHO.


Originally published on my blog. Follow the authors Rees and Connor on Twitter.


COVID-19 Posts: A Public Dataset Containing 400+ COVID-19 Blog Posts was first posted on November 15, 2020 at 7:08 am. ©2020 “R-posts.com“. Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post COVID-19 Posts: A Public Dataset Containing 400+ COVID-19 Blog Posts first appeared on R-bloggers.

Most popular on Netflix, Disney+, Hulu and HBOmax. Weekly Tops for last 60 days

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Couple months ago I published Most popular on Netflix. Daily Tops for last 60 days– small research based on daily scraping data answering following questions: How many movies (titles) made the Netflix Daily Tops? What movie was the longest #1 on Netflix? For how many days movies / TV shows stay in Tops and as #1? etc. This time I am sharing analysis of the most popular movies / TV shows across Netflix, Disney+, Hulu and HBOmax on weekly basis, instead of daily, with anticipation of better trends catching.

So, let`s count how many movies made the top5, I assume it is less than 5 *60…

library(tidyverse)library (gt)platforms <- c('Disney+','HBOmax', 'Hulu', 'Netflix') # additionally, load CSV data using readr 

Wrangle raw data – reverse (fresh date first), take top 5, take last 60 days

fjune_dt % rev () %>% slice (1:5) %>% select (1:60)fdjune_dt % rev () %>% slice (1:5) %>% select (1:60)hdjune_dt % rev () %>% slice (1:5) %>% select (1:60)hulu_dt % rev () %>% slice (1:5) %>% select (1:60)

Gather it together and count the number of unique titles in Top5 for 60 days

fjune_dt_gathered <- gather (fjune_dt)fdjune_dt_gathered <- gather (fdjune_dt)hdjune_dt_gathered <- gather (hdjune_dt)hulu_dt_gathered <- gather (hulu_dt)unique_fjune_gathered % length ()unique_fdjune_gathered % length ()unique_hdjune_gathered % length ()unique_hulu_gathered % length ()unique_gathered <- c(unique_fdjune_gathered, unique_hdjune_gathered, unique_hulu_gathered, unique_fjune_gathered)unique_gathered <- as.data.frame (t(unique_gathered), stringsAsFactors = F)colnames (unique_gathered) <- platforms

Let`s make a nice table for the results

unique_gathered_gt %tab_header(  title = "Number of unique movies (titles) in Top5")%>%  tab_style(    style = list(      cell_text(color = "purple")),    locations = cells_column_labels(      columns = vars(HBOmax)))%>%  tab_style(    style = list(      cell_text(color = "green")),    locations = cells_column_labels(      columns = vars(Hulu))) %>%  tab_style(    style = list(      cell_text(color = "red")),    locations = cells_column_labels(      columns = vars(Netflix)))unique_gathered_gt

Using similar code we can count the number of unique titles which were #1 one or more days

What movie was the longest in Tops / #1?

table_fjune_top5 <- sort (table (fjune_dt_gathered$value), decreasing = T) # Top5table_fdjune_top5 <- sort (table (fdjune_dt_gathered$value), decreasing = T)table_hdjune_top5 <- sort (table (hdjune_dt_gathered$value), decreasing = T)table_hulu_top5 <- sort (table (hulu_dt_gathered$value), decreasing = T)

Plotting the results

bb5fdjune <- barplot (table_fdjune_top5 [1:5], ylim=c(0,62), main = "Days in Top5, Disney+", las = 1, col = 'blue')text(bb5fdjune,table_fdjune_top5 [1:5] +2,labels=as.character(table_fdjune_top5 [1:5]))bb5hdjune <- barplot (table_hdjune_top5 [1:5], ylim=c(0,60), main = "Days in Top5, HBO Max", las = 1, col = 'grey', cex.names=0.7)text(bb5hdjune,table_hdjune_top5 [1:5] +2,labels=as.character(table_hdjune_top5 [1:5]))bb5hulu <- barplot (table_hulu_top5 [1:5], ylim=c(0,60), main = "Days in Top5, Hulu", las = 1, col = 'green')text(bb5hulu,table_hulu_top5 [1:5] +2,labels=as.character(table_hulu_top5 [1:5]))bb5fjune <- barplot (table_fjune_top5 [1:5], ylim=c(0,60), main = "Days in Top5, Netflix", las = 1, col = 'red')text(bb5fjune,table_fjune_top5 [1:5] +2,labels=as.character(table_fjune_top5 [1:5]))

The same for the movies / TV shows reached the first place in weekly count Average days in top distribution

#top 5ad5_fjune <- as.data.frame (table_fjune_top5, stringsAsFActrors=FALSE)ad5_fdjune <- as.data.frame (table_fdjune_top5, stringsAsFActrors=FALSE)ad5_hdjune <- as.data.frame (table_hdjune_top5, stringsAsFActrors=FALSE)ad5_hulu <- as.data.frame (table_hulu_top5, stringsAsFActrors=FALSE)par (mfcol = c(1,4))boxplot (ad5_fdjune$Freq, ylim=c(0,20), main = "Days in Top5, Disney+")boxplot (ad5_hdjune$Freq, ylim=c(0,20), main = "Days in Top5, HBO Max")boxplot (ad5_hulu$Freq, ylim=c(0,20), main = "Days in Top5, Hulu")boxplot (ad5_fjune$Freq, ylim=c(0,20), main = "Days in Top5, Netflix")

The same for the movies / TV shows reached the first place in weekly count (#1)


Most popular on Netflix, Disney+, Hulu and HBOmax. Weekly Tops for last 60 days was first posted on November 15, 2020 at 7:08 am. ©2020 "R-posts.com". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Most popular on Netflix, Disney+, Hulu and HBOmax. Weekly Tops for last 60 days first appeared on R-bloggers.

BASIC XAI with DALEX — Part 3: Partial Dependence Profile

$
0
0

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

BASIC XAI

BASIC XAI with DALEX — Part 3: Partial Dependence Profile

Introduction to model exploration with code examples for R and Python.

By Anna Kozak

Welcome to the “BASIC XAI with DALEX” series.

In this post, we present the Partial Dependence Profile (PDP), the model agnostic method, which is the one we can use for any type of model.

In the first part of this series, you can find here BASIC XAI with DALEX — Part 1: Introduction.

In the second part of this series, you can find here BASIC XAI with DALEX — Part 2: Permutation-based variable importance.

So, shall we start?

First — What PDP delivers to us?

We continue to explore the model, we look at global explanations. Earlier we talked about the method for variable importance based on permutation. Now we look at Partial Dependence Profile. The general idea behind the design of PD profiles is to show how the expected model prediction value behaves as a function of the selected explanatory variable. For one model we can construct a general PD profile using all observations from the data set or several profiles for observation subgroups. A comparison of subgroup-specific profiles can provide important insight, for example, into the stability of the model prediction.

  1. Profiles can be created for all the observations in the set, as well as for the division against other variables. For example, we can see how a specific variable behaves when differentiated by gender, race, or other factors.
  2. We can detect some complicated variable relationships. For example, we have PD profiles for two models and we can see that one of the simple models (linear regression) does not detect any dependence, while the profile for a black-box model (random forest) notices a difference.

Second — Idea of Partial Dependence Profile

The value of a Partial Dependence profile for model f() and explanatory variable Xj at z is defined as follow:

Thus, it is the expected value of the model predictions when variable Xʲ is fixed at z over the (marginal) distribution of X-ʲ, i.e., over the joint distribution of all explanatory variable other than j.

Generally, we do not know the true distribution of X-ʲ. We can estimate it, however, by the empirical distribution of n observations available in a training dataset.

Third — let’s get a model in R and Python

Let’s write some code. We are still working on the DALEX apartments data. To calculate and visualize PD profiles we use the model_profile() function with partial type. The generic plot function draws lines for numeric variables. To get profiles for the categorical variables in the model_profile() function we should use the variable_type = “categorical”.

<a href="https://medium.com/media/fadcb60ae8aff3c562cc0d844dc9c38f/href" rel="nofollow" target="_blank">https://medium.com/media/fadcb60ae8aff3c562cc0d844dc9c38f/href</a>

Now let’s see what PD profiles look like for the surface and construction.year. We can see that the larger the area of the apartment, the lower the price per square meter is. For the variable year of construction, the most expensive apartments are built before World War II and after 1990.

plot(rf_mprofile)

As we mentioned earlier, we can construct profiles against other variables (grouped partial dependence profiles).

For example, let’s see how the surface and construction.year will depend on the number of rooms in the apartment. Each color corresponds to the number of rooms in the apartment, respectively ranger_1 to one room, ranger_2 to two, and so on. What we can see is that the more rooms the price per square meter decreases. The biggest difference is for a property with one room.

plot(rf_mprofile_group)

We also talked about comparing PDP for different models (contrastive partial dependence profiles). Let’s build a linear regression model on the same data. The linear regression model does not capture the U-shaped relationship between the construction.year and the price. On the other hand, the effect of the surface on the apartment price seems to be underestimated by the random forest model. Hence, one could conclude that, by addressing the issues, one could improve either of the models, possibly with an improvement in predictive performance.

plot(rf_mprofile, lm_mprofile)

In the next part, we will talk about the Break Down method.

Many thanks to Przemyslaw Biecek and Jakub Wiśniewski for their support on this blog.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

In order to see more R related content visit https://www.r-bloggers.com


BASIC XAI with DALEX — Part 3: Partial Dependence Profile was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post BASIC XAI with DALEX — Part 3: Partial Dependence Profile first appeared on R-bloggers.

constants: Update to 2018 CODATA values

$
0
0

[This article was first published on R – Enchufa2, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The constants package contains CODATA internationally recommended values of the fundamental physical constants, provided as symbols for direct use within the R language. Optionally, the values with uncertainties and/or units are also provided if the errorsunits and/or quantities packages are installed. The Committee on Data for Science and Technology (CODATA) is an interdisciplinary committee of the International Council for Science which periodically provides the internationally accepted set of values of the fundamental physical constants. This release contains the “2018 CODATA” version, published on May 2019 [E. Tiesinga, P. J. Mohr, D. B. Newell, and B. N. Taylor (2020) http://physics.nist.gov/constants].

This version contains some breaking changes that are necessary to streamline future updates and provide a stable symbol table:

  • The codata table includes the absolute uncertainty instead of the relative one. Thus, the rel_uncertainty column has been dropped in favour of the new uncertainty. Also, columns have been slightly reordered.
  • Symbol names for constants have changed. The old ones were hand-crafted and thus unmanageable. This release adopts the ASCII symbols defined by NIST in their webpage, except for those that collide with some base R function. In particular, there are two cases: c, the speed of light, has been renamed as c0sigma, the Stefan-Boltzmann constant, has been renamed as sigma0.
  • Constant types, or categories, (column codata$type) adopts the names defined by NIST in the webpage too. Some constants belong to more than one category (separated by comma); some others belong to no category (missing type).

There are some new features too:

  • In addition to the codata data frame, this release includes codata.cor, a correlation matrix for all the constants.
  • In addition to syms_with_errors and syms_with_units, there is a new list of symbols called syms_with_quantities (available if the optional quantities package is installed), which provides constant values with uncertainty and units.
  • Experimental support for correlated values in syms_with_errors and syms_with_quantities is provided (disabled by default; see details in help(syms) for activation instructions).

See the README for some usage examples. For questions, suggestions or issues, please use the issue tracker.

Article originally published in Enchufa2.es: constants: Update to 2018 CODATA values.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post constants: Update to 2018 CODATA values first appeared on R-bloggers.


The Birth of a Galaxy

$
0
0

[This article was first published on Wenyao, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Universe by tidyverse.

Image by NASA/JPL-Caltech/R. Hurt

In celebration of my first appearance on R-Bloggers

There is no denying that some of the most awe-inspiring photos ever taken can be found in astrophotography. In my low-cost attempt at capturing that magic a few years ago, I posted an image of a procedurally generated Milky Way. Since then, the same methodology has been applied to harmonograph. Today, sit tight because our mission is to document my journey to infinity and beyond.

The idea is simple enough. Structurally, the Milky Way consists of a couple of spiral arms spinning around its center, all of which are in turn made up of numerous teeny-tiny little stars. In the absence of a space telescope, how can we create something that looks reasonably close to the galaxy? Fortunately, we have a few pretty powerful tools at our disposal (specifically, R and tidyverse/ggplot). Note that the focus here is aesthetic pleasingness rather than scientific accuracy.

Spiral Arms

Named after the Greek mathematician, the Archimedean spiral bears a remarkable resemblance to the spiral arms (from a top-down perspective at least) of the Milky Way. In polar coordinates, the curve is given by:

\[r = \theta ^ k\]

Equivalently, the Cartesian coordinates would be:

\[\begin{cases} x = \theta ^ k \cos{\theta} \\ y = \theta ^ k \sin{\theta} \\ \end{cases}\]

In the language of tidyverse, once we have the desired range of $\theta$, the full set of points can be generated by:

spiral_arm <- tibble(  theta = seq(from = theta_from, to = theta_to, length.out = theta_length)) %>%   mutate(    r = theta ^ k,    x = r * cos(theta),    y = r * sin(theta)  )

This should get the job done nicely. Now, what if we want more than one spiral arm? One solution would be to repeat the process several times and add a constant to theta for rotation purposes:

spiral_arms <- lapply(  list(id = seq_len(num_of_arms)),  function(id, theta_from, theta_to, arm_width){    tibble(      id = id,      theta = seq(from = theta_from, to = theta_to, length.out = theta_length)    ) %>%       mutate(        r = theta ^ k,        x = r * cos(theta + 2 * pi * id / num_of_arms),        y = r * sin(theta + 2 * pi * id / num_of_arms)      )  }) %>%   bind_rows()

Lo and behold – Archimedes has given us the skeleton upon which a galaxy will be born.

Fleshing out the Skeleton

Stars rarely align on a line. Instead, they exhibit some degree of duality of individual randomness and collective predictability. We can jitter the points vertically and horizontally with some white noises to achieve similar effect. If there aren’t enough points, reuse the exisiting ones!

stars <- sprial_arms %>%   slice(rep(row_number(), star_intensity)) %>%   mutate(    x = x + rnorm(n(), sd = width),    y = y + rnorm(n(), sd = width)  )

There are two moving pieces in the equation. First, the intensity variable controls overall how many stars will be created. On the other hand, the standard deviation of the noise governs the dispersion of how far a star tends to diverge from its spiral arm. Also, in R’s plotting convention, shape number 8 will give us that star-shaped point we want.

There seems to be one problem though – why don’t the stars shine?

Twinkle Twinkle Little Star

As it turns out, black isn’t the greatest choice of color when it comes to stars. It’s a subjective call but personally, I would rather that they are colored this way:

Even then, no color alone can bring the kind of vitality and liveliness that we’ve come to expect from a photo. Ideally, each star should be assigned its own color by randomly sampling from the color space (with replacement):

stars <- stars %>%  mutate(    color = star_colors %>% sample(size = n(), replace = TRUE)  )

In fact, we can further randomize other attributes of the stars as well, i.e., either via sampling values from a predefined set (e.g., random sizes for the stars) or letting it correlate with some feature (e.g., opacity being inversely proportional to the radius from the center). Adding multiple layers of halo effect also helps in terms of introducing the illusion of a vibrant galaxy.

ggplot(sprial_arms, aes(x = x, y = y)) +  geom_point(data = stars, size = star_halo_size1, color = "white", shape = 8) +  geom_point(data = stars, size = star_halo_size2, color = "white", shape = 8) +  geom_point(data = stars, size = stars$size, alpha = stars$alpha, color = stars$color, shape = 8) +  theme(panel.background = element_rect(fill = background_color))

The halos are effectively nothing more than a few extra points in white at the exact same places, albeit bigger in size and lower in opacity. It’s a simple technique but sometimes it works wonders. When all is being said and drawn, we’ve got ourselves a fairly decent galaxy.

Before we move on, let’s take a moment to appreciate how far we’ve come since sketching the skeleton. However, something is still missing.

Galactic Center

At the heart of the Milky Way lies the brightest region of our galaxy, the Galactic Center, the jewel in the crown. From a purely visual standpoint, it looks like a tilted oval spanning from bottom left to top right, shining and fiery.

But guess what, by no means is it an obscure pattern in geometry. Recall bivariate normal distribution?

\[\begin{cases} x = a \\ y = \rho a + \sqrt{1 – \rho ^ 2} b \\ \end{cases}\]

where $a$ and $b$ are independent normally distributed random variables. This should give us a way to generate points similar in shape to the Galactic Center:

gc <- tibble(    x = rnorm(gc_intensity, sd = gc_sd_x)) %>%   mutate(    y = gc_rho * x + sqrt(1 - gc_rho ^ 2) * rnorm(n(), sd = gc_sd_y)  )

Again, let’s pick the color palette that best matches that of a burning furnace:

And as usual, we pull the trick of randomized assignment of color, size, transparency in addition to the halo effect:

ggplot(sprial_arms, aes(x = x, y = y)) +  geom_point(data = gc, size = gc_halo_size1, alpha = gc_halo_alpha1, color = "gold", shape = 8) +  geom_point(data = gc, size = gc_halo_size2, alpha = gc_halo_alpha2, color = "gold", shape = 8) +  geom_point(data = gc, size = gc$size, alpha = gc$alpha, color = gc$color, shape = 8) +  theme(panel.background = element_rect(fill = background_color))

Between you and me, who would have thought that there’s something Gaussian to be found all over the galaxy?

Putting It All Together

From a self-proclaimed data artist who lacks training in cosmology in any meaningful way, the end result works stunningly well. Serene and peaceful, dazzling yet profound, it breathes, chants and whispers, into the void for an eternity. In all fairness, most of the credit goes to our friend randomness who manages to create a sense of guided unpredictability, although not without careful choices of color palette, transparency, shape and size. To do it justice, I highly recommend viewing the image in its native resolution.

Last but not least, the appeal obviously goes beyond the Milky Way. As long as we’ve made up our mind about an object’s functional form (and maybe a new color palette that goes with it), everything else should still hold. This bodes well for other system, constellation or galaxy that we want to give a try so please feel free to let me know if you would like to see Andromeda next.

You can also find animation, video, source, or merchandise.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Wenyao.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Birth of a Galaxy first appeared on R-bloggers.

How to Test for Randomness

$
0
0

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been contacted by many people asking me to predict the outcome of some events that in theory are random. For example, they want me to predict lottery games like Keno, Lotto, Casino Roulette numbers and so on so forth. My answer is that you cannot predict something which is supposed to be random. No model can give you a better estimate than what you already know, for example, in roulette the probability to get the number 0 is 1/37 no matter what were the previous numbers. This implies that before start building advanced Machine Learning and Artificial Intelligence models to predict the outcome of the next draw, try to check if these numbers are actually random. If there are actual random then it means that there is no pattern and you should not waste your time with ML and AI.

For demonstration purposes, we will assume that we are dealing with numbers obtained from an unbiased casino roulette with numbers from 0 to 36. Let’s create our sample in R.

set.seed(5)# Generate 10K random numbers from 0 to 36casino<-sample(c(0:36), 100000, replace = TRUE) 

Chi-Square Test for the Frequency of the Numbers

In the beginning, we can test if the frequency of the drawn numbers is random. A barplot of the frequency of each number will help us to get a better idea.

barplot(table(casino), main="Frequency of each number") 
How to Test for Randomness 1

Let’s now run the Chi-Square test:

chisq.test(table(casino)) 
How to Test for Randomness 2

As we can see the p-value is greater than 5% which means that we do not reject the null hypothesis which was that the distribution of the digit was independent. Of course, this test does not check if there is a pattern in the way that the numbers are served.

Autocorrelation and Partial Autocorrelation

A quick way to see if there is a pattern in the way that the numbers are served is to plot the acf and pacf.

acf(casino)pacf(casino) 
How to Test for Randomness 3How to Test for Randomness 4

If you see get similar plots as the above ones it means that there is no correlation between the current drawn number with the previous (lag) ones.

Wald-Wolfowitz Runs Test

Regarding the sequence of the numbers, we can apply the Wald-Wolfowitz Runs Test that is a non-parametric statistical test that checks a randomness hypothesis for a two-valued data sequence. More precisely, it can be used to test the hypothesis that the elements of the sequence are mutually independent. Notice that this was suggested for binary cases but the runs tests can be used to test the randomness of a distribution, by taking the data in the given order and marking with + the data greater than the median, and with – the data less than the median (numbers equalling the median are omitted.)

We can run this test using the randtestspackage in R. Let’s do it for our sample data:

library(randtests)runs.test(casino) 
How to Test for Randomness 5

Again, we accept the null hypothesis that was that the sequence of the numbers is random.

Bartels Test for Randomness

Another test that you can apply is the Bartels Test for Randomness which is the rank version of von Neumann’s Ratio Test for Randomness. Let’s run it in r using the randests package.

runs.test(casino) 
How to Test for Randomness 6

Again, we can claim that the numbers are random.

Cox Stuart Test

The proposed method is based on the binomial distribution. We can easily run this test in R using again the randests package.

cox.stuart.test(casino) 
How to Test for Randomness 7

Again, we accept the null hypothesis at 5% level of significance.

Difference Sign Test

Another test that we can run, is the non-parametric “Difference Sign Test”:

difference.sign.test(casino) 
How to Test for Randomness 8

Again, we accept the null hypothesis.

Conclusion

We provided some different approaches to how you can test if a sequence of numbers is actually random or not. You can apply these tests if you want to make ensure the randomness of a sequence of numbers.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Test for Randomness first appeared on R-bloggers.

Le Monde puzzle [#1164]

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The weekly puzzle from Le Monde is quite similar to older Diophantine episodes (I find myself impossible to point out):

Give the maximum integer that cannot be written as 105x+30y+14z. Same question for 105x+70y+42z+30w.

These are indeed Diophantine equations and the existence of a solution is linked with Bézout’s Lemma. Take the first equation. Since 105 and 30 have a greatest common divisor equal to 3×5=15, there exists a pair (x⁰,y⁰) such that

105 x⁰ + 30 y⁰ = 15

hence a solution to every equation of the form

105 x + 30 y = 15 a

for any relative integer a. Similarly, since 14 and 15 are co-prime,

there exists a pair (a⁰,b⁰) such that

15 a⁰ + 14 b⁰ = 1

hence a solution to every equation of the form

15 a⁰ + 14 b⁰ = c

for every relative integer c. Meaning 105x+30y+14z=c can be solved in all cases. The same result applies to the second equation. Since algorithms for Bézout’s decomposition are readily available, there is little point in writing an R code..! However, the original question must impose the coefficients to be positive, which of course kills the Bézout’s identity argument. Stack Exchange provides the answer as the linear Diophantine problem of Frobenius! While there is no universal solution for three and more base integers, Mathematica enjoys a FrobeniusNumber solver. Producing 271 and 383 as the largest non-representable integers. Also found by my R code

o=function(i,e,x){  if((a<-sum(!!i))==sum(!!e))sol=(sum(i*e)==x) else{sol=0    for(j in 0:(x/e[a+1]))sol=max(sol,o(c(i,j),e,x))}  sol}a=(min(e)-1)*(max(e)-1)#upper boundM=b=((l<-length(e)-1)*prod(e))^(1/l)-sum(e)#lower boundfor(x in a:b){sol=0for(i in 0:(x/e[1]))sol=max(sol,o(i,e,x))M=max(M,x*!sol)}

(And this led me to recover the earlier ‘Og entry on the coin problem! As of last November.) The published solution does not bring any useful light as to why 383 is the solution, except for demonstrating that 383 is non-representable and any larger integer is representable.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Le Monde puzzle [#1164] first appeared on R-bloggers.

An Attempt at Tweaking the Electoral College

$
0
0

[This article was first published on R | JLaw's R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

With the 2020 Election wrapping up and a renewed discussion about the merits of the Electoral College I’ve been thinking more about the system and why it might be the way it is. While I understand the rationale of why doing a complete popular vote would have unintended consequences, I personally feel like the current system has overly valued small states by virtue of having a minimum of 3 electoral votes. My personal hypothesis is that we have too many states. Therefore, my solution would be to start combining the small states that they meet a minimum threshold of the US population. I fully recognize that this would be completely infeasible in practice… but this is just a humble blog. So this analysis will attempt to accomplish three things:

  1. When comparing the population from 1792 vs. 2020, do states generally represent smaller percentages of the US Population? (Do we have too many states from an Electoral College perspective?)

  2. How could a new system be devised by combining states to reach a minimum population threshold?

  3. Would this new system have impacted the results of the 2016 election? (At the time of writing, votes for the 2020 election are still being counted).

Gathering Data

Throughout this post, a number of difference libraries will be used as outputs will include plots, maps, and tables:

Loading Libraries

library(rvest) #Web-Scrapinglibrary(tidyverse) #Data Cleaning and Plottinglibrary(janitor) #Data Cleaning library(sf) #Manipulate Geographic Objectslibrary(httr) #Used to Download Excel File from Weblibrary(readxl) #Read in Excel Fileslibrary(kableExtra) #Create HTML Tables

Getting the US Population by State in 1790

Data from the 1790 US Census will be gathered from Wikipedia and scraped using the rvest package. In the following code block, all table tags will be extracted from the webpage and then I guessed and checked until I found the table I was looking for (in this case what I wanted was the 3rd table). The html_table() function converts the HTML table into a data frame and clean_names() from the janitor package will change the column headers into an R friendly format.

Finally, stringr::str_remove_all() will use regular expressions to remove the footnote notation “[X]” from the totals and readr::parse_number() will convert the character variable with commas into a numeric.

us_pop_1790 <- read_html('https://en.wikipedia.org/wiki/1790_United_States_Census') %>%  html_nodes("table") %>%   .[[3]] %>%   html_table() %>%   clean_names() %>%   filter(state_or_territory != 'Total') %>%   transmute(    state = state_or_territory,    population_1790 = str_remove_all(total, '\\[.+\\]') %>%       parse_number(),    population_percent_1790 = population_1790/sum(population_1790)  )

Getting US Population by State in 2019

A similar process will be used to get the population estimates for 2019 from Wikipedia. In this case there is only 1 table on the page so html_node('table') can be used rather than html_nodes('table') like in the above code block for 1790.

us_pop_2019 <- read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population') %>%   html_node('table') %>%   html_table() %>%   clean_names() %>%   filter(!is.na(estimated_population_per_electoral_vote_2019_note_2),         !estimated_population_per_electoral_vote_2019_note_2 %in% c('', '—'),         rank_in_states_territories_2010 != '—') %>%  transmute(    state,    population_2019 = parse_number(population_estimate_july_1_2019_2),    population_percent_2019 = population_2019 / sum(population_2019)    )

Getting # of Electoral Votes for Each State by Year

Finally, the table containing number of electoral votes by state by year will be extracted from Wikipedia. New code pieces for this code block are the use of selecting columns by number in the dplyr::select() and dplyr::rename() calls. Also, the use of dplyr::across() which in this context is a replacement for mutate_if, mutate_at, and mutate_all. Here I tell the mutate() to take all variables that start with “electoral votes” and apply the readr::parse_number() function to them keeping the names the same. We’ll use this data set later on.

electoral_votes <- read_html('https://en.wikipedia.org/wiki/United_States_Electoral_College') %>%   html_nodes("table") %>%   .[[5]] %>%   html_table(fill = T) %>%   select(2, 4, 36) %>%   filter(!Electionyear %in% c('Total', 'Electionyear', "State")) %>%   rename(state = 1, electoral_votes_1792 = 2, electoral_votes_2020 = 3) %>%   mutate(across(starts_with('electoral_votes'), parse_number))

Q1: Do states today represent smaller proportions of the population than they did when the Electoral College was formed?

My hypothesis is that the electoral college has become less effective because we’ve added too many small states that reflect minor amounts of the US population and that when the Electoral College was established the population distributions of states were more similar.

To check this I’ll be comparing the distributions of State populations as a % of the Total US Population for 1790 and 2019. One note before getting into the code is that in the article for the 1790 state population, Maine is given its own row. However, Maine was a part of Massachusetts until 1820, so since we’re more focused on “electing blocks” rather than states I will merge Maine into Massachusetts.

For this next code block, I join the two population data sets together and then all numeric variables summarized. Then, I melt the population percentages by year into a long-form data frame. Finally, I extract the numeric year from the variable names and compare the box plots of the % of Total Population for each State from 1790 and 2019.

us_pop_2019 %>%   left_join(    us_pop_1790 %>%       mutate(state = if_else(state == 'Maine', 'Massachusetts', state)) %>%       group_by(state) %>%       summarize(across(where(is.numeric), sum)),    by = "state"  ) %>%   pivot_longer(    cols = c(contains("percent")),    names_to = "year",    values_to = "population_dist"  ) %>%   mutate(year = str_extract(year, '\\d+') %>% as.integer) %>%   ggplot(aes(x = fct_rev(factor(year)), y = population_dist,              fill = factor(year))) +     geom_boxplot() +     labs(x = "Year", y = "Population Distribution",          title = "State Population Distribution by % of US Population") +    annotate('linerange', y = 1/nrow(us_pop_2019),              xmin = .6, xmax = 1.45, lty = 2) +     annotate('linerange', y = 1/(nrow(us_pop_1790)-1),              xmin = 1.6, xmax = 2.45, lty = 2) +     scale_y_continuous(label = scales::percent_format(accuracy = 1)) +     scale_fill_discrete(guide = F) +    coord_flip() +    theme_bw()

In the chart above we’re looking at the distribution of states by the % of the total US population they make up. The dashed lines represent the expected values if all states had the same amount. For example, there are 51 “voting bodies” that make up 100% of the US population, so the “expected” amount would be 1/51 or 2.0%. In 1790, the largest state made up 19.2% and the smallest state made up 1.5% of the total population. In 2019, the largest state makes up 12% of the total population and the smallest makes up 0.2% of the total population.

While some of this is due to having more states which means the same 100% is being cut into more pieces. Another way to see whether states are making up smaller pieces of the population today than back is to compare the data to those expected values from before. In the case of 1790, there are 15 voting bodies so on average we’d expected each state to make up 6.7%. And when looking the distribution of the states in 1790, 60% are below the expected amount of 6.7%. This is compared to the distribution in 2019 where 67% are below the expected amount of 2.0%.

When asking whether or not there are more small states in 2019 vs. 1790, I find that 28 of the 51 states (with DC) [55%] have a % of the US Population smaller than the minimum state from 1790 [1.5%]. These 28 states make up 141 or 26% of the 538 electoral votes.

So while there’s not a large difference between actual and expected it does seem that we have a greater concentration of smaller population states now than when the electoral college was first established based on the concentration that make up less than 1.5% of the US population.

Q2. How could states be combined to ensure each “voting group” meets a minimum population threshold?

The fact that 55% of states have a % of 2019 US Population smaller than the smallest percentage in 1790 gives promise to the idea that combining states could be feasible. So for this exercise, I’ll combine states together in order to ensure that each group has at least a minimum of 1.5% of the US Population.

Originally I had wanted to come up with a cool algorithm to find the optimal solution to ensure that each state group hit the 1.5% while taking into account the location of the states being combined and the political culture of the states… but alas I couldn’t figure out how to do it. So I combined the states manually taking into account geography but completely ignoring how states usually vote. In my new construction the following states get combined:

  • Alaska & Oregon
  • Arkansas & Mississippi
  • Connecticut & Rhode Island
  • Washington DC, Delaware, and West Virginia
  • Hawaii & Nevada
  • Iowa & Nebraska
  • Idaho, Montana, North Dakota, South Dakota, and Wyoming
  • Kansas & Oklahoma
  • New Hampshire, Maine, and Vermont
  • New Mexico & Utah
new_groupings <- us_pop_2019 %>%   mutate(    state = if_else(state == 'D.C.', 'District of Columbia', state),    new_grouping = case_when(      state %in% c('New Hampshire', 'Maine', 'Vermont') ~ 'NH/ME/VT',      state %in% c('Rhode Island', 'Connecticut') ~ 'CT/RI',      state %in% c('West Virginia', 'Delaware', 'District of Columbia') ~         'DC/DE/WV',      state %in% c('Alaska', 'Oregon') ~ 'AK/OR',      state %in% c('Utah', 'New Mexico') ~ 'NM/UT',      state %in% c('Hawaii', 'Nevada') ~ 'HI/NV',      state %in% c('Idaho', 'Montana', 'North Dakota',                    'South Dakota', 'Wyoming') ~ 'ID/MT/ND/SD/WY',      state %in% c('Iowa', 'Nebraska') ~ 'IA/NE',      state %in% c('Arkansas', 'Mississippi') ~ 'AR/MS',      state %in% c('Oklahoma', 'Kansas') ~ 'KS/OK',      TRUE ~ state    )  )

To display this brave new world, I will construct a map that shows my new compressed electoral map and the resulting changes in the number of electoral votes. The first step is adding the electoral votes into the data frame constructed in the last code block:

new_groupings <- new_groupings %>%   left_join(    electoral_votes %>%       transmute(state = if_else(state == 'D.C.', 'District of Columbia', state),                electoral_votes_2020),    by = "state"  ) 

Next, I need a mechanism to assign a number of electoral votes to my compressed map. Normally, there are 538 electoral votes representing the 435 voting members of Congress, the 100 Senators, and 3 additional electoral votes for Washington DC. Since I’m not trying to rock the boat too much. My new system will maintain the 2 votes per group represented by the Senate allocation and the population allocation from the Congressional side. In order to understand and apply this relationship I’m building a quick and dirty linear regression model to predict the population component for the new of electoral votes

electorial_vote_model <- lm(electoral_votes_2020-2 ~ population_2019,                             data = new_groupings)electorial_vote_model## ## Call:## lm(formula = electoral_votes_2020 - 2 ~ population_2019, data = new_groupings)## ## Coefficients:##     (Intercept)  population_2019  ##     0.094428506      0.000001313

This model shows that there is 1.313 electoral votes per 1 million people.

To visualize what this new electoral map will look map, I will use the sf package. While I’m not very familiar with this package (maybe a subject of a future post), I’ve tinkered around with the format before and have found it very compatible with tidy principles.

The first step is getting a shape file. For the United States, I will leverage the usa_sf function from the albersusa package which will return a map as a simple feature. The “laea” represents the projection.

usa <-  albersusa::usa_sf("laea") %>% select(name, geometry)knitr::kable(head(usa))
namegeometry
ArizonaMULTIPOLYGON (((-1111066 -8…
ArkansasMULTIPOLYGON (((557903.1 -1…
CaliforniaMULTIPOLYGON (((-1853480 -9…
ColoradoMULTIPOLYGON (((-613452.9 -…
ConnecticutMULTIPOLYGON (((2226838 519…
District of ColumbiaMULTIPOLYGON (((1960720 -41…

What makes the magic of the sf class is that the shape information is contained in the geometry column, but everything else can be operated on like a normal data frame. So for the next step, I’ll join the “state groupings” information to this shape file data using the “name” column from the shape data and the state column from the groupings data.

Next, I summarize the data to “combined state groupings” level where I get the sums of the population and the number of original electoral votes. The two unique parts of this summarize statement are:

  • st_union which will combine geographic areas from the shape file into new shapes. If you wanted to combine the groups but maintain all original boundaries then st_combine would be used instead.
  • Creating a better label for the combined state names by using paste in the summarize with the collapse option which concatenates the states in the aggregation.
  • The final mutate step uses the predict function to apply the regression model to compute the new electoral vote values for the combined states. Any state that wasn’t combined retained its original number of votes.

Afterwards, the new data set looks like:

new_usa <- usa %>%   left_join(new_groupings %>%               transmute(state,                         new_grouping,                         population_2019,                         electoral_votes_2020                        ),             by = c("name" = "state")  ) %>%   group_by(new_grouping) %>%   summarize(    geom = st_union(geometry),    population_2019 = sum(population_2019),    electoral_votes = sum(electoral_votes_2020),    states = paste(name, collapse = '/')  ) %>%   mutate(    new_ev = if_else(      states == new_grouping,      electoral_votes,      ceiling(predict(electorial_vote_model, newdata = .) + 2)    ),    lbl = if_else(new_grouping == states, NA_character_,                   paste0(new_grouping, ": ", new_ev - electoral_votes)))knitr::kable(head(new_usa))
new_groupinggeompopulation_2019electoral_votesstatesnew_evlbl
AK/ORMULTIPOLYGON (((-1899337 -2…494928210Oregon/Alaska9AK/OR: -1
AlabamaMULTIPOLYGON (((1145349 -15…49031859Alabama9NA
AR/MSMULTIPOLYGON (((1052956 -15…599397412Arkansas/Mississippi10AR/MS: -2
ArizonaMULTIPOLYGON (((-1111066 -8…727871711Arizona11NA
CaliforniaMULTIPOLYGON (((-1853480 -9…3951222355California55NA
ColoradoMULTIPOLYGON (((-613452.9 -…57587369Colorado9NA

Now we’re ready to plot the map. Plotting sf geometries work within the ggplot paradigm where geom_sf will draw the geometries and geom_sf_text will handle the overlays for the given groups. coord_sf changes the coordinate system of the plot. And everything else should be familiar from vanilla ggplot.

new_usa %>% ggplot() +  geom_sf(color = "#2b2b2b", size=0.125, aes(fill = lbl)) +  geom_sf_text(aes(label = lbl), check_overlap = T, size = 3) +   coord_sf(crs = st_crs("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"), datum = NA) +  scale_fill_discrete(guide = F, na.value = "grey90") +   labs(title = "Proposed Electoral Map",       subtitle = "Combining States so each 'Group' makes up at least ~1.5% of US Population",       caption = "Number represents the change in Electoral Votes due to combining") +   ggthemes::theme_map() +   theme(    plot.title = element_text(size = 14)  )

The states in gray remained unchanged and the filled in states represent our new groupings. The states that directly border each other have been combined into an “electoral grouping” with a newly assigned number of electoral votes. Since the electoral vote model was based on population, the change in the number of electoral votes comes primarily from the loss of the two senate votes for each combined state.

For example, NH/ME/VT originally would have had 11 electoral votes and under the new system will have 7 for a net change of -4 due to the loss of two combined states 2 senate votes.

Under the normal electoral college there were 538 votes and under this new system that number is reduced to 512.

Now that we have our new electoral college, would it have made a difference in 2016?

Q3: Would this new system have impacted the results of the 2016 election?

The 2016 election results between Donald Trump and Hillary Clinton is provided in great detail from the Federal Election Commission. Surprisingly, it was difficult to find the number of votes by state in an easily consumable way where I wouldn’t have to recode all the state names. So the FEC data will have to do even if its took some complicated data manipulation.

Since the FEC data comes from an Excel file, I first need to download the file from the FEC website. I’ll use the GET function from httr to download the Excel file to a temporary file and then will use read_excel from readxl to read in the file.

Before data manipulation, but after filtering to just Trump and Clinton, the data looks like.

GET("https://www.fec.gov/documents/1890/federalelections2016.xlsx",     write_disk(tf <- tempfile(fileext = ".xlsx")))results2016 <- read_excel(tf, sheet = '2016 Pres General Results') %>%   clean_names() %>%   filter(last_name %in% c('Trump', 'Clinton')) %>%   select(state, state_abbreviation, last_name, general_results)knitr::kable(head(results2016, 5))
statestate_abbreviationlast_namegeneral_results
AlabamaALTrump1318255
AlabamaALClinton729547
AlaskaAKTrump163387
AlaskaAKClinton116454
ArizonaAZTrump1252401

There was a small data quirk with New York state where because the same candidate can appear on multiple party lines a single candidate appears in multiple rows (Clinton appears 4 times and Trump 3). Therefore a first group-by is done to make the data 2 rows per state. Then the data is cast to a wider format, the electoral votes are added back and allocated to the winning candidate (technically this is wrong since Nebraska and Maine do not use all-or-nothing allocations, but its close enough for this exercise).

Then the data is aggregated to the new electoral groupings from the prior section and our “new” electoral votes are allocated in an all or nothing fashion to the candidate.

results2016 <- results2016 %>%   group_by(state, state_abbreviation, last_name) %>%   summarize(general_results = sum(general_results, na.rm = T),             .groups = 'drop') %>%   pivot_wider(    names_from = "last_name",    values_from = "general_results"  ) %>%   left_join(    new_groupings %>%       select(state, new_grouping, electoral_votes_2020, population_2019),    by = "state"  ) %>%   mutate(trump_ev = (Trump > Clinton)*electoral_votes_2020,         clinton_ev = (Clinton > Trump)*electoral_votes_2020  ) %>%   group_by(new_grouping) %>%   summarize(across(where(is.numeric), sum, na.rm = T),            states = paste(state, collapse = '/')) %>%   mutate(new_ev = if_else(              states == new_grouping,              electoral_votes_2020,              ceiling(predict(electorial_vote_model, newdata = .) + 2)            )) %>%   mutate(    new_trump_ev = if_else(Trump > Clinton, new_ev, 0),    new_clinton_ev = if_else(Trump < Clinton, new_ev, 0)  )knitr::kable(head(results2016, 5))
new_groupingClintonTrumpelectoral_votes_2020population_2019trump_evclinton_evstatesnew_evnew_trump_evnew_clinton_ev
AK/OR111856094579010494928237Alaska/Oregon909
Alabama72954713182559490318590Alabama990
AR/MS8656251385586125993974120Arkansas/Mississippi10100
Arizona11611671252401117278717110Arizona11110
California875379244838145539512223055California55055

Finally to visualize the difference in electoral votes between the actual 2016 results and our new 2016 results, the prior data set will be summarized and reshaped to get the data back into a tidy format with the proper labeling. The plot is a simple stacked barplot.

results2016 %>%   summarize(across(contains(c("trump_ev", "clinton_ev")), sum)) %>%   pivot_longer(cols = everything(),               names_to = 'variable',               values_to = 'electoral_votes') %>%   group_by(str_detect(variable, 'new')) %>%   mutate(    percents = electoral_votes/sum(electoral_votes),    old_v_new = if_else(str_detect(variable, 'new'), 'New EC', 'Original EC'),    candidate = case_when(       str_detect(variable, 'trump') ~ "trump",       str_detect(variable, 'clinton') ~ 'clinton',       TRUE ~ 'total'     ),    lbl = paste0(electoral_votes,                  '\n(',                  scales::percent(percents, accuracy = .1) ,')')  ) %>%    ggplot(aes(y = old_v_new, x = percents, fill = candidate)) +    geom_col(width = .5) +    geom_text(aes(label = lbl), position = position_stack(vjust = .5)) +     geom_vline(xintercept = .5, lty = 2) +     scale_x_continuous(label = scales::percent, expand = c(0,0)) +     scale_fill_manual(values = c('clinton' = 'blue', 'trump' = 'red')) +     guides(fill = guide_legend(reverse = T)) +     labs(x = "% of Electoral Vote",         y = "",         title = "Comparing 2016 Election Results in the Original vs. New System",         fill = "") +     cowplot::theme_cowplot() +     theme(      plot.title.position = 'plot',      axis.line = element_blank(),      axis.ticks.x = element_blank(),      axis.text.x = element_blank()    )

With the new electoral grouping system the net change in percentage of electoral votes was only 0.3%, so the overall result wouldn’t have changed.

What Actually Changed in the New System?

The final question would be how did the electoral votes change between the old system and the new system. The tbl_dl data frame is restructuring the data into the table format which will only have rows for groupings where the number of electoral votes is different and I’m creating labels to include the “+” and “-” symbols.

tbl_dt <- results2016 %>%   filter(trump_ev != new_trump_ev | clinton_ev != new_clinton_ev) %>%   transmute(    new_grouping,    clinton_delta = (new_clinton_ev - clinton_ev),    trump_delta = (new_trump_ev - trump_ev),    clinton_lbl = paste0(      if_else(clinton_delta > 0, "+", ""),      clinton_delta    ),    trump_lbl = paste0(      if_else(trump_delta > 0, "+", ""),      trump_delta    )  ) %>%  select(new_grouping, clinton_lbl, trump_lbl)

To complete the table visualization I’m using the kableExtra package. The kable_paper argument is a style setting and the two uses of column_spec sets the cell background to either red or green if the label constructed above is non-zero and white otherwise (which will appear blank). This was my first experience with kableExtra and while I’m happy that I was able to get this to be how I wanted, I found certain parts of the syntax a little frustrating.

tbl_dt %>%   kbl(align = c('l', 'c', 'c'),      col.names = c('', 'Clinton', 'Trump'),      caption = "Election 2016: Candidate's Change in Electoral Votes") %>%   kable_paper(full_width = F) %>%   column_spec(2, color = 'white', background = case_when(    str_detect(tbl_dt$clinton_lbl, "\\+") ~ 'green',    str_detect(tbl_dt$clinton_lbl, "\\-") ~ 'red',    TRUE ~ 'white'    )  ) %>%   column_spec(3, color = 'white', background = case_when(    str_detect(tbl_dt$trump_lbl, "\\+") ~ 'green',    str_detect(tbl_dt$trump_lbl, "\\-") ~ 'red',    TRUE ~ 'white'    )  )
Table 1: Election 2016: Candidate’s Change in Electoral Votes
ClintonTrump
AK/OR+2-3
AR/MS0-2
CT/RI-20
DC/DE/WV+1-5
HI/NV-20
IA/NE0-2
ID/MT/ND/SD/WY0-7
KS/OK0-1
NH/ME/VT-40
NM/UT-5+4

In most cases, votes were lost due to the combining of smaller states into these groupings but in a few instances the combination of multiple states changed who won the popular vote. For example, in the Alaska/Oregon there were originally 10 electoral votes (3 from Alaska which went to Trump, 7 from Oregon that went to Clinton). The grouping lost vote in the combining and then the combined Oregon/Alaska went to Clinton overall. Therefore, Clinton gets all 9 new electoral votes (+2 from the initial 7) and Trump loses the 3 he had from Alaska.

Wrapping Up

Back at the beginning of this analysis I hypothesized that the Electoral College had become more over-weighted towards smaller states than back in the 1790s during the early days of the electoral college. Based on comparing the % of the US Population of states from 1790 vs. 2019 I showed that this was true although not massively.

I proposed an idea to revise the Electoral College by combining states to ensure that each grouping makes up at a minimum 1.5% of the US Population, which was the smallest share of population from 1790. This reduced the overall number of electoral votes due to the reduction of the automatic 2 votes per state for the combined states.

Finally, I applied my new Electoral College to the 2016 election… it made almost no difference.

So overall, this thought exercise was fun to work through but it winds up being an incredibly small change to the results from the current system.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | JLaw's R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post An Attempt at Tweaking the Electoral College first appeared on R-bloggers.

RStudio 1.4 Preview: New Features in RStudio Server Pro

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post is part of a series on new features in RStudio 1.4, currently available as a preview release.

Today, we’re going to talk about what’s new in RStudio Server Pro (RSP) 1.4. The 1.4 release includes integration with a frequently requested editor (VS Code), several quality of life improvements for working with Launcher environments, new user administration commands, and long-awaited SAML support! Let’s get started!

RStudio Server Pro

Single Sign-On Authentication with SAML 2.0 & OpenID Connect

RSP 1.4 comes with native support for SAML and OpenID authentication for Single Sign-On. This allows RSP to leverage any authentication capabilities provided by your organization’s Identity Management such as multi-factor authentication.

Even when using SSO authentication with SAML or OpenID, RSP continues to require local system accounts. Similar to the authentication mechanisms supported previously by RSP, automatic account creation (provisioning) can be done via sssd integration with your LDAP or Active Directory and with RSP configured to use PAM sessions. You can find more information in the admin guide here and here.

If you already have LDAP or Active Directory integration working with RSP with PAM or proxied authentication, getting SAML or OpenID working is just a matter of configuring both RSP and your organization’s Identity Management to trust each other. We have some migration recommendations described here.

When configuring your Identity Management, the only information RSP needs to know about each user is their local account username, so this information is required in assertions or claims sent during authentication. By default, RSP expects an attribute called “Username” (case-sensitive) for SAML and a claim called “preferred_username” for OpenID, but those can be customized if necessary.

Note that RSP will not be able to use email addresses or any other user identifier for authentication purposes. If sssd integration is used, the username received by RSP must exactly match the one provided by sssd for the same user.

The admin guide contains more information on how to configure SAML and OpenID.

Note: SAML and OpenID cannot yet be configured with Google because it does not provide usernames, only emails. If Google is your preferred authentication, you can keep using it, but be aware it will be deprecated in a future release. We will provide a migration path from Google accounts to OpenID at that time.

VS Code Sessions (Preview)

Many data science teams use VS Code side by side with RStudio as a tool for reproducible research. In this RSP update, we’re making it easier to use these tools together; you can now run VS Code sessions in addition to RStudio and Jupyter sessions inside RSP, providing your data scientists with all of the editing tools they need to do their data science more effectively!

Just like RStudio sessions, RSP manages all of the authentication and supervision of VS Code sessions, while providing you a convenient dashboard of running sessions. Starting a new VS Code session is as easy as choosing VS Code when you start a new session.

VS Code Session

Note that RStudio does not bundle VS Code (it must be installed separately) and that VS Code is only available when RSP is configured with the Job Launcher. The VS Code editing experience is provided by the use of the open source code-server which must be installed and configured in order to be used. This setup can be done easily by simply running the command sudo rstudio-server install-vs-code , which will download all the necessary binaries and automatically configure the /etc/rstudio/vscode.conf file which enables VS Code integration. See the admin guide for more details.

Currently, VS Code Sessions are a Preview feature. The feature itself is stable and usable, but you may find some bugs, and we are still working to complete some aspects of the VS Code development workflow. We highly encourage you to submit your feedback to let us know how we can improve!

Job Launcher Project Sharing

In previous versions of RSP, use of the Job Launcher automatically prevented you from using the Project Sharing and Realtime Collaboration features within RStudio sessions. We’re excited to announce that this limitation has now been removed, and you can share projects within Launcher sessions just the same as with regular sessions.

By default, when selecting the users to share projects with from within a session, only users that have signed in and used RSP will be shown, whereas previously the entire system’s users were displayed. This previous behavior was in some cases exhausting, and now also makes no sense in containerized environments (e.g., Kubernetes). The old behavior can be restored by setting project-sharing-enumerate-server-users=1 in the /etc/rstudio/rsession.conf configuration file.

Project Sharing

Local Launcher Load Balancing

In previous versions of RSP, if you wanted to load balance your sessions between multiple nodes running the Local Job Launcher plugin, you had to use an external load balancer to balance traffic between Job Launcher nodes. In RSP 1.4, load balancing has been improved when used with the Local Launcher to ensure that sessions are automatically load balanced across Launcher nodes that are running RSP and configured in the load balancer configuration file /etc/rstudio/load-balancer. Simply ensure that each RSP instance is configured to connect to its node-local Launcher instance. For more details, see the admin guide.

User Administration

RSP 1.3 introduced the ability to track named user licenses visually in the admin dashboard, as well as the ability to lock users that are no longer using RSP to free up license slots. In 1.4, we have added new admin commands to perform these operations from the command line instead of having to use the GUI. These commands allow you to:

  • List all RSP users
  • Add new users before they have signed in, indicating whether or not they should have administrator privileges
  • Change the admin status of a user
  • Lock and unlock users

Documentation for these commands can be found in the admin guide.


If you’re interested in giving the new RStudio Server Pro features a try, please download the RStudio 1.4 preview. Note that RStudio Server Pro 1.4 requires database connectivity; see the admin guide for full documentation on prerequisites.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post RStudio 1.4 Preview: New Features in RStudio Server Pro first appeared on R-bloggers.

Viewing all 12135 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>