Quantcast
Channel: R-bloggers
Viewing all 12113 articles
Browse latest View live

Evaluation of the new palette() for R

$
0
0

[This article was first published on rstats on Jakub Nowosad's website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R version 4.0 is just around the corner. One of the changes in the new version is the improved default color palette using the palette() function.

Proposed colors

The new proposed palette() default is less saturated and more balanced, while at the same time, it follows the same basic pattern of colors (hues). You can read more about it at https://developer.r-project.org/Blog/public/2019/11/21/a-new-palette-for-r/.

col_ver3 = c("#000000", "#FF0000", "#00CD00", "#0000FF",             "#00FFFF", "#FF00FF", "#FFFF00", "#BEBEBE")col_ver4 = c("#000000", "#DF536B", "#61D04F", "#2297E6",             "#28E2E5", "#CD0BBC", "#EEC21F", "#9E9E9E")
library(colorspace)swatchplot("Version 3" = col_ver3,           "Version 4" = col_ver4)

This proposal is still being disussed and modified as mentioned by Achim Zeileis on Twitter:

Call for feedback: We modified the #rstats palette() proposal from last week (https://t.co/g4gvHWQ0oy) so that magenta (#6) can be better distinguished from blue (#4) for colorblind viewers. Is this a reasonable improvement, esp. for deuteranopes? @apcamargo_@bmwiernikpic.twitter.com/gjxNqm5F5k

— Achim Zeileis (@AchimZeileis) November 26, 2019

Therefore, I decided it is a good time to test the properties of the proposed palette() default for color vision deficiencies – deuteranopia, protanopia, and tritanopia.

Comparision between palettes

I used the colorblindcheck package for this purpose.

# remotes::install_github("nowosad/colorblindcheck")library(colorblindcheck)

This tiny R package provides tools for helping to decide if the selected color palette is colorblind-friendly. You can see examples of its use at https://nowosad.github.io/colorblindcheck.

The primary function in this package is palette_check(), which creates summary statistics comparing the original input palette and simulations of color vision deficiencies.

palette_check(col_ver3, plot = TRUE)

##           name n tolerance ncp ndcp  min_dist mean_dist  max_dist## 1       normal 8  26.64945  28   28 26.649454  58.57976 105.67883## 2 deuteranopia 8  26.64945  28   26 12.790929  51.65788  99.81401## 3   protanopia 8  26.64945  28   24  4.337187  54.06193  95.06426## 4   tritanopia 8  26.64945  28   23 13.934054  51.49383  90.45153

Visual inspection of the old palette() default allows seeing that it is not suitable for people with color vision deficiencies. For example, people with protanopia could have problems distinguishing the first from the second color and the forth from the sixth color. This problem is also confirmed in the summary statistics, where the minimal distance between colors of the original palette is about 26, while it is only about 4 for protanopia.

palette_check(col_ver4, plot = TRUE)

##           name n tolerance ncp ndcp  min_dist mean_dist max_dist## 1       normal 8  23.51878  28   28 23.518780  50.21307 95.04017## 2 deuteranopia 8  23.51878  28   22 12.094062  41.11547 78.45654## 3   protanopia 8  23.51878  28   24  5.402646  42.28841 81.02547## 4   tritanopia 8  23.51878  28   22 11.032589  44.47677 83.19068

The proposed palette() looks considerably better, as it is easier to distinguish between colors for each color vision deficiency. However, the minimal distance between colors for protanopia is just marginally better with a value of about 5.

Protanomaly

Let’s use the palette_dist() function to compare each pair of colors in the old and proposed palette() using the protanopia color vision deficiency.

palette_dist(col_ver3, cvd = "pro")
##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]## [1,]   NA 13.89 63.29 40.08 85.59 41.23 90.98 66.30## [2,]   NA    NA 54.81 52.02 82.32 52.17 82.02 62.61## [3,]   NA    NA    NA 78.71 43.28 75.54 18.13 28.28## [4,]   NA    NA    NA    NA 52.63  4.34 95.06 52.40## [5,]   NA    NA    NA    NA    NA 48.08 44.74 14.58## [6,]   NA    NA    NA    NA    NA    NA 91.20 48.37## [7,]   NA    NA    NA    NA    NA    NA    NA 31.09## [8,]   NA    NA    NA    NA    NA    NA    NA    NA

The shortest distance between colors in the old palette() default was between the fourth and sixth color (4.33).

palette_dist(col_ver4, cvd = "pro")
##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]## [1,]   NA 30.25 68.88 53.60 74.53 35.78 71.67 51.95## [2,]   NA    NA 41.26 27.23 34.35 26.52 43.84 21.75## [3,]   NA    NA    NA 55.72 37.58 78.62  5.40 26.47## [4,]   NA    NA    NA    NA 20.76 31.85 58.06 23.89## [5,]   NA    NA    NA    NA    NA 51.88 39.84 16.37## [6,]   NA    NA    NA    NA    NA    NA 81.03 45.51## [7,]   NA    NA    NA    NA    NA    NA    NA 29.48## [8,]   NA    NA    NA    NA    NA    NA    NA    NA

This pair of colors is substantially more distinguishable in the new proposed palette() default with a distance of about 32. However, the shortest distance in this palette was 5.40 between the third and seventh color.

Summary

The new proposed palette() default is a step in the right direction with more balanced luminance while keeping similar hues to the old version. This constraint, however, results in having a pair of very similar colors for people with protanopia.

What can be done then to ensure that the color palette we use is colorblind friendly? Gladly, there are many additional color palettes available in R that can be used. It includes some of the palettes introduced in the R 3.6 function hcl.colors(). Read more about them at https://developer.r-project.org/Blog/public/2019/04/01/hcl-based-color-palettes-in-grdevices/ or see them by yourself using example("hcl.colors". Additionally, a new palette.colors() function will be added to R 4.0 with several sensible predefined palettes for representing qualitative data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats on Jakub Nowosad's website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Advent of Code 2019-01 with R & JavaScript

$
0
0

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Solving Advent of Code 2019-01 with R and JavaScript.

[Disclaimer] Obviously, this post contains a big spoiler about Advent of Code, as it gives solutions for solving day 1.

[Disclaimer bis] I’m no JavaScript expert so this might not be the perfect solution. TBH, that’s also the case for the R solution.

About the JavaScript code

The JavaScript code has been written in the same RMarkdown as the R code. It runs thanks to the {bubble} package: https://github.com/ColinFay/bubble

Instructions

Find the instructions at: https://adventofcode.com/2019/day/1

R solution

Part one

# Readipt<-read.delim("input1.txt",header=FALSE)# Get the sum of each element, divided by 3, rounded down, and substracted 2sum(floor(ipt$V1/3)-2)
## [1] 3361299

Part two

Using a recursive function: https://en.wikipedia.org/wiki/Recursion_(computer_science)

floorish<-function(x,start=0){loc<-floor(x/3)-2if(loc>0){start<-start+locfloorish(loc,start)}else{return(start)}}sum(purrr::map_dbl(ipt$V1,floorish))
## [1] 5039071

JS solution

Part one & Two

varfs=require('fs');// Reading the filevarres=fs.readFileSync("input1.txt",'utf8').split("\n").filter(x=>x.length!=0);// Turning to integerres=res.map(x=>parseInt(x));// Doing the floor of division less 2varval=res.map(x=>Math.floor(x/3)-2);// Sumingvaradd=(x,y)=>x+y;// Solutionconsole.log(val.reduce(add));// Creating the recursive functionfunctionfloorish(val,start=0){loc=Math.floor(val/3)-2;if(loc>0){start+=loc;returnfloorish(loc,start);}else{returnstart;}}// Doing the computationconsole.log(res.map(x=>floorish(x)).reduce(add));
## 3361299
## 5039071
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Patch it up and send it out

$
0
0

[This article was first published on Data Imaginist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am super, super thrilled to finally be able to announce that patchwork has been released on CRAN. Patchwork has, without a doubt, been my most popular unreleased package and it is great to finally make it available to everyone.

Patchwork is a package for composing plots, i.e. placing multiple plots together in the same figure. It is not the only package that tries to solve this. grid.arrange() from gridExtra, and plot_grid() from cowplot are two popular choices while some will claim that all you need is base graphics and layout() (they would be wrong, though). Do we really need another package for this? I personally feel that patchwork brings enough innovation to the table to justify its existence, but if you are a happy user of cowplot::plot_grid() I’m not here to force you away from that joy.

The claim to fame of patchwork is mainly two things: A very intuitive API, and a layout engine that promises to keep your plots aligned no matter how complex a layout you concoct.

library(ggplot2)library(patchwork)p1 <- ggplot(mpg) +   geom_point(aes(hwy, displ))p2 <- ggplot(mpg) +   geom_bar(aes(manufacturer, fill = stat(count))) +   coord_flip()# patchwork allows you to add plots togetherp1 + p2

If you find this intriguing, you should at least give patchwork a passing glance. I’ve already written at length about all of its features at its webpage, so if you don’t want to entertain my ramblings more than necessary, make haste to the Getting Started guide, or one of the in-depth guides covering:

The Patch that Worked

If you are still here, I’ll tell you a bit more about the package, and round up with some examples of my favorite features in patchwork. As I described in my look back at 2017 patchwork helped me out of burn-out fueled by increasing maintenance burdens of old packages. At that time I don’t think I expected two years to pass before it got its proper release, but here we are… What I don’t really go into is why I started on the package. The truth is that I was beginning to think about the new gganimate API, but was unsure whether it was possible to add completely foreign objects to ggplots, alter how it behaves, while still allowing normal ggplot2 objects to be added afterwards. I was not prepared to create a POC of gganimate to test it out at this point, so I came up with the idea of trying to allow plots to be added together. The new behavior was that the two plots would be placed beside each other, and the last plot would still be able to receive new ggplot objects. It worked, obviously, and I began to explore this idea a bit more, adding more capabilities. I consciously didn’t advertise this package at all. I was still burned out and didn’t want to do anything for anyone but myself, but someone picked it up from my github and made a moderately viral tweet about it, so it quickly became popular despite my intentions. I often joke that patchwork is my most elaborate tech-demo to date.

All that being said, I was in search for a better way to compose plots (I think most R users have cursed about misaligned axes and butchered facet_wrap() into a layout engine) and I now had a blurry vision of a solution, so I had to take it out of tech-demo land, and begin to treat it as a real package. But, along came gganimate and swallowed up all my development time. Further, I had hit a snag in how nested layouts worked that meant backgrounds and other elements were lost. This snag was due to a fundamental part of why patchwork otherwise worked so well, so I was honestly in no rush to get back to fixing it.

So patchwork lingered, unreleased…

At the start of 2019 I decided that the year should be dedicated to finishing of updates and unreleased packages, and by November only patchwork remained. I was still not feeling super exited about getting back to the aforementioned snag, but I saw no way out so I dived in. After having explored uncharted areas of grid in search of something that could align the layout engine implementation with not removing background etc. I was ready to throw it all out, but I decided to see how hard it would be to simply rewrite a subset of the layout engine. 1 day later I had a solution… There is a morale in there somewhere, I’m sure — feel free to use it.

The Golden Patches

I don’t want to repeat what I’ve written about at length in the guides I linked to in the beginning of the post, so instead I’ll end with simply a few of my favorite parts of patchwork. There will be little explanation about the code (again, check out the guides), so consider this a blindfolded tasting menu.

# A few more plots to play withp3 <- ggplot(mpg) +   geom_smooth(aes(hwy, cty)) +   facet_wrap(~year)p4 <- ggplot(mpg) +   geom_tile(aes(factor(cyl), drv, fill = stat(count)), stat = 'bin2d')

Human-Centered API

Patchwork implements a few API innovations to make plot composition both quick, but also readable: Consider this code

(p1 | p2) /   p3

It is not too difficult to envision what kind of composition comes out of this and, lo and behold, it does exactly what is expected:

As layout complexity increases, the use of operators get less and less readable. Patchwork allows you to provide a textual representation of the layout instead, which scales much better:

layout <- 'ABBCCD'p1 + p2 + p3 + p4 + plot_layout(design = layout)

Capable auto-tagging

When plot compositions are used in scientific literature, the subplots are often enumerated so they can be referred to in the figure caption and text. While you could do that manually, it is much easier to let patchwork do it for you.

patchwork <- (p4 | p2) /                p1patchwork + plot_annotation(tag_levels = 'A')

If you have a nested layout, as in the above, you can even tell patchwork to create a new tagging level for it:

patchwork <- ((p4 | p2) + plot_layout(tag_level = 'new')) /                 p1patchwork + plot_annotation(tag_levels = c('A', '1'))

It allows you to modify subplots all at once

What if want to play around with the theme? Do you begin to change the theme of all of your subplots? No, you use the & operator that allows you to add ggplot elements to all your subplots:

patchwork & theme_minimal()

It shepherds the guides

Look at the plot above. The guides are annoying, right. Let’s put them together:

patchwork + plot_layout(guides = 'collect')

That is, visually, better but really we only want a single guide for the fill. patchwork will remove duplicates, but only if they are alike. If we give them the same range, we get what we want:

patchwork <- patchwork & scale_fill_continuous(limits = c(0, 60))patchwork + plot_layout(guides = 'collect')

Pretty nice, right?

This is not a grammar

I’ll finish this post off with something that has been rummaging inside my head for a while, and this is as good a place as any to put it. It seems obvious to call patchwork a grammar of plot composition, after all it expands on ggplot2 which has a grammar of graphics. I think that would be wrong. A grammar is not an API, but a theoretical construct that describes the structure of something in a consistent way. An API can be based on a grammar (as is the case for ggplot2 and dplyr) which will guide its design, or a grammar can be developed in close concert with an API as I tried to do with gganimate. Not everything lends itself well to being described by a grammar, and an API is not necessarily bad if it is not based on one (conversely, it may be bad even if it is). Using operators to combine plots is hardly a reflection of an underlying coherent theory of plot composition, much less a reflection of a grammar. It is still a nice API though.

Why do I need to say this? It seems like the programming world has been taken over by grammars and you may feel bad about just solving a problem with a nice API. Don’t feel bad — “grammar” has just been conflated with “cohesive API” lately.

Towards some new packages

As mentioned in the beginning, I set out to mainly finish off stuff in 2019. tidygraph, ggforce, and ggraph has seen some huge updates, and with patchwork finally released I’ve reached my year goal with time to spare. I’ll be looking forward to creating something new again, but hopefully find a good rhythm where I don’t need to take a year off to update forgotten projects.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

riddle by attrition

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The weekend riddle from The Riddler is rather straightforward [my wording and simplification]:

Construct a decimal number X between 0 and 1 by drawing the first digit a¹ uniformly over {0,1,…,9}, the second digit a² uniformly over {0,1,…,9}, &tc., until 0 is attained. What is the expectation of this random variable X?

Since each new digit has expectation half of the previous digit, the expectation is an infinite geometric series with common ratio 20⁻¹ and factor 9, leading to an expectation of 9/19. As verified with the following R code:

skr<-function(){  a=9;b=0  while((a<-sample(rep(0:a,2),1))>0)b=10*b+a  while(b>=1)b=b/10  return(b)}
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent Calendar of Football Trivia Analyses

$
0
0

[This article was first published on rstats on Robert Hickman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the most consistent fonts of posts on this blog is The Guardian’s football trivia page The Knowledge. A particular reason for this is that the small contained questions lend themselves to small blogposts that I can turn around in an hour or two, as opposed to being endlessly redrafted until I lose interest.

However, I still sometimes don’t quite get round to finishing some of these posts, or have trouble justifying a blog post on a very small and ‘trivial’ answer to a question. Therefore, as a sort of end-of-year round up, and a Christmas present to myself, I wanted to push out answers to questions I found particularly interesting over the last year and hadn’t quite got round to 1. I’ll probably add them all to this post as I finish them up.

2nd December – Everything in its right place

I wonder if any of any sporting leagues have ever ended in alphabetical order? pic.twitter.com/you6u8Uzwz

— P A Hunt (@TeachFMaths) June 15, 2019

Answer – yes, kind of. But also no.

This question has actually been answered (as many of these will have been). For a league of 20 teams (like the English Premier League), we might imagine if would have happened over the last ~150 years, but it’s almost certain from some basic maths that it won’t have, and moreover, will never happen.

Let’s load some data and see why.

#as per usual, going to heavily rely on tidyverse 
#and engsoccerdata throughout these posts
library(tidyverse)
library(engsoccerdata)
#load English league data
league_data <- engsoccerdata::england %>%
  #select and gather match results
  select(season = Season, division, home, visitor, hgoal, vgoal) %>%
  gather("location", "team", -season, -division, -hgoal, -vgoal) %>%
  mutate(
    g_for = case_when(
      location == "home" ~ hgoal,
      location == "visitor" ~ vgoal
    ),
    g_ag = case_when(
      location == "home" ~ vgoal,
      location == "visitor" ~ hgoal
    )) %>%
  #get correct point for a win/loss
  mutate(
    points = case_when(
      g_for > g_ag & season < 1981 ~ 2,
      g_for > g_ag & season > 1980 ~ 3,
      g_for == g_ag ~ 1,
      g_for < g_ag ~ 0
    ),
    gd = g_for - g_ag
  ) %>%
  #group by season and league and get final tables
  group_by(season, division, team) %>%
  summarise(points = sum(points),
            gd = sum(gd),
            g_for = sum(g_for)) %>%
  arrange(-points, -gd, -g_for) %>%
  #rank league order and alphabetical order
  mutate(league_pos = rank(-points, ties.method = "first"),
         alph_order = rank(team, ties.method = "first")) %>%
  select(season, division, team, league_pos, alph_order) %>%
  #split by league and season
  split(., f = list(.$season, .$division)) %>%
  keep(function(x) nrow(x) > 0)

#print the top of the first league table
head(league_data[[1]])
## # A tibble: 6 x 5
## # Groups:   season, division [1]
##   season division team                    league_pos alph_order
##                                       
## 1   1888        1 Preston North End                1          9
## 2   1888        1 Aston Villa                      2          2
## 3   1888        1 Wolverhampton Wanderers          3         12
## 4   1888        1 Blackburn Rovers                 4          3
## 5   1888        1 Bolton Wanderers                 5          4
## 6   1888        1 West Bromwich Albion             6         11

We can then run a load of Spearman’s rank correlation tests on the data to see which ones are perfectly correlated or anti-correlated in both league and alphabetical ranking. We’ll use the very handy broom package to tidy the results of our many tests into one data.frame (remove the filter at the end of the pipe chain to see what gets output).

#use broom to tidily do stats
library(broom)

#correlate league and alphabetical order by year
exact_correlations <- league_data %>%
  map_df(., function(data) {
    cor.test(
      data$league_pos,
      data$alph_order,
      method = "spearman"
    ) %>%
      tidy() %>%
      mutate(season = unique(data$season),
             division = unique(data$division))
  }) %>%
  #take only significantly 
  filter(abs(statistic) == 1)

print(exact_correlations)
## # A tibble: 0 x 7
## # ... with 7 variables: estimate , statistic , p.value ,
## #   method , alternative , season , division 

And so we find no exact correlations. There are no instances in 363 separate seasons of English league football where teams line up in either alphabetical, or anti-alphabetical order.

Let’s see why this is. To make things simpler, I’m going to imagine a cutdown league of only 6 teams using teams starting with each of the first 6 letter of the alphabet:

first_letter_names <- league_data %>%
  bind_rows() %>%
  ungroup() %>%
  #get first letter of a team name
  mutate(first_letter = gsub("(^.)(.*)", "\\1", team)) %>%
  filter(season > 1992 &
           division == 1 &
           first_letter %in% toupper(letters[1:6])
         ) %>%
  #get one team beginning with A, B, C...
  filter(!duplicated(first_letter)) %>%
  select(team) %>%
  arrange(team) %>%
  print()
## # A tibble: 6 x 1
##   team            
##              
## 1 Arsenal         
## 2 Blackburn Rovers
## 3 Coventry City   
## 4 Derby County    
## 5 Everton         
## 6 Fulham

For the league to finish in alphabetical order, we first need the team that is first alphabetically (Arsenal) to finish in first position. Assuming all teams have an equal chance of winning the league, the chance of this is obviously

\[ p(Arsenal = 1) = \frac{1}{n}\]

Then we need the second team (Blackburn Rovers), to finish in second. This is predicated on Arsenal already finishing in first position, so the chance becomes

\[ p(Blackburn = 2 | Arsenal = 1) = \frac{1}{n-1} \]

and so on until the last team (Fulham) just have to slot into the only position left (n, 6th in our example)

Thus the total chance becomes

\[ \frac{1}{n} \cdot \frac{1}{n-1} … \cdot \frac{1}{1} \]

which can also be written

\[ p(ordered) = \prod_{n = 1}^{N} \frac{1}{n}\]

which multiplies out to

\[ p(ordered) = \frac{1}{n!} \]

so for our very small league the chance of n (assumed equally strong teams)

factorial(nrow(first_letter_names))
## [1] 720

so we have a 1/720 chance that this league ends perfectly in alphabetical order. For bigger leagues (for reference most large European leagues contain 18-24 teams) this denominator grows super-exponentially and becomes tiny.

For the English Premier League (20 teams) for instance the chance becomes

league_data %>%
  bind_rows() %>%
  ungroup() %>%
  filter(season == max(season) & division == 1) %>% 
  nrow() %>%
  factorial()
## [1] 2.432902e+18

or 1 in 2.4 quintillion. In short, if it’s assumed that there’s no relation between order of names and team strength, we might expect the universe to end before all 20 teams finish in perfect order.

We can test if our predictions bear out by looking at tiny leagues with small numbers of teams, e.g. the group stages of the Champions/Europa Leagues.

First we need to scrape the final tables for the last 8 years of data from both competitions:

library(rvest)

#website to scrape group stage data from
fb_data <- "https://footballdatabase.com"
ucl_links <- sprintf(
  "/league-scores-tables/uefa-champions-league-20%s-%s",
  10:18, 11:19
)
europa_links <- sprintf(
  "/league-scores-tables/uefa-europa-league-20%s-%s",
  10:18, 11:19
)
#function to scrape the data from these links
get_competition_data <- function(competition, links) {
  data <- links %>%
    paste0(fb_data, .) %>%
    map_df(., function(year) {
      page_read <- read_html(year)
      
      groups <- letters[1:8] %>%
        map_df(., function(group) {
          page_read %>% 
            html_nodes(sprintf("#total-group-%s > div > table", group)) %>% 
            html_table(fill = TRUE) %>% 
            as.data.frame() %>%
            mutate(group)
        }) %>%
        mutate(year = gsub("(.*-)([0-9]{4}-[0-9]{2})", "\\2", year))
    }) %>%
    mutate(competition)
}
#scrape and bind the data
uefa_data <- bind_rows(
  get_competition_data("champions", ucl_links),
  get_competition_data("europa", europa_links)
)
#print a cutdown version of the scraped data
head(uefa_data %>% select(club = Club, points = P, year, competition))
##                club points    year competition
## 1 Tottenham Hotspur     11 2010-11   champions
## 2       Inter Milan     10 2010-11   champions
## 3         FC Twente      6 2010-11   champions
## 4     Werder Bremen      5 2010-11   champions
## 5        Schalke 04     13 2010-11   champions
## 6              Lyon     10 2010-11   champions

So now we have 128 (8 groups x 8 years x 2 competitions) ‘mini-leagues’ each of 4 teams.

We can then munge this data to find all the groups where the teams finish in alphabetical order. We’d expect 128/4! leagues to finish in alphabetical order (or 5.33 to be exact).

ordered_groups <- uefa_data %>%
  #select relevant informatiob
  select(team = Club, league_pos = X., group, year, competition) %>%
  #by group find where teams finish in alphabetical order
  group_by(year, group, competition) %>%
  mutate(alph_order = rank(team, ties.method = "first")) %>%
  filter(league_pos == alph_order) %>%
  #keep only group where all (4) teams finish in order
  summarise(n = n()) %>%
  filter(n == 4) %>%
  #join and filter back data
  left_join(uefa_data, ., by = c("group", "year", "competition")) %>%
  filter(!is.na(n)) %>%
  #select useful information
  select(team = Club, points = P, gd = X..., league_pos = X.,
         group, year, competition) %>%
  #split groups up
  split(., list(.$year, .$group, .$competition)) %>%
  keep(function(x) nrow(x) > 0)

which leaves us with 5 leagues that have finished in order! almost exactly what we’d predict by chance if the first letter of a teams name had no effect on the outcome.

ordered_groups
## $`2011-12.c.champions`
##                team points gd league_pos group    year competition
## 5           Benfica     12  4          1     c 2011-12   champions
## 6          FC Basel     11  1          2     c 2011-12   champions
## 7 Manchester United      9  3          3     c 2011-12   champions
## 8     Otelul Galati      0 -8          4     c 2011-12   champions
## 
## $`2015-16.c.champions`
##                team points gd league_pos group    year competition
## 9   Atlético Madrid     13  8          1     c 2015-16   champions
## 10          Benfica     10  2          2     c 2015-16   champions
## 11      Galatasaray      5 -4          3     c 2015-16   champions
## 12 Lokomotiv Astana      4 -6          4     c 2015-16   champions
## 
## $`2010-11.f.champions`
##             team points  gd league_pos group    year competition
## 1     Chelsea FC     15  10          1     f 2010-11   champions
## 2      Marseille     12   9          2     f 2010-11   champions
## 3 Spartak Moskva      9  -3          3     f 2010-11   champions
## 4         Žilina      0 -16          4     f 2010-11   champions
## 
## $`2015-16.g.champions`
##                   team points  gd league_pos group    year competition
## 13          Chelsea FC     13  10          1     g 2015-16   champions
## 14         Dynamo Kyiv     11   4          2     g 2015-16   champions
## 15            FC Porto     10   1          3     g 2015-16   champions
## 16 Maccabi Tel Aviv FC      0 -15          4     g 2015-16   champions
## 
## $`2018-19.h.champions`
##                 team points gd league_pos group    year competition
## 17          Juventus     12  5          1     h 2018-19   champions
## 18 Manchester United     10  3          2     h 2018-19   champions
## 19          Valencia      8  0          3     h 2018-19   champions
## 20        Young Boys      4 -8          4     h 2018-19   champions

We can also do a larger test by randomly selecting teams out of the English league data we looked at earlier. To do this I need two quick functions: one to sample randomly from the data, and another to carry out the correlation test.

The first takes a number of samples (how many tests to run) and then selects a number of teams from each league sample. For instance, if I chose 3 teams, it might select Liverpool, Manchester United, and Watford, from the last season of the Premier League. These teams finished 2nd, 6th, and 11th respectively, so this ‘sampled league’ would fulfill the criteria of finishing in alphabetical order.

set.seed(3459)

#take a random sample of leagues and teams withing those leagues
sample_cutdown_leagues <- function(nteams, nsamples, data) {
  samples <- sample(length(data), nsamples, replace = TRUE)
  
  sampled_league_data <- data[samples]
  
  league_team_serials <- sampled_league_data %>%
    lapply(., nrow) %>%
    lapply(., sample, size = nteams)
  
  #carry out the correlation test
  league_cor_test <- map2_df(
    .x = sampled_league_data,
    .y = league_team_serials,
    .f = cor_test_data
  )
}
  
#function for correlation test
cor_test_data <- function(full_league_data, sampled_teams) {
  sampled_league <- full_league_data[sampled_teams,] %>%
    arrange(league_pos)
  cor_test <- cor.test(
    sampled_league$league_pos,
    sampled_league$alph_order,
    method = "spearman"
  ) %>%
    tidy() %>%
    #mutate on information about that season and teams chosen
    mutate(teams = paste(sampled_league$team, collapse = ", "),
           season = unique(sampled_league$season),
           division = unique(sampled_league$division))
}

So for instance if I just run it once, randomly selecting 4 teams:

test <- sample_cutdown_leagues(4, 1, league_data)
#print the teams selected
test$teams
## [1] "Brentford, Bristol Rovers, Brighton & Hove Albion, Chester"
test
## # A tibble: 1 x 8
##   estimate statistic p.value method   alternative teams     season division
##                                    
## 1      0.8      2.00   0.333 Spearma~ two.sided   Brentfor~   1994        3

It gives me 4 teams from the 1994 division 3 who didn’t finish in alphabetical order (though, amusingly, all have a very similar starting letter).

We can then carry this out with 10000 samples for n_team numbers of 2:6 to see if we get roughly the expected numbers of exactly correlated league finish positions (this will take 1-2mins) by finding out how many tests give an estimate of 1 (finished exactly correlated with alphabetical order) or -1 (finished exactly anti-correlated with alphabetical order).

Both these numbers should be roughly equal to the number of samples (10000) divided by the factorial of the number of teams selected.

test_n_numbers <- function(nteams) {
  #run sampling function n times
  #10k should do
  sampling <- sample_cutdown_leagues(nteams, 10000, league_data)
  
  #find exactly correlated and anti-correlated examples
  #where teams are in exact alphabetical order ascending or descending
  correlated <- length(which(sampling$estimate == max(sampling$estimate)))
  anti_correlated <- length(which(sampling$estimate == min(sampling$estimate)))
  expected <- nrow(sampling) / factorial(nteams)
  
  df <- data.frame(n = nteams,
                   sample_cor = correlated,
                   sample_anticor = anti_correlated,
                   sample_expected = expected)
}
#run the function
testing <- map_df(2:6, test_n_numbers)
#print results
print(testing)
##   n sample_cor sample_anticor sample_expected
## 1 2       5010           4990      5000.00000
## 2 3       1676           1665      1666.66667
## 3 4        367            398       416.66667
## 4 5        101             81        83.33333
## 5 6         14             15        13.88889

And the numbers line up, as we would expect if there is no effect of the first letter of a team’s name upon final league position.

Finally, we can do a Kendall’s correlation test to really see if there is any relationship between alphabetical team name order and final league finish for all out our English league data. We use Kendall instead of a Spearman test here because we grouping all the data together we’re going to have a lot of ties (one team has to finish 1st in every league each year).

all_data <- league_data %>%
  bind_rows()

#do a big correlation test
kendall_test <- cor.test(all_data$alph_order,
                         all_data$league_pos,
                         alternative = "two.sided",
                         method = "kendall") %>%
  tidy() %>%
  print()
## # A tibble: 1 x 5
##   estimate statistic p.value method                         alternative
##                                               
## 1   0.0135      1.74  0.0826 Kendall's rank correlation tau two.sided

And we can see that, even though our p-value is ‘approaching significance’, it’s not significant at our fairly liberal threshold of 0.05. Even then, the effect size (0.013) is tiny, so there’s no need for Watford to start worrying just yet.

  1. SMALL DIGRESSION: I love blogging on this site and it also has been a great help to me in numerous ways (practice coding/writing, feeling like a “programmer”, for job interviews), but quite a lot of the time feel posts are not quite where I want them (I’m sure this feeling isn’t restricted to me) and so won’t put them up and so that time (sometimes quite a few hours!) I put into them in my spare time feels wasted and makes me feel worse about myself. I’m hoping that pushing out fairly rushed/half formed ideas like this will help with this.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats on Robert Hickman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

learnr 0.10.0

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

learnr 0.10.0 has been released! In this version of learnr, quiz questions have been expanded to allow for more question types. Text box quiz questions have been implemented natively within learnr and ranking questions have been implemented using the sortable package.

The learnr R package makes it easy to turn any R Markdown document into an interactive tutorial. Tutorials consist of content along with interactive components for checking and reinforcing understanding. Tutorials can include any or all of the following:

  1. Narrative, figures, illustrations, and equations.

  2. Code exercises (R code chunks that users can edit and execute directly).

  3. Quiz questions.

  4. Videos (supported services include YouTube and Vimeo).

  5. Interactive Shiny components.

Tutorials automatically preserve work done within them, so if a user works on a few exercises or questions and returns to the tutorial later they can pick up right where they left off.

Example

Test out the latest interactive demo of sortable’s ranking quiz question.

learnr::run_tutorial("question_rank", "sortable")

Highlights

New quiz questions

I am excited to announce that quiz questions are now mini shiny applications. This opens the door to new and extendable question types, such as text box and ranking questions. The sortable R package (an htmlwidgets wrapper around the drag-and-drop Sortable.js) has already implemented ranking questions using the new learnr quiz question API. Thank you Kenton Russell for originally pursuing sortable and Andrie de Vries for connecting the two packages.

Please see learnr::run_tutorial("quiz_question", "learnr") for more information.

Available tutorials

A new function, available_tutorials(), has been added. When called, this function will find all available tutorials in every installed R package. If a package name is provided, only that package will be searched. This functionality has been integrated into run_tutorial if a user provides a wrong tutorial name or forgets the package name.

Please see ?learnr::available_tutorials for more information.

Better pre-rendering

Using the latest rmarkdown, learnr tutorials are now agressively pre-rendered. For package developers, please do not include the pre-rendered HTML files in your package as users will most likely need to recompile the tutorial. See learnr’s .Rbuildignore for an example.

Deploying dependencies not found

If your learnr tutorial contains broken code within exercises for users to fix, the CRAN version of packrat will not find all of your dependencies to install when the tutorial is deployed. To deploy tutorials containing exercise code with syntax errors, install the development version of packrat. This version of packrat is able to find dependencies per R chunk, allowing for broken R chunks within the tutorial file.

remotes::install_github("rstudio/packrat")

Breaking changes

learnr 0.10.0 includes some non-backward-compatible bug fixes involving a the browser’s local storage. It is possible that the browser’s local storage will have a “cache miss” and existing users will be treated like new users.

learnr change log

New features

  • Quiz questions are implemented using shiny modules (instead of htmlwidgets). (#194)

  • Aggressively rerender prerendered tutorials in favor of a cohesive exercise environment (#169, #179, and rstudio/rmarkdown#1420)

  • Added a new function, safe, which evaluates code in a new, safe R environment. (#174)

Minor new features and improvements

  • Added the last evaluated exercise submission value, last_value, as an exercise checker function argument. (#228)

  • Added tabset support. (#219#212)

  • Question width will expand to the container width. (#222)

  • Available tutorial names will be displayed when no name parameter or an incorrect name is provided to run_tutorial(). (#234)

  • The options parameter was added to question to allow custom questions to pass along custom information. See sortable::sortable_question for an example. (#243)

  • Missing package dependencies will ask to be installed at tutorial run time. (@isteves, #253)

  • When questions are tried again, the existing answer will remain, not forcing the user to restart from scratch. (#270)

  • A version number has been added to question_submission events. This will help when using custom storage methods. (#291)

  • Tutorial storage on the browser is now executed directly on indexedDB using idb-keyval (dropping localforage). This change prevents browser tabs from blocking each other when trying to access indexedDB data. (#305)

Bug fixes

  • Fixed a spurious console warning when running exercises using Pandoc 2.0. (#154)

  • Added a fail-safe to try-catch bad student code that would crash the tutorial. (@adamblake, #229)

  • Replaced references to checkthat and grader in docs with gradethis (#269)

  • Removed a warning created by pandoc when evaluating exercises where pandoc was wanting a title or pagetitle. #303

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Static and Dynamic Book Exercises with R

$
0
0

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the new edition of my R book, to be released in early 2020 (see current TOC, new packages and notification form), I’m giving special attention to its use in the classroom. For that, I’ve created class slides and R exercises in the static and dynamic form. All the extra content will be freely available in the internet and distributed with package afedR. Anyone can use it, without the need of purchasing the book (but off course it would help).

To access the files, first install the package from Github with devtools (ignore warning messages about long paths):

devtools::install_github('msperlin/afedR')

You can copy all book content to a local folder using the following function:

my_tempdir <- tempdir()
afedR::afedR_get_book_files(path_to_copy = my_tempdir)

The static exercises for all chapters are available in the afedR files/eoc-exercises folder:

list.files(file.path(my_tempdir, 'afedR files/eoc-exercises/'))
##  [1] "afedR-Solutions-Chapter-01-Introduction.Rmd"                   
##  [2] "afedR-Solutions-Chapter-02-Basic Operations.Rmd"               
##  [3] "afedR-Solutions-Chapter-03-Writing-Research-Scripts.Rmd"       
##  [4] "afedR-Solutions-Chapter-04-Importing-Exporting-Local-Files.Rmd"
##  [5] "afedR-Solutions-Chapter-05-Importing-Data-Internet.Rmd"        
##  [6] "afedR-Solutions-Chapter-06-Dataframes-and-others.Rmd"          
##  [7] "afedR-Solutions-Chapter-07-Basic-Classes.Rmd"                  
##  [8] "afedR-Solutions-Chapter-08-Programming-with-R.Rmd"             
##  [9] "afedR-Solutions-Chapter-09-Cleaning-Structuring-Data.Rmd"      
## [10] "afedR-Solutions-Chapter-10-Figures.Rmd"                        
## [11] "afedR-Solutions-Chapter-11-Financial-Econometrics.Rmd"         
## [12] "afedR-Solutions-Chapter-12-Reporting-Results.Rmd"              
## [13] "afedR-Solutions-Chapter-13-Optimizing-Code.Rmd"

Every .Rmd file is self-contained and should compile without any problems in your computer.

Now, the real benefit of the package is in the dynamic R exercises you can create with package exams. Back in 2017 I already talked about my admiration and use of exams in all of my university classes. In short, you can use exams to create an unique version of a exercise for each student by randomizing numbers and text. All questions are written in .Rmd/.Rnw files and, since its all RMarkdown code, you can make it as dynamic as possible. The amount of hours it saved me so far in creating and grading exams is unbelievable! I even changed the structure of all my classes to a more activity-oriented coursework based on single-choice exercises. The feedback I get from the students has been very positive.

As a result of using exams for many years, I wrote a significant database of R single-choice questions that I use in my university courses. It amounts to 91 questions, covering R basics, functions, class objects, programming, econometrics, and much more. All of these exam questions are included in the package and I’ll add more with time. You can find all of them in a compiled html file in this link.

Creating a Dynamic Exam

Creating a dynamic R exam is simple. All you need is the names of all students among other options. Function afedR_build_exam will grab all exercise templates, compile each exam, and output a different .html file for each student. Have a look at a reproducible example:

library(afedR)
library(tidyverse)

set.seed(1)

student_names <- c('Roger Federer', 'John Wick', 'Robert Engle', 
                   'Getulio Vargas', 'Mario Quintana', 'Elis Regina') 
my_ids <- c(sample(seq(length(student_names)))) # ids (usually a numeric and unique symbol given by the university)
class_name <- 'R Workshop'
exercise_name <- 'Introduction to R'
temp_dir <- file.path(tempdir(), 'html exams') # where to create exam files

l_out <- afedR_build_exam(students_names = student_names, 
                          students_ids = my_ids, 
                          class_name = class_name, 
                          exercise_name = 'Introduction to R',
                          chapters_to_include = 2, # single chapter for simplicity (it goes from 1-13)
                          dir_out = temp_dir)
## Exams generation initialized.
## 
## Output directory: /tmp/Rtmprsh4CX/exams files file29381321d46f
## Exercise directory: /mnt/HDD/Dropbox/11-My Website/www.msperlin.com-blog/content/post
## Supplement directory: /tmp/Rtmprsh4CX/file2938539ecdc8
## Temporary directory: /tmp/Rtmprsh4CX/file29385ddf98d
## Exercises: /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-01-AboutPrint, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-04-AboutTypeFiles, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-05-AboutTypeFiles2, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-06-Sequence, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-08-AboutError2, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-09-About_ls, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-10-SelectingValues, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-11-AboutDim, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-12-listfiles, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-13-About_rm, /home/msperlin/R/x86_64-pc-linux-gnu-library/3.6/afedR/extdata/exam_files/exercise_files/Chapter 02 - Basic Operations/Chapter_02-Intro-14-setwd
## 
## Generation of individual exams.
## Exam 1: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.
## Exam 2: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.
## Exam 3: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.
## Exam 4: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.
## Exam 5: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.
## Exam 6: _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-01-AboutPrint (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-04-AboutTypeFiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-05-AboutTypeFiles2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-06-Sequence (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-08-AboutError2 (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-09-About_ls (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-10-SelectingValues (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-11-AboutDim (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-12-listfiles (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-13-About_rm (srt) _home_msperlin_R_x86_64-pc-linux-gnu-library_3.6_afedR_extdata_exam_files_exercise_files_Chapter 02 - Basic Operations_Chapter_02-Intro-14-setwd (srt) ... w ... done.

Done. All exams files are available in folder /tmp/Rtmprsh4CX/html exams:

list.files(temp_dir)
## [1] "Introduction to R_Elis Regina_Ver 06.html"   
## [2] "Introduction to R_Getulio Vargas_Ver 04.html"
## [3] "Introduction to R_John Wick_Ver 02.html"     
## [4] "Introduction to R_Mario Quintana_Ver 05.html"
## [5] "Introduction to R_Robert Engle_Ver 03.html"  
## [6] "Introduction to R_Roger Federer_Ver 01.html"

As an example of html output, file Introduction to R_Roger Federer_Ver 01.html is available in this link.

Going further, the output of afedR_build_exam is a list that includes the correct answers for each student/question:

print(l_out$answer_key)
## # A tibble: 6 x 13
##   i_name i_ver `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`   `10`  `11` 
##                
## 1 Elis …     6 e     e     e     e     e     d     d     d     d     d     d    
## 2 Getul…     4 e     e     e     e     c     a     a     a     a     a     a    
## 3 John …     2 a     a     a     a     a     a     a     a     a     a     a    
## 4 Mario…     5 a     a     a     a     e     e     e     e     e     e     e    
## 5 Rober…     3 a     a     a     a     e     e     e     e     e     e     e    
## 6 Roger…     1 d     d     d     d     e     a     a     a     a     a     a

You can use this table for grading all exams. Currently I use Google Forms to register student’s answers with an online questionnaire. This helps because I can turn all answers in a single Google Spreadsheet, import it in R with package googlesheets4, and effortlessly grade all exams in a R script. Soon, in another post, I’ll write about my detailed workflow in using exams with Google Forms and Google Classroom.

I hope this content helps all R instructions around the world. But, it is a work in progress. If you find any issue, please let me know by email or posting an issue event at the github code.

The book is finally reaching its final version and I’m very excited about it. Its been a long journey. You can find more details about the new book here. If you want to be notified about the publication, just sign this form and I’ll email you as soon as the book becomes available.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Improve General Regression Neural Network by Monotonic Binning

$
0
0

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A major criticism on the binning algorithm as well as on the WoE transformation is that the use of binned predictors will decrease the model predictive power due to the loss of data granularity after the WoE transformation. While talk is cheap, I would use the example below to show that using the monotonic binning algorithm to pre-process predictors in a GRNN is actually able to alleviate the over-fitting and to improve the prediction accuracy for the hold-out sample.

First of all, the whole dataset was split into half, e.g. one as the training sample and another as the hold-out sample. The smoothing parameter, e.g. sigma, was chosen through the random search and happened to be 2.198381 for both GRNNs.

  1. For the first GRNN with untransformed raw predictors, the AUC for the training sample is 0.69 and the AUC for the hold-out sample is 0.66.
  2. For the second GRNN with WoE-transformed predictors, the AUC for the training sample is 0.72 and the AUC for the hold-out sample is 0.69.

In this particular example, it is clearly shown that there is roughly a 4% – 5% improvement in the AUC statistic for both training and hold-out samples through the use of monotonic binning and WoE transformations.

.gist table { margin-bottom: 0; }
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Create realistic-looking Islands with R

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modern movies use a lot of mathematics for their computer animations and CGI effects, one of them is what is known under the name fractals.

In this post, we will use this technique to create islands with coastlines that look extraordinarily realistic. If you want to do that yourself read on!

I was lucky enough to meet Professor Benoit Mandelbrot, one of the brightest minds of the past century and one of the fathers of fractal geometry, in person. One of his papers bears the title “How Long is the Coast of Britain?” where the so-called coastline paradox is being addressed: it basically states that the finer your ruler is the longer your overall result will be, so the coastline of a landmass does not have a well-defined length… an obviously counterintuitive result:

Source: wikimedia

Here we will turn this principle on its head and use it to actually create realistic-looking landmasses with R. The inspiration for this came from chapter 4 “Infinite Detail” of the book “Math Bytes” by my colleague Professor T. Chartier from Davidson College in North Carolina.

The idea is to start with some very simple form, like a square, and add more detail step-by-step. Concretely, we go through every midpoint of our ever more complex polygon and shift it by a random amount. Because the polygon will be getting more and more intricate we have to adjust the absolute amount by which we shift the respective midpoints. That’s about all… have a look at the following code:

# define squarex <- c(0, 16, 16, 0, 0)y <- c(0, 0, 16, 16, 0)# function for generating random offsets of midpoints, change e.g. limit for your own experimentsrealistic <- function(n, k) {  limit <- 0.5^k * 10  runif(n, -limit, limit)}# function for calculating all midpoints of a polygonmidpoints <- function(x) {  na.omit(filter(x, rep(1/2, 2)))}island <- function(x, y, iter = 10, random = realistic) {  # suppress "number of columns of result is not a multiple of vector length"-warnings because recycling is wanted here  oldw <- getOption("warn")  options(warn = -1)    for (i in 1:iter) {    # calculate midpoints of each line segment    x_m <- as.numeric(midpoints(x))    y_m <- as.numeric(midpoints(y))        # shift midpoint by random amount    x_m <- x_m + random(length(x_m), i)    y_m <- y_m + random(length(y_m), i)        #insert new midpoints into existing coordinates of polygon    x_n <- c(rbind(x, x_m))    x_n <- x_n[-length(x_n)]    y_n <- c(rbind(y, y_m))    y_n <- y_n[-length(y_n)]        x <- x_n    y <- y_n  }  options(warn = oldw)  list(x = x, y = y)}plot_island <- function(coord, island_col = "darkgreen", water_col = "lightblue") {  oldp <- par(bg = water_col)  plot(coord$x, coord$y, type = "n", axes = FALSE, frame.plot = FALSE, ann = FALSE)  polygon(coord$x, coord$y, col = island_col)  par(oldp)}

Perhaps one thing is worth mentioning concerning the interaction of the functions: island receives the realistic function as an argument, which is a very elegant way of modularizing the code in case you want to try different randomizing functions. This extraordinary feature is possible because R is a functional programming language where functions are first-class citizens (for an introduction on this see here: Learning R: A gentle introduction to higher-order functions).

But now for some serious terra-forming:

set.seed(1) # for reproducabilitycoord <- island(x, y)plot_island(coord)

set.seed(3)coord <- island(x, y)plot_island(coord)

We can also observe the process step-by-step:

set.seed(2)coord <- island(x, y)plot_island(coord)

set.seed(2)coord <- island(x, y, iter = 1)plot_island(coord)

set.seed(2)coord <- island(x, y, iter = 2)plot_island(coord)

set.seed(2)coord <- island(x, y, iter = 4)plot_island(coord)

set.seed(2)coord <- island(x, y, iter = 6)plot_island(coord)

set.seed(2)coord <- island(x, y, iter = 8)plot_island(coord)

If you do your own experiments with the randomizer function and/or the seed and find an especially appealing example, please share it with us in the comments!

As a bonus, I would like to draw your attention to this little masterpiece of an educational film on fractals, created in the style of the great “Sin City” flick (~ 5 min.):

So, let me end this post with one last example of a volcanic island in a sea of lava… 😉

set.seed(5214)coord <- island(x, y)plot_island(coord, island_col = "black", water_col = "darkred")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019-02 with R & JavaScript

$
0
0

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Solving Advent of Code 2019-02 with R and JavaScript.

[Disclaimer] Obviously, this post contains a big spoiler about Advent of Code, as it gives solutions for solving day 2.

[Disclaimer bis] I’m no JavaScript expert so this might not be the perfect solution. And TBH, that’s also the case for the R solution.

About the JavaScript code

The JavaScript code has been written in the same RMarkdown as the R code. It runs thanks to the {bubble} package: https://github.com/ColinFay/bubble

Instructions

Find the instructions at: https://adventofcode.com/2019/day/2

R solution

When in doubt, use brute force.

Ken Thompson

Part one

extract<-function(vec,idx)vec[as.character(idx)]day_2<-function(vec,one=12,two=2){vec[2]<-onevec[3]<-twonames(vec)<-0:(length(vec)-1)start<-0repeat{req<-extract(vec,start)if(req==99)breakif(req==1)fun<-`+`if(req==2)fun<-`*`vec[as.character(extract(vec,start+3))]<-fun(extract(vec,extract(vec,start+1)),extract(vec,extract(vec,start+2)))start<-start+4}vec[1]}ipt<-scan("input2.txt",what=numeric(),sep=",")day_2(ipt)
##       0 
## 3409710

Part two

x<-purrr::cross2(0:99,0:99)i<-1repeat{res<-day_2(ipt,x[[i]][[1]],x[[i]][[2]])if(res==19690720)breaki<-i+1}# Answer100*x[[i]][[1]]+x[[i]][[2]]
## [1] 7912

JS solution

Part one & Two

// Reading the file

varres=fs.readFileSync("input2.txt",'utf8').split(",").filter(x=>x.length!=0);varres=res.map(x=>parseInt(x));
functionday_2(vec,one=12,two=2){varloc=vec.slice();loc[1]=one;loc[2]=two;start=0;do{varreq=loc[start];if(req===99){break;}pos1=loc[start+1];pos2=loc[start+2];pos3=loc[start+3];if(req===1){loc[pos3]=loc[pos1]+loc[pos2];}elseif(req===2){loc[pos3]=loc[pos1]*loc[pos2];}start=start+4;}while(start<vec.length)returnloc[0]}
day_2(res)
## 3409710
functionmake_array(l){returnArray.from({length:l},(el,index)=>index);}varx=make_array(100);vary=make_array(100);varcross=[];for(vari=0;i<x.length;i++){for(varj=0;j<y.length;j++){cross.push([x[i],y[j]])}}i=0do{ans=day_2(res,cross[i][0],cross[i][1]);if(ans==19690720)breaki++}while(i<cross.length)
100*cross[i][0]+cross[i][1]
## 7912

Day 2 takeaway

  • Array.from({length: n}, (el, index) => index); is more or less the equivalent of R 1:n

  • When doing [] = in JS, we’re modifying the original objet. Compare

# Rx<-1:3x
## [1] 1 2 3
plop<-function(y){y[1]<-2}plop(x)x
## [1] 1 2 3

to

// JSvarx=[1,2,3];functionyeay(ipt){ipt[1]=12}yeay(x)
x
## [ 1, 12, 3 ]
  • JavaScript copies by reference. Compare:

# R x<-1:3y<-xx[1]<-999x
## [1] 999   2   3
y
## [1] 1 2 3

And

varx=make_array(3);vary=xx[1]=999
xy
## [ 0, 999, 2 ]
## [ 0, 999, 2 ]
  • This can be prevented with obj.slice()

varx=make_array(10);vary=x.slice();x[1]=999
xy
## [ 0, 999, 2, 3, 4, 5, 6, 7, 8, 9 ]
## [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

NHSRDatasets meets runcharter

$
0
0

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Big up the NHS –

Background

The NHSRDatasets package made it to CRAN recently, and as it is designed for use by NHS data analysts, and I am an NHS data analyst, let’s take a look at it. Thanks to Chris Mainey and Tom Jemmett for getting this together.

Load packages and data

As above let’s load what we need for this session. The runcharter package is built using data.table, but I’m using dplyr in this main section to show that you don’t need to know data.table to use it.

library(NHSRdatasets)library(runcharter)# remotes::install_github("johnmackintosh/runcharter)library(dplyr)library(skimr)

However- seriously, do take a look at data.table. It’s not as hard to understand as some might have you believe. A little bit of effort pays off. You can load the runcharter package from github using the remotes package. (I’ve managed to install it on Windows and Ubuntu. Mac user? No idea, I’m a Band 6 UK public sector data analyst, so can only dream of such luxuries).

ae<-data("ae_attendances")# a promiseae<-ae_attendances#  a stringrm(ae_attendances)# just typing 'ae' brings it to life in the environment

That felt a bit glitchy. There has to be a sleeker way to load and assign a built in dataset but I couldn’t find one. Cursory google here.

Let’s have a look at the data:

glimpse(ae)
## Observations: 12,765## Variables: 6## $ period       2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2...## $ org_code     RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM...## $ type         1, 2, other, 1, 2, other, other, other, 1, other, ...## $ attendances  21289, 813, 2850, 30210, 807, 11352, 4381, 19562, ...## $ breaches     2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322...## $ admissions   5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0,...

Lot’s of factors. I’m actually very grateful for this package, as it caused me major issues when I first tried to plot this data using an earlier version of runcharter. I hadn’t considered factors as a possible grouping variable, which was a pretty big miss. All sorted now.

There’s way too much data for my tiny laptop screen, so I will filter the data for type 1 departments – the package help gives us a great link to explain what this means

type1<-ae%>%filter(type==1)%>%arrange(period)# plot attendancesp<-runcharter(type1,med_rows=13,# median of first 13 pointsrunlength=9,# find a run of 9 consecutive pointsdirection="above",# find run above the original mediandatecol="period",grpvar="org_code",yval="attendances")

The runcharter function returns both a plot, and a data.table/ data.frame showing a summary of any runs in the desired direction (I’m assuming folk reading this have a passing knowledge of run charts, but if not, you can look at the package vignette, which is the cause of most of my commits!!)

Don’t try loading the plot right now, because it is huge, and takes ages. If we look at the summary dataframe,we can see 210 rows, a fairly decent portion of which relate to significant increases above the original median value

p$sustained
##      org_code median start_date   end_date  extend_to  run_type##   1:      R0A  21430 2017-10-01 2018-10-01 2019-03-01  baseline##   2:      R1F   3477 2016-04-01 2017-04-01 2017-05-01  baseline##   3:      R1H  28843 2016-04-01 2017-04-01 2019-03-01  baseline##   4:      R1K  11733 2016-04-01 2017-04-01 2019-03-01  baseline##   5:      RA2   5854 2016-04-01 2017-04-01 2018-03-01  baseline##  ---                                                           ## 206:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained## 207:      RLT   6977 2018-03-01 2018-11-01 2019-03-01 sustained## 208:      RQ8   8456 2018-03-01 2018-11-01 2019-03-01 sustained## 209:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained## 210:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained

Let’s use skimr to get a sense of this

skimr::skim(p$sustained)
## Skim summary statistics##  n obs: 210 ##  n variables: 6 ## ## -- Variable type:character -----------------------------------------------------------##  variable missing complete   n min max empty n_unique##  run_type       0      210 210   8   9     0        2## ## -- Variable type:Date ----------------------------------------------------------------##    variable missing complete   n        min        max     median n_unique##    end_date       0      210 210 2017-04-01 2019-03-01 2017-04-01        9##   extend_to       0      210 210 2017-05-01 2019-03-01 2019-03-01        7##  start_date       0      210 210 2016-04-01 2018-07-01 2016-04-01        9## ## -- Variable type:factor --------------------------------------------------------------##  variable missing complete   n n_unique                     top_counts##  org_code       0      210 210      139 RA4: 3, RDD: 3, RDE: 3, RGN: 3##  ordered##     TRUE## ## -- Variable type:numeric -------------------------------------------------------------##  variable missing complete   n   mean      sd   p0     p25  p50      p75##    median       0      210 210 9389.8 4317.54 3477 6468.25 8413 11311.25##   p100     hist##  29102 

To keep this manageable, I’m going to filter out for areas that have median admissions > 10000 (based on the first 13 data points)

high_admits<-p$sustained%>%filter(median>10000&run_type=="sustained")%>%select(org_code)

Then I change the org_code from factor to character, and pull out unique values. I’m sure there is a slicker way of doing this, but it’s getting late, and I don’t get paid for this..

I use the result to create a smaller data frame

high_admits$org_code<-as.character(high_admits$org_code)type1_high<-type1%>%filter(org_code%in%high_admits$org_code)

And now I can produce a plot that fits on screen. I’ve made the individual scales free along the y axis, and added titles etc

p2<-runcharter(type1_high,med_rows=13,# median of first 13 points as beforerunlength=9,# find a run of 9 consecutive pointsdirection="above",datecol="period",grpvar="org_code",yval="attendances",facet_scales="free_y",facet_cols=4,chart_title="Increased attendances in selected Type 1 AE depts",chart_subtitle="Data covers 2016/17 to 2018/19",chart_caption="Source : NHSRDatasets",chart_breaks="6 months")

Let’s look at the sustained dataframe

p2$sustained
##     org_code median start_date   end_date  extend_to  run_type##  1:      RCB   9121 2016-04-01 2017-04-01 2018-03-01  baseline##  2:      RDD  11249 2016-04-01 2017-04-01 2017-05-01  baseline##  3:      RDE   7234 2016-04-01 2017-04-01 2017-05-01  baseline##  4:      RGN   7912 2016-04-01 2017-04-01 2017-05-01  baseline##  5:      RJ1  12240 2016-04-01 2017-04-01 2018-03-01  baseline##  6:      RJE  14568 2016-04-01 2017-04-01 2018-05-01  baseline##  7:      RJL  11262 2016-04-01 2017-04-01 2018-03-01  baseline##  8:      RQM  16478 2016-04-01 2017-04-01 2018-03-01  baseline##  9:      RRK   9584 2016-04-01 2017-04-01 2018-03-01  baseline## 10:      RTE  11303 2016-04-01 2017-04-01 2017-05-01  baseline## 11:      RTG  11344 2016-04-01 2017-04-01 2018-07-01  baseline## 12:      RTR  10362 2016-04-01 2017-04-01 2018-03-01  baseline## 13:      RVV  12700 2016-04-01 2017-04-01 2017-05-01  baseline## 14:      RW6  22114 2016-04-01 2017-04-01 2017-05-01  baseline## 15:      RWE  12275 2016-04-01 2017-04-01 2017-05-01  baseline## 16:      RWF  11939 2016-04-01 2017-04-01 2018-03-01  baseline## 17:      RWP   9976 2016-04-01 2017-04-01 2018-03-01  baseline## 18:      RXC   9396 2016-04-01 2017-04-01 2018-03-01  baseline## 19:      RXH  12494 2016-04-01 2017-04-01 2018-03-01  baseline## 20:      RXP  10727 2016-04-01 2017-04-01 2017-05-01  baseline## 21:      RYR  11578 2016-04-01 2017-04-01 2018-03-01  baseline## 22:      RCB  10062 2018-03-01 2018-11-01 2019-03-01 sustained## 23:      RDD  12093 2017-05-01 2018-01-01 2018-03-01 sustained## 24:      RDE   7637 2017-05-01 2018-01-01 2018-03-01 sustained## 25:      RGN  11896 2017-05-01 2018-01-01 2018-05-01 sustained## 26:      RJ1  13570 2018-03-01 2018-11-01 2019-03-01 sustained## 27:      RJE  15183 2018-05-01 2019-01-01 2019-03-01 sustained## 28:      RJL  11972 2018-03-01 2018-11-01 2019-03-01 sustained## 29:      RQM  18560 2018-03-01 2018-11-01 2019-03-01 sustained## 30:      RRK  29102 2018-03-01 2018-11-01 2019-03-01 sustained## 31:      RTE  11772 2017-05-01 2018-01-01 2018-05-01 sustained## 32:      RTG  17169 2018-07-01 2019-03-01 2019-03-01 sustained## 33:      RTR  10832 2018-03-01 2018-11-01 2019-03-01 sustained## 34:      RVV  13295 2017-05-01 2018-01-01 2018-03-01 sustained## 35:      RW6  22845 2017-05-01 2018-01-01 2019-03-01 sustained## 36:      RWE  18173 2017-05-01 2018-01-01 2019-03-01 sustained## 37:      RWF  12793 2018-03-01 2018-11-01 2019-03-01 sustained## 38:      RWP  10358 2018-03-01 2018-11-01 2019-03-01 sustained## 39:      RXC  10279 2018-03-01 2018-11-01 2019-03-01 sustained## 40:      RXH  13158 2018-03-01 2018-11-01 2019-03-01 sustained## 41:      RXP  11314 2017-05-01 2018-01-01 2019-03-01 sustained## 42:      RYR  11970 2018-03-01 2018-11-01 2019-03-01 sustained## 43:      RDD  12776 2018-03-01 2018-11-01 2019-03-01 sustained## 44:      RDE  15322 2018-03-01 2018-11-01 2019-03-01 sustained## 45:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained## 46:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained## 47:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained##     org_code median start_date   end_date  extend_to  run_type

And of course, the plot itself

p2$runchart

runcharter

I haven’t looked into the actual data too much, but there are some interesting little facets here – what’s the story with RDE, RRK and RTG for example? I don’t know which Trusts these codes represent, but they show a marked increase.

The RGN (top right) and RVV (mid left) show the reason why I worked on this package – we can see that there has been more than one increase. Performing this analysis in Excel is not much fun after a while.

There is a lot more I can look at with this package, and we in the NHS-R community are always happy to receive more datasets for inclusion, so please contribute if you can.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio’s Commercial Desktop License is now RStudio Desktop Pro

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have good news for our commercial desktop IDE customers. We are giving the commercial version of our desktop IDE a new name and some great new features, including support for the RStudio Professional Drivers, at no additional cost to you.

As part of this release, we are renaming RStudio Commercial Desktop License to RStudio Desktop Pro. Existing Commercial Desktop customers will be migrated as part of their next renewal. However, if you would like to migrate to the new release before then, or have any questions, please contact your RStudio Customer Success representative for more information.

Updates for Users

As with the current Commercial Desktop License offering, RStudio Desktop Pro has all the great features of the RStudio Desktop Open Source Edition, plus a commercial license for organizations not able to use AGPL software, and access to priority support.

Beyond that, RStudio Desktop Pro also provides:

  • RStudio Professional Drivers: These drivers provide ODBC data connectors for many of the most popular data sources. These drivers can be downloaded and configured directly from within RStudio Desktop Pro. See RStudio Pro Drivers Documentation for details, and this blog post for the most recent updates.
  • License Activation and Management: To help users ensure compliance with their organization’s policies against AGPL software, RStudio Desktop Pro is a separate download from the RStudio Desktop Open Source Edition, with the AGPL license removed. A commercial license manager is integrated into the software, and the license itself is delivered as part of the purchase process. The usage of RStudio Desktop Pro is governed by a time-limited license tied to the renewal date. The time-limited license prevents users from accidentally reverting to the AGPL license if their subscription lapses.

RStudio professional products are designed to work together. For example, RStudio Desktop Pro will use the same professional drivers as RStudio Server Pro and RStudio Connect, ensuring a consistent user experience across platforms. We’d love to get your input on what you’d like to see in the future, to better integrate RStudio Desktop Pro with our other professional products. Please email sales@rstudio.com with your feedback.

For more information

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introducing Mark Padgham, rOpenSci’s new Software Research Scientist

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re thrilled to be introducing a new member of our team. Mark Padgham has joined rOpenSci as a Software Research Scientist working full-time from Münster, Germany. Mark will play a key role in research and development of statistical software standards and expanding our efforts in software peer review, enabled by new funding from the Sloan Foundation. He will work closely with Noam Ross, rOpenSci Leadership team member, and Scientist at EcoHealth Alliance and Karthik Ram, rOpenSci Project Lead.

Mark has made many contributions to rOpenSci prior to joining the team. He has developed four peer-reviewed packages: osmdata to import Open Street Map data1; osmplotr to make bespoke plots with that data; bikedata to access historical trip data from public bicycle-hire systems 2, and most recently, mapscanner to print maps, draw on them, and scan them back in. Mark also contributed to the development of stats193. He has reviewed three packages: piggyback, tradestatistics, and bomrang4, authored two blog posts, and participated in rOpenSci unconf18.

Note from Stefanie: When I met Mark in Seattle at rOpenSci’s unconf18, I was struck by his personal commitment to “walking the talk” of getting around by bicycle. I knew him only from his blog post on getting data from public bicycle rental systems, and there he was with colleagues, walking down the street with a brightly-colored rental bike. He said he made a point of using such clearly marked rental bikes when visiting other cities since they’re great conversation-starters about sustainable cities.

Headshot of Mark Padgham

I’ve got a lot of concrete visions for extending testing beyond current computational or quantitative approaches. Convincing leaders in statistics and statistical software to update and innovate on current practices is certainly going to be challenging and very rewarding.

We talked with Mark to learn more about him, and his work.

What do you want people to know about your background?

I am originally from Australia, have lived and worked in a few countries along the way (including Japan and Austria), and now live in Germany. I have an undergraduate degree in physics and maths, a PhD in ecology, qualifications in climatology and meteorology, as well as official teaching qualifications. All of these were obtained from universities in Melbourne, Australia – each one from a different university. Following my PhD, I worked in post-doctoral academic positions at the interface of ecology and climatology, before changing country and continent to Germany, and changing academic direction to the study of urban systems.

I have always said I’m dedicated to changing the world one walking trip or one bike trip at a time to change the way people live life in cities, but now I have a much more “meta” personal motivation. I develop software in the hope that it will make the world a better place. It’s focussed on a particular kind of change. My software is primarily motivated by a desire to make the understanding of cities easier and more accessible – from accessing the raw data of street networks (osmdata), to fine-scaled data on mobility (bikedata), to understanding and mapping how, where, and when people move through cities (through the dodgr package).

What is your favorite thing about R?

I like to use R (rather than another mainstream language) because R is the best language to develop master control systems from your raw data input through to documentation outputs, as websites, published reports, dashboards and every step in between. They can all be controlled from the one language. There’s no other language that reaches that level of master control system. It is an accolade of what RStudio has done to transform the utility of the language.

When I sit down with R, the first question I ask myself is, how do I get into C++ from here?

What’s an R thing you’d like to learn?

Yeah…lots of things [long pause while Mark tries to choose one thing]. To really understand how the C-level language of R interfaces with objects compiled from other languages and environments – that would be really cool. And at the level of C code, how lazy-evaluation works. Everyone’s gotten so used to that but you don’t even have to think about it. That’s a piece of amazing magic under the hood that I don’t understand.

You’ve made so many concrete contributions to rOpenSci, as well as less-visible community contributions. What made you keep coming back to give more?

It comes back to my belief in software as a tool to make the world a better place. Part of that is an active effort on the part of myself and the others with whom I work and develop packages to develop a comprehensive ecosystem. I make an active personal effort to develop a comprehensive ecosystem of packages that presents an open source alternative to closed source commercial environments. And that open source alternative will hopefully help to openly transform the world.

It’s in my own interest to continue to contribute to the community because it’s only in that way that what I have contributed continues to maintain its value. Without ongoing contributions to rOpenSci that whole thing stops at the point at which I left it and all I’ve got is to sit back and hope something happens. That’s worth nothing in comparison to me continuing to actively strive to foster, cultivate, grow, expand and enrich this ecosystem of tools in our little niche.

What will you be working on and what do you think are some of the key challenges?

I’ll be working on developing robust tools to assess packages. As an inveterate coder, I love nothing more than to get up in the morning and find myself still sitting there at 4:00 in the afternoon in my “coding-hole”. But a big part of the initial stuff, which I’m really excited about, is about connecting various people who will ultimately be involved in guiding the expansion of rOpenSci into the statistical methods side of things. So it’s much more people-centered than the typical life of a 100% coder. Like so many rOpenSci projects, this aims to be very community-driven.

I’ll really enjoy being between two worlds trying to think in general terms about developing this whole new endeavour [statistical software peer review] and ensuring the development is as general and generalizable as possible. While on the other side, having to be involved with profoundly important brilliant academic experts in various aspects of statistics & statistical software so it will be both very niche and very general at the same time. One of the key challenges will be working with experts in the statistical and software communities and having to foster consensus of opinion on their part in realms that have basically never been tried before. And on the other side, I’ve got a lot of concrete visions for extending testing beyond current computational or quantitative approaches. Convincing leaders in statistics and statistical software to update and innovate on current practices is certainly going to be challenging and very rewarding.

What would you like to see in the statistical software standards or package review landscape in two years?

Actual testing of the statistical routines themselves. To date, all systems for evaluating software rely on signature keys of what reliable software might look like externally. It’s like it’s packaged up in a box and you can only examine all the things on the outside of the box because getting inside the box is almost impossible. But the R package system is also quite unique in that way, that it allows you to standardize things enough that you know what to look for and then pry into that.

I’m excited about the opportunity to develop new packages in a new field for me. All of my packages to date have been intended to meet very concrete, applied needs, and to enable users to do things in R that were not otherwise possible. I’m already cultivating a different mindset in this position, towards developing packages that enable users to do things that might already be possible, yet to do them better – along the lines of how Jim Hester’s glue package (largely) replaces current functionality with better alternatives. I’m excited to be thinking about the language itself more deeply and the structure of the way R works.

Also, the generalizability of everything. At the end I imagine there is this document that says “This is how you construct a new community of software peer review”, regardless of the academic field. That’s incredibly exciting.

Find Mark on GitHub, Twitter, Website, rOpenSci


  1. osmdata Padgham M, Rudis B, Lovelace R, Salmon M (2017). “osmdata.” The Journal of Open Source Software, 2(14), https://doi.org/10.21105/joss.00305. ↩
  2. bikedata Padgham M, Ellison R (2017). “bikedata.” The Journal of Open Source Software, 2(20), https://doi.org/10.21105/joss.00471. ↩
  3. stats19 Lovelace R, Morgan M, Hama L, Padgham M, Ranzolin D, Sparks A (2019). “stats 19: A package for working with open road crash data.” The Journal of Open Source Software, 4(33), 1181, https://doi.org/10.21105/joss.01181. ↩
  4. bomrang Sparks AH, Padgham M, Parsonage H, Pembleton K (2017). “bomrang: Fetch Australian Government Bureau of Meteorology Weather Data.” The Journal of Open Source Software, 2(17), https://doi.org/10.21105/joss.00411. ↩
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Rt of good package READMEs

$
0
0

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent topic of the Write The Docs’ great newsletter was READMEs. It read “As they’re often the first thing people see about a code project, READMEs are pretty important to get right.”. In this post, we’ll share some insights around the READMEs of R packages: why they’re crucial; what they usually contain; how you can best write yours. Let’s dive in! 🏊

Why is a good README key

As mentioned above, the WTD newsletter stated that READMEs are often the first entry point to a project. For a package you could think of other entry points such as the CRAN homepage, but the README remains quite important as seen in the poll below

When I try to become acquainted with a new (to me) #rstats package, I prefer to read ___________

— Jonathan Carroll (@carroll_jono) March 2, 2018

A good README is crucial to recruit users that’ll actually gain something from using your package. As written by noffle in the Art of README,

“your job, when you’re doing it with optimal altruism in mind, isn’t to “sell” people on your work. It’s to let them evaluate what your creation does as objectively as possible, and decide whether it meets their needs or not – not to, say, maximize your downloads or userbase.”

Furthermore, you can recycle the content of your README in other venues (more on how to do that – without copy-pasting – later) like that vignette mentioned in the poll. If you summarize your package in a good one-liner for the top of the README,

> Connect to R-hub, from R

you can re-use it

  • as Package Title in DESCRIPTION,
Title: Connect to 'R-hub'
  • in the GitHub repo description,

  • in your introduction at a social event (ok, maybe not).

Other parts of a pitch are good talk fodder, blog post introductions, vignette sections, etc. Therefore, the time you spend pitching your package in the best possible way is a gift that’ll keep on giving, to your users and you.

What is a good README

In the Art of README, noffle includes a checklist; and the rOpenSci dev guide features guidance about the README. Now, what about good READMEs in the wild? In this section, we’ll have a look at a small sample of READMEs.

Sampling READMEs

We shall start by merging the lists of top downloaded and trending CRAN packages one can obtain using pkgsearch.

library("magrittr")trending <- pkgsearch::cran_trending()top <- pkgsearch::cran_top_downloaded()pkglist <- unique(c(trending[["package"]], top[["package"]]))

This is a list of 184 package names, including effectsize, Copula.Markov, leaflet.providers, farver, renv and httptest. Then, again with pkgsearch, we’ll extract their metadata, before keeping only those that have a GitHub README. More arbitrary choices. 😬

meta <- pkgsearch::cran_packages(pkglist)meta <- meta %>%  dplyr::mutate(URL = strsplit(URL, "\\,")) %>%  tidyr::unnest(URL) %>%  dplyr::filter(stringr::str_detect(URL, "github\\.com")) %>%  dplyr::mutate(URL = stringr::str_remove_all(URL, "\\(.*")) %>%  dplyr::mutate(URL = stringr::str_remove_all(URL, "\\#.*")) %>%  dplyr::mutate(URL = trimws(URL)) %>%  dplyr::select(Package, Title, Date, Version,                URL) %>%  dplyr::mutate(path = urltools::path(URL)) %>%  dplyr::mutate(path = stringr::str_remove(path, "\\/$")) %>%  tidyr::separate(path, sep = "\\/", into = c("owner", "repo"))str(meta)
## Classes 'tbl_df', 'tbl' and 'data.frame':122 obs. of  7 variables:##  $ Package: chr  "effectsize" "leaflet.providers" "farver" "httptest" ...##  $ Title  : chr  "Indices of Effect Size and Standardized Parameters" "Leaflet Providers" "High Performance Colour Space Manipulation" "A Test Environment for HTTP Requests" ...##  $ Date   : chr  NA NA NA NA ...##  $ Version: chr  "0.0.1" "1.9.0" "2.0.1" "3.2.2" ...##  $ URL    : chr  "https://github.com/easystats/effectsize" "https://github.com/rstudio/leaflet.providers" "https://github.com/thomasp85/farver" "https://github.com/nealrichardson/httptest" ...##  $ owner  : chr  "easystats" "rstudio" "thomasp85" "nealrichardson" ...##  $ repo   : chr  "effectsize" "leaflet.providers" "farver" "httptest" ...

At this point we have 122 packages with 122 unique GitHub repo URLs, pfiew.

We’ll then extract their preferred README from the GitHub V3 API. Some of them won’t even have one so we’ll lose them from the sample.

gh <- memoise::memoise(ratelimitr::limit_rate(gh::gh,                                              ratelimitr::rate(1, 1)))get_readme <- function(owner, repo){  readme <- try(gh("GET /repos/:owner/:repo/readme",                   owner = owner, repo = repo),                silent = TRUE)  if(inherits(readme, "try-error")){    return(NULL)  }    lines <- suppressWarnings(    readLines(      readme$download_url      )    )    if (length(lines) == 1){  sub(readme$path, lines,      readme$download_url) -> link  } else {    link <- readme$download_url  }    tibble::tibble(owner = owner,                 repo = repo,                 readme = list(suppressWarnings(readLines(link))))}readmes <- purrr::map2_df(.x = meta$owner, .y = meta$repo,                          .f = get_readme)

The readmes data.frame has 117 lines so we lost a few more packages.

Assessing README size

Number of lines

A first metric we’ll extract is the number of lines of the README.

count_lines <- function(readme_lines){  readme_lines %>%    purrr::discard(. == "") %>% # emtpy lines    purrr::discard(stringr::str_detect(., "\\<\\!\\-\\-")) %>% # html comments    length()}readmes <- dplyr::group_by(readmes, owner, repo) %>%  dplyr::mutate(lines_no = count_lines(readme[[1]])) %>%  dplyr::ungroup()

How long are usual READMEs? Their number of lines range from 2 to 1426 with a median of 85.

library("ggplot2")ggplot(readmes) +  geom_histogram(aes(lines_no), binwidth = 5) +  xlab("No. of lines") +  scale_y_continuous(NULL, breaks = NULL) +  hrbrthemes::theme_ipsum(base_size = 16,                          axis_title_size = 16) +  ggtitle(glue::glue("Number of lines in a sample of {nrow(readmes)} READMEs"))
Dot plot of the number of lines in READMEs

Figure 1: Dot plot of the number of lines in READMEs

READMEs in our sample most often don’t have more than 200 lines. Now, this metric might indicate how much a potential user needs to take in and how long they need to scroll down but we shall now look into other indicators of size: the number of lines of R code, the number of words outside of code and output.

Other size indicators

To access the numbers we’re after without using too many regular expressions, we shall convert the Markdown content to XML via commonmark and use XPath to parse it.

get_xml <- function(readme_lines){  readme_lines %>%      glue::glue_collapse(sep = "\n") %>%      commonmark::markdown_xml(normalize = TRUE,                               hardbreaks = TRUE) %>%      xml2::read_xml() %>%      xml2::xml_ns_strip() -> xml        xml2::xml_replace(xml2::xml_find_all(xml, "//softbreak"),                      xml2::read_xml("\n"))        list(xml)}readmes <- dplyr::group_by(readmes, owner, repo) %>%  dplyr::mutate(xml_readme = get_xml(readme[[1]])) %>%  dplyr::ungroup()

This is how a single README XML looks like:

readmes$xml_readme[[1]]
## {xml_document}## ##  [1] \n  effectsize \n   ...##  [2] \n  Si ...##  [4] \n  The goal of this package is to ...##  [5] \n  Installation\n< ...##  [6] \n  Run the following:\n

install.packages("devtools")\n ...## [8] library("effectsize")\n\n Documentation\n ...## [10] \n Click on the buttons above to ...## [12] \n \n \n \n Features\n\n This package is focused on ind ...## [15] \n Effect Size Computatio ...## [16] \n Basic Indices (Cohen’s ...## [17] \n The package provides functions ...## [18] cohens_d(iris$Sepal.Length, ir ...## [19] \n ANOVAs (Eta\n ...## [20] model <- aov(Sepal.Length ~ ...## ...

Let’s count lines of code.

get_code_lines <- function(xml) {    if(is.null(xml)) {    return(NULL)  }    xml2::xml_find_all(xml, "code_block") %>%    purrr::keep(xml2::xml_attr(., "info") == "r") %>%    xml2::xml_text() %>%    length}loc <- readmes %>%  dplyr::group_by(repo, owner) %>%  dplyr::summarise(loc = get_code_lines(xml_readme[[1]]))

The number of lines of code of READMEs range from 0 (for 31 READMEs) to 41 with a median of 2. The README with the most lines of code is https://github.com/r-lib/ps#readme.

What about words in text?

get_wordcount <- function(xml, package) {    if(is.null(xml)) {    return(NULL)  }  xml %>%    xml2::xml_find_all("*[not(self::code_block) and not(self::html_block)]") %>%    xml2::xml_text() %>%    glue::glue_collapse(sep = " ") %>%    tibble::tibble(text = .) %>%    tidytext::unnest_tokens(word, text) %>%    nrow() } words <- readmes %>%  dplyr::group_by(repo, owner) %>%  dplyr::summarise(wordcount = get_wordcount(xml_readme[[1]]))

The number of words in READMEs range from 13 to 2105 with a median of 278. The READMEs with respectively the most words and least words are https://github.com/jonclayden/RNifti#readme and https://github.com/jvbraun/AlgDesign#readme.

The README size might depend on the package interface size itself, i.e. a package with a single function/dataset probably doesn’t warrant many lines. Beyond a certain interface size or complexity, one might want to make it easier on potential users by breaking up documentation into smaller articles, instead of showing all there is in one page.

Now, a helpful way to still convey information efficiently when there are more than a few things is a good README structure. Besides, a good structure is important for READMEs of all sizes.

Glimpsing at README structure

To assess README structures a bit, we first need to extract headers.

get_headings <- function(xml) {    if(is.null(xml)) {    return(NULL)  }    xml %>%    xml2::xml_find_all("heading") -> headings    list(tibble::tibble(text = xml2::xml_text(headings),                 position = seq_along(headings),                 level = xml2::xml_attr(headings, "level")))}structure <- readmes %>%  dplyr::group_by(repo, owner) %>%  dplyr::mutate(structure = get_headings(xml_readme[[1]])) %>%  dplyr::ungroup()

Here’s the structure of the 42th README.

structure$structure[42][[1]]
## # A tibble: 12 x 3##    text                                      position level##                                             ##  1 Creating Pretty Documents From R Markdown        1 2    ##  2 Themes for R Markdown                            2 3    ##  3 The prettydoc Engine                             3 3    ##  4 Options and Themes                               4 3    ##  5 Offline Math Expressions                         5 3    ##  6 Related Projects                                 6 3    ##  7 Gallery                                          7 3    ##  8 Cayman (demo page)                               8 4    ##  9 Tactile (demo page)                              9 4    ## 10 Architect (demo page)                           10 4    ## 11 Leonids (demo page)                             11 4    ## 12 HPSTR (demo page)                               12 4

We wrote an ugly long and imperfect function to visualize the structure of any of the sampled READMEs, inspired by not ugly and better code in fs.

Click to see the function.

pc <- function(...) {  paste0(..., collapse = "")}print_readme_structure <- function(structure){  structure$parent <- NA  for (i in seq_len(nrow(structure))) {    possible_parents <- structure$position[structure$level < structure$level[i]                                    & structure$position < structure$position[i]]    if (any(possible_parents)) {      structure$parent[i] <- max(possible_parents)    } else {      structure$parent[i] <- NA    }  }    structure$parent[structure$level == min(structure$level)] <- 0      for (i in seq_len(nrow(structure))) {    if (structure$level[i] == 1) {      cat(structure$text[i])    } else {            if(structure$position[i] == max(structure$position[structure$parent==structure$parent[i]])) {        firstchar <- cli:::box_chars()$l      } else {        firstchar <- cli:::box_chars()$j      }            cat(        rep("  ", max(          0,          as.numeric(structure$level[i]) - 1        )),        pc(          firstchar,          pc(rep(            cli:::box_chars()$h,            max(              0,              as.numeric(structure$level[i]) - 1            )          ))        ), structure$text[i]      )    }    cat("\n")  }  }

In practice,

print_readme_structure(structure$structure[42][[1]])
##    └─ Creating Pretty Documents From R Markdown##       ├── Themes for R Markdown##       ├── The prettydoc Engine##       ├── Options and Themes##       ├── Offline Math Expressions##       ├── Related Projects##       └── Gallery##          ├─── Cayman (demo page)##          ├─── Tactile (demo page)##          ├─── Architect (demo page)##          ├─── Leonids (demo page)##          └─── HPSTR (demo page)
print_readme_structure(structure$structure[7][[1]])
## A Teradata Backend for dplyr##          ├─── Koji Makiyama (@hoxo-m)##    ├─ 1 Overview##    ├─ 2 Installation##    ├─ 3 Details##       ├── 3.1 Usage##       ├── 3.2 Translatable functions##          ├─── 3.2.1 lubridate friendly functions##          ├─── 3.2.2 Treat Boolean##          ├─── 3.2.3 to_timestamp()##          └─── 3.2.4 cut()##       └── 3.3 Other useful functions##          └─── 3.3.1 blob_to_string()##    └─ 4 Related work
print_readme_structure(structure$structure[1][[1]])
## effectsize ##    ├─ Installation##    └─ Documentation## Features##    ├─ Effect Size Computation##       ├── Basic Indices (Cohen’s d, Hedges’ g, Glass’ delta)##       ├── ANOVAs (Eta2, Omega2, …)##       └── Regression Models##    ├─ Effect Size Interpretation##    ├─ Effect Size Conversion##    └─ Standardization##       └── Data standardization, normalization and rank-transformation

What are the most common headers?

structure %>%  tidyr::unnest(structure) %>%  dplyr::count(text, sort = TRUE) %>%  dplyr::filter(n > 4) %>%  knitr::kable()
textn
Installation85
Usage45
Overview23
Code of Conduct17
License15
Example10
Cheatsheet9
Features9
Getting help5
Related work5

This seems to be quite in line with noffle’s checklist (e.g. having installation instructions). Headers related to the background and to examples might have a title specific to the package, in which case they don’t appear in the table above.

The headers are one thing, their order is another one which we won’t analyze in this post. In the Art of README, noffle discusses “cognitive funneling” which might help you choose an order.

How to write a good README

The previous sections aimed at giving an overview of reasons for writing a good README and content of READMEs in the wild, this one should give a few helpful tips.

Tools for writing and re-using content

R Markdown

In noffle’s checklist for a README one item is “Clear, runnable example of usage”. You probably want to go a step further and have “Clear, runnable, executed example of usage” for which using R Markdown is quite handy. Using R Markdown to produce your package README is our number one recommendation.

usethis’ README templates

To create the README, you can use either usethis::use_readme_rmd() (for an R Markdown README) or usethis::use_readme_md() (for a Markdown README). The created README file will include a few sections for you to fill in.

Re-use of Rmd portions in other Rmds

Then, instead of writing everything in one .Rmd and then copying entire sections of it in a vignette, you can re-use Rmd chunks as explained in this blog post by Garrick Aden-Buie presenting an idea by Brodie Gaslam. Compared to that blog post, we recommend keeping re-usable Rmd pieces in man/rmdhunks/ so that they’re available to build vignettes. In the vignettes, use

```{r child='../man/rmdhunks/child.Rmd'} ```

In the README, use

```{r child='man/rmdhunks/child.Rmd'} ```

Re-use of Rmd portions in manual pages

Since roxygen2’s latest release, you can include md/Rmd in manual pages. What a handy thing for, say, your package-level documentation! Refer to roxygen2’s documentation.

In the R files use

#' @includeRmd man/rmdhunks/child.Rmd

Table of contents

If your README is a bit long you might need to add a table of contents at the top, as done for pkgsearch README. In a pkgdown website, the README does not get a table of content in the sidebar, which might be an argument for keeping it small as opposed to articles that do get a table of contents in the sidebar.

Hide clutter in details tags

You can use the details package to hide some content that’d otherwise clutter the README, but that you still want to be available upon click. You can see it in action in reactor README.

How to assess a README

You should write the package README with potential users in mind, who might not understand the use case for your package, who might not know the tool you’re wrapping or porting, etc. It is not easy to change perspectives, though! Having actual external people review your README, or using the audience feedback after say a talk introducing the package, can help a lot here. Ask someone on your corridor or your usual discussion venues (forums, Twitter, etc.).

Conclusion

In this post we discussed the importance of a good package README, mentioned existing guidance, and gave some data-driven clues as to what is a usual README. We also mentioned useful tools to write a good README. For further READing we recommend the Write the Docs’ list of README-related resources as well as this curated list of awesome READMEs. We also welcome your input below… What do you like seeing in a README?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modelling Cairngorm snow to 2080

$
0
0

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been doing some work for the Cairngorms National Park (CNP), with the James Hutton Institute under the ClimateXChange (CXC) lead from the Scottish Government.

I don’t get to do much snow work these days, so it has been really fun to resurrect some of my PhD.

The report was issued on Monday by the CNP, to help the CNP plan for the impacts of climate change. The official version of the report is available from CXC.

Headline: forecasting is an uncertain business. That aside, indications are

  • There’s been ~2 °C of warming in daily minimum temperatures (at Balmoral) over the past 100 years.
  • In to the future it is likely there will be increased variability in snow cover over the next two decades (some very snowy winters, some with very little snow).
  • After this global/Scottish temperatures will have increased enough that there will probably be a dramatic decrease in the number of days of snow cover each winter.

Snow is beautiful and emotive, so the report has been getting some media interest with more to come (BBC One Show – keep an eye out!). Here are a couple of recent pieces:

I made a summary of my contribution in the below graphic. True to form all built on open source (Scribus, Ubuntu, R, QGIS).

CNP_snowMichael Spencer. (2019, October 24). Cairngorm National Park snow cover duration 1960 – 2080. Zenodo. http://doi.org/10.5281/zenodo.3518297

The code for making parts of the graphic and for running the snow model are online. Take a look:

Michael Spencer. (2019, October 25). Snow Cover and Climate Change in the Cairngorms National Park – model run, analysis and plot creation (Version v1.0). Zenodo. http://doi.org/10.5281/zenodo.3519210

I ran the model on the EPIC RStudio server, 50 cores was a lot quicker than running it on my desktop!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


How Auth0’s Data Team uses R and Python

$
0
0

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How Auth0’s Data Team uses R and Python

The Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems, creating machine learning, and deep learning models.

With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of data. Scaling processes in real-world scenarios is a hot topic among new people coming to data.

This post first appeared at: https://auth0.com/blog/how-the-auth0-data-team-uses-r-and-python/

R or Python?

Well… both!

R is a GNU project, thought as a statistical data language originally developed at Bell Laboratories around 1996.

Python, developed in 1991 by Guido van Rossum, is a general-purpose language with a focus on code readability.

Both R and Python are highly extensible through packages.

We mainly use R for our data processes and ML projects, and Python to do the integrations and Deep Learning projects.

Our stack is R with RStudio, and Python 3 with Jupyter notebooks.

How Auth0’s Data Team uses R and Python

RStudio is an open-source and vast IDE capable of browsing data and objects created during the session, plots, debugging code, among many other options. It also provides an enterprise-ready solution.

Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today’s is widely used for data scientists to share their analysis. Recently Google creates "Colab", a Jupyter notebook environment capable of running in the google drive cloud.

So is R capable of running on production?

Yes.

We run several heavy data preparations and predictive models every day, every hour, and every few minutes.

How do we run R and Python tasks on production?

We use Airflow as an orchestrator, an open-source project created by Airbnb.

Airflow is an incredible and robust project which allows us to schedule processes, assign priorities, rules, detailed log, etc.

For development, we still use the form: Rscript my_awesome_script.R.

Airflow is a Python-based task scheduler that allows us to run chained processes, with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate the Data Warehouse with fresh data every day.

Do we have a data warehouse?

Yes, and it’s huge!

It’s mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their website to learn more about it.

How Auth0’s Data Team uses R and Python

R connects directly to Amazon Redshift thanks to the rauth0 package, which uses the redshiftTools package, developed by Pablo Seibelt.

Generally, data is uploaded from R to Amazon Redshift using redshiftTools. This data can be either plain files or from data frames created during the R session.

We use Python to import and export unstructured data since R does not have useful libraries currently to handle it.

We have experimented with JSON libraries in R but the result is much worse than using Python in this scenario. For example, using RJSONIO the dataset is automatically transformed into an R Data Frame, with little control of how the transformation is done. This is only useful for very simple JSON data structures and is very difficult to manipulate in R, compared to Python where this is much easier and more natural.

How do we deal with data preparation using R?

We have two scenarios, data preparation for data engineering, and data preparation for machine learning/AI.

One of the biggest strengths of R is the tidyverse package, which is a set of packages developed by lots of ninja developers, some of them working at RStudio Inc company. They provide a common API and a shared philosophy for working with data. We will cover an example in the next section.

How Auth0’s Data Team uses R and Python

The tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.

For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book. It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.

Does R scale?

One of the key points of dplyr is it can be run on databases, thanks to another package with a pretty similar name: dbplyr.

This way, we write R syntax (dplyr) and it is "automagically" converted to SQL syntax and it then runs on production.

There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R.

For example, following dplyr syntax:

flights %>% group_by(month, day) %>% summarise(delay = mean(dep_delay))

Generates:

SELECT month, day, AVG(dep_delay) AS delay FROM nycflights13::flights GROUP BY month, day

This way, dbplyr makes transparent for the R user working with objects in RAM or in a foreign database.

Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).

How do we share the results?

Mostly in Tableau. We have some integrations with Salesforce.

In addition, we do have some reports deployed in Shiny. Especially the ones that need complex customer interaction. Shiny allows custom reports to be built using simple R code without having to learn Javascript, Python or other frontend and backend languages. Through the use of a "reactive" interface, the user can input parameters that the Shiny application can use to react and redraw any reports. In contrast with tools like Tableau, Domo, PowerBI, etc. which are more "drag and drop", the programmatic nature of Shiny apps allow them to do almost anything the developer can conceive in their imagination, which might be more difficult or impossible in other tools.

How Auth0’s Data Team uses R and Python

For ad hoc reports (HTML), we use R markdown which shares some functionality with to jupyter notebooks. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and also books!

Machine Learning / AI

We use both R and Python.

For Machine Learning projects, we use mainly the caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.

For Deep Learning, we use Python, specifically the libraries Keras with TensorFlow as the backend. Keras is an API to build with just a bunch of lines of code, many of the most complex neural networks. It can easily scale by training them on the cloud, in services like AWS.

Nowadays we are also doing some experiments with the fastai library for NLP problems.

Summing up!

The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.

Here we wanted to share the not-so-common approach of using R for data engineering tasks, what are our favorite and Python libraries, with a focus on sharing the results, explaining some of the practices we do every day.

We think the most important stages in a data project are the data analysis and data preparation. Choosing the right approach can save a lot of time and make the project to scale.

We hope this post encourages you to try some of the suggested technologies and rock your data projects!

Any Questions? Leave it in the comments 📨

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019-03 with R & JavaScript

$
0
0

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Solving Advent of Code 2019-03 with R and JavaScript.

[Disclaimer] Obviously, this post contains a big spoiler about Advent of Code, as it gives solutions for solving day 3.

[Disclaimer bis] I’m no JavaScript expert so this might not be the perfect solution. TBH, that’s also the case for the R solution.

About the JavaScript code

The JavaScript code has been written in the same RMarkdown as the R code. It runs thanks to the {bubble} package: https://github.com/ColinFay/bubble

Instructions

Find the instructions at: https://adventofcode.com/2019/day/3

R solution

Part one

library(magrittr)ipt<-scan("input3.txt",what=character(),sep="\n")first<-strsplit(ipt[1],split=",")[[1]]sec<-strsplit(ipt[2],split=",")[[1]]directions<-function(ipt,x,y){dir<-substr(ipt,1,1)how_m<-as.numeric(substr(ipt,2,nchar(ipt)))if(dir=="R"){x<-x+how_m}elseif(dir=="L"){x<-x-how_m}elseif(dir=="U"){y<-y+how_m}elseif(dir=="D"){y<-y-how_m}return(list(x=x,y=y))}get_dir<-function(vec){out<-data.frame(x=0,y=0)for(iinseq_along(vec)){y_m_1<-out$y[nrow(out)]x_m_1<-out$x[nrow(out)]res<-directions(vec[i],x=x_m_1,y=y_m_1)out%<>%rbind(data.frame(x=x_m_1:res$x,y=y_m_1:res$y)[-1,])}out$step<-1:nrow(out)out}out_a<-get_dir(first)out_b<-get_dir(sec)res<-merge(out_a,out_b,by=c("x","y"))res$path<-abs(res$x)+abs(res$y)sort(unique(res$path))[2]
## [1] 386

Part two

res$tot_step<-res$step.x+res$step.ysort(unique(res$tot_step))[2]
## [1] 6486

JS solution

Sorry today I didn’t have time to work on the JS solution… Finding the R solution was already quite a challenge 🙂

Here is the beginning of a code to solve the problem. Might get back to it later!

Part one & Two

varres=fs.readFileSync("input3.txt",'utf8').split("\n").filter(x=>x.length!=0);varfirst=res[0].split(",").filter(x=>x.length!=0);varsec=res[1].split(",").filter(x=>x.length!=0);
functiondirections(ipt,x,y){vardir=ipt.substring(0,1)varhow_m=parseInt(ipt.substring(1,ipt.length))if(dir=="R"){varx=x+how_m}elseif(dir=="L"){varx=x-how_m}elseif(dir=="U"){vary=y+how_m}elseif(dir=="D"){vary=y-how_m}varret={x:x,y:y};returnret}functionget_out(vec){varout={x:[0],y:[0]}for(vari=0;i<vec.length;i++){vary_m_1=out["y"][out[["y"]].length-1]varx_m_1=out["x"][out[["x"]].length-1]varres=directions(vec[i],x=x_m_1,y=y_m_1)out.x.push(res.x);out.y.push(res.y);}returnout}vardir_f=get_out(first);vardir_s=get_out(sec);
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

📦 Managing dependencies in packages

$
0
0

[This article was first published on Iegor Rudnytskyi, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Managing usual dependencies of a package is clearly covered in R packages by Hadley Wickham. Typically, that would be the end of a tutorial or a post. However, teaching recently how to develop a package, I encountered a couple of super interesting and non-trivial questions that would not have a conventional solution. I guess this post would be a perfect place to share my thoughts on that meter, as well as a nice excuse to restart blogging.

Non-CRAN packages

When developing the package, the standard place to list dependencies (i.e., external packages that your package needs) is Imports: in DESCRIPTION. Full stop here. These packages are required to be installed so that your package works. And they will be installed automatically when installing your package via install.packages() (see default behavior of dependencies argument). However, packages in Imports: field are supposed to be published on CRAN. That could be an issue if your package uses functionality from packages that are not (yet) published on CRAN. This is the exact question I was asked by one of my students: where do I specify non-CRAN dependencies?

I was sure that there exists a common workflow to do it. After a minute of extensive research, I found out that CRAN policy explains it quite vaguely. Further, there were three Stackoverflow questions about it (see below in References). The answer that I found was quite satisfactory: Dirk Eddelbuettel proposes to list the package in Sugests: and specify the additional repository in special free-form filed Additional_repositories:. He also suggests using drat package to create CRAN-like R packages repository, which from my view is a bit overkill. So my solution would be to list the name of the package in Suggests: and mention the link to its GitHub repo (almost surely the source is stored on GitHub) in Additional_repositories:.

That would be the nice end of the story but how would you let know the end-user that you need this package to be pre-installed? The workaround I found is to rise a message from the function, where this dependence is used and ask the user to install it, for example:

my_function<-function(){if(!("nonCRANpkg"%in%rownames(installed.packages()))){message("Please install package nonCRANpkg.")}}

The problem is that the user should come back to the installation process at the point when they use my_function() . In addition, it probably affects the expected output of the function or even worse if the function is internal one and not exported into the namespace. That is why, from my personal view, the installation of all dependencies should be tackled way before the first call of my_function(). And here the function .onAttach() comes in handy. This function allows displaying messages when the package is loading. We simply need to inform the user that they need to install the dependence before using our package (mind the difference between message() and packageStartupMessage()):

.onAttach<-function(libname,pkgname){if(!("nonCRANpkg"%in%rownames(installed.packages()))){packageStartupMessage(paste0("Please install `nonCRANpkg` by"," `devtools::install_github('username/nonCRANpkg')`"))}}

To summarize in a nutshell: mention the package name in the field Suggests: of DESCRIPTION, link to its repo in Additional_repositories: (in the same file), and write a simple .onAttach function.

Shiny demo app

It is always a cool idea to compliment the package with a Shiny app so that a user can have an interactive interface to play around with the functionality of the package. We typically store scripts of those demo apps in inst\shiny-examples\name_of_app and add a function runDemo() to run them (see a wonderful post by Dean Attali in the references for details). Those apps are very likely to have their own dependencies, as well as they definitely require shiny namespace to be loaded. That is why we see all these library() calls at the beginning of Shiny apps’ scripts.

Obviously, (1) we want to ensure that the user has all required packages installed, and (2) avoid using library() in package’s scripts. The solution is very simple – specify all Shiny app dependencies in Imports: and use the usual :: to access functions from respective namespaces.

To sum up all the previous take-home points, I created a dummypkg for illustration, which is stored at GitHub repo irudnyts\dummypkg. It contains a barebone example of non-CRAN dependencies, as well as a tiny Shiny app with dependencies. Managing those dependencies is super important since we do not want our packages to like like jack-in-the-boxes.

Many thanks go to Ana Lucy Bejarano Montalvo who inspired me by asking those questions.

References

  1. R packages by Hadley Wickham
  2. Include non-CRAN package in CRAN package
  3. R package building: How to import a function from a package not on CRAN
  4. How to make R package recommend a package hosted on GitHub?
  5. R package dependencies not installed from Additional_repositories
  6. Supplementing your R package with a Shiny app
  7. Lego PNG image with transparent background
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Iegor Rudnytskyi, PhD.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In-Database Logisitc Regression with R

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.

In a previous article we illustrated how to calculate xgboost model predictions in-database. This was referenced and incorporated into tidypredict. After learning more about what the tidypredict team is up to, I discovered another tidyverse package called modeldb that fits models in-database. It currently supports linear regression and k-means clustering, so I thought I would provide an example of how to do in-database logistic regression.

Rather than focusing on the details of logistic regression, we will focus more on how we can use R and some carefully written SQL statements to iteratively minimize a cost function. We will also use the condusco R package, which allows us to iterate through the results of a query easily.

A Simple Logistic Regression Example

Let’s start with a simple logistic regression example. We’ll simulate an outcome \(y\) based on the fact that \(Pr(y=1) = \frac{e^{\beta x}}{1+e^{\beta x}}\). Here \(\beta\) is a vector containing the coefficients we will later be estimating (including an intercept term). In the example below, our \(x\) values are uniform random values between -1 and 1.

set.seed(1)# the number of samplesn <- 1000# uniform random on (-1,1)x1 <- 2*runif(n)-1x2 <- 2*runif(n)-1x <- cbind(1, x1, x2)# our betasbeta <- c(-1, -3.0, 5.0)probs <- exp(beta %*% t(x))/(1+exp(beta %*% t(x)))y <- rbinom(n,1,probs)sim <- data.frame(id = seq(1:n), y = y, x1 = x1, x2 = x2)mylogit <- glm(y ~ x1 + x2, data = sim, family = "binomial")summary(mylogit)mylogit$coefficients

As expected, the coefficients of our logistic model successfully approximate the parameters in our beta vector.

In-database Logistic Regression

Now, let’s see if we can find a way to calculate these same coefficients in-database. In this example, we’re going to use Google BigQuery as our database, and we’ll use condusco’s run_pipeline_gbq function to iteratively run the functions we define later on. To do this, we’ll need to take care of some initial housekeeping:

library(bigrquery)library(whisker)library(condusco)# Uncomment and define your own config# config <- list(#   project = '',#   dataset = '',#   table_prefix = ''# )# a simple whisker.render helper function for our use-casewr <- function(s, params=config){whisker.render(s,params)}# put the simulated data in GBQinsert_upload_job(  project = wr('{{{project}}}'),  dataset = wr('{{{dataset}}}'),  table = "logreg_sim",  values = sim,  write_disposition = "WRITE_TRUNCATE")

Now, we’ll create the pipelines to do the logistic regression. Please note that the code below is quite verbose. While all of it is needed for the code to work, we’ll just focus on understanding how a couple of steps work. Once we understand one step, the rest is pretty easy. Feel free to skip ahead.

First, we create a pipeline that does two things:

  • create a main table containing all of our global settings
  • calls another pipeline (log_reg_stack) with the global settings as inputs

Importantly, note that all of the parameters (eg. {{{project}}}) are dynamically swapped out in the query below with the wr function and the params variables. So this pipeline dynamically creates a query based on the parameters passed to it. We will call this pipeline later to run the process.

## Pipeline: log_reg#log_reg <- function(params){    print ("log_reg")  query <- '    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_settings    AS     SELECT      "{{{project}}}" AS project,      "{{{dataset}}}" AS dataset,      "{{{data_table}}}" AS data_table,      {{{max_steps}}} AS max_steps,      {{{error_tol}}} AS error_tol,      {{{learning_rate}}} AS learning_rate,      "{{{id_column}}}"   AS id_column,      "{{{label_column}}}" AS label_column,      "{{{fieldnames}}}" AS fieldnames,      "{{{constant_id}}}" AS constant_id,      "{{{table_prefix}}}" AS table_prefix  '    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    # Now run the log_reg_stack pipeline and pass the settings to it  invocation_query <- '    SELECT *    FROM {{{dataset}}}.{{table_prefix}}_settings  '  run_pipeline_gbq(    log_reg_stack,    wr(invocation_query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )}

The above pipeline calls another pipeline, log_reg_stack, which is defined below. log_reg_stack creates a table with the field names that we will use in the logistic regression and then runs log_reg_stack_field on each of the field names. Note that the invocation_query below contains a query that results in one or more rows containing a field name. run_pipeline_gbq takes the results and iterates over them, calling log_reg_stack_field on each one. Finally, it creates the _labels table and calls log_reg_setup, passing it the results of the global settings query.

## Pipeline: stack variables#log_reg_stack <- function(params){    print ("log_reg_stack")    # Table: _fieldnames   query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_fieldnames    AS    SELECT TRIM(fieldname) AS fieldname    FROM (      SELECT split(fieldnames,',') AS fieldname      FROM (          SELECT '{{{fieldnames}}}' AS fieldnames      )    ), UNNEST(fieldname) as fieldname    GROUP BY 1  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    # Run _stack_field  query <- "    DROP TABLE IF EXISTS {{{dataset}}}.{{{table_prefix}}}_stacked  "    tryCatch({    query_exec(      project = wr('{{{project}}}', params),      query = wr(query, params),      use_legacy_sql = FALSE    )},    error = function(e){      print(e)  })      invocation_query <- "    SELECT      a.fieldname AS fieldname,        b.*    FROM (        SELECT fieldname        FROM {{{dataset}}}.{{{table_prefix}}}_fieldnames        GROUP BY fieldname    ) a      CROSS JOIN (        SELECT *          FROM {{{dataset}}}.{{{table_prefix}}}_settings    ) b  "  run_pipeline_gbq(    log_reg_stack_field,    wr(invocation_query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )    # Table: _labels  query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_labels    AS    SELECT      {{{id_column}}} AS id,      {{{label_column}}} AS label    FROM {{{data_table}}}  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )      # Run _setup  invocation_query <- "    SELECT *        FROM {{{dataset}}}.{{{table_prefix}}}_settings  "  run_pipeline_gbq(    log_reg_setup,    wr(invocation_query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )  }

The log_reg_stack_field and log_reg_setup pipelines are not particularly interesting. They do the groundwork needed to allow the log_reg_loop pipeline to iterate. The _stacked table contains the feature names and their values, and the _feature_stats and features_stacked_vni tables contains normalized values used later. Finally, the _fit_params table contains the value of the fit parameters that will be updated as we iteratively minimize the cost function in the loop. The log_reg_setup pipeline ends by calling log_reg_loop, passing it the results of the global settings query.

log_reg_stack_field <- function(params){    print ("log_reg_stack_field")  destination_table <- '{{{dataset}}}.{{{table_prefix}}}_stacked'  query <- "    SELECT {{{id_column}}} AS id,      LTRIM('{{{fieldname}}}') AS feature_name,      CAST({{{fieldname}}} AS FLOAT64) AS vi    FROM {{{data_table}}}  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    destination_table = wr(destination_table, params),    use_legacy_sql = FALSE,    write_disposition = 'WRITE_APPEND',    create_disposition = 'CREATE_IF_NEEDED'  )  }log_reg_setup <- function(params){    print ("log_reg_setup")    query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_feature_stats    AS    SELECT feature_name,      AVG(vi) AS mean,      STDDEV(vi) AS stddev    FROM {{{dataset}}}.{{{table_prefix}}}_stacked    GROUP BY feature_name  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_features_stacked_vni    AS    SELECT      a.id AS id,      a.feature_name AS feature_name,      CASE        WHEN b.stddev > 0.0 THEN (vi - b.mean) / b.stddev        ELSE vi - b.mean      END AS vni    FROM {{{dataset}}}.{{{table_prefix}}}_stacked a    JOIN {{{dataset}}}.{{{table_prefix}}}_feature_stats b      ON a.feature_name = b.feature_name  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    query <- "    INSERT INTO {{{dataset}}}.{{{table_prefix}}}_features_stacked_vni (id, feature_name, vni)         SELECT      id,      '{{{constant_id}}}' as feature_name,      1.0 as vni    FROM {{{dataset}}}.{{{table_prefix}}}_stacked    GROUP BY 1,2,3  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_fit_params    AS    SELECT      step,      param_id,      param_value,      cost,      stop,      message    FROM (      SELECT 1 as step,      feature_name as param_id,      0.0 as param_value,      1e6 as cost,      false as stop,      '' as message      FROM {{{dataset}}}.{{{table_prefix}}}_stacked      GROUP BY param_id    ) UNION ALL (      SELECT 1 as step,      '{{{constant_id}}}' as param_id,      0.0 as param_value,      1e6 as cost,      false as stop,      '' as message    )  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    # Run _loop  invocation_query <- "    SELECT *        FROM {{{dataset}}}.{{{table_prefix}}}_settings  "  run_pipeline_gbq(    log_reg_loop,    wr(invocation_query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )}

Next, we’ll create a loop pipeline that will iteratively calculate the cost function and update the _fit_params table with the latest update.

## Pipeline: loop#log_reg_loop <- function(params){    print ("log_reg_loop")    query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_x_dot_beta_i    AS     SELECT      a.id AS id,      SUM(a.vni * b.param_value) AS x_dot_beta_i    FROM {{{dataset}}}.{{{table_prefix}}}_features_stacked_vni a      RIGHT JOIN (      SELECT param_id, param_value          FROM {{{dataset}}}.{{{table_prefix}}}_fit_params          WHERE STEP = (SELECT max(step) FROM {{{dataset}}}.{{{table_prefix}}}_fit_params)    ) b    ON a.feature_name = b.param_id    GROUP BY 1  "    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )    query <- '  INSERT INTO {{{dataset}}}.{{{table_prefix}}}_fit_params (step, param_id, param_value, cost, stop, message)  SELECT      b.step + 1 as step,      b.param_id as param_id,      b.param_value - {{{learning_rate}}} * err as param_value,    -1.0 * a.cost as cost,    CASE      WHEN ( abs((b.cost-(-1.0*a.cost))/b.cost) < {{{error_tol}}} ) OR (step+1 > {{{max_steps}}})          THEN true      ELSE false        END AS stop,      CONCAT( "cost: ", CAST(abs((b.cost-(-1.0*a.cost))/b.cost) AS STRING), " error_tol: ", CAST({{{error_tol}}} AS STRING)) as message    FROM (      SELECT        param_id,        avg(err) as err,        avg(cost) as cost      FROM (        SELECT          a.id,        param_id,        (1.0/(1.0 + EXP(-1.0 * (c.x_dot_beta_i))) - CAST(label AS FLOAT64)) * vni as err,        CAST(label AS FLOAT64) * LOG( 1.0/(1.0 + EXP(-1.0 * (c.x_dot_beta_i))) )             + (1.0-CAST(label AS FLOAT64))*(log(1.0 - (1.0/(1.0 + EXP(-1.0 * (c.x_dot_beta_i))))))  as cost      FROM (          SELECT a.id as id,          b.param_id as param_id,        a.vni as vni,        b.param_value as param_value        FROM {{{dataset}}}.{{{table_prefix}}}_features_stacked_vni a          JOIN (          SELECT param_id, param_value              FROM {{{dataset}}}.{{{table_prefix}}}_fit_params              WHERE STEP = (SELECT max(step) FROM {{{dataset}}}.{{{table_prefix}}}_fit_params)        ) b        ON a.feature_name = b.param_id        GROUP BY 1,2,3,4      ) a      JOIN {{{dataset}}}.{{{table_prefix}}}_labels b        ON a.id = b.id      JOIN {{{dataset}}}.{{{table_prefix}}}_x_dot_beta_i c      ON a.id = c.id    )      GROUP BY param_id  ) a  JOIN (    SELECT *    FROM {{{dataset}}}.{{{table_prefix}}}_fit_params    WHERE STEP = (SELECT max(step) FROM {{{dataset}}}.{{{table_prefix}}}_fit_params)  ) b  ON a.param_id = b.param_id  '    query_exec(    project = wr('{{{project}}}', params),    query = wr(query, params),    use_legacy_sql = FALSE  )      # Loop or stop  query <- "      SELECT stop  AS stop      FROM (        SELECT *          FROM {{{dataset}}}.{{{table_prefix}}}_fit_params        ORDER BY step DESC        LIMIT 1      )  "    res <- query_exec(    wr(query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )    if(res$stop == FALSE){    print("stop == FALSE")    invocation_query <- '      SELECT *      FROM {{{dataset}}}.{{table_prefix}}_settings    '    run_pipeline_gbq(      log_reg_loop,      wr(invocation_query,  params),      wr('{{{project}}}', params),      use_legacy_sql = FALSE    )  }  else {    print("stop == TRUE")    invocation_query <- '      SELECT *      FROM {{{dataset}}}.{{table_prefix}}_settings    '    run_pipeline_gbq(      log_reg_done,      wr(invocation_query,  params),      wr('{{{project}}}', params),      use_legacy_sql = FALSE    )  }  }

And finally, a log_reg_done pipeline that outputs the results:

## Pipeline: done#log_reg_done <- function(params){    print ("log_reg_done")    # Display results in norm'd coords  query <- '    SELECT "normalized coords parameters" as message,      step,        param_id,        param_value     FROM {{{dataset}}}.{{{table_prefix}}}_fit_params    WHERE step = (SELECT max(step) from {{{dataset}}}.{{{table_prefix}}}_fit_params)  '    res <- query_exec(    wr(query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )    print(res)    # Display results in original coords  query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_model_params_stacked    AS     SELECT      param_id,      param_value_rescaled    FROM (      SELECT        a.param_id AS param_id,        a.param_value + b.constant_offset AS param_value_rescaled      FROM (        SELECT          step,          param_id,          param_value        FROM {{{dataset}}}.{{{table_prefix}}}_fit_params        WHERE step = (SELECT max(step) from {{{dataset}}}.{{{table_prefix}}}_fit_params)        AND param_id = 'CONSTANT'      ) a      JOIN (        SELECT          step,          'CONSTANT' as param_id,          sum(-1.0*param_value*mean/stddev) as constant_offset        FROM {{{dataset}}}.{{{table_prefix}}}_fit_params a        JOIN {{{dataset}}}.{{{table_prefix}}}_feature_stats b          ON a.param_id = b.feature_name        WHERE step = (SELECT max(step) FROM {{{dataset}}}.{{{table_prefix}}}_fit_params)        GROUP BY 1,2      ) b      ON a.param_id = b.param_id    ) UNION ALL (      SELECT        param_id,        param_value/stddev as param_value_rescaled      FROM {{{dataset}}}.{{{table_prefix}}}_fit_params a      JOIN {{{dataset}}}.{{{table_prefix}}}_feature_stats b      ON a.param_id = b.feature_name      WHERE step = (SELECT max(step) FROM {{{dataset}}}.{{{table_prefix}}}_fit_params)      GROUP BY 1,2    )  "    res <- query_exec(    wr(query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )    print(res)      # transpose the _model_params_stacked table  invocation_query <- '    SELECT      a.list,      b.*    FROM (      SELECT CONCAT("[", STRING_AGG(CONCAT("{\\"val\\": \\"",TRIM(fieldname), "\\"}")), "]") AS list      FROM rstevenson.indb_logreg_001_fieldnames    ) a    CROSS JOIN (      SELECT *        FROM rstevenson.indb_logreg_001_settings    ) b  '    run_pipeline_gbq(    log_reg_model_params,    wr(invocation_query, config),    wr('{{{project}}}', config),    use_legacy_sql = FALSE  )    print("DONE")  }

Our last pipeline, called at the end of the above pipeline, will transpose the stacked model params. In other words, it will output the parameters of the model in separate columns:

log_reg_model_params <- function(params){    query <- "    CREATE OR REPLACE TABLE {{{dataset}}}.{{{table_prefix}}}_model_params    AS     SELECT    {{#list}}      MAX(CASE WHEN param_id='{{val}}' THEN param_value_rescaled END ) AS {{val}},    {{/list}}    MAX(CASE WHEN param_id='{{constant_id}}' THEN param_value_rescaled END ) AS {{constant_id}}    FROM {{{dataset}}}.{{{table_prefix}}}_model_params_stacked  ;"  res <- query_exec(    wr(query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )    print(res)}

Running the pipeline

We are now ready to run the log_reg pipeline. We’ll set up the invocation query with all of our global parameters. These will be stored in the _settings table and then, after stacking and setup, the pipeline will iterate through the loop to calculate the logistic regression coefficients.

# Run the log_reg pipeline with the following params (2D test)invocation_query <- '  SELECT  "{{{project}}}" as project,  "{{{dataset}}}" as dataset,  "{{{table_prefix}}}" as table_prefix,  "{{{dataset}}}.logreg_sim" as data_table,          "25"  as max_steps,  "1e-6" as error_tol,  "6.0"  as learning_rate,  "id"   as id_column,  "y"  as label_column,  "x1, x2"  as fieldnames,  "CONSTANT" as constant_id'cat(wr(invocation_query, config))query_exec(wr(invocation_query, config), project=config$project, use_legacy_sql = FALSE)run_pipeline_gbq(  log_reg,  wr(invocation_query, config),  project = wr('{{{project}}}', config),  use_legacy_sql = FALSE)

After running the above, we should be able to query the table that holds the fitted parameters:

  query <- "    SELECT *    FROM {{{dataset}}}.{{{table_prefix}}}_model_params  ;"  query_exec(    wr(query),    wr('{{{project}}}'),    use_legacy_sql = FALSE  )

As expected, these results are pretty close to our original beta values.

Please keep in mind that this is not ready to be released into the wild. Further improvements include modifications to deal with categorical variables, output describing whether a logistic fit is statistically significant for a particular parameter, and options for controlling step-sizes. But it does show the concept of how an iterative process like logistic regression can be done while using the database to maintain state.

Prediction

Now that we have fit the logistic regression model and the model is stored in the database, we can predict values using the model. We just need a prediction pipeline:

## Pipeline: predict#log_reg_predict <- function(params){    query <- '  SELECT    1/(1+exp(-1.0*(CONSTANT + {{#list}}a.{{val}}*b.{{val}} + {{/list}} + 0))) as probability  FROM {{{dataset}}}.{{{table_prefix}}}_model_params a  CROSS JOIN {{{data_table}}} b  ORDER BY {{{id_column}}}  '    res <- query_exec(    wr(query, params),    wr('{{{project}}}', params),    use_legacy_sql = FALSE  )  }

Note that the above uses whisker to calculate the dot product \(x\beta\) by expanding a JSON-formatted array of field names into {{#list}}a.{{val}}*b.{{val}} + {{/list}} code. In the code below, we will create a JSON-formatted array of field names. Now let’s run the predictions:

# Run the prediction pipeline with the following paramsinvocation_query <- '  SELECT    "{{{project}}}" as project,    "{{{dataset}}}" as dataset,    "{{{table_prefix}}}" as table_prefix,    "{{{dataset}}}.logreg_sim" as data_table,    "id" as id_column,    CONCAT("[", STRING_AGG(CONCAT("{\\"val\\": \\"",TRIM(fieldname), "\\"}")), "]") AS list  FROM {{{dataset}}}.{{{table_prefix}}}_fieldnames'predictions <- run_pipeline_gbq(  log_reg_predict,  wr(invocation_query, config),  project = wr('{{{project}}}', config),  use_legacy_sql = FALSE)

Let’s test the rounded predictions to see how well they approximate the outcomes:

# inspect first 5 true probs vs. predicted probabilitieshead(probs[1:5])head(predictions[[1]]$probability[1:5])# mean relative error between true probs and predicted probabilities mean((abs(probs-predictions[[1]]$probability))/probs)

Our model-based logistic regression model predicts the true probabilities with a mean relative error of about 7%.

Next steps

We have shown how to train and store a logistic regression model in a database. We can then predict outcomes given features that are also stored in the database without having to move data back and forth to a prediction server. In this particular example, it would likely be much faster to move the data to a computer and run the predictions there. However, certain use cases exist where in-database modeling could be an avenue for consideration. Further, since logistic models are fundamental to many types of tree and forest predictors, in-database logistic regression would be a necessary step in developing in-database tree methods. It remains to be seen if this approach can be easily translated into the tidyverse modeldb package.

_____='https://rviews.rstudio.com/2019/12/04/in-database-logisitc-regression-with-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Job Posting: Research Software Engineer at EcoHealth Alliance

$
0
0

[This article was first published on noamross.net: R posts , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m recruiting a Research Software Engineer to join my team at EcoHealth Alliance in New York. Details and how to apply can be found at https://www.ecohealthalliance.org/career/research-software-engineer.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: noamross.net: R posts .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12113 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>