Quantcast
Channel: R-bloggers
Viewing all 12108 articles
Browse latest View live

RDieHarder 0.2.1

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version, now at 0.2.1, of the random-number generator tester RDieHarder (based on the DieHarder suite developed / maintained by Robert Brown with contributions by David Bauer and myself) is now on CRAN.

This version has only internal changes. Brian Ripley, tireless as always, is testing the impact of gcc 10 on CRAN code and found that the ‘to-be-default’ option -fno-common throws off a few (older) C code bases, this one (which is indeed old) included. So in a nutshell, we declared all global variables extern and defined them once and only once in new file globals.c. Needless to say, this affects the buildability options. In the past we used to rely on an external library libdieharder (which e.g. I had put together for Debian) but we now just build everything internally in the package.

Which builds on the changes in RDieHarder 0.2.0 which I apparently had not blogged about when it came out on December 21 last year. I had refactored the package to use either the until-then-required-but-now-optional external library, or the included library code. Doing so meant more builds on more systems including Windows.

This (very old) package has no NEWS.Rd file to take a summary from, but the ChangeLog file has all the details.

Thanks to CRANberries, you can also look at a diff from 0.2.1 to 0.2.0. or the older diff from 0.2.0 to 0.1.4.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Practical Tidy Evaluation

$
0
0

[This article was first published on jessecambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidy evaluation is a framework for controlling how expressions and variables in your code are evaluated by tidyverse functions. This framework, housed in the rlang package, is a powerful tool for writing more efficient and elegant code. In particular, you’ll find it useful for passing variable names as inputs to functions that use tidyverse packages like dplyr and ggplot2.

The goal of this post is to offer accessible examples and intuition for putting tidy evaluation to work in your own code. Because of this I will keep conceptual explanations brief, but for more comprehensive documentation you can refer to dplyr’s website, rlang’s website, the ‘Tidy Evaluation’ book by Lionel Henry and Hadley Wickham, and the Metaprogramming Section of the ‘Advanced R’ book by Hadley Wickham.

Motivating Example

To begin, let’s consider a simple example of calculating summary statistics with the mtcars dataset. Below we calculate maximum and minimum horsepower (hp) by the number of cylinders (cyl) using the group_by and summarize functions from dplyr.

library(dplyr)hp_by_cyl<-mtcars%>%group_by(cyl)%>%summarize(min_hp=min(hp),max_hp=max(hp))
cylmin_hpmax_hp
452113
6105175
8150335

Now let’s say we wanted to repeat this calculation multiple times while changing which variable we group by. A brute force method to accomplish this would be to copy and paste our code as many times as necessary and modify the group by variable in each iteration. However, this is inefficient especially if our code gets more complicated, requires many iterations, or requires further development.

To avoid this inelegant solution you might think to store the name of a variable inside of another variable like this groupby_var <- "vs". Then you could attempt to use your newly created “groupby_var” variable in your code: group_by(groupby_var). However, if you try this you will find it doesn’t work. The “group_by” function expects the name of the variable you want to group by as an input, not the name of a variable that contains the name of the variable you want to group by.

This is the kind of headache that tidy evaluation can help you solve. In the example below we use the quo function and the “bang-bang” !! operator to set “vs” (engine type, 0 = automatic, 1 = manual) as our group by variable. The “quo” function allows us to store the variable name in our “groupby_var” variable and “!!” extracts the stored variable name.

groupby_var<-quo(vs)hp_by_vs<-mtcars%>%group_by(!!groupby_var)%>%summarize(min_hp=min(hp),max_hp=max(hp))
vsmin_hpmax_hp
091335
152123

The code above provides a method for setting the group by variable by modifying the input to the “quo” function when we define “groupby_var”. This can be useful, particularly if we intend to reference the group by variable multiple times. However, if we want to use code like this repeatedly in a script then we should consider packaging it into a function. This is what we will do next.

Making Functions with Tidy Evaluation

To use tidy evaluation in a function, we will still use the “!!” operator as we did above, but instead of “quo” we will use the enquo function. Our new function below takes the group by variable and the measurement variable as inputs so that we can now calculate maximum and minimum values of any variable we want. Also note two new features I have introduced in this function:

  • The as_label function extracts the string value of the “measure_var” variable (“hp” in this case). We use this to set the value of the “measure_var” column.
  • The “walrus operator” := is used to create a column named after the variable name stored in the “measure_var” argument (“hp” in the example). The walrus operator allows you to use strings and evaluated variables (such as “measure_var” in our example) on the left hand side of an assignment operation (where there would normally be a “=” operator) in functions such as “mutate” and “summarize”.

Below we define our function and use it to group by “am” (transmission type, 0 = automatic, 1 = manual) and calculate summary statistics with the “hp” (horsepower) variable.

car_stats<-function(groupby_var,measure_var){groupby_var<-enquo(groupby_var)measure_var<-enquo(measure_var)return(mtcars%>%group_by(!!groupby_var)%>%summarize(min=min(!!measure_var),max=max(!!measure_var))%>%mutate(measure_var=as_label(measure_var),!!measure_var:=NA))}hp_by_am<-car_stats(am,hp)
amminmaxmeasure_varhp
062245hpNA
152335hpNA

We now have a flexible function that contains a dplyr workflow. You can experiment with modifying this function for your own purposes. Additionally, as you might suspect, you could use the same tidy evaluation functions we just used with tidyverse packages other than dplyr.

As an example, below I’ve defined a function that builds a scatter plot with ggplot2. The function takes a dataset and two variable names as inputs. You will notice that the dataset argument “df” needs no tidy evaluation. The as_label function is used to extract our variable names as strings to create a plot title with the “ggtitle” function.

library(ggplot2)scatter_plot<-function(df,x_var,y_var){x_var<-enquo(x_var)y_var<-enquo(y_var)return(ggplot(data=df,aes(x=!!x_var,y=!!y_var))+geom_point()+theme_bw()+theme(plot.title=element_text(lineheight=1,face="bold",hjust=0.5))+geom_smooth()+ggtitle(str_c(as_label(y_var)," vs. ",as_label(x_var))))}scatter_plot(mtcars,disp,hp)

As you can see, we’ve plotted the “hp” (horsepower) variable against “disp” (displacement) and added a regression line. Now, instead of copying and pasting ggplot code to create the same plot with different datasets and variables, we can just call our function.

The “Curly-Curly” Shortcut and Passing Multiple Variables

To wrap things up, I’ll cover a few additional tricks and shortcuts for your tidy evaluation toolbox.

  • The “curly-curly” {{ }} operator directly extracts a stored variable name from “measure_var” in the example below. In the prior example we needed both “enquo” and “!!” to evaluate a variable like this so the “curly-curly” operator is a convenient shortcut. However, note that if you want to extract the string variable name with the “as_label” function, you will still need to use “enquo” and “!!” as we have done below with “measure_name”.
  • The syms function and the “!!!” operator are used for passing a list of variables as a function argument. In prior examples “!!” was used to evaluate a single group by variable; we now use “!!!” to evaluate a list of group by variables. One quirk is that to use the “syms” function we will need to pass the variable names in quotes.
  • The walrus operator “:=” is again used to create new columns, but now the column names are defined with a combination of a variable name stored in a function argument and another string (“_min” and “_max” below). We use the “enquo” and “as_label” functions to extract the string variable name from “measure_var” and store it in “measure_name” and then use the “str_c” function from stringr to combine strings. You can use similar code to build your own column names from variable name inputs and strings.

Our new function is defined below and is first called to group by the “cyl” variable and then called to group by the “am” and “vs” variables. Note that the “!!!” operator and “syms” function can be used with either a list of strings or a single string.

get_stats<-function(data,groupby_vars,measure_var){groupby_vars<-syms(groupby_vars)measure_name<-as_label(enquo(measure_var))return(data%>%group_by(!!!groupby_vars)%>%summarize(!!str_c(measure_name,"_min"):=min({{measure_var}}),!!str_c(measure_name,"_max"):=max({{measure_var}})))}cyl_hp_stats<-mtcars%>%get_stats("cyl",mpg)gear_stats<-mtcars%>%get_stats(c("am","vs"),gear)
cylmpg_minmpg_max
421.433.9
617.821.4
810.419.2
amvsgear_mingear_max
0033
0134
1045
1145

This concludes my introduction to tidy evaluation. Hopefully this serves as a useful starting point for using these concepts in your own code.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jessecambon-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to precompute package vignettes or pkgdown articles

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As of earlier this year, we are now automatically building binaries and pkgdown documentation for all rOpenSci packages. One issue we encountered is that some packages include vignettes that require some special tools/data/credentials, which are unavailable on generic build servers.

This post explains how to include such vignettes and articles in your package.

On package vignettes

By default, R automatically recreates vignettes during R CMD check or when generating pkgdown sites by running all R code. This is useful because it provides some extra testing of your code and ensures that documentation is reproducible. However, sometimes it is not a good idea to run the code on every build server, every time. For example:

  • The vignette examples require some special local software or private data.
  • The code connects to a web service that requires authentication or has limits.
  • You don’t want to hammer web services for every CMD check.
  • The vignette code takes very long to execute.

In such cases it is better to execute the rmarkdown code locally, and ship a vignette in the package which already contains the rendered R output.

The solution: locally knitting rmarkdown

Suppose you have a vignette called longexample.Rmd. To pre-compute the vignette, rename the input file to something that is not recognized by R as rmarkdown such as: longexample.Rmd.orig. Then run knitr in the package directory to execute and replace R code in the rmarkdown:

# Execute the code from the vignetteknitr::knit("vignettes/longexample.Rmd", output = "vignettes/longexample.Rmd")

The new output file longexample.Rmd now contains markdown with the already executed R output. So it can be treated as a regular vignette, but R can convert it to html instantaneously without having to re-run R code from the rmarkdown.

The jsonlite package shows a real world example. In this case I pre-computed vignettes that access web APIs to prevent services from getting hammered (and potentially banning the check servers).

Saving vignette figures

One gotcha with this trick is that if the vignette output includes figures, you need to store the images in the vignettes folder. It is also a good idea to explicitly name your rmarkdown knitr chunks, so that the images have sensible filenames.

Our recently onboarded package eia by Matt Leonawicz is a good example. This package provides an R client for US Energy Information Administration Open Data API. The eia documentation gets automatically generated for each commit on the rOpenSci docs server, even though the code in the vignettes actually requires an API key (which the docs server does not have).

screenshot

The eia vignettes directory contains the Rmd.orig input files and the .Rmd files as pre-computed by the package author. Also note the vignettes directory contains a handy script precompile.R that makes it easy for the package author to refresh the output vignettes locally.

Don’t forget to update

The drawback of this approach is that documents no longer automatically update when the package changes. Therefore you should only pre-compute the vignettes and articles that are problematic, and make a note for yourself to re-knit the vignette occasionally, e.g. before a package release. Adding a script to your vignette folders that does so can be a helpful reminder.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inset maps with ggplot2

$
0
0

[This article was first published on the Geocomputation with R website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inset maps enable multiple places to be shown in the same geographic data visualisation, as described in the Inset maps section (8.2.7) of our open source book Geocomputation with R. The topic of inset maps has gained attention and recently Enrico Spinielli asked inset maps could be created for data in unusual coordinate systems:

Speaking of insets, do you know of any ggplot2 examples with an inset for showing where the (bbox of the) map is in an orthographic/satellite proj? It is usually done to provide geographic context… NYT/WP have fantastic maps like that

— espinielli (@espinielli) November 4, 2019

R’s flexibility allows inset maps to be created in various ways, using different approaches and packages. However, the main idea stays the same: we need to create at least two maps: a larger one, called the main map, that shows the central story and a smaller one, called the inset map, that puts the main map in context.

This blog post shows how to create inset maps with ggplot2 for visualization. The approach also uses the sf package for spatial data reading and handling, cowplot to arrange inset maps, and rcartocolor for additional color palettes. To reproduce the results on your own computer, after installing them, these packages can be attached as follows:

library(sf)library(ggplot2)library(cowplot)library(rcartocolor)

Basic inset map

Let’s start by creating a basic inset map.

Data preparation

The first step is to read and prepare the data we want to visualize. We use the us_states data from the spData package as the source of the inset map, and north_carolina from the sf package as the source of the main map.

library(spData)data("us_states", package = "spData")north_carolina = read_sf(system.file("shape/nc.shp", package = "sf"))

Both objects should have the same coordinate reference system (crs). Here, we use crs = 2163, which represents the US National Atlas Equal Area projection.

us_states_2163 = st_transform(us_states, crs = 2163)north_carolina_2163 = st_transform(north_carolina, crs = 2163)

We also need to have the borders of the area we want to highlight (use in the main map). This can be done by extracting the bounding box of our north_carolina_2163 object.

north_carolina_2163_bb = st_as_sfc(st_bbox(north_carolina_2163))

Maps creation

The second step is to create both inset and main maps independently. The inset map should show the context (larger area) and highlight the area of interest.

ggm1 = ggplot() +   geom_sf(data = us_states_2163, fill = "white") +   geom_sf(data = north_carolina_2163_bb, fill = NA, color = "red", size = 1.2) +  theme_void()ggm1

The main map’s role is to tell the story. Here we show the number of births between 1974 and 1978 in the North Carolina counties (the BIR74 variable) using the Mint color palette from the rcartocolor palette. We also customize the legend position and size – this way, the legend is a part of the map, instead of being somewhere outside the map frame.

ggm2 = ggplot() +   geom_sf(data = north_carolina_2163, aes(fill = BIR74)) +  scale_fill_carto_c(palette = "Mint") +  theme_void() +  theme(legend.position = c(0.4, 0.05),        legend.direction = "horizontal",        legend.key.width = unit(10, "mm"))ggm2

Maps joining

The final step is to join two maps. This can be done using functions from the cowplot package. We create an empty ggplot layer using ggdraw(), fill it with out main map (draw_plot(ggm2)), and add an inset map by specifing its position and size:

gg_inset_map1 = ggdraw() +  draw_plot(ggm2) +  draw_plot(ggm1, x = 0.05, y = 0.65, width = 0.3, height = 0.3)gg_inset_map1

The final map can be saved using the ggsave() function.

ggsave(filename = "01_gg_inset_map.png",        plot = gg_inset_map1,       width = 8,        height = 4,       dpi = 150)

Advanced inset map

Let’s expand the idea of the inset map in ggplot2 based on the previous example.

Data preparation

This map will use the US states borders (states()) as the source of the inset map and the Kentucky Senate legislative districts (state_legislative_districts()) as the main map.

library(tigris)us_states = states(cb = FALSE, class = "sf")ky_districts = state_legislative_districts("KY", house = "upper",                                           cb = FALSE, class = "sf")

The states() function, in addition to the 50 states, also returns the District of Columbia, Puerto Rico, American Samoa, the Commonwealth of the Northern Mariana Islands, Guam, and the US Virgin Islands. For our purpose, we are interested in the continental 48 states and the District of Columbia only; therefore, we remove the rest of the divisions using subset().

us_states = subset(us_states,                    !NAME %in% c(                     "United States Virgin Islands",                     "Commonwealth of the Northern Mariana Islands",                     "Guam",                     "American Samoa",                     "Puerto Rico",                     "Alaska",                     "Hawaii"                   ))

The same as in the example above, we transform both objects to have the same projection.

ky_districts_2163 = st_transform(ky_districts, crs = 2163)us_states_2163 = st_transform(us_states, crs = 2163)

We also extract the bounding box of the main object here. However, instead of using it directly, we add a buffer of 10,000 meters around it. This output will be handy in both inset and main maps.

ky_districts_2163_bb = st_as_sfc(st_bbox(ky_districts_2163))ky_districts_2163_bb = st_buffer(ky_districts_2163_bb, dist = 10000)

The ky_districts_2163 object does not have any interesting variables to visualize, so we create some random values here. However, we could also join the districts’ data with another dataset in this step.

ky_districts_2163$values = runif(nrow(ky_districts_2163))

Map creation

The inset map should be as clear and simple as possible.

ggm3 = ggplot() +   geom_sf(data = us_states_2163, fill = "white", size = 0.2) +   geom_sf(data = ky_districts_2163_bb, fill = NA, color = "blue", size = 1.2) +  theme_void()ggm3

On the other hand, the main map looks better when we provide some additional context to our data. One of the ways to achieve it is to add the borders of the neighboring states.

Importantly, we also need to limit the extent of our main map to the range of the frame in the inset map. This can be done with the coord_sf() function.

ggm4 = ggplot() +   geom_sf(data = us_states_2163, fill = "#F5F5DC") +  geom_sf(data = ky_districts_2163, aes(fill = values)) +  scale_fill_carto_c(palette = "Sunset") +  theme_void() +  theme(legend.position = c(0.5, 0.07),        legend.direction = "horizontal",        legend.key.width = unit(10, "mm"),        plot.background = element_rect(fill = "#BFD5E3")) +  coord_sf(xlim = st_bbox(ky_districts_2163_bb)[c(1, 3)],           ylim = st_bbox(ky_districts_2163_bb)[c(2, 4)])ggm4

Finally, we draw two maps together, trying to find the best location and size for the inset map.

gg_inset_map2 = ggdraw() +  draw_plot(ggm4) +  draw_plot(ggm3, x = 0.02, y = 0.65, width = 0.35, height = 0.35)gg_inset_map2

The final map can be saved using the ggsave() function.

ggsave(filename = "02_gg_inset_map.png",        plot = gg_inset_map2,       width = 7.05,        height = 4,       dpi = 150)

Summary

The above examples can be adjusted to any spatial data and location. It is also possible to put more context on the map, including adding main cities’ names, neighboring states’ names, and annotations (using geom_text(), geom_label()). The main map can also be enhanced with the north arrow and scale bar using the ggsn package.

As always with R, there are many possible options to create inset maps. You can find two examples of inset maps created using the tmap package in the Geocomputation with R book. The second example is a classic map of the United States, which consists of the contiguous United States, Hawaii, and Alaska. However, Hawaii and Alaska are displayed at different geographic scales than the main map there. This problem can also be solved in R, which you can see in the Making maps of the USA with R: alternative layout blogpost and the Alternative layout for maps of the United States repository.

The presented approaches also apply to other areas. For example, you can find three ways on how to create an inset map of Spain in the Alternative layout for maps of Spain repository. Other examples of inset maps with ggplot2 can be found in the Inset Maps vignette by Ryan Peek and the blog post Drawing beautiful maps programmatically with R, sf and ggplot2 by Mel Moreno and Mathieu Basille.

The decision which option to use depends on the expected map type preferred R packages, etc. Try different approaches on your own data and decide what works best for you!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: the Geocomputation with R website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building a statistical model for field goal kicker accuracy

$
0
0

[This article was first published on R on Jacob Long, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the methodological companion to my post on my proposed method for evaluating NFL kickers. To get all the results and some extra info about the data, check it out.

When you have a hammer, everything looks like a nail, right? Well I’m a big fan of multilevel models and especially the ability of MCMC estimation to fit these models with many, often sparse, groups. Great software implementations like Stan and my favorite R interface to it, brms make doing applied work pretty straightforward and even fun.

As I spent time lamenting the disappointing season my Chicago Bears have been having, I tried to think about how seriously I should take the shakiness of their new kicker, Eddy Pineiro. Just watching casually, it can be hard to really know whether a kicker has had a very difficult set of kicks or not and what an acceptable FG% would be. This got me thinking about how I could use what I know to satisfy my curiosity.

Previous efforts

Of course, I’m not exactly the first person to want to do something to account for those differences. Chase Stuart over at Football Perspectives used field goal distance to adjust FG% for difficulty (as well as comparing kickers across eras by factoring in generational differences in kicking success). Football Outsiders does something similar — adjusting for distance — when grading kickers. Generally speaking, the evidence suggests that just dealing with kick distance gets you very far along the path to identifying the best kickers.

Chris Clement at the Passes and Patterns blog provides a nice review of the statistical and theoretical approaches to the issue. There are several things that are clear from previous efforts, besides the centrality of kick distance. Statistically, methods based on the logistic regression model are appropriate and the most popular — logistic regression is a statistical method designed to predict binary events (e.g., making/missing a field goal) using multiple sources of information. And besides kick distance, there are certainly elements of the environment that matter — wind, temperature, elevation among them — although just how much and how easily they can be measured is a matter of debate.

There has also been a lot of interest in game situations, especially clutch kicking. Do kickers perform worse under pressure, like when their kicks will tie the game or give their team the lead late in the game? Does “icing” the kicker, by calling a timeout before the kick, make the kicker less likely to be successful? Do they perform worse in playoff games?

On icing, Moskowitz and Wertheim (2011), Stats, Inc., Clark, Johnson, and Stimpson (2013), and LeDoux (2016) do not find compelling evidence that icing the kicker is effective. On the other hand, Berry and Wood (2004)Goldschmied, Nankin, and Cafri (2010), and Carney (2016) do find some evidence that icing hurts the kicker. All these have some limitations, including which situations qualify as “icing” and whether we can identify those situations in archival data. In general, to the extent there may be an effect, it looks quite small.

Most important in this prior work is the establishment of a few approaches to quantification. A useful way to think about comparing kickers is to know what their expected FG% (eFG%) is. That is, given the difficulty of their kicks, how would some hypothetical average kicker have fared? Once we have an expected FG%, we can more sensibly look at the kicker’s actual FG%. If we have two kickers with an actual FG% of 80%, and kicker A had an eFG% of 75% while kicker B had an eFG% of 80%, we can say kicker A is doing better because he clearly had a more difficult set of kicks and still made them at the same rate.

Likewise, once we have eFG%, we can compute points above average (PAA). This is fairly straightforward since we’re basically just going to take the eFG% and FG% and weight them by the number of kicks. This allows us to appreciate the kickers who accumulate the most impressive (or unimpressive) kicks over the long haul. And since coaches generally won’t try kicks they expect to be missed, it rewards kickers who win the trust of their coaches and get more opportunities to kick.

Extensions of these include replacement FG% and points above replacement, which use replacement level as a reference point rather than average. This is useful because if you want to know whether a kicker is playing badly enough to be fired, you need some idea of who the competition is. PAA and eFG% are more useful when you’re talking about greatness and who deserves a pay raise.

Statistical innovations

The most important — in my view — entries into the “evaluating kickers with statistical models” literature are papers by Pasteur and Cunningham-Rhoads (2014) as well as Osborne and Levine (2017).

Pasteur and Cunningham-Rhoads — I’ll refer to them as PC-R for short — gathered more data than most predecessors, particularly in terms of auxiliary environmental info. They have wind, temperature, and presence/absence of precipitation. They show fairly convincingly that while modeling kick distance is the most important thing, these other factors are important as well. PC-R also find the cardinal direction of every NFL stadium (i.e., does it run north-south, east-west, etc.) and use this information along with wind direction data to assess the presence of cross-winds, which are perhaps the trickiest for kickers to deal with. They can’t know about headwinds/tailwinds because as far as they (and I) can tell, nobody bothers to record which end zone teams defend at the game’s coin toss, so we don’t know without looking at video which direction the kick is going. They ultimately combine the total wind and the cross wind, suggesting they have some meaningful measurement error that makes them not accurately capture all the cross-winds. Using their logistic regressions that factor for these several factors, they calculate an eFG% and use it and its derivatives to rank the kickers.

PC-R include some predictors that, while empirically justifiable based on their results, I don’t care to include. These are especially indicators of defense quality, because I don’t think this should logically effect the success of a kick and is probably related to the selection bias inherent to the coach’s decision to try a kick or not. They also include a “kicker fatigue” variable that appears to show that kickers attempting 5+ kicks in a game are less successful than expected. I don’t think this makes sense and so I’m not going to include it for my purposes.

They put some effort into defining a “replacement-level” kicker which I think is sensible in spite of some limitations they acknowledge. In my own efforts, I decided to do something fairly similar by using circumstantial evidence to classify a given kicker in a given situation as a replacement or not.

PC-R note that their model seems to overestimate the probability of very long kicks, which is not surprising from a statistical standpoint given that there are rather few such kicks, they are especially likely to only be taken by those with an above-average likelihood of making them, and the statistical assumption of linearity is most likely to break down on the fringes like this. They also mention it would be nice to be able to account for kickers having different leg strengths and not just differing in their accuracy.

Osborne and Levine (I’ll call them OL) take an important step in trying to improve upon some of these limitations. Although they don’t use this phrasing, they are basically proposing to use multilevel models, which treat each kicker as his own group and thereby accounting for the possibility — I’d say it’s a certainty — that kickers differ from one another in skill.

A multilevel model has several positive attributes, especially that it not only adjusts for the apparent differences in kickers but also that it looks skeptically upon small sample sizes. A guy who makes a 55-yard kick in his first career attempt won’t be dubbed the presumptive best kicker of all time because the model will by design account for the fact that a single kick isn’t very informative. This means we can simultaneously improve the prediction accuracy on kicks, but also use the model’s estimates of kicker ability without over-interpreting small sample sizes. They also attempt to use a quadratic term for kick distance, which could better capture the extent to which the marginal difference of a few extra yards of distance is a lot different when you’re at 30 vs. 40 vs. 50 yards. OL are unsure about whether the model justifies including the quadratic term but I think on theoretical grounds it makes a lot of sense.

OL also discuss using a clog-log link rather than the logistic link, showing that it has better predictive accuracy under some conditions. I am going to ignore that advice for a few reasons, most importantly because the advantage is small and also because the clog-log link is computationally intractable with the software I’m using.

Model

Code and data for reproducing these analyses can be found on Github

My tool is a multilevel logistic regression fit via MCMC using the wonderful brms R package. I actually considered several models for model selection.

In all cases, I have random intercepts for kicker and stadium. I also use random slopes for both kick distance and wind at the stadium level. Using random wind slopes at the stadium level will hopefully capture the prevailing winds at that stadium. If they tend to be helpful, it’ll have a larger absolute slope. Some stadiums may have swirling winds and this helps capture that as well. The random slope for distance hopefully captures some other things, like elevation. I also include interaction terms for wind and kick distance as well as temperature and kick distance, since the elements may only affect longer kicks.

There are indicators for whether the kick was “clutch” — game-tying or go-ahead in the 4th quarter — whether the kicker was “iced,” and whether the kick occurred in the playoffs. There is an interaction term between clutch kicks and icing to capture the ideal icing situation as well.

I have a binary variable indicating whether the kicker was, at the time, a replacement. In the main post, I describe the decision rules involved in that. I have interaction terms for replacement kickers and kick distance as well as replacement kickers and clutch kicks.

I have two random slopes at the kicker level:

  • Kick distance (allowing some kickers to have stronger legs)
  • Season (allowing kickers to have a career trajectory)

Season is modeled with a quadratic term so that kickers can decline over time — it also helps with the over-time ebb and flow of NFL FG%. It would probably be better to use a GAM for this to be more flexible, but they are a pain.

All I’ve disclosed so far is enough to have one model. But I also explore the form of kick distance using polynomials. OL used a quadratic term, but I’m not sure even that is enough. I compare 2nd, 3rd, and 4th degree polynomials for kick distance to try to improve the prediction of long kicks in particular. Of course, going down the road of polynomials can put you on a glide path towards the land of overfitting.

I fit several models, with combinations of the following:

  • 2nd, 3rd, or 4th degree polynomial
  • brms default improper priors on the fixed and random effects or weakly informative normal priors on the fixed and random effects
  • Interactions with kick distance with either all polynomial terms or just the first and second degree terms

That last category is one that I suspected — and later confirmed — could cause some weird results. Allowing all these things to interact with a 3rd and 4th degree polynomial term made for some odd predictions on the fringes, like replacement-level kickers having a predicted FG% of 70% at 70 yards out.

Model selection

I looked at several criteria to compare models.

A major one was approximate leave-one-out cross-validation. I will show the LOOIC, which is interpreted like AIC/BIC/DIC/WAIC in terms of lower numbers being better. This produces the same ordering as the ELPD, which has the opposite interpretation in that higher numbers are better. Another thing I looked at was generating prediction weights for the models via Bayesian model stacking. I also calculated Brier scores, which are a standard tool for looking at prediction accuracy for binary outcomes and are simply the mean squared prediction error. Last among the quantitative measures is the AUC (area under the curve), which is another standard tool in the evaluation of binary prediction models.

Beyond these, I also plotted predictions in areas of interest where I’d like the model to perform well (like on long kicks) and checked certain cases where external information not shown directly to the model gives me a relatively strong prior. Chief among these was whether it separated the strong-legged kickers well.

Below I’ve summarized the model comparison results. I shade the metrics darker wherever the number is better — sometimes lower numbers are better, sometimes higher numbers are. The bolded, red-colored row is the model I used.

Model specification
Model fit metrics
Polynomial degree Interaction degree Priors LOOIC Model weight Brier score AUC
3 2 Proper 8030.2650.50830.11650.7786
4 2 Proper 8032.3320.15330.11640.7792
3 2 Improper 8032.7260.07630.11600.7811
3 3 Proper 8033.8790.00010.11650.7787
3 3 Improper 8035.6160.20420.11580.7815
4 2 Improper 8036.6140.00010.11580.7816
4 4 Proper 8038.2660.00000.11630.7793
2 2 Proper 8043.2210.00000.11690.7778
2 2 Improper 8043.4500.05770.11650.7798
4 4 Improper 8043.7410.00000.11560.7825

So why did I choose that model? The approximate LOO-CV procedure picks it as the third model, although there’s enough uncertainty around those estimates that it could easily be the best — or not as good as some ranked below it. It’s not clear that the 4th degree polynomial does a lot of good in the models that have it and it increases the risk of overfitting. It seems to reduce the predicted probability of very long kicks, but as I’ve thought about it more I’m not sure it’s a virtue.

Compared to the top two models, which distinguish themselves from the chosen model by their use of proper priors, the model I chose does better on the in-sample prediction accuracy metrics without looking much different on the approximate out-of-sample ones. It doesn’t get much weight because it doesn’t have much unique information compared to the slightly-preferred version with proper priors. But as I looked at the models’ predictions, it appeared to me that the regularization with the normal priors was a little too aggressive and wasn’t picking up on the differences among kickers in leg strength.

That being said, the choices among these top few models are not very important at all when it comes to the basics of who are the top and bottom ranked kickers.

Notes on predicted success

I initially resisted including things like replacement status and anything else that is a fixed characteristic of a kicker (at least within a season) or kicker-specific slopes in the model because I planned to extract the random intercepts and use that as my metric. Adding those things would make the random intercepts less interpretable; if a kicker is bad and there’s no “replacement” variable, then the intercept will be negative, but with the “replacement” variable the kicker may not have a negative intercept after the adjustment for replacement status.

Instead, I decided to focus on model predictions. Generating the expected FG% and replacement FG% was pretty straightforward. For eFG%, take all kicks attempted and set replacement = 0. For rFG%, take all kicks and set replacement = 1.

To generate kicker-specific probabilities, though, I had to decide how to incorporate this information. I’d clearly overrate new, replacement-level kickers. My solution to this was to, before generating predictions on hypothetical data, set each kicker’s replacement variable to his career average.

For season, on the other hand, I could eliminate the kicker-specific aspect of this by intentionally zeroing these effects out in the predictions. If I wanted to predict success in a specific season, of course, I could include this.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Jacob Long.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019 challenge with R

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have decided to tackle this year’s Advent Of Code using R (more or less). I know there are more preferred languages, such as Python, C#, Java, JavaScript, Go, Kotlin, C++, Elixir, but it was worth trying.

Into the 8th day of the competition, in the time of writing this blog post, I have had little problems using R. There was a competition on second day, that initiated the array of arrays with 0-based first position, and knowing that R is 1-based first position, I had some fun (and aggravation) finding out the problem. And at the end, I rewrote everything in Python 🙂 The same set of instructions continued on the 5th and 6th day, but R should also be just as good as any other language.

On the day 3, the Manhattan distance (similar to Manhattan wiring) from starting point to closest intersection of two different wires needed to be calculated. Which of course, it was fun, but an extra little tickle was to visualize both wires in 2-D space.

The following dataset is, the dataset I have received, your’s will be different. And the code was just a product of 10 minutes writing (hence, not really optimized).

My dataset is presented as:

library(ggplot2)# read instructions for both wiresd <- c("R1007","D949","R640","D225","R390","D41","R257","D180","L372","U62","L454","U594","L427","U561","R844","D435","L730","U964","L164","U342","R293","D490","L246","U323","L14","D366","L549","U312","L851","U959","L255","U947","L179","U109","R850","D703","L310","U175","L665","U515","R23","D633","L212","U650","R477","U131","L838","D445","R999","D229","L772","U716","L137","U355","R51","D565","L410","D493","L312","U623","L846","D283","R980","U804","L791","U918","L641","U258","R301","U727","L307","U970","L748","U229","L225","U997","L134","D707","L655","D168","L931","D6","R36","D617","L211","D453","L969","U577","R299","D804","R910","D898","R553","U298","L309","D912","R757","U581","R228","U586","L331","D865","R606","D163","R425","U670","R156","U814","L168","D777","R674","D970","L64","U840","L688","U144","L101","U281","L615","D393","R277","U990","L9","U619","L904","D967","L166","U839","L132","U216","R988","U834","R342","U197","L717","U167","L524","U747","L222","U736","L149","D156","L265","U657","L72","D728","L966","U896","R45","D985","R297","U38","R6","D390","L65","D367","R806","U999","L840","D583","R646","U43","L731","D929","L941","D165","R663","U645","L753","U619","R60","D14","L811","D622","L835","U127","L475","D494","R466","U695","R809","U446","R523","D403","R843","U715","L486","D661","L584","U818","L377","D857","L220","U309","R192","U601","R253","D13","L95","U32","L646","D983","R13","U821","L1","U309","L425","U993","L785","U804","L663","U699","L286","U280","R237","U388","L170","D222","L900","U204","R68","D453","R721","U326","L629","D44","R925","D347","R264","D767","L785","U249","R989","D469","L446","D384","L914","U444","R741","U90","R424","U107","R98","U20","R302","U464","L808","D615","R837","U405","L191","D26","R661","D758","L866","D640","L675","U135","R288","D357","R316","D127","R599","U411","R664","D584","L979","D432","R887","D104","R275","D825","L338","D739","R568","D625","L829","D393","L997","D291","L448","D947","L728","U181","L137","D572","L16","U358","R331","D966","R887","D122","L334","D560","R938","D159","R178","D29","L832","D58","R37")d2 <- c("L993","U121","L882","U500","L740","D222","R574","U947","L541","U949","L219","D492","R108","D621","L875","D715","R274","D858","R510","U668","R677","U327","L284","U537","L371","U810","L360","U333","L926","D144","R162","U750","L741","D360","R792","D256","L44","D893","R969","D996","L905","D524","R538","U141","R70","U347","L383","U74","R893","D560","L39","U447","L205","D783","L244","D40","R374","U507","L946","D934","R962","D138","L584","U562","L624","U69","L77","D137","L441","U671","L849","D283","L742","D459","R105","D265","R312","D734","R47","D369","R676","D429","R160","D814","L881","D830","R395","U598","L413","U817","R855","D377","L338","D413","L294","U321","L714","D217","L15","U341","R342","D480","R660","D11","L192","U518","L654","U13","L984","D866","R877","U801","R413","U66","R269","D750","R294","D143","R929","D786","R606","U816","L562","U938","R484","U32","R136","U30","L393","U209","L838","U451","L387","U413","R518","D9","L847","D605","L8","D805","R348","D174","R865","U962","R926","U401","R445","U720","L843","U785","R287","D656","L489","D465","L192","U68","L738","U962","R384","U288","L517","U396","L955","U556","R707","U329","L589","U604","L583","U457","R545","D504","L521","U711","L232","D329","L110","U167","R311","D234","R284","D984","L778","D295","R603","U349","R942","U81","R972","D505","L301","U422","R840","U689","R225","D780","R379","D200","R57","D781","R166","U245","L865","U790","R654","D127","R125","D363","L989","D976","R993","U702","L461","U165","L747","U656","R617","D115","L783","U187","L462","U838","R854","D516","L978","U846","R203","D46","R833","U393","L322","D17","L160","D278","R919","U611","L59","U709","L472","U871","L377","U111","L612","D177","R712","U628","R858","D54","L612","D303","R205","U430","R494","D306","L474","U848","R816","D104","L967","U886","L866","D366","L120","D735","R694","D335","R399","D198","R132","D787","L749","D612","R525","U163","R660","U316","R482","D412","L376","U170","R891","D202","R408","D333","R842","U965","R955","U440","L26","U747","R447","D8","R319","D188","L532","D39","L863","D599","R307","U253","R22")

Creating two empty dataframes (each for different group) and a helper function to decode the instructions: R1007 -> should be undestood as Right for 1007 steps and in accordance, the coordinates change.

gCoord2 <- function(pos, prevX, prevY){direction <- substring(pos,1,1)val <- as.integer(substring(pos,2,nchar(pos)))#a <- c(0,0)if (direction == "R") {a <- c(prevX,prevY+val) return(a)}if (direction == "L") {a <- c(prevX,prevY-val) return(a)}if (direction == "U") {a <- c(prevX+val,prevY) return(a)}if (direction == "D") {a <- c(prevX-val, prevY) return(a)}}

And not to iterate through both datasets and binding them into single one:

df <- data.frame(x = c(0), y = c(0), group = 1)df2 <- data.frame(x = c(0), y = c(0), group = 2)for (i in 1:length(d)){ii <- d[i] print(ii)#get last value from dfX <- tail(df$x,1)Y <- tail(df$y,1)x1<- gCoord2(d[i],X,Y)[1]y1<- gCoord2(d[i],X,Y)[2]df <-rbind(df, c(x=x1, y=y1, group=1))}for (i in 1:length(d2)){ii <- d2[i] print(ii)#get last value from dfX <- tail(df2$x,1)Y <- tail(df2$y,1)x1<- gCoord2(d2[i],X,Y)[1]y1<- gCoord2(d2[i],X,Y)[2]df2 <-rbind(df2, c(x=x1, y=y1, group=2))}df3 <- rbind(df,df2)

Finally, touch of ggplot2:

ggplot(df3, aes(x = x, y = y, group = group, colour=group)) +geom_path(size = 0.75, show.legend = FALSE) +# theme_void() +ggtitle('Advent Of Code 2019; Day 3 - Graph; starting point(0,0)')

And the graph with both wires:

plot_zoom

Happy R-coding! 🙂

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Training courses in San Francisco

$
0
0

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jumping Rivers are coming to San Francisco in January 2020! We’ll be running a number of R training courses with Paradigm Data. You can find the booking links and more details over at our courses page. Don’t be afraid to get in contact if you have any questions!

22nd January – Intro to R

This is a one-day intensive course on R and assumes no prior knowledge. By the end of the course, participants will be able to import, summarise and plot their data. At each step, we avoid using "magic code", and stress the importance of understanding what R is doing.

23rd January – Getting to Grips with the Tidyverse

The tidyverse is essential for any data scientist who deals with data on a day-to-day basis. By focusing on small key tasks, the tidyverse suite of packages removes the pain of data manipulation. This training course covers key aspects of the tidyverse, including dplyr, lubridate, tidyr and tibbles.

24th January – Advanced Graphics with R

The ggplot2 package can create advanced and informative graphics. This training course stresses understanding – not just one off R scripts. By the end of the session, participants will be familiar with themes, scales and facets, as well as the wider ggplot2 world of packages.


The post Training courses in San Francisco appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Advent of Code 2019-08 with R & JavaScript

$
0
0

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Solving Advent of Code 2019-08 with R and JavaScript.

[Disclaimer] Obviously, this post contains a big spoiler about Advent of Code, as it gives solutions for solving day 8.

About the JavaScript code

The JavaScript code has been written in the same RMarkdown as the R code. It runs thanks to the {bubble} package: https://github.com/ColinFay/bubble

Instructions

Find the instructions at: https://adventofcode.com/2019/day/8

R solution

Part one

library(magrittr)library(purrr)ipt<-read.delim("input8.txt",header=FALSE,colClasses="character")$V1ipt<-strsplit(ipt,"")[[1]]%>%as.numeric()layers_size<-6*25l<-list()for(iin1:(length(ipt)/layers_size)){l[[i]]<-ipt[1:150]ipt<-ipt[151:length(ipt)]}mn<-l%>%lapply(table)%>%map_dbl("0")%>%which.min()l[[mn]]%>%table()
## .##   0   1   2 ##   7  14 129
14*129
## [1] 1806

Part two

v<-c()for(iinseq_len(layers_size)){idx<-map_dbl(l,i)v[i]<-idx[idx%in%c(0,1)][1]}library(dplyr)library(tidyr)library(ggplot2)library(tibble)matrix(v,ncol=6)%>%as.data.frame()%>%rowid_to_column()%>%gather(key=key,value=value,V1:V6)%>%mutate(key=gsub("V(.)","\\1",key)%>%as.numeric())%>%ggplot(aes(rowid,key,fill=as.factor(value)))+geom_tile()+coord_fixed()+scale_fill_viridis_d()+scale_y_reverse()

JS solution

vaript=fs.readFileSync("input8.txt",'utf8').split("").filter(x=>x.length!=0&x!='\n').map(x=>parseInt(x));varlayers_size=6*25;varlayer_n=ipt.length/layers_size;varres=[];functiontable(vec){vartbl={};vec.map(function(x){if(tbl[x]){tbl[x]=tbl[x]+1;}else{tbl[x]=1;}})returntbl;}for(vari=0;i<layer_n;i++){res[i]=ipt.splice(0,layers_size);}varres_b=res.map(x=>table(x));varminim=Math.min.apply(Math,res_b.map(x=>x['0']));varsmallest=res_b.filter(x=>x['0']==minim);
smallest[0]["1"]*smallest[0]["2"];
## 1806
varv=[];for(vari=0;i<layers_size;i++){varidx=res.map(x=>x[i]);v[i]=idx.find(z=>z==0|z==1);}varnn=[];for(vari=0;i<6;i++){nn[i]=v.splice(0,25).join("").replace(/0/g,"");}
nn
## [ '    1 1     1 1     1 1 1 1   1 1 1       1 1    ',##   '      1   1     1   1         1     1   1     1  ',##   '      1   1     1   1 1 1     1     1   1     1  ',##   '      1   1 1 1 1   1         1 1 1     1 1 1 1  ',##   '1     1   1     1   1         1   1     1     1  ',##   '  1 1     1     1   1         1     1   1     1  ' ]
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


5 Data Science Technologies for 2020 (and Beyond)

$
0
0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Moving into 2020, three things are clear – Organizations want Data Science, Cloud, and Apps. Here are the Top 5 essential skills for Data Scientists that need to build and deploy applications in 2020 and beyond.

This is part of a series of articles on Data Science key skills for 2020:

Top 20 Tech Skills 2014-2019

Indeed, the popular employment-related search engine, released an article showing changing trends from 2015 to 2019 in “Technology-Related Job Postings” examining the 5-Year Change of the most requested technology skills.

Today's Top Tech Skills

Top 20 Tech Skills 2014-2019 Source: Indeed Hiring Lab.

I’m generally not a big fan of these reports because the technology landscape changes so quickly. But, I was pleasantly surprised at the length of time from the analysis – Indeed looked at changes over a 5-year period, which gives a much better sense of the long term trends.

Why No R, Shiny, Tableau PowerBI, Alteryx?

The skills reported are not “Data Science”-specific (which is why you don’t see R, Tableau, PowerBI, Alteryx, on the list).

However, we can glean insights based on the technologies present…

Cloud, Machine Learning, Apps Driving Growth

From the technology growth, it’s clear that Businesses need Cloud + ML + Apps.

Key Technologies Driving Tech Skill Growth

Technologies Driving Tech Skill Growth

My Takeaway

This assessment has led me to my key technologies for Data Scientists heading into 2020. I focus on key technologies related to Cloud + ML + Apps.

Top 5 Data Science Technologies for Cloud + ML + Apps

That Data Scientists should learn for 2020 and beyond – these are geared towards the Business Demands: Cloud + ML + Apps. In other words, businesses need data-science and machine learning-powered web applications deployed into production via the Cloud.

Here’s what you need to learn to build ML-Powered Web Applications and deploy in the Cloud.

*Note that R and Python are skills that you should be learning before you jump into these.

5 Key Data Science Technologies for Cloud + Machine Learning + Applications

5 Key Data Science Technologies for Cloud + Machine Learning + Applications

1. AWS Cloud Services

The most popular cloud service provider. EC2 is a staple for apps, running jupyter/rstudio in the cloud, and leveraging cloud resources rather than investing in expensive computers & servers.

AWS Resource:AWS for Data Science Apps – 14% Share, 400% Growth

2. Shiny Web Apps

A comprehensive web framework designed for data scientists with a rich ecosystem of extension libraries (dubbed the “shinyverse”).

Shiny Resource (Coming Soon): Shiny Data Science Web Applications

3. H2O Machine Learning

Automated machine learning library available in Python and R. Works well on structured data (format for 95% of business problems). Automation drastically increases productivity in machine learning.

H2O Resource (Coming Soon): H2O Automated Machine Learning (AutoML)

4. Docker for Web Apps

Creating docker environments drastically reduces the risk of software incompatibility in production. DockerHub makes it easy to share your environment with other Data Scientists or DevOps. Further, Docker and DockerHub make it easy to deploy applications into production.

Docker Resource:Docker for Data Science Apps – 4000% Growth

5. Git Version Control

Git and GitHub are staples for reproducible research and web application development. Git tracks past versions and enables software upgrades to be performed on branches. GitHub makes it easy to share your research and/or web applications with other Data Scientists, DevOps, or Data Engineering. Further, Git and GitHub make it easy to deploy changes to apps in production.

Git Resource (Coming Soon): Git Version Control for Data Science Apps

Other Technologies Worth Mentioning

  1. dbplyr for SQL– For data scientists that need to create complex SQL queries, but don’t have time to deal with messy SQL. dbplyr is a massive productivity booster. It converts R (dplyr) to SQL. Can use it for 95% of SQL queries.

  2. Bootstrap– For data scientists that build apps, Bootstrap is a Front-End web framework that Shiny is built on top of and it powers much of the web (e.g. Twitter’s app). Bootstrap makes it easy to control the User Interface (UI) of your application.

  3. MongoDB– For data scientists that build apps, MongoDB is a NoSQL database that is useful for storing complex user information of your application in one table. Much easier than creating a multi-table SQL database.

Real Shiny App + AWS + Docker Case Example

In my Shiny Developer with AWS Course (NEW), you use the following application architecture that uses AWS EC2 to create an Ubuntu Linux Server that hosts a Shiny App in the cloud called the Stock Analyzer.

Data Science Web Application Architecture From Shiny Developer with AWS Course

You use AWS EC2 to build a server to run your Stock Analyzer application along with several other web apps.

AWS EC2 Instance used for Cloud Deployment From Shiny Developer with AWS Course

Next, you use a DockerFile to containerize the application’s software environment.

DockerFile for Stock Analyzer App From Shiny Developer with AWS Course

You then deploy your “Stock Analyzer” application so it’s accessible anywhere via the AWS Cloud.

DockerFile for Stock Analyzer App From Shiny Developer with AWS Course

If you are ready to learn how to build and deploy Shiny Applications in the cloud using AWS, then I recommend my NEW 4-Course R-Track System.

I look forward to providing you the best data science for business education.

Matt Dancho

Founder, Business Science

Lead Data Science Instructor, Business Science University

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppClassic 0.9.12

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A maintenance release 0.9.12 of the RcppClassic package arrived earlier today on CRAN. This package provides a maintained version of the otherwise deprecated initial Rcpp API which no new projects should use as the normal Rcpp API is so much better.

Changes are all internal. Testing is now done via tinytest, vignettes are now pre-built and at the request of CRAN we no longer strip the resulting library. No other changes were made.

CRANberries also reports the changes relative to the previous release from July of last year.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

About URLs in DESCRIPTION

$
0
0

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Among DESCRIPTION usual fields is the free-text URL field where package authors can store various links: to the development website, docs, upstream tool, etc. In this post, we shall explain why storing URLs in DESCRIPTION is important, where else you should add URLs and what kind of URLs are stored in CRAN packages these days.

Why put URLs in DESCRIPTION?

In the following we’ll assume your package has some sort of online development repository (GitHub? GitLab? R-Forge?) and a documentation website (handily created via pkgdown?). Adding URLs to your package’s online homes is extremely useful for several reasons.

As a side note: Yes, you can store several URLs under URL, even if the field name is singular. See for instance rhub’s DESCRIPTION🔗🔗

URL: https://github.com/r-hub/rhub, https://r-hub.github.io/rhub/

Why put URLs in DESCRIPTION?

  • It will help your users find your package’s pretty documentation from the CRAN page, instead of just the less pretty PDF manual.

  • Likewise, from the CRAN page your contributors can directly find where to submit patches.

  • If your package has a package-level man page, and it should (e.g. as drafted by usethis::use_package_doc() and then generated by roxygen2), then after typing say library("rhub") and then ?rhub, your users will find the useful links.

  • Other tools such as helpdesk and the pkgsearch RStudio addin can help surface the URLs you store in DESCRIPTION.

  • Indirectly, having a link to the docs website and development repo will increase their page rank, see useful comments in this discussion, so potential users and contributors find them more easily by simply searching for your package.

Quick tip, you can add GitHub URLs (URL and BugReports) to DESCRIPTION by running usethis::use_github_links(). 🚀

Where else put your URLs?

For the same reasons as previously, you should make the most of all places that can store your package’s URL(s). Have you put your package’s docs URL

Have you used any of your package’s URLs

Don’t miss any opportunity to point users and contributors in the right direction!

What URLs do people use in DESCRIPTION files of CRAN packages?

In the following, we shall parse the URL field of the CRAN packages database.

db <- tools::CRAN_package_db()db <- tibble::as_tibble(db[, c("Package", "URL")])db <- dplyr::distinct(db)

There are 15315 packages on CRAN at the time of writing, among which 8040 with something written in the URL field. We can parse this data.

db <- db[!is.na(db$URL),]library("magrittr")# function from https://github.com/r-hub/pkgsearch/blob/26c4cc24b9296135b6238adc7631bc5250509486/R/addin.R#L490-L496url_regex <- function() "(https?://[^\\s,;>]+)"find_urls <- function(txt) {  mch <- gregexpr(url_regex(), txt, perl = TRUE)  res <- regmatches(txt, mch)[[1]]  if(length(res) == 0) {    return(list(NULL))  } else {    list(unique(res))  }}db %>%  dplyr::group_by(Package)  %>%  dplyr::mutate(actual_url = find_urls(URL))%>%  dplyr::ungroup() %>%  tidyr::unnest(actual_url) %>%  dplyr::group_by(Package, actual_url) %>%  dplyr::mutate(url_parts = list(urltools::url_parse(actual_url))) %>%  dplyr::ungroup() %>%  tidyr::unnest(url_parts) %>%  dplyr::mutate(scheme = trimws(scheme)) -> parsed_db

There are 7192 with at least one valid URL.

What are the packages with most links?

mostlinks <- dplyr::count(parsed_db, Package, sort = TRUE)mostlinks
## # A tibble: 7,192 x 2##    Package           n##             ##  1 RcppAlgos         7##  2 BIFIEsurvey       5##  3 BigQuic           5##  4 dendextend        5##  5 PGRdup            5##  6 vwline            5##  7 ammistability     4##  8 augmentedRCBD     4##  9 dcGOR             4## 10 dialr             4## # … with 7,182 more rows

The package with the most links in URL is RcppAlgos.

What is the most popular scheme, http or https?

dplyr::count(parsed_db, scheme, sort = TRUE)
## # A tibble: 2 x 2##   scheme     n##     ## 1 https   5910## 2 http    2496

There is a bit less that one third of http links.

Can we identify popular domains?

dplyr::count(parsed_db, domain, sort = TRUE)
## # A tibble: 1,855 x 2##    domain                    n##                     ##  1 github.com             4660##  2 www.r-project.org       164##  3 cran.r-project.org      143##  4 r-forge.r-project.org    82##  5 bitbucket.org            67##  6 sites.google.com         54##  7 arxiv.org                52##  8 gitlab.com               44##  9 docs.ropensci.org        38## 10 www.github.com           32## # … with 1,845 more rows

GitHub seems to be the most popular development platform, as least from this sample of CRAN packages that indicate an URL. It is also possible that some developers set up their own GitLab server with a own domain. Many packages link to www.r-project.org which is not very informative, or to their own CRAN page which can be informative.

Other relatively popular domains are sites.google.com and arxiv.org. There are problably links to other venues for scientific publications than arxiv.org. What about doi.org?

dplyr::filter(parsed_db, domain %in% c("doi.org", "dx.doi.org")) %>%  dplyr::select(Package, actual_url)
## # A tibble: 44 x 2##    Package                actual_url                                    ##                                                               ##  1 abcrlda                https://dx.doi.org/10.1109/LSP.2019.2918485   ##  2 adwave                 https://doi.org/10.1534/genetics.115.176842   ##  3 ammistability          https://doi.org/10.5281/zenodo.1344756        ##  4 anMC                   https://doi.org/10.1080/10618600.2017.1360781 ##  5 ANOVAreplication       https://dx.doi.org/10.17605/OSF.IO/6H8X3      ##  6 AssocAFC               https://doi.org/10.1093/bib/bbx107            ##  7 augmentedRCBD          https://doi.org/10.5281/zenodo.1310011        ##  8 CorrectOverloadedPeaks http://dx.doi.org/10.1021/acs.analchem.6b02515##  9 dataMaid               https://doi.org/10.18637/jss.v090.i06         ## 10 disclapmix             http://dx.doi.org/10.1016/j.jtbi.2013.03.009  ## # … with 34 more rows

The “earlier but no longer preferred” dx.doi.org is still in use.

rOpenSci docs server also make an appearance.

Note that you could do a similar analysis of the BugReports field. We’ll leave that as an exercise to the reader. 😉

Conclusion

In this note, we explained why having URLs in DESCRIPTION of your package can help users and contributors find the right venues for their needs, and we had a look at URLs currently stored in the DESCRIPTIONs of CRAN packages, in particular discussing current popular domains. How do you ensure the users of your package can find its best online home(s)? How do you look for online home(s) of the packages you use?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Repeated measures can improve estimation when we only care about a single endpoint

$
0
0

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, involving multiple clusters, three interventions, a sequential and adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.

There was one issue that came up that should be fairly generalizable to other studies. In this case, individual measures will be collected repeatedly over time but the primary outcome of interest will be the measure collected during the last follow-up period. I wanted to explore what, if anything, can be gained by analyzing all of the available data rather than focusing only the final end point.

Data generation

In this simulation scenario, there will be 200 subjects randomized at the individual level to one of two treatment arms, intervention (\(rx = 1\)) and control (\(rx = 0\)). Each person will be followed for 5 months, with a binary outcome measure collected at the end of each month. In the data, period 0 is the first month, and period 4 is the final month.

library(simstudy)set.seed(281726)dx <- genData(200)dx <- trtAssign(dx, grpName = "rx")dx <- addPeriods(dx, nPeriods = 5)

Here are the data for a single individual:

dx[id == 142]
##     id period rx timeID## 1: 142      0  1    706## 2: 142      1  1    707## 3: 142      2  1    708## 4: 142      3  1    709## 5: 142      4  1    710

The probabilities of the five binary outcomes for each individual are a function of time and intervention status.

defP <- defDataAdd(varname = "p",                    formula = "-2 + 0.2*period + 0.5*rx",                    dist = "nonrandom", link = "logit")  dx <- addColumns(defP, dx)

The outcomes for a particular individual are correlated, with outcomes in two adjacent periods are more highly correlated than outcomes collected further apart. (I use an auto-regressive correlation structure to generate these data.)

dx <- addCorGen(dtOld = dx, idvar = "id", nvars = 5, rho = 0.6,                 corstr = "ar1", dist = "binary", param1 = "p",                 method = "ep", formSpec = "-2 + 0.2*period + 0.5*rx",                cnames = "y")dx[id == 142]
##     id period rx timeID    p y## 1: 142      0  1    706 0.18 0## 2: 142      1  1    707 0.21 0## 3: 142      2  1    708 0.25 1## 4: 142      3  1    709 0.29 0## 5: 142      4  1    710 0.33 0

In the real world, there will be loss to follow up – not everyone will be observed until the end. In the first case, I will be assuming the data are missing completely at random (MCAR), where missingness is independent of all observed and unobserved variables. (I have mused on missingess before.)

MCAR <- defMiss(varname = "y", formula = "-2.6",                logit.link = TRUE, monotonic = TRUE)dm <- genMiss(dx, MCAR, "id", repeated = TRUE, periodvar = "period")dObs <- genObs(dx, dm, idvars = "id")dObs[id == 142]
##     id period rx timeID    p  y## 1: 142      0  1    706 0.18  0## 2: 142      1  1    707 0.21  0## 3: 142      2  1    708 0.25  1## 4: 142      3  1    709 0.29 NA## 5: 142      4  1    710 0.33 NA

In this data set only about 70% of the total sample is observed – though by chance there is different dropout for each of the treatment arms:

dObs[period == 4, .(prop.missing = mean(is.na(y))), keyby = rx]
##    rx prop.missing## 1:  0         0.28## 2:  1         0.38

Estimating the intervention effect

If we are really only interested in the probability of a successful outcome in the final period, we could go ahead and estimate the treatment effect using a simple logistic regression using individuals who were available at the end of the study. The true value is 0.5 (on the logistic scale), and the estimate here is close to 1.0 with a standard error just under 0.4:

fit.l <- glm(y ~ rx, data = dObs[period == 4], family = binomial)coef(summary(fit.l))
##             Estimate Std. Error z value Pr(>|z|)## (Intercept)    -1.25       0.28    -4.4  9.9e-06## rx              0.99       0.38     2.6  9.3e-03

But, can we do better? Fitting a longitudinal model might provide a more stable and possibly less biased estimate, particularly if the specified model is the correct one. In this case, I suspect it will be an improvement, since the data was generated using a process that is amenable to a GEE (generalized estimating equation) model.

library(geepack)  fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,          data = dObs, corstr = "ar1")coef(summary(fit.m))
##             Estimate Std.err Wald Pr(>|W|)## (Intercept)    -2.33   0.259   81  0.00000## period          0.30   0.072   17  0.00003## rx              0.83   0.263   10  0.00152

And finally, it is reasonable to expect that a model that is based on a data set without any missing values would provide the most efficient estimate. And that does seem to be case if we look at the standard error of the effect estimate.

fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,          data = dx, corstr = "ar1")coef(summary(fit.f))
##             Estimate Std.err Wald Pr(>|W|)## (Intercept)    -2.15   0.227 89.2  0.0e+00## period          0.30   0.062 23.1  1.5e-06## rx              0.54   0.233  5.4  2.1e-02

Of course, we can’t really learn much of anything from a single simulated data set. Below is a plot of the mean estimate under each modeling scenario (along with the blue line that represents \(\pm 2\)sd) based on 2500 simulated data sets with missingness completely at random. (The code for these replications is included in the addendum.)

It is readily apparent that under an assumption of MCAR, all estimation models yield unbiased estimates (the true effect size is 0.5), though using the last period only is inherently more variable (given that there are fewer observations to work with).

Missing at random

When the data are MAR (missing at random), using the last period only no longer provides an unbiased estimate of the effect size. In this case, the probability of missingness is a function of time, intervention status, and the outcome from the prior period, all of which are observed. This is how I’ve defined the MAR process:

MAR <- defMiss(varname = "y",                formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",               logit.link = TRUE, monotonic = TRUE)

The mean plots based on 2500 iterations reveal the bias of the last period only. It is interesting to see that the GEE model is not biased, because we have captured all of the relevant covariates in the model. (It is well known that a likelihood method can yield unbiased estimates in the case of MAR, and while GEE is not technically a likelihood, it is a quasi-likelihood method.)

Missing not at random

When missingness depends on unobserved data, such as the outcome itself, then GEE estimates are also biased. For the last set of simulations, I defined missingness of \(y\) in any particular time period to be a function of itself. Specifically, if the outcome was successful and the subject was in the intervention, the subject would be more likely to be observed:

NMAR <- defMiss(varname = "y",                 formula = "-2.9 + 0.2*period - 2*rx*y",                logit.link = TRUE, monotonic = TRUE)

Under the assumption of missingness not at random (NMAR), both estimation approaches based on the observed data set with missing values yields an biased estimate, though using all of the data appears to reduce the bias somewhat:

Addendum: generating replications

iter <- function(n, np, defM) {    dx <- genData(n)  dx <- trtAssign(dx, grpName = "rx")  dx <- addPeriods(dx, nPeriods = np)    defP <- defDataAdd(varname = "p", formula = "-2 + 0.2*period + .5*rx",                     dist = "nonrandom", link = "logit")    dx <- addColumns(defP, dx)  dx <- addCorGen(dtOld = dx, idvar = "id", nvars = np, rho = .6,                   corstr = "ar1", dist = "binary", param1 = "p",                   method = "ep", formSpec = "-2 + 0.2*period + .5*rx",                  cnames = "y")    dm <- genMiss(dx, defM, "id", repeated = TRUE, periodvar = "period")  dObs <- genObs(dx, dm, idvars = "id")    fit.f <- geeglm(y ~ period + rx, id = id, family = binomial,          data = dx, corstr = "ar1")    fit.m <- geeglm(y ~ period + rx, id = id, family = binomial,          data = dObs, corstr = "ar1")    fit.l <- glm(y ~ rx, data = dObs[period == (np - 1)], family = binomial)    return(data.table(full = coef(fit.f)["rx"],                     miss = coef(fit.m)["rx"],                    last = coef(fit.l)["rx"])         )}## defMMCAR <- defMiss(varname = "y", formula = "-2.6",                logit.link = TRUE, monotonic = TRUE)MAR <- defMiss(varname = "y",                formula = "-2.9 + 0.2*period - 2*rx*LAG(y)",               logit.link = TRUE, monotonic = TRUE)NMAR <- defMiss(varname = "y",                 formula = "-2.9 + 0.2*period - 2*rx*y",                logit.link = TRUE, monotonic = TRUE)##library(parallel)niter <- 2500resMCAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MCAR)))resMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, MAR)))resNMAR <- rbindlist(mclapply(1:niter, function(x) iter(200, 5, NMAR)))
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A Comparative Review of the R AnalyticFlow GUI for R

$
0
0

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

R AnalyticFlow (RAF) is a free and open source graphical user interface (GUI) for the R language that focuses on beginners looking to point-and-click their way through analyses.  What sets it apart from the other half-dozen GUIs for R is that it uses a flowchart-like workflow diagram to control the analysis instead of only menus. In my first programming class back in the Pleistocene Era, my professor told us to never begin a program without doing a flowchart of what you were trying to accomplish. With workflow tools, you get the benefit of the diagram outlining the big picture, while the dialog box settings in each node control what happens at each step. In Figure 1 you can get a good idea of what is happening without any further information.

Another advantage you get with most workflow tools is the ability to reuse workflows very easily because the dataset is read in only once at the beginning. Unfortunately, most of that advantage is missing from R AnalyticFlow (hereafter, “RAF”) since you must specify which dataset is used in every node. The downside to workflow tools is that they’re slightly harder to learn than menu-based systems. This involves learning how to draw a diagram, what flows through it (e.g. datasets, models), and how to generate a single comprehensive reports for the entire analysis.

This post is one of a series of comparative reviews which aim to help non-programmers choose the GUI that is best for them. The reviews all follow a standard template to make comparisons across products easier. These reviews also include a cursory description of the programming support that each GUI offers.

Figure 1. An example workflow from R AnalyticFlow.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics, jamovi, and RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The Help Desks at most universities are flooded with such calls at the beginning of each semester!

RAF is available for Mac, and Linux. Its installation takes four steps:

  1. Install Java, if you don’t already have it installed. This can be tricky as you must match the type of Java to the type of R you use. Most computers these days have 64-bit operating systems. Whether 32-bit or 64-bit, you must use the same “bitness” on all of these steps, or it will not work.
  2. Next, install R if you haven’t already (available here).
  3. Install RAF itself after downloading it from here.
  4. Start RAF. It will prompt you to install some R packages, notably rJava. This step requires Internet access. To install if you don’t have such access, see the RAF website’s About R Packages section for important details on how to proceed (from another machine that does have Internet access, of course).

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

RAF does not offer any plug-in modules, though its developers do provide instruction on how you can create your own.

Startup

Some user interfaces for R, such as BlueSky and jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R Commander and JGR, have you start R, then load a package from your library, and then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start RAF directly by double-clicking its icon from your desktop or choosing it from your Start Menu (i.e. not from within R itself). On my system, I had to right-click the icon and choose, “Run as Administrator” or I would get the message, “Failed to Launch R. Confirm Settings?” If I responded “Yes”, it showed the path to my installation of R, which was already correct. I tried a second computer and it did start, but when it tried to install the JavaGD and rJava packages, it said, “Warning in install.packages (c(“JavaGD”,”rJava”)) : ‘lib = “C:/Program Files/R/R-3.6.1/library” ‘ is not writable. Would you like to use a personal library instead?”

Upon startup, it displays its startup screen, shown in Figure 2. Quick Start puts you into the software with a new Flow window open. New Project starts a new workflow, and Bookmarks give you quick access to existing workflows.

Figure 2. R AnalyticFlow’s Startup Screen.

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

To start entering data, choose “Input> Enter Data” and drag the selection onto the workflow editor window. An empty spreadsheet will appear (Figure 3). You can enter variable names on the first line if you check the “Header: Use 1st Row” box at the bottom of the window. This is the first hint you’ll see that RAF leans on R terminology that can be somewhat esoteric. RAF’s developers could have labeled this choice as “Column Names” but went with the R terminology of “Header” instead. This approach may be confusing for beginners, but if their goal is to learn R, it will help in the long run.

To enter factors (R’s categorical variables), choose the “Options” tab and check, “Convert Characters to Factors”, then RAF will convert the character string variables you enter to factors. Otherwise, it will leave them as characters. Dates remain stored as characters; you have to use “Processing> Set Data Type” node to change them, and they must be entered in the form yyyy-mm-dd.

Figure 3. R Analytic Flow’s data entry screen.

There is no limit to the number of rows and columns you can enter initially. However, once you choose “Run”, the data frame is created and can no longer be edited!

Saving the workflow is done with the standard “File > Save As” menu. You must save each one to its own file. To save the flow and the various objects that it uses such as data frames and models, use “Project > Export”. When receiving a project from a colleague, use “Project> Import” to begin using it.

Data Import

To analyze data, you must first read it. While many R GUIs can import a wide range of data formats such as files created by other statistics programs and databases, RAF can import only text and R objects.

RAF’s text import feature is well done. Once you select an Input File, it quickly scans the file and figures out if variable names are present, the delimiters it uses to separate the columns, and so on. It then displays a “preview” (Figure 4, bottom). It does this quickly since its preview is only on the first 100 rows of data. If the preview displays errors, you then manually change the settings and check the preview until it’s correct. When the preview looks good, you click, “Run”, it will then read all the data.

Figure 4. The Read Text File window.

Data Export

The ability to export data to a wide range of file types helps when you, or other members of your research team, have to use multiple tools to complete a task. Unfortunately, this is a very weak area for R GUIs. Deducer offers no data export at all, and R Commander, and rattle can export only delimited text files (an earlier version of this listed jamovi as having very limited data export; that has now been expanded). Only BlueSky offers a fairly comprehensive set of export options. Unfortunately, RAF falls into the former group, being able only to export data in text and R object files.

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as BlueSky and the R Commander, can handle many, but not all, of them.

RAF handles a fairly basic set of data management tools:

  1. Add/Edit Columns
  2. Rename – Variables in a data frame)
  3. Set Data Type
  4. Select Rows
  5. Select Columns
  6. Missing Values – Sets values as missing, no imputation)
  7. Sort
  8. Sampling
  9. Aggregate
  10. Merge – Various joins
  11. Merge – Adds rows
  12. Manage Objects (copies, deletes, renames)

Workflows, Menus & Dialog Boxes

The goal of pointing & clicking your way through an analysis is to save time by recognizing dialog box settings rather than performing the more difficult task of recalling  programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

RAF uses a unique interface. There are two ways to add build a workflow that guides your analysis. First, you can click on a toolbar icon, which drops down a menu. Click on a selection, and – without releasing the mouse button – drag your selection onto the flow window. In that case, the dialog box with its options opens below the flow area (Figure 3, bottom right).

The second way to use it is to click on a toolbar icon, drop down its menu, click on a selection and immediately release the mouse button. This causes the dialog box to appear floating in the middle of the screen (not shown). When you finish choosing your settings, there is a “Drag to Add” button at the top of the dialog. Clicking that button causes the dialog box to collapse into an icon which you can then drag onto the workflow surface.

Regardless of which method you choose, if you drop the new icon onto the top of one that is already in the workflow, it will move the new icon to the right and draw an arrow (called an “edge”) connecting the older one to the new. If you don’t drop it onto an icon that’s already in your workflow, you can add a connecting arrow later by clicking on the first icon, then choose “Draw Edge” and an arrow will appear aimed to the right (workflows go mostly left to right). The arrow will float around as you move your mouse, until you click on the second icon. A third way to connect the nodes in a flow is to click one icon, hold the Alt key down, then drag to the second icon.

Figure 3 shows the entire RAF window. On the top right is the workflow. Here are the steps I followed to create it:

  1. I chose “Input> Read Text File” and dragged it onto the workflow. The icon’s settings appeared in the bottom right window.
  2. I filled in the dialog box’s settings, then clicked “Run”. It named the icon after the file mydata.csv and a spreadsheet appeared in the upper-right.
  3. I chose “Statistics> Cross Tabulation”, and dragged its icon onto the data icon.
  4. I clicked the downward-facing arrow in the “Group By” box, and chose the variables. The first one I chose (workshop) formed the rows and the second (gender) formed the columns. Unlike most GUIs, there’s no indication of row and column roles.
  5. I clicked “Run Node” at the top of the cross tabulation dialog box. The cross tabulation output appeared in the upper left window (right half). The code that RAF wrote to perform the task appears in the R Console window in the lower left.

You can run an entire flow by clicking “Run Flow” at the top left of the Flow window. While describing the process of building a workflow is tedious, learning to build one is quite easy to learn.

Figure 3. The entire R Analytic Flow window, with Cross Tabulation highlighted. In the top row are the viewer window (left) and flow window (right). In the bottom row are the R console (left) and the dialog box for the chosen icon (right). The Cross Tabulation icon is selected, so its dialog box is shown.

The goal of using a GUI is to make analysis easy, so GUI dialog boxes are usually quite simple to use and include everything that’s relevant within a single box. I looked at all the options in this dialog but could not find one to do a very common test for such a cross-tabulation table: the chi-squared test. RAF uses an aspect of R objects that ends up essentially creating two different types of dialog boxes in separate parts of its interface. R objects contain multiple bits of output. You can display them using generic R functions such as summary() and print(). The output window has radio buttons for those functions (Figure 3, right above the cross-tabulation table). Clicking the “summary” button will call R’s summary() function to display the chi-squared results where the table is currently shown. To study the pattern in the table and the chi-squared results requires clicking back and forth on Table and summary; you can’t get them to both appear on your screen at the same time.

Correlations provide another example. The statistics are shown, but their p-values are not shown until you click on the “summary” button. This approach is confusing for beginners, but good for people wishing to learn R.

A common data analysis task is repeating the same analysis across many variables. For example, you might want to repeat the above cross tabulation (or t-tests, etc.) on many variables at once. This is usually quite easy to accomplish in most GUIs, but not in RAF. Since R’s functions may not offer that ability without using R’s “apply” family of functions (or loops), and RAF does not support such functions, such simple tasks become quite a lot of work when using RAF. You need to add an node to your flow for each and every variable!

Each dialog box has an “Advanced” tab which allows you to enter the name of any R argument(s) in one column, and any value(s) you would like to pass to that argument in another. That’s a nice way to offer graphical control over common tasks, while assuring that every task a function is capable of is still available.

In a complex analysis, workflows can become quite complex and hard to read. A solution to this problem is the concept of a “metanode”. Metanodes allow you t take an entire section of your workflow and collapse it into what appears to be a single node. For example, you might commonly use eight nodes to prepare a dataset for analysis. You could combine all eight into a new node you call “Data Prep”, greatly simplifying the workflow. Unfortunately, RAF does not offer metanodes, as do other workflow-driven data science tools such as KNIME and RapidMiner.

One of the most surprising aspects of RAF’s workflow style is that every node specifies its input and output objects. That means that you can run any analysis with no connecting arrows in your diagram! Rather than be a required feature as with many workflow-based tools, in RAF they offer only the convenience of re-running an entire flow at once.

During GUI-driven analysis, the fact that R is doing the work is quite obvious as the code and any resulting messages appear in the Console window.

Documentation & Training

The only written documentation for RAF is the brief, but easy to follow, R AnalyticFlow 3 Starter Guide. Kamala Valarie has also done a 15-minute video on YouTube showing how to use RAF.

Help

R GUIs provide simple task-by-task dialog boxes that generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception is the R Commander, which lacks help on the dialog boxes themselves.

The level of help that RAF offers is only the built-in R help file for the particular function you’re using. However, I had problems with the help getting stuck and showing me the help file from previous tasks rather than the one I was currently using.

Graphics

The various GUIs available for R handle graphics in several ways. Some, such as R Commander and RKWard, focus on R’s built-in graphics. Others, such as BlueSky Statistics use the popular ggplot2 package. Still others, such as jamovi, use their own functions and integrate them into analysis steps.

GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it. That’s how BlueSky and jamovi work.

RAF uses the very flexible lattice package for all of its graphics. That makes it particularly easy to display “small multiples” of the same plot repeated by levels of another variable or two. There does not appear to be any way to control the style of the plots.

More…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why I don’t use the Tidyverse

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There seems to be some revolution going on in the R sphere… people seem to be jumping at what is commonly known as the tidyverse, a collection of packages developed and maintained by the Chief Scientist of RStudio, Hadley Wickham.

In this post, I explain what the tidyverse is and why I resist using it, so read on!

Ok, so this post is going to be controversial, I am fully aware of that. The easiest way to deal with it if you are a fan of the tidyverse is to put it into the category “this guy is a dinosaur and hasn’t yet got the point of it all”… Fine, this might very well be the case and I cannot guarantee that I will change my mind in the future, so bear with me as I share some of my musings on the topic as I feel about it today… and do not hesitate to comment below!

According to his own website, the tidyverse is an opinionated collection of R packages designed for data science [highlighting my own]. “Opinionated”… when you google that word it says:

characterized by conceited assertiveness and dogmatism. “an arrogant and opinionated man”

If you ask me it is no coincidence that this is the first statement on the webpage!

Before continuing, I want to make clear that I believe that Hadley Wickham does what he does out of a strong commitment to the R community and that his motivations are well-meaning. He obviously is also a person who is almost eerily productive (and to add the obvious: RStudio is a fantastic integrated development environment (IDE) which is looking for its equal in the Python world!). Having said that I think the tidyverse is creating some conflict within the community which at the end could have detrimental ramifications:

The tidyverse is creating some meta-layer on top of Base R, which changes the character of the language considerably. Just take the highly praised pipe operator%>%:

# Base Rtemp <- mean(c(123, 987, 756))temp## [1] 622# tidyverselibrary(dplyr)## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':## ##     filter, lag## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, uniontemp <- c(123, 987, 756) %>% meantemp## [1] 622

The problem I have with this is that the direction of the data flow is totally inconsistent: it starts in the middle with the numeric vector, goes to the right into the mean function (by the pipe operator %>%) and after that to the right into the variable (by the assignment operator <-). It is not only longer but also less clear in my opinion.

I know fans of the tidyverse will hasten to add that it can make code clearer when you have many nested functions but I would counter that there are also other ways to make your code clearer in this regard, e.g. by separating the functions into different lines of code, each with an assignment operator… which used to be the standard way!

But I guess my main point is that R is becoming a different beast this way: we all know that R – as any programming language – has its quirks and idiosyncrasies. The same holds true for the tidyverse (remember: any!). My philosophy has always been to keep any programming language as pure as possible, which doesn’t mean that you have to program everything from scratch… it means that you should only e.g. add packages for functional requirements and only very cautiously for structural ones.

This is, by the way, one of my criticisms on Python: you have the basic language but in order to do serious data science need all kinds of additional packages, which change the structure of the language (to read more on that see here: Why R for Data Science – and not Python?)!

At the end you will in most cases have some kind of strange mixture of the differnt data and programming approaches which makes the whole thing even more messy. As a professor, I also see the difficulties in teaching that stuff without totally confusing my students. This is often the problem with Python + NumPy + SciPy + PANDAS + SciKit-Learn + Matplotlib and I see the same kind of problems with R + ggplot2 + dplyr + tidyr + readr + purrr + tibble + stringr + forcats!

On top of that is the ever-growing complexity a problem because of all the dependencies. I am always skeptical of code where dozens of packages have to be installed and loaded first. Even in the simple code above just by loading the dplyr package (which is only one out of the eight tidyverse packages), several base R functions are being overwritten: filter, lag, intersect, setdiff, setequal and union.

In a way, the tidyverse feels (at least to me) like some kind of land grab, some kind of takeover. It is almost like a religion… and that I do not like! This is different with other popular packages, like Rcpp: with Rcpp you do the same stuff but faster… with the tidyverse you do the same stuff but only differently (I know, in some cases it is faster as well but that is often not the reason for using it… contrary to the excellent data.table package)!

One final thought: Hadley Wickham was asked the following question in 2016 (source: Quora):

Do you expect the tidyverse to be the part of core R packages someday?

His answer is telling:

It’s extremely unlikely because the core packages are extremely conservative so that base R code is stable, and backward compatible. I prefer to have a more utopian approach where I can be quite aggressive about making backward-incompatible changes while trying to figure out a better API.

Wow, that is something! To be honest with you: when it comes to software I like conservative! I like stable! I like backward compatible, especially in a production environment!

Everybody can (and should) do their own experiments “to figure out a better [whatever]” but please never touch a running system (if it ain’t broke don’t fix it!), especially not when millions of critical business and science applications depend on it!

Ok, so this was my little rant… now it is your turn to shoot!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Running different versions of R in the RStudio IDE is, on occasion, required to load older packages.

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a re-post by the author from: https://www.bsetmet.com/index.php/2019/05/26/gist-rstudio-vsersions-ubuntu/

I got fed up with the manual process, so I started to automate the entire process Ubuntu in a bash script. This script should work for most Debian based distros.

TLDR -> Get to the Code on Github!

(Through out this process stackoverflow was my friend.)

Generally the process for 3.4.4 (for example) is to:

1. Download R from CRAN

https://cran.r-project.org/src/base/R-3/R-3.4.4.tar.gz

2. Un-archive files

tar -xzvf R-3.4.4.tar.gz

3. Make from source

(from inside the un-archived directory)
sudo ./configure --prefix=/opt/R/$r_ver --enable-R-shlib && \sudo make && \sudo make install

4. Update environment variable  for R-Studio

export RSTUDIO_WHICH_R="/opt/R/3.4.4/bin/R"

5. Launch R-Studio (in the context of this environment variable)

rstudio

I started down the road of manually downloading all the .tar.gz files of the versions that I might want to install, so then I grabbed a method for un-archiving all these files at one time.

find . -name '*.tar.gz' -execdir tar -xzvf '{}' \;

Here is where I started to Build and install R from source in an automating script at once I use this

[
]czoyOTQ6XCIjIS9iaW4vYmFzaA0KIyBydW4gd2l0aCBzdWRvDQpmdW5jdGlvbiBjb25maWdfbWFrZV9pbnN0YWxsX3IoKXsNCiAgZm97WyYqJl19bGRlcl9wYXRoPSQxDQogIHJfdmVyc2lvbj0kMg0KICBjZCAkZm9sZGVyX3BhdGgNCiAgc3VkbyAuL2NvbmZpZ3VyZSAtLXByZWZpeHtbJiomXX09L29wdC9SLyRyX3ZlcnNpb24gLS1lbmFibGUtUi1zaGxpYiAmYW1wOyZhbXA7IFxcXFwNCiAgc3VkbyBtYWtlICZhbXA7JmFtcDsgXFxcXHtbJiomXX0NCiAgc3VkbyBtYWtlIGluc3RhbGwNCn0NCmNvbmZpZ19tYWtlX2luc3RhbGxfciB+L0Rvd25sb2Fkcy9SLTMuNC40IDMuNC40XCI7e1smKiZdfQ==[
]

From here i added a menu system i found on stack overflow. This script prompts to install whatever version of R you are attempting to launch if not yet installed.

https://gist.github.com/seakintruth/95d49b6ea316c2b8a6e283e1ee1a3f3a

This file can be downloaded, inspected and run from the github RAW link, of course to run a .sh file it needs to be updated to be executable with

[
]czo0MTpcInN1ZG8gY2htb2QgK3ggUlNUVURJT19WZXJzaW9uX0xhdW5jaGVyLnNoXCI7e1smKiZdfQ==[
]

I have found that by just directing the gist’s contents directly into bash i can skip that step!

[
]czoxMDI6XCJiYXNoICZsdDsoY3VybCAtcyBodHRwczovL2dpc3QuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3NlYWtpbnRydXRoLzk1ZDR7WyYqJl19OWI2ZWEzMTZjMmI4YTZlMjgzZTFlZTFhM2YzYS9yYXcpXCI7e1smKiZdfQ==[
]

Executing this script also places a copy of it’s self in the current users local profile, and a link to a new .desktop for the local Unity Launcher on first run. This allows me to run this custom launcher from the application launcher. I then pin it as a favorite to the Dock manually.

The completed can be found on Github HERE or view RAW script from Github

Jeremy D. Gerdesseakintruth@gmail.comApril 2019Creative Commons by-sa 4.0 (CC BY-SA 3.0) https://creativecommons.org/licenses/by-sa/3.0/ 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Predicting and visualizing user-defined data point with K-Nearest Neighbors

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

K-nearest neighbors is easy to understand supervised learning method which is often used to solve classification problems.  The algorithm assumes that similar objects are closer to each other.

To understand it better and keeping it simple, explore this shiny app, where user can define data point, no. of neighbors and predict the outcome. Data used here is popular Iris dataset and ggplot2 is used for visualizing existing data and user defined data point. Scatter-plot and table are updated for each new observation.

shinyapp   

Data is more interesting in overlapping region between versicolor and virginica.  Especially at K=2 error is more vivid, as same data point is classified into different categories. In the overlapping region, if there is a tie between different categories outcome is decided randomly.

One can also see accuracy  with 95% Confidence Interval for different values K neighbors with 80/20 training-validation data . While classification is done with ‘class’ package, accuracy is calculated with ‘caret’ library.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

workshop (Presidency University): Politics with big data social science analysis with R

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 Politics with a big data: data analysis in social research using R      

20 -22 December 2019

Presidency University

Department of Political Science, Presidency University

In association with Association SNAP

Invites you to a workshop on R Statistical software designed exclusively for social science researchers. The workshop will introduce basic statistical concepts and provide the fundamental R programming skills necessary for analyzing policy and political data in India.

This is an applied course for researchers, scientists with little-to-no programming experienceand aims teach best practices for data analysis to apply skills to conduct reproducible research.

The workshop will also introduce available datasets in India; along with hands-on training on data management and analysis using R software.

Course:The broad course contents include: a) use of big data in democracy, b) Familiarization with Basic operations in R c) Data Management d) Observe Data Relationships: Statistics and Visualization, e) Finding Statistically Meaningful Relationships, f) Text analysis of policy document.  The full course module available upon registration.

For whom:ideal for early career researcher, academic, researcher with Think-tank, Journalists and students with interest in political data.

Fees:Rs 1500 (inclusive of working Lunch, tea coffee and workshop kit). Participants need to arrange their own accommodation and travel. Participants must bring their own computer (wi-fi access will be provided by Presidency University).

To Register:Please visit the website www.google.comto register interest. If your application successful, we shall notify through email and payment details.

The last date for receiving application is 10 December, 2015.

For further details https://presiuniv.ac.in/web/ email- zaad.polsc@presiuniv.ac.in

Resources Persons:

Dr. Neelanjan Sircar, Assistant Professor of Political Science at Ashoka University and Visiting Senior Fellow at the Centre for Policy Research in New Delhi

Dr. PraveshTamang, Assistant Professor of Economics at Presidency University

Sabir Ahmed, National Research Coordinator Pratichi (India) Trust- Kolkata

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Elements of Variance

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Partial Moments Equivalences

Below are some basic equivalences demonstrating partial moments’ role as the elements of variance.

Why is this relevant?

The additional information generated from partial moments permits a level of analysis simply not possible with traditional summary statistics. There is further introductory material on partial moments and their extension into nonlinear analysis & behavioral finance applications available at:

https://www.linkedin.com/pulse/elements-variance-fred-viole

Installation

require(devtools); install_github('OVVO-Financial/NNS',ref = "NNS-Beta-Version")

Mean

A difference between the upside area and the downside area of f(x).

set.seed(123); x=rnorm(100); y=rnorm(100)

> mean(x)
[1] 0.09040591

> UPM(1,0,x)-LPM(1,0,x)
[1] 0.09040591

Variance

A sum of the squared upside area and the squared downside area.

> var(x)
[1] 0.8332328

# Sample Variance:
> UPM(2,mean(x),x)+LPM(2,mean(x),x)
[1] 0.8249005

# Population Variance:
> (UPM(2,mean(x),x)+LPM(2,mean(x),x))*(length(x)/(length(x)-1))
[1] 0.8332328

# Variance is also the co-variance of itself:
> (Co.LPM(1,1,x,x,mean(x),mean(x))+Co.UPM(1,1,x,x,mean(x),mean(x))-D.LPM(1,1,x,x,mean(x),mean(x))-D.UPM(1,1,x,x,mean(x),mean(x)))*(length(x)/(length(x)-1))
[1] 0.8332328

Standard Deviation

> sd(x)
[1] 0.9128159

> ((UPM(2,mean(x),x)+LPM(2,mean(x),x))*(length(x)/(length(x)-1)))^.5
[1] 0.9128159

Covariance

> cov(x,y)
[1] -0.04372107

> (Co.LPM(1,1,x,y,mean(x),mean(y))+Co.UPM(1,1,x,y,mean(x),mean(y))-D.LPM(1,1,x,y,mean(x),mean(y))-D.UPM(1,1,x,y,mean(x),mean(y)))*(length(x)/(length(x)-1))
[1] -0.04372107

Covariance Elements and Covariance Matrix

> cov(cbind(x,y))
            x           y
x  0.83323283 -0.04372107
y -0.04372107  0.93506310

> cov.mtx=PM.matrix(LPM.degree = 1,UPM.degree = 1,target = 'mean', variable = cbind(x,y), pop.adj = TRUE)

> cov.mtx
$clpm
          x         y
x 0.4033078 0.1559295
y 0.1559295 0.3939005

$cupm
          x         y
x 0.4299250 0.1033601
y 0.1033601 0.5411626

$dlpm
          x         y
x 0.0000000 0.1469182
y 0.1560924 0.0000000

$dupm
          x         y
x 0.0000000 0.1560924
y 0.1469182 0.0000000

$matrix
            x           y
x  0.83323283 -0.04372107
y -0.04372107  0.93506310

Pearson Correlation

> cor(x,y)
[1] -0.04953215

> cov.xy=(Co.LPM(1,1,x,y,mean(x),mean(y))+Co.UPM(1,1,x,y,mean(x),mean(y))-D.LPM(1,1,x,y,mean(x),mean(y))-D.UPM(1,1,x,y,mean(x),mean(y)))*(length(x)/(length(x)-1))

> sd.x=((UPM(2,mean(x),x)+LPM(2,mean(x),x))*(length(x)/(length(x)-1)))^.5

> sd.y=((UPM(2,mean(y),y)+LPM(2,mean(y),y))*(length(y)/(length(y)-1)))^.5

> cov.xy/(sd.x*sd.y)
[1] -0.04953215

Skewness*

A normalized difference between upside area and downside area.

> skewness(x)
[1] 0.06049948

> ((UPM(3,mean(x),x)-LPM(3,mean(x),x))/(UPM(2,mean(x),x)+LPM(2,mean(x),x))^(3/2))
[1] 0.06049948

UPM/LPM – a more intuitive measure of skewness. (Upside area / Downside area)

> UPM(1,0,x)/LPM(1,0,x)
[1] 1.282673

Kurtosis*

A normalized sum of upside area and downside area.

> kurtosis(x)
[1] -0.161053

> ((UPM(4,mean(x),x)+LPM(4,mean(x),x))/(UPM(2,mean(x),x)+LPM(2,mean(x),x))^2)-3
[1] -0.161053

CDFs

> P=ecdf(x)

> P(0);P(1)
[1] 0.48
[1] 0.83

> LPM(0,0,x);LPM(0,1,x)
[1] 0.48
[1] 0.83

# Vectorized targets:
> LPM(0,c(0,1),x)
[1] 0.48 0.83

# Joint CDF:
> Co.LPM(0,0,x,y,0,0)
[1] 0.28

# Vectorized targets:
> Co.LPM(0,0,x,y,c(0,1),c(0,1))
[1] 0.28 0.73

PDFs

> tgt=sort(x)

# Arbitrary d/dx approximation
> d.dx=(max(x)+abs(min(x)))/100

> PDF=(LPM.ratio(1,tgt+d.dx,x)-LPM.ratio(1,tgt-d.dx,x))

> plot(sort(x),PDF,col='blue',type='l',lwd=3,xlab="x")

Numerical Integration – [UPM(1,0,f(x))-LPM(1,0,f(x))]=[F(b)-F(a)]/[b-a]

# x is uniform sample over interval [a,b]; y = f(x)
> x=seq(0,1,.001);y=x^2

> UPM(1,0,y)-LPM(1,0,y)
[1] 0.3335

Bayes’ Theorem

https://github.com/OVVO-Financial/NNS/blob/NNS-Beta-Version/examples/Bayes’%20Theorem%20From%20Partial%20Moments.pdf

*Functions are called from the PerformanceAnalytics package

require(PerformanceAnalytics)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Analyzing Relational Contracts with R: Part I

$
0
0

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One goal of game theory is to understand how effective cooperation can be sustained in relationships. You may have heard of Axelrod’s famous tournaments that studied how effective different submitted strategies are in sustaining cooperation in repeated Prisoners’ dilemma games.

A related approach to understand the scope for cooperation in relationships is to characterize game theoretic equilibria, in which every player always acts optimal given the strategies of everybody else. This blog post illustrates the new R package RelationalContracts that facilitates such equilibrium analysis. I also want to introduce you to the problem of multiple equilibria and how one can select equilibria by assuming parties explicitly negotiate the terms of their relationship.

The second part of this blog series will introduce stochastic games that allow for more complex relationships with endogenous states. It also illustrates the vulnerability paradoxon and shows how one can account for hold-up concerns.

The most well known game in game theory is probably the prisoners’ dilemma. The following code creates a variation of an infinitely repeated prisoners’ dilemma with a bit larger action space:

library(RelationalContracts)g=rel_game("Mutual Gift Game")%>%rel_state("x0",# Both players can pick effort# on a grid between 0 and 1A1=list(e1=seq(0,1,by=0.1)),A2=list(e2=seq(0,1,by=0.1)),# Stage game payoffspi1=~e2-0.5*e1^2,pi2=~e1-0.5*e2^2)

Each player $i$ chooses an effort level $e_i$ on a grid between 0 and 1 that directly benefits the other player. Effort involves for player $i$ cost of $\frac 1 2 e_i$ and grants the other player a benefit of $e_i$. While it would be jointly efficient if both players choose full effort $e_1=e_2=1$, the unique Nash equilibrium of the (one-shot) stage game is that both players choose zero effort.

Yet, if players interact repeatedly, positive effort levels can be sustained in subgame perfect equilibria. The common assumption in the relational contracting literature is that players interact for infinitely many periods and discount future payoffs with a discount factor $\delta \in [0,1)$. Discounting can be due to positive interest rates, or one can alternatively interpret $\delta$ as 1 minus an exogenous probability that the relationship breaks down after a period.

The following code computes the highest effort levels that can be implemented in any subgame perfect equilibrium for a discount factor of $\delta=0.3$:

# 1. Solve SPE and store it in game objectg=rel_spe(g,delta=0.3)# 2. Get equilibrium descripion as data frame#    and select actions on equilibrium pathget_eq(g)%>%# select(ae.e1,ae.e2)
## # A tibble: 1 x 2##   ae.e1 ae.e2##    ## 1   0.6   0.6

The information above tells us that there is a subgame perfect equilibrium on whose equilibrium path both players will choose efforts of $e_1=e_2=0.6$ in every period.

An equilibrium has to specify what players do after every possible history of play in the game. In particular, to induce positive effort levels, an equilibrium needs to describe some form of punishment that would follow should a player deviate and choose effort below the agreed upon 0.6. The harsher the punishment for deviations, the higher effort levels one can incentivize on the equilibrium path.

In our example game so called grim-trigger strategies implement the harshest possible punishment: should any player ever deviate from the agreed upon effort level then in all future periods players will play the Nash equilibrium of the stage game, i.e. they both chose zero effort.

One can easily show that such grim-trigger strategies can implement a symetric effort level $e$ on the equilibrium path if and only if e - 0.5 \cdot e^2 \geq (1-\delta) e + \delta \cdot 0. The left hand side is the average discounted payoff of a player on the equilibrium path. Average discounted payoffs are just the discounted sum of payoffs multiplied by $(1-\delta)$. This normalization puts payoffs of the repeated game on the same scale as payoffs of the stage game. The right hand side is the average discounted continuation payoff if a player deviates to zero effort. In the period of deviation she saves her effort cost but in all future periods payoffs are zero.

However, there are many more strategy profiles that constitute a subgame perfect equilibrium (SPE). For example, another SPE of the repeated game is to always repeat the stage game Nash equilibrium of choosing zero effort. In other SPE, players may pick positive but lower effort levels than $0.6$. Effort levels may also be asymmetric or vary over time. Also different punishment schemes can be used, e.g. punishments that reduce effort only for a finite number of periods or asymmetric punishments that are more costly for the deviator than the other player.

For sufficiently large discount factors there are infinitely many different equilibria and infinitely many different equilibrium payoffs.

Equilibrium payoff sets with and without transfers

Numerically computing the set of SPE payoffs is a hard problem and tractable algorithms require additional assumptions.

One assumption is that at the beginning of a period all players observe the realization of a standard uniform random variable that can be used as a public correlation device. This makes the SPE payoff set a convex polyhedron. Another restriction is that one only looks at pure strategy equilibria, i.e. chosen actions are a deterministic function of the history of play and the outcome of the commonly observed correlation device.

The following code solves and illustrates the SPE payoff set of our game assuming a public correlation device.

library(RSGSolve)# Transform game in format used by RSGSolversg=make.rsg.game(g)# Solve SPE payoff setrsg.sol=solveSG(rsg=rsg,delta=0.3)# Show SPE payoff setplot.rsg.payoff.set(rsg.sol)# Mark payoff of the equilibrium described above u.sym=0.6-0.5*0.6^2points(x=u.sym,y=u.sym,pch=19)

The library RSGSolve is just a simple R interface to Benjamin Brook’s package SGsolve (developed with Dilip Abreu and Yuli Sannikov) that implements their algorithm described here (2016).

Many things become easier and much faster to compute (very relevant for larger games) if one makes the additional assumption that at the beginning of each period, before actions take place, players can make voluntary monetary transfers to each other (or to an uninvolved third party) and that players are risk-neutral. This is a common assumption in the relational contracting literature. Susanne Goldlücke and me have developed corresponding methods to compute and characterize equilibrium payoff sets ( 2012 for repeated games with transfers and 2017 for more general stochastic games with transfers).

The following code shows the SPE payoff set of our game assuming monetary transfers are possible.

plot_eq_payoff_set(g)plot.rsg.payoff.set(rsg.sol,fill=NULL,lwd=2,add=TRUE)

The thick black line shows for comparison the SPE payoff set without transfers. We see that transfers allow to implement more equilibrium payoffs, in particular more efficient unequal payoffs.

More importantly for every repeated game with transfers the SPE payoff set is such a simplex (just a triangle for two player games) with a linear Pareto-frontier that has a slope of -1. This means in two-player repeated games with transfers the SPE payoff sets are characterized by just 3 numbers:

get_eq(g)%>%select(v1,v2,U)
## # A tibble: 1 x 3##      v1    v2     U##     ## 1     0     0  0.84

The values $v_1=0$ and $v_2=0$ describe the lowest SPE payoffs of player 1 and 2, respectively, that is the lower left point of the payoff set. The value $U=0.84$ is the highest possible sum of player 1 and player 2’s SPE payoff. Every payoff $u$ on our linear Pareto-frontier satisfies that $u_1+u_2=U$.

Simple equilibria

We have shown in our research that in repeated games with transfers every pure SPE payoff can be implemented with the following simple class of equilibria. On the equilibrium path in every period the same action profile $a^e$ is played and possible some transfers are conducted. Punishment has the following structure. If player $i$ deviates from the agreed upon action profile then she can redeem herself by paying a monetary fine to the other player at the beginning of the next period. If she does not pay the fine (or deviates from some other required payment on the equilibrium path) then for one period a punishment action profile $a^i$ is played. Afterwards, she has again the opportunity to stop the punishment by paying a fine and so on.

The following code shows us the 3 relevant action profiles $a^e$, $a^1$ and $a^2$ for our example.

get_eq(g)%>%select(ae=ae.lab,a1=a1.lab,a2=a2.lab,U,v1,v2)
## # A tibble: 1 x 6##   ae        a1    a2        U    v1    v2##            ## 1 0.6 | 0.6 0 | 0 0 | 0  0.84     0     0

On the equilibrium path ae is played, i.e. both players pick effort 0.6. Both punishment profiles a1 and a2 are simply the stage game Nash equilibria that both players choose zero effort. (Not in every game it is optimal to punish with the stage game Nash equilibria though). A simple equilibrium also specifies monetary transfers and the size of fines, but the exact structure of these transfers is not shown. A key theoretical result is that by adapting the transfers in an incentive compatible way, one can implement every pure SPE payoff of the game (blue area in plot above) without changing the equilibrium path and punishment actions.

Many equilibria: What to we learn? What do we assume?

No matter whether allowing for transfers or not: for sufficiently large discount factors most infinitly repeated games and generalizations have a huge set of equilibrium outcomes. And in games with more than 2 players, even more diverse behavior can be part of equilibrium strategies. For example there could be equilibria in which for no particular reason one player is a “leader” and other people mutually punish each other if they don’t make gifts to that particular leader.

When loosely interpreting a subgame perfect equilibrium in real world terms, I would call it a convention that is stable in the sense that no individual alone has an incentive to ever deviate. So game theory supports the view that in our world with many people that are part of many relationships and interact repeatedly, a lot of stable conventions that guide behavior could arise. And looking at the world, we know that there are and have been many different systems laws and conventions, differing quite a bit in dimensions as efficiency or equity.

On the other hand, conventions and codified laws don’t just fall from the sky but people can reason, discuss and coordinate how they want to structure their relationships.

Look at our example game above. Why should two rational parties follow the worst possible equilibrum and never exert any effort when both can be strictly better off by coordinating to a convention (picking an equilibrium) that supports positive effort levels?

A common assumption in the relational contracting literature is that players don’t pick a Pareto-dominated equilibrium. This means they don’t pick a particular equilibrium if there is another equilibrium that would make both player’s better off (and at least one player strictly).

Of course, assuming that in real relationships parties really manage to always find all incentive compatible Pareto improvements is fairly optimistic. But even if real world conventions are not Pareto optimal, one might be hopeful that the qualitative insights of how to structure relationships and which institutions facilitate cooperation from looking at Pareto-optimal equilbria are useful.

While I share this view for repeated games, I want to later convince you that once we move to more general games with endogenous states, the predictions of Pareto-optimal equilibria can be quite unintuitive because this equilibrium selection rules out plausible hold-up concerns. But let’s postpone this discussion to the second blog post…

What happens if a player becomes vulnerable?

We often are interested how changes in the economic environment or institutions that govern the rules of the game affect the implementable equilibrium payoffs.

Consider the following variation of our mutual gift game. Player 2 now also has the option to choose at zero cost negative effort of size -vul that hurts player 1. The parameter vul=1 measures player 1’s vulnerability:

g.vul=rel_game("Mutual Gift Game with Vulnerability")%>%rel_param(vul=1,delta=0.3)%>%rel_state("x0",# Action spacesA1=list(e1=seq(0,1,by=0.1)),A2=list(e2=~c(-vul,seq(0,1,by=0.1))),# Stage game payoffspi1=~e2-0.5*e1^2,pi2=~e1-0.5*pmax(e2,0)^2)

What will be the impact of player 1’s vulnerability on the set of SPE payoffs? Let us solve the game with transfers:

g.vul=rel_spe(g.vul)get_eq(g.vul)%>%select(ae.lab,a1.lab)
## # A tibble: 1 x 2##   ae.lab    a1.lab##         ## 1 0.9 | 0.9 0 | -1

If player 1 is vulnerable, both players can more effectively cooperate and implement higher efforts of $e_1=e_2=0.9$ in a SPE. We see from $a^1$ that a deviation by player 1 is indeed punished by negative effort of player 2. The stronger punishment allows to incentivize higher effort by player 1. If player 1 chooses higher effort on the equilibrium path, it is also less attractive for player 2 to deviate since she would forego higher equilibrium path payoffs. Therefore also higher effort by player 2 can be implemented. (Transfers additionally facilitate smoothing of incentive constraints such that we can implement the same effort level for both players). This means also player 1 could in principle benefit from being vulnerable.

Let us take a look at the equilibrium payoff sets with and without vulnerability of player 1:

plot_eq_payoff_set(g.vul,colors="#ffaaaa",add.state.label=FALSE,plot.r=FALSE)plot_eq_payoff_set(g,add=TRUE,add.state.label=FALSE,plot.r=FALSE)

We indeed see that every point in the (blue) equilibrium payoff set without vulnerability is strictly Pareto-dominated by some point in the (red) equilibrium payoff set given a vulnerable player 1.

Does is this mean that in this game it is beneficial for player 1 to be vulnerable? Not neccesarily. We see that with vulnerability there are also many equilibrium payoffs that make player 1 worse off than his worst equilibrium payoff without vulnerability. Restricting attentition to Pareto-optimal payoffs does not change this fact.

Nash bargaining to select an equilibrium payoff

To make sharper predictions, we need stronger assumptions on equilibrium selection. Let us assume that at the beginning of their relationship parties select an equilibrium via bargaining. The mathematic formulation of a bargaining problem requires a feasible set of payoffs, which would be the set of SPE payoffs, and a disagreement payoff that will be implemented if no agreement is reached.

In general a player benefits in bargaining if disagreement gives her a high payoff and the other player a low payoff. Unfortunately, there is no clear rule that specifies a single appropriate disagreement point when players bargaining over repeated game equilibria. In our example, let us assume that under disagreement both players choose in every period the stage game Nash equilibrium where everybody picks the lowest possible effort level, this means negative effort by player 2 if player 1 is vulnerable. This corresponds to the lower left point $(-1,0)$ of the SPE payoff set.

Given a linear Pareto-frontier, the Nash bargaining solution splits the gains over the disagreement point equally. In our example, the Nash bargaining solution is just the center point of the Pareto frontier of SPE payoffs. Let us recall the payoff sets with and without vulnerability:

plot_eq_payoff_set(g.vul,colors="#ffaaaa",add.state.label=FALSE)plot_eq_payoff_set(g,add=TRUE,add.state.label=FALSE)

The Nash bargaining solutions for both cases are marked now with black dots. We see that player 1 would here be worse off from being vulnerable. Even though vulnerability allows to incentivize higher effort levels and thus creates efficiency gains, this positive effect is outweighed for player 1 by a strong weakening of her bargaining position arising from the vulnerability.

This finishes the first part of this blog series. In the next part we will continue our study looking at stochastic games with endogenous states and discuss hold-up concerns and the vulnerability paradoxon

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

a journey from basic prototype to production-ready Shiny dashboard

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

a custom dark mode for shiny dashboard

# Abstract

As web dashboards have become the norm for interacting with data, looks and added functionality have taken a more prominent role. Dashboard users expect the typical look and feel and interactivity they get when surfing the web. To help fill the gap between data scientists and developers, packages that simplify and streamline the dashboard creation process have become more important as part of the dashboard creation workflow.

We will start with a basic shiny dashboard that uses no libraries. This is how most shiny dashboards look when someone begins working on it. Then I will show you how we can enhance it using two different approaches based on the most popular UI frameworks: bootstrap  and semantic UI. Most of you use bootstrap, but I encourage you to try a new one!   

Then I will wrap up by giving you a taste of how far you can go using only r/shiny to create a pure custom solution. 

# Agenda

– Base shiny

– Bootstrap approach: `shinydashboard`

– Semantic UI approach: `shiny.semantic`

– Custom approach: no framework? No problem.

# Shiny

We will start with a proof of concept dashboard created using base shiny (This particular dashboard was created for our company’s internal tools for our hiring processes, but all data in the examples is randomly generated).

semantic UI dashboard

a semantic UI dashboard example

A great example of the advantages of shiny for creating simple dashboards, our UI can be easily defined:

fluidPage(  headerPanel("Hiring Funnel"),  sidebarPanel(...),  mainPanel(...))

Coupled with some server behavior for the different inputs and outputs, we end up with a complete dashboard that, even though it’s simple will become the base for our next examples.

# shinydashboard

Shiny already includes bootstrap as a default framework. `shinydashboard` is probably the most well known extension to shiny when it comes to styling your dashboards in a simple and quick way. In fact enabling it is as simple as loading the library and replacing the wrappers for our main dashboard components:

library(shinydashboard)dashboardPage(  dashboardHeader(...),  dashboardSidebar(...),  dashboardBody(...))

A simple way to breathe new life into your dashboards! However it is important to keep in mind that:

 – Customization is limited (css is an option).

 – Color themes are available.

 – you’ll be pigeon-holed in the bootstrap ecosystem.

# shiny.semantic

Another possible solution, especially if you would like more customization and would like to switch bootstrap in favor of semantic UI, is to use shiny.semantic in conjunction with semantic.dashboard. This opens a different set of UI elements that can be used, so elements such as tabs, inputs might need to be updated if you are making the switch from shiny or shinydashboard.

From the main layout part, `semantic.dashboard` works very similar to how `shinydashboard` does. A few differences exist when it comes to function arguments, but the general structure remains the same:

library(semantic.dashboard)library(shiny.semantic)dashboardPage(  dashboardHeader(...),  dashboardSidebar(...),  dashboardBody(...))

Some changes are needed for some components. Two good examples are:

 – Date inputs. Switching from bootstrap means we no longer have access to bootstrap components, date input is one of those examples that can be replaced by the semantic version:

# BootstrapdateInput("date_from", "Start date", value = "2019-02-01", weekstart = 1)# Semantic UIdate_input("date_from", "Start date", value = "2019-02-01", icon = "")

 – Tab sets. Another example of this is tab sets; semantic tabsets need to be replaced by their semantic counterpart:

# BootstraptabsetPanel(  tabPanel("General overview", ...)  tabPanel("Channels overview", ...)  tabPanel("Days to decision", ...)  tabPanel("Funnel visualization", ...))# Semantic UItabset(  list(menu = div("General overview"), content = div(...)),  list(menu = div("Channels overview"), content = div(...)),  list(menu = div("Days to decision"), content = div...()),  list(menu = div("Funnel visualization"), content = div(...)))

The semantic UI version is a bit more verbose, but it does allow for more customization when it comes to the internal HTML structure. You also get access to all of the semantic UI library

semantic.dashboard

semantic.dashboard

# custom

We can also go a different route and dive into a fully custom solution, allowing a much higher level of customization, but it also requires a lot more knowledge when it comes to HTML, CSS and JS. This solution is usually reserved for applications where the layout, theme or overhaul experience is completely different for existing packages.

As an example, our internal app using a fully custom approach ended up like this:

example of a fully custom dashboard

example of a fully custom dashboard

Also with dark mode:

a custom dark mode for shiny dashboard

a custom “dark mode” for shiny dashboard

Our approach included:

 – Extended usage of CSS (in our case, SASS) for styling, layout and themes,

 – Custom HTML elements using `shiny:::tags`.

 – JavaScript for extra functionalities (Loading screens, custom tab behavior)

There’s a lot going on here so I will go over it in more detail on a future post about all of the different things that went into creating this dashboard, but it’s good to remember: no matter the complexity, a true custom solution is always possible! What is your favorite package? Do you think it’s better to spend more time in custom solutions, or are the standard packages enough for you? Add your comments below! 

# References

– [shiny] (https://shiny.rstudio.com/)

– [shinydashboard] (https://rstudio.github.io/shinydashboard/)

– [semantic.dashboard] (https://github.com/Appsilon/semantic.dashboard/)

– [shiny.semantic] (https://github.com/Appsilon/shiny.semantic/)

– [Bootstrap] (https://getbootstrap.com/)

– [Semantic UI] (https://semantic-ui.com/)

Follow Appsilon Data Science on Social Media

 

Article a journey from basic prototype to production-ready Shiny dashboard comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12108 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>