Learning things we already know about stocks

August 21, 2017, 5:00 pm

≫ Next: Introducing routr – Routing of HTTP and WebSocket in R

≪ Previous: Tidyer BLS data with the blscarpeR package

(This article was first published on R Views, and kindly contributed to R-bloggers)

This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking; you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.

Despite the obvious result, the process of teasing out latent groupings from historic price data is interesting. That’s the focus of this example. A central idea of the approach taken here comes from the great paper of Ledoit and Wolf, “Honey, I Shrunk the Sample Covariance Matrix” (http://www.ledoit.net/honey.pdf). This example employs an alternative approach based on a matrix eigenvalue decomposition, but it’s the same general idea.

This note follows an informal, how-to format. Rather than focus on mathematical analysis, which is well-detailed in the references, I try to spell out the hows and whys: how to do things step by step (using R), and a somewhat non-rigorous rationale for each step that’s hopefully at least convincing and intuitive.

For emphasis, allow me to restate the first sentence as an objective:

Group stocks together in a network that highlights associations within and between the groups using only historical price data

That’s what the rest of this example will do, hopefully illuminating some key ideas about regularization along the way.

Software used in the example

The example uses R, of course, and the following R packages, all available on CRAN (some of the packages themselves have dependencies):

quantmod (at least version 0.4-10)
igraph (at least version 1.1.2)
threejs (at least version 0.3.1)

Getting data

NOTE: You can skip ahead to the Sample correlation section by simply downloading a sample copy of processed log(return) data as follows:

library(quantmod)load(url("http://illposed.net/logreturns.rdata"))

Otherwise, follow the next two sections to download the raw stock daily price data and process those data into log(returns).

Download daily closing price data from Google Finance

The quantmod package (Ulrich and Ryan, http://www.quantmod.com/) makes it ridiculously easy to download (and visualize) financial time series data. The following code uses quantmod to download daily stock price data for about 100 companies with the largest market capitalizations listed on the Standard & Poor’s 500 index at the time of this writing. The code downloads daily closing prices from 2012 until the present. Modify the code to experiment with different time periods or stocks as desired!

Because stock symbol names may change and companies my come and go, it’s possible that some of the data for some time periods are not available. The tryCatch() block in the code checks for a download error and flags problems by returning NA, later removed from the result. The upshot is that the output number of columns of stock price time series may be smaller than the input list of stock symbols.

The output of the following code is an xts time series matrix of stock prices called prices, whose rows correspond to days and columns to stock symbols.

library(quantmod)from="2012-05-17"sym = c("AAPL", "ABBV", "ABT", "ACN", "AGN", "AIG", "ALL", "AMGN", "AMZN", "AXP",        "BA", "BAC", "BIIB", "BK", "BLK", "BMY", "BRK.B", "C", "CAT", "CELG", "CL",        "CMCSA", "COF", "COP", "COST", "CSCO", "CVS", "CVX", "DD", "DHR", "DIS", "DOW",        "DUK", "EMR", "EXC", "F", "FB", "FDX", "FOX", "FOXA", "GD", "GE", "GILD", "GM",        "GOOG", "GOOGL", "GS", "HAL", "HD", "HON", "IBM", "INTC", "JNJ", "JPM", "KHC",        "KMI", "KO", "LLY", "LMT", "LOW", "MA", "MCD", "MDLZ", "MDT", "MET", "MMM",        "MO", "MON", "MRK", "MS", "MSFT", "NEE", "NKE", "ORCL", "OXY", "PCLN", "PEP",        "PFE", "PG", "PM", "PYPL", "QCOM", "RTN", "SBUX", "SLB", "SO", "SPG", "T",        "TGT", "TWX", "TXN", "UNH", "UNP", "UPS", "USB", "UTX", "V", "VZ", "WBA",        "WFC", "WMT", "XOM")prices = Map(function(n)             {               print(n)               tryCatch(getSymbols(n, src="google", env=NULL, from=from)[, 4], error = function(e) NA)             }, sym)N = length(prices)# identify symbols returning valid datai = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))# combine returned prices list into a matrix, one column for each symbol with valid dataprices = Reduce(cbind, prices[i])colnames(prices) = sym[i]

Clean up and transform data

Not every stock symbol may have prices available for every day. Trading can be suspended for some reason, companies get acquired or go private, new companies form, etc.

Let’s fill in missing values going forward in time using the last reported price (piecewise constant interpolation) – a reasonable approach for stock price time series. After that, if there are still missing values, just remove those symbols that contain them, possibly further reducing the universe of stock symbols we’re working with.

for(j in 1:ncol(prices)) prices[, j] = na.locf(prices[, j])       # fill inprices = prices[, apply(prices, 2, function(x) ! any(is.na(x)))]  # omit stocks with missing data

Now that we have a universe of stocks with valid price data, convert those prices to log(returns) for the remaining analysis (by returns I mean simply the ratio of prices relative to the first price).

Why log(returns) instead of prices?

The log(returns) are closer to normally distributed than prices especially in the long run. Pat Burns wrote a note about this (with a Tom Waits soundtrack): http://www.portfolioprobe.com/2012/01/23/the-distribution-of-financial-returns-made-simple/.

But why care about getting data closer to normally distributed?

That turns out to be important to us because later we’ll use a technique called partial correlation. That technique generally works better for normally distributed data than otherwise, see for example a nice technical discussion about this by Baba, Shibata, and Sibuya here: https://doi.org/10.1111%2Fj.1467-842X.2004.00360.x

The following simple code converts our prices matrix into a matrix of log(returns):

log_returns = apply(prices, 2, function(x) diff(log(x)))

Sample correlation matrix

It’s easy to convert the downloaded log(returns) data into a Pearson’s sample correlation matrix X:

X = cor(log_returns)

The (i, j)th entry of the sample correlation matrix X above is a measurement of the degree of linear dependence between the log(return) series for the stocks in columns i and j.

There exist at least two issues that can lead to serious problems with the interpretation of the sample correlation values:

As Ledoit and Wolf point out, it’s well known that empirical correlation estimates may contain lots of error.
Correlation estimates between two stock log(return) series can be misleading for many reasons, including spurious correlation or existence of confounding variables related to both series (http://www.tylervigen.com/spurious-correlations).

A Nobel-prize winning approach to dealing with the second problem considers cointegration between series instead of correlation; see for example notes by Eric Zivot (https://faculty.washington.edu/ezivot/econ584/notes/cointegrationslides.pdf), Bernhard Pfaff’s lovely book “Analysis of Integrated and Cointegrated Time Series with R” (http://www.springer.com/us/book/9780387759661), or Wikipedia (https://en.wikipedia.org/wiki/Cointegration). (I also have some weird technical notes on the numerics of cointegration at http://illposed.net/cointegration.html.)

Cointegration is a wonderful but fairly technical topic. Instead, let’s try a simpler approach.

We can try to address issue 2 above by controlling for confounding variables, at least partially. One approach considers partial correlation instead of correlation (see for example the nice description in Wikipedia https://en.wikipedia.org/wiki/Partial_correlation). That approach works best in practice with approximately normal data – one reason for the switch to log(returns) instead of prices. We will treat the entries of the precision matrix as measures of association in a network of stocks below.

It’s worth stating that our simple approach basically treats the log(returns) series as a bunch of vectors and not so much bona fide time series, and can’t handle as many pathologies that might occur as well as cointegration can. But as we will see, this simple technique is still pretty effective at finding structure in our data. (And, indeed, related methods as discussed by Ledoit and Wolf and elsewhere are widely used in portfolio and risk analyses in practice.)

The partial correlation coefficients between all stock log(returns) series are the entries of the inverse of the sample correlation matrix (https://www.statlect.com/glossary/precision-matrix).

Market trading of our universe of companies, with myriad known and unknown associations between them and the larger economy, produced the stock prices we downloaded. Our objective is a kind of inverse problem: given a bunch of historical stock prices, produce a network of associations.

You may recall from some long ago class that, numerically speaking, inverting matrices is generally a bad idea. Even worse, issue 1 above says that our estimated correlation coefficients contain error (noise). Even a tiny amount noise can be hugely amplified if we invert the matrix. That’s because, as we will soon see, the sample correlation matrix contains tiny eigenvalues, and matrix inversion effectively divides the noise by those tiny values. Simply stated, dividing by a tiny number returns a big number; that is, matrix inversion tends to blow the noise up. This is a fundamental issue (in a sense, the fundamental issue) common to many inverse problems.

Ledoit and Wolf’s sensible answer to reducing the influence of noise is regularization. Regularization replaces models with different, but related, models designed to reduce the influence of noise on their output. LW use a form of regularization related to ridge regression (a.k.a., Tikhonov regularization) with a peculiar regularization operator based on a highly structured estimate of the covariance. We will use a simpler kind of regularization based on an eigenvalue decomposition of the sample correlation matrix X.

Regularization

Here is an eigenvalue decomposition of the sample correlation matrix:

L = eigen(X, symmetric=TRUE)

Note that R’s eigen() function takes care to return the (real-valued) eigenvalues of a symmetric matrix in decreasing order for us. (Technically, the correlation matrix is symmetric positive semi-definite, and will have only non-negative real eigenvalues.)

Each eigenvector represents an orthogonal projection of the sample correlation matrix into a line (a 1-d shadow of the data). The first two eigenvectors define a projection of the sample correlation matrix into a plane (2-d), and so on. The eigenvalues estimate the proportion of information (or variability, if you prefer) from the original sample correlation matrix contained in each eigenvector. Because the eigenvectors are orthogonal, these measurements of projected information are additive.

Here is a plot of all the sample correlation matrix eigenvalues (along with a vertical line that will be explained in a moment):

plot(L$values, ylab="eigenvalues")abline(v=10)

The eigenvalues fall off rather quickly in our example! That means that a lot of the information in the sample correlation matrix is contained in the first few eigenvectors.

Let’s assume, perhaps unreasonably, that the errors in our estimate of the correlation matrix are equally likely to occur in any direction (that the errors are white noise, basically). As we can see above, most of the information is concentrated in the subspace corresponding to the first few eigenvectors. But white noise will have information content in all the dimensions more or less equally.

One regularization technique replaces the sample correlation matrix with an approximation defined by only its first few eigenvectors. Because they represent a large amount of the information content, the approximation can be pretty good. More importantly, because we assumed noise to be more or less equally represented across the eigenvector directions and we’re cutting most of those off, this approximation tends to damp the noise more than the underlying information. Most importantly, we’re cutting off the subspace associated with tiny eigenvalues, avoiding the problem of division by tiny values and significantly reducing amplified noise in the inverse of the sample correlation matrix (the precision matrix).

The upshot is, we regularize the sample correlation matrix by approximating it by a low-rank matrix that substantially reduces the influence of noise on the precision matrix. See Per Christian Hansen’s classic paperback, “Rank-Deficient and Discrete Ill-Posed Problems” (http://epubs.siam.org/doi/book/10.1137/1.9780898719697), for insight into related topics.

But how to choose a cut-off rank?

There is substantial mathematical literature for just this topic (regularization parameter choice selection), complete with deep theory as well as lots of heuristics. Let’s keep things simple for this example and form our approximation by cutting off eigenvectors beyond where the eigenvalue plot starts to flatten out – close to the vertical line in the above plot.

Alternatively, consider the lovely short 2004 paper by Chris Ding and Xiaofeng He (http://dl.acm.org/citation.cfm?id=1015408) that illuminates connections (that I happen to find fascinating) between k-means clustering and projections like truncated eigenvalue expansions. Although we aren’t interested in k-means clustering per se, our objective is connected to clustering. Ding and He show that we can find at least k (k-means) clusters using the first k – 1 eigenvectors above. This gives us another heuristic way to choose a projection dimension, at least if we have an idea about the number of clusters to look for.

A precision matrix, finally

Finally, we form the precision matrix P from the regularized sample correlation matrix. The inversion is less numerically problematic now because of regularization. Feel free to experiment with the projected rank N below!

N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))

Other approaches

I’m not qualified to write about them, but you should be aware that Bayesian approaches to solving problems like this are also effectively (and effective!) regularization methods. I hope to someday better understand the connections between classical inverse problem solution methods that I know a little bit about, and Bayesian methods that I know substantially less about.

Put a package on it

There is a carefully written R package to construct regularized correlation and precision matrices: the corpcor package (https://cran.r-project.org/package=corpcor; also see http://strimmerlab.org/software/corpcor/) by Juliane Schafer, Rainer Opgen-Rhein, Verena Zuber, Miika Ahdesmaki, A. Pedro Duarte Silva, and Korbinian Strimmer. Their package includes the original Ledoit-Wolf-like regularization method, as well as refinements to it and many other methods. The corpcor package, like Ledoit Wolf, includes ways to use sophisticated regularization operators, and can apply more broadly than the simple approach taken in this post.

You can use the corpcor package to form a Ledoit-Wolf-like regularized precision matrix P, and you should try it! The result is pretty similar to what we get from our simple truncated eigenvalue decomposition regularization in this example.

Networks and clustering

The (i, j)th entry of the precision matrix P is a measure of association between the log(return) time series for the stocks in columns i and j, with larger values corresponding to more association.

An interesting way to group related stocks together is to think of the precision matrix as an adjacency matrix defining a weighted, undirected network of stock associations. Thresholding entries of the precision matrix to include, say, only the top 10% results in a network of only the most strongly associated stocks.

Thinking in terms of networks opens up a huge and useful toolbox: graph theory. We gain access to all kinds of nifty ways to analyze and visualize data, including methods for clustering and community detection.

R’s comprehensive igraph package by Gábor Csárdi (https://cran.r-project.org/package=igraph) includes many network cluster detection algorithms. The example below uses Blondel and co-authors’ fast community detection algorithm implemented by igraph’s cluster_louvain() function to segment the thresholded precision matrix of stocks into groups. The code produces an igraph graph object g, with vertices colored by group membership.

suppressMessages(library(igraph))threshold = 0.90Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrixg = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph# The rest of the code lumps any singletons lacking edges into a single 'unassociated' group shown in gray# (also assigning distinct colors to the other groups).x = groups(cluster_louvain(g))i = unlist(lapply(x, length))d = order(i, decreasing=TRUE)x = x[d]i = i[d]j = i > 1s = sum(j)names(x)[j] = seq(1, s)names(x)[! j] = s + 1grp = as.integer(rep(names(x), i))clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]g = set_vertex_attr(g, "color", value=clrs)

Use the latest threejs package to make a nice interactive visualization of the network (you can use your mouse/trackpad to rotate, zoom and pan the visualization).

library(threejs)graphjs(g, vertex.size=0.2, vertex.shape=colnames(X), edge.alpha=0.5)

The stock groups identified by this method are uncanny, but hardly all that surprising really. Look closely and you will see clusters made up of bank-like companies (AIG, BAC, BK, C, COF, GS, JPM, MET, MS, USB, WFC), pharmaceutical companies (ABT, AMGN, BIIB, BMY, CELG, GILD, JNJ, LLY, MRK, PFE), computer/technology-driven companies (AAPL, ACN, CSCO, IBM, INTC, MSFT, ORCL, QCOM, T, TXN, VZ – except oddly, the inclusion of CAT in this list), and so on. With the threshold value of 0.9 above, a few stocks aren’t connected to any others; they appear in gray.

The groups more or less correspond to what we already know!

The group that includes FB, GOOG, and AMZN (Facebook, Alphabet/Google, and Amazon) is interesting and a bit mysterious. It includes credit card companies V (Visa), MA (Mastercard) and American Express (AXP). Perhaps the returns of FB, GOOG and AMZN are more closely connected to consumer spending than technology! But oddly, this group also includes a few energy companies (DUK, EXC, NEE), and I’m not sure what to make of that…

This way of looking at things also nicely highlights connections between groups. For instance, we see that a group containing consumer products companies (PEP, KO, PG, CL, etc.) is connected to both the Pharma group, and the credit card company group. And see the appendix below for a visualization that explores different precision matrix threshold values, including lower values with far greater network connectivity.

Review

We downloaded daily closing stock prices for 100 stocks from the S&P 500, and, using basic tools of statistics and analysis like correlation and regularization, we grouped the stocks together in a network that highlights associations within and between the groups. The structure teased out of the stock price data is reasonably intuitive.

Appendix: `threejs` tricks

The following self-contained example shows how the network changes with threshold value. It performs the same steps as we did above, but uses some tricks in threejs and an experimental extension to the crosstalk package and a few additional R packages to present an interactive animation. Enjoy!

suppressMessages({library(quantmod)library(igraph)library(threejs)library(crosstalk)library(htmltools)# using an experimental extension to crosstalk:library(crosstool) # devtools::install_github('bwlewis/crosstool')})# Download the processed log(returns) data:suppressMessages(load(url("http://illposed.net/logreturns.rdata")))X = cor(log_returns)L = eigen(X, symmetric=TRUE)N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))colnames(P) = colnames(X)# A function that creates a network for a given threshold and precision matrixf = function(threshold, P){  Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrix  g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph  x = groups(cluster_louvain(g))  i = unlist(lapply(x, length))  d = order(i, decreasing=TRUE)  x = x[d]  i = i[d]  j = i > 1  s = sum(j)  names(x)[j] = seq(1, s)  names(x)[! j] = s + 1  grp = as.integer(rep(names(x), i))  clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]  g = set_vertex_attr(g, "color", value=clrs)  set_vertex_attr(g, "shape", value=colnames(P))}threshold = c(0.99, 0.95, 0.90, 0.85, 0.8)g = Map(f, threshold, MoreArgs=list(P=P)) # list of graphs, one for each threshold# Compute force-directed network layouts for each threshold value# A bit expensive to compute, so run in parallel!library(parallel)l = mcMap(function(x) layout_with_fr(x, dim=3, niter=150), g, mc.cores=detectCores())sdf = SharedData$new(data.frame(key=paste(seq(0, length(threshold) - 1))), key=~key)slider = crosstool(sdf, "transmitter",                sprintf("",                length(threshold) - 1), width="100%", height=20, channel="filter")vis = graphjs(g, l, vertex.size=0.2, main=as.list(threshold), defer=TRUE, edge.alpha=0.5, deferfps=30,        crosstalk=sdf, width="100%", height=900)browsable(div(list(HTML(""), tags$h3("Precision matrix quantile threshold (adjust slider to change)"), slider, vis)))

Precision matrix quantile threshold (adjust slider to change)

{"x":{"innerHTML":"<input type='range' min='0' max='4' value='0'/>","value":"value","channel":"filter","crosstalk_key":["0","1","2","3","4"],"crosstalk_group":"SharedData32084f4e","crosstalk_group2":"SharedData32084f4e"},"evals":[],"jsHooks":[]}

_____='https://rviews.rstudio.com/2017/08/22/stocks/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Introducing routr – Routing of HTTP and WebSocket in R

August 21, 2017, 5:00 pm

≫ Next: Understanding gender roles in movies with text mining

≪ Previous: Learning things we already know about stocks

(This article was first published on Data Imaginist - R posts, and kindly contributed to R-bloggers)

routr is now available on CRAN, and I couldn’t be happier. It’s release marks the completion of an idea that stretches back longer than my attempts to bring network visualization and ggplot2 together (see this post for ref). While my PhD was still concerned with proteomics a began developing GUI’s based on shiny for managing different parts of the proteomics workflow. I soon came to realize that I was spending an inordinate amount of time battling shiny itself because I wanted more than it was meant for. Thus began my idea of creating an expressive and powerful web server framework for R in the veins of express.js and the likes that could be made to do anything. The idea lingered in my head for a long time and went through several iterations until I finally released fiery in the late summer of 2016. fiery was never meant to stand alone though and I boldly proclaimed that routr would come next. That didn’t seem to happen. I spend most of the following year developing tools for visualization and network analysis while having guilty consciousness about the project I’d put on hold. Fortunately I’ve been able to put in some time for taking up development for the fiery ecosystem once again, so without further ado…

routr

While I spend some time in the introduction to talk about the whole development path of fiery, I would like to start here with saying that routr is a server agnostic tool. Sure, I’ve build it for use with fiery but I’ve been very deliberate in making it completely independent of it, except for the code that is involved in the fiery plugin functionality. So, you’re completely free to use routr with whatever server framework you wish (e.g. hook it directly to an httpuv instance). But how does it work? read on…

The design

routr is basically build up of two different concepts: routes and route stacks. Routes are a collection of handlers attached to specific HTTP request methods (e.g. GET, POST, PUT) and paths. When a request lands at a route one of the handlers is chosen and called, based on the nature of the request. A route stack is a collection of routes. When a request lands at a route stack it will pass it through all the routes it contains sequentially, potentially stopping if one of the handlers signals it. In the following these two concepts will be discussed in detail.

Routes

In its essence a router is a decision mechanism for redirection HTTP requests into the correct handler function based on the request URL. It makes sure that e.g. requests for http://example.com/info ends up in a different handler than http://example.com/users/thomasp85. This functionality is encapsulated in the Route class. The basic use is illustrated below:

library(routr)route<-Route$new()route$add_handler('get','/info',function(request,response,keys,...){response$status<-200Lresponse$body<-list(h1='This is a test server')TRUE})route$add_handler('get','/users/thomasp85',function(request,response,keys,...){response$status<-200Lresponse$body<-list(h1='This is the user information for thomasp85')TRUE})route

## A route with 2 handlers## get: /users/thomasp85##    : /info

Let’s walk through what happened here. First we created a new Route object and then we added two handlers to it, using the eponymous add_handler() method. Both of the handlers responds to the GET method, but differs in the path they are listening for. routr uses reqres under the hood so each handler method is passed a Request and Response pair (we’ll get back to the keys argument). Lastly, each handler must return either TRUE indicating that the next route should be called, or FALSE indicating no further routes should be called. As the request and response objects are R6 objects any changes to them will be kept outside of the handler and there is thus no need to return them.

Now, consider the situation where I have build my super fancy web service into a thriving business with millions of users – would I need to add a handler for every user? No. This would be a case for a parameterized path.

route$add_handler('get','/users/:user_id',function(request,response,keys,...){response$status<-200Lresponse$body<-list(h1=paste0('This is the user information for ',keys$user_id))TRUE})route

## A route with 3 handlers## get: /users/thomasp85##    : /users/:user_id##    : /info

As can be seen, prefixing a path element with : will make it into a variable, matching anything that is put in there and adds it as an element to the keys argument. Paths can contain as many variable elements as wanted in order to reuse handlers as efficiently as possible.

There’s a last piece of path functionality left to discuss: The wildcard. While parameterized path elements only matches as single element (e.g. /users/:user_id will match /users/johndoe but not /users/johndoe/settings) the wildcard matches anything. Let’s try one of these:

route$add_handler('get','/setting/*',function(request,response,keys,...){response$status_with_text(403L)# ForbiddenFALSE})route$add_handler('get','/*',function(request,response,keys,...){response$status<-404Lresponse$body<-list(h1='We really couldn\'t find your page')FALSE})route

## A route with 5 handlers## get: /users/thomasp85##    : /users/:user_id##    : /setting/*##    : /info##    : /*

Here we add two new handlers, one preventing access to anything under the /settings location, and one implementing a custom 404 - Not found page. Both returns FALSE as they are meant to prevent any further processing.

Now there’s a slight pickle with the current situation. If I ask for /users/thomasp85 it can match three different handlers: /users/thomasp85, /users/:user_id, and /*. Which to chose? routr decides on the handler based on path specificity, where handlers are prioritized based on number of elements in the path (the more the better), number of parameterized elements (the less the better), and existence of wildcards (better with none). In the above case it means that the /users/thomasp85 will be chosen. The handler priority can always be seen when printing the Route object.

The request method is less complicated than the path. It simply matches the method used in the request, ignoring the case. There’s one special method: all. This one will match any method, but only if a handler does not exist for that specific method.

Route Stacks

Conceptually, route stacks are much simpler than routes, in that they are just a sequential collection of routes, with the means to pass requests through them. Let’s create some additional routes and collect them in a RouteStack:

parser<-Route$new()parser$add_handler('all','/*',function(request,response,keys,...){request$parse(reqres::default_parsers)})formatter<-Route$new()formatter$add_handler('all','/*',function(request,response,keys,...){response$format(reqres::default_formatters)})router<-RouteStack$new()router$add_route(parser,'request_prep')router$add_route(route,'app_logic')router$add_route(formatter,'response_finish')router

## A RouteStack containing 3 routes## 1: request_prep## 2: app_logic## 3: response_finish

Now, when our router receives a request it will first pass it to the parser route and attempt to parse the body. If it is unsuccessful it will abort (the parse() method returns FALSE if it fails), if not it will pass the request on to the route we build up in the prior section. If the chosen handler returns TRUE the request will then end up in the formatter route and the response body will be formatted based on content negotiation with the request. As can be seen route stacks are an effective way to extract common functionality into well defined handlers.

If you’re using fiery. RouteStack objects are also what will be used as plugins. Whether to use the router for request, header, or message (WebSocket) events is decided by the attach_to field.

app<-fiery::Fire$new()app$attach(router)app

## 🔥 A fiery webserver## 🔥  💥   💥   💥## 🔥           Running on: 127.0.0.1:8080## 🔥     Plugins attached: request_routr## 🔥 Event handlers added## 🔥              request: 1

Predefined routes

Lastly, routr comes with a few predefined routes, which I will briefly mention: The ressource_route maps files on the server to handlers. If you wish to serve static content in some way, this facilitates it, and takes care of a lot of HTTP header logic such as caching. It will also automatically serve compressed files if they exist and the client accepts them:

static_route<-ressource_route('/'=system.file(package='routr'))router$add_route(static_route,'static',after=1)router

## A RouteStack containing 4 routes## 1: request_prep## 2: static## 3: app_logic## 4: response_finish

Now, you can get the package description file by visiting /DESCRIPTION. If a file is found it will return FALSE in order to simply return the file. If nothing is found it will return TRUE so that other routes can decide what to do.

If you wish to limit the size of requests, you can use the sizelimit_route and e.g. attach it to the header event in a fiery app, so that requests that are too big will get rejected before the body is fetched.

sizelimit<-sizelimit_route(10*1024^2)# 10 mbreject_router<-RouteStack$new(size=sizelimit)reject_router$attach_to<-'header'app$attach(reject_router)app

## 🔥 A fiery webserver## 🔥  💥   💥   💥## 🔥           Running on: 127.0.0.1:8080## 🔥     Plugins attached: request_routr## 🔥                       header_routr## 🔥 Event handlers added## 🔥               header: 1## 🔥              request: 1

Wrapping up

As I started by saying, the release of routr marks a point of maturity for my fiery ecosystem. I’m extremely happy with this, but it is in no way the end of development. I will pivot to working on more specialized plugins now concerned with areas such as security and scalability, but the main approach to building fiery server side logic is now up and running – I hope you’ll take it for a spin.

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist - R posts.

↧

Understanding gender roles in movies with text mining

August 21, 2017, 5:00 pm

≫ Next: How to Create an Online Choice Simulator

≪ Previous: Introducing routr – Routing of HTTP and WebSocket in R

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.

The R code behind this analysis in publicly available on GitHub.

I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

↧

How to Create an Online Choice Simulator

August 21, 2017, 10:32 pm

≫ Next: Some Neat New R Notations

≪ Previous: Understanding gender roles in movies with text mining

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Choice Simulator

Choice model simulator

What is a choice simulator?

A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.

Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.

How to create a choice simulator

In this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.

Step 1: Import the model(s) results

First of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondent-level parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).

If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.

The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM).

Step 2: Simplify calculations using variable sets

Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.

Variable sets in data tree

In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:

If the variables are already grouped into a variable set, select the variable set, and select Data Manipulation > Split (Variables). In the dataset that I am using, all the variables I need for my calculation are already grouped into a single variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM, so I click on this and split it.
Next, select the first attribute’s variables. In my example, this is the four variables that start with Weight:, each of which represents the respondent-level parameters for different egg weights. (The first of these contains only 0s, as dummy coding was used.)
Then, go to Data Manipulation > Combine (Variables).
Next set the Label for the new variable set to something appropriate. For reasons that will become clearer below, it is preferable to set it to a single, short word. For example, Weight.
Set the Label field for each of the variables to whatever label you plan to show in the choice simulator. For example, if you want people to be able to choose an egg weight of 55g (about 2 ounces), set the Label to 55g.
Finally, repeat this process for all the attributes. If you have any numeric attributes, then leave these as a single variable, like Price in the example here.

Step 3: Create the controls

Choice model simulation inputs

In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:

Insert > Control (More).
Type the levels, separated by semi-colons, into the Item list. These must match, exactly, to the labels that you have entered for the Labels for the first attribute in point 5 in the previous step. For example: 55g; 60g; 65g; 70g. I recommend using copy and paste because if you make some typos they will be difficult to track down. Where you have a numeric attribute, such as Price in the example, you enter the range of values that you wish the user to be able to choose from (e.g., 1.50; 2.00; 2.50; 3.00; 3.50; 4.00; 4.50; 5.00).
Select the Properties tab in the Object Inspector and set the Name of the control to whatever you set as the Label for the corresponding variable set with the number 1 affixed at the end. For example, Weight.1 (You can use any label, but following this convention will save you time later on.)
Click on the control and select the first level. For example, 55g.
Repeat these steps until you have created controls for each of the attributes, each under each other, as shown above.
Select all the controls that you have created, and then select Home >Copy and Home > Paste, and move the new set of labels to the right of the previous labels. Repeat this for as many sets of alternatives as you wish to include. In my example, there are four alternatives.
Finally, add labels for the brands and attributes: Insert > TextBox (Text and Images).

See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.

Step 4: Calculate preference shares

Insert an R Output (Insert > R Output (Analysis)), setting it to Automatic with the appropriate code, and positioning it underneath the first column of combo boxes. Press the Calculate button, and it should calculate the share for the first alternative. If you paste the code below, and everything is setup properly, you will get a value of 25%.
Now, click on the R Output you just created, and copy-and-paste it. Position the new version immediately below the second column of combo boxes.
Modify the very last line of code, replacing [1] with [2], which tells it to show the results of the second alternative.
Repeat steps 2 and 3 for alternatives 3 and 4.

The code below can easily be modified for other models. A few key aspects of the code:

It works with four alternatives and is readily modified to deal with different numbers of alternatives.
The formulas for the utility of each alternative are expressed as simple mathematical expressions. Because I was careful with the naming of the variable sets and the controls, they are easy to read. If you are using Displayr, you can hover over the various elements of the formula and you will get a preview of their data.
The code is already setup to deal with weights. Just click on the R Output that contains the formula and apply a weight (Home > Weight).
It is set up to automatically deal with any filters. More about this below.

R Code to paste:

# Computing the utility for each alternativeu1 = Weight[, Weight.1] + Organic[, Organic.1] + Charity[, Charity.1] + Quality[, Quality.1] + Uniformity[, Uniformity.1] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.1))u2 = Weight[, Weight.2] + Organic[, Organic.2] + Charity[, Charity.2] + Quality[, Quality.2] + Uniformity[, Uniformity.2] + Feed[, Feed.2] + Price*as.numeric(gsub("\\$", "", Price.2))u3 = Weight[, Weight.3] + Organic[, Organic.3] + Charity[, Charity.3] + Quality[, Quality.3] + Uniformity[, Uniformity.3] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.3))u4 = Weight[, Weight.4] + Organic[, Organic.4] + Charity[, Charity.4] + Quality[, Quality.4] + Uniformity[, Uniformity.4] + Feed[, Feed.1] + Price*as.numeric(gsub("\\$", "", Price.4))# Computing preference sharesutilities = as.matrix(cbind(u1, u2, u3, u4))eutilities = exp(utilities)shares = prop.table(eutilities, 1)# Filtering the shares, if a filter is applied.shares = shares[QFilter, ]# Filtering the weight variable, if required.weight = if (is.null(QPopulationWeight)) rep(1, length(u1)) else QPopulationWeightweight = weight[QFilter]# Computing shares for the total sampleshares = sweep(shares, 1, weight, "*")shares = as.matrix(apply(shares, 2, sum))shares = 100 * prop.table(shares, 2)[1]

Step 5: Make it pretty

If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert >Image.

Step 6: Add filters

If you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.

Step 7: Share

To share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.

Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/Dashboard?id=21043f64-45d0-47af-9797-cd4180805849. This URL is public. You cannot guess or find this link by web-searching for security reasons. If, however, you give the URL to someone, then they can access the document. Alternatively, if you have an annual Displayr account, you can instead go into Settings for the document (the cog at the top-right of the screen) and press Disable Public URL. This will limit access to only people who are set up as users for your organization. You can set up people as users in the company’s Settings, accessible by clicking on the cog at the top-right of the screen. If you don’t see these settings, contact support@displayr.com to buy a license.

Worked example of a choice simulator

You can see the choice simulator in View Mode here (as an end-user will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

↧

Some Neat New R Notations

August 22, 2017, 6:39 am

≫ Next: Onboarding visdat, a tool for preliminary visualisation of whole dataframes

≪ Previous: How to Create an Online Choice Simulator

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr supplies a few neat new coding notations.

An Abacus, which gives us the term “calculus.”

The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following:

library("seplyr")names <- c('a', 'b')names := c('x', 'y')#>   a   b #> "x" "y"

This can be very useful when programming in R, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = 'x', b = 'y'), where all left-hand-sides are concrete values even if not quoted).

A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y') is the same as c('a', 'b') := c('x', 'y'). Roughly this means the two operations play well with each other.

The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=” notation for this (and, as is common in R, pick the correct behavior based on runtime types).

The function construction is written as: “variables := { code }” (the braces are required) and the semantics are roughly the same as “function(variables) { code }“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr notation is not as concise as is possible).

This notation allows us to write the squares of 1 through 4 as:

sapply(1:4, x:={x^2})

instead of writing:

sapply(1:4, function(x) x^2)

It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “(λx.x^2)“. In the development version of seplyr we are experimenting with the following additional notations:

sapply(1:4, lambda(x)(x^2))

sapply(1:4, λ(x, x^2))

(Both of these currenlty work in the development version, though we are not sure about submitting source files with non-ASCII characters to CRAN.)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Onboarding visdat, a tool for preliminary visualisation of whole dataframes

August 22, 2017, 12:00 am

≫ Next: So you (don’t) think you can review a package

≪ Previous: Some Neat New R Notations

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Take a look at the data

This is a phrase that comes up when you first get a dataset.

It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?

Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.

These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.

The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to "get a look at the data".

Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more.

I felt like the code was a little sloppy, and that it could be better.
I wanted to know whether others found it useful.

What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.

Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.

rOpenSci onboarding basics

Onboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.

What's in it for the author?

Feedback on your package
Support from rOpenSci members
Maintain ownership of your package
Publicity from it being under rOpenSci
Contribute something to rOpenSci
Potentially a publication

What can rOpenSci do that CRAN cannot?

The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN ¹. Here's what rOpenSci does that CRAN cannot:

Assess documentation readability / usability
Provide a code review to find weak points / points of improvement
Determine whether a package is overlapping with another.

So I submitted visdat to the onboarding process. For me, I did this for three reasons.

So visdat could become a better package
Pending acceptance, I would get a publication in JOSS
I get to contribute back to rOpenSci

Submitting the package was actually quite easy – you go to submit an issue on the onboarding page on GitHub, and it provides a magical template for you to fill out ², with no submission gotchas – this could be the future ³. Within 2 days of submitting the issue, I had a response from the editor, Noam Ross, and two reviewers assigned, Mara Averick, and Sean Hughes.

I submitted visdat and waited, somewhat apprehensively. What would the reviewers think?

In fact, Mara Averick wrote a post: "So you (don't) think you can review a package" about her experience evaluating visdat as a first-time reviewer.

Getting feedback

Unexpected extras from the review

Even before the review started officially, I got some great concrete feedback from Noam Ross, the editor for the visdat submission.

Noam used the goodpractice package, to identify bad code patterns and other places to immediately improve upon in a concrete way. This resulted in me:
- Fixing error prone code such as using 1:length(...), or 1:nrow(...)
- Improving testing using the visualisation testing software vdiffr)
- Reducing long code lines to improve readability
- Defining global variables to avoid a NOTE ("no visible binding for global variable")

So before the review even started, visdat is in better shape, with 99% test coverage, and clearance from goodpractice.

The feedback from reviewers

I received prompt replies from the reviewers, and I got to hear really nice things like "I think visdat is a very worthwhile project and have already started using it in my own work.", and "Having now put it to use in a few of my own projects, I can confidently say that it is an incredibly useful early step in the data analysis workflow. vis_miss(), in particular, is helpful for scoping the task at hand …". In addition to these nice things, there was also great critical feedback from Sean and Mara.

A common thread in both reviews was that the way I initially had visdat set up was to have the first row of the dataset at the bottom left, and the variable names at the bottom. However, this doesn't reflect what a dataframe typically looks like – with the names of the variables at the top, and the first row also at the top. There was also suggestions to add the percentage of missing data in each column.

On the left are the old visdat and vismiss plots, and on the right are the new visdat and vismiss plots.

Changing this makes the plots make a lot more sense, and read better.

Mara made me aware of the warning and error messages that I had let crop up in the package. This was something I had grown to accept – the plot worked, right? But Mara pointed out that from a user perspective, seeing these warnings and messages can be a negative experience for the user, and something that might stop them from using it – how do they know if their plot is accurate with all these warnings? Are they using it wrong?

Sean gave practical advice on reducing code duplication, explaining how to write general construction method to prepare the data for the plots. Sean also explained how to write C++ code to improve the speed of vis_guess().

From both reviewers I got nitty gritty feedback about my writing – places where documentation that was just a bunch of notes I made, or where I had reversed the order of a statement.

What did I think?

I think that getting feedback in general on your own work can be a bit hard to take sometimes. We get attached to our ideas, we've seen them grow from little thought bubbles all the way to "all growed up" R packages. I was apprehensive about getting feedback on visdat. But the feedback process from rOpenSci was, as Tina Turner put it, "simply the best".

Boiling down the onboarding review process down to a few key points, I would say it is transparent, friendly, and thorough.

Having the entire review process on GitHub means that everyone is accountable for what they say, and means that you can track exactly what everyone said about it in one place. No email chain hell with (mis)attached documents, accidental reply-alls or single replies. The whole internet is cc'd in on this discussion.

Being an rOpenSci initiative, the process is incredibly friendly and respectful of everyone involved. Comments are upbeat, but are also, importantly thorough, providing constructive feedback.

So what does `visdat` look like?

library(visdat)vis_dat(airquality)

visdat-example

This shows us a visual analogue of our data, the variable names are shown on the top, and the class of each variable is shown, along with where missing data.

You can focus in on missing data with vis_miss()

vis_miss(airquality)

vis-miss-example

This shows only missing and present information in the data. In addition to vis_dat() it shows the percentage of missing data for each variable and also the overall amount of missing data. vis_miss() will also indicate when a dataset has no missing data at all, or a very small percentage.

The future of `visdat`

There are some really exciting changes coming up for visdat. The first is making a plotly version of all of the figures that provides useful tooltips and interactivity. The second and third changes to bring in later down the track are to include the idea of visualising expectations, where the user can search their data for particular things, such as particular characters like "~" or values like -99, or -0, or conditions "x > 101", and visualise them. Another final idea is to make it easy to visually compare two dataframes of differing size. We also want to work on providing consistent palettes for particular datatypes. For example, character, numerics, integers, and datetime would all have different (and consistently different) colours.

I am very interested to hear how people use visdat in their work, so if you have suggestions or feedback I would love to hear from you! The best way to leave feedback is by filing an issue, or perhaps sending me an email at nicholas [dot] tierney [at] gmail [dot] com.

The future of your R package?

If you have an R package you should give some serious thought about submitting it to the rOpenSci through their onboarding process. There are very clear guidelines on their onboarding GitHub page. If you aren't sure about package fit, you can submit a pre-submission enquiry– the editors are nice and friendly, and a positive experience awaits you!

CRAN is an essential part of what makes the r-project successful and certainly without CRAN R simply would not be the language that it is today. The tasks provided by the rOpenSci onboarding require human hours, and there just isn't enough spare time and energy amongst CRAN managers. ↩
Never used GitHub? Don't worry, creating an account is easy, and the template is all there for you. You provide very straightforward information, and it's all there at once. ↩
With some journals, the submission process means you aren't always clear what information you need ahead of time. Gotchas include things like "what is the residential address of every co-author", or getting everyone to sign a copyright notice. ↩

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

↧

So you (don’t) think you can review a package

August 22, 2017, 12:00 am

≫ Next: Caching httr Requests? This means WAR[C]!

≪ Previous: Onboarding visdat, a tool for preliminary visualisation of whole dataframes

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Contributing to an open-source community without contributing code is an oft-vaunted idea that can seem nebulous. Luckily, putting vague ideas into action is one of the strengths of the rOpenSci Community, and their package onboarding system offers a chance to do just that.

This was my first time reviewing a package, and, as with so many things in life, I went into it worried that I'd somehow ruin the package-reviewing process— not just the package itself, but the actual onboarding infrastructure…maybe even rOpenSci on the whole.

Barring the destruction of someone else's hard work and/or an entire organization, I was fairly confident that I'd have little to offer in the way of useful advice. What if I have absolutely nothing to say other than, yes, this is, in fact, a package?!

rOpenSci package review: what I imagined

So, step one (for me) was: confess my inadequacies and seek advice. It turns out that much of the advice vis-à-vis how to review a package is baked right into the documents. The reviewer template is a great trail map, the utility of which is fleshed out in the rOpenSci Package Reviewing Guide. Giving these a thorough read, and perusing a recommended review or two (links in the reviewing guide) will probably have you raring to go. But, if you're feeling particularly neurotic (as I almost always am), the rOpenSci onboarding editors and larger community are endless founts of wisdom and resources.

`visdat`

I knew nothing about Nicholas Tierney's visdat package prior to receiving my invitation to review it. So the first (coding-y) thing I did was play around with it in the same way I do for other cool R packages I encounter. This is a totally unstructured mish-mash of running examples, putting my own data in, and seeing what happens. In addition to being amusing, it's a good way to sort of "ground-truth" the package's mission, and make sure there isn't some super helpful feature that's going unsung.

If you're not familiar with visdat, it "provides a quick way for the user to visually examine the structure of their data set, and, more specifically, where and what kinds of data are missing."¹ With early-stage EDA (exploratory data analysis), you're really trying to get a feel of your data. So, knowing that I couldn't be much help in the "here's how you could make this faster with C++" department, I decided to fully embrace my role as "naïve user".²

Questions I kept in mind as ~myself~ resident naïf:

What did I think this thing would do? Did it do it?
What are things that scare me off?

The latter question is key, and, while I don't have data to back this up, can be a sort of "silent" usability failure when left unexamined. Someone who tinkers with a package, but finds it confusing doesn't necessarily stop to give feedback. There's also a pseudo curse-of-knowledge component. While messages and warnings are easily parsed, suppressed, dealt with, and/or dismissed by the veteran R user/programmer, unexpected, brightly-coloured text can easily scream Oh my gosh you broke it all!! to those with less experience.

Myriad lessons learned

I can't speak for Nick per the utility or lack thereof of my review (you can see his take here, but I can vouch for the package-reviewing experience as a means of methodically inspecting the innards of an R package. Methodical is really the operative word here. Though "read the docs," or "look at the code" sounds straight-forward enough, it's not always easy to coax oneself into going through the task piece-by-piece without an end goal in mind. While a desire to contribute to open-source software is noble enough (and is how I personaly ended up involved in this process– with some help/coaxing from Noam Ross), it's also an abstraction that can leave one feeling overwhelmed, and not knowing where to begin.³

There are also self-serving bonus points that one simply can't avoid, should you go the rOpenSci-package-reviewing route– especially if package development is new to you.⁴ Heck, the package reviewing guide alone was illuminating.

Furthermore, the wise-sage 🦉 rOpenSci onboarding editors⁵ are excellent matchmakers, and ensure that you're actually reviewing a package authored by someone who wants their package to be reviewed. This sounds simple enough, but it's a comforting thought to know that your feedback isn't totally unsolicited.

Yes, I'm quoting my own review. ↩
So, basically just playing myself… Also I knew that, if nothing more, I can proofread and copy edit. ↩
There are lots of good resources out there re. overcoming this obstacle, though (e.g. First Timers Only; or Charlotte Wickham's Collaborative Coding from useR!2017 is esp. for the R-user). ↩
OK, so I don't have a parallel world wherein a very experienced package-developer version of me is running around getting less out of the process, but if you already deeply understand package structure, you're unlikely to stumble upon quite so many basic "a-ha" moments. ↩
Noam Ross, Scott Chamberlain, Karthik Ram, & Maëlle Salmon ↩

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

↧

Caching httr Requests? This means WAR[C]!

August 22, 2017, 10:53 am

≫ Next: Gender roles in film direction, analyzed with R

≪ Previous: So you (don’t) think you can review a package

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test")

res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods")

str(jsonlite::fromJSON(content(res, as="text")), 2)
## 'data.frame':	67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...

res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime",
                query = list(lat=52.629729, lng=-1.131592, date="2017-01"))

res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location",
                query = list(location_id="884227", date="2017-02"))

close_warc_file(wf)

As you can see, the standard httrresponse object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE)

glimpse(xdf)
## Observations: 3
## Variables: 14
## $ target_uri                  "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                  "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type           "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                   "response", "response", "response"
## $ content_length              2984, 511564, 688
## $ payload_type                "application/json", "application/json", "application/json"
## $ profile                     NA, NA, NA
## $ date                        2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code            200, 200, 200
## $ http_protocol_content_type  "application/json", "application/json", "application/json"
## $ http_version                "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers            [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...
## $ warc_record_id              "", "",...
## $ payload                     [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...

xdf$target_uri
## [1] "https://data.police.uk/api/leicestershire/neighbourhoods"                                   
## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01"
## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02"

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character"))
  print(str(res, 2))
}
## 'data.frame': 67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...
## NULL
## 'data.frame': 1318 obs. of  9 variables:
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location        :'data.frame': 1318 obs. of  3 variables:
##   ..$ latitude : chr  "52.616961" "52.629963" "52.641646" "52.635184" ...
##   ..$ street   :'data.frame': 1318 obs. of  2 variables:
##   ..$ longitude: chr  "-1.120719" "-1.122291" "-1.131486" "-1.135455" ...
##  $ context         : chr  "" "" "" "" ...
##  $ outcome_status  :'data.frame': 1318 obs. of  2 variables:
##   ..$ category: chr  NA NA NA NA ...
##   ..$ date    : chr  NA NA NA NA ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ month           : chr  "2017-01" "2017-01" "2017-01" "2017-01" ...
## NULL
## 'data.frame': 1 obs. of  9 variables:
##  $ category        : chr "violent-crime"
##  $ location_type   : chr "Force"
##  $ location        :'data.frame': 1 obs. of  3 variables:
##   ..$ latitude : chr "52.643950"
##   ..$ street   :'data.frame': 1 obs. of  2 variables:
##   ..$ longitude: chr "-1.143042"
##  $ context         : chr ""
##  $ outcome_status  :'data.frame': 1 obs. of  2 variables:
##   ..$ category: chr "Unable to prosecute suspect"
##   ..$ date    : chr "2017-02"
##  $ persistent_id   : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735"
##  $ id              : int 54726925
##  $ location_subtype: chr ""
##  $ month           : chr "2017-02"
## NULL

We can also use a jwatr helper function — payload_content()— which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

↧

Gender roles in film direction, analyzed with R

August 22, 2017, 2:33 pm

≫ Next: useR!2017 Roundup

≪ Previous: Caching httr Requests? This means WAR[C]!

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding— it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.

This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with "he" and "she" in the stage directions (but not the dialogue) in the screenplays — directions like "she snuggles up to him, strokes his back" and "he straps on a holster under his sealskin cloak". The essay also includes an analysis of words by the writer and character's gender, and includes lots of lovely interactive elements (including the ability to see examples of the stage directions).

The analysis, including the chart above, was was created using the R language, and the R code is available on GitHub. The screenplay analysis makes use on the tidytext package, which simplifies the process of handling the text-based data (the screenplays), extracting the stage directions, and tabulating the word pairs.

You can find the complete essay linked below, and it's well worth checking out to experience the interactive elements.

ThePudding: She Giggles, He Gallops

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

useR!2017 Roundup

August 23, 2017, 1:00 am

≫ Next: Simple practice: data wrangling the iris dataset

≪ Previous: Gender roles in film direction, analyzed with R

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

Organising useR!2017 was a challenge but a very rewarding experience. With about 1200 attendees of over 55 nationalities exploring an interesting program, we believe it is appropriate to call it a success – something the aftermovie only seems to confirm.

Behind the Scenes

To give you a glimpse behind the scenes of the conference organization, Maxim Nazarov held a lightning talk on ‘redmineR and the story of automating useR!2017 abstract review process’

You can find the R package on the Open Analytics Github and slides are available here.

Laure Cougnaud presented during the useR! Newbies session in a talk called ‘Making the most of useR!’ and assisted the newbies throughout the conference as a conference buddy. She also served as the chair of the Bioinformatics I session.

In spite of recent appearances Open Analytics does more than organizing useR! Conferences and, as a platinum sponsor, Tobias Verbeke had the opportunity to present Open Analytics in a sponsorship talk.

Open Analytics offers its services in four different service lines:

statistical consulting,
scientific programming,
application development & integration and
data analysis hardware & hosting.

The talks our consultants contributed can be nicely laid out along these service lines.

Statistical Consulting

On the methodological side (statistical consulting) Kathy Mutambanengwe held a talk on ‘A restricted composite likelihood approach to modelling Gaussian geostatistical data’.

Adriaan Blommaert and Nicolas Sauwen co-authored the lightning talk by Tor Maes on Multivariate statistics for PAT data analysis: short overview of existing R packages and methods. Finally, the poster session presented work by Machteld Varewyck on Analyzing Digital PCR Data in R and Rytis Bagdziunas presented a poster on BIGL: assessing and visualizing drug synergy.

Scientific Programming

In the scientific programming area, Nicolas Sauwen held a talk on the ‘Differentiation of brain tumor tissue using hierarchical non-negative matrix factorization’

The application to differentation of brain tumor issue is an interesting case, but the hNMF method is currently the fastest NMF implementation and can be put to use in many other contexts of unsupervised learning. If interested, the discussed hNMF package can be found on CRAN.

Kirsten Van Hoorde held a lightning talk on the ‘R (‘template’) package to automate analysis workflow and reporting’,

co-authored by Laure Cougnaud. For an example package demonstrating the approach, please see the Open Analytics Github– slides can be found here.

Application Development and Integration

Regarding application development and integration, Marvin Steijaert and Griet Laenen held a long talk on ‘Biosignature-Based Drug Design: from high dimensional data to business impact’ demonstrating how machine learning is put to use to design drugs (slides here).

Data Analysis Hardware and Hosting

Regarding hosting of data analysis applications, Tobias Verbeke held a well attended talk on ShinyProxy a fully open source product that allows to run Shiny apps at scale in an enterprise context.

All information can be found on the ShinyProxy website and sources are on Github.

We hope you enjoyed the conference as much as we did. Let us know if you have any questions or comments on these talks or on Open Analytics and its services offer.

On popular demand: here you can find the source code of the Poissontris game – that other Shiny app

Poissontris screenshot

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics.

↧

Simple practice: data wrangling the iris dataset

August 23, 2017, 2:20 am

≫ Next: Rcpp now used by 10 percent of CRAN packages

≪ Previous: useR!2017 Roundup

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

In last weeks post, I emphasized the importance of practicing R and the Tidyverse with small, simple problems, drilling them until you are competent.

In that post, I gave you a few very small scripts to practice (which I suggest that you memorize).

This week, I want to give you another small example. We’re going to clean up the iris dataset.

More specifically, we’re going to:

Coerce the iris dataset from an old-school data frame into a tibble.
Rename the variables, such that the characters are lower case, and such that “snake case” is applied in place of periods.

Like last week, this is a very simple example. However, (like I mentioned in the past) this is the sort of small task that you’ll need to be able to execute fluidly if you want to work on larger projects.

If you want to do large, complex analyses, it really pays to first master techniques on a small scale using much simpler datasets.

Ok, let’s dive in.

First, let’s take a look at the complete block of code.

library(tidyverse)
library(stringr)


#------------------
# CONVERT TO TIBBLE
#------------------
# – the iris dataframe is an old-school dataframe
#   ... this means that by default, it prints out
#   large numbers of records.
# - By converting to a tibble, functions like head()
#   will print out a small number of records by default

df.iris <- as_tibble(iris)


#-----------------
# RENAME VARIABLES
#-----------------
# - here, we're just renaming these variables to be more
#   consistent with common R coding "style"
# - We're changing all characters to lower case
#   and changing variable names to "snake case"

colnames(df.iris) <- df.iris %>%
  colnames() %>%
  str_to_lower() %>%
  str_replace_all("\\.","_")

# INSPECT

df.iris %>% head()

What have we done here? We’ve combined several discrete functions of the Tidyverse together in order to perform a small amount of data wrangling.

Specifically, we’ve turned the iris dataset into a tibble, and we’ve renamed the variables to be more consistent with modern R code standards and naming conventions.

This example is quite simple, but useful. This is the sort of small task that you’ll need to be able to do in the context of a large analysis.

Breaking down the script

To make this a little clearer, let’s break this down into its component parts.

In the section where we renamed the variables, we only used three core functions:

colnames()
str_to_lower()
str_replace_all()

Each of these individual pieces are pretty straight forward.

We are using colnames() to retrieve the column names.

Then, we pipe the output into the stringr function str_to_lower_() to convert all the characters to lower case.

Next, we use str_replace_all() to replace the periods (“.”) with underscores (“_”). This effectively transforms the variable names to “snake case.” (Keep in mind that str_replace_all() uses regular expressions. You have learned regular expressions, right?)

Finally, using the assignment operator (at the upper, left hand side of the code), we assign the resulting transformed column names to the tibble by using colnames(df.iris).

I will point out that we have used these functions in a “waterfall” pattern; we have combined them by using the the pipe operator, %>%, such that the output of one step becomes the immediate input for the next step. This is a key feature of the Tidyverse. We can combine very simple functions together in new ways to accomplish tasks. This might not seem like a big deal, but it is extremely powerful. The modular nature of the Tidyverse functions, when used with the pipe operator, make the Tidyverse flexible and syntactically powerful, while allowing the code to remain clear and easy to read.

A test of skill: can you write this fluently?

The functions that we just used are all critical for doing data science in R. With that in mind, this script is a good test of your skill: can you write code like this fluently, from memory?

That should be your goal.

To get there, you need to know how the individual functions work. What that means is that you need to study the functions (how they work). But to be able to put them into practice, you need to drill them. So after you understand how they work, drill each individual function until you can write each individual function from memory. Next, you should drill small scripts (like the one in this blog post). You ultimately want to be able to “put the pieces together” quickly and seamlessly in order to solve problems and get things done.

I’ve said it before: if you want a great data science job, you need to be one of the best. If you want to be one of the best, you need to master the toolkit. And to master the toolkit, you need to drill.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to practice.

You need to know what to practice, and you need to know how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

What data science tools you should learn (and what not to learn)
How to practice those tools
How to put those tools together to execute analyses and machine learning projects
… and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Simple practice: data wrangling the iris dataset appeared first on SHARP SIGHT LABS.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

↧

Rcpp now used by 10 percent of CRAN packages

August 23, 2017, 4:18 am

≫ Next: Going Bayes #rstats

≪ Previous: Simple practice: data wrangling the iris dataset

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

10 percent of CRAN packages

Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of 11275 packages. The graph on the left shows the growth of both outright usage numbers (in darker blue, left axis) and relative usage (in lighter blue, right axis).

Older posts on this blog took note when Rcpp passed round hundreds of packages, most recently in April for 1000 packages. The growth rates for both Rcpp, and of course CRAN, are still staggering. A big thank you to everybody who makes this happen, from R Core and CRAN to all package developers, contributors, and of course all users driving this. We have built ourselves a rather impressive ecosystem.

So with that a heartfelt Thank You! to all users and contributors of R, CRAN, and of course Rcpp, for help, suggestions, bug reports, documentation, encouragement, and, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

Going Bayes #rstats

August 23, 2017, 6:07 am

≫ Next: Basics of data.table: Smooth data exploration

≪ Previous: Rcpp now used by 10 percent of CRAN packages

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.

Due to the latest tidyr-update, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayes-stuff is now avaible in my packages on CRAN.

Finding shape or location parameters from distributions

The following functions are included in the sjstats-package. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta(), find_normal() or find_cauchy() help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal(), using the information given in the text:

library(sjstats)find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8)#> $mean#> [1] 53.78387#>#> $sd#> [1] 30.48026

High Density Intervals for MCMC samples

The hdi()-function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanreg-objects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDI-values, e.g. if you need these intervals on an expontiated scale.

library(rstanarm)fit <- stan_glm(mpg ~ wt + am, data = mtcars, chains = 1)hdi(fit)#>          term   hdi.low  hdi.high#> 1 (Intercept) 32.158505 42.341421#> 2          wt -6.611984 -4.022419#> 3          am -2.567573  2.343818#> 4       sigma  2.564218  3.903652# fit logistic regression modelfit <- stan_glm(  vs ~ wt + am,  data = mtcars,  family = binomial("logit"),  chains = 1)hdi(fit, prob = .89, trans = exp)#>          term      hdi.low     hdi.high#> 1 (Intercept) 4.464230e+02 3.725603e+07#> 2          wt 6.667981e-03 1.752195e-01#> 3          am 8.923942e-03 3.747664e-01

Marginal effects for rstanarm-models

The ggeffects-package creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()-method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr (pkg MASS), hurdle and zeroinfl (pkg pscl), betareg (pkg betareg), truncreg (pkg truncreg), coxph (pkg survival) and stanreg (pkg rstanarm).

ggpredict() is the main function that computes marginal effects. Predictions for stanreg-models are based on the posterior distribution of the linear predictor (posterior_linpred()), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()) for inference and model checking, and you can do so using the ppd-argument when calling ggpredict(), however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred() is the default for stanreg-models with ggpredict().

Here is an example with two plots, one without raw data and one including data points:

library(sjmisc)library(rstanarm)library(ggeffects)data(efc)# make categoricalefc$c161sex <- to_label(efc$c161sex)# fit modelm <- stan_glm(neg_c_7 ~ c160age + c12hour + c161sex, data = efc)dat <- ggpredict(m, terms = c("c12hour", "c161sex"))dat#> # A tibble: 128 x 5#>        x predicted conf.low conf.high  group#>                   #>  1     4  10.80864 10.32654  11.35832   Male#>  2     4  11.26104 10.89721  11.59076 Female#>  3     5  10.82645 10.34756  11.37489   Male#>  4     5  11.27963 10.91368  11.59938 Female#>  5     6  10.84480 10.36762  11.39147   Male#>  6     6  11.29786 10.93785  11.61687 Female#>  7     7  10.86374 10.38768  11.40973   Male#>  8     7  11.31656 10.96097  11.63308 Female#>  9     8  10.88204 10.38739  11.40548   Male#> 10     8  11.33522 10.98032  11.64661 Female#> # ... with 118 more rowsplot(dat)plot(dat, rawdata = TRUE)

As you can see, if you work with labelled data, the model-fitting functions from the rstanarm-package preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()-function.

Next…

Next I will integrate ggeffects into my sjPlot-package, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()-function which will replace former single functions like sjp.lm(), sjp.glm(), sjp.lmer() or sjp.glmer(). plot_model() should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…

Tagged: Bayes, data visualization, ggplot, R, rstanarm, rstats, sjPlot, Stan

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

↧

Basics of data.table: Smooth data exploration

August 23, 2017, 9:00 am

≫ Next: Recreating and updating Minard with ggplot2

≪ Previous: Going Bayes #rstats

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The data.table package provides perhaps the fastest way for data wrangling in R. The syntax is concise and is made to resemble SQL. After studying the basics of data.table and finishing this exercise set successfully you will be able to start easing into using data.table for all your data manipulation needs.

We will use data drawn from the 1980 US Census on married women aged 21–35 with two or more children. The data includes gender of first and second child, as well as information on whether the woman had more than two children, race, age and number of weeks worked in 1979. For more information please refer to the reference manual for the package AER.

Answers are available here.

Exercise 1 Load the data.table package. Furtermore (install and) load the AER package and run the command data("Fertility") which loads the dataset Fertility to your workspace. Turn it into a data.table object.

Exercise 2 Select rows 35 to 50 and print to console its age and work entry.

Exercise 3 Select the last row in the dataset and print to console.

Exercise 4 Count how many women proceeded to have a third child.

Learn more about the data.table package in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to

work with different data manipulation packages,
know how to import, transform and prepare your dataset for modelling,
and much more.

Exercise 5 There are four possible gender combinations for the first two children. Which is the most common? Use the by argument.

Exercise 6 By racial composition what is the proportion of woman working four weeks or less in 1979?

Exercise 7 Use %between% to get a subset of woman between 22 and 24 calculate the proportion who had a boy as their firstborn.

Exercise 8 Add a new column, age squared, to the dataset.

Exercise 9 Out of all the racial composition in the dataset which had the lowest proportion of boys for their firstborn. With the same command display the number of observation in each category as well.

Exercise 10 Calculate the proportion of women who have a third child by gender combination of the first two children?

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

Recreating and updating Minard with ggplot2

August 23, 2017, 2:59 pm

≫ Next: Sentiment analysis using tidy data principles at DataCamp

≪ Previous: Basics of data.table: Smooth data exploration

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Minard's chart depicting Napoleon's 1812 march on Russia is a classic of data visualization that has inspired many homages using different time-and-place data. If you'd like to recreate the original chart, or create one of your own, Andrew Heiss has created a tutorial on using the ggplot2 package to re-envision the chart in R:

The R script provided in the tutorial is driven by historical data on the location and size of Napoleon's armies during the 1812 campaign, but you could adapt the script to use new data as well. Andrew also shows how to combine the chart with a geographical or satellite map, which is how the cities appear in the version above (unlike in Minard's original).

The data behind the Minard chart is available from Michael Friendly and you can find the R scripts at this Github repository. For the complete tutorial, follow the link below.

Andrew Heiss: Exploring Minard’s 1812 plot with ggplot2 (via Jenny Bryan)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Sentiment analysis using tidy data principles at DataCamp

August 23, 2017, 5:00 pm

≫ Next: Digit fifth powers: Euler Problem 30

≪ Previous: Recreating and updating Minard with ggplot2

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched!

The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. The course is organized into four case studies (one per chapter), and I don’t think it’s too much of a spoiler to say that I wear a costume for part of it. I’m just saying you should probably check out the course trailer.

Course description

Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in four case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

Learning objectives

Learn the principles of sentiment analysis from a tidy data perspective
Practice manipulating and visualizing text data using dplyr and ggplot2
Apply sentiment analysis skills to several real-world text datasets

Check the course out, have fun, and start practicing those text mining skills!

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

↧

Digit fifth powers: Euler Problem 30

August 23, 2017, 5:00 pm

≫ Next: Hard-nosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies

≪ Previous: Sentiment analysis using tidy data principles at DataCamp

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of $2^{1000}$ .

Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.

Euler Problem 30 Definition

Surprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:

1634 = 1^4 + 6^4 + 3^4 + 4^4

8208 = 8^4 + 2^4 + 0^4 + 8^4

9474 = 9^4 + 4^4 + 7^4 + 4^4

As 1 = 1^4 is not a sum, it is not included.

The sum of these numbers is 1634 + 8208 + 9474 = 19316 . Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.

Proposed Solution

The problem asks for a brute-force solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is and 9^5=59049 , which has five digits. If we then look at $5 \times 9^5=295245$ , which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.

largest <- 6 * 9^5answer <- 0for (n in 2:largest) {    power.sum <-0    i <- n while (i > 0) {        d <- i %% 10        i <- floor(i / 10)        power.sum <- power.sum + d^5    }    if (power.sum == n) {        print(n)        answer <- answer + n    }}print(answer)

View the most recent version of this code on GitHub.

The post Digit fifth powers: Euler Problem 30 appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

↧

Hard-nosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies

August 23, 2017, 8:34 pm

≫ Next: Analyzing Google Trends Data in R

≪ Previous: Digit fifth powers: Euler Problem 30

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

Before recession a commercial tool was popular in the country, hence, uncertainty around tools and technology was not much; however, after recession, incertitude (i.e. uncertainty) around tools and technology have pre-occupied and occupying data science learning, delivery and deployment.

When python was continuing as general programming language, Rwas the left out best choice (became more popular with the advent of an IDE i.e. RStudio) and author still see its popularity among non-programming background (i.e. other than computer scientists) data scientists. Yet, author notices in local meet ups, panel discussions, webinars, still, a clarity on which is better from aspirants towards the data sicence as a everyday interest as shown in below image.

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details. Find more about author at http://in.linkedin.com/in/pradeepmavuluri

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

↧

Analyzing Google Trends Data in R

August 23, 2017, 8:48 pm

≫ Next: Boston EARL Keynote speaker announcement: Tareef Kawaf

≪ Previous: Hard-nosed Indian Data Scientist Gospel Series – Part 1 : Incertitude around Tools and Technologies

Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.

This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray.

Analyzing Google Trends in R

I have never bought a Blu-ray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blu-rays peaked.

The following R code retrieves the global search history since 2004 for Blu-ray.

library(gtrendsR)
library(reshape2)
 
google.trends = gtrends(c("blu-ray"), gprop = "web", time = "all")[[1]]
google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits")
rownames(google.trends) = google.trends$date
google.trends$date = NULL

The first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, news, images or youtube. The third argument time can be any of now 1-d, now 7-d, today 1-m, today 3-m, today 12-m, today+5-y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 2010-12-31 2011-06-30.

Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:

Sys.setenv(TZ = "UTC")

After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.

Plotting Google Trends data: Identifying seasonality and trends

Plotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.

Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Blu-ray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.

Google Trends by geographic region

Next, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.

google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "2010-06-30 2017-06-30")[[1]]

The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as US-CA for California. I find the easiest way to get these codes is to use this Wikipedia page.

An alternative way to find all the region-level codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).

library(gtrendsR)
geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))

Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.

Create your own analysis

In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis (just sign into Displayr first).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

↧

Boston EARL Keynote speaker announcement: Tareef Kawaf

August 24, 2017, 4:25 am

≫ Next: Notice: Changes to the site

≪ Previous: Analyzing Google Trends Data in R

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mango Solutions are thrilled to announce that Tareef Kawaf, President of RStudio, will be joining us at EARL Boston as our third Keynote Speaker.

Tareef is an experienced software startup executive and a member of teams that built up ATG’s eCommerce offering and Brightcove’s Online Video Platform, helping both companies grow from early startups to publicly traded companies. He joined RStudio in early 2013 to help define its commercial product strategy and build the team. He is a software engineer by training, and an aspiring student of advanced analytics and R.

This will be Tareef’s second time speaking at EARL Boston and we’re big supporters of RStudio’s mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment, so we’re looking forward to him taking to the podium again this year.

Want to join Tareef at EARL Boston?

Speak

Abstract submissions close on 31 August, so time is running out to share your R adventures and innovations with fellow R users.

All accepted speakers receive a 1-day Conference pass and a ticket to the evening networking reception.

Submit your abstract here.

Buy a ticket

Early bird tickets are now available! Save more than $100 on a Full Conference pass.

Buy tickets here.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Software used in the example

Getting data

Download daily closing price data from Google Finance

Clean up and transform data

Sample correlation matrix

Regularization

But how to choose a cut-off rank?

A precision matrix, finally

Other approaches

Put a package on it

Networks and clustering

Review

Appendix: threejs tricks

Precision matrix quantile threshold (adjust slider to change)

routr

The design

Routes

Route Stacks

Predefined routes

Wrapping up

What is a choice simulator?

How to create a choice simulator

Step 1: Import the model(s) results

Step 2: Simplify calculations using variable sets

Step 3: Create the controls

Step 4: Calculate preference shares

R Code to paste:

Step 5: Make it pretty

Step 6: Add filters

Step 7: Share

Worked example of a choice simulator

rOpenSci onboarding basics

Getting feedback

Unexpected extras from the review

The feedback from reviewers

What did I think?

So what does visdat look like?

The future of visdat

The future of your R package?

visdat

Questions I kept in mind as ~myself~ resident naïf:

Myriad lessons learned

Future Work

Behind the Scenes

Statistical Consulting

Scientific Programming

Application Development and Integration

Data Analysis Hardware and Hosting

Breaking down the script

A test of skill: can you write this fluently?

Sign up now, and discover how to rapidly master data science

Finding shape or location parameters from distributions

High Density Intervals for MCMC samples

Marginal effects for rstanarm-models

Next…

Related exercise sets:

Course description

Learning objectives

Euler Problem 30 Definition

Proposed Solution

Analyzing Google Trends in R

Plotting Google Trends data: Identifying seasonality and trends

Google Trends by geographic region

Create your own analysis

Want to join Tareef at EARL Boston?

Speak

Buy a ticket

Appendix: `threejs` tricks

So what does `visdat` look like?

The future of `visdat`

`visdat`