Quantcast
Channel: R-bloggers
Viewing all 12110 articles
Browse latest View live

An Amazon SDK for R!?

$
0
0

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

Intro:

For a long time I have found it difficult to appreciate the benefits of “cloud compute” in my R model builds. This was due to my initial lack of understanding and the setting up of R on cloud compute environments. When I noticed that AWS was bringing out a new product AWS Sagemaker, the possiblities of what it could provide seemed like a dream come true.

Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. Your models get to production faster with much less effort and lower cost. (https://aws.amazon.com/sagemaker/)

A question about AWS Sagemake came to mind: Does it work for R developers??? Well…not exactly. True it provides a simple way to set up an R environment in the cloud but it doesn’t give the means to access other AWS products for example AWS S3 and AWS Athena out of the box. However for Python this is not a problem. Amazon has provided a Software Development Kit (SDK) for Python called boto3, which comes pre-installed on AWS Sagemaker.

It isn’t all bad news, RStudio has developed a package called reticulate that lets R interfaced into Python. So using reticulate in combination with boto3 gives R full access to all of AWS products from Sagemaker similar to Python. However are there any other methods for R user to connect to AWS?

AWS interfaces for R:

paws an R SDK:

Paws is a Package for Amazon Web Services in R. Paws provides access to the full suite of AWS services from within R.(https://github.com/paws-r/paws)

When I want to connect to AWS I usually turn to Python. AWS’s boto3 is an excellent means of connecting to AWS and exploit its resources. However R now has it’s own SDK into AWS, paws. This came as a little surprise to me as I started to accept that R might never have an SDK for AWS. How wrong I was.

What’s pleasing to me was how well developed and easy the package was to use. It felt natural to switch between boto3 and paws. Almost like it was a long lost brother.

Here is a quick example to show the comparison between boto3 and paws. Returning a list of all objects in S3 inside a prefix:

Python

import boto3

s3 = boto3.Session().client("s3")
obj = s3.list_objects(Bucket = 'mybucket', Prefix = "prefix_1/")
[x.get("Key") for x in obj.get("Contents")]

R

s3 <- paws::s3()

obj <- s3$list_objects(Bucket = 'mybucket', Prefix = "prefix_1/")
lapply(obj$Contents, function(x) x$Key)

From this quick example it is clear that the paws SDK’s syntax is extremely similar to boto3, although with an R twist. This can only a good thing, as hundreds of people know boto3 already and therefore they will be familiar with paws by association. I can’t express the potential the package paws gives R users. A good project that utilises the paws sdk is the package noctua. noctua creates a wrapper of the paws connection to AWS Athena and developes a DBI interface for R users. We will go into the package noctua in the next blog. First here is an example how of to work with AWS Athena when using paws.

Querying to AWS Athena using paws

# create an AWS Athena object
athena <- paws::athena()

# Submit query to AWS Athena
res <- athena$start_query_execution(
            QueryString = "show Databases",
            ResultConfiguration = 
                list(OutputLocation = "s3://mybucket/queries/"))

# Get Status of query
result <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)

# Return results if query is successful
if(result$QueryExecution$Status$State == "FAILED") {
  stop(result$QueryExecution$Status$StateChangeReason, call. = FALSE)
} else {output <- 
          athena$get_query_results(
              QueryExecutionId = res$QueryExecutionId,
              MaxResults = 1)}

From an initial view it might look daunting however this is exactly the same interface that boto3 provides when working with AWS Athena. The good news is that noctua wraps all of this and creates the DBI method dbGetQuery for paws.

paws is an excellent R SDK into AWS, so please download paws and give it ago, I am sure you will be pleasantly surprised like myself.

install.packages("paws")

Note: For more examples, the developers of paws have created some code examples https://github.com/paws-r/paws/tree/master/examples and a documentation website https://paws-r.github.io/.

botor :

This R package provides raw access to the ‘Amazon Web Services’ (‘AWS’) ‘SDK’ via the ‘boto3’ Python module and some convenient helper functions (currently for S3 and KMS) and workarounds, eg taking care of spawning new resources in forked R processes. (https://daroczig.github.io/botor/)

When using botor on AWS Sagemaker, R users can easily interact with all of AWS products in the exact same manner as a Python user. However botor’s convenient helper functions certainly does make the experience working on AWS Sagemaker easier. Here is a quick example to demostrate how easy/ useful these helper function are:

Upload iris data.frame to s3 bucket

library(botor)

write_s3(iris, data.table::fwrite, "s3://mybucket/iris.csv")

Read s3 file back into R as a data.frame

read_s3("s3:://mybucket/iris.csv", data.table::fread)

These convenient helper functions are not limited to just reading/writing data in csv format. They can also be used to upload R models, which can be really useful when wanted to store pre-built models. Here is a quick example of what I like to call a crap model.

train <- iris[1:20,1:4]
test <- iris[21:40,1:4]
 
model <- lm(Petal.Width ~., train)

Uploading and downloading R models to S3

s3_write(model, saveRDS, "s3://mybucket/crap_model.RDS")
s3_model <- s3_read("s3://mybucket/crap_model.RDS", readRDS)

It is clear to see how useful botor is when working with AWS S3.

Cloudyr Project:

I personally haven’t used the AWS cloudyr packages, however I don’t want to leave them out. The cloudyr project aim is to bring R onto the cloud compute:

The goal of this initiative is to make cloud computing with R easier, starting with robust tools for working with cloud computing platforms.(https://cloudyr.github.io/)

As I haven’t utilised the wide range of packages that the cloudyr project provides I won’t give examples. Please go to the cloudyr github https://github.com/cloudyr as a lot of work has gone into making R easier to work with cloud computing. They have a lot of documentation plus they are actively developing R packages to make user experience better.

Summary:

I believe that all of these packages have advantages in working with AWS when using R. As R has a SDK paws for AWS it would be great if it was added to the base image, as it allows R developers to utilise AWS products in their AWS Sagemaker environments. Alternatively the botor package would be another package for AWS to consider putting in their AWS Sagemaker image.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Extracting basic Plots from Novels: Dracula is a Man in a Hole

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In 1965 the University of Chicago rejected Kurt Vonnegut’s college thesis, which claimed that all stories shared common structures, or “shapes”, including “Man in a Hole”, “Boy gets Girl” and “Cinderella”. Many years later the then already legendary Vonnegut gave a hilarious lecture on this idea – before continuing to read on please watch it here (about 4 minutes):

When you think about it the shape “Man in a Hole” (characters plunge into trouble and crawl out again) really is one of the most popular – even the Bible follows this simple script (see below)!

A colleague of mine, Professor Matthew Jockers from the University of Nebraska, has analyzed 50,000 novels and found out that Vonnegut was really up to something: there are indeed only half a dozen possible plots most novels follow.

You can read more about this project here: The basic plots of fiction

Professor Jockers has written a whole book about this topic: “The Bestseller Code”. But what is even more mind-blowing than this unifying pattern of all stories is that you can do these analyses yourself – with any text of your choice! Professor Jockers made the syuzhet package publicly available on CRAN (“Syuzhet” is the Russian term for narrative construction).

A while ago I finished Dracula, the (grand-)father of all vampire and zombie stories. What a great novel that is! Admittedly it is a little slow-moving but the atmosphere is better than in any of the now popular TV series. Of course, I wanted to do an analysis of the publicly available Dracula text.

The following code should be mostly self-explanatory. First the original text (downloaded from Project Gutenberg: Bram Stoker: Dracula) is broken down into separate sentences. After that the sentiment for each sentence is being evaluated and all the values smoothed out (by using some kind of specialized low pass filter). Finally the transformed values are plotted:

library(syuzhet)dracula <- get_text_as_string("data/pg345.txt")Dracula <- get_sentences(dracula)Dracula_sent <- get_sentiment(Dracula, method = "bing")ft_values <- get_dct_transform(Dracula_sent, low_pass_size = 3, scale_range = TRUE)plot(ft_values, type = "l", main = "Dracula using Transformed Values", xlab = "Narrative Time", ylab = "Emotional Valence", col = "red")abline(h = 0)

In a way R has “read” the novel in no time and extracted the basic plot – pretty impressive, isn’t it! As you can see the story follows the “Man in a Hole”-script rather exemplary, which makes sense because at the beginning everything seems to be fine and well, then Dracula appears and, of course, bites several protagonists, but in the end they catch and kill him – everything is fine again.

THE END

…as a bonus, here is the plot that shows that the Bible also follows a simple “Man in a Hole” narrative (paradise, paradise lost, paradise regained). Fortunately, you can conveniently install the King James Bible as a package: https://github.com/JohnCoene/sacred

# devtools::install_github("JohnCoene/sacred")library(sacred)KJV_sent <- get_sentiment(king_james_version$text, method = "bing")ft_values <- get_dct_transform(KJV_sent, low_pass_size = 3, scale_range = TRUE)plot(ft_values, type = "l", main = "King James Bible using Transformed Values", xlab = "Narrative Time", ylab = "Emotional Valence", col = "red")abline(h = 0)

Simple stories often work best!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What’s new in DALEX v 0.4.9?

$
0
0

[This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Few days ago a new version of DALEX was accepted by CRAN (v 0.4.9). Here you will find short overview what was added/changed.

DALEX is an R package with methods for visual explanation and exploration of predictive models. Here you will find short overview with examples based on Titanic data. For real world use cases: Here you will find a conference talk related to credit scoring based on FICO data. Here you will find an example use case for insurance pricing.

Major changes in the last version

Verbose model wrapping

Function explain() is now more verbose. During the model wrapping it checks the consistency of arguments. This works as unit tests for a model. This way most common problems with model exploration can be identified early during the exploration.

Support for new families of models

We have added more convenient support for gbm models. The ntrees argument for predict_function is guessed from the model structure. Support for mlr, scikit-learn, h2o and mljar was moved to DALEXtra in order to limit number of dependencies.

Integration with other packages

DALEX has now better integration with the auditor package. DALEX explainers can be used with any function form the auditor package. So, now you can easily create an ROC plot, LIFT chart or perform analysis of residuals. This way we have access to a large number of loss functions.

Richer explainers

Explainers have now new elements. Explainers store information about packages that were used for model development along with their versions. Latest version of explainers stores also sampling weights for the data argument.

A bit of philosophy

Cross-comparisons of models is tricky because predictive models may have very different structures and interfaces. DALEX is based on an idea of an universal adapter that can transform any model into a wrapper with unified interface that can be digest by any model agnostic tools.

In this medium article you will find a longer overview of this philosophy.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Enlarging the eBook supply

$
0
0

[This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There has been a campaign by members of the Canadian library community to ask international publishers to reduce prices to libraries for eBooks.  If my experience in the scientific publishing arena is any guide, the publishers will continue to charge the maximum they can get until the marketplace — that is us — forces them to change or get out of business.

This does not mean that the campaign is without merit. I just feel it needs to be augmented by some explorations of alternatives that could lead to a different ecosystem for writers and readers, possibly without traditional publishers.

Where am I coming from?

  • I am a retired university professor with a number of techinical and scientific books to my name over a long period. Oddly the first published, in February 1979, is a computer book that is still in print on real paper. All the others have gone out of print and I have put those for which I could generate an e-version on archive.org.
  • Since retiring in 2008, I have published a traditional book with Wiley. The royalties were pitiful, and the publisher had deigned to ignore requests to fix the web-site. However, I believe they are still selling the book, though I’ve not had any reports. If it was worthwhile, I’d ask a lawyer to prod them. However, I know the amounts involved are much, much less than any fee a lawyer would charge.
  • I’ve also written four novels as well as many shorter pieces, and am one of a collective that has published three creative writing anthologies. My novels are all epub versions and I have put them on obooko.com. They are freely available for individual readers.

How does this help libraries?

First, I’d be really happy if libraries where my novels are of local interest would make them freely available. There are clearly some minor costs to doing this, but we are talking of a few tens of dollars in labour to add a web-page and a link or two, and some cents of storage cost. I offered my first novel, which involves Ottawa, to the Ottawa Public Library, but was told I would have to pay an external (indeed US) company for the DRM to be applied. But I offered an unlimited number of downloads! Is it surprising that I went to archive.org and obooko.com? And the library has fewer offerings for its readers, but also I miss out on local readers being offered my work.

Second, I believe there are other authors whose motivations are not primarily financial who may be willing to offer their works in a similar fashion. My own motivations are to record anecdotes and events not otherwise easily accessed, if at all. I put these in a fictional shell to provide readability and structure. Given that obooko.com seems to be doing reasonably well, though some of the offerings are less than stellar, there are clearly other authors willing to make their works available to readers.

Third, there may be authors willing to offer their works for free for a limited time to allow initiatives by libraries to be tried out, even if they do want or need to be remunerated eventually.

What about the longer term?

DRM imposes considerable costs on both libraries and readers. Whether it protects authors can be debated. Its main damage is that it cedes control of works from authors and readers to foreign, often very greedy, companies, whose interests are not in creative works but in selling software systems. Worse, it only really protects the income of those companies. All DRM schemes are breakable, though they are always a nuisance.

Some workers talk of “social DRM” or watermarking. I believe this could be a useful tool. The borrower/buyer is identified on the work. I have done this with my books in the past, using a scheme developed to put a name on each page of student examination papers so each student had a unique paper. This avoided the need for strict invigilation, since copied answers could be wrong. In an ebook, the idea could be enhanced with invisible encoding of the name (steganography). However, each additional feature is another cost, another obstacle to getting the work to the reader. And no scheme is unbreakable.

Publishers now insist on “renewal” of the license for an eBook. The library may not get a “reference” copy. Personally, I believe one of the important roles of libraries is to serve as repositories of works. A single disk can store an awful number of eBooks, so the cost is small. As a possibility, reading the archival copies could be restricted to in-library access only, but users would have a chance to verify material for study and reference purposes. Audiobooks require more storage, and are less easy to use for reference purposes, but could be similarly archived, as the audiobook reader’s voice may be of interest.

For a sustainable writer-library-reader system, there does need to be a way to reward writers. The current scheme, with DRM, counts downloads. This is costly to libraries. How many times have you abandoned a book after a few pages? This might be overcome with some sort of sampling mechanism providing more material than currently offered as a “tease”.

If eBooks are available in unlimited quantities, authors could actually benefit more. There are often temporary fads or surges of interest, or else book club members all want to read a title. At the moment, I know some club members will decide not to bother with a given book, or will find ways to “share”. Clearly, libraries will not want to have open-ended costs, but it is likely that works will be temporarily popular. As I understand things, traditional publishers  allow a fixed number of borrowings, so there is in essence a per-borrowing charge. In effect this is infinite for a never-borrowed work. Those of us in the non-traditional community who still want some reward might be happy with a smaller per-borrowing reward, and may also be willing to accept a cap or a per-year maximum or some similar arrangement that still keeps costs lower for the library.

My view

I want my work read. Readers want choice and diversity. Libraries want to grow the community of written work. It is time to think out of the DRM box.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

$
0
0

[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of N^{\textrm{pop}} items with values normally distributed members with standard deviation known to be 2,

y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.

To complete the Bayesian model, we’ll assume a standard normal prior on \mu,

\mu \sim \textrm{normal}(0, 1).

Now we’re not going to observe all y_n, but only a sample of the N^{\textrm{pop}} elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items y_n from the population.

Missing data

Now let’s assume the sample of y_n we observe is not drawn at random from the population. Imagine instead that we have a subset of N items from the population, and for each item n, there is a probability \pi_n that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

\pi_n = \textrm{logit}^{-1}(y_n).

Now when we collect our sample, we’ll do something like poll N = 2000 people from the population, but each person n only has a \pi_n chance of responding. So we only wind up with N^{\textrm{obs}} observations, with N^{\textrm{miss}} = N - N^{\textrm{obs}} observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

\sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)  target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);

If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate \hat{\pi} is the true value \pi). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed N items when we’ve only observed N^{\textrm{obs}} items. That is, their weights are what we’d expect to get if we’d observed N items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to N^{\textrm{obs}}. That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll N people out of a population of N^{\textrm{pop}}, each of which has a \pi_n chance of responding, leading to a set of responses of size N^{\textrm{obs}}.

Given that we know how \pi relates to y, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the N^{\textrm{miss}} = N - N^{\textrm{obs}} missing items each get parameters y^{\textrm{miss}}_n representing how they would’ve responded had they responded. We also model response, so we have an extra term \textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}})) for the unobserved values and an extra term \textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n)) for the observed values.

This works. Here’s the Stan program.

data {  int N_miss;  int N_obs;  vector[N_obs] y_obs;}parameters {  real mu;  vector[N_miss] y_miss;}model {  // prior  mu ~ normal(0, 1);  // observed data likelihood  y_obs ~ normal(mu, 2);  1 ~ bernoulli_logit(y_obs);  // missing data likelihood and missingness  y_miss ~ normal(mu, 2);  0 ~ bernoulli_logit(y_miss);}

The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)  0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))

And here’s the simulation code, including a cheap run at SBC:

library(rstan)rstan_options(auto_write = TRUE)options(mc.cores = parallel::detectCores(), logical = FALSE)printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }inv_logit <- function(u) 1 / (1 + exp(-u))printf("Compiling model.")model <- stan_model('missing.stan')for (m in 1:20) {# SIMULATE DATAmu <- rnorm(1, 0, 1);N_tot <- 1000y <- rnorm(N_tot, mu, 2)z <- rbinom(N_tot, 1, inv_logit(y))y_obs <- y[z == 1]N_obs <- length(y_obs)N_miss <- N_tot - N_obs# COMPILE AND FIT STAN MODELfit <- sampling(model,                data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),                chains = 1, iter = 5000, refresh = 0)mu_ss <- extract(fit)$mumu_hat <- mean(mu_ss)q25 <- quantile(mu_ss, 0.25)q75 <- quantile(mu_ss, 0.75)printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",       mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))}

Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81

The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Mysterious Case of the Ghost Interaction

$
0
0

[This article was first published on R on I Should Be Writing: Est. 1641, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This spooky post was written in collaboration with Yoav Kessler (@yoav_kessler) and Naama Yavor (@namivor)..


Experimental psychology is moving away from repeated-measures-ANOVAs, and towards linear mixed models (LMM1). LMMs have many advantages over rmANOVA, including (but not limited to):

  • Analysis of single trial data (as opposed to aggregated means per condition).
  • Specifying more than one random factor (typically crossed random intercepts of subject and item).
  • The use of continuous variables as predictors.
  • Making you look like you know what you’re doing.
  • Defeating the un-dead / reviewer 2.
  • The ability to specify custom models.2

This post will focus on this last point. Specifically, why you should always include main-effects when modeling interactions, and what happens if you don’t (spooky).

Fitting the Right (yet oh so wrong) Model

Say you’ve finally won that grant you submitted to study candy consumption during ghostly themed holidays. As part of your first study, you decide to measure the effects of costume type (scary / cute) and level of neighborhood decor (high / low levels of house decorations) on the total weight of collected candy (in Kg). A simple, yet informative 2-by-2 design.

Being the serious scientist you are, you have several hypotheses:

  1. A main effect for decor level– neighborhoods with more decorations will overall give out more candy.
  2. No main effect for costume– overall, children with cute and scary costumes will receive the same amount of candy (in Kg).
  3. A decor level\(\times\)costume interaction– high decor neighborhoods will favor scary costumes, while low decor neighborhoods will favor cute costumes.

It would only make sense to specify your statistical model accordingly – after all, why shouldn’t your model represent your hypotheses?

In R, such a model is described as candy_kg ~ decor + decor:costume, instructing R to model candy_kg as a function of the effect for decor + the interaction decor:costume.

And so, you fit the model:

options(contrasts = c('contr.sum', 'contr.poly')) # set effects coding (just once)fit <- aov(candy_kg ~ decor + decor:costume, data = spooky_data)
TermdfSSMSFp-value\(\eta^2\)
decor130.0030.0023.64<0.0010.10
decor:costume2120.0060.0047.28<0.0010.40
Residuals116147.201.27

As predicted, you find both a significant main effect for decor and the interaction decor\(\times\)costume, with the interaction explaining 40% of the variance in collected candy weight. So far so good – your results reflect your hypotheses!

But then you plot your data, and to your horror you find…

It looks like there is no interaction at all! Your interaction was nothing more than a ghost! An apparition! How is this possible?? Where has all of variance explained by it gone???

What IS This??

In fact, had you fit the full model, you would have found:

fit <- aov(candy_kg ~ decor * costume, data = spooky_data)
TermdfSSMSFp-value\(\eta^2\)
decor130.0030.0023.64<0.0010.10
costume1120.00120.0094.56<0.0010.40
decor:costume10.000.000.001.0000.00
Residuals116147.201.27

The interaction actually explains 0% of the variance! And the effect of costume is the one that explains 40% of the variance!3 How could this be?? Have we angered Fisher’s spirit somehow?

What happened was that because we did not account for costume in our model, the variance explained by costume was swallowed by the interaction decor\(\times\)costume!

The Math

If you find math too scary, feel free to skip to conclusion.

Travel back to Intro to Stats, and recall that the interaction’s sum-of-squares – \(SS_{A\times B}\)– is calculated as:

\(SS_{A\times B} = (\bar{x}_{ij} – \bar{x}_{i.} – \bar{x}_{.j} + \bar{\bar{x}}_{..})^2\)

This is a simplification of the following equation:

\(SS_{A\times B} = \sum \sum (\bar{x}_{ij} – (\bar{x}_{i.} – \bar{\bar{x}}_{..}) – (\bar{x}_{.j} – \bar{\bar{x}}_{..}) + \bar{\bar{x}}_{..})^2\)

Where \((\bar{x}_{i.} – \bar{\bar{x}}_{..})\) represents the main effect for \(A\) and \((\bar{x}_{.j} – \bar{\bar{x}}_{..})\) represents the main effect for \(B\). We can see that \(SS_{A\times B}\) represents the deviation from the additive model– i.e., it is the degree by which the observed cells’ means deviate from what would be expected if there were only the two main effects.

When we exclude the main effect of \(B\) from out model, we are telling our model that there is no need to estimate the main effect. That is, we set \((\bar{x}_{.j} – \bar{\bar{x}}_{..})=0\). The resulting \(SS_{A\times B}\) is computed not as above, but as:

\(SS_{A\times B} = \sum \sum (\bar{x}_{ij} – (\bar{x}_{i.} – \bar{\bar{x}}_{..}) + \bar{\bar{x}}_{..})^2\)

This formula represents the degree by which the observed cells’ means deviate from what would be expected if there was only the main effect of \(A\). But now if the cells’ means deviate in a way that would otherwise have been part of a main effect for \(B\), the cells’ deviations from the main effect for \(A\) will now include the deviations that would otherwise have been accounted for by a main effect of \(B\)! This results in the main effect for \(B\) essentially getting “pooled” into \(SS_{A\times B}\). Furthermore, had you also excluded a main effect for \(A\), this effect too would have been “pooled” into the so-called \(A\times B\) interaction.

In other words:

When we don’t estimate (model) main effects, we change the meaning of interactions– they no longer represents a deviation from the additive model.

Conclusion

Sure, you can specify a model with no main effect and only interactions, but in such a case the interactions no longer mean what we expect them to mean. If we want interactions to represent deviation from the additive model, our model must also include the additive model!

For simplicity’s sake, this example has focused on a simple 2-by-2 between subject design, but the conclusions drawn here are relevant for any design in which a factor interacts with or moderates the effect of another factor or continuous variable.


  1. Or hierarchical linear models (HLM)… or mixed linear models (MLM)…

  2. Whereas in an AVONA analysis with 4 factors you always have: Four main effects + Six 2-way interaction + Four 3-way interaction + One 4-way interaction.

  3. Note also that the \(df_{residual}\) is the same for both models, indicating the same number of parameters overall have been estimated in both. E.g., while in the full model we would have 3 parameters – one for each main effect + one for the interaction, in the misspecified model we have one for the main effect, and two for the interaction. That is, no matter how you tell the model to split the \(SS\)s, the number of parameters needed to model 4 cells will always be 3.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on I Should Be Writing: Est. 1641.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A brief primer on Variational Inference

$
0
0

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

hljs.initHighlightingOnLoad();

$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});

Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking!

Preliminaries

Bayes’ rule states that

\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,

where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.1 Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.

Variational objective

We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:

q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,

where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence:

\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .

We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:

% <![CDATA[\begin{aligned}\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace ,\end{aligned} %]]&gt;

where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:

% <![CDATA[\begin{aligned}\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em]&= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace .\end{aligned} %]]&gt;

We can expand the joint probability to get more insight into this equation:

% <![CDATA[\begin{aligned}\text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em]&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace .\end{aligned} %]]&gt;

This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).

ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:

% <![CDATA[\begin{aligned}\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em]\text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em]\text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace ,\end{aligned} %]]&gt;

since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post.

Why variational?

Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462).

A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional:

\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,

which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.

In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.

Mean-field variational family

A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:

q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .

Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.

We start from our definition of the ELBO and apply the mean-field assumption:

% <![CDATA[\begin{aligned}\text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace .\end{aligned} %]]&gt;

In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:

% <![CDATA[\begin{aligned}\text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em]&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em]&\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]&= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]&= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace .\end{aligned} %]]&gt;

One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution

\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,

where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:

% <![CDATA[\begin{aligned}\text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]&= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em]&= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em]&= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace .\end{aligned} %]]&gt;

Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:

% <![CDATA[\begin{aligned}q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em]&= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace ,\end{aligned} %]]&gt;

see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that

p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,

which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem.

Application: Linear regression

In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:

% <![CDATA[\begin{aligned}y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em]\beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em]\sigma^2 &\propto \frac{1}{\sigma^2} \enspace ,\end{aligned} %]]&gt;

where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:

\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .

As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.

Variational density for $\sigma^2$

Our optimal variational density $q^\star(\sigma^2)$ is given by:

q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .

To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:

% <![CDATA[\begin{aligned}p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em]&= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em]&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]&\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace ,\end{aligned} %]]&gt;

which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:

% <![CDATA[\begin{aligned}q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em]&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em]&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em]&= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em]&= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace .\end{aligned} %]]&gt;

This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:

q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .

Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.

Variational density for $\beta$

Our optimal variational density $q^\star(\beta)$ is given by:

q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,

and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:

% <![CDATA[\begin{aligned}p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em]&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]&\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em]&=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em]&\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,\end{aligned} %]]&gt;

where we have “completed the square” (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:

% <![CDATA[\begin{aligned}q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em]&= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em]&= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,\end{aligned} %]]&gt;

which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:

q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,

Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.

To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.

Removing expectations

Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, for the variational density $q^\star(\sigma^2)$ we write:

% <![CDATA[\begin{aligned}\mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em]&= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace .\end{aligned} %]]&gt;

Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:

\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2 = \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,

the expectation becomes:

\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .

For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. Note that since $\sigma^2$ follows an inverse Gamma distribution, $1 / \sigma^2$ follows a Gamma distribution which has mean:

% <![CDATA[\begin{aligned}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em]&= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace .\end{aligned} %]]&gt;

Monitoring convergence

The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:

% <![CDATA[\begin{aligned}\text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em]&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em]&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace .\end{aligned} %]]&gt;

Let’s take a deep breath and tackle the second term first:

% <![CDATA[\begin{aligned}\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]&= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]&= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]&= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace .\end{aligned} %]]&gt;

Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:

\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,

We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:

% <![CDATA[\begin{aligned}\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]&=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]&= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace .\end{aligned} %]]&gt;

Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:

\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .

In the next section, we implement the algorithm for our linear regression problem in R.

Implementation in R

Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:

library('MCMCpack')#' Computes the ELBO for the linear regression example#' #' @param y univariate outcome variable#' @param x univariate predictor variable#' @param beta_mu mean of the variational density for \beta#' @param beta_sd standard deviation of the variational density for \beta#' @param nu parameter of the variational density for \sigma^2#' @param nr_samples number of samples for the Monte carlo integration#' @returns ELBOcompute_elbo<-function(y,x,beta_mu,beta_sd,nu,tau2,nr_samples=1e4){n<-length(y)sum_y2<-sum(y^2)sum_x2<-sum(x^2)sum_yx<-sum(x*y)# Takes a function and computes its expectation with respect to q(\beta)E_q_beta<-function(fn){integrate(function(beta){dnorm(beta,beta_mu,beta_sd)*fn(beta)},-Inf,Inf)$value}# Takes a function and computes its expectation with respect to q(\sigma^2)E_q_sigma2<-function(fn){integrate(function(sigma){dinvgamma(sigma^2,(n+1)/2,nu)*fn(sigma)},0,Inf)$value}# Compute expectations of log p(\sigma^2)E_log_p_sigma2<-E_q_sigma2(function(sigma)log(1/sigma^2))# Compute expectations of log p(\beta \mid \sigma^2)E_log_p_beta<-(log(tau2/beta_sd^2)*E_q_sigma2(function(sigma)log(sigma^2))+(beta_sd^2+tau2)/(tau2)*E_q_sigma2(function(sigma)1/sigma^2))# Compute expectations of the log variational densities q(\beta)E_log_q_beta<-E_q_beta(function(beta)dnorm(beta,beta_mu,beta_sd,log=TRUE))# E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails# Compute expectations of the log variational densities q(\sigma^2)sigma2<-rinvgamma(nr_samples,(n+1)/2,nu)E_log_q_sigma2<-mean(log(dinvgamma(sigma2,(n+1)/2,nu)))# Compute the expected log likelihoodE_log_y_b<-sum_y2-2*sum_yx*beta_mu+(beta_sd^2+beta_mu^2)*sum_x2E_log_y_sigma2<-E_q_sigma2(function(sigma)log(sigma^2)*1/sigma^2)E_log_y<-n/4*log(2*pi)*E_log_y_b*E_log_y_sigma2# Compute and return the ELBOELBO<-E_log_y+E_log_p_beta+E_log_p_sigma2-E_log_q_beta-E_log_q_sigma2ELBO}

The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:

% <![CDATA[\begin{aligned}\nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em]\mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em]\sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace .\end{aligned} %]]&gt;

The following function implements the iterative updating of these variational parameters until the ELBO has converged.

#' Implements CAVI for the linear regression example#' #' @param y univariate outcome variable#' @param x univariate predictor variable#' @param tau2 prior variance for the standardized effect size#' @returns parameters for the variational densities and ELBOlmcavi<-function(y,x,tau2,nr_samples=1e5,epsilon=1e-2){n<-length(y)sum_y2<-sum(y^2)sum_x2<-sum(x^2)sum_yx<-sum(x*y)# is not being updated through variational inference!beta_mu<-sum_yx/(sum_x2+1/tau2)res<-list()res[['nu']]<-5res[['beta_mu']]<-beta_mures[['beta_sd']]<-1res[['ELBO']]<-0j<-1has_converged<-function(x,y)abs(x-y)<epsilonELBO<-compute_elbo(y,x,beta_mu,1,5,tau2,nr_samples=nr_samples)# while the ELBO has not convergedwhile(!has_converged(res[['ELBO']][j],ELBO)){nu_prev<-res[['nu']][j]beta_sd_prev<-res[['beta_sd']][j]# used in the update of beta_sd and nuE_qA<-sum_y2-2*sum_yx*beta_mu+(beta_sd_prev^2+beta_mu^2)*(sum_x2+1/tau2)# update the variational parameters for sigma2 and betanu<-1/2*E_qAbeta_sd<-sqrt(((n+1)/E_qA)/(sum_x2+1/tau2))# update results objectres[['nu']]<-c(res[['nu']],nu)res[['beta_sd']]<-c(res[['beta_sd']],beta_sd)res[['ELBO']]<-c(res[['ELBO']],ELBO)# compute new ELBOj<-j+1ELBO<-compute_elbo(y,x,beta_mu,beta_sd,nu,tau2,nr_samples=nr_samples)}res}

Let’s run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) receive about $0.68$ prior probability.

gen_dat<-function(n,beta,sigma){x<-rnorm(n)y<-0+beta*x+rnorm(n,0,sigma)data.frame(x=x,y=y)}set.seed(1)dat<-gen_dat(100,0.30,1)mc<-lmcavi(dat$y,dat$x,tau2=0.50^2)mc
## $nu## [1]  5.00000 88.17995 45.93875 46.20205 46.19892 46.19895## ## $beta_mu## [1] 0.2800556## ## $beta_sd## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512## ## $ELBO## [1]       0.0000 -297980.0495     493.4807    -281.4578    -265.1289## [6]    -265.3197

From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.

Comparison with Stan

Whenever one goes down a rabbit hole of calculations, it is good to sanity check one’s results. Here, we use Stan’s variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a “black-box” way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:

data {  int n;  vector[n] y;  vector[n] x;  real tau;} parameters {  real b;  real sigma;} model {  target += -log(sigma);  target += normal_lpdf(b | 0, sigma*tau);  target += normal_lpdf(y | b*x, sigma);}

We use Stan’s black-box variational inference scheme:

library('rstan')# save the above model to a file and compile it# model <- stan_model(file = 'regression.stan')stan_dat<-list('n'=nrow(dat),'x'=dat$x,'y'=dat$y,'tau'=0.50)fit<-vb(model,data=stan_dat,output_samples=20000,adapt_iter=10000,init=list('b'=0.30,'sigma'=1),refresh=FALSE,seed=1)

This gives similar estimates as ours:

fit
## Inference for Stan model: variational-regression.## 1 chains, each with iter=20000; warmup=0; thin=1; ## post-warmup draws per chain=20000, total post-warmup draws=20000.## ##       mean   sd 2.5%  25%  50%  75% 97.5%## b     0.28 0.13 0.02 0.19 0.28 0.37  0.54## sigma 0.99 0.09 0.82 0.92 0.99 1.05  1.18## lp__  0.00 0.00 0.00 0.00 0.00 0.00  0.00## ## Approximate samples were drawn using VB(meanfield) at Wed Oct 30 13:20:01 2019.
## We recommend genuine 'sampling' from the posterior distribution for final inferences!

Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.

Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the Post Scriptum), we check our results against exact inference using Markov chain Monte carlo by visual inspection.

fit<-sampling(model,data=stan_dat,iter=8000,refresh=FALSE,seed=1)

The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.

plot of chunk unnamed-chunk-10

Note that the posterior variance of $\beta$ is slightly overestimated when using our variational scheme. This is in contrast to the fact that variational inference generally underestimates variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!

Conclusion

In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic “black-box” algorithms which avoid problem-specific calculations hold great promise.

We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan’s variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (2017).


I would like to thank Don van den Bergh for helpful comments on this blog post.


Post Scriptum

Normal-inverse-gamma Distribution

The posterior distribution is a Normal-inverse-gamma distribution:

p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,

where

% <![CDATA[\begin{aligned}\mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em]\lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em]\alpha &= \frac{n + 1}{2} \\[.5em]\gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace .\end{aligned} %]]&gt;

Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.

References

  • Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859-877.
  • Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic variational inference in Stan. In Advances in Neural Information Processing Systems (pp. 568-576).
  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1), 430-474.

Footnotes

  1. The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. ↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

81st TokyoR Meetup Roundup: A Special Session in {Shiny}!

$
0
0

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As another sweltering summer ends, another TokyoR Meetup! With global warming in full swing and it still being around 30 degrees at the end of September, this month’s meetup was held at DIP Corporation, an personnel/recruitment services company, in their headquarters in Roppongi, Tokyo. This month’s session was another special-themed session involving Shiny apps!

In line with my previous round up posts:

I will be going over around half of all the talks. Hopefully, my efforts will help spread the vast knowledge of Japanese R users to the wider R community. Throughout I will also post helpful blog posts and links from other sources if you are interested in learning more about the topic of a certain talk. You can follow Tokyo.R by searching for the #TokyoR hashtag on Twitter.

Anyways…

Let’s get started!

BeginneR Session

As with every TokyoR meetup, we began with a set of beginner user focused talks:

Main Talks

hoxo_m: Asynchronous Programming for Shiny!

@hoxo_m of HOXO-M Inc. talked about asynchronous programming with Shiny. Starting off with an introduction into the history of single and multi-threading in both R and how the growing popularity of Shiny has lead to a demand for multithreadedness to cater to the multitude of users using a Shiny app at once!

The main package that facilitates this in both R and Shiny is the {future} package which allows one to be able to evaluate R expressions asynchronously (in parallel or concurrently on a cluster). By using {future} in a Shiny context you can shift resource-intensive tasks (ex. grabbing data from an API) activated by one user into another process and free up time for other users’ tasks and reduce their waiting time.

The plan() function allows you to choose from a variety of options for launching/attaching R processes. The choices are multisession, multicore, and multiprocess. You can read more about it here.

There’s not a whole lot you need to do to grab the results from another process as all the render_*() are able to take “promise” objects. As a reminder, a “promise” in this context is an object that takes a result from an asynchronous process that happens later/slightly later. A “promise” object takes the result from a {future} code result and it will wait until a result appears from another process finishes running the code.

Another important component of the asynchronous framework in R is the {promises} package. It’s this package that allows for the actual abstractions within your R code for asynchronous programming such as the “promise pipe”, %...>%! You insert whatever long task code you have into the future() function then use the “promise pipe” to pass it to the rest of the code. As a future/promise object is not a data frame you can’t pass filter() or other functions to it, so you have to pass the “promise pipe” first before other regular functions can be run.

In a Shiny context, you can’t use reactives inside a future() function so one needs to assign a reactive as an object before the future() code and then pass that object into the function.

You also need to carefully think about WHERE (as in which process) the code is running. For example in the code below, the results are the same in both the top and bottom code. The code in black is done by the main process while the code in green is done by another process.

Although the above code works in both cases, for some functions such as plot() and print() can be run in another process and but their output can not be returned by the main process. The solution is to use the “promise pipe” to make sure that plot()/print() is being run by the main process instead. On the other hand you can still use promises within observe*() and eventReactive*()/reactive() code, you just have to remember to use the “promise pipes”.

Np_Ur: A Simple Shiny App in 30 Minutes!

@Np_Ur is known in the Japan community for his love of {shiny}, he even wrote a book on it called “Building Web Applications in R with Shiny”. This presentation was largely a demonstration as @Np_Ur explained, from the ground up, a lot of the basic functions that can get you a working Shiny app in the space of 30 minutes! From creating a navigation bar via navbarPage() to creating different drop-down options for customizing a plot and talking about extra functionality from other packages such as {DT} and {shinycssloaders}, @Np_Ur took us through the code and showed us completed Shiny apps for each step of the way.

I recommend going through his slides (also hosted on Shiny) as well as checking out the code for each of the Shiny apps he made for all different functionalities he talked about by clicking on the link below!

kashitan: Making {shiny} Maps with {leaflet}!

@kashitan presented some tips (that you normally won’t see in books/articles) for using {leaflet} with Shiny for map applications! He took us through four different functions that he found very useful for making {leaflet} maps with Japan census data.

The first function: updateSelectInput() allows you to update a drop-down menu with new values after selecting a certain input. In @kashitan’s case using the Japan census Shiny app, he wanted to be able to update the choices of the city/district after choosing a different prefecture on the app. Using the updateSelectInput() function the list of choices from the drop down menu updates to the city/districts of the newly chosen prefecture!

You can check out the documentation here.

The second function: leafletProxy() allows you to customize a {leaflet} map even after it has been rendered by Shiny. For @kashitan this was necessary as he didn’t want the map’s active zoom level and center coordinates to change even after choosing a new prefecture to look at.

The third function: fitBounds() allows you to set the boundaries of the map. For @kashitan similar to the previous function shown, he wanted the bounds to the view, following a change in the city/district, to always be within a certain bounding box.

The last function: input${id}shape_click shows you information about the polygon shape of the map you just clicked. {leaflet}’s “click” event currently only shows you the coordinate and id values from this function.

okiyuki: Software Engineering for Shiny!

@okiyuki presented on the various R packages used for the software engineering that supports Shiny apps.

  • {memoise}: Caches data when certain function is run for the first time (useful for dashboard shiny apps where similar use cases can be predicted)
  • {pool}: Easy database connection management in an interactive context. After inserting/accessing SQL database connection info, the connection is closed when app itself closes!
  • {shinyProxy}: Deploy Shiny apps with LDAP authentication/authorization and TLS protocols for an enterprise context. It uses Docker so that each user is using the app in their own single Docker container.
  • {shinyloadtest}: Helps analyze load tests and Shiny app performance with multiple users.

@okiyuki also talked about some of his personal struggles and pitfalls that he has come across when building Shiny apps at work. These include:

  • Deployed on ShinyServer but there was an error! Even though it was working fine a minute ago!

    • Solution: Use {Shinytest} and {testthat} to test deployment and other actions in Shiny
  • Unknowingly/unintentionally using functions from a different namespace

    • Solution: Make sure to explicitly :: your functions
    • Also restart your app via altering restart.txt in your Shiny app directory

An extra section talked about various helpful packages for Shiny app aesthetics such as:

  • {shinycssloaders}:
  • {shinyace}:
  • dreamRs’ suite of Shiny packages such as {shinyWidgets}
  • I introduced some of dreamRs’ packages in my useR!2019 blog post here.
  • Various packages to create Shiny Templates: {bs4dash}, {shinymaterial}, {fullpage}, {shiny.semantic}

LTs

igjit: Edit Your Photos with {shinyroom}!

You might remember from a few months back, @igjit presented on “RAW image processing with R” (TokyoR #79). Continuing where he left off he decided to create a photo-editing UI using the power of {shiny}. Motivated by comments following the previous presentation, @igjit decided to base it on “Adobe Lightroom”, and call it the {shinyroom} package. You can take a look at it here.

In terms of actually building the Shiny app he used the {imager} package for the actual photo editing functionality while {golem} was used as the package framework for the app. For appearances @igjit used the {shinythemes} package

During the course of building the app, @igjit came across a peculiar problem concerning the screen when the Shiny app was busy. By default, a particular panel becomes slightly opaque when the server is busy doing stuff in the background but this is annoying when you are working on editing images. To get around this problem, @igjit created another package called {shinyloadermessage} so that instead of the screen graying out a small message will appear instead.

flaty13: Reproducible Shiny with {shinymeta}!

@flaty13 talked about the recently made public {shinymeta} package and reproducibility with Shiny apps. This is a topic that has taken a while to develop due to the complexity of the issue, where the end goal was to find a way to document and describe the actions of users who interacted with very dynamic Shiny apps with many different features. With the {shinymeta} package you can now download R scripts highlight the steps you took in interacting with the app.

The next step that is currently in development is to output an .RMD report among a number of other features as the package is still in the experimental phase. See the resources below for more details, especially Joe Cheng’s keynote for all the stuff under-the-hood that’s making this exciting new development possible!

Other talks

Food, Drinks, and Conclusion

TokyoR happens almost monthly and it’s a great way to mingle with Japanese R users as it’s the largest regular meetup here in Japan. The next meetup will be on October 26 and I will also be one of the presenters!

Talks in English are also welcome so if you’re ever in Tokyo come join us!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Gold-Mining Week 9 (2019)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

</p> <p>Week 9 Gold Mining and Fantasy Football Projection Roundup now available.</p> <p>

The post Gold-Mining Week 9 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Offensive Programming in action (part III)

$
0
0

[This article was first published on NEONIRA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third post on offensive programming, dedicated to using offensive programming (OP).

You may refer to easy introduction to offensive programming to acquire some basic knowledge about offensive programming, and to discover offensive programming to get access to online materials.

Let’s see how OP helps in defining function with semantic naming for parameter names, and in providing expected function return type. This will make use of many tools provided by package wyz.code.offensiveProgramming.

Focus on a given type suffix

Let’s consider you want to know more about type suffix ’s’. Here is a typical session you could use.

library('wyz.code.offensiveProgramming')# get the default type factorytf<-retrieveFactory()
# is 's' a recorded type suffix?tf$checkSuffix('s')
## [1] TRUE
# what means 's' type suffixtf$getType('s')
## [1] "string"
# get verification function for type suffix 's'tf$getVerificationFunction('s')
## function (o_1_) ## {##     if (!is.character(o_1_)) ##         return(FALSE)##     if (length(o_1_) == 0) ##         return(TRUE)##     all(is.na(o_1_) == FALSE)## }## ## 
# get type factory information for 'pi'tf$getRecordedTypes()[suffix=='s']
##    suffix   type verify_function category## 1:      s string          basic

A real use case

Raw approach

Let’s consider following function that helps organizing seats for your friend around the table (under no constraint indeed).

organizeTable<-function(guestNames_s){guestNames_s[order(runif(length(guestNames_s)))]}guests<-c('Marie','Yves-Marie','Fabien','Marcello','Tina')organizeTable(guests)
## [1] "Marcello"   "Marie"      "Tina"       "Yves-Marie" "Fabien"
organizeTable(character(0))
## character(0)

Are those calls legal?

From a R perspective, obvuously yes. From OP perspective, answer is more subtle and depends on evaluation modes you choose.

Evaluation modes

Currently, 3 evaluation modes are supported. They are incremental modes.

  1. standard_R_evaluation mode,
  2. enhanced_R_evaluation mode, adds function return type evaluation to previous mode,
  3. type_checking_enforcement mode, adds function parameter type verification to previous mode.
# Available evaluation modesdefineEvaluationModes()
## [1] "standard_R_evaluation"     "enhanced_R_evaluation"    ## [3] "type_checking_enforcement"

Standard R evaluation mode

em<-EvaluationMode('standard_R_evaluation')efrt<-'x_s'# expected function returned typerunTransientFunction(organizeTable,list(guests),em,efrt)
## $status## [1] TRUE## ## $value## [1] "Tina"       "Yves-Marie" "Marcello"   "Fabien"     "Marie"     ## ## $mode## [1] "standard_R_evaluation"

Using, standard_R_evaluation mode brings status, value and mode results. It does not provide any extraneous information than a call in your R console. This mode is provided to ease comparisons. Not to be used solely, indeed.

Enhanced evaluation mode

This mode checks function returned value against expected function return type.

em<-EvaluationMode('enhanced_R_evaluation')runTransientFunction(organizeTable,list(guests),em,efrt)
## $status## [1] TRUE## ## $value## [1] "Tina"       "Marcello"   "Marie"      "Fabien"     "Yves-Marie"## ## $mode## [1] "enhanced_R_evaluation"## ## $function_return_type_check##    parameter_name                       parameter_value validity## 1:            x_s Tina,Marcello,Marie,Fabien,Yves-Marie     TRUE##                message## 1: good type in values

Using, enhanced_R_evaluation mode brings function_return_type results, as a data.table. The validity column provides information about type concordance between resulting value and expected type.

Enhanced evaluation mode

This mode checks parameter values against parameter semantic name specifications.

em<-EvaluationMode('type_checking_enforcement')runTransientFunction(organizeTable,list(guests),em,efrt)
## $status## [1] TRUE## ## $value## [1] "Yves-Marie" "Tina"       "Marcello"   "Fabien"     "Marie"     ## ## $mode## [1] "type_checking_enforcement"## ## $parameter_type_checks##    parameter_name                       parameter_value validity## 1:   guestNames_s Marie,Yves-Marie,Fabien,Marcello,Tina     TRUE##                message## 1: good type in values## ## $function_return_type_check##    parameter_name                       parameter_value validity## 1:            x_s Yves-Marie,Tina,Marcello,Fabien,Marie     TRUE##                message## 1: good type in values

Using, type_checking_enforcement mode brings parameter_type_checks results, as a data.table. Here also, the validity column provides information about type concordance between provided parameter value and parameter semantic name specification. Detail level is one line for each function parameter, to ease discovery of and remediance to uncompliances.

So, legal or not ?

Let’s check them

em<-EvaluationMode('type_checking_enforcement')runTransientFunction(organizeTable,list(guests),em,efrt)
## $status## [1] TRUE## ## $value## [1] "Tina"       "Fabien"     "Yves-Marie" "Marcello"   "Marie"     ## ## $mode## [1] "type_checking_enforcement"## ## $parameter_type_checks##    parameter_name                       parameter_value validity## 1:   guestNames_s Marie,Yves-Marie,Fabien,Marcello,Tina     TRUE##                message## 1: good type in values## ## $function_return_type_check##    parameter_name                       parameter_value validity## 1:            x_s Tina,Fabien,Yves-Marie,Marcello,Marie     TRUE##                message## 1: good type in values
runTransientFunction(organizeTable,list(character(0)),em,efrt)
## $status## [1] TRUE## ## $value## character(0)## ## $mode## [1] "type_checking_enforcement"## ## $parameter_type_checks##    parameter_name parameter_value validity             message## 1:   guestNames_s                     TRUE good type in values## ## $function_return_type_check##    parameter_name parameter_value validity             message## 1:            x_s                     TRUE good type in values

From R point of view, both calls are legal. From offensive programming, only calls returning a status TRUE are valid.

Accepting or not the second case is indeed a matter of input scope specification. Does organizing a table for no person have any sense?

If your answer is yes, then keep this function definition. If no, how could you improve previous implementation?

Refined approach

We will change the parameter name and specify length constraint on it. Note that function body is semantically exactly the same as for previous function. No change in implementation algorithm. Just parameter renaming propagation changes are applied.

organizeTableBis<-function(guestNames_s_1m){guestNames_s_1m[order(runif(length(guestNames_s_1m)))]}organizeTableBis(guests)
## [1] "Fabien"     "Tina"       "Marie"      "Yves-Marie" "Marcello"
organizeTableBis(character(0))
## character(0)
runTransientFunction(organizeTableBis,list(guests),em,efrt)
## $status## [1] TRUE## ## $value## [1] "Tina"       "Fabien"     "Marcello"   "Yves-Marie" "Marie"     ## ## $mode## [1] "type_checking_enforcement"## ## $parameter_type_checks##     parameter_name                       parameter_value validity## 1: guestNames_s_1m Marie,Yves-Marie,Fabien,Marcello,Tina     TRUE##                message## 1: good type in values## ## $function_return_type_check##    parameter_name                       parameter_value validity## 1:            x_s Tina,Fabien,Marcello,Yves-Marie,Marie     TRUE##                message## 1: good type in values
runTransientFunction(organizeTableBis,list(character(0)),em,efrt)
## $status## [1] FALSE## ## $value## character(0)## ## $mode## [1] "type_checking_enforcement"## ## $parameter_type_checks##     parameter_name parameter_value validity## 1: guestNames_s_1m                    FALSE##                                       message## 1: wrong length, was expecting [1m] , got [0]## ## $function_return_type_check##    parameter_name parameter_value validity             message## 1:            x_s                     TRUE good type in values

Now the second case is flagged as invalid from OP point of view. One of the reasons is given explicitly by the parameter_type_checkdata.table.

To conclude …

You have

  1. been introduced to semantic naming for function parameter names,
  2. been sensitized to semantic naming length specification impact,
  3. been shown how and when to use predefined evaluation modes,
  4. been initiated to use common OP tools, runTransientFunction, EvaluationMode, and function parameter type factory operations,
  5. been trained to interpret evaluation results

Great, we are more than half the way. Next post will be about registering your own types, and managing your classes and their related functional tests. Stay tuned.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: NEONIRA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

LongCART – Regression tree for longitudinal data

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Longitudinal changes in a population of interest are often heterogeneous and may be influenced by a combination of baseline factors. The longitudinal tree (that is, regression tree with longitudinal data) can be very helpful to identify and characterize the sub-groups with distinct longitudinal profile in a heterogenous population. This blog presents the capabilities of the R package LongCART for constructing longitudinal tree according to the LongCART algorithm (Kundu and Harezlak 2019). In addition, this packages can also be used to formally evaluate whether any particular baseline covariate affects the longitudinal profile via parameter instability test. In this blog, construction of longitudinal tree is illlustrated with an R dataset in step by step approach and the results are explained.

Installing and Loading LongCART package

R> install.packages("LongCART")
R> library(LongCART)

Get the example dataset The ACTG175 dataset in speff2trial package contains longitudinal observation of CD4 counts from clinical trial in HIV patients. This dataset is in “wide” format, and, we need to convert  it to first “long” format.

R> library(speff2trial)
R> data("ACTG175", package = "speff2trial")
R> adata<- reshape(data=ACTG175[,!(names(ACTG175) %in% c("cd80", "cd820"))],
+ varying=c("cd40", "cd420", "cd496"), v.names="cd4",
+ idvar="pidnum", direction="long", times=c(0, 20, 96))
R> adata<- adata[order(adata$pidnum, adata$time),]

Longtudinal model of interest Since the count data including CD4 counts are often log transformed before modeling, a simple longitudinal model for CD4 counts would be: log(CD4 countit) = beta0 + beta1*t + bi + epsilonit

Does the fixed parameters of above longitdinal vary with the level of baseline covariate? Categorical baseline covariate Suppose we want to evaluate whether any of the fixed model parameters changes with the levels of any baseline categorical partitioning variable, say, gender.

R> adata$Y=ifelse(!is.na(adata$cd4),log(adata$cd4+1), NA)
R> StabCat(data=adata, patid="pidnum", fixed=Y~time, splitvar="gender")
Stability Test for Categorical grouping variable
Test.statistic=0.297, p-value=0.862

The p-value is 0.862 which indicates that we don’t have any evidence that fixed parameters vary with the levels of gender.

Continuous baseline covariate Now suppose we are interested to evaluate whether any of the fixed model parameters changes with the levels of any baseline continuous partitioning variable, say, wtkg.

R> StabCont(data=adata, patid="pidnum", fixed=Y~time, splitvar="wtkg")
Stability Test for Continuous grouping variable
Test.statistic=1.004 1.945, Adj. p-value=0.265 0.002

The result returns two two p-values – the first p-value correspond to parameter instability test of beta0 and the second ones correspond to beta1 .

Constructing tree for longitudinal profile The ACTG175 dataset contains several baseline variables including gender, hemo (presence of hemophilia), homo (homosexual activity), drugs (history of intravenous drug use ), oprior (prior non-zidovudine antiretroviral therapy), z30 (zidovudine use 30 days prior to treatment initiation), zprior (zidovudine use prior to treatment initiation), race, str2 (antiretroviral history), treat (treatment indicator), offtrt (indicator of off-treatment before 96 weeks), age, wtkg (weight) and karnof (Karnofsky score). We can construct longitudinal tree to identify the sub-groups defined by these baseline variables such that the individuals within the given sub-groups are homogeneous with respect to longitudinal profile of CD4 counts but the longitudinal profiles among the sub-groups are heterogenous.

R> gvars=c("age", "gender", "wtkg", "hemo", "homo", "drugs",
+ "karnof", "oprior", "z30", "zprior", "race",
+ "str2", "symptom", "treat", "offtrt", "strat")
R> tgvars=c(1, 0, 1, 0, 0, 0,
+ 1, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0)
R> out.tree<- LongCART(data=adata, patid="pidnum", fixed=Y~time,
+ gvars=gvars, tgvars=tgvars, alpha=0.05,
+ minsplit=100, minbucket=50, coef.digits=3)

All the baseline variables are listed in gvars argument. The gvars argument is accompanied with the tgvars argument which indicates type of the partitioning variables (0=categorical or 1=continuous). Note that the LongCART() function currently can handle the categorical variables with numerical levels only. For nominal variables, please create the corresponding numerically coded dummy variable(s). 

Now let’s view the tree results

R> out.tree$Treeout
ID n yval var index p (Instability) loglik improve Terminal
1 1 2139 5.841-0.003time offtrt 1.00 0.000 -4208 595 FALSE
2 2 1363 5.887-0.002time treat 1.00 0.000 -2258 90 FALSE
3 4 316 5.883-0.004time str2 1.00 0.005 -577 64 FALSE
4 8 125 5.948-0.002time symptom NA 1.000 -176 NA TRUE
5 9 191 5.84-0.005time symptom NA 0.842 -378 NA TRUE
6 5 1047 5.888-0.001time wtkg 68.49 0.008 -1645 210 FALSE
7 10 319 5.846-0.002time karnof NA 0.260 -701 NA TRUE
8 11 728 5.907-0.001time age NA 0.117 -849 NA TRUE
9 3 776 5.781-0.007time karnof 100.00 0.000 -1663 33 FALSE
10 6 360 5.768-0.009time wtkg NA 0.395 -772 NA TRUE
11 7 416 5.795-0.005time z30 1.00 0.014 -883 44 FALSE
12 14 218 5.848-0.003time treat NA 0.383 -425 NA TRUE
13 15 198 5.738-0.007time age NA 0.994 -444 NA TRUE

In the above output, each row corresponds to single node including the 7  terminal nodes identified by TERMINAL=TRUE.  Now let’s visualize the tree results

R> par(xpd = TRUE)
R> plot(out.tree, compress = TRUE)
R> text(out.tree, use.n = TRUE)

The resultant tree is as follows: ACTG175tree

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New package: simfinR

$
0
0

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my latest post I wrote about package GetEdgarData, which downloaded structured data from the SEC. I’ve been working on this project and soon realized that the available data at the SEC/DERA section is not complete. For example, all Q4 statements are missing. This seems to be the way all exchanges release the financial documents. I’ve found the same problem here in the Brazilian exchange.

It came to my attention that there is an alternative way of fetching corporate data and adjusted prices, the SimFin project. From its own website:

Our core goal is to make financial data as freely available as possible because we believe that  having the right tools for investing/research shouldn't be the privilege of those that can afford to spend thousands of dollars per year on data.

The platform is free with a daily limit of 2000 api calls. This is not bad and should suffice for most users. If you need more calls, the premium version is just 10 euros a month, a fraction of what other data vendors usually request.

Package simfinR, available in Github and soon in CRAN, facilitates all calls to the simfin API. It first makes sure the requested data exists and only then calls the api. As usual, all api queries are saved locally using package memoise. This means that the second time you ask for a particular data about a company/year, the function will load a local copy, and will not call the web api.

Package GetEdgarData, however, will be discontinued. I’ll keep the files in Github but will no longer develop it. It takes a lot of time to write and maintain R packages, and I fell that simfinR has far more potential.

As mentioned before, both new packages, GetQuandlData and simfinR will be part of my next book, “Analyzing Financial and Economic Data with R”, which should be released in early 2020.

Installation

# not in CRAN yet (need to test it further)#install.packages('simfinR')# from Githubdevtools::install_github('msperlin/simfinR')

Example 01 – Apples Quarterly Net Profit

The first step in using simfinR is finding information about available companies:

library(simfinR)library(tidyverse)# You need to get your own api key at https://simfin.com/my_apy_key <- readLines('~/Dropbox/.api_key_simfin.txt')# get infodf_info_companies <- simfinR_get_available_companies(my_apy_key)# check itglimpse(df_info_companies)
## Observations: 2,564## Variables: 3## $ simId   171401, 901704, 901866, 45730, 378251, 896477, 418866, 79…## $ ticker  "ZYXI", "ZYNE", "ZVO", "ZUMZ", "ZTS", "ZS", "ZNGA", "ZIOP…## $ name    "ZYNEX INC", "Zynerba Pharmaceuticals, Inc.", "Zovio Inc"…

We find information about 2564 companies. Digging deeper we find that the simfin id of Apple is 111052. Let’s use it to download the annual financial information since 2009.

id_companies <- 111052 # id of APPLE INCtype_statements <- 'pl' # profit/lossperiods = 'FY' # final yearyears = 2009:2018df_fin_FY <- simfinR_get_fin_statements(id_companies,                                     type_statements = type_statements,                                     periods = periods,                                     year = years,                                     api_key = my_apy_key)glimpse(df_fin_FY)
## Observations: 580## Variables: 13## $ company_name    "APPLE INC", "APPLE INC", "APPLE INC", "APPLE INC…## $ company_sector  "Computer Hardware", "Computer Hardware", "Comput…## $ type_statement  pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, p…## $ period          FY, FY, FY, FY, FY, FY, FY, FY, FY, FY, FY, FY, F…## $ year            2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2…## $ ref_date        2009-12-31, 2009-12-31, 2009-12-31, 2009-12-31, …## $ acc_name        "Revenue", "Sales & Services Revenue", "Financing…## $ acc_value       4.2905e+10, NA, NA, NA, -2.5683e+10, NA, NA, NA, …## $ tid             "1", "3", "5", "6", "2", "7", "8", "9", "4", "10"…## $ uid             "1", "0", "0", "0", "2", "0", "0", "0", "4", "10"…## $ parent_tid      "4", "1", "1", "1", "4", "2", "2", "2", "19", "19…## $ display_level   "0", "1", "1", "1", "0", "1", "1", "1", "0", "0",…## $ check_possible  FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

And now we plot the results of the “Net Income” (profit/loss) for all years:

net_income <- df_fin_FY %>%               filter(acc_name == 'Net Income')p <- ggplot(net_income,            aes(x = ref_date, y = acc_value)) +  geom_col()  +   labs(title = 'Yearly Profit of APPLE INC',       x = '',       y = 'Yearly Profit',       subtitle = '',       caption = 'Data from simfin ') +   theme_bw()print(p)

Not bad!

We can also grab data for all quarters:

type_statements <- 'pl' # profit/lossperiods = c('Q1', 'Q2', 'Q3', 'Q4') # final yearyears = 2009:2018df_fin_quarters <- simfinR_get_fin_statements(id_companies,                                     type_statements = type_statements,                                     periods = periods,                                     year = years,                                     api_key = my_apy_key)glimpse(df_fin_quarters)
## Observations: 2,320## Variables: 13## $ company_name    "APPLE INC", "APPLE INC", "APPLE INC", "APPLE INC…## $ company_sector  "Computer Hardware", "Computer Hardware", "Comput…## $ type_statement  pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, pl, p…## $ period          Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q…## $ year            2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2…## $ ref_date        2009-03-31, 2009-03-31, 2009-03-31, 2009-03-31, …## $ acc_name        "Revenue", "Sales & Services Revenue", "Financing…## $ acc_value       1.188e+10, NA, NA, NA, -7.373e+09, NA, NA, NA, 4.…## $ tid             "1", "3", "5", "6", "2", "7", "8", "9", "4", "10"…## $ uid             "1", "0", "0", "0", "2", "0", "0", "0", "4", "10"…## $ parent_tid      "4", "1", "1", "1", "4", "2", "2", "2", "19", "19…## $ display_level   "0", "1", "1", "1", "0", "1", "1", "1", "0", "0",…## $ check_possible  FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

And plot the results:

net_income <- df_fin_quarters %>%               filter(acc_name == 'Net Income')p <- ggplot(net_income,            aes(x = period, y = acc_value)) +  geom_col() + facet_grid(~year, scales = 'free') +   labs(title = 'Quarterly Profit of APPLE INC',       x = 'Quarters',       y = 'Net Profit') +   theme_bw()print(p)

Nice and impressive profit record. The first quarter (Q1) seems to present the best performance, probably due to end of year holidays.

Example 02 – Quarterly Net Profit of Many Companies

Package simfinR can also fetch information for many companies in a single call. Let’s run another example by selecting four random companies and creating the same previous graph:

set.seed(5)my_ids <- sample(df_info_companies$simId, 4)type_statements <- 'pl' # profit/lossperiods = 'FY' # final yearyears = 2010:2018df_fin <- simfinR_get_fin_statements(id_companies = my_ids,                                     type_statements = type_statements,                                     periods = periods,                                     year = years,                                     api_key = my_apy_key)net_income <- df_fin %>%               filter(acc_name == 'Net Income')p <- ggplot(net_income,            aes(x = ref_date, y = acc_value)) +  geom_col() +   labs(title = 'Annual Profit/Loss of Four Companies',       x = '',       y = 'Net Profit/Loss') +   facet_wrap(~company_name, scales = 'free_y') +   theme_bw()print(p)

Example 03: Fetching price data

The simfin project also provides adjusted prices of stocks. Have a look:

set.seed(5)my_ids <- sample(df_info_companies$simId, 4)type_statements <- 'pl' # profit/lossperiods = 'FY' # final yearyears = 2009:2018df_price <- simfinR_get_price_data(id_companies = my_ids,                                     api_key = my_apy_key)p <- ggplot(df_price,            aes(x = ref_date, y = close_adj)) +  geom_line() +   labs(title = 'Adjusted stock prices for four companies',       x = '',       y = 'Adjusted Stock Prices') +   facet_wrap(~company_name, scales = 'free_y') +   theme_bw()print(p)

As you can see, the data is comprehensive and should suffice for many different corporate finance research topics.

Give it a try and, if you’ve found any problem or bug, please let me know at marceloperlin@gmail.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#FunDataFriday – ATX GIS Day

$
0
0

[This article was first published on #FunDataFriday - Little Miss Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is it?

ATX GIS day is being held in Austin, TX on Nov 13 to celebrate the use of GIS (Geographic Information Systems) data! You can think of it as a bunch of super cool data people getting together to talk about geography-based data and analysis techniques.

 

GISDay2019.png

 

Why is it Awesome?

AustinLib3.jpg

If the description didn’t already hook you, it’s awesome that someone has organized a space for data nerds geeking out over GIS data techniques.

Truthfully, GIS data analysis often create some of the most beautiful data visualizations out there. Examples: ggmap, rayshader and so many more.

The gathering will be held in the beautiful Austin Central Library. There will be talks, VR activities, musical guests and drinks at the Cookbook Cafe (a cafe which serves literary-themed food and cocktails).

 

How to Get Started?

Come join us! I’ll be speaking in the afternoon about making beautiful maps with R and ggmap.

Learn more about the day and sign up here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: #FunDataFriday - Little Miss Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Command Centre amplification with predictive analytics and machine learning

$
0
0

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, our team at Draper and Dash have been busy creating an NHS operational command centre. This command centre is different, as it uses a collection and ensemble of cutting edge predictive and machine learning techniques. To read the blog you can access this below:

We have really enjoyed the process and we are in the process of creating the technical documentation for the R scripts and releasing the code. Please watch this space for further updates around the models we have used.

All our solutions are created in R and are validated with key NHS colleagues.

If you like the sound of the command centre, then please contact our team at sales-support@draperanddash.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

AzureRMR 2.3.0 now on CRAN

$
0
0

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is to announce that the latest update to AzureRMR is now available on CRAN. Version 2.3.0 brings several changes to make life easier when managing resources in Azure.

New in this version is a facility for parallelising connections to Azure, using a pool of background processes. Some operations, such as downloading many small files or interacting with a cluster of VMs, can be sped up significantly by carrying them out in parallel rather than sequentially. The code for this is currently duplicated in multiple packages including AzureStor and AzureVM; moving it into AzureRMR removes the duplication and also makes it available to other packages that may benefit. See the vignette for more details.

One side-effect of this change is that loading a current version of AzureStor or AzureVM, along with AzureRMR, will bring up a message in the R console:

> library(AzureRMR) > library(AzureVM) Attaching package: 'AzureVM' The following objects are masked from 'package:AzureRMR':     delete_pool, init_pool

Similarly, if you load the SAR package, you will receive a warning:

> library(SAR) Warning messages: 1: replacing previous import 'AzureRMR::init_pool' by 'AzureStor::init_pool' when loading 'SAR' 2: replacing previous import 'AzureRMR::delete_pool' by 'AzureStor::delete_pool' when loading 'SAR'

These messages are because the pool functions in AzureRMR have the same names as those in the other packages. You can safely ignore them; everything will still function correctly, and I'll be submitting updated versions to CRAN in the next few days (as soon as the AzureRMR update propagates to CRAN mirrors).

Other changes in 2.3.0 include:

  • Subscription and resource group objects now have do_operation methods, like resource objects. This allows you to carry out arbitrary operations on a sub or RG, if you know the REST call.
  • AzureGraph is now a direct import, which should help ensure your credentials are consistent for Resource Manager and Microsoft Graph.
  • Error messages should now be much more informative, especially when deploying templates.

If you run into problems, or to send feedback, please open an issue at the GitHub repo.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Multiple data imputation and explainability

$
0
0

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Imputing missing values is quite an important task, but in my experience, very often, it is performed using very simplistic approaches. The basic approach is to impute missing values for numerical features using the average of each feature, or using the mode for categorical features. There are better ways of imputing missing values, for instance by predicting the values using a regression model, or KNN. However, imputing only once is not enough, because each imputed value carries with it a certain level of uncertainty. To account for this, it is better to perform multiple imputation. This means that if you impute your dataset 10 times, you’ll end up with 10 different datasets. Then, you should perform your analysis 10 times, for instance, if training a machine learning model, you should train it on the 10 datasets (and do a train/test split for each, even potentially tune a model for each). Finally, you should pool the results of these 10 analyses.

I have met this approach in the social sciences and statistical literature in general, but very rarely in machine learning. Usually, in the social sciences, explainability is the goal of fitting statistical models to data, and the approach I described above is very well suited for this. Fit 10 (linear) regressions to each imputed dataset, and then pool the estimated coefficients/weights together. Rubin’s rule is used to pool these estimates. You can read more about this rule here. In machine learning, the task is very often prediction; in this case, you should pool the predictions. Computing the average and other statistics of the predictions seem to work just fine in practice.

However, if you are mainly interested in explainability, how should you proceed? I’ve thought a bit about it, and the answer, is “exactly the same way”… I think. What I’m sure about, is you should impute m times, run the analysis m times (which in this case will include getting explanations) and then pool. So the idea is to be able to pool explanations.

Explainability in the “standard” case (no missing values)

To illustrate this idea, I’ll be using the {mice} package for multiple imputation, {h2o} for the machine learning bit and{iml} for explainability. Note that I could have used any other machine learning package instead of {h2o} as {iml} is totally package-agnostic. However, I have been experimenting with {h2o}’s automl implementation lately, so I happened to have code on hand. Let’s start with the “standard” case where the data does not have any missing values.

First let’s load the needed packages and initialize h2o functions with h2o.init():

library(tidyverse)library(Ecdat)library(mice)library(h2o)library(iml)h2o.init()

I’ll be using the DoctorContacts data. Here’s a description:

Click to view the description of the data

DoctorContacts              package:Ecdat              R DocumentationContacts With Medical DoctorDescription:     a cross-section from 1977-1978     _number of observations_ : 20186Usage:     data(DoctorContacts)     Format:     A time serie containing :     mdu number of outpatient visits to a medical doctor     lc log(coinsrate+1) where coinsurance rate is 0 to 100     idp individual deductible plan ?     lpi log(annual participation incentive payment) or 0 if no payment     fmde log(max(medical deductible expenditure)) if IDP=1 and MDE>1          or 0 otherw     physlim physical limitation ?     ndisease number of chronic diseases     health self-rate health (excellent,good,fair,poor)     linc log of annual family income (in \$)     lfam log of family size     educdec years of schooling of household head     age exact age     sex sex (male,female)     child age less than 18 ?     black is household head black ?Source:     Deb, P.  and P.K.  Trivedi (2002) “The Structure of Demand for     Medical Care: Latent Class versus Two-Part Models”, _Journal of     Health Economics_, *21*, 601-625.References:     Cameron, A.C.  and P.K.  Trivedi (2005) _Microeconometrics :     methods and applications_, Cambridge, pp. 553-556 and 565.

The task is to predict "mdu", the number of outpatient visits to an MD. Let’s prepare the data and split it into 3; a training, validation and holdout set.

data("DoctorContacts")contacts <- as.h2o(DoctorContacts)
splits <- h2o.splitFrame(data=contacts, ratios = c(0.7, 0.2))original_train <- splits[[1]]validation <- splits[[2]]holdout <- splits[[3]]features_names <- setdiff(colnames(original_train), "mdu")

As you see, the ratios argument c(0.7, 0.2) does not add up to 1. This means that the first of the splits will have 70% of the data, the second split 20% and the final 10% will be the holdout set.

Let’s first go with a poisson regression. To obtain the same results as with R’s built-in glm() function, I use the options below, as per H2o’s glm faq.

If you read Cameron and Trivedi’s Microeconometrics, where this data is presented in the context of count models, you’ll see that they also fit a negative binomial model 2 to this data, as it allows for overdispersion. Here, I’ll stick to a simple poisson regression, simply because the goal of this blog post is not to get the best model; as explained in the beginning, this is an attempt at pooling explanations when doing multiple imputation (and it’s also because GBMs, which I use below, do not handle the negative binomial model).

glm_model <- h2o.glm(y = "mdu", x = features_names,                     training_frame = original_train,                     validation_frame = validation,                     compute_p_values = TRUE,                     solver = "IRLSM",                     lambda = 0,                     remove_collinear_columns = TRUE,                     score_each_iteration = TRUE,                     family = "poisson",                      link = "log")

Now that I have this simple model, which returns the (almost) same results as R’s glm() function, I can take a look at coefficients and see which are important, because GLMs are easily interpretable:

Click to view h2o.glm()’s output

summary(glm_model)
## Model Details:## ==============## ## H2ORegressionModel: glm## Model Key:  GLM_model_R_1572735931328_5 ## GLM Model: summary##    family link regularization number_of_predictors_total## 1 poisson  log           None                         16##   number_of_active_predictors number_of_iterations  training_frame## 1                          16                    5 RTMP_sid_8588_3## ## H2ORegressionMetrics: glm## ** Reported on training data. **## ## MSE:  17.6446## RMSE:  4.200547## MAE:  2.504063## RMSLE:  0.8359751## Mean Residual Deviance :  3.88367## R^2 :  0.1006768## Null Deviance :64161.44## Null D.o.F. :14131## Residual Deviance :54884.02## Residual D.o.F. :14115## AIC :83474.52## ## ## H2ORegressionMetrics: glm## ** Reported on validation data. **## ## MSE:  20.85941## RMSE:  4.56721## MAE:  2.574582## RMSLE:  0.8403465## Mean Residual Deviance :  4.153042## R^2 :  0.09933874## Null Deviance :19667.55## Null D.o.F. :4078## Residual Deviance :16940.26## Residual D.o.F. :4062## AIC :25273.25## ## ## ## ## Scoring History: ##             timestamp   duration iterations negative_log_likelihood## 1 2019-11-03 00:33:46  0.000 sec          0             64161.43611## 2 2019-11-03 00:33:46  0.004 sec          1             56464.99004## 3 2019-11-03 00:33:46  0.020 sec          2             54935.05581## 4 2019-11-03 00:33:47  0.032 sec          3             54884.19756## 5 2019-11-03 00:33:47  0.047 sec          4             54884.02255## 6 2019-11-03 00:33:47  0.063 sec          5             54884.02255##   objective## 1   4.54015## 2   3.99554## 3   3.88728## 4   3.88368## 5   3.88367## 6   3.88367## ## Variable Importances: (Extract with `h2o.varimp`) ## =================================================## ##        variable relative_importance scaled_importance  percentage## 1    black.TRUE          0.67756097        1.00000000 0.236627982## 2   health.poor          0.48287163        0.71266152 0.168635657## 3  physlim.TRUE          0.33962316        0.50124369 0.118608283## 4   health.fair          0.25602066        0.37785627 0.089411366## 5      sex.male          0.19542639        0.28842628 0.068249730## 6      ndisease          0.16661902        0.24591001 0.058189190## 7      idp.TRUE          0.15703578        0.23176627 0.054842384## 8    child.TRUE          0.09988003        0.14741114 0.034881600## 9          linc          0.09830075        0.14508030 0.034330059## 10           lc          0.08126160        0.11993253 0.028379394## 11         lfam          0.07234463        0.10677213 0.025265273## 12         fmde          0.06622332        0.09773781 0.023127501## 13      educdec          0.06416087        0.09469387 0.022407220## 14  health.good          0.05501613        0.08119732 0.019213558## 15          age          0.03167598        0.04675000 0.011062359## 16          lpi          0.01938077        0.02860373 0.006768444

As a bonus, let’s see the output of the glm() function:

Click to view glm()’s output

train_tibble <- as_tibble(original_train)r_glm <- glm(mdu ~ ., data = train_tibble,            family = poisson(link = "log"))summary(r_glm)
## ## Call:## glm(formula = mdu ~ ., family = poisson(link = "log"), data = train_tibble)## ## Deviance Residuals: ##     Min       1Q   Median       3Q      Max  ## -5.7039  -1.7890  -0.8433   0.4816  18.4703  ## ## Coefficients:##               Estimate Std. Error z value Pr(>|z|)    ## (Intercept)  0.0005100  0.0585681   0.009   0.9931    ## lc          -0.0475077  0.0072280  -6.573 4.94e-11 ***## idpTRUE     -0.1794563  0.0139749 -12.841  < 2e-16 ***## lpi          0.0129742  0.0022141   5.860 4.63e-09 ***## fmde        -0.0166968  0.0042265  -3.951 7.80e-05 ***## physlimTRUE  0.3182780  0.0126868  25.087  < 2e-16 ***## ndisease     0.0222300  0.0007215  30.811  < 2e-16 ***## healthfair   0.2434235  0.0192873  12.621  < 2e-16 ***## healthgood   0.0231824  0.0115398   2.009   0.0445 *  ## healthpoor   0.4608598  0.0329124  14.003  < 2e-16 ***## linc         0.0826053  0.0062208  13.279  < 2e-16 ***## lfam        -0.1194981  0.0106904 -11.178  < 2e-16 ***## educdec      0.0205582  0.0019404  10.595  < 2e-16 ***## age          0.0041397  0.0005152   8.035 9.39e-16 ***## sexmale     -0.2096761  0.0104668 -20.032  < 2e-16 ***## childTRUE    0.1529588  0.0179179   8.537  < 2e-16 ***## blackTRUE   -0.6231230  0.0176758 -35.253  < 2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for poisson family taken to be 1)## ##     Null deviance: 64043  on 14096  degrees of freedom## Residual deviance: 55529  on 14080  degrees of freedom## AIC: 84052## ## Number of Fisher Scoring iterations: 6

I could also use the excellent {ggeffects} package to see the marginal effects of different variables, for instance "linc":

library(ggeffects)ggeffect(r_glm, "linc") %>%     ggplot(aes(x, predicted)) +    geom_ribbon(aes(ymin = conf.low, ymax = conf.high), fill = "#0f4150") +    geom_line(colour = "#82518c") +    brotools::theme_blog()

We can see that as “linc” (and other covariates are held constant), the target variable increases.

Let’s also take a look at the marginal effect of a categorical variable, namely "sex":

Click to view another example of marginal effects

library(ggeffects)ggeffect(r_glm, "sex") %>%     ggplot(aes(x, predicted)) +    geom_point(colour = "#82518c") +    geom_errorbar(aes(x, ymin = conf.low, ymax = conf.high), colour = "#82518c") +    brotools::theme_blog()

In the case of the "sex" variable, men have significantly less doctor contacts than women.

Now, let’s suppose that I want to train a model with a more complicated name, in order to justify my salary. Suppose I go with one of those nifty black-box models, for instance a GBM, which very likely will perform better than the GLM from before. GBMs are available in {h2o} via the h2o.gbm() function:

gbm_model <- h2o.gbm(y = "mdu", x = features_names,            training_frame = original_train,            validation_frame = validation,            distribution = "poisson",            score_each_iteration = TRUE,            ntrees = 110,            max_depth = 20,            sample_rate = 0.6,            col_sample_rate = 0.8,            col_sample_rate_per_tree = 0.9,            learn_rate = 0.05)

To find a set of good hyper-parameter values, I actually used h2o.automl() and then used the returned parameter values from the leader model. Maybe I’ll write another blog post about h2o.automl() one day, it’s quite cool. Anyways, now, how do I get me some explainability out of this? The model does perform better than the GLM as indicated by all the different metrics, but now I cannot compute any marginal effects, or anything like that. I do get feature importance by default with:

h2o.varimp(gbm_model)
## Variable Importances: ##    variable relative_importance scaled_importance percentage## 1       age       380350.093750          1.000000   0.214908## 2      linc       282274.343750          0.742143   0.159492## 3  ndisease       245862.718750          0.646412   0.138919## 4       lpi       173552.734375          0.456297   0.098062## 5   educdec       148186.265625          0.389605   0.083729## 6      lfam       139174.312500          0.365911   0.078637## 7      fmde        94193.585938          0.247650   0.053222## 8    health        86160.679688          0.226530   0.048683## 9       sex        63502.667969          0.166958   0.035881## 10       lc        50674.968750          0.133232   0.028633## 11  physlim        45328.382812          0.119175   0.025612## 12    black        26376.841797          0.069349   0.014904## 13      idp        24809.185547          0.065227   0.014018## 14    child         9382.916992          0.024669   0.005302

but that’s it. And had I chosen a different “black-box” model, not based on trees, then I would not even have that. Thankfully, there’s the amazing {iml} package that contains a lot of functions for model-agnostic explanations. If you are not familiar with this package and the methods it implements, I highly encourage you to read the free online ebook written by the packages author, Christoph Molnar (who you can follow on Twitter).

Out of the box, {iml} works with several machine learning frameworks, such as {caret} or {mlr} but not with {h2o}. However, this is not an issue; you only need to create a predict function which returns a data frame (h2o.predict() used for prediction with h2o models returns an h2o frame). I have found this interesting blog post from business-science.io which explains how to do this. I highly recommend you read this blog post, as it goes much deeper into the capabilities of {iml}.

So let’s write a predict function that {iml} can use:

#source: https://www.business-science.io/business/2018/08/13/iml-model-interpretability.htmlpredict_for_iml <- function(model, newdata){  as_tibble(h2o.predict(model, as.h2o(newdata)))}

And let’s now create a Predictor object. These objects are used by {iml} to create explanations:

just_features <- as_tibble(holdout[, 2:15])actual_target <- as_tibble(holdout[, 1])predictor_original <- Predictor$new(  model = gbm_model,   data = just_features,   y = actual_target,   predict.fun = predict_for_iml  )

predictor_original can now be used to compute all kinds of explanations. I won’t go into much detail here, as this blog post is already quite long (and I haven’t even reached what I actually want to write about yet) and you can read more on the before-mentioned blog post or directly from Christoph Molnar’s ebook linked above.

First, let’s compute a partial dependence plot, which shows the marginal effect of a variable on the outcome. This is to compare it to the one from the GLM model:

feature_effect_original <- FeatureEffect$new(predictor_original, "linc", method = "pdp")plot(feature_effect_original) +    brotools::theme_blog()

feature_effect_original <- FeatureEffect$new(predictor_original, "linc", method = "pdp")plot(feature_effect_original) +    brotools::theme_blog()

Quite similar to the marginal effects from the GLM! Let’s now compute model-agnostic feature importances:

feature_importance_original <- FeatureImp$new(predictor_original, loss = "mse")plot(feature_importance_original)

And finally, the interaction effect of the sex variable interacted with all the others:

interaction_sex_original <- Interaction$new(predictor_original, feature = "sex")plot(interaction_sex_original)

Ok so let’s assume that I’m happy with these explanations, and do need or want to go further. This would be the end of it in an ideal world, but this is not an ideal world unfortunately, but it’s the best we’ve got. In the real world, it often happens that data comes with missing values.

Missing data and explainability

As explained in the beginning, I’ve been wondering how to deal with missing values when the goal of the analysis is explainability. How can the explanations be pooled? Let’s start with creating a data set with missing values, then perform multiple imputation, then perform the analysis.

First, let me create a patterns matrix, that I will pass to the ampute() function from the {mice} package. This function creates a dataset with missing values, and by using its patterns argument, I can decide which columns should have missing values:

patterns <- -1*(diag(1, nrow = 15, ncol = 15) - 1)patterns[ ,c(seq(1, 6), c(9, 13))] <- 0amputed_train <- ampute(as_tibble(original_train), prop = 0.1, patterns = patterns, mech = "MNAR")
## Warning: Data is made numeric because the calculation of weights requires## numeric data

Let’s take a look at the missingness pattern:

naniar::vis_miss(amputed_train$amp) +     brotools::theme_blog() +       theme(axis.text.x=element_text(angle=90, hjust=1))

Ok, so now let’s suppose that this was the dataset I was given. As a serious data scientist, I decide to perform multiple imputation first:

imputed_train_data <- mice(data = amputed_train$amp, m = 10)long_train_data <- complete(imputed_train_data, "long")

So because I performed multiple imputation 10 times, I now have 10 different datasets. I should now perform my analysis on these 10 datasets, which means I should run my GBM on each of them, and then get out the explanations for each of them. So let’s do just that. But first, let’s change the columns back to how they were; to perform amputation, the factor columns were converted to numbers:

long_train_data <- long_train_data %>%     mutate(idp = ifelse(idp == 1, FALSE, TRUE),           physlim = ifelse(physlim == 1, FALSE, TRUE),           health = as.factor(case_when(health == 1 ~ "excellent",                              health == 2 ~ "fair",                              health == 3 ~ "good",                               health == 4 ~  "poor")),           sex = as.factor(ifelse(sex == 1, "female", "male")),           child = ifelse(child == 1, FALSE, TRUE),           black = ifelse(black == 1, FALSE, TRUE))

Ok, so now we’re ready to go. I will use the h2o.gbm() function on each imputed data set. For this, I’ll use the group_by()nest() trick which consists in grouping the dataset by the .imp column, then nesting it, then mapping the h2o.gbm() function to each imputed dataset. If you are not familiar with this, you can read this other blog post, which explains the approach. I define a custom function, train_on_imputed_data() to run h2o.gbm() on each imputed data set:

train_on_imputed_data <- function(long_data){    long_data %>%         group_by(.imp) %>%         nest() %>%         mutate(model = map(data, ~h2o.gbm(y = "mdu", x = features_names,            training_frame = as.h2o(.),            validation_frame = validation,            distribution = "poisson",            score_each_iteration = TRUE,            ntrees = 110,            max_depth = 20,            sample_rate = 0.6,            col_sample_rate = 0.8,            col_sample_rate_per_tree = 0.9,            learn_rate = 0.05)))}

Now the training takes place:

imp_trained <- train_on_imputed_data(long_train_data)

Let’s take a look at imp_trained:

imp_trained
## # A tibble: 10 x 3## # Groups:   .imp [10]##     .imp            data model     ##     >     ##  1     1   [14,042 × 16] ##  2     2   [14,042 × 16] ##  3     3   [14,042 × 16] ##  4     4   [14,042 × 16] ##  5     5   [14,042 × 16] ##  6     6   [14,042 × 16] ##  7     7   [14,042 × 16] ##  8     8   [14,042 × 16] ##  9     9   [14,042 × 16] ## 10    10   [14,042 × 16] 

We see that the column model contains one model for each imputed dataset. Now comes the part I wanted to write about (finally): getting explanations out of this. Getting the explanations from each model is not the hard part, that’s easily done using some {tidyverse} magic (if you’re following along, run this bit of code below, and go make dinner, have dinner, and wash the dishes, because it takes time to run):

make_predictors <- function(model){    Predictor$new(        model = model,         data = just_features,         y = actual_target,         predict.fun = predict_for_iml        )}make_effect <- function(predictor_object, feature = "linc", method = "pdp"){    FeatureEffect$new(predictor_object, feature, method)}make_feat_imp <- function(predictor_object, loss = "mse"){    FeatureImp$new(predictor_object, loss)}make_interactions <- function(predictor_object, feature = "sex"){    Interaction$new(predictor_object, feature = feature)}imp_trained <- imp_trained %>%    mutate(predictors = map(model, make_predictors)) %>%     mutate(effect_linc = map(predictors, make_effect)) %>%     mutate(feat_imp = map(predictors, make_feat_imp)) %>%     mutate(interactions_sex = map(predictors, make_interactions))

Ok so now that I’ve got these explanations, I am done with my analysis. This is the time to pool the results together. Remember, in the case of regression models as used in the social sciences, this means averaging the estimated model parameters and using Rubin’s rule to compute their standard errors. But in this case, this is not so obvious. Should the explanations be averaged? Should I instead analyse them one by one, and see if they differ? My gut feeling is that they shouldn’t differ much, but who knows? Perhaps the answer is doing a bit of both. I have checked online for a paper that would shed some light into this, but have not found any. So let’s take a closer look to the explanations. Let’s look at feature importance:

Click to view the 10 feature importances

imp_trained %>%     pull(feat_imp)
## [[1]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1 ndisease     1.0421605   1.362672      1.467244          22.03037## 2     fmde     0.8611917   1.142809      1.258692          18.47583## 3      lpi     0.8706659   1.103367      1.196081          17.83817## 4   health     0.8941010   1.098014      1.480508          17.75164## 5       lc     0.8745229   1.024288      1.296668          16.55970## 6    black     0.7537278   1.006294      1.095054          16.26879## ## [[2]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1      age      0.984304   1.365702      1.473146          22.52529## 2     linc      1.102023   1.179169      1.457907          19.44869## 3 ndisease      1.075821   1.173938      1.642938          19.36241## 4     fmde      1.059303   1.150112      1.281291          18.96944## 5       lc      0.837573   1.132719      1.200556          18.68257## 6  physlim      0.763757   1.117635      1.644434          18.43379## ## [[3]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1      age     0.8641304   1.334382      1.821797          21.62554## 2    black     1.0553001   1.301338      1.429119          21.09001## 3     fmde     0.8965085   1.208761      1.360217          19.58967## 4 ndisease     1.0577766   1.203418      1.651611          19.50309## 5     linc     0.9299725   1.114041      1.298379          18.05460## 6      sex     0.9854144   1.091391      1.361406          17.68754## ## [[4]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##   feature importance.05 importance importance.95 permutation.error## 1 educdec     0.9469049   1.263961      1.358115          20.52909## 2     age     1.0980269   1.197441      1.763202          19.44868## 3  health     0.8539843   1.133338      1.343389          18.40753## 4    linc     0.7608811   1.123423      1.328756          18.24649## 5     lpi     0.8203850   1.103394      1.251688          17.92118## 6   black     0.9476909   1.089861      1.328960          17.70139## ## [[5]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##   feature importance.05 importance importance.95 permutation.error## 1     lpi     0.9897789   1.336405      1.601778          22.03791## 2 educdec     0.8701162   1.236741      1.424602          20.39440## 3     age     0.8537786   1.181242      1.261411          19.47920## 4    lfam     1.0185313   1.133158      1.400151          18.68627## 5     idp     0.9502284   1.069772      1.203147          17.64101## 6    linc     0.8600586   1.042453      1.395231          17.19052## ## [[6]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##   feature importance.05 importance importance.95 permutation.error## 1      lc     0.7707383   1.208190      1.379422          19.65436## 2     sex     0.9309901   1.202629      1.479511          19.56391## 3    linc     1.0549563   1.138404      1.624217          18.51912## 4     lpi     0.9360817   1.135198      1.302084          18.46696## 5 physlim     0.7357272   1.132525      1.312584          18.42349## 6   child     1.0199964   1.109120      1.316306          18.04274## ## [[7]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1     linc     0.9403425   1.262994      1.511122          20.65942## 2       lc     1.0481333   1.233136      1.602796          20.17103## 3 ndisease     1.1612194   1.212454      1.320208          19.83272## 4  educdec     0.7924637   1.197343      1.388218          19.58554## 5     lfam     0.8423790   1.178545      1.349884          19.27805## 6      age     0.9125829   1.168297      1.409525          19.11043## ## [[8]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1      age     1.1281736   1.261273      1.609524          20.55410## 2   health     0.9134557   1.240597      1.432366          20.21716## 3     lfam     0.7469043   1.182294      1.345910          19.26704## 4      lpi     0.8088552   1.160863      1.491139          18.91779## 5 ndisease     1.0756671   1.104357      1.517278          17.99695## 6     fmde     0.6929092   1.093465      1.333544          17.81946## ## [[9]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1  educdec     1.0188109   1.287697      1.381982          20.92713## 2      lpi     0.9853336   1.213095      1.479002          19.71473## 3     linc     0.8354715   1.195344      1.254350          19.42625## 4      age     0.9980451   1.179371      1.383545          19.16666## 5 ndisease     1.0492685   1.176804      1.397398          19.12495## 6     lfam     1.0814043   1.166626      1.264592          18.95953## ## [[10]]## Interpretation method:  FeatureImp ## error function: mse## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##    feature importance.05 importance importance.95 permutation.error## 1      age     0.9538824   1.211869      1.621151          19.53671## 2      sex     0.9148921   1.211253      1.298311          19.52678## 3     lfam     0.8227355   1.093094      1.393815          17.62192## 4 ndisease     0.8282127   1.090779      1.205994          17.58459## 5       lc     0.7004401   1.060870      1.541697          17.10244## 6   health     0.8137149   1.058324      1.183639          17.06138

As you can see, the feature importances are quite different from each other, but I don’t think this comes from the imputations, but rather from the fact that feature importance depends on shuffling the feature, which adds randomness to the measurement (source: https://christophm.github.io/interpretable-ml-book/feature-importance.html#disadvantages-9). To mitigate this, Christoph Molnar suggests repeating the the permutation and averaging the importance measures; I think that this would be my approach for pooling as well.

Let’s now take a look at interactions:

Click to view the 10 interactions

imp_trained %>%     pull(interactions_sex)
## [[1]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.07635197## 2      idp:sex   0.08172754## 3      lpi:sex   0.10704357## 4     fmde:sex   0.11267146## 5  physlim:sex   0.04099073## 6 ndisease:sex   0.16314524## ## [[2]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.10349820## 2      idp:sex   0.07432519## 3      lpi:sex   0.11651413## 4     fmde:sex   0.18123926## 5  physlim:sex   0.12952808## 6 ndisease:sex   0.14528876## ## [[3]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.05919320## 2      idp:sex   0.05586197## 3      lpi:sex   0.24253335## 4     fmde:sex   0.05240474## 5  physlim:sex   0.06404969## 6 ndisease:sex   0.14508072## ## [[4]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.02775529## 2      idp:sex   0.02050390## 3      lpi:sex   0.11781130## 4     fmde:sex   0.11084240## 5  physlim:sex   0.17932694## 6 ndisease:sex   0.07181589## ## [[5]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.12873151## 2      idp:sex   0.03681428## 3      lpi:sex   0.15879389## 4     fmde:sex   0.16952900## 5  physlim:sex   0.07031520## 6 ndisease:sex   0.10567463## ## [[6]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.15320481## 2      idp:sex   0.08645037## 3      lpi:sex   0.16674641## 4     fmde:sex   0.14671054## 5  physlim:sex   0.09236257## 6 ndisease:sex   0.14605618## ## [[7]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.04072960## 2      idp:sex   0.05641868## 3      lpi:sex   0.19491959## 4     fmde:sex   0.07119644## 5  physlim:sex   0.05777469## 6 ndisease:sex   0.16555363## ## [[8]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.04979709## 2      idp:sex   0.06036898## 3      lpi:sex   0.14009307## 4     fmde:sex   0.10927688## 5  physlim:sex   0.08761533## 6 ndisease:sex   0.20544585## ## [[9]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.08572075## 2      idp:sex   0.12254979## 3      lpi:sex   0.17532347## 4     fmde:sex   0.12557420## 5  physlim:sex   0.05084209## 6 ndisease:sex   0.13977328## ## [[10]]## Interpretation method:  Interaction ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##       .feature .interaction## 1       lc:sex   0.08636490## 2      idp:sex   0.04807331## 3      lpi:sex   0.17922280## 4     fmde:sex   0.05728403## 5  physlim:sex   0.09392774## 6 ndisease:sex   0.13408956

It would seem that interactions are a bit more stable. Let’s average the values; for this I need to access the results element of the interactions object and the result out:

interactions_sex_result <- imp_trained %>%     mutate(interactions_results = map(interactions_sex, function(x)(x$results))) %>%     pull()

interactions_sex_result is a list of dataframes, which means I can bind the rows together and compute whatever I need:

interactions_sex_result %>%     bind_rows() %>%     group_by(.feature) %>%     summarise_at(.vars = vars(.interaction),                  .funs = funs(mean, sd, low_ci = quantile(., 0.05), high_ci = quantile(., 0.95)))
## # A tibble: 13 x 5##    .feature       mean     sd low_ci high_ci##                    ##  1 age:sex      0.294  0.0668 0.181    0.369##  2 black:sex    0.117  0.0286 0.0763   0.148##  3 child:sex    0.0817 0.0308 0.0408   0.125##  4 educdec:sex  0.148  0.0411 0.104    0.220##  5 fmde:sex     0.114  0.0443 0.0546   0.176##  6 health:sex   0.130  0.0190 0.104    0.151##  7 idp:sex      0.0643 0.0286 0.0278   0.106##  8 lc:sex       0.0811 0.0394 0.0336   0.142##  9 lfam:sex     0.149  0.0278 0.125    0.198## 10 linc:sex     0.142  0.0277 0.104    0.179## 11 lpi:sex      0.160  0.0416 0.111    0.221## 12 ndisease:sex 0.142  0.0356 0.0871   0.187## 13 physlim:sex  0.0867 0.0415 0.0454   0.157

That seems pretty good. Now, what about the partial dependence? Let’s take a closer look:

Click to view the 10 pdps

imp_trained %>%     pull(effect_linc)
## [[1]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.652445   pdp## 2 0.5312226 1.687522   pdp## 3 1.0624453 1.687522   pdp## 4 1.5936679 1.687522   pdp## 5 2.1248905 1.685088   pdp## 6 2.6561132 1.694112   pdp## ## [[2]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.813449   pdp## 2 0.5312226 1.816195   pdp## 3 1.0624453 1.816195   pdp## 4 1.5936679 1.816195   pdp## 5 2.1248905 1.804457   pdp## 6 2.6561132 1.797238   pdp## ## [[3]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.906515   pdp## 2 0.5312226 2.039318   pdp## 3 1.0624453 2.039318   pdp## 4 1.5936679 2.039318   pdp## 5 2.1248905 2.002970   pdp## 6 2.6561132 2.000922   pdp## ## [[4]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.799552   pdp## 2 0.5312226 2.012634   pdp## 3 1.0624453 2.012634   pdp## 4 1.5936679 2.012634   pdp## 5 2.1248905 1.982425   pdp## 6 2.6561132 1.966392   pdp## ## [[5]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.929158   pdp## 2 0.5312226 1.905171   pdp## 3 1.0624453 1.905171   pdp## 4 1.5936679 1.905171   pdp## 5 2.1248905 1.879721   pdp## 6 2.6561132 1.869113   pdp## ## [[6]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 2.147697   pdp## 2 0.5312226 2.162393   pdp## 3 1.0624453 2.162393   pdp## 4 1.5936679 2.162393   pdp## 5 2.1248905 2.119923   pdp## 6 2.6561132 2.115131   pdp## ## [[7]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.776742   pdp## 2 0.5312226 1.957938   pdp## 3 1.0624453 1.957938   pdp## 4 1.5936679 1.957938   pdp## 5 2.1248905 1.933847   pdp## 6 2.6561132 1.885287   pdp## ## [[8]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 2.020647   pdp## 2 0.5312226 2.017981   pdp## 3 1.0624453 2.017981   pdp## 4 1.5936679 2.017981   pdp## 5 2.1248905 1.981122   pdp## 6 2.6561132 2.017604   pdp## ## [[9]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.811189   pdp## 2 0.5312226 2.003053   pdp## 3 1.0624453 2.003053   pdp## 4 1.5936679 2.003053   pdp## 5 2.1248905 1.938150   pdp## 6 2.6561132 1.918518   pdp## ## [[10]]## Interpretation method:  FeatureEffect ## features: linc[numerical]## grid size: 20## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data:## Sampling from data.frame with 2013 rows and 14 columns.## ## Head of results:##        linc   .y.hat .type## 1 0.0000000 1.780325   pdp## 2 0.5312226 1.850203   pdp## 3 1.0624453 1.850203   pdp## 4 1.5936679 1.850203   pdp## 5 2.1248905 1.880805   pdp## 6 2.6561132 1.881305   pdp

As you can see, the values are quite similar. I think that in the case of plots, the best way to visualize the impact of the imputation is to simply plot all the lines in a single plot:

effect_linc_results <- imp_trained %>%     mutate(effect_linc_results = map(effect_linc, function(x)(x$results))) %>%     select(.imp, effect_linc_results) %>%     unnest(effect_linc_results)effect_linc_results %>%     bind_rows() %>%     ggplot() +     geom_line(aes(y = .y.hat, x = linc, group = .imp), colour = "#82518c") +     brotools::theme_blog()

Overall, the partial dependence plot seems to behave in a very similar way across the different imputed datasets!

To conclude, I think that the approach I suggest here is nothing revolutionary; it is consistent with the way one should conduct an analysis with multiple imputed datasets. However, the pooling step is non-trivial and there is no magic recipe; it really depends on the goal of the analysis and what you want or need to show.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}Buy me an EspressoBuy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

binb 0.0.5: More improvements

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The fifth release of the binb package just arrived on CRAN. binb regroups four rather nice themes for writing LaTeX Beamer presentations much more easily in (R)Markdown. As a teaser, a quick demo combining all four themes follows; documentation and examples are in the package.

This release contains some nice extensions to the Monash theme by Rob Hyndman](https://robjhyndman.com/). You can see the a longer demo in this pdf and the extended options (i.e. for titlepage) in this pdf. David Selby also correct a minor internal wart in Presento.

Changes in binb version 0.0.5 (2019-11-02)

  • The Monash theme was updated with new titlepage and font handling and an expanded demo (Rob in #20).

  • The presento theme is now correctly labeled as exported (David Selby in #22).

  • The two Monash demos are now referenced from README.md (Dirk).

CRANberries provides a summary of changes to the previous version. For questions or comments, please use the issue tracker at GitHub.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Re-creating survey microdata from marginal totals by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently did some pro bono work for Gun Control NZ reviewing the analysis by a market research firm of the survey that led to this media release: “Most New Zealanders back stronger gun laws”. The analysis all checked out ok. The task at that time was to make sure that any claims about different perceptions of different groups in New Zealand society were backed by adequately severe (for the context) tests, and in the end I just performed a bunch of pragmatic Chi-square tests with the aggregate results provided by the original market research company.

However, my first thought when I saw that I had access only to the marginal aggregate totals, not the original microdata, was that it might be easier to analyse if I could re-create the original microdata. Amongst other things, this would have meant I could cycle through various bits of analysis with less code. Creating an adequate set of microdata was a harder job than I had hoped, so I didn’t use that method for the original task. But since then I have been pecking away at it to see what it would take, partly for potential future use in this situation, partly as a general self-learning exercise in further understanding issues relating to statistical disclosure control (confidentialisation by cell suppression, random rounding, etc), which is more of a relevant technique than ever today.

See this story for an example claim that an extended version of the sort of methods I describe here, in combination with repeated queries of marginal totals that have been pertured for statistical disclosure control, could be used to recover original sensitive census data. Me, I’m sceptical of the practicality of the alleged “exploit” in that story, which is more theoretical than actual, but it’s true that there is a potential vulnerability here that is worth dabbling my toes in to understand a little. It’s certainly true that repeated queries of a tool that delivers marginal totals (ie crosstabs) can get around the more basic forms of statistical disclosure control perturbation, which is why national stats offices are investing heavily in methods of control such as fixed random rounding to mitigate that particular vulnerability.

But to be clear on this up front – I don’t think a data snooper re-creating sensitive data is a realistic fear with regard to the sort of survey, with limited number of variables, such as that used for a demo in today’s post. And my analysis below bears this out.

Example – Survey of New Zealand views on gun control

When tidied up, the available data from the survey I am working with looks like this, a typical set of high level results from a market research poll:

This chart shows one of two substantive questions in the survey, which is all I will be focusing on today. The visually prominent differences in response, such as by ethnicity, region, gender and rural or not location are statistically significant. The published data show cross tabulated counts by one question and one variable at a time (which I will be calling “marginal totals” in this post) but no interactions.

The code for making these nice diverging stacked bar charts for Likert-like data is shown a bit later down this post.

My task today is, it it possible to recreate a set of microdata that adds up to the aggregates in the available data? This is a special case of creating synthetic data, a common statistical disclosure control technique used to make microdata available for analysis that has the same properties as original microdata but without revealing sensitive details. Sometimes the data is analysed directly, sometimes it is used as a training set of data with which statisticians can develop their models safely before sending them off to a secure environment to be run on the real, gold standard data.

Of course, this isn’t a new problem. A nice overview of some model-based methods is provided by Alan Lee in this discussion of creating SURFs (Simulated Unit Record Files) with New Zealand data. The synthpop R package was developed as part of the Synthetic Data Estimation for UK Longitudinal Studies project and seems to be the most advanced software in this area. However, the motivation of these efforts is to create a SURF that has plausibly been generated from the same underlying joint distribution of the population as the actual sample; I have given myself the slightly extended task of generating a SURF that has exactly the same joint distribution as the actual sample (not the population), at least to the degree of detail revealed by the published marginal totals.

Is this possible? There must be at least one solution, which is the actual sample (unless the aggregate totals have been deliberately fuzzed eg by random rounding or other perturbation for disclosure control). Can I find that solution? Is there more than one solution; if not, could this method be something a snooper could do to recover sensitive information? If there are multiple solutions, does finding one allow us to get extra insight from the data, for example by fitting a model that looks for effects from one variable while controlling for others, something that isn’t possible with just the univariate aggregates? These are interesting questions.

It turns out that the answers are:

  • It is possible to find a solution that very closely matches the published marginal totals, but very difficult to get an exact match
  • The close solutions aren’t unique
  • The existence of multiple solutions means that a snooper is unlikely to recover sensitive information by this method
  • Unless you combine considerably more variables than I have done in this instance, the process of re-synthesising microdata from the marginal totals does not yield any additional insights.

Data management

Import

First step as always is data management. This first chunk of code below downloads the data, tidies it and produces the first chart.

Post continues after R code

library(tidyverse)library(survey)library(tidyverse)library(scales)library(readxl)library(janitor)library(rmarkdown)library(kableExtra)library(knitr)library(RColorBrewer)download.file("https://github.com/ellisp/blog-source/raw/master/data/NZ%20gunlaw%20survey%202019%20Sep%20supplementary%20tables.xlsx",destfile="gun-law-survey-tables.xlsx",mode="wb")q1<-"How strongly do you support or oppose strengthening New Zealand’s existing gun laws?"orig1<-read_excel("gun-law-survey-tables.xlsx",sheet="Strengthening of exisiting laws",skip=1)names(orig1)[1:2]<-c("variable","value")clean<-function(x){x<-gsub("/"," ",x,fixed=TRUE)x<-gsub(" ","_",str_squish(x),fixed=TRUE)x<-gsub("M.ori","Maori",x)return(x)}freq1c<-orig1%>%select(-`Total support`,-`Total oppose`)%>%fill(variable)%>%filter(!is.na(value))%>%gather(answer,prop,-variable,-value,-n)%>%group_by(variable,value)%>%mutate(Freq=pmax(0.01,n*prop/100))%>%ungroup()%>%select(variable,value,answer,Freq)ans_levs<-c("Strongly oppose","Somewhat oppose","Neither support or oppose","Unsure","Strongly support","Somewhat support")# Graphic  freq1c%>%filter(variable!="All")%>%mutate(answer=ordered(answer,levels=ans_levs))%>%group_by(variable,value)%>%mutate(prop=Freq/sum(Freq))%>%filter(!answer%in%c("Unsure","Neither support or oppose"))%>%mutate(prop=ifelse(grepl("oppose",answer),-prop,prop))%>%ungroup()%>%mutate(variable=gsub("_"," ",variable),value=fct_relevel(str_wrap(value,30),c("Other ethnicity","$50k-$100k","Under $50k","Canterbury","Wellington/ Wairarapa","Lower North Is.","Upper North Is.","Auckland"),after=Inf))%>%ggplot(aes(weight=prop,x=value,fill=answer))+facet_wrap(~variable,scales="free_y")+geom_bar(position="stack")+coord_flip()+scale_y_continuous(label=percent_format(accuracy=1))+scale_fill_manual(breaks=c("Strongly oppose","Somewhat oppose","Somewhat support","Strongly support"),values=brewer.pal(4,"Spectral")[c(1,2,4,3)])+labs(y="'Unsure' and 'Neither' responses omitted from chart, but not from calculation of percentages",x="",fill="",subtitle=q1,title="Support and opposition to gun control in New Zealand")

Approach

We will be performing this resurrection of the microdata in three steps:

  1. create a table of all possible combinations of each variable (income, age, ethnicity, possible answers to the question on gun control)
  2. create a set of weights for each combination that add up to the observed marginal distributions (eg combinations of age with particular answers to the question), from which we can select samples that represent the same population we have inferred the original sample was drawn from
  3. draw a sample from that population,
  4. evaluate our sample’s marginal totals against those of the original sample, and “improve” our simulated sample by dropping excess individuals and replacing them with new ‘respondents’ that give the sample propoertise that more closely resemble the original sample

The usual process of generating a SURF performs just steps 1 to 3, which turn out to be fairly straightforward with our relatively small number of variables. It is task four that looks to be the difficult one; a good thing too, or it would be too easy for snoopers to re-create microdata from aggregates.

BTW if people are wondering about my use of the term “snooper”, this isn’t me being whimsical; this is the term generally used in the statistical disclosure control literature for the adversary that we build our confidentialisation methods to guard against.

Dealing with some variables’ idiosyncracies

When we look at counts of responses by each of the reported variables, straight away we have two interesting problems.

  • The “ethnicity” variable, while reported in aggregate the same as variables such as region and income, is different because in the standard New Zealand classifications it is possible for a respondent to report two ethnicities. So we have more than 1,000 total responses by ethnicity despite the reported sample size being 1,000
  • Other variables have less than 1,000 responses, almost certainly due to non-response; and the “unknown” categories for those variables are not provide.
variable sum(Freq)
Age 1000.0000
All 1000.0000
Dependent children 1000.0000
Employment 1000.0000
Ethnicity 1098.8538
Gender 995.7287
Household income 857.4557
Living Situation 928.8421
Region 1000.0100
Rural 956.4756

… which was generated with this:

freq1c%>%group_by(variable)%>%summarise(sum(Freq))

It looks like age, dependent children, employment and region were all fully responded to (perhaps mandatory) whereas other variables had varying degrees of partial response, with income being the most non-answered question (unsurprising because of its sensitivity).

For income (and similar variables) we can recover the marginal totals of those for whom income is unknown by comparing the total responses for each level of the substantive question for “All” to those for whom we do have income information.

For ethnicity we need five yes/no variables for each of the possible ethnicities. When we generate our full “all combinations” population, we will eliminate those with more than two ethnicities, which helps keep size down to reasonable levels.

A bit of mucking around lets us deduce the values of those unknowns, and turn ethnicity into multiple variables.

#--------------Ethnicity variables and unknowns (income etc)--------------------freq1c%>%group_by(variable)%>%summarise(sum(Freq))%>%kable()%>%kable_styling()%>%write_clip()freq1a<-freq1c%>%mutate(variable=ifelse(variable=="Ethnicity",value,variable),variable=clean(variable),value=gsub("$100,000k","$100k",value,fixed=TRUE))ethnicity_vars<-freq1a%>%filter(variable==clean(value))%>%distinct(variable)%>%pull(variable)totals<-freq1a%>%filter(variable=="All")%>%select(answer,answer_total=Freq)unknowns<-freq1a%>%filter(variable!="All")%>%group_by(variable,answer)%>%summarise(variable_total=sum(Freq))%>%left_join(totals,by="answer")%>%mutate(Freq=round(answer_total-variable_total,1),value=ifelse(variable%in%ethnicity_vars,paste("Not",variable),paste("Unknown",variable)))%>%ungroup()%>%select(variable,value,answer,Freq)%>%filter(Freq>0)freq1b<-freq1a%>%filter(variable!="All")%>%rbind(unknowns)%>%mutate(Freq=round(Freq))%>%filter(Freq>0)

Step 1 – creating all possible combinations

The first few rows of the object freq1b above look like this:

variable value answer Freq
Region Upper North Is. Strongly support 99
Region Auckland Strongly support 155
Region Wellington/ Wairarapa Strongly support 69
Region Lower North Is. Strongly support 52
Region Canterbury Strongly support 72
Region Other South Is. Strongly support 53
Rural Yes Strongly support 50
Rural No Strongly support 436
Gender Male Strongly support 216
Gender Female Strongly support 283

Here is some code that takes that freq1b object and turns it into a big data frame of all the possible combinations of demographic variables and answers to question 1, hence representing the full population of New Zealand in scope for the survey (although not yet in the actual proportions of that population).

#-------------------Create population dataset of all possible combinations of variables---------poss_answers1<-freq1b%>%distinct(answer)%>%rename(value=answer)%>%mutate(variable="q1")all_vars<-freq1b%>%distinct(variable,value)%>%rbind(poss_answers1)%>%group_by(variable)%>%mutate(sequence=1:n())%>%ungroup()all_vars_l<-with(all_vars,split(value,variable))all_combos<-do.call(expand.grid,all_vars_l)%>%as_tibble()%>%mutate_all(as.character)%>%mutate(wt=1)%>%filter(!(q1=="Unsure"&Living_Situation=="Renting from Housing New Zealand or other social housing organisation"))%>%filter(!(q1=="Unsure"&Region=="Wellington/ Wairarapa"))%>%filter(!(Gender=="Unknown Gender"&q1%in%c("Neither support or oppose","Somewhat oppose","Unsure")))%>%# remove people with more than 2 ethnicities, to save 1+ million impossible combinations:mutate(number_ethnicities=(Asian=="Asian")+(NZ_European_Other_European=="NZ European / Other European")+(NZ_Maori=="NZ Māori")+(Other_ethnicity=="Other ethnicity")+(Pasifika=="Pasifika"))%>%filter(number_ethnicities%in%1:2)%>%select(-number_ethnicities)

Note that I have eliminated by hand a few combinations of variables that are implicitly non-existent in terms of our original sample, and also that the last steps in the chunk above are to cut down our combinations of variables to those that have only one or two ethnicities.

Step 2 – weights that resemble the marginal totals from the actual sample

So far so good. The next step is the one that gets most attention in the synthetic data literature; creating a model or set of weights that will allow our table of all possible combinations of variables to actually be representative of the in-scope population, as understood from the original sample. There are many ways to go about this, with and without explicit statistical models, and I have opted for the simplest which is to use iterative proportional fitting (or “raking”) to create a set of weights for my all-combinations table that adds up to every one of the observed combinations of aggregate marginal totals. I use Thomas Lumley’s survey package as a short cut here, noting the irony that I am creating weights for a population to match a sample, rather than (as is usually the case) for a sample to match a population. Sometimes this makes it hard to keep track of exactly what I mean by population in sample in the code below.

# Create a list with 13 different population tibbles, for each combination of a variable with Q1pops1<-freq1b%>%group_split(variable)%>%lapply(.,function(x){names(x)[2:3]<-c(unique(x$variable),"q1")x<-x[,-1]return(x)})full_data<-all_combosd1<-svydesign(~1,weights=~wt,data=full_data)d2<-rake(d1,sample=list(~Age+q1,~Asian+q1,~Dependent_children+q1,~Employment+q1,~Gender+q1,~Household_income+q1,~Living_Situation+q1,~NZ_European_Other_European+q1,~NZ_Maori+q1,~Other_ethnicity+q1,~Pasifika+q1,~Region+q1,~Rural+q1),population=list(pops1[[1]],pops1[[2]],pops1[[3]],pops1[[4]],pops1[[5]],pops1[[6]],pops1[[7]],pops1[[8]],pops1[[9]],pops1[[10]],pops1[[11]],pops1[[12]],pops1[[13]]))full_data$wt<-weights(d2)

This only takes a few seconds to run on my laptop.

So we now have the full_data object, which has 2.5 million rows and 15 columns. The first 10 rows look like this:

Age Asian Dependent_children Employment Gender Household_income Living_Situation NZ_European_Other_European NZ_Maori Other_ethnicity Pasifika q1 Region Rural wt
18-29 Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 2e-07
30-44 Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07
45-59 Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07
60+ Not Asian Yes 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 3e-07
18-29 Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 6e-07
30-44 Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 8e-07
45-59 Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 9e-07
60+ Not Asian No 30 hours or more Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-06
18-29 Not Asian Yes Less than 30 hours Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-07
30-44 Not Asian Yes Less than 30 hours Male Under $50k Renting from a private landlord or property management company Not NZ_European_Other_European NZ Maori Other ethnicity Pasifika Strongly support Upper North Is. Yes 1e-07

Within rounding errors, that final wt column adds up to the published survey sample totals. For example:

>full_data%>%+group_by(Gender,q1)%>%+summarise(full_data_total=sum(wt))%>%+left_join(pops1[[5]])%>%+rename(original_survey_total=Freq)Joining,by=c("Gender","q1")# A tibble: 15 x 4# Groups:   Gender [3]Genderq1full_data_totaloriginal_survey_total<chr><chr><dbl><dbl>1FemaleNeithersupportoroppose55552FemaleSomewhatoppose38383FemaleSomewhatsupport108.1074FemaleStronglyoppose21215FemaleStronglysupport282.2836FemaleUnsure997MaleNeithersupportoroppose63638MaleSomewhatoppose49.499MaleSomewhatsupport96.59610MaleStronglyoppose515111MaleStronglysupport216.21612MaleUnsure8813UnknownGenderSomewhatsupport1.00114UnknownGenderStronglyoppose1115UnknownGenderStronglysupport2.002

Step 3 – a sample from that population

So, a sample drawn at random from full_data with probability of selection proportionate to wt has a fighting chance of coming from the true population with the same joint distribution that the original sample came from, that is, the actual in-scope population of New Zealand. If we were just trying to create a SURF for analysts to use we would stop here, but we are interested not just in whether it is drawn from the same population, but how exactly it resembles the original sample. So in the code below I create the function evaluate_dissim(), which is going to get some heavy duty use later on. It compares the discrepencies in marginal totals between the new sample and the population totals it is meant to have, returning the sum of the absolute values.

# we can draw a sample from the set of all combinationsset.seed(877)latest_sample<-sample_n(full_data,1000,replace=TRUE,weight=wt)#' Evaluate the dissimilarity of a current sample's marginal totals compared to #' those in the various margin "population" totals#' #' @param latest_sample a sample to evaluate#' @param pops a list of population data frames with three columns; first two are variables, third is Freq#' @details provides the sum of the absolute differences of the marginal totals from the sample with those#' in pops that it is trying to resembleevaluate_dissim<-function(latest_sample,pops){total_discrepency<-0for(jin1:length(pops)){var_names<-names(pops[[j]])[1:2]x<-latest_sample[,var_names]%>%group_by_all()%>%summarise(sample=n())%>%left_join(pops[[j]],by=var_names)%>%mutate(discrepency=abs(sample-Freq))%>%ungroup()%>%summarise(discrepency=sum(discrepency))%>%pull(discrepency)total_discrepency<-total_discrepency+x}return(total_discrepency)}discrepencies<-evaluate_dissim(latest_sample,pops1)

Our starting sample is out from the marginal totals of the combined sample by a total of 1,546. For example, using gender again:

>latest_sample%>%+group_by(Gender,q1)%>%+summarise(latest_sample_total=n())%>%+left_join(pops1[[5]])%>%+rename(original_sample_total=Freq)Joining,by=c("Gender","q1")# A tibble: 15 x 4# Groups:   Gender [3]Genderq1latest_sample_totaloriginal_sample_total<chr><chr><int><dbl>1FemaleNeithersupportoroppose64552FemaleSomewhatoppose45383FemaleSomewhatsupport1021074FemaleStronglyoppose19215FemaleStronglysupport2652836FemaleUnsure797MaleNeithersupportoroppose71638MaleSomewhatoppose44499MaleSomewhatsupport1009610MaleStronglyoppose295111MaleStronglysupport24021612MaleUnsure8813UnknownGenderSomewhatsupport2114UnknownGenderStronglyoppose1115UnknownGenderStronglysupport32

The new sample has similar results to the original – it’s plausibly from the same population, which is all we’ve aimed for so far – but it’s clearly not exactly the same. For example, my sample has 64 female respondents who neither support or oppose the gun control in question, whereas the original sample published a figure of 55. Close but not exact.

Step 4 – improving the simulated sample to make it resemble the original sample’s marginal totals

The fourth step is the tricky stuff. I tried several ways to do the next step and am far from convinced I’ve got it right; in fact even for the approach I’ve ended up with, I have far from optimised code. But it’s good enough for illustrative purposes.

My first idea was to find a row of data (representing a whole respondent) that was in “excess” with regard to at least one of its variables, remove that respondent from the sample and replace them with someone else chosen at random, with extra probability given to new respondents that would improve the marginal totals.

To do this, I defined a function called improve_rnd() which takes the latest sample and confronts it with a single marginal total, just like I do with gender in the table above.

  • From that I identify types of respondents who are in excess (eg Female who Neither support or oppose) and types that are missing (Female who strongly support).
  • I sort my sample with the rows that most represent excess values at the top (with some randomness given there will be lots of ties)
  • I lop off the top row of that sorted sample (noting that this means removing a whole respondent ie also their values for income, age, region, ethnicity, etc)
  • I create a table of under-represented rows from the population with regard to this particular variable, multiplying their original population weights by how badly they are under represented
  • I sample one row from that under-represented set based on those new weights and add it to my new candidate sample
  • I evaluate the candidate sample overall against all marginal totals to see if overall the change improves things or makes them worse (for example, I might have worsened the totals for income even though I improved them for gender)
  • If the candidate sample is better, I return this; otherwise I repeat the last two steps, systematically going through my under-represented set of data until I find one that unambiguously improves the data

As you can probably guess, that second last point above proved to be important – before I did that, I ended up cycling around a modestly improved equilibirium where improving the marginal totals for any one variable was making it worse for the others.

This function then becomes the work horse for an iterative process where I work through all my variables (gender, income, age, etc) one at a time at random, looking to drop one of the rows in my sample that makes it worst for that variable and replace it with an improvement that doesn’t degrade the other variables’ totals. I repeat this until the total discrepency gets down to some acceptable level that is probably from rounding error (noting I never had the exact frequencies from the original data, only proportions and sub-population values of n).

Here’s the code that does that procedure. This is the expensive part of the computation; it took me many hours to converge to a combined discrepency of less than 400, so in the code below I stop it once it gets below 1,200 – still much higher than I was hoping to get.

Post continues after R code

#' Improve a sample on basis of discrepency with a single set of marginal totals, while#' not making it worse based on the other marginal totals it is matching#'#' @param latest_sample A sample that is trying to look like the original survey#' @param marg_totals A single data frame of marginal totals with three columns, of which the first#' two must be variables from full data and the third, Freq, is counts we are trying to match#' @param full_data A dataset of all possible combinations that the sample is drawn from, #' with population weights#' @param disc #' @param strict if TRUE, whether to insist the end result sample has the same number of rows#' as the original latest_sample#' @param pops A full set of all marginal totals (as opposed to marg_totals which is just one #' of them) - needed for evaluating how well candidate samples go compared to their aspiration.#' @param verbose Whether or not to print out which row of the extra sample data is being tried#' to replace an excess row in the sample#' @param max_attempts the maximum number of candidate rows to try as a replacement for an excess#' row in the sampleimprove_rnd<-function(latest_sample,marg_totals,full_data,disc=0.1,strict=FALSE,pops,verbose=TRUE,max_attempts=100){# initial discrepency:starting_disc<-evaluate_dissim(latest_sample,pops)# initial number of rows (useful for checking if we lose any later):starting_rows<-nrow(latest_sample)# variable names we are checking against for just this marginal totalvar_names<-names(marg_totals)[1:2]# in case we have any excesses recorded from previously, make this NULLlatest_sample$excesses<-NULL# for convenience, renaming, and being able to go back if necessary, copy the current sample data:x<-latest_samplenames(x)[names(x)==var_names[1]]<-"var1"names(x)[names(x)==var_names[2]]<-"var2"# identify which combinations of the variables listed in marg_data are in latest_sample in excess:new_excesses<-x%>%group_by(var1,var2)%>%summarise(sample_freq=n())%>%full_join(marg_totals,by=c("var1"=var_names[1],"var2"=var_names[2]))%>%mutate(sample_freq=replace_na(sample_freq,0))%>%mutate(excesses=sample_freq-Freq)%>%select(var1,var2,excesses)names(new_excesses)[1:2]<-var_namesy<-latest_sample%>%left_join(new_excesses,by=var_names)under_rep_vars<-new_excesses%>%ungroup()%>%filter(excesses<-disc)# Create and sort the under-represented rows of original data, with those that# are more under-represented more likely to be at the topunder_rep_data<-full_data%>%inner_join(under_rep_vars,by=var_names)%>%mutate(wt=wt*abs(excesses),id=runif(n()*wt))%>%arrange(desc(id))%>%select(-id)%>%slice(1:max_attempts)if(nrow(under_rep_data)>0){z<-y%>%# knock off one of the worst rows with too many excesses of some variable:arrange(desc(jitter(excesses)))%>%slice(-1)# cycle through all the possible candidates to replace a row of our sample with# one that is under represented, and choose the first one that reduces the overall# discrepency between our marginal totals and the ones we are aiming for:for(iin1:nrow(under_rep_data)){if(verbose){cat(i)}candidate_sample<-z%>%# replace it with a row from our under-represented set of data:rbind(under_rep_data[i,])# evaluate our candidate.new_disc<-evaluate_dissim(candidate_sample,pops)# Have we made things worse for the other variables by trying to fix it for this one?:if(new_disc<starting_disc){# If things are better, we take this sample and break out of the looplatest_sample<-candidate_samplebreak()}# if things aren't better, we keep cycling through under_rep_data until we find one that is}}else{# if there's nothing to replace:new_disc<-starting_disc}if(strict){if(nrow(latest_sample)!=starting_rows){print(c(nrow(latest_sample),starting_rows))stop("Somehow gained or lost some rows")}}return(list(latest_sample,new_disc))}latest_sample%>%group_by(Gender,q1)%>%summarise(latest_sample_total=n())%>%left_join(pops1[[5]])%>%rename(original_sample_total=Freq)discrepencies<-evaluate_dissim(latest_sample,pops1)# best result from random version is 344while(min(discrepencies)>1200){# cycle through each of our sets of marginal totals (in random order), # looking to improve the match if we canfor(iinsample(1:length(pops1))){y<-latest_sample%>%excess(marg_totals=pops1[[i]],full_data=full_data,disc=0.2,pops=pops1,max_attempts=5,verbose=FALSE)latest_sample<-y[[1]]discrepencies<-c(discrepencies,y[[2]])}# at the end of a cycle of variables, print how well we're going at improving things:print(min(discrepencies))}

So it turned out to be a pretty inefficient solution to knock out a whole person at a time and grab a new one in the hope they would be better. In retrospect, this didn’t make much sense as a method – I clearly came to it because I was thinking in terms of “whole people” as though my simulated sample really represented people and I had no choice but to accept and reject. I was thinking in these terms because I liked the idea of continually sampling from my hypothetical population, and felt this would somehow better mimic the known joint distribution. But I don’t think that makes much sense now that I’ve tried it.

An alternative, perhaps an obvious one, is to look at my current sample, find the particular combinations of some variable that is over-represented (eg nine extra females who neither support or oppose), pick nine of them at random and make them male. Clearly not possible in the real world, but that’s the whole point of synthetic data. I’d resisted this originally because I’d been thinking of all the other characteristics that are jointly distributed with being female that I was now changing, but on a bit of reflection this is actually little different from chucking those nine women out of the sample and keeping grabbing more people from the bag until I found some who were right.

So to implement this I built a new function, improve_fix which does just that. This was much quicker than my first, resampling-based method; but is still prone to being stuck. So in the implementation below I actually combine both methods: using the “swap people’s characteristics” method as the main way of “improving” my sample, but if after 20 efforts of doing this no improvements are happening, trying my original random resampling method to get stuck out of bad equilibrium.

Post continues after R code.

#-----------------Alternative approach - brute force, one variable at a time----#' @details This method changes the sample by seeking a way to flick over var1 to a new value#' that will improve the marginal combinations of var1 and var2 while leaving all other variables#' unchanged. Unlike improve_rnd, which swaps out whole rows at a time.improve_fix<-function(latest_sample,marg_totals,full_data,pops){# initial discrepency:starting_disc<-evaluate_dissim(latest_sample,pops)# variable names we are checking against for just this marginal totalvar_names<-names(marg_totals)[1:2]# in case we have any excesses recorded from previously, make this NULLlatest_sample$excesses<-NULLlatest_sample$id<-1:nrow(latest_sample)# for convenience, renaming, and being able to go back if necessary, copy the current sample data:x<-latest_samplenames(x)[names(x)==var_names[1]]<-"var1"names(x)[names(x)==var_names[2]]<-"var2"# identify which combinations of the variables listed in marg_data are in latest_sample in excess:changes_possible<-x%>%group_by(var1,var2)%>%summarise(sample_freq=n())%>%full_join(marg_totals,by=c("var1"=var_names[1],"var2"=var_names[2]))%>%mutate(sample_freq=replace_na(sample_freq,0))%>%mutate(excesses=jitter(sample_freq-Freq))%>%filter(round(abs(excesses))>0)%>%select(var1,var2,excesses)%>%ungroup()candidate_var2<-changes_possible%>%group_by(var2)%>%filter(min(excesses)<0&max(excesses)>0)%>%select(-excesses)changes_needed<-changes_possible%>%inner_join(candidate_var2,by=c("var1","var2"))%>%arrange(desc(rnorm(n())))%>%filter(var2==var2[1])%>%filter(excesses%in%range(excesses))%>%mutate(excesses=round(excesses))names(changes_needed)[1:2]<-var_namesnumber_changes<-min(abs(pull(changes_needed,excesses)))change_from<-filter(changes_needed,excesses>0)%>%select(-excesses)change_to<-filter(changes_needed,excesses<0)if(nrow(change_from)==1&nrow(change_to)==1){knock_out<-latest_sample%>%inner_join(change_from,by=var_names)%>%sample_n(number_changes)replace_with<-knock_outreplace_with[,var_names[1]]<-change_to[,var_names[1]]replace_with[,var_names[2]]<-change_to[,var_names[2]]latest_sample<-latest_sample%>%filter(!id%in%knock_out$id)%>%rbind(replace_with)}latest_sample$id<-NULL# new  discrepency:new_disc<-evaluate_dissim(latest_sample,pops)# Note that if change_to has zero rows thenstopifnot(new_disc<=starting_disc)return(list(latest_sample,new_disc))}while(min(discrepencies)>10){# cycle through each of our sets of marginal totals (in random order), # looking to improve the match if we canfor(iinsample(1:length(pops1))){y<-improve_fix(latest_sample,marg_totals=pops1[[i]],full_data=full_data,pops=pops1)if(is.na(y[[2]])){stop("found an NA")}latest_sample<-y[[1]]discrepencies<-c(discrepencies,y[[2]])}# at the end of a cycle of variables, print how well we're going at improving things:print(min(discrepencies))# sometimes we get prematurely stuck with the "fixed" method and it is worth substituting# in and out some whole rows of data to kick-start the process:if(length(unique(tail(discrepencies,20)))==1){for(iinsample(1:length(pops1))){y<-latest_sample%>%improve_rnd(marg_totals=pops1[[i]],full_data=full_data,disc=0.2,pops=pops1,max_attempts=5,verbose=FALSE)latest_sample<-y[[1]]discrepencies<-c(discrepencies,y[[2]])}}}

This method delivers good results in only a few minutes of processing time. Here’s a plot of the decreasing total discrepencies:

You can see the first, slowish descent in total discrepencies in the first 100 or so samples, which represents my first, replacement-based method with improve_rnd(). Then from a total discrepency of about 1,200 to just about 250 there is a rapid improvement from using the improve_fix() method. That method stalls at a discrepency level of about 250 but a return to a cycle of improve_rnd() gets it out of that equilibrium; this pattern recurs a few times until we minimise our total discrepencies at 22. Importantly, we can get to that sample with discrepency of 22 from multiple ways.

So here’s the comparison of just one variable and question 1 of the sample’s marginal totals and the target, to compare with my first synthetic sample. Note that we now have nearly an exact match – we have just one too few females who “strongly support”, and one too many of unknown gender who “somewhat support”. My method as implemented doesn’t have a clear way to improve the sample in this case. Intuitively, we would think we could find one of those two “unknown gender – somewhat support” people and change them to “female – strongly support” but I’ve written my program only to change one variable at a time (in this case Gender), to minimise complications with other sets of variables. So further improvement is still possible, if anyone really wants to pursue this approach

> #-----------------End results--------------> latest_sample %>%+   group_by(Gender, q1) %>%+   summarise(latest_sample_total = n()) %>%+   left_join(pops1[[5]]) %>%+   rename(original_sample_total = Freq)Joining, by = c("Gender", "q1")# A tibble: 15 x 4# Groups:   Gender [3]   Gender         q1                        latest_sample_total original_sample_total                                                                  1 Female         Neither support or oppose                  55                    55 2 Female         Somewhat oppose                            38                    38 3 Female         Somewhat support                          107                   107 4 Female         Strongly oppose                            21                    21 5 Female         Strongly support                          282                   283 6 Female         Unsure                                      9                     9 7 Male           Neither support or oppose                  63                    63 8 Male           Somewhat oppose                            49                    49 9 Male           Somewhat support                           96                    9610 Male           Strongly oppose                            51                    5111 Male           Strongly support                          216                   21612 Male           Unsure                                      8                     813 Unknown Gender Somewhat support                            2                     114 Unknown Gender Strongly oppose                             1                     115 Unknown Gender Strongly support                            2                     2

Reflections

A few conclusions:

  • This was considerably harder to do than I’d realised; and would get harder again with more sets of marginal totals to match.
  • We would need several orders of magnitude more sets of marginal totals before we have to worry about there being a unique solution that gave a snooper (eg someone who knows the one Female Pasifika survey respondent in region X) confidence they had obtained new, sensitive information ie their answers to the survey.
  • Because I have not included any external information (eg on the ethnic characteristics of different regions), I will have only very weak interaction relations between any variables – big underestimates of the true interactions. Considerably more information of that sort would be needed for this exercise to add any value in terms of generating insights beyond what the original marginal totals did.

Overall, at the end of 1,000+ line blog post, I would say my full method set out here is unlikely to be pragmatically helpful for any realistic use case that comes to mind for me. I think the traditional generation of synthetic data (my steps 1 to 3) is potentially useful for purpose of developing models that might then be applied to the gold standard original microdata in a secure setting; but step 4 really doesn’t add much value either for a bona fide researcher or for a ill-intentioned snooper.

Acknowledgement and caveat

Gun Control NZ who commissioned the original survey shared the data and gave me permission to write a blog post on the data here but they have not seen the content in advance or been involved any other way; I am not affiliated with them and they are in no sense responsible for any of the content, findings, errors or omissions in the above.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Owl of Athena

$
0
0

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

Intro:

After developing the package RAthena, I stumbled quite accidentally into the R SDK for AWS paws. As RAthena utilises Python’s SDK boto3 I thought the development of another AWS Athena package couldn’t hurt. As mentioned in my previous blog the paws syntax is very similar to boto3 so alot of my RAthena code was very portable and this gave me my final excuse to develop my next R package.

paws and AWS Athena:

Before getting into the next package, lets first look at how the SDK’s interact with AWS Athena.

For example: return all databases in AWS Athena

# create an AWS Athena object
athena <- paws::athena()

# Submit query to AWS Athena
res <- athena$start_query_execution(
            QueryString = "show Databases",
            ResultConfiguration = 
                list(OutputLocation = "s3://mybucket/queries/"))

# Get Status of query
result <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)

# Return results if query is successful
if(result$QueryExecution$Status$State == "FAILED") {
  stop(result$QueryExecution$Status$StateChangeReason, call. = FALSE)
} else {output <- 
          athena$get_query_results(
              QueryExecutionId = res$QueryExecutionId,
              MaxResults = 10)}

This isn’t the prettiest code when wanting to query AWS Athena with the SQL, in the above example: SHOW DATABASES. This example only returns the top 10 results. It is even more “interesting” if you wish to return the entire data frame from AWS Athena. This is where noctua comes in.

noctua

To start off with I will go through the same 3 questions I went through in my Athena and R … there is another way!? blog.

  • What is noctua?

noctua is a R package that creates a DBI (Database Interface) for R, using the R package DBI and the R SDK paws as the backend (so basically the same as RAthena)

  • Why was noctua created when there are already methods for connecting to Athena?

noctua was created to provide an extra method to connect to Athena for R users. Plus it seemed natural to create noctua due to the nature in how it connects to AWS Athena (through a SDK), which is the method RAthena connects to AWS Athena.

  • Why is noctua called noctua?

This is a tricky one as RAthena was already taken. So I looked for a historic reference to link the new package to AWS Athena. I settled on noctua due to: Athena/Minerva is the Greek/Roman god of wisdom, handicraft, and warfare. One of the main symbols for Athena is the Owl. Noctua is the latin word for Owl.

How to install:

noctua is currently on the CRAN and Github:

CRAN version:

install.packages("noctua")

Github development version:

remotes::install_github("dyfanjones/noctua")

Usage:

As with all DBI interface packages the key functions are exactly the same. Which means that there is little to no upskilling required. The only difference between each method is how they connect and send data back to the database. So we will focus mainly on those two aspects.

Connecting to AWS Athena:

noctua offers a wide range of connection methods from hard coding to using Amazon Resource Name Roles (ARN roles). Which is very similar to the RAthena package.

Hard-Coding Method:

This method isn’t recommended as your credentials are hard-coded.

library(DBI)
con <- dbConnect(noctua::athena(),
                 aws_access_key_id = "YOUR AWS KEY ID",
                 aws_secret_access_key = "YOUR SECRET ACCESS KEY",
                 s3_staging_dir = "s3://path/to/query/bucket/")

Note:s3_staging_dir requires to be in the format of s3 uri for example “s3://path/to/query/bucket/”

If you do not wish to create AWS Profiles then setting environmental variables would be the recommended method.

Environment Variable Method:

noctua supports AWS credentials when set into the environment variables to avoid hard-coding. From what I have found out, an easy way to set up environment variables (that persists) in R is to use the file.edit function like so:

file.edit("~/.Renviron")

And now you can simply add in your environment variables in the file you are editing for example:

AWS_ACCESS_KEY_ID = YOUR AWS KEY ID

Once you have set your environment variables you can connect to Athena in the following method:

library(DBI)
con <- dbConnect(noctua::athena(),
                 s3_staging_dir = "s3://path/to/query/bucket/")

You can set the s3_staging_dir parameter as an environmental variable, to do this you need to set the following environmental variable:

AWS_ATHENA_S3_STAGING_DIR = s3://path/to/query/bucket/

This allows for the following connection:

library(DBI)
con <- dbConnect(noctua::athena())

AWS Profile Names:

Another method is to use AWS Profile Names. AWS profile names can be setup either manually in the ~/.aws directory or by using the AWS Command Line Interface (AWS CLI). Once you have setup your profile name you can connect to AWS Athena:

Using Default Profile Name:

library(DBI)
con <- dbConnect(noctua::athena())

Using Non-Default Profile Name:

library(DBI)
con <- dbConnect(noctua::athena(),
                 profile_name = "rathena")

ARN Roles:

ARN roles are fairly useful if you need to assume a role that can connect to another AWS account and use the AWS Athena in that account. Or whether you want to create a temporary connection with different permissions than your current role (AWS ARN Documentation).

Assuming ARN role credentials before connecting to AWS Athena:

library(noctua)
library(DBI)
assume_role(profile_name = "YOUR_PROFILE_NAME",
            role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
            set_env = TRUE)

# Connect to Athena using ARN Role
con <- dbConnect(athena(),
                s3_staging_dir = "s3://path/to/query/bucket/")

Connect to AWS Athena directly using ARN role:

library(DBI)
con <- dbConnect(noctua::athena(),
                  profile_name = "YOUR_PROFILE_NAME",
                  role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
                  s3_staging_dir = 's3://path/to/query/bucket/')

Note:ARN Roles have a duration timer before they will expire. To change the default you can increase the duration_seconds parameter from the default 3600 seconds (1 hour).

Temporary Sessions:

Finally you can create temporary credentials before connecting to AWS Athena:

library(noctua)
library(DBI)

# Create Temporary Credentials duration 1 hour
get_session_token("YOUR_PROFILE_NAME",
                  serial_number='arn:aws:iam::123456789012:mfa/user',
                  token_code = "531602",
                  set_env = TRUE)

# Connect to Athena using temporary credentials
con <- dbConnect(athena(),
                s3_staging_dir = "s3://path/to/query/bucket/")

Note:This method will work for users who have set up Multi-Factor Authentication (MFA).

Querying:

To query AWS Athena using the noctua it is very similar to querying any other DBI database method:

library(DBI)

con <- dbConnect(noctua::athena())

dbGetQuery(con, "show databases")

That is it! So if we look back at the initial paws code when working with AWS Athena. The code was very intimidating when wanting to do basic AWS Athena queries. noctua packages all that up and makes it super easy to work with.

Uploading Data:

It is all very well querying data from AWS Athena but what is more useful is to upload data as well. noctua has addressed this and implemented a method in dbWriteTable.

dbWriteTable(con, "mtcars", mtcars,
             partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")),
             s3.location = "s3://mybucket/data/")

Once you have uploaded data into AWS Athena you can query it in the following:

dbGetQuery(con, "select * from mtcars")

Here are all variable parameters for the dbWriteTable method:

conn: An AthenaConnection object, produced by dbConnect()

name: A character string specifying a table name. Names will be automatically quoted so you can use any sequence of characters, not just any valid bare table name.

value: A data.frame to write to the database.

overwrite: Allow overwriting the destination table. Cannot be ‘TRUE’ if ‘append’ is also ‘TRUE’.

append: Allow appending to the destination table. Cannot be ‘TRUE’ if ‘overwrite’ is also ‘TRUE’.

row.names: Either TRUE, FALSE, NA or a string. If TRUE, always translate row names to a column called “row_names”. If FALSE, never translate row names. If NA, translate rownames only if they’re a character vector. A string is equivalent to TRUE, but allows you to override the default name. For backward compatibility, NULL is equivalent to FALSE.

field.types: Additional field types used to override derived types.

partition: Partition Athena table (needs to be a named list or vector) for example: c(var1 = “2019-20-13”)

s3.location s3 bucket to store Athena table, must be set as a s3 uri for example (“s3://mybucket/data/“)

file.type: What file type to store data.frame on s3, RAthena currently supports [“csv”, “tsv”, “parquet”]. Note:file.type “parquet” is supported by R package arrow and will need to be installed separately if you wish to upload data.frames in “parquet” format.

…: Other arguments used by individual methods.

Conclusion:

noctua is a package that gives R users the access to AWS Athena using the R AWS SDK paws. Thus no external software is required and it can all be installed from the CRAN. If you are interested in how to connect R to AWS Athena please check out RAthena as well (my other AWS Athena connectivity R package). All feature requests/ suggestions/issues are welcome please add them to: Github Issues.

Finally please star the github repositories if you like the work that has been done with R and AWS Athena noctua , RAthena

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exploratory analysis of a banana

$
0
0

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is just me amusing myself by exploring a tiny data set I have lying around. The dataset and the code is on Github.

In 2014 (I think), I was teaching the introductory cell biology labs (pictures in the linked post) in Linköping. We were doing a series of simple preparations to look at cells and organelles: a cheek swab gives you a view of dead mammalian cells with bacteria on them; Elodea gives you a nice chloroplast view; a red bell pepper gives you chromoplasts; and a banana stained with iodine gives you amyloplasts. Giving the same lab six times in a row, it became apparent how the number of stained amyloplasts decreased as the banana ripened.

I took one banana, sliced in into five pieces (named A-E), and left it out to ripen. Then I stained (with Lugol’s iodine solution) and counted the number of amyloplasts per cell in a few cells (scraped off with a toothpick) from the end of each piece at day 1, 5, and 9.

First, here is an overview of the data. On average, we go from 17 stained amyloplasts on day 1, to 5 on day five and 2 on day nine.

If we break the plot up by slices, we see decline in every slice and variability between them. Because I only sampled each slice once per day, there is no telling whether this is variation between parts of the banana or between samples taken (say, hypothetically, because I might have stuck the toothpick in more or less deeply, or because the ripeness varies from the middle to the peel).

How can we model this? Let’s first fit a linear model where the number of amyloplasts decline at a constant rate per day, allowing for different starting values and different declines for each slice. We can anticipate that a Gaussian linear model will have some problems in this situation.

We fit a linear model and pull out the fitted values for each day–slice combination:

model_lm  <- lm(amyloplasts ~ day * slice,                data = banana)levels <- expand.grid(slice = unique(banana$slice),                      day = unique(banana$day),                      stringsAsFactors = FALSE)pred_lm  <- cbind(levels,                  predict(model_lm,                          newdata = levels,                          interval = "confidence"))

Then, to investigate the model’s behaviour, we can simulate data from the model, allowing for uncertainty in the fitted parameters, with the sim function from the arm package.

We make a function to simulate data from the linear model given a set of parameters, then simulate parameters and feed the first parameter combination to the function to get ourselves a simulated dataset.

y_rep_lm  <- function(coef_lm, sigma, banana) {    slice_coef  <- c(0, coef_lm[3:6])    names(slice_coef)  <- c("A", "B", "C", "D", "E")    slice_by_day_coef  <- c(0, coef_lm[7:10])    names(slice_by_day_coef)  <- c("A", "B", "C", "D", "E")       banana$sim_amyloplasts  <-         coef_lm[1] +        slice_coef[banana$slice] +        banana$day * (coef_lm[2] + slice_by_day_coef[banana$slice]) +        rnorm(nrow(banana), 0, sigma)    banana}sim_lm  <- sim(model_lm)sim_banana  <- y_rep_lm(sim_lm@coef[1,], sim_lm@sigma[1], banana)

The result looks like this (black dots) compared with the real data (grey dots).

The linear model doesn’t know that the number of amyloplasts can’t go below zero, so it happily generates absurd negative values. While not apparent from the plots, the linear model also doesn’t know that amyloplasts counts are restricted to be whole numbers. Let’s fit a generalized linear model with a Poisson distribution, which should be more suited to this kind of discrete data. The log link function will also turn the linear decrease into an exponential decline, which seems appropriate for the decline in amyloplasts.

model_glm <- glm(amyloplasts ~ day * slice,                 data = banana,                 family = poisson(link = log))pred_glm <- predict(model_glm,                    newdata = levels,                    se.fit = TRUE)results_glm <- data.frame(levels,                          average = pred_glm$fit,                          se = pred_glm$se.fit,                          stringsAsFactors = FALSE)  y_rep_glm  <- function(coef_glm, banana) {    slice_coef  <- c(0, coef_glm[3:6])    names(slice_coef)  <- c("A", "B", "C", "D", "E")    slice_by_day_coef  <- c(0, coef_glm[7:10])    names(slice_by_day_coef)  <- c("A", "B", "C", "D", "E")        latent  <- exp(coef_glm[1] +        slice_coef[banana$slice] +        banana$day * (coef_glm[2] + slice_by_day_coef[banana$slice]))     banana$sim_amyloplasts  <- rpois(n = nrow(banana),                                     lambda = latent)    banana}sim_glm  <- sim(model_glm)sim_banana_glm  <- y_rep_glm(sim_glm@coef[2,], banana)

This code is the same deal as above, with small modifications: glm instead of lm, with some differences in the interface. Then a function to simulate data from a Poisson model with an logarithmic link, that we apply to one set of parameters values.

There are no impossible zeros anymore. However, there seems to be many more zeros in the real data than in the simulated data, and consequently, as the number of amyloplasts grow small, we overestimate how many there should be.

Another possibility among the standard arsenal of models is a generalised linear model with a negative binomial distribution. As opposed to the Poisson, this allows greater spread among the values. We can fit a negative binomial model with Stan.

library(rstan)model_nb  <- stan(file = "banana.stan",                  data = list(n = nrow(banana),                              n_slices = length(unique(banana$slice)),                              n_days = length(unique(banana$day)),                              amyloplasts = banana$amyloplasts,                              day = banana$day - 1,                              slice = as.numeric(factor(banana$slice)),                              prior_phi_scale = 1))y_rep  <- rstan::extract(model_nb, pars = "y_rep")[[1]]

Here is the Stan code in banana.stan:

data {    int n;    int n_slices;    int  amyloplasts[n];    real  day[n];    int  slice[n];    real prior_phi_scale;}parameters {    real initial_amyloplasts[n_slices];    real decline[n_slices];    real < lower = 0> phi_rec;}model {    phi_rec ~ normal(0, 1);    for (i in 1:n) {        amyloplasts[i] ~ neg_binomial_2_log(initial_amyloplasts[slice[i]] +                            day[i] * decline[slice[i]],    (1/phi_rec)^2);    }}generated quantities {    vector[n] y_rep;    for (i in 1:n) {        y_rep[i] = neg_binomial_2_rng(exp(initial_amyloplasts[slice[i]] +                          day[i] * decline[slice[i]]),      (1/phi_rec)^2);    }}

This model is similar to the Poisson model, except that the negative binomial allows an overdispersion parameter, a small value of which corresponds to large variance. Therefore, we put the prior on the reciprocal of the square root of the parameter.

Conveniently, Stan can also make the simulated replicated data for us in the generated quantities block.

What does the simulated data look like?

Here we have a model that allows for more spread, but in the process, generates some extreme data, with hundreds of amyloplasts per cell in some slices. We can try to be draconian with the prior and constrain the overdispersion to smaller values instead:

model_nb2 <- stan(file = "banana.stan",                  data = list(n = nrow(banana),                              n_slices = length(unique(banana$slice)),                              n_days = length(unique(banana$day)),                              amyloplasts = banana$amyloplasts,                              day = banana$day - 1,                              slice = as.numeric(factor(banana$slice)),                              prior_phi_scale = 0.1))y_rep2  <- rstan::extract(model_nb2, pars = "y_rep")[[1]]

That looks a little better. Now, we’ve only looked at single simulated datasets, but we can get a better picture by looking at replicate simulations. We need some test statistics, so let us count how many zeroes there are in each dataset, what the maximum value is, and the sample variance, and then do some visual posterior predictive checks.

 check_glm  <- data.frame(n_zeros = numeric(1000),                         max_value = numeric(1000),                         variance = numeric(1000),                         model = "Poisson",                         stringsAsFactors = FALSE)check_nb  <- data.frame(n_zeros = numeric(1000),                        max_value = numeric(1000),                        variance = numeric(1000),                        model = "Negative binomial",                        stringsAsFactors = FALSE)check_nb2  <- data.frame(n_zeros = numeric(1000),                         max_value = numeric(1000),                         variance = numeric(1000),                         model = "Negative binomial 2",                         stringsAsFactors = FALSE)for (sim_ix in 1:1000) {    y_rep_data  <- y_rep_glm(sim_glm@coef[sim_ix,], banana)    check_glm$n_zeros[sim_ix]  <- sum(y_rep_data$sim_amyloplasts == 0)    check_glm$max_value[sim_ix] <- max(y_rep_data$sim_amyloplasts)    check_glm$variance[sim_ix] <- var(y_rep_data$sim_amyloplasts)    check_nb$n_zeros[sim_ix]  <- sum(y_rep[sim_ix,] == 0)    check_nb$max_value[sim_ix]  <- max(y_rep[sim_ix,])    check_nb$variance[sim_ix]  <- var(y_rep[sim_ix,])    check_nb2$n_zeros[sim_ix]  <- sum(y_rep2[sim_ix,] == 0)    check_nb2$max_value[sim_ix]  <- max(y_rep2[sim_ix,])    check_nb2$variance[sim_ix]  <- var(y_rep2[sim_ix,])}check  <- rbind(check_glm,                check_nb,                check_nb2)melted_check  <- gather(check, "variable", "value", -model)check_data  <- data.frame(n_zeros = sum(banana$amyloplasts == 0),                          max_value = max(banana$amyloplasts),                          variance = var(banana$amyloplasts))

Here is the resulting distribution of these three discrepancy statistics in 1000 simulated datasets for the three models (generalised linear model with Poisson distribution and the two negative binomial models). The black line is the value for real data.

When viewed like this, it becomes apparent how the negative binomial models do not fit that well. The Poisson model struggles with the variance and the number of zeros. The negative binomial models get closer to the number of zeros in the real data, they still have too few, while at the same time having way too high maximum values and variance.

Finally, let’s look at the fitted means and intervals from all the models. We can use the predict function for the linear model and Poisson model, and for the negative binomial models, we can write our own:

pred_stan <- function(model, newdata) {    samples <- rstan::extract(model)    initial_amyloplasts <- data.frame(samples$initial_amyloplasts)    decline  <- data.frame(samples$decline)    names(initial_amyloplasts) <- names(decline) <- c("A", "B", "C", "D", "E")    ## Get posterior for levels    pred  <- matrix(0,                    ncol = nrow(newdata),                    nrow = nrow(initial_amyloplasts))    for (obs in 1:ncol(pred)) {        pred[,obs]  <- initial_amyloplasts[,newdata$slice[obs]] +            (newdata$day[obs] - 1) * decline[,newdata$slice[obs]]    }    ## Get mean and interval    newdata$fit  <- exp(colMeans(pred))    intervals <- lapply(data.frame(pred), quantile, probs = c(0.025, 0.975))    newdata$lwr  <- exp(unlist(lapply(intervals, "[", 1)))    newdata$upr  <- exp(unlist(lapply(intervals, "[", 2)))    newdata}pred_nb <- pred_stan(model_nb, levels)pred_nb2 <- pred_stan(model_nb2, levels)

In summary, the three generalised linear models with log link function pretty much agree about the decline of amyloplasts during the later days, which looks more appropriate than a linear decline. They disagree about the uncertainty about the numbers on the first day, which is when there are a lot. Perhaps coincidentally, this must also be where the quality of my counts are the lowest, because it is hard to count amyloplasts on top of each other.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12110 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>