Quantcast
Channel: R-bloggers
Viewing all 12135 articles
Browse latest View live

rOpenSci Dev Guide 0.3.0: Updates

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As announced in February, we now have an online book containing all things related to rOpenSci software review. Our goal is to update it approximately quarterly – it’s time to present the third version. You can read the changelog or this blog post to find out what’s new in our dev guide 0.3.0!

Updates to our policies and guidance

Scope

We’ve introduced an important change for anyone thinking of submitting a package. We ask authors to select a category in our Aims and Scope under which to submit. We found that the “reproducibility” category was confusing, as almost any R package can be related to reproducibility. We split “reproducibility” into four new categories that fit the narrower topics it previously included. They are:

  • data validation and testing: Tools that enable automated validation and checking of data quality and completeness as part of scientific workflows.

  • workflow automation: Tools that automate and link together workflows, such as build systems and tools to manage continuous integration.

  • version control: Tools that facilitate the use of version control in scientific workflows.

  • citation management and bibliometrics: Tools that facilitate managing references, such as for writing manuscripts, creating CVs or otherwise attributing scientific contributions.

For discussion and changes see ropensci/software-review-meta#81 and ropensci/dev_guide#184.

Documentation

rOpenSci’s Jeroen Ooms has built out a new way for rOpenSci packages to improve documentation for their packages. Many of you have likely been building package documentation websites with pkgdown. There were ways to automate the building of these documentation pages – but nothing as easy as we have now. Docs for each rOpenSci package are now built automatically and can be found at https://docs.ropensci.org/{package}, where {package} is the package name, e.g., the page for the bowerbird package is https://docs.ropensci.org/bowerbird/. Read Jeroen’s post for more details Introducing the new rOpenSci docs server and see the developer guide section Automatic deployment of the documentation website.

We’ve changed our guidelines regarding README’s and vignettes. If you have chunks shared between the README and any vignettes, then we suggest using Rmarkdown chunks. Related issues: ropensci/dev_guide#159 and ropensci/dev_guide#161.

Last, we’ve removed our recommendation about using the roxygen2 development version because the latest stable version has what is needed regarding tags and the rev contributor role. (thanks again Hugo Grusonropensci/dev_guide#165)

CRAN checks

We’ve added explanations of CRAN checks to the packaging section, including that they can use the CRAN checks API, in particular see rodev::use_cchecks_badge() for adding a badge to your README to let users know the status of your CRAN checks.

R helpers

usethis is a helper package for R package maintainers. We’ve added to the guide recommendations for usethis functions to use for adding testing or vignette infrastructure.

Built on usethis, the rodev package is a new helper package, maintained by Maëlle. The goal of rodev is to help rOpenSci package developers with common tasks, and to promote best practices. We’ve added mentions of rodev functions throughout the guide where appropriate.

New section: Changing package maintainers

We’ve added a new section to the book on dealing with changing package maintainers. Part of the reason for rOpenSci to exist is sustainability of important R packages. Where possible, we strive to find new maintainers for packages when the current maintainers need to walk away. To help the rOpenSci community (including rOpenSci staff), we’ve added this section with tips and guidance for people wanting to give up the maintainer role, take over that role, and for staff that are involved in helping.

JOSS

We’ve updated JOSS submission instructions at https://devguide.ropensci.org/approvaltemplate.html (see ropensci/dev_guide#187). Instead of submitting with your Zenodo DOI, submit to JOSS with your rOpenSci Github repo URL. In addition, add a comment that the package has been reviewed by rOpenSci.

Security

We’d like to have fleshed out the security chapter more, but in the meantime, we’ve added a link to an rOpenSci community call on Security for R, where you can find lots of resources and discussion on security in R packages.

Other changes

  • Matt Fidler amended our section on Console messages ropensci/dev_guide#178. We previously said to only use print() and/or cat() in your print.* methods; now the language suggests to use those two functions in either print.* or str.* methods.

  • We previously briefly mentioned “git flow”. We now expand that a bit to discuss two aspects of “git flow”: keep master shippable, and use branches because they’re cheap.

For editors

We’ve added a book release guide– a guide for shepherding a new version of the developer guide – which culminates in this blog post!

Guide tweaks

The developer guide now lives at https://devguide.ropensci.org/, its very own subdomain!

We fixed formatting of Appendices B-D in the PDF version of the book (see PDF button in the navbar) with an issue by Indrajeet Patil.

For book contributors, we’ve update instructions in the README for rendering the developer guide locally (https://github.com/ropensci/dev_guide/issues/192).

Hugo Gruson helped us with a number of things: added a note for package maintainers about the importance of GitHub recognizing your repository as the language R, and helped us update links to https from http.

Conclusion

In this post we summarized the changes incorporated into our book “rOpenSci Packages: Development, Maintenance, and Peer Review” over the last four months. We are grateful for all contributions that made this release possible. If you have any feedback about the book, head to the issue tracker! We are already working on updates for our next version, such as documenting R6 objects, guidance on package citations, and adding more content for the security chapter. Check out the the issue tracker if you’d like to contribute.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Part II: Deploying a Dash Application to Operationalize Machine Learning Models

$
0
0

[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Note: This guide is intended for use with version 3.3 of Plotly On-Premise.

In Part I of the tutorial, we created and configured an Azure VM. We then installed Dash Enterprise on the Azure VM. Finally, we configured Dash Enterprise and verified that our hostname pointed to the Azure VM’s public IP address.

We proceeded with a Self-Signed SSL Certificate to easily create a testing production environment for this tutorial. In Part II of the tutorial, we continue our introduction to Dash Enterprise by deploying a machine learning application available from the Dash Gallery.

Initialization on App Manager

Within the Plotly On-Premise Server Manager, either clicking on Open while accessing the Dashboard view or copy pasting the URL provided for the Dash server located in Settings should load the App Manager portal.

The first time you attempt to load the App Manager portal page, you’ll likely be prompted to accept the self-signed SSL certificate you have created, and provide the administrator credentials as seen in the Settings page.

Figure 20: Dash App Manager Portal

Before we deploy an application, we’ll need to initialize it so that a local repository is set up and ready to receive the app via git push.  Navigate to App Manager and click the blue Initialize App at upper right, as seen in Figure 21 below.

Figure 21: Initializing your first Dash for R application

Warning message above can be safely ignored using SSH connection for the purpose of this tutorial. However, we need to note that, holding a recognized Certificate for HTTPS communication is crucial for the production environment. 

For the purposes of this tutorial, we’ll ignore the message about the self-signed certificate which appears — in a production environment, you’ll want to use an appropriate, recognized SSL certificate to secure web traffic to and from the portal.Next, we’ll want to enter a name for the app we’re initializing. Fill in the name field, then click the blue Create button seen in Figure 22 below. For future reference, note that you may also optionally link a database in the same configuration pane.

Figure 22: Initialization of Dash App – II

Adding your SSH key to App Manager

Prior to pushing code to the repository we’ve just created by initializing our Dash app, it can be convenient to add our SSH public key to the App Manager via the SSH Keys menu at the top of the window. We’ll reuse the SSH key we created earlier to access our Azure VM.

Figure 23: Adding an SSH public key to App Manager

 Configuring your local environment for deployment

We’re almost ready to deploy our Dash app for the first time! There are just a couple brief steps which remain; we’ll need to return to the command line interface (CLI) to continue. Don’t worry, it’s straightforward and we’ll walk you through it here.

First, download the application from its GitHub repository by cloning it:

$ git clone git@github.com:plotly/dashr-svm

Next, we’ll need to change directories:

$ cd dashr-svm

The App Manager UI provides instructions on how to configure Git so that it’s able to locate our remote repository on the host:

Figure 24: App Manager Instructions for SSH Configuration

Now we need to add the remote repository:

$ git remote add plotly [dokku@dashr-azure.plotly.host:3022]:dashr-svm

We can confirm that our repository settings are properly configured by retrieving the list of Git remotes:

$ git remote –v

 Typing the above command will result in:

Figure 25: Adding the remote repository and verifying the configuration

Preparing Application Files

We’ve reached the final step prior to “pushing” our code to the remote repository on the Dash Enterprise server. It’s always a good idea to ensure that everything is in place before we attempt to deploy our application, to help avoid potential complications.

The files below will be the same for all Dash for R applications deployed via Dash Enterprise. You won’t need to edit any of these for the purposes of this tutorial, since the dashr-svm repo contains everything you need to get started right away.

This information is just for reference; feel free to skip over what’s described below.

.buildpacks The (hidden) .buildpacks file defines which buildpack should be used for deployment. In this tutorial, we assume you’re running version 3.3 of Plotly On-Premise. The underlying OS version for this product is 18.04, which also matches that of the heroku-18 “stack”.

We recommend using our fork of Chris Stefano’s R buildpack, which will retrieve version 3.6.1 of R and all the packages required to quickly deploy your Dash app: https://github.com/plotly/heroku-buildpack-r#heroku-18

Procfile When the app is deployed, a Docker container is initialized within the VM. Within this container, there’s a folder named app which contains the files required to start your Dash application. The Procfile tells the server where to find the script which launches your app. In most cases, it should look something like this:

web: R -f app/app.R

app.R This file contains the application code which launches the Dash server. While you can provide absolute paths to application resources, it might be easier to invoke setwd() and provide it the application path at the top of your app.R

setwd("/app")

init.R This file describes the R packages and options required to run your Dash application. You’ll want to add any R dependencies your app has here.

Deploying the Application

OK, we’re ready to push our Dash app! Since this is the first commit, we can use Git shorthand to add all the files in our directory (since they’re all new to Git), and then we’ll commit them with a comment:

git add .
git commit -m "Added App Files"

The commits are now “staged”, but need to be uploaded (pushed) to the remote repository. We’ll want to provide the remote name (“plotly”), followed by the branch name. 

Here we go!

git push plotly master

Assuming our deployment was successful, we’ll see a message like the one below in Figure 26:

Figure 26: A successful deployment!

Using the URL provided above next to Application deployed:, navigate to the app in your browser. Once you’ve authenticated (assuming your app isn’t configured to be publicly accessible within the App Manager) and accepted the SSL certificate, the application which appears should resemble Figure 27:

Figure 27: SVM Explorer

Congratulations, you successfully configured a Dash Enterprise server within Microsoft Azure and deployed a machine learning app! Let’s recap what we accomplished in this tutorial:

  • We selected a Microsoft Azure virtual machine ready for use with Dash Enterprise
  • We provisioned and configured the VM
  • We installed Dash Enterprise on the VM
  • We successfully deployed a machine learning application written using Dash for R

We hope you’re as excited as we are about harnessing the power of analytics in R using Dash and Microsoft Azure. If you’re ready to take the next step, click here to learn more about Dash and Plotly’s enterprise product offerings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rBokeh – Tipps and Tricks with JavaScript and beyond!

$
0
0

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You want to have a nice interactive visualization, e.g., in RShiny, where you can zoom in and out, subset the plot or even have hover and click effects? I guess your first shot would be plotly, and rightfully so. However, there is also a great alternative: rBokeh.

Lately, I had a project where the functionality of plotly did not meet my needs. In particular, plotly does not have a function to draw rectangles. Just converting a ggplot object into plotly by ggplotly() does not allow for a properly customized callback. Thankfully, one of my STATWORX colleagues pointed me to rBokeh. Its functionality and syntax are quite similar to ggplot or plotly since it also follows the rules of grammar of graphics.

In the course of this project, I learned a lot and grew to love rBokeh. Therefore, I want to share some ideas and solutions for some rather specific problems I came across. If you look for a general introduction to rBokeh, check out theselinks. And now, let’s start!

Write complex JavaScript callbacks

An excellent quality of rBokeh is its capability to write customized callbacks with JavaScript (JS). We’re going to dive into this topic next: There are already some examples for rbokeh, but I found them rather basic and insufficiently explained. A more extensive presentation of tasks you can solve with callbacks is available for Python. I will guide you through my approach to write a fancy JS hover callback in R. 🙂

My goal is to trigger something in my plot by hovering over different elements. For this showcase, I use the classic iris data again. When you hover over a point, you should see the corresponding species name in a highlighted box (check the plot!).

The complete code for this task is shown below but let me guide you through it step by step:

Set up the information to trigger

First, I set up the data to be triggered. This includes the names of the different iris species and where this „legend“ should appear in the plot.

my_data <- tibble(name = unique(iris$Species),                        x = c(2.1, 3.1, 4.1),                        y = c(7.5, 8.0, 7.5))

As usual for an rBokeh plot, we need to call the figure() function first. Since we like to build a sort of legend by ourselves, I remove the legend by setting its location to NULL. The scatter plot is then initialized by ly_points(). For visual purposes, I colorize the points according to their species and increase the point size. The next line, however, is quite essential for the functionality of my callback. You must define a name for the glyph (lname) in order to refer to it in the callback. Note that it is not necessary to call it like the glyph itself. I do this just for convenience.

plot <-   figure(data = iris,          legend_location = NULL) %>%   ly_points(x = Sepal.Width,            y = Sepal.Length,            color = Species,             size = 20,            lname = "points") %>% 

The names of the species can be included in the plot by ly_text(). We have to specify the data argument and overwrite the default data that was specified before in figure(). I specify the location and text arguments and align them appropriately. Furthermore, I set the color argument to transparent. This predefined CSS color ensures that the text is „hidden“ when the plot is first rendered and before we first hover over it. Of course, we have to define the lname argument again.

  ly_text(data = my_data,          x = x,          y = y,           text = name,           align = "center",           baseline = "middle",          color = "transparent",          lname = "text") %>% 

The centered rectangles are made by ly_crect. The first special argument I set here is the alpha. Like for the ly_text, I want the glyph transparent in the initial plot. However and strangely, some glyphs do not recognize transparent as a predefined color. It is furthermore impossible to use the alpha option inside hex codes in rBokeh. The only remaining option is to set the alpha argument explicitly to zero, yet this requires an additional step later on. Nevertheless, we need to define some arbitrary color to change it later. Also, we must not forget to name the glyph!

  ly_crect(data = my_data,           x = x,           y = y,            height = 0.5,           width = 1,           alpha = 0,           color = "blue",           lname = "rects")

That’s it with the initial plot. Now I need to find a linkage between each point in the scatter plot and its corresponding species in my_data. This will heavily depend on the specific case you’re dealing with, but the following lapply does the job here. It returns a list in which each element refers to one line of the iris data and consequently to one point. Each list element is a list by itself and contains the position of the row in my_data with the correct species. Since we use this information in our JS code, you have to account for the zero-based property of JS. Therefore, when we want to refer to the first line in my_data, this is addressed by 0 and so on. One final note here: make sure that you store everything in lists since JS is not capable of digesting R vectors.

linkages <- lapply(iris$Species,                    # To keep it general, I store all values in a list.                   # This is not necessary if you link to just one value.                   # However, whenever dealing with one-to-many relationships                   # you need to store them in a list rather than in a vector                   FUN =   function(x) if(x == "setosa") list(0) # JS is zero-based!                   else if (x == "versicolor") list(1)                    else if (x == "virginica") list(2))

Now we finally arrive at the long-anticipated callback part. Since it should be initialized by hovering, we have to call tool_hover and define which layer should trigger the callback (ref_layer). In our case, its the scatter plot layer, of course. Since we want to include JS code, we need to call custom_callback. In this, we can define which layers we want to have access to within JS (lnames) and even can make further R objects available inside of JS by the args argument. This must be a named list with the R object on the rhs and the object name you can address within JS on the lhs.

plot %>%  tool_hover(    ref_layer = "points",    callback = custom_callback(      lnames = c("points", "rects", "text"),      args = list(links = linkages),

Let’s start with the actual JS code. In the first two lines, I invoke a debugger that makes it easier to develop and debug the code but more on this topic at the end of my post. To make the code more readable and reduce overhead, I define some variables. The first variable (indices) stores the hover information (cb_data) or, more precisely, the indices of the currently hovered points. The further variables contain the underlying data of the respective glyphs.

      code = paste0("            debugger;            console.log(points_data.get('data'));      var indices = cb_data.index['1d'].indices;      var pointsdata = points_data.get('data');      var rectsdata = rects_data.get('data');      var textdata = text_data.get('data');",

This following code snippet is necessary because we had to set alpha = 0 within ly_crect. Since we do not always want to hide this glyph, we have to overwrite and increase this value here. What happens is that when the plot is first called, alpha is zero. When we first hover the plot, alpha is set to 0.2. Note that we have to change both arguments, fill_alpha, and line_alpha. Now we would see the rectangles but change its color to transparent immediately. This is achieved in the next step.

      "rects_glyph.get('fill_alpha').value = 0.2      rects_glyph.get('line_alpha').value = 0.2",

Each time the callback is triggered, that is the courser changes its position, the color for all text and rectangle elements is set to transparent. This is achieved by calling the JS rgba function (red, green, blue, alpha) and setting the alpha avalue to 0 for all glyphs. A simple for loop does this job here. For those who are not familiar with this kind of loop specification: we define an iterator variable i that’s initial value is zero. After each iteration, i is incremented by 1 (note that the ++ are behind i) as long as i is smaller than, e.g., rectsdata.fill_color.length, that is the length of the vector.

      "for (var i=0; i < rectsdata.fill_color.length; i++){      rectsdata.line_color[i] = 'rgba(255, 255, 255, 0)';      rectsdata.fill_color[i] = 'rgba(255, 255, 255, 0)';      }      for (var i=0; i < textdata.text_color.length; i++){      textdata.text_color[i] = 'rgba(255, 255, 255, 0)';      }",

We are now approaching the core of the JS callback. These two loops actually trigger the behavior we desire. The outer loop is necessary if we hover over multiple points simultaneously. For each hovered point, its index position in the data set is stored in the variable ind. In the next step, we finally code the visible callback effect. For this, the element of the links list that corresponds to the currently hovered point is detected and its value is derived. This corresponds to the position in my_data and then changes the color of the desired rectangle and text glyph. The inner loop is just included to keep my example as general as possible. In case that lists stored in links contains multiple elements by themselves.

      "for(var i=0; i < indices.length; i++){      var ind = indices[i];      for (var j=0;j< links[ind].length; j++){      rectsdata.fill_color[links[ind][j]] = '#0085AF';      rectsdata.line_color[links[ind][j]] = '#0085AF';      textdata.text_color[links[ind][j]] = '#013848';      }      }")

And that’s it! We built a fancy JS hover callback with rBokeh. And, as promised, here is the full code 🙂

my_data <- tibble(name = unique(iris$Species),                        x = c(2.1, 3.1, 4.1),                        y = c(7.5, 8.0, 7.5))plot <-   figure(data = iris,          # Remove legend         legend_location = NULL) %>%   ly_points(x = Sepal.Width,            y = Sepal.Length,            # To check the correct functionality of the callback            color = Species,             # Increase size of the points            size = 20,            # IMPORTANT: define a name for the glyph!            lname = "points") %>%   ly_text(data = my_data,          x = x,          y = y,           text = name,           # Specify the correct alignment          align = "center",           baseline = "middle",          # IMPORTANT: Make the text invisible in the initial plot!          color = "transparent",          # IMPORTANT: define a name for the glyph!          lname = "text") %>%   ly_crect(data = my_data,           x = x,           y = y,            # Adapte the size of the rectangles           height = 0.5,           width = 1,           # IMPORTANT: make the rectangles transparent in the initial plot           alpha = 0,           # IMPORTANT: nevertheless, define an arbitrary color to refer to it in the           # callback           color = "blue",           # IMPORTANT: define a name for the glyph!           lname = "rects")linkages <- lapply(iris$Species,                    # To keep it general, I store all values in a list.                   # This is not necessary if you link to just one value.                   # However, whenever dealing with one-to-many relationships                   # you need to store them in a list rather than in a vector                   FUN =   function(x) if(x == "setosa") list(0) # JS is zero-based!                   else if (x == "versicolor") list(1)                    else if (x == "virginica") list(2))plot %>%  tool_hover(    custom_callback(      code = paste0("            debugger;            console.log(points_data.get('data'));      var indices = cb_data.index['1d'].indices;      var pointsdata = points_data.get('data');      var rectsdata = rects_data.get('data');      var textdata = text_data.get('data');      rects_glyph.get('fill_alpha').value = 0.2      rects_glyph.get('line_alpha').value = 0.2      for (var i=0; i< rectsdata.fill_color.length; i++){      rectsdata.line_color[i] = 'rgba(255, 255, 255, 0)';      rectsdata.fill_color[i] = 'rgba(255, 255, 255, 0)';      }      for (var i=0; i< textdata.text_color.length; i++){      textdata.text_color[i] = 'rgba(255, 255, 255, 0)';      }      for(var i=0; i < indices.length; i++){      var ind0 = indices[i];      for (var j=0;j< links[ind0].length; j++){      rectsdata.fill_color[links[ind0][j]] = '#0085AF';      rectsdata.line_color[links[ind0][j]] = '#0085AF';      textdata.text_color[links[ind0][j]] = '#013848';      }      }      ",       lnames = c("points", "rects", "text"),      args = list(links = linkages)),     ref_layer = "points")

Final remarks on debugging

As mentioned above, it is advisable to include a debugger; in your JS code. This directly triggers the debugger to start and allows you to inspect your plot properly. Just right click on your „Viewer“ window in RStudio and select „Inspect Element“.

inspect-element

As soon as you hover over the plot, this invokes the debugger and allows you to see the complete JS element with all data and attributes.

debugger

That’s basically the equivalent to calling rbokeh::debug_callback if you didn’t use your own JS code. Another nice JS function to know is console.log(). Whatever element or variable you want to have a closer look at, just put it inside this function and check what’s behind it via „Inspect Element/Console“. Very helpful to see whether something looks like or contains what you expect it would. Here is an example with console.log(points_data.get('data')).

consolelog

Is there more we can do to customize rBokeh..?

This is all I got for you concerning callbacks in rBokeh. In my next blog post, I will show you how to manipulate a standard rBokeh plot as extensively as its counterpart in Python. Stay tuned!

Über den Autor

Matthias Nistler

Matthias Nistler

I am a data scientist at STATWORX and passionate for wrangling data and getting the most out of it. Outside of the office, I use every second for cycling until the sun goes down.

ABOUT US


STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.  

Sign Up Now!

Sign Up Now!

.button { background-color: #0085af;}</p><p>.x-container.width { width: 100% !important;}</p><p>.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag rBokeh – Tipps and Tricks with JavaScript and beyond! erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wikipedia Page View Statistics Late 2007 and Beyond

$
0
0

[This article was first published on petermeissner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The {wikipediatrend} package

This blog post covers the major release of the {wikipediatrend} package – namely version 2.1.4.

❤ Thanks to all CRAN people ❤

The History

Introduction

{wikipediatrend} dates back to late 2014. It is my first R package making it to CRAN and at this time it was the first and only R package to allow access to Wikipedia page view statistics from within R.

Wikipedia is a repository of vast amounts knowledge. Also, it is one of the world’s most used web pages. So, numbers on page visits are fun to play with and of public interest at the same time. In addition Wikipedia and Wikipedia page views are subject to and of scientific research. Page views in particular have a history of serving as measure of public interest and thus provide a basis for public attention research in e.g. sociology, politics, economics, …, and so on.

Data Dumbs

Wikipedia page view log dumbs can be downloaded from 2007 onwards from the following pages:

  • https://dumps.wikimedia.org/other/pagecounts-raw/
  • https://dumps.wikimedia.org/other/ /.

These dumbs are HUGE and not really suit for analyses (~20 TB of compressed text files or ~200 TB of uncompressed data):

  • files are per hour aggregates of server requests
  • all wiki projects are in one file
  • all languages are in one file
  • labels are raw (bytes) server requests instead of standardized article titles
  • download bandwidth is limited

Data Wrapper

Back in 2014 an external page – stats.grok.se – took on the effort to provide daily page view aggregates. Data was available from December 2007 onwards. While the admin of the page did put in the hours to provide data in a clean way, {wikipediatrend} provided a wrapper to access the data. The aim was – and still is nowadays – to provide analytics ready data in neat data.frames. The user specifies what she wants and the packages does all the rest:

  • sending web requests
  • parsing response data
  • rectangularizing the data
  • caching to keep the traffic burden at a minimum

Dark Years

In late 2015 early 2016 several things happened within a couple of months:

  • Wikipedia choose to provide a (high performance) page views web API themselves. 🎉
  • Wikipedia provided there very own convenience R wrapper {pageviews} to the page views web API. 👍
  • {wikipediatrend} followed suit by binding both data sources since Wikipedia’s own wrapper only provided data from late 2015 onwards. ✅
  • Later on in 2016 stats.grok.se went down and {wikipediatrend} again followed suit and coped with the changes. 😭

From 2016 onwards {wikipediatrend} only provided a wrapper to the official API plus some further convenience features like the following:

  • a function to query available languages for an article
  • a somewhat smoother and less low level data retrieval flow

During 2016 expectations were, that Wikipedia would extend its API to fill the gap that was left behind by stats.grok.se going out of service – providing page views for late 2007 up to late 2015. During 2017 I was trying to convince Wikipedia’s analytics team to take on the task of providing the data. Wikipedia, I thought, is root source of the data. This is the right place to build a data pipeline for this. Providing hardware is easier for an organization than for an individual, I thought. And they already had the web API running anyways and were knowledgeable about how to process their very own dumb files. Unfortunately, communications lead to nothing and I myself neither had the time nor the money to get something like this set up all by myself.

Rising Hopes

In 2018 – two and a half years after stats.grok.se went down – some lucky events came together. First, I was leaving my current employment toying with the idea of doing freelance – so, suddenly I had a lot of time at my disposal. Second, Simon Munzert an R web scraping pioneer, political scientist and Professor at the Hertie School of Governance was using Wikipedia page views for his research. Like me, he felt the need to get the data back online. And like me he he is a fan of open data for the public good. Most fortunate motivation, expertise and time were also met by money in form of Simon’s research grant by the Mercedes and Benz Foundation. So, plans could be made, rents be payed, and children be fed.

Challenges

One of the big challenges was time, money and data quality.

It wouldn’t be fun if it was easy – right?

To give some perspective: Data from late 2007 up to late 2015 has roughly about 3 trillion (3,183,334,099,176) data points for the 45 Million (45,008,117) pages of the 20 languages covered (en, de, fr, es, it, cs, da, et, fi, el, hu, no, pl, pt, sk, sl, sv, tr, ru, zh). Those 3 Trillion data points had to be aggregated from hourly counts to daily counts reducing the number of data points to 130 billion (132,638,920,799).

My estimates were that a single machine with some cores would take several months to get all the processing steps done: data download, decompression, filtering, cleaning, transformation, aggregation, and storage.

In addition my personal goal was to be able to host the clean data aggregates on cheap standard servers – this implies that data should be well below one terabyte in size and databases queries should not take to much time to deliver a complete time series for a specific article in a specific language.

Solutions

Most of the space stems from the requested article names – long strings taking up loads of bytes. Furthermore, those labels were duplicated due to the low aggregation level and the rawness of the data. To solve for this size problem a choose to build up an dictionary of pages available at certain points using data dumbs from here and here. All page views that could not be found in the dictionary were simply excluded from the database. Time series values were referenced by dictionary id instead of using the large article labels. Tables were modeled to minimize data duplication.

To tackle the query time I optimized data storage for the standard use case: looking up a specific article for a specific language and requesting all available time series values. The labels in the dictionary got an index for fast lookups while time series data was stored in such a way that one row contains a whole month of page view values as comma separated value reducing the amount of rows to look up by factor 30.

Hurray

The result is a database with an REST API that once again allows to openly and free of charge get access to Wikipedia page view statistics for 2007 and up to 2016: http://petermeissner.de:8880/. For later dates {wikipediatrend} uses Wikipedia’s own page views API.

Word of Caution

In regard to data quality some words of caution are appropriate: Data has been gathered to allow for the highest possible quality but as real live data projects go there is always some bug waiting to bite you. One issue is article coverage. As reported above which articles to include and which not stems from the article lists used and their quality. This means e.g. that you will not get any data from articles that have not been there in 2008 or 2018.

Database API Endpoints

To allow for an public REST API a lightweight webserver works on top of the PostgreSQL database. The API has several endpoints that serve data in JSON format:

Endpoints

  • API description can be found at http://petermeissner.de:8880/
  • article stats will return page view stats for the specified article title given that its part of the database: http://petermeissner.de:8880/article/exact/{lang}/{articlename}
  • simple search and stats will do a simple string matching search and return the first 100 title matches as well their times series data: http://petermeissner.de:8880/article/search/{lang}/{articlename}
  • regex search without stats will do a regular expression search but only return title matches and no page view numbers: http://petermeissner.de:8880/search/{lang}/{regex}

The Release

With this release {wikipediatrend} once again has access to data from late 2007 onwards by wrapping the self hosted API described above and the official Wikipedia {pageviews} package.

Although, package API (how functions are used and what they return) has stayed the same, I decided to celebrate this big release with a major version bump from 1.x.x to 2.x.x.

One new function has been added to wrap the search API endpoint: wpd_search() allows to search for page titles via regular expressions within the database.

In addition several attempts have been made to improve the overall robustness, especially against missing data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: petermeissner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

cran checks API: an update

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you have an R package on CRAN, you probably know about CRAN checks. Each package on CRAN, that is not archived on CRAN1, has a checks page, like this one for ropenaq: https://cloud.r-project.org/web/checks/check_results_ropenaq.html

The table above is results of running R CMD CHECK on the package on a combination of different operating systems, R versions and compilers. CRAN maintainers presumably use these as a basis for getting in touch with maintainers when these checks are failing.

It’s great that packages are being tested in a variety of scenarios to make sure each package will work for as many users as possible.

The only drawback to CRAN checks is that they are not available in a machine readable way. When information is made available as machine readable many additional things become possible.

CRAN Checks API

For a recent overview of the API, check out the Overview of the CRAN checks API blog post on the R-Hub blog.

To make CRAN check data available in a machine readable we built out the CRAN checks API. Over the two years since it began, the API has gotten some tweaks and fixes, some of which are very recent.

When the API first came out the only routes were:

  • /heartbeat
  • /docs
  • /pkgs
  • /pkgs/:pkg_name:

Now the routes are:

  • /heartbeat
  • /docs
  • /pkgs
  • /pkgs/:pkg_name:
  • /pkgs/:pkg_name:/history
  • /history/:date
  • /maintainers
  • /maintainers/:email:
  • /badges/:type/:package
  • /badges/flavor/:flavor/:package

The big new additions since the beginning are maintainers, badges, and history; I’ll briefly talk about each.

Maintainers

The maintainers route gives you access to the data on the summary pages by package maintainer https://cloud.r-project.org/web/checks/check_summary_by_maintainer.html. /maintainers gives all maintainers data, while with /maintainers/:email: you can get data for a specific maintainer (by their email address). Note that with /maintainers/:email: at this time you need to use the format CRAN uses for emails to get a match in the API; e.g., with the package ropenaq, the maintainer’s (Maëlle Salmon) email is maelle.salmon@yahoo.se, but it needs to be formatted like /maintainers/maelle.salmon_at_yahoo.se in the API. I will at some point make it so that you can just use the actual email address in the route.

The data returned for the /maintainers API routes is not the detailed data you get on the /pkgs routes. There’s two main data fields.

The table slot has an array of hashes, one for each package, with a summary of how many checks were ok, note, warn, and error.

The packages slot has an array of hahes, one for each package, but includes the URL for the CRAN checks page, and similar to the table slot, has a summary of the number of checks in each category.

Badges

Badges routes are by far the most widely used routes on this API, with ~900 R packages using a cran checks badge. Within the /badges/:type/:package route, you can get a badge for either /badges/summary/:package or /badges/worst/:package, where the former gives you a badge as Not OK (red) or OK (green), and the latter gives the worst check result (error, warn, note, and ok; each w/ a different color).

The /badges/flavor/:flavor/:package route allows you to get a badge for a specific “flavor”, that is, an operating system OR an R version. For example, /badges/flavor/osx/ggplot2 or /badges/flavor/release/ggplot2.

If you request a badges route with an uknown flavor or package you get an gray unknown badge (see above).

To use a CRAN checks badge, simply copy/paste the below text into your README, changing the package name to your own, and selecting the type of badge you want:

[![cran checks](https://cranchecks.info/badges/summary/reshape)](https://cran.r-project.org/web/checks/check_results_reshape.html)

or

[![cran checks](https://cranchecks.info/badges/summary/reshape)](https://cranchecks.info/pkgs/reshape)

The former links to the CRAN checks page, while the latter links to the API route for the package.

History

We’ve had the /pkgs/:pkg_name:/history route for quite a while now, but it’s been very slow to respond because the SQL database was in an ideal situation (we had no indexes; and a ton of data). It’s now fixed, and you can very quickly get up to the last 30 days of checks history. We prune out any data older than 30 days; all older day gets put in an Amazon S3 bucket to save disk space on the server and to make them available for the /history/:date route.

If you want more than 30 days in the past, we’ve got a new route /history/:date to get all historical data by day, across all packages. It has daily data back to December 2018. There’s a few days missing here and there as I was learning and making mistakes. To get the data, send a request like /history/2019-10-01, and you’ll get a 302 redirect to a temporary URL (expires in 15 min) for the gzipped JSON file. You can easily get these in R like:

x <- jsonlite::stream_in(curl::curl("https://cranchecks.info/history/2019-10-01"))str(x)#> 'data.frame':    15167 obs. of  6 variables:#> $ package      : chr  "localIV" "di" "GAR" "MetABEL" ...#> $ summary      : chr  "{\"any\":false,\"ok\":12,\"note\":0,\"warn\":0,\"error\":0,\"fail\":0}" "{\"any\":false,\"ok\":12,\"note\":0,\"warn\":0,\"error\":0,\"fail\":0}" "{\"any\":false,\"ok\":12,\"note\":0,\"warn\":0,\"error\":0,\"fail\":0}" "{\"any\":true,\"ok\":0,\"note\":12,\"warn\":0,\"error\":0,\"fail\":0}" ...#> $ checks       : chr  "[{\"flavor\":\"r-devel-linux-x86_64-debian-clang\",\"version\":\"0.2.1\",\"tinstall\":2.21,\"tcheck\":40.68,\"t"| __truncated__ "[{\"flavor\":\"r-devel-linux-x86_64-debian-clang\",\"version\":\"1.1.4\",\"tinstall\":2.54,\"tcheck\":24.8,\"tt"| __truncated__ "[{\"flavor\":\"r-devel-linux-x86_64-debian-clang\",\"version\":\"1.1\",\"tinstall\":1.87,\"tcheck\":20.85,\"tto"| __truncated__ "[{\"flavor\":\"r-devel-linux-x86_64-debian-clang\",\"version\":\"0.2-0\",\"tinstall\":2.59,\"tcheck\":19.27,\"t"| __truncated__ ...#> $ check_details: chr  "null" "null" "null" "{\"version\":\"0.2-0\",\"check\":\"package dependencies\",\"result\":\"NOTE\",\"output\":\"Package suggested bu"| __truncated__ ...#> $ date_updated : chr  "2019-10-01 15:02:40 UTC" "2019-10-01 15:02:40 UTC" "2019-10-01 15:02:40 UTC" "2019-10-01 15:02:40 UTC" ...

Note above that we use stream_in() because the JSON file is new-line delimited, or ND-JSON; each line in the file is valid JSON, but the entire file is not valid JSON. stream_in() pulls all that data into a nice data.frame, where you can parse as you like from there.

What else is possible now that we have an API?

One concrete idea I hope to pursue at some point:

  • Notify package maintainers. Notifying them of exactly what would be up to the maintainer. Now that the API exists, we can much more easily build a notification system. For example, a maintainer could say they want to get an email whenever cran checks are failing more than 3 days in a row; or when cran checks details match a certain character string (e.g., segfault); or to ignore cran check results that are just failures due to temporary problems with a dependency package.

I’m sure there’s many other ideas I haven’t thought of 😸

What’s next?

  • Notifications (see ropenscilabs/cchecksapi#13) may or may not happen – chime in there if you’re interested in helping with this.
  • Add ability to ignore specific NOTEs for badges. It would be nice to ignore NOTEs that can be ignored so you aren’t getting a “something’s wrong” badge when nothing important is wrong (i.e., CRAN maintainers don’t contact the maintainer as it’s something rather trivial). See ropenscilabs/cchecksapi#29
  • Have an idea or comment? open an issue

  1. old versions of the package exist, but binaries are no longer built and checks are no longer performed ↩
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building Interactive World Maps in Shiny

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Florianne Verkroost is a PhD candidate at Nuffield College at the University of Oxford. With a passion for data science and a background in mathematics and econometrics. She applies her interdisciplinary knowledge to computationally address societal problems of inequality.

In this post, I will show you how to create interactive world maps and how to show these in the form of an R Shiny app. As the Shiny app cannot be embedded into this blog, I will direct you to the live app and show you in this post on my GitHub how to embed a Shiny app in your R Markdown files, which is a really cool and innovative way of preparing interactive documents. To show you how to adapt the interface of the app to the choices of the users, we’ll make use of two data sources such that the user can choose what data they want to explore, and that the app adapts the possible input choices to the users’ previous choices. The data sources here are about childlessness and gender inequality, which is the focus of my PhD research, in which I computationally analyse the effects of gender and parental status on socio-economic inequalities.

We’ll start by loading and cleaning the data, whereafter we will build our interactive world maps in R Shiny. Let’s first load the required packages into RStudio.

Importing, exploring and cleaning the data

Now, we can continue with loading our data. As we’ll make world maps, we need a way to map our data sets to geographical data containing coordinates (longitude and latitude). As different data sets have different formats for country names (e.g., “United Kingdom of Great Britain and Northern Ireland” versus “United Kingdom”), we’ll match country names to ISO3 codes to easily merge all data sets later on. Therefore, we first scrape an HTML table of country names, ISO3, ISO2 and UN codes for all countries worldwide. We use the rvest package using the XPath to indicate what part of the web page contains our table of interest. We use the pipe (%>%) from the magrittr package to feed our URL of interest into functions that read the HTML table using the XPath and convert that to a data frame in R. One can obtain the XPath by hovering over the HTML table in developer mode in the browser, and having it show the XPath.

The first element in the resulting list contains our table of interest, and as the first column is empty, we delete it. Also, as you can see from the HTML table in the link, there are some rows that show the letter of the alphabet before starting with a list of countries of which the name starts with that letter. As these rows contain the particular letter in all columns, we can delete these by deleting all rows for which all columns have equal values.

library(magrittr)library(rvest)url <- "https://www.nationsonline.org/oneworld/country_code_list.htm"iso_codes <- url %>%  read_html() %>%  html_nodes(xpath = '//*[@id="CountryCode"]') %>%  html_table()iso_codes <- iso_codes[[1]][, -1]iso_codes <- iso_codes[!apply(iso_codes, 1, function(x){all(x == x[1])}), ]names(iso_codes) <- c("Country", "ISO2", "ISO3", "UN")head(iso_codes)
##          Country ISO2 ISO3  UN## 2    Afghanistan   AF  AFG 004## 3  Aland Islands   AX  ALA 248## 4        Albania   AL  ALB 008## 5        Algeria   DZ  DZA 012## 6 American Samoa   AS  ASM 016## 7        Andorra   AD  AND 020

Next, we’ll collect our first data set, which is a data set on childlessness provided by the United Nations. We download the file from the link, save it locally, and then load it into RStudio using the read_excel() function in the readxl package.

library(readxl)url <- "https://www.un.org/en/development/desa/population/publications/dataset/fertility/wfr2012/Data/Data_Sources/TABLE%20A.8.%20%20Percentage%20of%20childless%20women%20and%20women%20with%20parity%20three%20or%20higher.xlsx"destfile <- "dataset_childlessness.xlsx"download.file(url, destfile)childlessness_data <- read_excel(destfile)
head(childlessness_data)
## # A tibble: 6 x 17##   `United Nations… ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10##                       ## 1 "TABLE  A.8. PE…                  ## 2 Country          ISO … Peri… Refe… Perc…     Perc…    ## 3                    35-39 40-44 45-49 35-39 40-44 45-49## 4 Afghanistan      4     Earl… ..    ..    ..    ..    ..    ..    ..   ## 5 Afghanistan      4     Midd… ..    ..    ..    ..    ..    ..    ..   ## 6 Afghanistan      4     Late… 2010  2.6   2.6   2.1   93.8  94.5  94   ## # … with 7 more variables: ...11 , ...12 , ...13 ,## #   ...14 , ...15 , ...16 , ...17 

We can see that the childlessness data are a bit messy, especially when it comes to the first couple of rows and column names. We only want to maintain the columns that have country names, periods, and childlessness estimates for different age groups, as well as the rows that refer to data for specific countries. The resulting data look much better. Note that when we convert the childlessness percentage columns to numeric type later on, the “..” values will automatically change to NA.

cols <- which(grepl("childless", childlessness_data[2, ]))childlessness_data <- childlessness_data[-c(1:3), c(1, 3, cols:(cols + 2))]names(childlessness_data) <- c("Country", "Period", "35-39", "40-44", "45-49")head(childlessness_data)
## # A tibble: 6 x 5##   Country     Period  `35-39` `40-44` `45-49`##                     ## 1 Afghanistan Earlier ..      ..      ..     ## 2 Afghanistan Middle  ..      ..      ..     ## 3 Afghanistan Latest  2.6     2.6     2.1    ## 4 Albania     Earlier 7.2     5.5     5.2    ## 5 Albania     Middle  ..      ..      ..     ## 6 Albania     Latest  4.8     4.3     3.3

Our second data set is about measures of gender inequality, provided by the World Bank. We read this .csv file directly into RStudio from the URL link.

gender_index_data <- read.csv("https://s3.amazonaws.com/datascope-ast-datasets-nov29/datasets/743/data.csv")head(gender_index_data)
##   Country.ISO3 Country.Name Indicator.Id## 1          AGO       Angola        27959## 2          AGO       Angola        27960## 3          AGO       Angola        27961## 4          AGO       Angola        27962## 5          AGO       Angola        28158## 6          AGO       Angola        28159##                                                           Indicator## 1                                   Overall Global Gender Gap Index## 2                  Global Gender Gap Political Empowerment subindex## 3                  Global Gender Gap Political Empowerment subindex## 4                                   Overall Global Gender Gap Index## 5 Global Gender Gap Economic Participation and Opportunity Subindex## 6 Global Gender Gap Economic Participation and Opportunity Subindex##   Subindicator.Type   X2006    X2007    X2008    X2009   X2010   X2011## 1             Index  0.6038   0.6034   0.6032   0.6353  0.6712  0.6624## 2              Rank 81.0000  92.0000 103.0000  36.0000 24.0000 24.0000## 3             Index  0.0696   0.0696   0.0711   0.2007  0.2901  0.2898## 4              Rank 96.0000 110.0000 114.0000 106.0000 81.0000 87.0000## 5              Rank 69.0000  87.0000  87.0000  96.0000 76.0000 96.0000## 6             Index  0.5872   0.5851   0.5843   0.5832  0.6296  0.5937##   X2012   X2013    X2014   X2015   X2016   X2018## 1    NA  0.6659   0.6311   0.637   0.643   0.633## 2    NA 34.0000  38.0000  38.000  40.000  58.000## 3    NA  0.2614   0.2402   0.251   0.251   0.206## 4    NA 92.0000 121.0000 126.000 117.000 125.000## 5    NA 92.0000 111.0000 116.000 120.000 113.000## 6    NA  0.6163   0.5878   0.590   0.565   0.602

Luckily, these data are better structured than the childlessness data. The data contains gender inequality measures per year, and for convenience we add a new column with the values for the most recent year for which data are available. In this post, we’ll only look at the rank indicators rather than indices and normalized scores. We drop the Subindicator and IndicatorID columns using the select() function from the dplyr package, as we won’t need these further.

library(dplyr)gender_index_data["RecentYear"] <- apply(gender_index_data, 1, function(x){as.numeric(x[max(which(!is.na(x)))])})gender_index_data <- gender_index_data[gender_index_data$Subindicator.Type == "Rank", ] %>%   select(-Subindicator.Type, -Indicator.Id)names(gender_index_data) <- c("ISO3", "Country", "Indicator", as.character(c(2006:2016, 2018)), "RecentYear")head(gender_index_data)
##    ISO3 Country## 2   AGO  Angola## 4   AGO  Angola## 5   AGO  Angola## 7   AGO  Angola## 9   AGO  Angola## 11  AGO  Angola##                                                                                           Indicator## 2                                                  Global Gender Gap Political Empowerment subindex## 4                                                                   Overall Global Gender Gap Index## 5                                 Global Gender Gap Economic Participation and Opportunity Subindex## 7                                                 Global Gender Gap Educational Attainment Subindex## 9                                                    Global Gender Gap Health and Survival Subindex## 11 Wage equality between women and men for similar work (survey data, normalized on a 0-to-1 scale)##    2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2018 RecentYear## 2    81   92  103   36   24   24   NA   34   38   38   40   58         58## 4    96  110  114  106   81   87   NA   92  121  126  117  125        125## 5    69   87   87   96   76   96   NA   92  111  116  120  113        113## 7   107  119  122  127  125  126   NA  127  138  141  138  143        143## 9     1    1    1    1    1    1   NA    1   61    1    1    1          1## 11   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  135   94         94

Next, we load in our world data with geographical coordinates directly from the ggplot2 package. These data contain geographical coordinates of all countries worldwide, which we’ll later need to plot the worldmaps.

library(maps)library(ggplot2)world_data <- ggplot2::map_data('world')world_data <- fortify(world_data)head(world_data)
##     long   lat group order region subregion## 1 -69.90 12.45     1     1  Aruba      ## 2 -69.90 12.42     1     2  Aruba      ## 3 -69.94 12.44     1     3  Aruba      ## 4 -70.00 12.50     1     4  Aruba      ## 5 -70.07 12.55     1     5  Aruba      ## 6 -70.05 12.60     1     6  Aruba      

To map our data, we need to merge the childlessness, gender gap index, and world map data. As I said before, these all have different notations for country names, which is why we’ll use the ISO3 codes. However, even between the ISO code data and the other data sets, there is discrepancy in country names. Unfortunately, to solve this, we need to manually change some country names in our data to match those in the ISO code data set. The code for doing so is long and tedious, so I won’t show that here, but for your reference you can find it here.

Now that the name changes for countries have been made, we can add the ISO3 codes to our childlessness and world map data. The gender gap index data already contain these codes, so there’s no need for us to add these there.

childlessness_data['ISO3'] <- iso_codes$ISO3[match(childlessness_data$Country, iso_codes$Country)]world_data["ISO3"] <- iso_codes$ISO3[match(world_data$region, iso_codes$Country)]

Next, we melt the childlessness and gender gap index data into long format such that they will have similar shape and column names for merging. The melt() function is included in package reshape2. The goal here is to create variables that have different unique values for the different data, such that I can show you how to adapt the R Shiny app input to the users’ choices. For example, we’ll create a DataType column that has value Childlessness for the rows of the childlessness data and value Gender Gap Index for all rows of the gender gap index data. We’ll also create a column Period that contains earlier, middle and later periods for the childlessness data, and different years for the gender gap index data. As such, when the user chooses to explore the childlessness data, the input for the period will only contain the choices relevant to the childlessness data (i.e., earlier, middle, and later periods and no years). When the user chooses to explore the gender gap index data, they will only see different years as choices for the input of the period, and not earlier, middle, and later periods. The same goes for the Indicator column. This may sound slightly vague at this point, but we’ll see this in practice later on when building the R Shiny app.

library(reshape2)childlessness_melt <- melt(childlessness_data, id = c("Country", "ISO3", "Period"),                            variable.name = "Indicator", value.name = "Value")childlessness_melt$Value <- as.numeric(childlessness_melt$Value)gender_index_melt <- melt(gender_index_data, id = c("ISO3", "Country", "Indicator"),                           variable.name = "Period", value.name = "Value")

After melting the data and ensuring they’re in the same format, we merge them together using the rbind() function, which we can do here because the data have the same column names.

childlessness_melt["DataType"] <- rep("Childlessness", nrow(childlessness_melt))gender_index_melt["DataType"] <- rep("Gender Gap Index", nrow(gender_index_melt))df <- rbind(childlessness_melt, gender_index_melt)

Creating an interactive world map

Next, it’s time to define the function that we’ll use for building our world maps. The inputs to this function are the merged data frame, the world data containing geographical coordinates, and the data type, period and indicator the user will select in the R Shiny app. We first define our own theme, my_theme(), for setting the aesthetics of the plot. Next, we select only the data that the user has selected to view, resulting in plotdf. We keep only the rows for which the ISO3 code has been specified (some countries, e.g., Channel Islands in the childlessness data, are not contained in the ISO code data). We then add the data the user wants to see to the geographical world data. Finally, we plot the world map. The most important part of this plot is that contained in the geom_polygon_interactive() function from the ggiraph package. This function draws the world map in white with grey lines, fills it up according to the value of the data selected (either childlessness or gender gap rank) in a red-to-blue color scheme set using the brewer.pal() function from the RColorBrewer package, and interactively shows in the tooltip the ISO3 code and value when hovering over the plot.

worldMaps <- function(df, world_data, data_type, period, indicator){    # Function for setting the aesthetics of the plot  my_theme <- function () {     theme_bw() + theme(axis.text = element_text(size = 14),                       axis.title = element_text(size = 14),                       strip.text = element_text(size = 14),                       panel.grid.major = element_blank(),                        panel.grid.minor = element_blank(),                       panel.background = element_blank(),                        legend.position = "bottom",                       panel.border = element_blank(),                        strip.background = element_rect(fill = 'white', colour = 'white'))  }    # Select only the data that the user has selected to view  plotdf <- df[df$Indicator == indicator & df$DataType == data_type & df$Period == period,]  plotdf <- plotdf[!is.na(plotdf$ISO3), ]    # Add the data the user wants to see to the geographical world data  world_data['DataType'] <- rep(data_type, nrow(world_data))  world_data['Period'] <- rep(period, nrow(world_data))  world_data['Indicator'] <- rep(indicator, nrow(world_data))  world_data['Value'] <- plotdf$Value[match(world_data$ISO3, plotdf$ISO3)]    # Create caption with the data source to show underneath the map  capt <- paste0("Source: ", ifelse(data_type == "Childlessness", "United Nations" , "World Bank"))    # Specify the plot for the world map  library(RColorBrewer)  library(ggiraph)  g <- ggplot() +     geom_polygon_interactive(data = world_data, color = 'gray70', size = 0.1,                                    aes(x = long, y = lat, fill = Value, group = group,                                         tooltip = sprintf("%s %s", ISO3, Value))) +     scale_fill_gradientn(colours = brewer.pal(5, "RdBu"), na.value = 'white') +     scale_y_continuous(limits = c(-60, 90), breaks = c()) +     scale_x_continuous(breaks = c()) +     labs(fill = data_type, color = data_type, title = NULL, x = NULL, y = NULL, caption = capt) +     my_theme()    return(g)}

Building an R Shiny app

Now that we have our data and world mapping function ready and specified, we can start building our R Shiny app. (If you’re not familiar with R Shiny, I recommend that you to have a look at the Getting Started guide first.) We can build our app by specifying the UI and server components. In the UI, we include a fixed user input selection where the user can choose whether they want to see the childlessness or gender gap index data. We further include dynamic inputs for the period and indicators the user wants to see. As mentioned before, these are dynamic because the choices shown will depend on the selections made by the user on previous inputs. We then use the ggiraph package to output our interactive world map. We use the sidebarLayout() function to show the input selections on the left side and the world map on the right side, rather than the two stacked vertically.

Everything that depends on the inputs by the user needs to be specified in the server function, which in this case is not only the world map creation, but also the second and third input choices, since these depend on the previous inputs made by the user. For example, when we run the app later, we’ll see that when the user selects the childlessness data for the first input for data type, the third indicator input will only show age groups, and the text above the selector will also show “age group”, whereas when the user selects the gender gap index data, the third indicator will show different measures and the text above the selector will show “indicator” rather than “age group”.

Finally, we can run our app by either clicking “Run App” in the top of our RStudio IDE, or by running

shinyApp(ui = ui, server = server)

Below is a screen shot of the app. You can check out the live app here. In this post on my GitHub, you can also see how to embed a Shiny app in your R Markdown files, which is a really cool and innovative way of preparing interactive documents. Finally, the source code used to build the live app can also be found on my GitHub here.

Now try selecting different inputs and see how the input choices change when doing so. Also, don’t forget to try hovering over the world map to see different data values for different countries interactively!

_____='https://rviews.rstudio.com/2019/10/09/building-interactive-world-maps-in-shiny/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Twitter data analysis in R

$
0
0

[This article was first published on poissonisfish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1024px-Game_of_Thrones_-_SEASON_7_Episode_4What can you tell from over half a million tweets reacting to the last season of Game of Thrones? [source]

Like many other fans of the show, I had great expectations for the eighth and last season of Game of Thrones (GoT) that premiered on 14 April 2019. The much anticipated moment coincided with the few last days I spent finalising my last post on Bayesian models. Nonetheless, it provided a good testing ground for quantitative text analysis – by scouring and analysing tweets from the US in the course of the eighth and final GoT season, spanning the period between 7 April and 28 May 2019.

Introduction

In many regards, this post will be very different from previous entries. While the focus is the usual R-based statistical analysis, data collection is also discussed in depth and this in turn begs for basic Unix / macOS terminal commands. MS Windows users can refer to VirtualBox or Ubuntu installations. In summary, this post will demonstrate how to

  • Schedule a Twitter API program to harvest up to 100,000 tweets on a daily basis;
  • Process and aggregate harvested tweets;
  • Explore the resulting dataset using geocoding, document-feature and feature co-occurrence matrices, wordclouds and time-resolved sentiment analysis.

Kaggle

The large size of the resulting Twitter dataset (714.5 MB), also unusual in this blog series and prohibitive for GitHub standards, had me resorting to Kaggle Datasets for hosting it. Kaggle not only literally offers a considerable storage capacity for both private and public datasets, but also both writing and reproducible execution of R and Python scripts, also called kernels or notebooks. Notably, execution is served with up to 16 GB RAM and 5 GB storage ATTOW.

You can find the dataset page here and download the files either manually or using the Kaggle API with the following command,

kaggle datasets download monogenea/game-of-thrones-twitter -p INSERT_PATH

The Twitter dataset gotTwitter.csv shows up under Data Sources along with the code used for data collection. The code was split between the complementary scripts harvest.R and process.R that deal with tweet harvest and processing, respectively. To glean some basic insights from the data, I also wrote a Get-Started kernel that you can re-run, fork or modify. You are welcome to publish your own kernel!

The cron scheduler

Before turning to the R analysis I would like to introduce the cron scheduler. The cron is a convenient Unix scheduler that can assist with repeating tasks over regular time intervals, and a powerful tool for searching tweets using the Twitter API. To initialise a cron job you need to launch a text editor from the terminal by running crontab -e and add a line as shown below, before saving.

* * * * * INSERT_PATH_INTERPRETER INSERT_PATH_SCRIPT

The first five arguments separated by whitespace specify minute, hour, day of month, month and day of the week, respectively. An asterisk in place of any of the five instructs the job to be executed every such instance. Say, if you want a job to be run every Monday at 17:30 you could use 30 17 * * 1. Taking another example, if every tenth day of the month and every minute between 6:00 and 7:00, * 06 10 * *. One can additionally specify ranges (e.g. 00-14 in the first position for the first 15 minutes) or arrays (e.g. 01,05 in the second position for 1:00 and 5:00). This is testament to the versatility of the cron scheduler. Once the job is finished you can simply kill the scheduler by running crontab -r. If, however, you have additional unfinished jobs, enter the editor with crontab -e instead, delete the line above and save. More information about setting up a cron job can be found here.

Sharing is caring

Much of this post will review, in finer detail, both harvest and processing steps as well as the analysis from the accompanying kernel. Due to the large volume of all harvested files combined and the restricted access to tweets from that period, data collection and analysis are decoupled. I will first describe the data collection so that you can familiarise yourself with the process and reutilise the two scripts. The R analysis, on the other hand, is based on the provided dataset and should be fully reproducible both locally and on Kaggle.

Take the utmost responsibility when handling demographic information. The present records captured from the Twitter API are in the public domain and licensed as such, sensitive to the extent they associate with usernames and geographical coordinates. Use these tools for the common good, and always aim at making your data and work visible and accessible.

Let’s get started with R

Harvest

The quality of data collection and downstream analyses is dictated by the scheduled daily tweet search, or as I call it, the harvest. The harvest was executed using the script harvest.R, which will next be broken down into three separate parts – the Twitter API, the optional Google Maps API and the actual tweet search. We will then define the cron job setup that brings all three together.

Twitter API

The standard Twitter API, which is free of charge, offers a seven-day endpoint to your tweet searches. This means you can only retrieve tweets that are at most seven-day old. If you are planning a search to trace back longer than that, the obvious alternative to paid subscriptions is to repeatedly search tweets weekly or daily, using the standard option. Here is how.

To get started you first need a set of credentials from a Twitter API account. You can use your regular Twitter account, if you have one. All it takes is creating an app, then extracting consumer key, consumer secret, access token and access secret. Handle these carefully and do not share them. You can then create a token with create_token from rtweet by passing the app name and the four keys as shown below.

.gist table { margin-bottom: 0; }

Only much later did I notice that a single authentication creates the hidden file .rtweet_token.rds to be used in future executions via .Renviron. Eventually, repeated searches will write the same file with different number endings, creating unnecessary clutter. Because it makes no different in this process, you can prevent this from happening by switching off the set_renv option from create_token.

Google Maps API (optional)

The existence of geographical coordinates from users or devices can substantially empower studies based on social media content. If you want to geocode tweets you also need a Maps JavaScript API key from the Google Maps Platform. This can be done for free after setting up an account, creating a project, setting up a billing account with your credit card details and generating a set of credentials you can then pass to rtweet. You should get about US$200 monthly free credit, which is plenty for searching tweets. You can find more information here.

After setting up your Google Maps token, you can pass the key to lookup_cords as apikey, immediately after the region, country or city you want to study. In the present case we are interested in the whole of the US, so we have lookup_coords("usa", apikey = apiKey) as will be seen later.

.gist table { margin-bottom: 0; }

Tweet search

The key player in the harvest process is the actual tweet search, which is singlehandedly managed by the function search_tweets. Inside this function you can define keywords q, whether or not to use retryonratelimit, language lang, geocoding geocode, whether or not to include retweets include_rts and search size n among other options. I used the simple query game of thrones with retry-on-rate-limit, English language, US geocoding, discarding retweets and limiting the search to 100,000 tweets. The resulting object newTweets is then to be written into a CSV file in a directory called tweets. I found it convenient calling Sys.date() to set the name of the individual CSV files to the corresponding day of harvesting, i.e. YYYY-MM-DD.csv.

.gist table { margin-bottom: 0; }

Cron job setup

At this stage the tweet search harvest.R is fully set up and ready to be scheduled. The interpreter for executing R scripts is Rscript and the script we want to execute on a regular basis is harvest.R. As put forth in the Get-Started kernel, to reproduce my cron job you need to launch a text editor from the terminal by running crontab -e and add the following line before saving,

0 04 * * * /Library/Frameworks/R.framework/Resources/Rscript ~/Documents/INSERT_PATH/harvest.R

My cron job was instructed to execute the script harvest.R everyday at 4:00. You might also have noticed the very first line in harvest.R, carrying a #! prefix. This is a shebang, a special header that invokes an interpreter for executing the script. By using it we no longer need to specify the interpreter in the cron instance above. However, execution permissions must be first granted from the terminal, with e.g. the command chmod u+x harvest.R. Only then the following alternative to the above crontab will work,

0 04 * * * ~/Documents/INSERT_PATH/harvest.R

Processing

I ran through some issues at the start of the harvest, so the first effective batch from 17 April 2019 had to stretch back well over three days, when the first episode aired. To address the problem, I exceptionally set the search size to 300,000 tweets in the first day and to 100,000 in all successive days until the end of the harvest. Taking 100,000 per day was clearly more than needed, as reflected by large overlaps in successive files (data not shown). Despite the redundancy, it nevertheless assured capturing most activity including peaks coinciding with the air dates of all six episodes. By the end of the harvest all batches but the first weighted up to ~100 MB.

dailyCollectionsThe process generated a large volume of data (4.02 GB), so it made little sense to keep the individual files. Therefore, I wrote the script processing.R that

i) Lists all CSV files in the directory tweets;

ii) Defines a function mergeTweets to extract records from a donor table to a reference recipient table, using the column status_id to identify unique tweets;

iii) Iterates the function over all files listed in i) to populate a recipient table called allTweets;

iv) Writes the resulting allTweets table as gotTwitter.csv with UTF-8 encoding.

Note that I make use of rtweet functions throughout, namely read_twitter_csv, do_call_rbind and write_as_csv. These functions are optimised for handling rtweet objects, they run considerably faster and are cross-compatible, compared to alternative methods. Also note the unflatten = T option in read_twitter_csv that prevents inadvertently writing the flat long-lat coordinates back to CSV. This was one of the issues I ran into, as the default mode will return a table whose coordinates can no longer be used with the maps package.

.gist table { margin-bottom: 0; }

To my relief, the resulting gotTwitter.csv file carried tweets dated to between 7 April and 28 May 2019, therefore covering all six episode air dates and some more. It amassed an impressive total of 760,660 unique tweets.

Get-Started analysis

This analysis is largely based on the quanteda and maps packages and fully described in the Get-Started kernel. We will first load all required packages and read the CSV Twitter file. The tidyverse package and downstream dependencies work seamlessly with rtweet, maps and lubridate. The package reshape2 will be used later to convert timestamps from wide to long format. The Twitter dataset will then be read and unflattened using read_twitter_csv.

.gist table { margin-bottom: 0; }

Before proceeding, you might want to convert the UTC timestamps under created_at to an appropriate US timezone. In the kernel, created_at is overwritten with the corresponding lubridate encoding via as_datetime, and then converted to EDT (NY time) using with_tz.

Next, we can look into the US-wide geographical distribution of the harvested tweets. It will first take overwriting of allTweets using lat_long, which will simply add two new columns lat and lng carrying valid long-lat coordinates. Then we can create an instance of maps, which will require very large par margins in Kaggle kernels. To draw the US map with state borders we use map("state") with an appropriate lwd option to set line width. Now we can add the data points to the canvas by passing the long-lat coordinates.

.gist table { margin-bottom: 0; }
__results___3_0

Aside from having more GoT reactions on Twitter in more densely populated areas, the above figure clearly hints on the representativity of all 48 contiguous US states. My apologies to all Alaskans and Hawaiians, please modify the code above to visualise either state or the whole of the 50-state US. Interestingly, the dataset covers some activity outside of the US.

Let us move on to textual analysis and the underlying mathematical representation. We will clean up the tweet text by removing irrelevant elements, generate tokens from the resulting content and build a document-feature matrix (DFM). In rigour, tokens are vocables that comprise single or multiple words (i.e. n-grams) delineated by whitespace, and the number of occurrences per tweet shows up in the corresponding DFM column. Because tokenisation is exhaustive, DFMs tend to be extremely sparse.

The proposed tokenisation process strips off Twitter-specific characters, separators, symbols, punctuation, URLs, hyphens and numbers. Then, it identifies all possible uni- and bigrams. The inclusion of bigrams is important to capture references to characters like Night King and Grey Worm. Then, after setting all alphabetical characters to lower-case and removing English stop-words (e.g. the, and, or, by) we create the DFM gotDfm, with a total of 2,025,121 features.

We can now investigate co-occurrence of character names in the DFM. In this context, if we let CO_{A,B } represent the co-occurrence of a pair of any characters A and B, we have

CO_{A,B} = \sum_{i=1}^{n} 1_{T_i}(\{A,B\}) := \begin{cases} 1, \text{if} \{A,B\} \subseteq T_i \\ 0, \text{otherwise} \end{cases} 

where T_i is the ith tweet, and a mathematical set with as many elements as tokens. The summation over the indicator function 1_{T_i}(\{A,B\}) counts tweets where both A and B are mentioned, and hence serves as a measure of association between the two characters.

I chose a set of 20 GoT character names whose matching features will be pulled out from the DFM and used to prepare a feature co-occurrence matrix (FCM). Because CO_{A,B} = CO_{B,A} , this is a simple 20 \times 20 symmetric matrix carrying all co-occurrence counts in the DFM that very much resembles a covariance matrix. We can easily visualise the FCM as a network of GoT characters using textplot_network. The min_freq argument applies a cutoff to discard co-occurrences with small relative frequencies.

.gist table { margin-bottom: 0; }

__results___5_1

The undirected graph above represents my choice of 20 GoT characters using nodes, and the underlying co-occurrences by means of connecting edges, which are the larger the greater their relative frequency. Do these results make sense? For one, the strongest associations occur between lovers or enemies. In the first case we have couples like Daenerys and Jon, Jaime and Cersei, Arya and Gendry, Brienne and Jaime with a dash of Tormund. Although not popular overall, Grey Worm associated more closely to his beloved Missandei too. In the second case, we have Arya and the Night King, Cersei and Daenerys and Arya and Cersei. This is clearly a very crude way to characterise their relationships and one could argue this should be done on separate time intervals, as these same relationships can be assumed to evolve the show throughout.

Next we will consider the popularity dynamics of all 20 GoT characters to flesh out patterns based on their interventions in all six episodes. We will create the object popularity to carry binary values that ascertain whether individual tweets mention any of the characters, from the tkn list object. Then, we append the created_at column from allTweets and use the melt function to expand created_at over all 20 columns. Finally, entries where any of the 20 GoT characters is mentioned are selected and the results can be visualised.

Since character popularity can be expected to fluctuate over time, pinpointing the exact air dates of all six episodes can greatly improve our analysis. They all occurred in regular intervals of one week, at 21:00 EST starting 14 May, 2019. Again with the help of lubridate, we can encode this using ymd_hms("2019-04-14 21:00:00", tz="EST") + dweeks(0:5). We can now use epAirTime to highlight the exact time of all six air dates. Finally, a bit of ggplot and ggridges magic will help plotting the distribution of character references over time.

.gist table { margin-bottom: 0; }

 

__results___7_1

Here too we can relate the results to key events in the show, specifically

  • Brienne knighted by Jaime in Ep.2;
  • The affair between Arya and Gendry in Ep.2;
  • The demise of Melisandre, Jorah, Theon and Night King in Ep. 3;
  • The beheading and execution of Missandei and Varys, respectively, in Ep.4;
  • The demise of both the Hound and Cersei in Ep.5;
  • The crowning of Bran in Ep.6.

Up to this point all materials and analysis we discussed are described and available from the Kaggle dataset page. Time for a break? ☕

Since you are here

To avoid making this post a mere recycling of what I uploaded to Kaggle a few months ago, I propose moving on to discuss Twitter bots, building wordclouds and conducting sentiment analysis.

Dealing with Twitter bots

In working with Twitter data, one can argue that the inexpressive and pervasive nature of ads and news put out by bot accounts can severely bias analyses aimed at user sentiment, which we will use shortly. One strategy to identify and rule out bots is to simply summarise the number of tweets, as there should be a human limit to how many you can write in the period between 7 April and 28 May 2019. An appropriate tweet count limit can then be used, beyond which users are considered bots.

But before kicking off, I would like to bring out some intuition about tweeting behaviour. I would expect that users are generally less reactive to corporate than personal tweets (your thoughts?). How reactive users are about a certain tweet, for example, can be manifest in retweeting. Moreover, considering the point discussed above,  I would therefore expect users that tweet less to gather more retweets in average. The following piece will test this hypothesis in the present dataset, and investigate that association in relation to the number of followers.

The dataset can be partitioned based on the usernames available under allTweets$screen_name, from which we can then derive

  • num_tweets, how often each user tweeted during this period;
  • median_rt, the median number or retweets from each user (median over mean, to avoid large fluctuations due to extreme cases of null or exaggerated retweeting);
  • mean_followers, the average number of followers in that period (average since of course it is dynamic).

We should also visualise the data in log-scale, as they are expectedly highly skewed. The code below uses a log_{10} transformation with a unit offset.

.gist table { margin-bottom: 0; }

1

It seems that indeed there is a negative relationship between the frequency of tweeting and the average number of retweets. Here, the plotting character size is proportional to the log_{10} of the number of followers. In the original scale, we are talking about users with a number of followers ranging from zero to 77,581,90. This impressive number in particular comes from @TheEllenShow. Nonetheless, there is no clear association with neither median retweet or tweet counts, which suggests that both high- and low-profile users can be equally prolific and equally contribute to viral retweeting.

Before turning our attention to wordclouds, I propose we first remove potential bots. How likely is it for a human user to publish 10^3 tweets about GoT in seven weeks? My approach removes all users that tweeted as many as 10^2  times during this period. You may find it to be too strict, so feel free to update the cutoff.

Wordclouds

Wordclouds are an effective way of summarising textual data by displaying the top frequent terms in a DFM, where word size is proportional to its relative frequency. To better understand the audience, the following wordcloud construction procedure will be focussed on tweets published right after each episode.

This will require cleaning the textual data more thoroughly than before. I mentioned UTF-8 previously, which is the encoding used on the text from our dataset. UTF-8 helps encoding special characters and symbols, such as those from non-latin alphabets, and can easily be decoded back. Since special characters and symbols can tell us little about opinion, we can easily remove them using a trick proposed by Ken Benoit in this Stackoverflow thread. Then we also have Unicode for emojis, equally irrelevant. Under allTweets$text, emojis share a common encoding structure between arrows, e.g.  so we might be better off using a regex (i.e. regular expression) to replace all emoji encodings with single whitespaces while avoiding off-targets. In the code below, I propose using the regex <[A-Z+0-9]+>. You can read this pattern as something like identify all occurrences of < followed by one or more capital letters, digits and +, followed by >. Regex patterns are really useful and I might cover them in greater depth in a future post.

The tokenisation recipe above to remove symbols, punctuation and more can be re-used here. In preparing the subsequent DFM we will convert all letters to lower case and remove stop-words, as before, but also stem words. Stemming words, as the name suggests, clips the few last characters of a word and is effective in resolving the differences among singular, plural, and verbal forms of semantically related terms (e.g. imagined, imagination and imagining can be stemmed to imagin). Further down the line, irrelevant terms such as the show title and single-letter features can also be discarded from the DFM. Finally, we subset the resulting DFM for every two hours and four days after the airing of each of the six episodes. The six subsets will be use to construct six separate wordclouds with a word limit of 100. You might be warned about words not fitting the plot, worry not!

.gist table { margin-bottom: 0; }

Move your mouse over the different panels to reveal the episode number and title. We can now inquire about why some of these frequent terms emerge:

  • Ep.1 – premier, first and one certainly allude to this being the season premiere. Another interesting emerging term is recap, hinting on fans going back to previous seasons;
  • Ep.2 – night, week and battl are clear references to the much anticipated battle of Winterfell that aired the week after (Ep.3, The Long Night); arya and sex likely refer to the sex scene with Gendry;
  • Ep.3 – night, battl and winterfel as explained above. There are also some curious references to Avengers: Endgame, the film that premiered on 24 April 2019;
  • Ep.4 – If you watched to the show, you certainly remember the controversial Starbucks coffee cup scene from this episode;
  • Ep.5 – final, last and similar terms reflect some anticipation about the show finale. Fans discuss petitions for a remake (remak, petit) and express disappointment (disappoint). We will encounter disappointment again when working with sentiment analysis;
  • Ep.6 – end and final, evidently; bran and king also emerge, as Bran is crowned the king of the six kingdoms. Some disappointment, as noticed before.

Sentiment analysis

I would like to conclude the post with sentiment analysis, i.e. determining the balance between positive and negative emotions over time, from matching tokens to a sentiment dictionary from quanteda. By framing the analysis against the six air dates we can make statements about the public opinion on the last GoT season. 

Conducting sentiment analysis is deceptively simple. The tokens from the wordcloud exercise are initially matched against the dictionary data_dictionary_LSD2915 and processed just as before, to build a DFM. In contrast to previous DFMs, this instance carries counts of words associated with either positive or negative emotions. Alternative methods including the package SentimentAnalysis also list emotionally neutral words, thereby relying on a tertiary response. Finally, we pool counts from the same day accordingly and can now set to investigate the evolution of sentiment regarding the eighth season of GoT.

In accordance to this quanteda tutorial we can derive a relative sentiment score whose sign informs of the sentiment most expressed in any particular day. The sentiment score from any given day is simply calculated as

sentiment = \frac{\sum_{i=1}^{n}P_i-N_i}{\sum_{i=1}^{n}P_i+N_i}

where P_i and N_i are counts of words associated with positive and negative sentiment, respectively, in the ith tweet from that day. As a results, this score ranges between -1 and 1 and takes positive (resp. negative) values when positive (resp. negative) emotions dominate, with the advantage of normalising to the total counts. Let us have a look.

.gist table { margin-bottom: 0; }

1

It is interesting to observe the dominant positive sentiment around the Ep.1 and Ep.2  that gives way to an uninterrupted period of negative sentiment between Ep.3 and Ep.5, which in turn evolves to a more neutral sentiment by Ep.6 and the end of the show. 

While no statistical tests were used, I think it is fair to make some assumptions. There was a lot of excitement with the approaching premiere, as fans waited over one year for the final season. In that light, it would make sense to observe overall positive emotions leading to Ep.1. Then, the negative emotions that dominates from Ep.2 to Ep.5 are in fine agreement with the disappointment picked up from our wordclouds and indeed echoed in various media. Finally, with the approaching end of the season and the show by Ep.6, users presumably reviewed it as a whole, explaining perhaps the appeased negative reactions. Had we used a permutation test, e.g. shuffling the date order of the tweets and re-analysed the data, we could have drawn a confidence interval and determine the significance of the sentiment changes. Up for the challenge?

Wrap-up

This was a long journey, six months in the making, a very productive one ☕. Various different aspects of Twitter data analysis were considered, including

  • How to collect Twitter data, first by harvesting tweets using a cron job with daily access to the Twitter API, then by processing the harvested tweets;
  • How to judiciously clean textual data, remove stop-words, stem words and manipulate date-time variables;
  • How to build DFMs, FCMs and co-occurrence networks;
  • How to analyse the popularity dynamics of individual GoT characters;
  • How to distinguish human users and bots;
  • How to build basic wordclouds using well defined time intervals;
  • How to conduct a basic sentiment analysis.

I hope this gives you a glimpse of the value of this dataset and the powerful combination of R and other scripting languages. I learned much about Unix with The Unix Workbench free course from Coursera, which I highly recommend to beginners. 

My first contact with quantitative textual analysis took place in the Cambridge AI Summit 2018 organised by Cambridge Spark after a brilliant talk by Kenneth Benoit, professor at the London School of Economics and Political Science in the UK. Ken is the main developer of quanteda and demonstrated its use on a Twitter analysis of Brexit vote intentions. I am much indebted to him for inspiration.

Finally, those that follow my blog surely also noticed it has a new face. After much struggling with the ugly and poorly functional Syntax Highlighter plugin from WordPress.com, I found Github Gists a neat alternative. I also finally registered the website, now poissonisfish.com and totally ad-free. I hope you like it!

And yes, I have a thing with coffee. ☕

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: poissonisfish.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppArmadillo 0.9.800.1.0

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

armadillo image

Another month, another Armadillo upstream release! Hence a new RcppArmadillo release arrived on CRAN earlier today, and was just shipped to Debian as well. It brings a faster solve() method and other goodies. We also switched to the (awesome) tinytest unit test frameowrk, and Min Kim made the configure.ac script more portable for the benefit of NetBSD and other non-bash users; see below for more details. One again we ran two full sets of reverse-depends checks, no issues were found, and the packages was auto-admitted similarly at CRAN after less than two hours despite there being 665 reverse depends. Impressive stuff, so a big Thank You! as always to the CRAN team.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 665 other packages on CRAN.

Changes in RcppArmadillo version 0.9.800.1.0 (2019-10-09)

  • Upgraded to Armadillo release 9.800 (Horizon Scraper)

    • faster solve() in default operation; iterative refinement is no longer applied by default; use solve_opts::refine to explicitly enable refinement

    • faster expmat()

    • faster handling of triangular matrices by rcond()

    • added .front() and .back()

    • added .is_trimatu() and .is_trimatl()

    • added .is_diagmat()

  • The package now uses tinytest for unit tests (Dirk in #269).

  • The configure.ac script is now more careful about shell portability (Min Kim in #270).

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Simple Tabulations Made Simple

$
0
0

[This article was first published on petermeissner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The {tabit} package

This is a blog post announcing the brand new micro package {tabit} that just made it to CRAN.

🙌 Thanks to all CRAN people 🙌

Motivation

{tabit} is a package that is about making simple tabulations simple. My motivation for writing this package was the realization that I was actually missing the way I could do tabulation in Stata: Easily getting an idea of the data very fast.

While R of cause has an tabulation function build in, I was always struggling with getting it to show what I wanted without specifying to many arguments.

The way I want it to work:

  • I want frequencies.
  • I want percentages.
  • I do not want it to ignore NAs – never ever, no way.
  • I want to see how thinks look like without NAs.
  • I want it in a format that is easy to use later on or re-use for non interactive data glancing tasks.
  • I want results to be sorted by decreasing frequency.
  • I want it to be generic so I can make it work for vectors and lists and data.frames and all the things that might come up in the future.
  • I want it to be configurable via parameters.
  • I really do not want to have to touch those parameters ever.
  • I want it to have zero dependencies.
  • I want it all, I want it know

Over the last years I realized I was rewriting kind of the same function over and over again for projects and also for packages. After having gone through some iterations I am now quite happy with the outcome – though little code it might be.

Giving it a try

At the moment only one-dimensional tables are implemented multidimensional tabulations are planned but a well balanced design for a function still is under development.

Lets have a demo using the built in “New York Air Quality” data set. The data set consists of several variables most notably some containing missing values. The variable of interest Solar.R measures solar radiation in Langleys.

To get a quick overview I round the radiation measures to the nearest hundreds and use ti_tab1() to get a frequency table. The result of a call to ti_tab1() is a data.frame with on line per variable value, sorted by decreasing frequencies and including frequencies for missing values as well per default. Since percentages differ depending on whether or not missing values (NA) are included or not there is one column excluding NAs and one including them.

library(tabit)ti_tab1(x=round(airquality$Solar.R,-2))##   value count   pct pct_all## 3   200    50 34.25   32.68## 4   300    45 30.82   29.41## 2   100    34 23.29   22.22## 1     0    17 11.64   11.11## 5       7    NA    4.58

If sorting by frequency is not what I want I can easily turn it off by setting the sort parameter to FALSE:

ti_tab1(x=round(airquality$Solar.R,-2),sort=FALSE)##   value count   pct pct_all## 1     0    17 11.64   11.11## 2   100    34 23.29   22.22## 3   200    50 34.25   32.68## 4   300    45 30.82   29.41## 5       7    NA    4.58

The same is true for the numbers of digits to show for the percentage columns:

ti_tab1(x=round(airquality$Solar.R,-2),digits=0)##   value count pct pct_all## 3   200    50  34      33## 4   300    45  31      29## 2   100    34  23      22## 1     0    17  12      11## 5       7  NA       5ti_tab1(x=round(airquality$Solar.R,-2),digits=4)##   value count     pct pct_all## 3   200    50 34.2466 32.6797## 4   300    45 30.8219 29.4118## 2   100    34 23.2877 22.2222## 1     0    17 11.6438 11.1111## 5       7      NA  4.5752

Since ti_tab1() is implemented as generic it can handle multiple data types– i.e. vectors, data.frames, and lists – and can be extended to cover other data types as well.

Again the ti_tab1() returns a data.frame. This time a column named name has been added which captures the name of the column on which the frequencies and percentages are based upon.

ti_tab1(x=lapply(airquality,round,-2))##       name value count    pct pct_all## 1    Ozone     0    82  70.69   53.59## 2    Ozone      37     NA   24.18## 3    Ozone   100    33  28.45   21.57## 4    Ozone   200     1   0.86    0.65## 5  Solar.R   200    50  34.25   32.68## 6  Solar.R   300    45  30.82   29.41## 7  Solar.R   100    34  23.29   22.22## 8  Solar.R     0    17  11.64   11.11## 9  Solar.R       7     NA    4.58## 10    Wind     0   153 100.00  100.00## 11    Wind       0     NA    0.00## 12    Temp   100   153 100.00  100.00## 13    Temp       0     NA    0.00## 14   Month     0   153 100.00  100.00## 15   Month       0     NA    0.00## 16     Day     0   153 100.00  100.00## 17     Day       0     NA    0.00

Last but not least the fact that ti_tab1() returns simple data.frames means that R provides a large array of things I can do with them – plotting, filtering, writing to file – and that every R user instantly knows how to handle them.

# get all countsti_tab1(x=airquality$Wind)$count## [1] 15 11 11 11 10  9  8  8  8  8  8  6  5  4  4  3  3  3  3  3  2  1  1  1  1  1  1  1  1  1  1  0# get the highest percentagetab<-ti_tab1(x=round(airquality$Solar.R,-2))tab$pct[1]## [1] 34.25# get percentage of NAstab$pct_all[is.na(tab$value)]## [1] 4.58

Things to come

As mentioned beforehand one of the things planed for this micro package is to add multidimensional tables. Another option is to extend the tabulation functions to allow for user defined aggregation functions producing other statistics than counts and percentages.

Other than that I think the package really has quite a narrow scope and we should keep it like that.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: petermeissner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

microbenchmark_1.4-7 on CRAN

$
0
0

[This article was first published on FOSS Trading, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I pushed an updated microbenchmark to CRAN a couple weeks ago. There were two noteworthy changes, thanks to great contributions from @MichaelChirico and @harvey131.

Michael fixed a bug in the check for whether the unit argument was a character string (#9, #10). The prior behavior was an uninformative error.

Harvey added a feature to allow you to use a string for common checks: “equal”, “identical”, and “equivalent” (#16). So you don’t need to create a custom function to use all.equal(), all.equal(…, check.attributes = FALSE), and identical, respectively.

I also converted the unit tests to use RUnit. I also made some changes to the repo, including adding a contributing guide and issue/pull-request templates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Gold-Mining Week 6 (2019)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

</p> <p>Week 6 Gold Mining and Fantasy Football Projection Roundup now available.</p> <p>

The post Gold-Mining Week 6 (2019) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back in the GSSR

$
0
0

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The General Social Survey, or GSS, is one of the cornerstones of American social science and one of the most-analyzed datasets in Sociology. It is routinely used in research, in teaching, and as a reference point in discussions about changes in American society since the early 1970s. It is also a model of open, public data. The National Opinion Research Center already provides many excellent tools for working with the data, and has long made it freely available to researchers. Casual users of the GSS can examine the GSS Data Explorer, and social scientists can download complete datasets directly. At present, the GSS is provided to researchers in a choice of two commercial formats, Stata (.dta) and SPSS (.sav). It’s not too difficult to get the data into R (especially now that the Haven package is pretty reliable), but it can be a little annoying to have to do it repeatedly. After doing it one too many times, I got tired of it and I made a package instead. The gssr package provides the GSS Cumulative Data File (1972-2018) and the GSS Three Wave Panel Data File (2006-2010), together with their codebooks, in a format that makes it straightforward to get started working with them in R. The gssr package makes the GSS a little more accessible to users of R, the free software environment for statistical computing, and thus helps in a small way to make the GSS even more open than it already is. The package presently lives at http://kjhealy.github.io/gssr/, as it is still in development. There is a vignette providing an overview of what’s included, and you can see the source code on GitHub.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

issuer: Local issue tracking, no net required

$
0
0

[This article was first published on R – William Doane, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The goal of issuer is to provide a simple issue tracker, hosted on your local file system, for those users who don’t want to or are disallowed from using cloud-based code repositories.

Online code repositories often provide an issue tracker to allow developers, reviewers, and users to report bugs, submit feature requests, and so on. However, many developers either choose to work offline or work on enterprise networks where use of cloud services may be prohibited.

issuer is an Add-in for use in RStudio’s desktop IDE. It works entirely locally with no requirement for a cloud service or even a network connection.

Read more about issuer at https://github.com/WilDoane/issuer

You can install the development version of issuer from Github with:

devtools::install_github("WilDoane/issuer")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – William Doane.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mediation, confounding, and measurement error

$
0
0

[This article was first published on R Psychologist - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mediation might be the ultimate example of how a method continues to be used despite a vast number of papers and textbooks describing the extremely strong assumptions required to estimate unbiased effects. My aim with this post is not to show some fancy method that could help reduce bias; rather I just want to present a small simulation-based example of the underappreciated consequences of measurement error and confounding. There’s been many other people making the same point, for instance, Dunn & Bentall (2007) expressed some strong concerns about investigating mediators in psychological treatment studies:

“The assumptions concerning the lack of hidden confounding and measurement errors are very rarely stated, let alone their validity discussed. One suspects that the majority of investigators are oblivious of these two requirements. One is left with the unsettling thought that the thousands of investigations of mediational mechanisms in the psychological and other literatures are of unknown and questionable value.” (p. 4743)

The causal mediation model

In all examples, I assume that mediation is investigated in a randomized controlled trial where treatment allocation is randomized. The treatment is a cognitive-behavioral therapy (CBT), and we want to estimate the indirect effect of homework completion, and the hypothesis is that a non-trivial amount of the treatment effect is mediated by exposure-based homework adherence. The figure bellow presents three different scenarios that I will simulate.

  • In (a), the relationship between the mediator and the outcome is confounded, but neither the mediator nor the confounder is measured with error.
  • In (b), the confounder is measured with error, I assume independent and nondifferential measurement error (i.e., classical measurement error).
  • In (c), there’s no confounding, but now the mediator is measured with error.

The causal estimands are most clearly expressed using the potential outcomes framework, where the indirect effect for a single patient (Imai, Keele, & Tingley, 2010), is written as,

$$ \text{indirect effect} = Y_i(1, M_i(1)) – Y_i(1, M_i(0)) $$

and the direct effect of the treatment is,

$$ \text{direct effect} = Y_i(1, M_i(t)) – Y_i(0, M_i(t)). $$

\(M_i(1)\) is the level of the mediator under the treatment and \(M_i(0)\) under the control, and \(Y_i(1, M_i(1))\) is thus the outcome after treatment with the mediator at the natural level realized under the treatment. The subscript i indicates that these effects can be different for each individual. Just as with treatment effects, all these potential outcomes cannot be observed for every patient, but we can estimate the average causal effects. The indirect effect tells us “[w]hat change would occur to the outcome if one changes the mediator from the value that would be realized under the control condition, \(M_i(0)\), to the value that would be observed under the treatment condition, \(M_i(1)\), while holding the treatment status at t” (Imai, Keele, & Tingley, 2010, p. 311).

Generate the data

We’ll use the following packages. The simulations are performed using powerlmm, and the models are fit using brms.

library(brms)library(purrr)# 0.5.0 DEV VERSION, not on CRANlibrary(powerlmm)library(dplyr)library(ggplot2)

We need to create a custom function that simulates the data.

#' Create mediation data #' using potential outcomes#'#' @param n total number of participants#' @param b_pre_M Effect of pretest values on M1#' @param b_pre_Y Effect of pretest values on Y#' @param b_M1 Effect of M1 on outcomes#' @param b_TX Direct effect of TX#' @param pre_M Mean of pre#' @param M1_M Mean of M1, ignoring contribution of confounder#' @param M_me_sd SD of mediator's measurement error #' @param pre_me_sd SD of pretest confound's measurement error #' @param ... #'#' @return a tibblesim_data<-function(n,b_pre_M,b_pre_Y,b_M1,b_TX,pre_M=10,M1_M=2.5,M_me_sd=0,pre_me_sd=0,...){tibble::tibble(# pretest for Ypre=rnorm(n,pre_M,2),# treatment assigmentTX=rbinom(n,1,0.5),# Mediator in control, 0 for allM0=0,# Mediator under treatmentM1=rnorm(n,M1_M,1)+b_pre_M*pre,# Y(0, M(0)), outcome in control when mediator at control levelsY0_M0=3+b_pre_Y*pre+rnorm(n,0,2),# Y(0, M(1)), outcomes in control when mediator at TX levelsY0_M1=Y0_M0+b_M1*M1,# Y(1, M(0)), outcomes in TX when mediator at control levelsY1_M0=3+b_TX+b_pre_Y*pre+rnorm(n,0,2),# Y(1, M(1)), outcomes in TX when mediator at TX levels Y1_M1=Y1_M0+b_M1*M1,# MediatorM=(TX==0)*M0+(TX==1)*M1,# Mediator with errorM_me=(TX==0)*M0+(TX==1)*(M1+rnorm(n,0,M_me_sd)),# Pretest with errorpre_me=pre+rnorm(n,0,pre_me_sd),# Outcomey=(TX==0)*Y0_M0+(TX==1)*Y1_M1)}

Let’s pass this function to powerlmm as a custom model.

ds<-study_design(custom=TRUE)# confoundingp<-study_parameters(ds,n=100,b_pre_M=-0.25,b_pre_Y=0.5,b_TX=-3,b_M1=-0.6,pre_M=10,M1_M=7.5,pre_me_sd=1.5,M_me_sd=1,data_gen=sim_data)

Since this is a custom model, we need to define the true parameter values if we want to calculate the coverage of the CIs automatically.

# The true parameter values# used by powerlmm to calculate CI coverage etc# # Uninteresting paras are set to 0, could prob. be NA insteadindirect<-with(p,b_M1*(M1_M+b_pre_M*pre_M))direct<-p$b_TXparams<-list("fixed"=list("M_Intercept"=0,"y_Intercept"=0,"M_TX"=0,"y_M"=0,"y_TX"=0,"indirect"=indirect,"direct"=direct,"total"=indirect+direct,"prop_mediated"=indirect/(indirect+direct)),"random"=list("sigma_M"=0,"sigma_y"=0))p$params<-paramsparams$fixed
## $indirect## [1] -3## ## $direct## [1] -3## ## $total## [1] -6## ## $prop_mediated## [1] 0.5

Let’s generate a large data set to look at the values for the true causal mediation model.

pn<-ppn$n<-5e4dn<-simulate_data(pn)dn%>%summarise(indirect=mean(Y1_M1-Y1_M0),direct=mean(Y1_M1-Y0_M1),Z_M=mean(M1-M0),total=mean(Y1_M1-Y0_M0),prop_mediated=indirect/total)
## # A tibble: 1 x 5##   indirect direct   Z_M total prop_mediated##                   ## 1    -3.00  -2.99  4.99 -5.99         0.500

We can see that the average indirect effect of exposure-based homework is -3, and that the average direct effect is -3 (effects transmitted via other mechanisms). Thus, the total treatment effect is 6 point reduction, and 50% of that effect is mediated by homework adherence.

We can also take a random sample of 100 participants and look at the individual-level effects. The figure below shows the direct, indirect, and total effects for these 100 participants. We see that the effects vary substantially on the individual level. In reality, we can’t know if the individual-level effects vary or if they are constant for all participants.

Run the simulation

Let’s first define the simulations for the scenarios with confounding, i.e., (a) and (b). We’ve already defined the measurement error, cor(pre, pre*) = 0.8.

dn%>%summarise(cor(pre,pre_me))
## # A tibble: 1 x 1##   `cor(pre, pre_me)`##                ## 1              0.799

We’ll fit all models using brms, there are other packages that can fit these models (e.g., mediation which includes a bunch of useful tools), but as I’ll use brms as powerlmm already has methods to extract the results.

# No adjustmentd<-simulate_data(p)fit_b<-brm(bf(M~TX)+bf(y~M+TX)+set_rescor(FALSE),data=d)# Adjust for pretest of outcomefit_b_pre<-brm(bf(M~pre+TX)+bf(y~pre+M+TX)+set_rescor(FALSE),data=d)

We also need to add a function that will calculate the indirect and direct effects.

summarize_model<-function(fit,d){summary_func<-function(x){data.frame("estimate"=mean(x),"se"=sd(x),"pval"=NA,"df"=NA,"df_bw"=NA,"CI_lwr"=quantile(x,0.025),"CI_upr"=quantile(x,0.975))}posterior_samples(fit)%>%transmute(indirect=b_M_TX*b_y_M,direct=b_y_TX,total=indirect+direct,prop_mediated=indirect/total)%>%map_df(summary_func,.id="parameter")}

We can then create three simulation formulas.

f0<-sim_formula(fit_b,post_test=summarize_model)f1<-sim_formula(fit_b_pre,post_test=summarize_model)# Just rename pre_me to pre# pre now have measurement erroradd_pre_me<-function(d,...){d$pre<-d$pre_med}f1_me<-sim_formula(fit_b_pre,post_test=summarize_model,data_transform=add_pre_me)

Then we just run the simulation. This code can also be used to calculate power for a mediation study.

# manually start clusters# need to load packagescl<-parallel::makeCluster(12)parallel::clusterEvalQ(cl,{library(dplyr)library(purrr)})res<-simulate(p,nsim=1000,cores=12,cl=cl,formula=sim_formula_compare("M"=f0,"M_pre"=f1,"M_pre_me"=f1_me))saveRDS(res,"med_sim.Rds")

The simulation for the scenario with measurement error in the mediator is performed in the same way. The correlation between the mediator measured with error (M* = M_me) and the true mediator (M) is about 0.7, in the treatment group.

code

# Remove confoundingp1<-pp1$b_pre_M<-0p1$M1_M<-5# Sim formulasf0<-sim_formula(fit_b,post_test=summarize_model)add_M_me<-function(d,...){d$M<-d$M_med}f1_me<-sim_formula(fit_b,post_test=summarize_model,data_transform=add_M_me)f1_me_pre<-sim_formula(fit_b_pre,post_test=summarize_model,data_transform=add_M_me)# manually start clusters# need to load packagescl<-parallel::makeCluster(12)parallel::clusterEvalQ(cl,{library(dplyr)library(purrr)})res2<-simulate(p1,nsim=1000,cores=12,cl=cl,formula=sim_formula_compare("M"=f0,"M_me"=f1_me,"M_me_pre"=f1_me_pre))saveRDS(res2,"med_me_sim.Rds")

Simulation results

Now we just have to summarize the results. First, we create two functions to extract the relevant results.

code

res<-readRDS("med_sim.Rds")res_me<-readRDS("med_me_sim.Rds")sum_res<-summary(res)sum_res_me<-summary(res_me)extract_summary<-function(model){model$FE%>%filter(parameter%in%c("indirect","direct","total","prop_mediated"))}summary_table<-function(res){map_df(res$summary,extract_summary,.id="label")%>%transmute(label,parameter,M_est,theta,"%_RB"=(M_est-theta)/theta*100,Power,CI_Cover)}

Then we can plot the results for the indirect effects.

code

library(tidyr)x<-summary_table(sum_res)x<-x%>%filter(parameter=="indirect")%>%mutate(sim="confounding",label=factor(label,levels=c("M_pre","M_pre_me","M"),labels=c("Adjusted","Adjusted (with measurement error)","Unadjusted")))x_me<-summary_table(sum_res_me)x_me<-x_me%>%filter(parameter=="indirect")%>%mutate(sim="ME",label=factor(label,levels=c("M","M_me","M_me_pre"),labels=c("Mediator (perfect)","Mediator (with measurement error)","Mediator (with measurement error) + Adjusted")))tmp<-rbind(x,x_me)tmp_long<-gather(tmp,variable,value,-sim,-label,-parameter,-theta)variables<-c("M_est","%_RB","Power","CI_Cover")tmp_long<-mutate(tmp_long,variable=factor(variable,levels=variables,labels=c("Estimate","% RB","Power","CI Coverage")),sim=factor(sim,levels=c("confounding","ME"),labels=c("Confounding \n M - Y","Measurement error \n in mediator")))tmp_hline<-data.frame(variable=unique(tmp_long$variable),yintercept=c(-3,0,0.8,0.95))p_res<-ggplot(tmp_long,aes(label,value,color=sim))+geom_line(aes(group=variable))+geom_point()+geom_hline(data=tmp_hline,aes(yintercept=yintercept),color="black",linetype="dashed",alpha=0.5)+facet_grid(sim~variable,drop=TRUE,scales="free")+labs(x=NULL,y=NULL)+coord_flip()+scale_color_manual(values=c("#0984e3","black"))+theme_minimal()+theme(legend.position="none",axis.text.y=element_text(color="black"))

center

For the scenarios with confounding we see that:

  • failing to account for baseline values of the outcome variable in the mediation analysis leads to an overestimation of the indirect effect of homework adherence. Participants with fewer problems at baseline are more likely to complete more homework, and they are also likely to have fewer problems at posttest,
  • adjusting for a confounder that’s perfectly measured yields unbiased estimates (assuming no other hidden confounding), adjusting for a confounder measured with error is an improvement but there’s still residual confounding leading to bias.

When there’s measurement error in the mediator we see that:

  • the indirect effect is attenuated.
  • In this case, adjusting for pretest values does not reduce bias, but it does reduce the standard errors and leads to increased power.

Here are also tables with the results for the direct and total effect, as well.

code

summary_table(sum_res)%>%kable(digits=2)
labelparameterM_esttheta%_RBPowerCI_Cover
Mindirect-5.09-3.069.640.940.72
Mdirect-0.90-3.0-69.980.080.71
Mtotal-5.99-6.0-0.171.000.96
Mprop_mediated0.860.571.830.940.72
M_preindirect-3.07-3.02.490.600.96
M_predirect-2.92-3.0-2.520.510.96
M_pretotal-6.00-6.0-0.011.000.95
M_preprop_mediated0.520.53.340.590.96
M_pre_meindirect-3.84-3.027.840.770.92
M_pre_medirect-2.17-3.0-27.760.310.93
M_pre_metotal-6.00-6.00.041.000.96
M_pre_meprop_mediated0.640.528.960.760.92

code

summary_table(sum_res_me)%>%kable(digits=2)
labelparameterM_esttheta%_RBPowerCI_Cover
Mindirect-2.94-3.0-1.940.450.94
Mdirect-3.09-3.02.840.440.94
Mtotal-6.03-6.00.451.000.95
Mprop_mediated0.490.5-1.410.440.94
M_meindirect-1.47-3.0-51.030.260.73
M_medirect-4.56-3.051.910.940.75
M_metotal-6.03-6.00.441.000.95
M_meprop_mediated0.250.5-50.610.250.74
M_me_preindirect-1.47-3.0-51.130.300.68
M_me_predirect-4.56-3.052.060.960.70
M_me_pretotal-6.03-6.00.471.000.95
M_me_preprop_mediated0.250.5-50.800.300.68

Summary

Measurement error and confounding is a huge problem for mediation analyses, and there’s no easy solution. In real life, we can expect both confounding and measurement error in the mediator and confounders. There’s likely to be multiple sources of confounding, both related to baseline variables and post-randomization variables (i.e., things happening after treatment allocation). Assumptions regarding the lack of hidden confounding and measurement error are very hard to defend.

References

  • Dunn, G., & Bentall, R. (2007). Modelling treatment-effect heterogeneity in randomized controlled trials of complex interventions (psychological treatments). Statistics in Medicine, 26(26), 4719–4745. https://doi.org/10.1002/sim.2891
  • Imai, K., Keele, L., & Tingley, D. (2010). A general approach to causal mediation analysis. Psychological Methods, 15(4), 309–334. https://doi.org/10.1037/a0020761

if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) { var align = "center", indent = "0em", linebreak = "false";</p><p> if (false) { align = (screen.width < 768) ? "left" : align; indent = (screen.width < 768) ? "0em" : indent; linebreak = (screen.width < 768) ? 'true' : linebreak; } var mathjaxscript = document.createElement('script'); mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#'; mathjaxscript.type = 'text/javascript'; mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML'; var configscript = document.createElement('script'); configscript.type = 'text/x-mathjax-config'; configscript[(window.opera ? "innerHTML" : "text")] = "MathJax.Hub.Config({" + " config: ['MMLorHTML.js']," + " TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," + " jax: ['input/TeX','input/MathML','output/HTML-CSS']," + " extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," + " displayAlign: '"+ align +"'," + " displayIndent: '"+ indent +"'," + " showMathMenu: true," + " messageStyle: 'normal'," + " tex2jax: { " + " inlineMath: [ ['\\\\(','\\\\)'] ], " + " displayMath: [ ['$$','$$'] ]," + " processEscapes: true," + " preview: 'TeX'," + " }, " + " 'HTML-CSS': { " + " availableFonts: ['STIX', 'TeX']," + " preferredFont: 'STIX'," + " styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," + " linebreaks: { automatic: "+ linebreak +", width: '90% container' }," + " }, " + "}); " + "if ('default' !== 'default') {" + "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" + "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" + "VARIANT['normal'].fonts.unshift('MathJax_default');" + "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" + "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" + "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" + "});" + "}"; (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript); (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Gosset part 2: small sample statistics

$
0
0

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simulation was the key to to achieve world beer dominance.

‘Scientific’ Brewing at scale in the early 1900s

Beer bottles cheers

Beer bottles cheers

This post is an explainer about the small sample experiments performed by William S. Gosset. This post contains some R code that simulates his simulations1 and the resulting determination of the ideal sample size for inference.

If you brew your own beer, or if you want to know how many samples you need to say something useful about your data, this post is for you.

I am a big fan of Gosset, look only at my work room:

Yes that is the wikipedia portret of William Gosset! and so I am absolutely fascinated with the work that Gosset did. In fact I think he is the first data scientist.

If you are not interested in how beer is made, and just in the simulation go to the heading simulation.

Brewing beer (at scale)

One of the problems William S. Gosset worked on was determining the quality of Malt. To brew beer you need at least 3 ingredients, yeast, hops and a cereal grain2. On a high level it is very simple: You start with extracting the starch from the grain into water. You then flavor the resulting sweet liquid with hops and ferment that mixture with yeast. Put it in a barrel or bottles and you are done! This is quite doable for small batches, but boy does it get difficult at scale!

Gosset’s work touched on all three ingredients of beer, but in this post I will look into the cereal grain.

Beer, malts and beer technique

Now, beer brewing has been done since at least the 6th century BCE3, and all the steps in beer brewing have their own specialized names4 , which are different in all languages. So I will be talking about malt, but remember that it is just a source of starch, a source of sugars (I’m sorry chemists/ biologist, I have to simplify this a bit).

The source of starch in beer is barley, this grain is first dried, cleaned and then wetted. This starts the biological process of germination in the grain (it gets ready to start a new barley plant). In that process some enzymes are made that will break down starch to smaller sugars. The grains are now ready to start a new plant and, then we sort of kill that process by forced drying. So we end up with sad half germinated barley. This half germinated barley is called malt or malted barley.

Grain, Barley, transforming into malt, as seen here.Malt and grain, detail of individual sprouted malt

Remember that we use the barley for its starch, which is converted into sugar. Sugar is converted by yeast into alcohol. Therefore the amount of sugar in the mixture determines the maximum amount of alcohol the yeast can create. Higher alcohol content keeps the beer for longer periods, but it also changes the taste of the beer. The Guinness brewers wanted consistent taste and consistent alcohol levels.

Too high alcohol levels would increase the tax Guinness was paying the British government. A lower level of alcohol would make the beer go bad earlier and crucially, customers would go to a competitor, and the customers might hate you.

Customers would maybe go to a competitor when I lower the alcohol content?

Customers would maybe go to a competitor when I lower the alcohol content?

So consistency is key if you want people to keep drinking your beer5.

Malt and sugars

Barley in the sun Guinness Malt in Gosset’s time, came from Irish and English Barley stock. Since the sugar content in malt can vary from batch to batch (more or less sun, different amounts of rain etc.), and we determined that you want the exact same taste in all your beer, brewers need to check the sugar content of the barley. In this time brewers were checking the sugar content per batch manually by sniffing the barley, crumbling the material and visually checking it.

However there are only so many brewers and checking every batch is not scalable. The Guinness brewery was one of the largest in the world and there were simply not enough master brewers.

The sugar content of malt extract was measured by degrees saccharine per barrel of 168 pounds malt weight. In earlier experiments the brewers determined that an extract around 133 degrees saccharine gave the desired alcohol level.

So you better make sure the malt extract sugar content is close to 133 degrees, if you want consistent beer. In Gosset’s view, 0.5 degrees was a difference or error in malt extract level which Guinness and its customers could accept.

“It might be maintained,” he said, that malt extract “should be [estimated] within .5 of the true result with a probability of 10 to 1” – Gosset, according to Guinessometrics – Stephen T. Ziliak (see full reference below)

However how can we be certain that a barrel has a sugar content of 133 degrees?

They could take samples, and average those, but how many samples should you average to give enough certainty (that is with odds of 10 to 1 that the sample average is within 0.5 degree of the real value)? And every sample takes time and destroys a that sample from the barrel.

Simulation

Malt, green, a handful of green malt So how do we find out how many samples you need, to have an accurate estimation of your real value of the barrel? Gosset and his co workers actually used simulation; From one representative barrel of malt extract they had taken a lot of samples. Gosset6 simulated what would happen if they drew and averaged multiple samples.

  • What if we take 2 samples and average that out, is that close enough to the real value?
  • What if we take 3?
  • etc?

By simulating these draws from one barrel7 they generalized this pattern to what would happen if they sampled from the all of the barrels.

And that is what I will show in R code.

So we are talking about sampling, but maybe not this kind of sampling… Randy Marshal sampling beer and wine (South Park)

Simulating drawing from a barrel

First the goal in clearer language. We want to know how many samples you have to draw to know what the real degree saccharine level of the barrel is. With a 10 to 1 odds of being within 0.5 degree.

In economic terms: You are running a large brewery and you want to make sure that the beer you produce has the same alcohol content in every batch. Therefore you have to check the barrels of malt extract. How many samples do you have to get from every barrel to be certain enough of its actual sugar content?

Let’s first transform that odds to probabilities, because I don’t know what those odds mean8. A 10 to 1 odds means ten successes and one failure, so it is really 10 out of 11 successes, 10/11 gives us a probability of approximately 0.909.

A coding example

Let’s create a vector of samples from the same barrel of malt extract, sample from those samples, take the average, see if we are within the range that Gosset defined, calculate how many times the sample was a correct representation of the barrel and finally determine how many samples are enough.

library(tidyverse) # I am just lazy, purrr, and tibble and dplyr are enough

First we create a dataset. Say we have a large amount of Malt extract Like the brewers at Guinness we have taken many many many many samples from one barrel, and so we know what the actual saccharine level of the barrel is. This is relevant for simulation but don’t forget that we don’t know the truth when we are working with actual data.

So this degrees__sacharine vector represents 3000 samples from one barrel of malt extract.

set.seed(7334)degrees_sacharine = rnorm(3000,mean = 133, sd = 0.6) # this is really just a guess

Then I create some functions to take a sample, a function to determine if that value is within the range, and finally I combine those functions.

take_sample_average <- function(bag= degrees_sacharine, sample_size){  mean(sample(bag, sample_size, replace= TRUE))}within_range <- function(value){  ifelse(value > 132.5 & value < 133.5, TRUE, FALSE)}is_sample_within_range <- function(sample_size){  take_sample_average(bag = degrees_sacharine, sample_size = sample_size) %>%    within_range()}

For example: So now we take 2 samples out of the bag, and get (142.2745, 119.4484)/2 = 142.2745. Is this within the range of 132.5 and 133.5? no.

#  and what if we take 3 samples?take_sample_average(bag = degrees_sacharine, 3)
## [1] 132.8

But how often, on average am I within the real value of the bag? We simulate taking 2 to 15 samples from the barrel and averaging per sample.

sampling_experiments <-    tibble(  N = seq_len(2500)) %>%  mutate( # there is probably a more concise way to do this    sample_02 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 2)),    sample_03 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 3)),    sample_04 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 4)),    sample_05 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 5)),    sample_06 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 6)),    sample_07 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 7)),    sample_08 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 8)),    sample_09 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 9)),    sample_10 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 10)),    sample_15 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 15)),    sample_20 = purrr::map_lgl(N, ~is_sample_within_range(sample_size = 20)),    )

So how many times are the samples within the range?

sampling_experiments %>%  summarize_all(.funs = ~sum(.)/n()) %>% # this doesn't read that well  # So I reshape the data, remove the N column and remove the sample_ part  gather(key = "sample_size", value = "prob_in_range", sample_02:sample_20) %>%  select(-N) %>%  mutate(sample_size = str_remove(sample_size,pattern =  "sample_"))
## # A tibble: 11 x 2##    sample_size prob_in_range##                   ##  1 02                  0.755##  2 03                  0.854##  3 04                  0.901##  4 05                  0.932##  5 06                  0.954##  6 07                  0.966##  7 08                  0.982##  8 09                  0.986##  9 10                  0.991## 10 15                  0.999## 11 20                  0.999

Gosset found in his experiments that he needed at least 4 samples for a estimation with an odds of at least 10 to 1, which is a probability of approximately 0.909.

In our case for our bag of estimations we would say we need at least 5 samples to get these odds or better.

Practical results

Armed with this knowledge the Guinness brewery knew it could test the malt extract barrels by taking 4 samples out of every batch to get an approximation of the true sugar content of a batch that would be correct in 10 out of 11 times.

That meant that the brewery could use this technique to check the barrels of malt extract in a chemical way, in stead of a master brewer sampling, and manually investigating the malt or barley. You can scale the number of tests, but not the amount of brewers / checkers.

From beer to general statistics

The Guinness owners were happy, Gosset was probably too. But he realized there must be a systematic way to determine how sure we can be about the values in the sample compared with the true value. He took a year sabbatical to work with statistician Karl Pierson on this problem. He found a relation that we can approximate the ‘true’ mean and standard deviation based on the sample mean and sample standard deviation as a function of the sample size.

And that is what we today, call the t-distribution.

Image credits

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

sessioninfo::session_info()
## ─ Session info ──────────────────────────────────────────────────────────##  setting  value                       ##  version  R version 3.6.1 (2019-07-05)##  os       macOS Mojave 10.14.6        ##  system   x86_64, darwin15.6.0        ##  ui       X11                         ##  language (EN)                        ##  collate  en_US.UTF-8                 ##  ctype    en_US.UTF-8                 ##  tz       Europe/Amsterdam            ##  date     2019-10-11                  ## ## ─ Packages ──────────────────────────────────────────────────────────────##  package     * version date       lib source        ##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)##  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.0)##  blogdown      0.16    2019-10-01 [1] CRAN (R 3.6.0)##  bookdown      0.14    2019-10-01 [1] CRAN (R 3.6.0)##  broom         0.5.2   2019-04-07 [1] CRAN (R 3.6.0)##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.0)##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.0)##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)##  digest        0.6.21  2019-09-20 [1] CRAN (R 3.6.0)##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.6.0)##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)##  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.6.0)##  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.6.0)##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.0)##  ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.6.0)##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)##  haven         2.1.1   2019-07-04 [1] CRAN (R 3.6.0)##  hms           0.5.1   2019-08-23 [1] CRAN (R 3.6.0)##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)##  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.0)##  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.6.0)##  knitr         1.25    2019-09-18 [1] CRAN (R 3.6.0)##  lattice       0.20-38 2018-11-04 [1] CRAN (R 3.6.1)##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.6.0)##  lifecycle     0.1.0   2019-08-01 [1] CRAN (R 3.6.0)##  lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.6.0)##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)##  modelr        0.1.5   2019-08-08 [1] CRAN (R 3.6.0)##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)##  nlme          3.1-141 2019-08-01 [1] CRAN (R 3.6.0)##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.6.0)##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)##  purrr       * 0.3.2   2019-03-15 [1] CRAN (R 3.6.0)##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.0)##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)##  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.0)##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.6.0)##  rmarkdown     1.16    2019-10-01 [1] CRAN (R 3.6.0)##  rstudioapi    0.10    2019-03-19 [1] CRAN (R 3.6.0)##  rvest         0.3.4   2019-05-15 [1] CRAN (R 3.6.0)##  scales        1.0.0   2018-08-09 [1] CRAN (R 3.6.0)##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)##  tibble      * 2.1.3   2019-06-06 [1] CRAN (R 3.6.0)##  tidyr       * 1.0.0   2019-09-11 [1] CRAN (R 3.6.0)##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.6.0)##  tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.6.0)##  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.0)##  vctrs         0.2.0   2019-07-05 [1] CRAN (R 3.6.0)##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)##  xfun          0.10    2019-10-01 [1] CRAN (R 3.6.0)##  xml2          1.2.2   2019-08-09 [1] CRAN (R 3.6.0)##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)##  zeallot       0.1.0   2018-01-28 [1] CRAN (R 3.6.0)## ## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

  1. see what I did there?↩

  2. If you are fancy you can add other cool ingredients, if you are purist and you keep to the 1516 ‘reinheidsgebod’ you only use barley, water and hops, yeast is necessary but not mentioned in the rules, because it’s existence wasn’t known, maybe the thought it was generatia spontanae or something?↩

  3. The secret to create alcohol out of literally everything, has been rediscovered again and again!↩

  4. from the Wikipedia article I found the words: wort=the sugary liquid, mashing = mixing malted barley with hot water, liquor = hot water with sugar in it, grist =crushed malt, sparging = washing of grains, lautering = separation of wort with grain itself↩

  5. Heineken beer may taste like piss to some people but at least it consistently tastes like piss everywhere you drink it, I think I’m allowed to say that about a Dutch product right?↩

  6. And coworkers, because boy this must have taken some time, and think about the calculations! all done by hand.↩

  7. simulated because the samples had already been done so it was more a choose one of the values at random from this piece of paper↩

  8. maybe this is a country specific thing, in the UK everyone seems to know about odds and betting, but I can’t get my head around it↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


#FunDataFriday – gTrendsR

$
0
0

[This article was first published on #FunDataFriday - Little Miss Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is it?

The gtrendsR package is an R package that can be used to programmatically gather and display Google trend information. Lately, I seem to be finding a lot of fun use cases for it, so I figured I would share the joy in my #FunDataFriday series!

Why is it awesome?

It’s an awesome package because it’s so simple! In three lines of code, you can pull Google trend data and visualize the results. Because you get the raw trend data, you can very easily extend your analysis to do almost anything.

How to get started?

Getting started is easy. With just three lines, you can plot your own gTrendsR graph in R.

 

library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

With three more lines, you can make the graph interactive

 

library(plotly)
p <-plot(trends)
ggplotly(p)

 

A gTrendsR graph inspired by  Epi Ellie’s  outrageous competition on the popularity of  Nerds and Smarties!

A gTrendsR graph inspired by Epi Ellie’s outrageous competition on the popularity of Nerds and Smarties!

 

With a little more effort, you can either start diving into the data and merge it with other sources. If you want to stay on the data visualization path, you can easily exploit the full benefits of ggplot2 to analyze the results! If you want to learn more, I have a few more examples in my recent blog post analyzing the relative popularity of The Bachelor franchise series over time.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: #FunDataFriday - Little Miss Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Explaining Predictions: Boosted Trees Post-hoc Analysis (Xgboost)

$
0
0

[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recap

We’ve covered various approaches in explaining model predictions globally. Today we will learn about another model specific post hoc analysis. We will learn to understand the workings of gradient boosting predictions.

Like past posts, the Clevaland heart dataset as well as tidymodels principle will be used. Refer to the first post of this series for more details.

Gradient Boosting

Besides random forest introduced in a past post, another tree-based ensemble model is gradient boosting. In gradient boosting, a shallow and weak tree is first trained and then the next tree is trained based on the errors of the first tree. The process continues with a new tree being sequentially added to the ensemble and the new successive tree improves on the errors of the ensemble of preceding trees. On the hand, random forest is an ensemble of deep independent trees.

#librarylibrary(tidyverse)library(tidymodels)theme_set(theme_minimal())#importheart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F)# Renaming var colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp","chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp","ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease")#elaborating cat var##simple ifelse conversion heart<-heart %>% mutate(sex= ifelse(sex=="1", "male", "female"),fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"),heart_disease=ifelse(heart_disease=="0", "no", "yes")) #remember to leave it as numeric for DALEX ## complex ifelse conversion using `case_when`heart<-heart %>% mutate(rest_cp=case_when(rest_cp== "1" ~ "typical",rest_cp=="2" ~ "atypical", rest_cp== "3" ~ "non-CP pain",rest_cp== "4" ~ "asymptomatic"), rest_ecg=case_when(rest_ecg=="0" ~ "normal",rest_ecg=="1" ~ "ST-T abnorm",rest_ecg=="2" ~ "LV hyperthrophy"), ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm", ex_STpeak== "2" ~ "flat",ex_STpeak== "3" ~ "down"), thalassemia=case_when(thalassemia=="3.0" ~ "norm",   thalassemia== "6.0" ~ "fixed", thalassemia== "7.0" ~ "reversable")) # convert missing value "?" into NAheart<-heart%>% mutate_if(is.character, funs(replace(., .=="?", NA)))# convert char into factorsheart<-heart %>% mutate_if(is.character, as.factor)#train/test set set.seed(4595)data_split <- initial_split(heart, prop=0.75, strata = "heart_disease")heart_train <- training(data_split)heart_test <- testing(data_split)

The gradient boosting package which we’ll use is xgboost. xgboost only accepts numeric values thus one-hot encoding is required for categorical variables. However, I was still able to train a xgboost model without one-hot encoding when I used the parsnip interface.

# create recipe objectheart_recipe<-recipe(heart_disease ~., data= heart_train) %>%  step_knnimpute(all_predictors())# process the traing set/ prepare recipe(non-cv)heart_prep <-heart_recipe %>% prep(training = heart_train, retain = TRUE)

No tunning was done, the hyperparameters are default settings which were made explicit.

# boosted tree model bt_model<-boost_tree(learn_rate=0.3, trees = 100, tree_depth= 6, min_n=1, sample_size=1, mode="classification") %>% set_engine("xgboost", verbose=2) %>%    fit(heart_disease ~ ., data = juice(heart_prep))

Feature Importance (global level)

The resulting gradient boosting model bt_model$fit represented as a parsnip object does not inherently contain feature importance unlike a random forest model represented as a parsnip object.

summary(bt_model$fit)
##               Length Class              Mode       ## handle            1  xgb.Booster.handle externalptr## raw           66756  -none-             raw        ## niter             1  -none-             numeric    ## call              7  -none-             call       ## params            9  -none-             list       ## callbacks         1  -none-             list       ## feature_names    20  -none-             character  ## nfeatures         1  -none-             numeric

We can extract the important features from the boosted tree model with xgboost::xgb.importance. Although we did the pre-processing and modelling using tidymodels, we ended up using the original Xgboost package to explain the model. Perhaps, tidymodels could consider integrating prediction explanation for more models that they support in the future.

library(xgboost)xgb.importance(model=bt_model$fit) %>% head()
##                Feature       Gain      Cover  Frequency## 1:     thalassemianorm 0.24124439 0.05772889 0.01966717## 2: ex_STdepression_dur 0.17320374 0.15985018 0.15279879## 3:            ex_maxHR 0.10147873 0.12927719 0.13615734## 4:                 age 0.07165646 0.09136876 0.12859304## 5:                chol 0.06522754 0.10151576 0.15431165## 6:             rest_bp 0.06149660 0.09178222 0.11497731

Variable importance score

Feature importance are computed using three different importance scores.

  1. Gain: Gain is the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher score suggests the feature is more important in the boosted tree’s prediction.
  2. Cover: Cover is the relative observations associated with a predictor. For example, feature X is used to determine the terminal node for 10 observations in tree A and 20 observations in tree B. The absolute observations associated with feature X is 30 and the relative observation is 30/sum of all absolute observation for all features.
  3. Frequency: Frequency refers to the relative frequency a variable occurs in the ensembled of trees. For instance, feature X occurs in 1 split in tree A and 2 splits in tree B. The absolute occurrence of feature X is 3 and the (relative) frequency is 3/sum of all absolute occurrence for all features.

Category variables especially those with minimal cardinality will have low frequency score as these variables are seldom used in each tree. Compared to continuous variables or to some extend category variables with high cardinality as they have are likely to have a larger range of values which increases the odds of occurring the model. Thus, the developers of xgboost discourage using frequency score unless you’re clear about your rationale for selecting frequency as the feature importance score. Rather, gain score is the most valuable score to determine variable importance. xgb.importance selects gain score as the fault measurement and arranges features according to the descending value of gain score resulting in the most important feature to be displayed at the top.

Plotting variable importance

xgbosst provides two options to plot variable importance.

  1. Using basic R graphics via xgb.plot.importance
  2. Using ggplot interface via xgb.ggplot.importance. I’ll be using the latter.

The xgb.ggplot.importance uses the gain variable importance measurement by default to calculate variable importance. The default argument measure=NULL can be changed to use other variable importance measurements. However, based on the previous section, it will be wiser to leave the argument as the default. The xgb.ggplot.importance graph also displays the cluster of variables that have similar variable importance scores. The xgb.ggplot.importance graph displays each variable’s gain score as a relative contribution to the overall model importance by default. The sum of all the gain scores will equal to 1.

xgb.importance(model=bt_model$fit) %>% xgb.ggplot.importance(top_n=6, measure=NULL, rel_to_first = F) 

Alternatively, the gain scores can be presented as relative scores to the most important feature. In this case, the most importance feature will have a score of 1 and the gain scores of the other variables will be scaled to the gain score of the most important feature. This alternate demonstration of gain score can be achieved by changing the default argument rel_to_first=F to rel_to_first=T.

xgb.importance(model=bt_model$fit) %>% xgb.ggplot.importance(top_n=6, measure=NULL, rel_to_first = T) 

Sum up

This is the last post of this series looking at explaining model predictions at a global level. We first started this series explaining predictions using white box models such as logistic regression and decision tree. Next, we did model specific post hoc evaluation on black box models. Specifically, for random forest and Xgboost.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on notast.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Loading packages efficiently

$
0
0

[This article was first published on woodpeckR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Problem

Especially in a project with many different scripts, it can be challenging to keep track of all the packages you need to load. It’s also easy to lose track of whether or not you’ve incorporated package loading into the script itself until you switch to a new computer or restart R and all of a sudden, your packages need to be re-loaded.

Context

When I was first starting out in R, I learned quickly to load packages all together at the top of a script, not along the way as I needed them. But it took a while, until I started using R Projects, before I decided to centralize package loading above the script level. I was sick of having to deal with loading the right packages at the right times, so I decided to just streamline the whole thing.

Solution

Make a separate R script, called “libraries.R” or “packages.R” or something. Keep it consistent. Mine is always called “libraries,” and I keep it in my project folder.

libraries.PNG

It looks something like this (individual libraries may vary, of course):

librariesscript

Then, at the top of each analysis script, I can simply source the libraries script, and all the libraries I need load automatically.

loading the libraries.PNG

Outcome

I can easily load libraries in the context of a single R Project, keep track of which ones are loaded, and not have to worry about making my scripts look messy with a whole chunk of library() commands at the top of each one. It’s also straightforward to pop open the “libraries” script whenever I want to add a new library or delete one.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: woodpeckR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

if ifelse() had more if’s

$
0
0

[This article was first published on woodpeckR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Problem

The ifelse() function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?

Context

I was recently given some survey data to clean up. It looked something like this (but obviously much larger):

tabletest.png

I needed to classify people in this data set based on whether they had passed or failed certain tests.

I wanted to separate the people into three groups:

  • People who passed both tests: Group A
  • People who passed one test: Group B
  • People who passed neither test: Group C

I thought about using a nested ifelse statement, and I certainly could have done that. But that approach didn’t make sense to me. The tests are equivalent and not given in any order; I simply want to sort the people into three equal groups. Any nesting of “if” statements would seem to imply a hierarchy that doesn’t really exist in the data. Not to mention that I hate nesting functions. It’s confusing and hard to read. 

Solution

Once again, dplyr to the rescue! I’m becoming more and more of a tidyverse fan with each passing day. 

Turns out, dplyr has a function for exactly this purpose: case_when(). It’s also known as “a general vectorised if,” but I like to think of it as “if ifelse() had more if’s.” 

Here’s the syntax:

library(dplyr) df <- df %>%  mutate(group = case_when(test1 & test2 ~ "A", # both tests: group A                          xor(test1, test2) ~ "B", # one test: group B                          !test1 & !test2 ~ "C" # neither test: group C ))

Output:

tabletest2.PNG

Let me translate the above into English. After loading the package, I reassign df, the name of my data frame, to a modified version of the old df. Then (%>%), I use the mutate function to add a new column called group. The contents of the column will be defined by the case_when() function.

case_when(), in this example, took three conditions, which I’ve lined up so you can read them more easily. The condition is on the left side of the ~, and the resulting category (A, B, or C) is on the right. I used logical operators for my conditions. The newest one to me was the xor() function, which is an exclusive or: only one of the conditions in the parentheses can be TRUE, not both. 

Outcome

Easily make conditional assignments within a data frame. This function is a little less succinct than ifelse(), so I’m probably not going to use it for applications with only two cases, where ifelse() would work fine. But for three or more cases, it can’t be beat. Notice that I could have added any number of conditions to my case_when() statement, with no other caveats.

I love this function, and I think we should all be using it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: woodpeckR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Opioid prescribing habits in Texas

$
0
0

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A paper I worked on was just published in a medical journal. This is quite an odd thing for me to be able to say, given my academic background and the career path I have had, but there you go! The first author of this paper is a long-time friend of mine working in anesthesiology and pain management, and he obtained data from the Texas Prescription Drug Monitoring Program (PDMP) about controlled substance prescriptions from April 2015 to 2018. The DEA also provides data about controlled substances transactions between manufacturers and distributors (available in R) but PDMP data is somewhat different as it monitors prescriptions directly, down to the individual prescriber level. Each state maintains a separate PDMP, and access is often limited to licensed providers in that state. My coauthor/friend is, among other things, a licensed provider in Texas and was able to obtain this data for research purposes!

Clean and tidy the data

The first step in this analysis was to read in, clean, and tidy the PDMP data. This is a dataset of prescriptions for controlled substances, aggregated at the county and month level for us by the state agency; we requested data at two separate times and received data in two different formats. First, we have an Excel file.

library(tidyverse)library(readxl)library(lubridate)library(googlesheets)path <- "CountyDrugPillQty_2017_07.xlsx"opioids_raw <- path %>%    excel_sheets() %>%    set_names() %>%    map_df(~ read_excel(path = path, sheet = .x), .id = "sheet") %>%    mutate(Date = dmy(str_c("01-", sheet))) %>%    select(-sheet) %>%    rename(Name = `Generic Name`)

Then we have a second batch of data in Google Sheets.

new_opioids_sheet <- gs_title("TX CS Qty By Drug Name-County")new_opioids_raw <- new_opioids_sheet %>%    gs_read("TX CS RX By Generic Name-County",            col_types = "cnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn",            skip = 4,            verbose = FALSE) %>%    rename(Name = `Date/Month Filter`)  %>%     mutate(Date = case_when(str_detect(Name,                                        "^[a-zA-Z]{3}-[0-9]{2}$") ~ Name,                            TRUE ~ NA_character_)) %>%    fill(Date, .direction = "down") %>%    select(-`Grand Total`) %>%    filter(Name != Date) %>%    mutate(Date = dmy(str_c("01-", Date))) %>%    select(Name, Date, everything())

We have overlapping measurements for the same drugs and counties from February to June of 2017. Most measurements were close, but the new data is modestly higher in prescription quantity, telling us something about data quality and how this data is collected. When we have it available, we use the newer values. My coauthor/friend placed the individual drugs into larger categories so that we can look at groupings between the individual drug level and the schedule level. Using all that, finally, we have a tidy dataset of prescriptions per county per month.

categories_sheet <- gs_title("Drug categories")drug_categories <- categories_sheet %>%    gs_read("Sheet1", verbose = FALSE) %>%    rename(Name = `Generic Name`) %>%    bind_rows(categories_sheet %>%                  gs_read("Sheet2", verbose = FALSE) %>%                  rename(Name = `Generic Name`)) %>%    filter(Schedule %in% c("II", "III", "IV", "V"))opioids_tidy <- opioids_raw %>%    gather(County, PillsOld, ANDERSON:ZAVALA) %>%    full_join(new_opioids_raw %>%                   gather(County, PillsNew, ANDERSON:ZAVALA),              by = c("Name", "Date", "County")) %>%    mutate(Pills = coalesce(PillsNew, PillsOld),           Pills = ifelse(Pills > 1e10, NA, Pills)) %>%    replace_na(replace = list(Pills = 0)) %>%    mutate(County = str_to_title(County)) %>%     select(-PillsNew, -PillsOld) %>%    left_join(drug_categories, by = "Name") %>%    select(County, Date, Name, Category, Schedule, Pills) %>%    filter(Name != "Unspecified",           !is.na(Schedule)) %>%    filter(Date < "2018-05-01")opioids_tidy
## # A tibble: 1,234,622 x 6##    County  Date       Name                        Category   Schedule Pills##                                             ##  1 Anders… 2015-04-01 ACETAMINOPHEN WITH CODEINE… Opioid     III      37950##  2 Anders… 2015-04-01 ACETAMINOPHEN/CAFFEINE/DIH… Opioid     III        380##  3 Anders… 2015-04-01 ALPRAZOLAM                  Benzodiaz… IV       52914##  4 Anders… 2015-04-01 AMITRIPTYLINE HCL/CHLORDIA… Benzodiaz… IV         180##  5 Anders… 2015-04-01 AMPHETAMINE SULFATE         Amphetami… IV          60##  6 Anders… 2015-04-01 ARMODAFINIL                 Stimulant  IV         824##  7 Anders… 2015-04-01 ASPIRIN/CAFFEINE/DIHYDROCO… Opioid     III          0##  8 Anders… 2015-04-01 BENZPHETAMINE HCL           Amphetami… III         30##  9 Anders… 2015-04-01 BROMPHENIRAMINE MALEATE/PH… Opioid     V            0## 10 Anders… 2015-04-01 BROMPHENIRAMINE MALEATE/PS… Opioid     III          0## # … with 1,234,612 more rows

In this step, we removed the very small number of prescriptions that were missing drug and schedule information (“unspecified”). Now it’s ready to go!

Changing prescribing habits

The number of pills prescribed per month is changing at about -0.00751% each month, or about -0.0901% each year. This is lower (negative, even) than the rate of Texas’ population growth, estimated by the US Census Bureau at about 1.4% annually. Given what we find out further below about the racial/ethnic implications of population level opioid use in Texas and what groups are driving population growth in Texas, this likely makes sense.

opioids_tidy %>%    count(Schedule, Date, wt = Pills) %>%    mutate(Schedule = factor(Schedule, levels = c("II", "III", "IV", "V",                                                  "Unspecified"))) %>%    ggplot(aes(Date, n, color = Schedule)) +    geom_line(alpha = 0.8, size = 1.5) +    expand_limits(y = 0) +    labs(x = NULL, y = "Pills prescribed per month",         title = "Controlled substance prescriptions by schedule",         subtitle = "Schedule IV drugs account for the most doses, with Schedule II close behind")

We can also fit models to find which individual drugs are increasing or decreasing. The most commonly prescribed drugs that exhibited significant change in prescribing volume are amphetamines (increasing) and barbiturates (decreasing).

Connecting to Census data

When I started to explore how this prescription data varied spatially, I knew I wanted to connect this PDMP dataset to Census data. My favorite way to use Census data from R is the tidycensus package. Texas is an interesting place. It’s not only where I grew up (and where my coauthor and friend lives), but the second largest state in the United States by both land area and population. It contains 3 of the top 10 largest cities in the United States, yet also 3 of the 4 least densely populated counties in the United States. It is also the seventh most ethnically diverse state with a substantially higher Hispanic population compared with the United States as a whole, but similar proportion of white and black residents. We can download Census data to explore these issues.

library(tidycensus)population <- get_acs(geography = "county",                       variables = "B01003_001",                       state = "TX",                      geometry = TRUE) 
household_income <- get_acs(geography = "county",                             variables = "B19013_001",                             state = "TX",                            geometry = TRUE)

To look at geographical patterns, we will take the median number of pills prescribed per month for each county during the time we have data for.

opioids_joined <- opioids_tidy %>%    group_by(County, Date) %>%    summarise(Pills = sum(Pills)) %>%    ungroup %>%    mutate(Date = case_when(Date > "2017-01-01" ~ "2017 and later",                            TRUE ~ "Before 2017")) %>%    group_by(County, Date) %>%     summarise(Pills = median(Pills)) %>%     ungroup %>%    mutate(County = str_to_lower(str_c(County, " County, Texas")),           County = ifelse(County == "de witt county, texas",                           "dewitt county, texas", County)) %>%    inner_join(population %>% mutate(County = str_to_lower(NAME)), by = "County") %>%    mutate(OpioidRate = Pills / estimate)

What are the controlled substance prescription rates in the top 10 most populous Texas counties?

opioids_joined %>%   filter(Date == "2017 and later") %>%  top_n(10, estimate) %>%  arrange(desc(estimate)) %>%  select(NAME, OpioidRate) %>%  kable(col.names = c("County", "Median monthly pills per capita"), digits = 2)
CountyMedian monthly pills per capita
Harris County, Texas5.68
Dallas County, Texas6.20
Tarrant County, Texas7.74
Bexar County, Texas7.41
Travis County, Texas6.40
Collin County, Texas7.02
Hidalgo County, Texas3.31
El Paso County, Texas4.43
Denton County, Texas7.58
Fort Bend County, Texas5.17

These rates vary a lot; the controlled substance prescription rate in Tarrant County is almost 40% higher than the rate in Harris County.

library(sf)library(viridis)opioids_map <- opioids_joined %>%  mutate(OpioidRate = ifelse(OpioidRate > 16, 16, OpioidRate))opioids_map %>%  mutate(Date = factor(Date, levels = c("Before 2017", "2017 and later"))) %>%  st_as_sf() %>%  ggplot(aes(fill = OpioidRate, color = OpioidRate)) +   geom_sf() +   coord_sf() +   facet_wrap(~Date) +  scale_fill_viridis(labels = scales::comma_format()) +   scale_color_viridis(guide = FALSE) +  labs(fill = "Monthly pills\nper capita",       title = "Controlled substance prescriptions across Texas",       subtitle = "The prescription rate was higher overall before 2017")

This strong geographic trend is one of the most interesting results from this analysis. There are low rates in the Rio Grande Valley and high rates in north and east Texas. When I saw that pattern, I knew we needed to look into how race/ethnicity was related to these controlled prescription rates. Also, notice the change over time as these rates have decreased.

We don’t see a direct or obvious relationship with household income, but, as the maps hint at, race is another matter.

race_vars <- c("P005003", "P005004", "P005006", "P004003")texas_race <- get_decennial(geography = "county",                             variables = race_vars,                            year = 2010,                            summary_var = "P001001",                            state = "TX") race_joined <- texas_race %>%  mutate(PercentPopulation = value / summary_value,         variable = fct_recode(variable,                               White = "P005003",                               Black = "P005004",                               Asian = "P005006",                               Hispanic = "P004003")) %>%     inner_join(opioids_joined %>%               filter(OpioidRate < 20) %>%               group_by(GEOID, Date) %>%                summarise(OpioidRate = median(OpioidRate)))race_joined %>%  group_by(NAME, variable, GEOID) %>%  summarise(Population = median(summary_value),            OpioidRate = median(OpioidRate),            PercentPopulation = median(PercentPopulation)) %>%  ggplot(aes(PercentPopulation, OpioidRate,              size = Population, color = variable)) +  geom_point(alpha = 0.4) +  facet_wrap(~variable) +  scale_x_continuous(labels = scales::percent_format()) +  scale_y_continuous(labels = scales::comma_format()) +  scale_color_discrete(guide = FALSE) +  labs(x = "% of county population in that racial/ethnic group",       y = "Median monthly pills prescribed per capita",       title = "Race and controlled substance prescriptions",       subtitle = "The more white a county is, the higher the median monthly pills prescribed there",       size = "County\npopulation")

The more white a county is, the higher the rate of controlled substance prescription there. The more Hispanic a county is, the lower the rate of controlled substance prescription there. Effects with Black and Asian race are not clear in Texas.

Building a model

We used straightforward multiple linear regression to understand how prescription rates are associated with various factors. We fit a single model to all the counties to understand how their characteristics affect the opioid prescription rate. We explored including and excluding the various relevant predictors to build the best explanatory model that can account for the relationships that exist in this integrated PDMP and US Census Bureau dataset.

This was the first time I had used the huxtable package for a publication, and it was so convenient!

library(huxtable)opioids <- race_joined %>%  select(GEOID, OpioidRate, TotalPop = summary_value,         variable, PercentPopulation, Date) %>%  spread(variable, PercentPopulation) %>%  left_join(household_income %>%               select(GEOID, Income = estimate)) %>%  select(-geometry, -GEOID) %>%  mutate(Income = Income / 1e5,         OpioidRate = OpioidRate,          Date = factor(Date, levels = c("Before 2017", "2017 and later")),         Date = fct_recode(Date, ` 2017 and later` = "2017 and later"))lm1 <- lm(OpioidRate ~ Income + White, data = opioids)lm2 <- lm(OpioidRate ~ Income + White + Date, data = opioids)lm3 <- lm(OpioidRate ~ Income + Date + log(TotalPop), data = opioids)lm4 <- lm(OpioidRate ~ Income + White + Date + log(TotalPop), data = opioids)huxreg(lm1, lm2, lm3, lm4)
(1)(2)(3)(4)
(Intercept)5.640 ***6.468 ***8.394 ***3.668 ***
(0.524)   (0.508)   (0.847)   (0.785)   
Income-3.209 ** -3.229 ***-0.239    -4.432 ***
(0.973)   (0.922)   (1.063)   (0.941)   
White7.120 ***7.134 ***        7.718 ***
(0.560)   (0.531)           (0.536)   
Date 2017 and later        -1.650 ***-1.640 ***-1.649 ***
        (0.216)   (0.251)   (0.211)   
log(TotalPop)                0.081    0.309 ***
                (0.077)   (0.067)   
N507        507        507        507        
R20.243    0.322    0.080    0.349    
logLik-1194.782    -1166.829    -1244.045    -1156.282    
AIC2397.563    2343.658    2498.091    2324.564    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Model metrics such as AIC and log likelihood indicate that the model including income, percent white population, date, and total population on a log scale provides the most explanatory power for the opioid rate. Using the proportion of population that is Hispanic gives a model that is about as good; these are basically interchangeable but opposite in effect. Overall, the \(R^2\) of these models is not extremely high (the best model has an adjusted \(R^2\) of 0.359) because these models are estimating population level characteristics and there is significant county-to-county variation that is not explained by these four predictors alone. The population level trends are statistically significant and with the effect sizes at the levels shown here.

We can more directly explore the factors involved in this explanatory model (income, ethnicity, time) visually.

race_joined %>%  filter(variable == "White") %>%  left_join(household_income %>%               as.data.frame() %>%               select(GEOID, Income = estimate)) %>%  filter(!is.na(Income)) %>%  mutate(Income = ifelse(Income <= median(Income, na.rm = TRUE),                          "Low income", "High income"),         PercentPopulation = cut_width(PercentPopulation, 0.1)) %>%  group_by(PercentPopulation, Income, Date) %>%  summarise(OpioidRate = median(OpioidRate)) %>%  mutate(Date = factor(Date, levels = c("Before 2017", "2017 and later"))) %>%  ggplot(aes(PercentPopulation, OpioidRate, color = Income, group = Income)) +  geom_line(size = 1.5, alpha = 0.8) +  geom_smooth(method = "lm", lty = 2, se = FALSE) +  scale_y_continuous(labels = scales::comma_format(),                     limits = c(0, NA)) +  scale_x_discrete(labels = paste0(seq(0, 0.9, by = 0.1) * 100, "%")) +  theme(legend.position = "top") +  facet_wrap(~Date) +  labs(x = "% of county population that is white",       y = "Median monthly pills prescribed per 1k population",       color = NULL,       title = "White population, income, and controlled substance usage",       subtitle = "Before 2017, the more white a county was, the more low income was associated with more controlled substance usage")

This plot illustrates the relationship between white population percentage and income, and how that has changed with time. The difference in controlled substance usage between lower and higher income counties (above and below the median in Texas) changes along the spectrum of counties’ population that is white.

The first effect to notice here is that the more white a county is, the higher the rate of controlled substance prescriptions. This was true both before 2017 and for 2017 and later, and for both low-income and high-income groups of counties. The second effect, though, is to compare the slopes of the two lines. Before 2017, the slope was shallower for higher income counties (above the median in Texas), but in lower income counties (below the median in Texas), the slope was steeper, i.e., the increase in prescription rate with white percentage was more dramatic. For 2017 and later, there is no longer a difference between low-income and high-income counties, although the trend with white population percentage remains.

What have we learned here? In the discussion of our paper, we focus on the difference or disparity in opioid prescription rates with race/ethnicity, and how that may be related to the subjective nature of the evaluation of pain by medical practitioners. A racial/ethnic difference in opioid prescribing rate has been found in other studies using alternative data sources. We can understand the differences in how media, the healthcare system, and the culture at large have portrayed the opioid epidemic compared to previous drug epidemics (such as those of the 1980s) due to what populations are impacted.

Learn more

If you want to read more about this new analysis and related work, check out the paper. You can also look at the GitHub repo where I have various bits of code for this analysis, which is now public.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12135 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>