Conference abstract bi-grams – FOSS4GUK

September 27, 2019, 7:24 am

≫ Next: 101 Data Science Interview Questions, Answers, and Key Concepts

≪ Previous: #FunDataFriday – Watson Studio

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I helped run a conference last week. As part of this I produced a wordcloud from the conference abstracts, although pretty it could have been more informative of the conference content. This blog post shows you how to make a network of conference bi-grams.

FOSS4GUK 2019 abstract bigrams

A bi-gram is a pair of words. In the last sentence “is a”, “a pair” and “pair of” are all bi-grams, they are pairs of words which are adjacent. I based this blog post on Julia and David’s excellent tidytext book. As before each abstract is stored in a separate file, so I’ve read each of those in and then turned them into a tidy bi-gram table:

library(tidyverse)library(tidytext)library(tidygraph)library(ggraph)library(extrafont)# ----------------------------data("stop_words")f = list.files("~/Cloud/Michael/FOSS4G/talks/abstracts_clean/")abstracts = lapply(f, function(i){   read_table(paste0("~/Cloud/Michael/FOSS4G/talks/abstracts_clean/", i),              col_names = F) %>%      gather(key, word) %>%      select(-key) %>%      add_column(author = str_remove(i, ".txt")) %>%      unnest_tokens(bigram, word, token = "ngrams", n = 2)})abstracts = do.call("rbind.data.frame", abstracts)bigrams = abstracts %>%   separate(bigram, c("word1", "word2"), sep = " ") %>%   filter(!word1 %in% stop_words$word[stop_words$word != "open"]) %>%   filter(!word2 %in% stop_words$word[stop_words$word != "open"]) %>%   filter(!str_detect(word1, "[0-9]")) %>%   filter(!str_detect(word2, "[0-9]")) %>%   filter(!str_detect(word1, "NA")) %>%   filter(!str_detect(word2, "NA"))bigram_counts = bigrams %>%   count(word1, word2, sort = TRUE)

Then I write out a graph to a png. There’s some nifty stuff on the repel line which keeps labels on the plot and I’ve event put the text into the conference font:

png("~/Cloud/Michael/FOSS4G/talks/abstract_bigram.png",    width=1200, height=850, res=110)bigram_counts %>%   filter(n > 1) %>%   as_tbl_graph() %>%   ggraph(layout = "fr") +   geom_edge_link(width = 1.1, colour = "#f49835") +   geom_node_point(colour = "#497fbf") +   geom_node_text(aes(label = name),                  colour = "grey10",                  vjust = 1, hjust = 1,                  repel = T, force = 0.1, box.padding = 0)+   labs(title = "FOSS4GUK 2019 - Edinburgh",        subtitle = "Abstract bigrams") +   theme(text = element_text(family = "Aileron SemiBold",                             colour = "grey10"))dev.off()

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

101 Data Science Interview Questions, Answers, and Key Concepts

September 27, 2019, 10:14 am

≫ Next: 100% Stacked Chicklets

≪ Previous: Conference abstract bi-grams – FOSS4GUK

[This article was first published on R Programming - Data Science Blog | AI, ML, big data analytics , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

101 Data Science Interview Questions, Answers, and Key Concepts

In October 2012, the Harvard Business Review described “Data Scientist” as the “sexiest” job of the 21^st century. Well, as we approach 2020 the description still holds true! The world needs more data scientists than there are available for hire. All companies – from the smallest to the biggest – want to hire for a job role that has something “Data” in its name: “Data Scientists”, “Data Analysts”, “Data Engineers” etc.

On the other hand, there’s large number of people who are trying to get a break in the Data Science industry, including people with considerable experience in other functional domains such as marketing, finance, insurance, and software engineering. You might have already invested in learning data science (maybe even at adata science bootcamp), but how confident are you for your next Data Science interview?

This blog is intended to give you a nice tour of the questions asked in a Data Science interview. After thorough research, we have compiled a list of 101 actual data science interview questions that have been asked between 2016-2019 at some of the largest recruiters in the data science industry – Amazon, Microsoft, Facebook, Google, Netflix, Expedia, etc.

If you want to know more regarding the tips and tricks for acing the interviews, watch the data science interview AMA with some of our own Data Scientists.

Data Science is an interdisciplinary field and sits at the intersection of computer science, statistics/mathematics, and domain knowledge. To be able to perform well, one needs to have a good foundation in not one but multiple fields, and it reflects in the interview. We’ve divided the questions into 6 categories:

Machine Learning
Data Analysis
Statistics, Probability, and Mathematics
Programming
SQL
Experiential/Behavioral Questions

We’ve also provided brief answers and key concepts for each question. Once you’ve gone through all the questions, you’ll have a good understanding of how well you’re prepared for your next data science interview!

Machine Learning

As one will expect, data science interviews focus heavily on questions that help the company test your concepts, applications, and experience on machine learning. Each question included in this category has been recently asked in one or more actual data science interviews at companies such as Amazon, Google, Microsoft, etc. These questions will give you a good sense of what sub-topics appear more often than others. You should also pay close attention to the way these questions are phrased in an interview.

Data Analysis

Machine learning concepts are not the only area in which you’ll be tested in the interview. Data pre-processing and data exploration are other areas where you can always expect a few questions. We’re grouping all such questions under this category. Data analysis is the process of evaluating data using analytical and statistical tools to discover useful insights. Once again, all these questions have been recently asked in one or more actual data science interviews at the companies listed above.

Statistics, Probability and Mathematics

As we’ve already mentioned, data science builds its foundation on statistics and probability concepts. Having a strong foundation in statistics and probability concepts is a requirement for data science, and these topics are always brought up in data science interviews. Here is a list of statistics and probability questions that have been asked in actual data science interviews.

Programming

When you appear for a data science interview your interviewers are not expecting you to come up with a highly efficient code that takes the lowest resources on computer hardware and executes it quickly. However, they do expect you to be able to use R, Python, or SQL programming languages so that you can access the data sources and at least build prototypes for solutions.

You should expect a few programming/coding questions in your data science interviews. You interviewer might want you to write a short piece of code on a whiteboard to assess how comfortable you are with coding, as well as get a feel for how many lines of codes you typically write in a given week.

Here are some programming and coding questions that companies like Amazon, Google, and Microsoft have asked in their data science interviews.

Structured Query Language (SQL)

Real-world data is stored in databases and it ‘travels’ via queries. If there’s one language a data science professional must know, it’s SQL – or “Structured Query Language”. SQL is widely used across all job roles in data science and is often a ‘deal-breaker’. SQL questions are placed early on in the hiring process and used for screening. Here are some SQL questions that top companies have asked in their data science interviews.

Situational/Behavioral Questions

Capabilities don’t necessarily guarantee performance. It’s for this reason employers ask you situational or behavioral questions in order to assess how you would perform in a given situation. In some cases, a situational or behavioral question would force you to reflect on how you behaved and performed in a past situation. A situational question can help interviewers in assessing your role in a project you might have included in your resume, can reveal whether or not you’re a team player, or how you deal with pressure and failure. Situational questions are no less important than any of the technical questions, and it will always help to do some homework beforehand. Recall your experience and be prepared!

Here are some situational/behavioral questions that large tech companies typically ask:

Thanks for reading! We hope this list is able to help you prepare and eventually ace the interview!

Like the 101 machine learning algorithms blog post, the accordion drop down lists are available for you to embed on your own site/blog post. Simply click the ’embed’ button in the lower left-hand corner, copy the iframe, and paste it within the page.

To leave a comment for the author, please follow the link and comment on their blog: R Programming - Data Science Blog | AI, ML, big data analytics .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

100% Stacked Chicklets

September 27, 2019, 11:15 am

≫ Next: Mapping the Underlying Social Structure of Reddit

≪ Previous: 101 Data Science Interview Questions, Answers, and Key Concepts

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I posted a visualization of email safety status (a.k.a. DMARC) of the Fortune 500 (2017 list) the other day on Twitter and received this spiffy request from @MarkAltosaar:

Would you be willing to add the R code used to produce this to your vignette for ggchicklet? I would love to see how you arranged the factors since it is a proportion. Every time I try something like this I feel like my code becomes very complex.
— Mark Altosaar (@MarkAltosaar) September 26, 2019

There are many ways to achieve this result. I’ll show one here and walk through the process starting with the data (this is the 2018 DMARC evaluation run):

library(hrbrthemes) # CRAN or fav social coding site using hrbrmstr/pkgnamelibrary(ggchicklet) # fav social coding site using hrbrmstr/pkgnamelibrary(tidyverse)f500_dmarc <- read_csv("https://rud.is/dl/f500-industry-dmarc.csv.gz", col_types = "cc")f500_dmarc## # A tibble: 500 x 2##    industry               p         ##                           ##  1 Retailing              Reject    ##  2 Technology             None      ##  3 Health Care            Reject    ##  4 Wholesalers            None      ##  5 Retailing              Quarantine##  6 Motor Vehicles & Parts None      ##  7 Energy                 None      ##  8 Wholesalers            None      ##  9 Retailing              None      ## 10 Telecommunications     Quarantine## # … with 490 more rows

The p column is the DMARC classification for each organization (org names have been withheld to protect the irresponsible) and comes from the p=… value in the DMARC DNS TXT record field. It has a limited set of values, so let’s enumerate them and assign some colors:

dmarc_levels <- c("No DMARC", "None", "Quarantine", "Reject")dmarc_cols <- set_names(c(ft_cols$slate, "#a6dba0", "#5aae61", "#1b7837"), dmarc_levels)

We want the aggregate value of each p, thus we need to do count counting:

(dmarc_summary <- count(f500_dmarc, industry, p))## # A tibble: 63 x 3##    industry            p              n##                         ##  1 Aerospace & Defense No DMARC       9##  2 Aerospace & Defense None           3##  3 Aerospace & Defense Quarantine     1##  4 Apparel             No DMARC       4##  5 Apparel             None           1##  6 Business Services   No DMARC       9##  7 Business Services   None           7##  8 Business Services   Reject         4##  9 Chemicals           No DMARC      12## 10 Chemicals           None           2## # … with 53 more rows

We’re also going to want to sort the industries by those with the most DMARC (sorted bars/chicklets FTW!). We’ll need a factor for that, so let’s make one:

(dmarc_summary %>%   filter(p != "No DMARC") %>% # we don't care abt this `p` value  count(industry, wt=n, sort=TRUE) -> industry_levels)## # A tibble: 21 x 2##    industry                      n##                         ##  1 Financials                   54##  2 Technology                   25##  3 Health Care                  24##  4 Retailing                    23##  5 Wholesalers                  16##  6 Energy                       12##  7 Transportation               12##  8 Business Services            11##  9 Industrials                   8## 10 Food, Beverages & Tobacco     6## # … with 11 more rows

Now, we can make the chart:

dmarc_summary %>%   mutate(p = factor(p, levels = rev(dmarc_levels))) %>%   mutate(industry = factor(industry, rev(industry_levels$industry))) %>%   ggplot(aes(industry, n)) +  geom_chicklet(aes(fill = p)) +  scale_fill_manual(name = NULL, values = dmarc_cols) +  scale_y_continuous(expand = c(0,0), position = "right") +  coord_flip() +  labs(    x = NULL, y = NULL,    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"  ) +  theme_ipsum_rc(grid = "X") +  theme(legend.position = "top")

Doh! We rly want them to be 100% width. Thankfully, {ggplot2} has a position_fill() we can use instead of position_dodge():

dmarc_summary %>%   mutate(p = factor(p, levels = rev(dmarc_levels))) %>%   mutate(industry = factor(industry, rev(industry_levels$industry))) %>%   ggplot(aes(industry, n)) +  geom_chicklet(aes(fill = p), position = position_fill()) +  scale_fill_manual(name = NULL, values = dmarc_cols) +  scale_y_continuous(expand = c(0,0), position = "right") +  coord_flip() +  labs(    x = NULL, y = NULL,    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"  ) +  theme_ipsum_rc(grid = "X") +  theme(legend.position = "top")

Doh! Even though we forgot to use reverse = TRUE in the call to position_fill() everything is out of order. Kinda. It’s in the order we told it to be in, but that’s not right b/c we need it ordered by the in-industry percentages. If each industry had the same number of organizations, there would not have been an issue. Unfortunately, the folks who make up these lists care not about our time. Let’s re-compute the industry factor by computing the percents:

(dmarc_summary %>%   group_by(industry) %>%   mutate(pct = n/sum(n)) %>%   ungroup() %>%   filter(p != "No DMARC") %>%   count(industry, wt=pct, sort=TRUE) -> industry_levels)## # A tibble: 21 x 2##    industry               n##                  ##  1 Transportation     0.667##  2 Technology         0.641##  3 Wholesalers        0.615##  4 Financials         0.614##  5 Health Care        0.6  ##  6 Business Services  0.55 ##  7 Food & Drug Stores 0.5  ##  8 Retailing          0.5  ##  9 Industrials        0.444## 10 Telecommunications 0.375## # … with 11 more rows

Now, we can go back to using position_fill() as before:

dmarc_summary %>%   mutate(p = factor(p, levels = rev(dmarc_levels))) %>%   mutate(industry = factor(industry, rev(industry_levels$industry))) %>%   ggplot(aes(industry, n)) +  geom_chicklet(aes(fill = p), position = position_fill(reverse = TRUE)) +  scale_fill_manual(name = NULL, values = dmarc_cols) +  scale_y_percent(expand = c(0, 0.001), position = "right") +  coord_flip() +  labs(    x = NULL, y = NULL,    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"  ) +  theme_ipsum_rc(grid = "X") +  theme(legend.position = "top")

FIN

As noted, this is one way to handle this situation. I’m not super happy with the final visualization here as it doesn’t have the counts next to the industry labels and I like to have the ordering by both count and more secure configuration (so, conditional on higher prevalence of Quarantine or Reject when there are ties). That is an exercise left to the reader .

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Mapping the Underlying Social Structure of Reddit

September 27, 2019, 11:27 am

≫ Next: Coding algorithms in R for models written in Stan

≪ Previous: 100% Stacked Chicklets

[This article was first published on Posts on Data Science Diarist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reddit is a popular website for opinion sharing and news aggregation. The site consists of thousands of user-made forums, called subreddits, which cover a broad range of subjects, including politics, sports, technology, personal hobbies, and self-improvement. Given that most Reddit users contribute to multiple subreddits, one might think of Reddit as being organized into many overlapping communities. Moreover, one might understand the connections among these communities as making up a kind of social structure.

Uncovering a population’s social structure is useful because it tells us something about that population’s identity. In the case of Reddit, this identity could be uncovered by figuring out which subreddits are most central to Reddit’s network of subreddits. We could also study this network at multiple points in time to learn how this identity has evolved and maybe even predict what it’s going to look like in the future.

My goal in this post is to map the social structure of Reddit by measuring the proximity of Reddit communities (subreddits) to each other. I’m operationalizing community proximity as the number of posts to different communities that come from the same user. For example, if a user posts something to subreddit A and posts something else to subreddit B, subreddits A and B are linked by this user. Subreddits connected in this way by many users are closer together than subreddits connected by fewer users. The idea that group networks can be uncovered by studying shared associations among the people that make up those groups goes way back in the field of sociology (Breiger 1974). Hopefully this post will demonstrate the utility of this concept for making sense of data from social media platforms like Reddit.¹

Data

The data for this post come from an online repository of subreddit submissions and comments that is generously hosted by data scientist Jason Baumgartner. If you plan to download a lot of data from this repository, I implore you to donate a bit of money to keep Baumgartner’s database up and running (pushshift.io/donations/).

Here’s the link to the Reddit submissions data – files.pushshift.io/reddit/submissions/. Each of these files has all Reddit submissions for a given month between June 2005 and May 2019. Files are JSON objects stored in various compression formats that range between .017Mb and 5.77Gb in size. Let’s download something in the middle of this range – a 710Mb file for all Reddit submissions from May 2013. The file is called RS_2013-05.bz2. You can double-click this file to unzip it, or you can use the following command in the Terminal: bzip2 -d RS_2013-05.bz2. The file will take a couple of minutes to unzip. Make sure you have enough room to store the unzipped file on your computer – it’s 4.51Gb. Once I have unzipped this file, I load the relevant packages, read the first line of data from the unzipped file, and look at the variable names.

read_lines("RS_2013-05", n_max = 1) %>% fromJSON() %>% names

##  [1] "edited"                 "title"      ##  [3] "thumbnail"              "retrieved_on"      ##  [5] "mod_reports"            "selftext_html"      ##  [7] "link_flair_css_class"   "downs"      ##  [9] "over_18"                "secure_media"      ## [11] "url"                    "author_flair_css_class"      ## [13] "media"                  "subreddit"      ## [15] "author"                 "user_reports"      ## [17] "domain"                 "created_utc"      ## [19] "stickied"               "secure_media_embed"      ## [21] "media_embed"            "ups"      ## [23] "distinguished"          "selftext"      ## [25] "num_comments"           "banned_by"      ## [27] "score"                  "report_reasons"      ## [29] "id"                     "gilded"      ## [31] "is_self"                "subreddit_id"      ## [33] "link_flair_text"        "permalink"      ## [35] "author_flair_text"

For this project, I’m only interested in three of these variables: the user name associated with each submission (author), the subreddit to which a submission has been posted (subreddit), and the time of submission (created_utc). If we could figure out a way to extract these three pieces of information from each line of JSON we could greatly reduce the size of our data, which would allow us to store multiple months worth of information on our local machine. Jq is a command-line JSON processor that makes this possible.

To install jq on a Mac, you need to make sure you have Homebrew (brew.sh/), a package manager that works in the Terminal. Once you have Homebrew, in the Terminal type brew install jq. I’m going to use jq to extract the variables I want from RS_2015-03 and save the result as a .csv file. To select variables with jq, list the JSON field names that you want like this: [.author, .created_utc, .subreddit]. I return these as raw output (-r) and render this as a csv file (@csv). Here’s the command that does all this:

jq -r '[.author, .created_utc, .subreddit] | @csv' RS_2013-05 > parsed_json_to_csv_2013_05

Make sure the Terminal directory is set to wherever RS_2013-05 is located before running this command. The file that results from this command will be saved as “parsed_json_to_csv_2013_05”. This command parses millions of lines of JSON (every Reddit submission from 05-2013), so this process can take a few minutes. In case you’re new to working in the Terminal, if there’s a blank line at the bottom of the Terminal window, that means the process is still running. When the directory name followed by a dollar sign reappears, the process is complete. This file, parsed_json_to_csv_2013_05, is about 118Mb, much smaller than 4.5Gb.

Jq is a powerful tool for automating the process of downloading and manipulating data right from your harddrive. I’ve written the a bash script that lets you download multiple files from the Reddit repository, unzip them, extract the relevant fields from the resulting JSON, and delete the unparsed files (Reddit_Download_Script.bash). You can modify this script to pull different fields from the JSON. For instance, if you want to keep the content of Reddit submissions, add .selftext to the fields that are included in the brackets.

Now that I have a reasonably sized .csv file with the fields I want, I am ready to bring the data into R and analyze them as a network.

Analysis

Each row of the data currently represents a unique submission to Reddit from a user. I want to turn this into a dataframe where each row represents a link between subreddits through a user. One problem that arises from this kind of data manipulation is that there are more rows in the network form of this data than there are in the current form of the data. To see this, consider a user who has submitted to 10 different subreddits. These submissions would take up ten rows of our dataframe in its current form. However, this data would be represented by 10 choose 2, or 45, rows of data in its network form. This is every combination of 2 subreddits among those to which the user has posted. This number gets exponentially larger as the number of submissions from the same user increases. For this reason, the only way to convert the data into a network form without causing R to crash is to convert the data into a Spark dataframe. Spark is a distributed computing platform that partitions large datasets into smaller chunks and operates on these chunks in parallel. If your computer has a multicore processor, Spark allows you to work with big-ish data on your local machine. I will be using a lot of functions from the sparklyr package, which supplies dplyr backend to Spark. If you’re new to Spark and sparklyr, check out RStudio’s guide for getting started with Spark in R (spark.rstudio.com/).

Once I have Spark configured, I import the data into R as a Spark dataframe.

reddit_data <- spark_read_csv(sc, "parsed_json_to_csv_2013_05",                                    header = FALSE)

To begin, I make a few changes to the data – renaming columns, converting the time variable from utc time to the day of the year, and removing submissions from deleted accounts. I also remove submissions from users who have posted only once – these would contribute nothing to the network data – and submissions from users who have posted 60 or more times – these users are likely bots.

reddit_data <- reddit_data %>%          rename(author = V1, created_utc = V2, subreddit = V3) %>%          mutate(dateRestored = timestamp(created_utc + 18000)) %>%          mutate(day = dayofyear(dateRestored)) %>%          filter(author != "[deleted]") %>% group_by(author) %>% mutate(count = count()) %>%          filter(count < 60 & count > 1) %>%          ungroup()

Next, I create a key that gives a numeric id to each subreddit. I add these ids to the data, and select the variables “author”, “day”, “count”, “subreddit”, and “id” from the data. Let’s have a look at the first few rows of the data.

subreddit_key <- reddit_data %>% distinct(subreddit) %>% sdf_with_sequential_id()      reddit_data <- left_join(reddit_data, subreddit_key, by = "subreddit") %>%        select(author, day, count, subreddit, id)      head(reddit_data)

## # Source: spark [?? x 5]      ##   author           day count subreddit             id      ##                                   ## 1 Bouda            141     4 100thworldproblems  2342      ## 2 timeXalchemist   147     4 100thworldproblems  2342      ## 3 babydall1267     144    18 123recipes          2477      ## 4 babydall1267     144    18 123recipes          2477      ## 5 babydall1267     144    18 123recipes          2477      ## 6 babydall1267     144    18 123recipes          2477

We have 5 variables. The count variable shows the number of times a user has posted to Reddit in May 2013, the id variable gives the subreddit’s numeric id, the day variable tells us what day of the year a submission has been posted, and the author and subreddit variables give user and subreddit names. We are now ready to convert this data to network format. The first thing I do is take an “inner_join” of the data with itself, merging by the “author” variable. For each user, the number of rows this returns will be the square of the number of submissions from that user. I filter this down to “number of submissions choose 2” rows for each user. This takes two steps. First, I remove rows that link subreddits to themselves. Then I remove duplicate rows. For instance, AskReddit-funny is a duplicate of funny-AskReddit. I remove one of these.

The subreddit id variable will prove useful for removing duplicate rows. If we can mutate two id variables into a new variable that gives a unique identifier to each subreddit pair, we can filter duplicates of this identifier. We need a mathematical equation that takes two numbers and returns a unique number (i.e. a number that can only be produced from these two numbers) regardless of number order. One such equation is the Cantor Pairing Function (wikipedia.org/wiki/Pairing_function):

Let’s define a function in R that takes a dataframe and two id variables, runs the id variables through Cantor’s Pairing Function and appends this to the dataframe, filters duplicate cantor ids from the dataframe, and returns the result. We’ll call this function cantor_filter.

cantor_filter <- function(df, id, id2){        df %>% mutate(id_pair = .5*(id + id2)*(id + id2 + 1) + pmax(id, id2)) %>% group_by(author, id_pair) %>%          filter(row_number(id_pair) == 1) %>% return()      }

Next, I apply an inner_join to the Reddit data and apply the filters described above to the resulting dataframe.

reddit_network_data <- inner_join(reddit_data, reddit_data %>%                              rename(day2 = day, count2 = count,                              subreddit2 = subreddit, id2 = id),                              by = "author") %>%                 filter(subreddit != subreddit2) %>%                 group_by(author, subreddit, subreddit2) %>%                 filter(row_number(author) == 1) %>%                 cantor_filter() %>%                 select(author, subreddit, subreddit2, id, id2, day, day2, id_pair) %>%                 ungroup %>% arrange(author)

Let’s take a look at the new data.

reddit_network_data

## Warning: `lang_name()` is deprecated as of rlang 0.2.0.      ## Please use `call_name()` instead.      ## This warning is displayed once per session.

## Warning: `lang()` is deprecated as of rlang 0.2.0.      ## Please use `call2()` instead.      ## This warning is displayed once per session.

## # Source:     spark [?? x 8]      ## # Ordered by: author      ##    author     subreddit     subreddit2        id   id2   day  day2  id_pair      ##                                          ##  1 --5Dhere   depression    awakened        7644 29936   135   135   7.06e8      ##  2 --Adam--   AskReddit     techsupport    15261 28113   135   142   9.41e8      ##  3 --Caius--  summonerscho… leagueoflegen…    79     3   124   142   3.48e3      ##  4 --Gianni-- AskReddit     videos         15261  5042   125   138   2.06e8      ##  5 --Gianni-- pics          AskReddit       5043 15261   126   125   2.06e8      ##  6 --Gianni-- movies        pics           20348  5043   124   126   3.22e8      ##  7 --Gianni-- gaming        videos         10158  5042   131   138   1.16e8      ##  8 --Gianni-- gaming        pics           10158  5043   131   126   1.16e8      ##  9 --Gianni-- movies        AskReddit      20348 15261   124   125   6.34e8      ## 10 --Gianni-- movies        videos         20348  5042   124   138   3.22e8      ## # … with more rows

We now have a dataframe where each row represents a link between two subreddits through a distinct user. Many pairs of subreddits are connected by multiple users. We can think of subreddit pairs connected through more users as being more connected than subreddit pairs connected by fewer users. With this in mind, I create a “weight” variable that tallies the number of users connecting each subreddit pair and then filters the dataframe to unique pairs.

reddit_network_data <- reddit_network_data %>% group_by(id_pair) %>%        mutate(weight = n()) %>% filter(row_number(id_pair) == 1) %>%        ungroup

Let’s have a look at the data and see how many rows it has.

reddit_network_data

## # Source:     spark [?? x 9]      ## # Ordered by: author      ##    author     subreddit   subreddit2    id   id2   day  day2 id_pair weight      ##                                     ##  1 h3rbivore  psytrance   DnB            8     2   142   142      63      1      ##  2 StRefuge   findareddit AlienBlue     23     5   133   134     429      1      ##  3 DylanTho   blackops2   DnB           28     2   136   138     493      2      ##  4 TwoHardCo… bikewrench  DnB           30     2   137   135     558      1      ##  5 Playbook4… blackops2   AlienBlue     28     5   121   137     589      2      ##  6 A_Jewish_… atheism     blackops2      6    28   139   149     623     14      ##  7 SirMechan… Terraria    circlejerk    37     7   150   143    1027      2      ##  8 Jillatha   doctorwho   facebookw…    36     9   131   147    1071      2      ##  9 MeSire     Ebay        circlejerk    39     7   132   132    1120      3      ## 10 Bluesfan6… SquaredCir… keto          29    18   126   134    1157      2      ## # … with more rows

reddit_network_data %>% sdf_nrow

## [1] 744939

We’re down to ~750,000 rows. The weight column shows that many of the subreddit pairs in our data are only connected by 1 or 2 users. We can substantially reduce the size of the data without losing the subreddit pairs we’re interested in by removing these rows. I decided to remove subreddit pairs that are connected by 3 or fewer users. I also opt at this point to stop working with the data as a Spark object and bring the data into the R workspace as a dataframe. The network analytic tools I use next require working on a regular dataframes and our data is now small enough that we can do this without any problems. Because we’re moving into the R workspace, I save this as a new dataframe called reddit_edgelist.

 reddit_edgelist <- reddit_network_data %>% filter(weight > 3) %>%        select(id, id2, weight) %>% arrange(id) %>%        # Bringing the data into the R workspace        dplyr::collect()

Our R dataframe consists of three columns: two id columns that provide information on connections between nodes and a weight column that tells us the strength of each connection. One nice thing to have would be a measure of the relative importance of each subreddit. A simple way to get this would be to count how many times each subreddit appears in the data. I compute this for each subreddit by adding the weight values in the rows where that subreddit appears. I then create a dataframe called subreddit_imp_key that lists subreddit ids by subreddit importance.

subreddit_imp_key <- full_join(reddit_edgelist %>% group_by(id) %>%                                       summarise(count = sum(weight)),                  reddit_edgelist %>% group_by(id2) %>%                    summarise(count2 = sum(weight)),                  by = c("id" = "id2")) %>%                  mutate(count = ifelse(is.na(count), 0, count)) %>%                  mutate(count2 = ifelse(is.na(count2), 0, count2)) %>%                  mutate(id = id, imp = count + count2) %>% select(id, imp)

Let’s see which subreddits are the most popular on Reddit according to the subreddit importance key.

left_join(subreddit_imp_key, subreddit_key %>% dplyr::collect(), by = "id") %>%        arrange(desc(imp))

## # A tibble: 5,561 x 3      ##       id    imp subreddit      ##             ##  1 28096 107894 funny      ##  2 15261 101239 AskReddit      ##  3 20340  81208 AdviceAnimals      ##  4  5043  73119 pics      ##  5 10158  51314 gaming      ##  6  5042  47795 videos      ##  7 17856  47378 aww      ##  8  2526  37311 WTF      ##  9 22888  31702 Music      ## 10  5055  26666 todayilearned      ## # … with 5,551 more rows

These subreddits are mostly about memes and gaming, which are indeed two things that people commonly associate with Reddit.

Next, I reweight the edge weights in reddit_edgelist by subreddit importance. The reason I do this is that the number of users connecting subreddits is partially a function of subreddit popularity. Reweighting by subreddit importance, I control for the influence of this confounding variable.

reddit_edgelist <- left_join(reddit_edgelist, subreddit_imp_key,                                   by = c("id" = "id")) %>%                        left_join(., subreddit_imp_key %>% rename(imp2 = imp),                                  by = c("id2" = "id")) %>%        mutate(imp_fin = (imp + imp2)/2) %>% mutate(weight = weight/imp_fin) %>%        select(id, id2, weight)      reddit_edgelist

## # A tibble: 56,257 x 3      ##       id   id2   weight      ##               ##  1     1 12735 0.0141      ##  2     1 10158 0.000311      ##  3     1  2601 0.00602      ##  4     1 17856 0.000505      ##  5     1 22900 0.000488      ##  6     1 25542 0.0185      ##  7     1 15260 0.00638      ##  8     1 20340 0.000320      ##  9     2  2770 0.0165      ## 10     2 15261 0.000295      ## # … with 56,247 more rows

We now have our final edgelist. There are about 56,000 thousand rows in the data, though most edges have very small weights. Next, I use the igraph package to turn this dataframe into a graph object. Graph objects can be analyzed using igraph’s clustering algorithms. Let’s have a look at what this graph object looks like.

reddit_graph <- graph_from_data_frame(reddit_edgelist, directed = FALSE)      reddit_graph

## IGRAPH 2dc5bc4 UNW- 5561 56257 --      ## + attr: name (v/c), weight (e/n)      ## + edges from 2dc5bc4 (vertex names):      ##  [1] 1--12735 1--10158 1--2601  1--17856 1--22900 1--25542 1--15260      ##  [8] 1--20340 2--2770  2--15261 2--18156 2--20378 2--41    2--22888      ## [15] 2--28115 2--10172 2--5043  2--28408 2--2553  2--2836  2--28096      ## [22] 2--23217 2--17896 2--67    2--23127 2--2530  2--2738  2--7610      ## [29] 2--20544 2--25566 2--3     2--7     2--7603  2--12931 2--17860      ## [36] 2--6     2--2526  2--5055  2--18253 2--22996 2--25545 2--28189      ## [43] 2--10394 2--18234 2--23062 2--25573 3--264   3--2599  3--5196      ## [50] 3--7585  3--10166 3--10215 3--12959 3--15293 3--20377 3--20427      ## + ... omitted several edges

Here we have a list of all of the edges from the dataframe. I can now use a clustering algorithm to analyze the community structure that underlies this subreddit network. The clustering algorithm I choose to use here is the Louvain algorithm. This algorithm takes a network and groups its nodes into different communities in a way that maximizes the modularity of the resulting network. By maximizing modularity, the Louvain algorithm groups nodes in a way that maximizes the number of within-group ties and minimizes the number of between-group ties.

Let’s apply the algorithm and see if the groupings it produces make sense. I store the results of the algorithm in a tibble with other relevant information. See code annotations for a more in-depth explanation of what I’m doing here.

reddit_communities <- cluster_louvain(reddit_graph, weights = reddit_edgelist$weight)      subreddit_by_comm <- tibble(        # Using map from purrr to extract subreddit ids from reddit_communities        id = map(reddit_communities[], as.numeric) %>% unlist,        # Creating a community ids column and using rep function with map to populate        # a column with community ids created by        # Louvain alg        comm = rep(reddit_communities[] %>%                     names, map(reddit_communities[], length) %>% unlist) %>%                     as.numeric) %>%        # Adding subreddit names        left_join(., subreddit_key %>% dplyr::collect(), by = "id") %>%        # Keeping subreddit name, subreddit id, community id        select(subreddit, id, comm) %>%        # Adding subreddit  importance        left_join(., subreddit_imp_key, by = "id")

Next, I calculate community importance by summing the subreddit importance scores of the subreddits in each community.

subreddit_by_comm <- subreddit_by_comm %>% group_by(comm) %>% mutate(comm_imp = sum(imp)) %>% ungroup

I create a tibble of the 10 most important communities on Reddit according to the subreddit groupings generated by the Louvain algorithm. This tibble displays 10 largest subreddits in each of these communities. Hopefully, these subreddits will be similar enough that we can discern what each community represents.

comm_ids <- subreddit_by_comm %>% group_by(comm) %>% slice(1) %>% arrange(desc(comm_imp)) %>% .[["comm"]]      top_comms <- list()      for(i in 1:10){      top_comms[[i]] <- subreddit_by_comm %>% filter(comm == comm_ids[i]) %>% arrange(desc(imp)) %>% .[["subreddit"]] %>% .[1:10]      }      comm_tbl <- tibble(Community = 1:10,                         Subreddits = map(top_comms, ~paste(.x, collapse = " ")) %>% unlist)

Let’s have a look at the 10 largest subreddits in each of the 10 largest communities. These are in descending order of importance.

options(kableExtra.html.bsTable = TRUE)      comm_tbl %>%      kable("html") %>%        kable_styling("hover", full_width = F) %>%        column_spec(1, bold = T, border_right = "1px solid #ddd;") %>%        column_spec(2, width = "30em")

Community	Subreddits
1	funny AskReddit AdviceAnimals pics gaming videos aww WTF Music todayilearned
2	DotA2 tf2 SteamGameSwap starcraft tf2trade Dota2Trade GiftofGames SteamTradingCards Steam vinyl
3	electronicmusic dubstep WeAreTheMusicMakers futurebeats trap edmproduction electrohouse EDM punk ThisIsOurMusic
4	hockey fantasybaseball nhl Austin DetroitRedWings sanfrancisco houston leafs BostonBruins mlb
5	cars motorcycles Autos sysadmin carporn formula1 Jeep subaru Cartalk techsupportgore
6	web_design Entrepreneur programming webdev Design windowsphone SEO forhire startups socialmedia
7	itookapicture EarthPorn AbandonedPorn HistoryPorn photocritique CityPorn MapPorn AnimalPorn SkyPorn Astronomy
8	wow darksouls Diablo Neverwinter Guildwars2 runescape diablo3 2007scape swtor Smite
9	blackops2 battlefield3 dayz Eve Planetside aviation airsoft WorldofTanks Warframe CallOfDuty
10	soccer Seattle Fifa13 Portland MLS Gunners reddevils chelseafc football LiverpoolFC

The largest community in this table, community 1, happens to contain the ten most popular subreddits on Reddit. Although some of these subreddits are similar in terms of their content – many of them revolve around memes, for example – a couple of them do not (e.g. videos and gaming). One explanation is that this first group of subreddits represents mainstream Reddit. In other words, the people who post to these subreddits are generalist posters – they submit to a broad enough range of subreddits that categorizing these subreddits into any of the other communities would reduce the modularity of the network.

The other 9 communities in the figure are easier to interpret. Each one revolves around a specific topic. Communities 2, 8, and 9 are gaming communities dedicated to specific games; communities 4 and 10 are sports communities; the remaining communities are dedicated to electronic music, cars, web design, and photography.

In sum, we have taken a month worth of Reddit submissions, converted them into a network, and identified subreddit communities from them. How successful were we? On one hand, the Louvain algorithm correctly identified many medium-sized communities revolving around specific topics. It’s easy to imagine that the people who post to these groups of subreddits contribute almost exclusively to them, and that it therefore makes sense to think of them as communities. On the other hand, the largest community has some pretty substantively dissimilar subreddits. These also happen to be the largest subreddits on Reddit. The optimistic interpretation of this grouping is that these subreddits encompass a community of mainstream users. However, the alternative possibly that this community is really just a residual category of subreddits that don’t really belong together but also don’t have any obvious place in the other subreddit communities. Let’s set this issue to the side for now.

In the next section, I visualize these communities as a community network and examine how this network has evolved over time.

Visualizations

In the last section, I generated some community groupings of subreddits. While these give us some idea of the social structure of Reddit, one might want to know how these communities are connected to each other. In this section, I take these community groupings and build a community-level network from them. I then create some interactive visualizations that map the social structure of Reddit and show how this structure has evolved over time.

The first thing I want to do is return to the subreddit edgelist, our dataframe of subreddit pairs and the strength of their connections, and merge this with community id variables corresponding to each subreddit. I filter the dataframe to only include unique edges, and add a variable called weight_fin, which is the average of the subreddit edge weights between each community. I also filter links in the community-level edgelist that connect community to themselves. I realize that there’s a lot going on in the code below. Feel free to contact me if you have any questions about what I’m doing here.

community_edgelist <- left_join(reddit_edgelist, subreddit_by_comm %>% select(id, comm), by = "id") %>%        left_join(., subreddit_by_comm %>% select(id, comm) %>% rename(comm2 = comm), by = c("id2"= "id")) %>%        select(comm, comm2, weight) %>%        mutate(id_pair = .5*(comm + comm2)*(comm + comm2 + 1) + pmax(comm,comm2)) %>% group_by(id_pair) %>%        mutate(weight_fin = mean(weight)) %>% slice(1) %>% ungroup %>% select(comm, comm2, weight_fin) %>%        filter(comm != comm2) %>% filter(comm != comm2) %>%        arrange(desc(weight_fin))

I now have a community-level edgelist, with which we can visualize a network of subreddit communities. I first modify the edge weight variable to discriminate between communities that are more and less connected. I choose an arbitrary cutoff point (.007) and set all weights below this cutoff to 0. Although doing this creates a risk of imposing structure on the network where there is none, this cutoff will help highlight significant ties between communities.

community_edgelist_ab <- community_edgelist %>%        mutate(weight =  ifelse(weight_fin > .007, weight_fin, 0)) %>%        filter(weight!=0) %>% mutate(weight = abs(log(weight)))

The visualization tools that I use here come from the visnetwork package. For an excellent set of tutorials on network visualizations in R, check out the tutorials section of Professor Katherine Ognyanova’s website (kateto.net/tutorials/). Much of what I know about network visualization in R I learned from the “Static and dynamic network visualization in R” tutorial.

Visnetwork’s main function, visNetwork, requires two arguments, one for nodes data and one for edges data. These dataframes need to have particular column names for visnetwork to be able to make sense of them. Let’s start with the edges data. The column names for the nodes corresponding to edges in the edgelist need to be called “from” and “to”, and the column name for edge weights needs to be called “weight”. I make these adjustments.

community_edgelist_mod <- community_edgelist_ab %>%        rename(from = comm, to = comm2) %>% select(from, to, weight)

Also, visnetwork’s default edges are curved. I prefer straight edges. To ensure edges are straight, add a smooth column and set it to FALSE.

community_edgelist_mod$smooth <- F

I’m now ready to set up the nodes data. First, I extract all nodes from the community edgelist.

community_nodes <- c(community_edgelist_mod %>% .[["from"]], community_edgelist_mod %>% .[["to"]]) %>% unique

Visnetwork has this really cool feature that lets you view node labels by hovering over them with your mouse cursor. I’m going to label each community with the names of the 4 most popular subreddits in that community.

comm_by_label <- subreddit_by_comm %>% arrange(comm, desc(imp)) %>% group_by(comm) %>% slice(1:4) %>%        summarise(title = paste(subreddit, collapse = " "))

Next, I put node ids and community labels in a tibble. Note that the label column in this tibble has to be called “title”.

community_nodes_fin <- tibble(comm = community_nodes) %>% left_join(., comm_by_label, by = "comm")

I want the nodes of my network to vary in size based on the size of each community. To do this, I create a community importance key. I’ve already calculated community importance above. I extract this score for each community from the subreddit_by_comm dataframe and merge these importance scores with the nodes data. I rename the community importance variable “size” and the community id variable “id”, which are the column names that visnetwork recognizes.

comm_imp_key <- subreddit_by_comm %>% group_by(comm) %>% slice(1) %>%        arrange(desc(comm_imp)) %>% select(comm, comm_imp)      community_nodes_fin <- inner_join(community_nodes_fin, comm_imp_key, by = "comm") %>%        rename(size = comm_imp, id = comm)

One final issue is that my “mainstream Reddit/residual subreddits” community is so much bigger than the other communities that the network visualization will be overtaken by it if I don’t adjust the size variable. I remedy this by raising community size to the .3th power (close to the cube root).

community_nodes_fin <- community_nodes_fin %>% mutate(size = size^.3)

I can now enter the nodes and edges data into the visNetwork function. I make a few final adjustments to the default parameters. Visnetwork now lets you use layouts from the igraph package. I use visIgraphLayout to set the position of the nodes according to the Fruchterman-Reingold Layout Algorithm (layout_with_fr). I also adjust edge widths and set highlightNearest to TRUE. This lets you highlight a node and the nodes it is connected to by clicking on it. Without further ado, let’s have a look at the network.

2013 Reddit Network.

The communities of Reddit do not appear to be structured into distinct categories. We don’t see a cluster of hobby communities and a different cluster of advice communities, for instance. Instead, we have some evidence to suggest that the strongest ties are among some of the larger subcultures of Reddit. Many of the nodes in the large cluster of communities above are ranked in the 2-30 range in terms of community size. On the other hand, the largest community (mainstream Reddit) is out on a island, with only a few small communities around it. This suggests that the ties between mainstream Reddit and some of Reddit’s more niche communities are weaker than the ties among the latter. In other words, fringe subcultures of Reddit are more connected to each other than they are to Reddit’s mainstream.

The substance of these fringe communities lends credence to this interpretation. Many of the communities in the large cluster are somewhat related in their content. There are a lot of gaming communities, several drug and music communities, a couple of sports communities, and few communities that combine gaming, music, sports, and drugs in different ways. Indeed, most of the communities in this cluster revolve around activities commonly associated with young men. One might even infer from this network that Reddit is organized into two social spheres, one consisting of adolescent men and the other consisting of everybody else. Still, I should caution the reader against extrapolating too much from the network above. These ties are based on 30 days of submissions. It’s possible that something occurred during this period that momentarily brought certain Reddit communities closer together than they would be otherwise. There are links among some nodes in the network that don’t make much logical sense. For instance, the linux/engineering/3D-Printing community (which only sort of makes sense as a community) is linked to a “guns/knives/coins” community. This strikes me as a bit strange, and I wonder if these communities would look the same if I took data from another time period. Still, many of the links here make a lot of sense. For example, the Bitcoin/Conservative/Anarcho_Capitalism community is tied to the Anarchism/progressive/socialism/occupywallstreet community. The Drugs/bodybuilding community is connected to the MMA/Joe Rogan community. That one makes almost too much sense. Anyway, I encourage you to click on the network nodes to see what you find.

One of the coolest things about the Reddit repository is that it contains temporally precise information on everything that’s happened on Reddit from its inception to only a few months ago. In the final section of this post, I rerun the above analyses on all the Reddit submissions from May 2017 and May 2019. I’m using the bash script I linked to above to do this. Let’s have a look at the community networks from 2017 and 2019 and hopefully gain some insight into how Reddit has evolved over the past several years.

2017 Reddit Network.

Perhaps owing the substantial growth of Reddit between 2013 and 2017, we start to see a hierarchical structure among the communities that we didn’t see in the previous network. A few of the larger communities now have smaller communities budding off of them. I see four such “parent communities”. One of them is the music community. There’s a musicals/broadway community, a reggae community, an anime music community, and a “deepstyle” (whatever that is) community stemming from this. Another parent community is the sports community, which has a few location-based communities, a lacrosse community, and a Madden community abutting it. The other two parent communities are porn communities. I won’t name the communities stemming from these, but as you might guess many of them revolve around more niche sexual interests.

This brings us to another significant change between this network and the one from 2013: the emergence of porn on Reddit. We now see that two of the largest communities involve porn. We also start to see some differentiation among the porn communities. There is a straight porn community, a gay porn community, and a sex-based kik community (kik is a messenger app). It appears that since 2013 Reddit is increasingly serving some of the same functions as Craigslist, providing users with a place to arrange to meet up, either online or in person, for sex. As we’ll see in the 2019 network, this function has only continued to grow. This is perhaps due to the Trump Administration’s sex trafficking bill and Craigslist’s decision to shutdown its “casual encounters” personal ads in 2018.

Speaking of Donald Trump, where is he in our network? As it turns out, this visualization belies the growing presence of Donald Trump on Reddit between 2013 and 2017. The_Donald is a subreddit for fans of Donald Trump that quickly became of the most popular subreddits on Reddit during this time. The reason that we don’t see it here is that it falls into the mainstream Reddit community, and despite its popularity it is not one of the four largest subreddits in this community. The placement of The_Donald in this community was one of the most surprising results of this project. I had expected The_Donald to fall into a conservative political community. The reason The_Donald falls into the mainstream community, I believe, is that much of The_Donald consists of news and memes, the bread and butter of Reddit. Many of the most popular subreddits in the mainstream community are meme subreddits – Showerthoughts, drankmemes, funny – and the overlap between users who post to these subreddits and users who post to The_Donald is substantial.

2019 Reddit Network.

That brings us to May 2019. What’s changed from 2017? The network structure is similar – we have two groups, mainstream Reddit and a interconnected cluster of more niche communities. This cluster has the same somewhat hierarchical structure that we saw in the 2017 network, with a couple of large “parent communities” that are porn communities. This network also shows the rise of Bitcoin on Reddit. While Bitcoin was missing from the 2017 network, in 2019 it constitutes one of the largest communities on the entire site. It’s connected to a conspiracy theory community, a porn community, a gaming community, an exmormon/exchristian community, a tmobile/verizon community, and architecture community. While some of these ties may be coincidental, some of them likely reflect real sociocultural overlaps.

Recap/Next Steps

That’s all I have for now. My main takeaway from this project is that Reddit consists of two worlds, a “mainstream” Reddit that is comprised of meme and news subreddits and a more fragmented, “fringe” Reddit that is made up of groups of porn, gaming, hobbiest, Bitcoin, sports, and music subreddits. This begs the question of how these divisions map onto real social groups. It appears that the Reddit communities outside the mainstream revolve around topics that are culturally associated with young men (e.g. gaming, vaping, Joe Rogan). Is the reason for this that young men are more likely to post exclusively to a handful of somewhat culturally subversive subreddits that other users are inclined to avoid? Unfortunately, we don’t have the data to answer this question, but this hypothesis is supported by the networks we see here.

The next step to take on this project will be to figure out how to allow for overlap between subreddit communities. As I mentioned, the clustering algorythm I used here forces subreddits into single communities. This distorts how communities on Reddit are really organized. Many subreddits appeal to multiple and distinct interests of Reddit users. For example, many subreddits attract users with a common political identity while also providing users with a news source. City-based subreddits attract fans of cities’ sports teams but also appeal to people who want to know about non-sports-related local events. That subreddits can serve multiple purposes could mean that the algorithm I use here lumped together subreddits that belong in distinct and overlapping communities. It also suggests that my mainstream Reddit community could really be a residual community of liminal subreddits that do not have a clear categorization. A clustering algorithm that allowed for community overlap would elucidate which subreddits span multiple communities. SNAP (Stanford Network Analysis Project) has tools in Python that seem promising for this kind of research. Stay tuned!

For some recent applications of Breiger’s ideas in computer science, see Yang et al. 2013; Yang and Leskovec 2012.︎

To leave a comment for the author, please follow the link and comment on their blog: Posts on Data Science Diarist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Coding algorithms in R for models written in Stan

September 27, 2019, 3:23 pm

≫ Next: Harry Potter and the Power of Bayesian Constrained Inference

≪ Previous: Mapping the Underlying Social Structure of Reddit

[This article was first published on R – Statisfaction, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stanislaw Ulam’s auto-biography, “adventures of a mathematician”, originally published in 1976

Hi all,

On top of recommending the excellent autobiography of Stanislaw Ulam, this post is about using the software Stan, but not directly to perform inference, instead to obtain R functions to evaluate a target’s probability density function and its gradient. With which, one can implement custom methods, while still benefiting from the great work of the Stan team on the “modeling language” side. As a proof of concept I have implemented a plain Hamiltonian Monte Carlo sampler for a random effect logistic regression model (taken from a course on Multilevel Models by Germán Rodríguez), a coupling of that HMC algorithm (as in “Unbiased Hamiltonian Monte Carlo with couplings“, see also this very recent article on the topic of coupling HMC), and then upper bounds on the total variation distance between the chain and its limiting distribution, as in “Estimating Convergence of Markov chains with L-Lag Couplings“.

The R script is here: https://github.com/pierrejacob/statisfaction-code/blob/master/2019-09-stan-logistic.R and is meant to be as simple as possible, and self-contained; warning, this is all really proof of concept and not thoroughly tested.

Basically the R script starts like a standard script that would use rstan for inference; it runs the default algorithm of Stan for a little while, then extracts some info from the “stanfit” object. With these, a pure R implementation of TV upper bounds for a naive HMC algorithm follows, that relies on functions called “stan_logtarget” and “stan_gradlogtarget” to evaluate the target log-pdf and its gradient.

The script takes a few minutes to run in total. Some time is first needed to compile the Stan code, and to run Stan for a few steps. Then some time spent towards the end of the script to generate 250 independent meeting times with a lag of 500 between the chains; the exact run time will of course depend a lot on your number of available processors (on my machine it takes around one minute). The script produces this plot:

2019-09-stan-logistic

This plot suggests that vanilla HMC as implemented in the script converges in less than 1000 iterations to its stationary distribution. This is probably quite conservative, but it’s still usable.

In passing, upon profiling the code of the function that generates each meeting time, it appears that half of the time is spent in Stan‘s “grad_log_prob” function (which computes the gradient of the log pdf of the target). This implies that not that much efficiency is lost in the fact that the algorithms are coded in pure R, at least for this model.

To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Harry Potter and the Power of Bayesian Constrained Inference

September 28, 2019, 1:00 am

≫ Next: Just Quickly: The unexpected use of functions as arguments

≪ Previous: Coding algorithms in R for models written in Stan

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor?

In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys?

Preliminaries

As in a previous blog post, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood

p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,

and a Beta prior for $\theta$:

p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .

This prior is conjugate for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this.

plot of chunk unnamed-chunk-1

In this blog post, we will use a prior predictive perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on posterior prediction, see this blog post. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ relative to another model $\mathcal{M}_0$:

\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,

where we can write the marginal likelihood of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors:

p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .

After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts.

The Hogwarts prediction contest

Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced.

The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions.

Prior beliefs

Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin cannot be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are:

% <![CDATA[\begin{aligned}\text{Ron} &: \theta = 0.50 \\[.5em]\text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em]\text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace ,\end{aligned} %]]>

which are visualized in the Figure below.

plot of chunk unnamed-chunk-2

In the next section, the three use their beliefs to make probabilistic predictions.

Prior predictions

Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a prior predictive distribution. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write:

% <![CDATA[\begin{aligned}p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em]&= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace ,\end{aligned} %]]>

where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions.

It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are:

% <![CDATA[\begin{aligned}p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em]&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em]&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em]&= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace ,\end{aligned} %]]>

which is a Beta-Binomial distribution with $\alpha = \beta = 1$.

Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.¹ We associate $\mathcal{M}_r$, which stands for restricted model, with her and write:

% <![CDATA[\begin{aligned}p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em]&= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em]&= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace .\end{aligned} %]]>

We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields:

nr_draws<-20theta_Hermione<-runif(n=nr_draws,min=0.50,max=1)predictions_Hermione<-rbinom(n=nr_draws,size=10,prob=theta_Hermione)predictions_Hermione

##  [1] 10 10 10  3  7 10  8  9  6  9  9  6  8  9  8 10  6 10  5  7

Let’s visualize Ron’s, Harry’s, and Hermione’s prior predictive distributions to get a better feeling for what they believe are likely coin flip outcomes. First, we implement their prior predictions in R:

Ron<-function(y,n=10){choose(n,y)*0.50^n}Harry<-function(y,n=10){choose(n,y)*beta(y+1,n-y+1)}Hermione<-function(y,n=10){int<-integrate(function(theta)theta^y*(1-theta)^(n-y),0.50,1)2*choose(n,y)*int$value}

Even though Ron believes that $\theta = 0.50$, this does not mean that his prior prediction puts all mass on $y = 5$; deviations from this value are plausible. Harry’s prior predictive distribution also makes sense: since he believes all values for $\theta$ to be equally likely, he should believe all outcomes are equally likely. Hermione, on the other hand, believes that $\theta \in [0.50, 1]$, so her prior probabilities for outcomes with few heads ($y < 5$) drastically decrease.

plot of chunk unnamed-chunk-5

After the three have clearly stated their prior beliefs and derived their prior predictions, Dobby throws a coin ten times. The coin comes up heads nine times. In the next section, we discuss the relative predictive performance of Ron, Harry, and Hermione based on these data.

Evaluating predictions

To assess the relative predictive performance of Ron, Harry, and Hermione, we need to compute the probability mass of $y = 9$ for their respective prior predictive distributions. Compared to Ron, Hermione did roughly 19 times better:

Hermione(9)/Ron(9)

## [1] 18.50909

Harry, on the other hand, did about 9 times better than Ron:

Harry(9)/Ron(9)

## [1] 9.309091

With these two comparisons, we also know by how much Hermione outperformed Harry, since by transitivity we have:

\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_0)} \times \frac{p(y \mid \mathcal{M}_0)}{p(y \mid \mathcal{M}_1)} = \text{BF}_{r0} \times \frac{1}{\text{BF}_{10}} \approx 2 \enspace ,

which is indeed correct:

Hermione(9)/Harry(9)

## [1] 1.988281

Note that this is also immediately apparent from the visualizations above, where Hermione’s allocated probability mass is about twice as large as Harry’s for the case where $y = 9$.

Hermione was bold in her prediction, and was rewarded with being favoured by a factor of two in predictive performance. Note that if her predictions would have been even bolder, say restricting her prior to $\theta \in [0.80, 1]$, she would have reaped higher rewards than a Bayes factor in favour of two. Contrast this to Dobby throwing the coin ten times and with only one heads showing. Then Harry’s marginal likelihood is still $\text{Beta}(11, 1) = \frac{1}{11}$. However, Hermione’s is not twice as much; instead, it is a mere $0.001065$, which would result in a Bayes factor of about $85$ in favour of Harry!

Harry(1)/Hermione(1)

## [1] 85.33333

This means that with bold predictions, one can also lose a lot. However, this is tremendously insightful, since Hermione would immediately realize where she went wrong. For a discussion that also points out the flexibility of Bayesian model comparison, see Etz, Haaf, Rouder, & Vandeckerckhove (2018).

In the next section, we will discover a nice trick which simplifies the computation of the Bayes factor; we do not need to derive marginal likelihoods, but can simply look at the prior and the posterior distribution of the parameter of interest in the unrestricted model.

Prior / Posterior trick

As it it turns out, the relative predictive performance of Hermione compared to Harry is given by the ratio of the purple area to the blue area in the figure below.

plot of chunk unnamed-chunk-10

In other words, the Bayes factor in favour of the restricted model (i.e., Hermione) compared to the unrestricted or encompassing model (i.e., Harry) is given by the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. We can check this numerically:

# (1 - pbeta(0.50, 10, 2)) / 0.50 would also workintegrate(function(theta)dbeta(theta,10,2),0.50,1)$value/0.50

## [1] 1.988281

This is a very cool result which, to my knowledge, was first described in Kluglist & Hoijtink (2005). In the next section, we prove it.

Proof

The proof uses two insights. First, note that we can write the priors in the restricted model, $\mathcal{M}_r$, as priors in the encompassing model, $\mathcal{M}_1$, subject to some constraints. In the Hogwarts prediction context, Hermione’s prior was a restricted version of Harry’s prior:

% <![CDATA[\begin{aligned}p(\theta \mid \mathcal{M}_r) &= p(\theta \mid \mathcal{M}_1)\mathbb{I}(0.50, 1) \\[1em] &= \begin{cases} \frac{p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} & \text{if} \hspace{1em} \theta \in [0.50, 1] \\[1em] 0 & \text{otherwise}\end{cases}\end{aligned} %]]>

plot of chunk unnamed-chunk-12

We have to divide by the term

K = \int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta = 0.50 \enspace ,

so that the restricted prior integrates to 1, as all proper probability distributions must. As a direct consequence, note that the density of a value $\theta = \theta^{\star}$ is given by:

p(\theta^{\star} \mid \mathcal{M}_r) = p(\theta^{\star} \mid \mathcal{M}_1) \cdot \frac{1}{K} \enspace ,

where $K$ is the renormalization constant. This means that we can rewrite terms which include the restricted prior in terms of the unrestricted prior from the encompassing model. This also holds for the posterior!

To see that we can also write the restricted posterior in terms of the unrestricted posterior from the encompassing model, note that the likelihood is the same under both models and that:

% <![CDATA[\begin{aligned}p(\theta \mid y, \mathcal{M}_r) &= \frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta} \\[.5em]&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K}}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K} \, \mathrm{d}\theta} \\[.5em]&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \\[.5em]&= \frac{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta}}{\int_{0.50}^1 \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \, \mathrm{d}\theta} \\[.5em]&= \frac{p(\theta \mid y, \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta} \enspace ,\end{aligned} %]]>

where we have to renormalize by

Z = \int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta \enspace ,

which is

1-pbeta(0.50,10,2)

## [1] 0.9941406

The figure below visualizes Harry’s and Hermione’s posterior. Sensibly, since Hermione excluded all $\theta \in [0, 0.50]$ in her prior, such values receive zero credence in her posterior. However, the difference in posterior distributions between Harry and Hermione is very weak in contrast to the difference in prior distribution. This is reflected in $Z$ being close to 1.

plot of chunk unnamed-chunk-14

Similar to the prior, we can write the density of a value $\theta = \theta^\star$ in terms of the encompassing model:

p(\theta^{\star} \mid y, \mathcal{M}_r) = p(\theta^{\star} \mid y, \mathcal{M}_1) \cdot \frac{1}{Z} \enspace .

Now that we have established that we can write both the prior and the posterior density of parameters in the restricted model in terms of the parameters in the unrestricted model, as a second step, note that we can swap the posterior and the marginal likelihood terms in Bayes’ rule such that:

p(y \mid \mathcal{M}_1) = \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)} \enspace ,

from which it follows that:

\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_1)} = \frac{\frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{p(\theta \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)}} \enspace .

Now suppose that we have values that are in line with the restriction, i.e., $\theta = \theta^{\star}$. Then:

\begin{aligned}\text{BF}_{r1} = \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star\mid \mathcal{M}_r)}{p(\theta^\star \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star \mid \mathcal{M}_1) \, \frac{1}{K}}{p(\theta^\star \mid y, \mathcal{M}_1) \, \frac{1}{Z}}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, \frac{1}{K}}{\frac{1}{Z}}}{p(y \mid \theta^\star, \mathcal{M}_1)} = \frac{\frac{1}{K}}{\frac{1}{Z}} = \frac{Z}{K} \enspace ,\end{aligned}

where we have used the previous insights and the fact that the likelihood under $\mathcal{M}_r$ and $\mathcal{M}_1$ is the same. If we expand the constants for our previous problem, we have:

\text{BF}_{r1} = \frac{Z}{K} = \frac{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace ,

which is, as claimed above, the posterior probability of values for $\theta$ that are in line with the restriction divided by the prior probability of values for $\theta$ that are in line with the restriction. Note that this holds for arbitrary restrictions of an arbitrary number of parameters (see Kluglist & Hoijtink, 2005). In the limit where we take the restriction to be infinitesimally small, that is, constrain the parameter to be a point value, this results in the Savage-Dickey density ratio (Wetzels, Grasman, & Wagenmakers, 2010).

In the next section, we apply this idea to a data set that relates Hogwarts Houses to personality traits.

Hogwarts Houses and personality

So, are you a Slytherin, Hufflepuff, Ravenclaw, or Gryffindor? And what does this say about your personality?

Inspired by Crysel et al. (2015), Lea Jakob, Eduardo Garcia-Garzon, Hannes Jarke, and I analyzed self-reported personality data from 847 people as well as their self-reported Hogwards House affiliation.² We wanted to answer questions such as: do people who report belonging to Slytherin tend to score highest on Narcissism, Machiavellianism, and Psychopathy? Are Hufflepuffs the most agreeable, and Gryffindors the most extraverted? The Figure below visualizes the raw data.

We used a between-subjects ANOVA as our model and, in the case of for example Agreeableness, compared the following hypotheses:

% <![CDATA[\begin{aligned}\mathcal{H}_0&: \mu_H = \mu_G = \mu_R = \mu_S \\[.5em]\mathcal{H}_r&: \mu_H > (\mu_G , \mu_R , \mu_S) \\[.5em]\mathcal{H}_1&: \mu_H , \mu_G , \mu_R , \mu_S\end{aligned} %]]>

We used the BayesFactor R package to compute the Bayes factor in favour of $\mathcal{H}_1$ compared to $\mathcal{H}_0$. For the restricted hypotheses $\mathcal{H}_r$, we used the prior/posterior trick outlined above; and indeed, we found strong evidence in favour of the notion that, for example, Hufflepuffs score highest on Agreeableness. Curious about Slytherin and the other Houses? You can read the published paper with all the details here.

Conclusion

Participating in a relaxing prediction contest, we saw how three subjective Bayesians named Ron, Harry, and Hermione formalized their beliefs and derived their predictions about the likely outcome of ten coin flips. By restricting her prior beliefs about the bias of the coin to exclude values smaller than $\theta = 0.50$, Hermione was the boldest in her predictions and was ultimately rewarded. However, if the outcome of the coin flips would have turned out differently, say $y = 2$, then Hermione would have immediately realized how wrong her beliefs were. I think we as scientists need to be more like Hermione: we need to make more precise predictions, allowing us to construct more powerful tests and “fail” in insightful ways.

We also saw a neat trick by which one can compute the Bayes factor in favour of a restricted model compared to an unrestricted model by estimating the proportion of prior and posterior values of the parameter that are in line with the restriction — no painstaking computation of marginal likelihoods required! We used this trick to find evidence for what we all knew deep in our hearts already: Hufflepuffs are so agreeable.

I would like to thank Sophia Crüwell and Lea Jakob for helpful comments on this blog post.

References

Klugkist, I., Kato, B., & Hoijtink, H. (2005). Bayesian model selection using encompassing priors. Statistica Neerlandica, 59(1), 57-69.
Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (2010). An encompassing prior generalization of the Savage–Dickey density ratio. Computational Statistics & Data Analysis, 54(9), 2094-2102.
Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (2018). Bayesian inference and testing any hypothesis you can specify. Advances in Methods and Practices in Psychological Science, 1(2), 281-295.
Crysel, L. C., Cook, C. L., Schember, T. O., & Webster, G. D. (2015). Harry Potter and the measures of personality: Extraverted Gryffindors, agreeable Hufflepuffs, clever Ravenclaws, and manipulative Slytherins. Personality and Individual Differences, 83, 174-179.
Jakob, L., Garcia-Garzon, E., Jarke, H., & Dablander, F. (2019). The Science Behind the Magic? The Relation of the Harry Potter “Sorting Hat Quiz” to Personality and Human Values. Collabra: Psychology, 5(1).

Footnotes

The analytical solution is unpleasant.
You can discover your Hogwarts House affiliation at https://www.pottermore.com/.

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Just Quickly: The unexpected use of functions as arguments

September 28, 2019, 5:00 pm

≫ Next: Does news coverage boost support for presidential candidates in the Democratic primary?

≪ Previous: Harry Potter and the Power of Bayesian Constrained Inference

[This article was first published on Blog on Credibly Curious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I think that I have learnt and forgotten, and then learnt about this feature of R a few times in the past 4 years. The idea (I think), is this:

R allows you to pass functions as arguments
Functions can be modified inside a function

So what the hell does that mean?

Well, I think I can summarise it down to this crazy piece of magic:

my_fun <- function(x, fun){
  fun(x)
}

Now we can pass in some input, and any function.

Let’s take the storms data from dplyr.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

storms

## # A tibble: 10,010 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##                     
##  1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropi… -1          25     1011
##  8 Amy    1975     6    28    18  34   -77   tropi… -1          30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropi… 0           35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropi… 0           40     1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter ,
## #   hu_diameter

Let’s take the mean of wind:

my_fun(storms$wind, mean)

## [1] 53.495

And, we can also do the standard deviation, or the variance, or the median

my_fun(storms$wind, sd)

## [1] 26.21387

my_fun(storms$wind, var)

## [1] 687.1668

my_fun(storms$wind, median)

## [1] 45

Why would you want to do this?

Let’s say you want to summarise the storms data further, for each month.

We take storms, group my month, then take the mean for month.

storms %>% 
  group_by(month) %>%
  summarise(wind_summary = mean(wind))

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         45.7
##  2     4         44.6
##  3     5         36.3
##  4     6         37.8
##  5     7         41.2
##  6     8         52.1
##  7     9         58.0
##  8    10         54.6
##  9    11         52.5
## 10    12         47.9

You could repeat the code again you could vary mean to be, say sd

storms %>% 
  group_by(month) %>%
  summarise(wind_summary = sd(wind))

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         9.08
##  2     4         5.94
##  3     5         9.57
##  4     6        13.4 
##  5     7        19.1 
##  6     8        26.0 
##  7     9        28.2 
##  8    10        25.3 
##  9    11        22.0 
## 10    12        14.6

Over the years, every time I repeat some code like this, I have felt a tug somewhere in my brain – a little spidey sense saying (something like): “Don’t repeat yourself, Nick”.

We can avoid repeating ourselves by using the template from earlier here in dplyr. We want to manipulate the summary (mean) used – so you could also take the median, variance, etc.

We can write the following:

storms_wind_summary <- function(fun){
  storms %>%
    group_by(month) %>%
    summarise(wind_summary = fun(wind))
}

And now we can pass the function name, say, mean.

storms_wind_summary(mean)

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         45.7
##  2     4         44.6
##  3     5         36.3
##  4     6         37.8
##  5     7         41.2
##  6     8         52.1
##  7     9         58.0
##  8    10         54.6
##  9    11         52.5
## 10    12         47.9

Or, any other function!

storms_wind_summary(sd)

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         9.08
##  2     4         5.94
##  3     5         9.57
##  4     6        13.4 
##  5     7        19.1 
##  6     8        26.0 
##  7     9        28.2 
##  8    10        25.3 
##  9    11        22.0 
## 10    12        14.6

storms_wind_summary(var)

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         82.5
##  2     4         35.3
##  3     5         91.5
##  4     6        180. 
##  5     7        365. 
##  6     8        678. 
##  7     9        793. 
##  8    10        638. 
##  9    11        482. 
## 10    12        213.

storms_wind_summary(median)

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1         50  
##  2     4         45  
##  3     5         35  
##  4     6         35  
##  5     7         37.5
##  6     8         45  
##  7     9         50  
##  8    10         50  
##  9    11         50  
## 10    12         45

We could even make our own!

range_diff <- function(x){
  diff(range(x))
}
  
storms_wind_summary(range_diff)

## # A tibble: 10 x 2
##    month wind_summary
##            
##  1     1           25
##  2     4           15
##  3     5           35
##  4     6           70
##  5     7          130
##  6     8          140
##  7     9          145
##  8    10          145
##  9    11          120
## 10    12           50

Looks like there was a pretty huge range in July through to November!

Pretty neat, eh? You can manipulate the function itself!

Going slightly further

The above was an example demonstrating how you can manipulate a function being passed.

But, there are other ways to do this with dplyr that I might use instead. We could use summarise_at here, to specify a function in a different, equivalent, way.

storms_wind_summary <- function(fun){
  storms %>%
    group_by(month) %>%
    summarise_at(.vars = vars(wind),
                 .funs = list(fun))
}

storms_wind_summary(mean)

## # A tibble: 10 x 2
##    month  wind
##     
##  1     1  45.7
##  2     4  44.6
##  3     5  36.3
##  4     6  37.8
##  5     7  41.2
##  6     8  52.1
##  7     9  58.0
##  8    10  54.6
##  9    11  52.5
## 10    12  47.9

storms_wind_summary(median)

## # A tibble: 10 x 2
##    month  wind
##     
##  1     1  50  
##  2     4  45  
##  3     5  35  
##  4     6  35  
##  5     7  37.5
##  6     8  45  
##  7     9  50  
##  8    10  50  
##  9    11  50  
## 10    12  45

What if we want to provide many functions? Say, the mean, median, sd, variance, all together, how they belong?

We can do this.

This is done by passing dots (ellipsis) ... to the function. This allows for any number of inputs.

storms_wind_summary <- function(...){
  storms %>%
    group_by(month) %>%
    summarise_at(.vars = vars(wind),
                 .funs = list(...))
}


storms_wind_summary(median, mean, max)

## # A tibble: 10 x 4
##    month   fn1   fn2   fn3
##       
##  1     1  50    45.7    55
##  2     4  45    44.6    50
##  3     5  35    36.3    60
##  4     6  35    37.8    80
##  5     7  37.5  41.2   140
##  6     8  45    52.1   150
##  7     9  50    58.0   160
##  8    10  50    54.6   160
##  9    11  50    52.5   135
## 10    12  45    47.9    75

What’s the point of this?

So, this might not be the most useful summary of the storms data…and writing functions like this might not be the most general usecase. dplyr provides some amazingly flexible syntax to summarise data. Sometimes the answer isn’t writing a function, and you want to be mindful of replicating the flexibility of dplyr itself.

That said, with a task like this, or any section of code, I really think it can be useful to wrap them in a function, which describes more broadly what that section does. And, with features like what I wrote about here, I think that you can more clearly and flexible wrap up these features for your own use.

R is flexible enough to make that quite straightforward, and I think that is pretty darn neat!

Fin

Go forth, and use the power of functions in your work!

PS

Upon reflection, I’m pretty sure Mitchell O’Hara-Wild was the one who helped really solidify this into my brain. Thanks, Mitch!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog on Credibly Curious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Does news coverage boost support for presidential candidates in the Democratic primary?

September 28, 2019, 5:00 pm

≫ Next: Tidy forecasting in R

≪ Previous: Just Quickly: The unexpected use of functions as arguments

[This article was first published on R on Jacob Long, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Matt Grossmann noted the close relationship between the amount of news coverage candidates in the Democratic primary have been receiving and their polling numbers.

Presidential candidate poll standing & prior week cable TV news share are correlated at .9
Harris got a bump in news coverage after the 1st debate that faded with her polls; Warren's news coverage rose relative to Sanders with her polls; Biden leads both weekly pic.twitter.com/Z5njfyYdJg
— Matt Grossmann (@MattGrossmann) September 27, 2019

This got me thinking about what the available data can bring to bear on this question. I have ongoing interest in longitudinal data and the software for analyzing it, so this seemed like a fun, quick project. Luckily, there are several great resources to take the pain out of data collection in this case.

The GDELT project offers a TV API that allows anyone to look at how much cable news channels mention the candidates (specifically, the number of 15-second windows of coverage that mention the candidate by name). Media Cloud also lets you look at how often candidates are mentioned in online news articles. Helpfully, the fine folks at FiveThirtyEight have compiled these data as well as polls, already.

Now I’m going to walk through how to get these data into R. Skip to the analysis by clicking here.

Getting the data

As mentioned, FiveThirtyEight has compiled most of the data we’re interested in, albeit in different places. We will read them into R as separate data frames and join them later. There are some warnings from the CSV parser but they aren’t important for our purposes.

library(tidyverse)library(jtools)library(tsibble)cable_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/cable_weekly.csv")online_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/online_weekly.csv")# Immediately convert `end_date` to date classpolls <- read_csv("https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv")

Now we have the data, but we still have to get it in shape. First, we deal with the polls.

Polls

These data are formatted such that every row is a unique combination of candidate and poll. So if a poll included 20 candidates, there would be 20 rows to cover the results of that single poll. This is actually a good thing for our purposes.

I first create two vectors of candidate names. The first is the candidates who will be retained for analysis, in the format they are named in the polling data. The second is the same set of candidates, but with their less formal names that are used in the media data.

candidates <- c("Amy Klobuchar", "Andrew Yang", "Bernard Sanders",                "Beto O'Rourke", "Bill de Blasio", "Cory A. Booker",                "Elizabeth Warren", "Eric Swalwell", "Jay Robert Inslee",                 "Joe Sestak", "John Hickenlooper", "John K. Delaney",                "Joseph R. Biden Jr.", "Julián Castro", "Kamala D. Harris",                 "Kirsten E. Gillibrand", "Marianne Williamson",                 "Michael F. Bennet", "Pete Buttigieg", "Seth Moulton",                "Steve Bullock", "Tim Ryan", "Tom Steyer", "Tulsi Gabbard",                "Wayne Messam")candidates_clean <- c("Amy Klobuchar", "Andrew Yang", "Bernie Sanders",                      "Beto O'Rourke", "Bill de Blasio", "Cory Booker",                      "Elizabeth Warren", "Eric Swalwell", "Jay Inslee",                       "Joe Sestak", "John Hickenlooper", "John Delaney",                      "Joe Biden", "Julian Castro", "Kamala Harris",                       "Kirsten Gillibrand", "Marianne Williamson",                       "Michael Bennet", "Pete Buttigieg", "Seth Moulton",                      "Steve Bullock", "Tim Ryan", "Tom Steyer",                      "Tulsi Gabbard", "Wayne Messam")

Now we do some filtering and data cleaning for polls. See the inline comments for some explanations, but basically we’re using only polls of known quality, that cover the time period for which we have media data, and only national polls.

polls <- polls %>%  # Convert date to date format  mutate(end_date = as.Date(end_date, format = "%m/%d/%y")) %>%  filter(    # include only polls of at least modest quality    fte_grade %in% c("C-", "C", "C+", "B-", "B", "B+", "A-", "A", "A+"),     # only include polls ending on or after 12/30/2018    end_date >= as.Date("12/30/2018", "%m/%d/%Y"),    # only include *Democratic* primary polls    party == "DEM",     # only include the selected candidates    candidate_name %in% candidates,    # only national polls    is.na(state),    # Exclude some head-to-head results, etc.    notes %nin% c("head-to-head poll",                   "HarrisX/SR Democrat LV, definite voter",                   "open-ended question")  ) %>%  mutate(    # Have to add 1 to the date to accommodate tsibble's yearweek()    # starting on Monday rather than Sunday like our other data sources    week = as.Date(yearweek(end_date + 1)) - 1,    # Convert candidate names to factor so I can relabel them    candidate_name = factor(candidate_name, levels = candidates, labels = candidates_clean)  )

Now we aggregate by week, forming a weekly polling average by candidate. If we were trying to build a forecast, we would do this in a better way that wouldn’t have so much variation. For now, all I do is weight the results by (logged) sample size. Note that pct refers to the percentage of the “votes” the candidate received in the poll.

polls_agg <- polls %>%  group_by(week, candidate_name) %>%  summarize(    pct_polls = weighted.mean(pct, log(sample_size))  )

For a quick sanity check, let’s plot these data to see if things line up ( I omit the relatively lower-polling candidates for simplicity).

library(ggplot2)top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                     "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                    "Cory Booker")ggplot(filter(polls_agg, candidate_name %in% top_candidates),        aes(x = week, y = pct_polls, color = candidate_name)) +  geom_line() +  theme_nice()

.gallery .img13 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-6-1.png');}

Okay, it’s a bit more variable than other aggregators but it’s showing us the same basic trends.

Media

We have two data frames with media coverage info, cable_mentions and online_mentions. These are in much better shape to begin with, but we do need to combine them and make a couple changes. Each row in these data represent a candidate and week, so there are $weeks \times candidates$ rows.

This is a good example of a time to use an inner join. Note that our key variables are the proportion of all news clips/articles that mention any candidate that mention the candidate in question. In other words, we’re ignoring variation in how much the primary gets discussed in the news and instead focusing on how big each candidate’s share of the coverage is.

media <-   inner_join(cable_mentions, online_mentions, by = c("date", "name")) %>%  mutate(    # Create new variables that put the media coverage variables on     # same scale as poll numbers    pct_cable = 100 * pct_of_all_candidate_clips,    pct_online = 100 * pct_of_all_candidate_stories  )

Let’s look at the trends for cable news…

library(ggplot2)top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                     "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                    "Cory Booker")ggplot(filter(media, name %in% top_candidates),        aes(x = date, y = pct_cable, color = name)) +  geom_line() +  theme_nice()

.gallery .img14 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-8-1.png');}

This looks a bit similar to the polling trends, although more variable over time.

And now online news…

library(ggplot2)top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",                     "Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",                    "Cory Booker")ggplot(filter(media, name %in% top_candidates),        aes(x = date, y = pct_online, color = name)) +  geom_line() +  theme_nice()

.gallery .img15 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-9-1.png');}

This one’s a bit more all over the place, with the minor candidates espcially having higher highs.

Combine data

Now we just need to get all this information in the same place for analysis. More inner joins!

joined <- inner_join(polls_agg, media,                      by = c("candidate_name" = "name", "week" = "date"))

Now we have everything in a single data frame where each row represents one week and one candidate. To make things work for statistical analysis, I’m going to do a couple conversions — one to the panel_data format, from my panelr package, and another to pdata.frame format, from the plm package. We’ll be using both packages for analysis.

library(panelr)# panel_data needs a number or ordered factor as wave variablejoined$wave <- as.ordered(joined$week) joined_panel <- panel_data(ungroup(joined), id = candidate_name, wave = wave)joined_pdata <- as_pdata.frame(joined_panel)

Analysis

Okay, so we have multiple time series for each candidate: their status in the polls, how much of the cable news coverage they’re getting, and how much of the online news coverage they’re getting. We’d like to know whether any of these are causing the others. Most interesting is whether the news coverage drives better results in the polls.

The kind of analyses we can do all have in common the idea of comparing each candidate to himself or herself in the past. If Elizabeth Warren’s share of news coverage goes from 10% to 12%, up 2 percentage points, where do we expect her share in the polls to go? If it goes from 15% to 17%, then it goes up 2 percentage points as well. This is treated equivalently to if Andrew Yang goes from 0% of news to 2% of news and then sees his polls goes from 1% to 3%.

Of course, this still doesn’t sort out the problem of reverse causality. If we see that news coverage and polls change at the same time, it’s not obvious which caused the other (and we’ll ignore the possibility that something else caused both for the time being). There are several methods for dealing with this and I’ll focus on ones that use past values of polls to predict future ones.

Fixed effects models

Fixed effects models are a common way to remove the influence of certain kinds of confounding variables, like a candidate’s pre-existing popularity. It doesn’t fix the problem of confounders that change over time (like a change in the candidate’s campaign strategy or a new scandal), but it’s a workhorse model for longitudinal data.

The process we’re looking at is dynamic, meaning candidates’ support in the past affects the present; people don’t pick their favorite candidate every week, they have a favorite candidate who will remain in that position unless something changes. We model this statistically by using last week’s polling average as a predictor of this week’s polling average. In the panel data literature, using so-called fixed effects models with a lagged value of the dependent variable in the model is a big no-no. This is because something called Nickell bias, which basically means that models like this give you wrong results in a predictable way.

Luckily, these data are not quite the same as the kind that the Nickell bias affects the most. We have 24 candidates with up to 38 weeks of data for each. The Nickell bias tends to be most problematic when you have relatively few time points and relatively many people (in this case candidates). So we’ll start with fixed effects models and assume the Nickell bias isn’t too serious.

I’m going to use the wbm() function from my panelr package to do this analysis.

fix_mod <- wbm(pct_polls ~ lag(pct_polls) +                 pct_cable + lag(pct_cable) +                 pct_online + lag(pct_online),               data = joined_panel, model = "fixed")summary(fix_mod)

MODEL INFO:Entities: 24Time periods: 2019-01-13-2019-09-15Dependent variable: pct_pollsModel type: Linear mixed effectsSpecification: withinMODEL FIT:AIC = 2233.15, BIC = 2269.64Pseudo-R² (fixed effects) = 0.03Pseudo-R² (total) = 0.98Entity ICC = 0.98-------------------------------------------------------------                         Est.   S.E.   t val.     d.f.      p--------------------- ------- ------ -------- -------- ------(Intercept)              3.85   1.45     2.65    23.01   0.01lag(pct_polls)           0.64   0.03    24.74   678.01   0.00pct_cable                0.08   0.01     6.29   678.01   0.00lag(pct_cable)           0.05   0.01     4.05   678.01   0.00pct_online              -0.03   0.01    -2.29   678.01   0.02lag(pct_online)         -0.02   0.01    -1.39   678.01   0.17-------------------------------------------------------------p values calculated using Satterthwaite d.f. RANDOM EFFECTS:------------------------------------------     Group         Parameter    Std. Dev. ---------------- ------------- ----------- candidate_name   (Intercept)     7.108       Residual                      1.007   ------------------------------------------

Here’s what the output is saying:

First of all, there’s evidence of momentum. If your poll numbers went up last week, all else being equal they’ll probably be up this week too.
Gains in cable news coverage both this week and last week are associated with gains in the polls this week.
Gains in online news coverage this week are associated (very weakly) with declines in the polls this week, assuming no change in cable news coverage.

I will note that as far as the online coverage is concerned, if I drop cable news coverage from the model then suddenly online coverage appears to have a positive effect. I think what’s going on there is both online and cable news cover candidates in a way that helps them, but online coverage is sometimes harmful in a way that is not true of online coverage. Either that or there’s just a lot more noise in the online data.

Adjusting for trends

This was the simplest analysis I can do. I can also try to remove any trends in the data to try to account for something that isn’t in the model that drives some candidates up or down over time. Basically, for each candidate we subtract their over-time trend from each week’s polling numbers and news coverage and see if deviations from their trend predict each other.

The risk with this approach is that it really is news that has most of the influence and you’re modeling away some of the “real” effects along with the stuff you don’t want around.

fix_mod <- wbm(pct_polls ~ lag(pct_polls) +                 pct_cable + lag(pct_cable) +                 pct_online + lag(pct_online),               data = joined_panel, model = "fixed",               detrend = TRUE)summary(fix_mod)

MODEL INFO:Entities: 24Time periods: 2019-01-13-2019-09-15Dependent variable: pct_pollsModel type: Linear mixed effectsSpecification: withinMODEL FIT:AIC = 2169.99, BIC = 2206.48Pseudo-R² (fixed effects) = 0.91Pseudo-R² (total) = 0.97Entity ICC = 0.7-------------------------------------------------------------                         Est.   S.E.   t val.     d.f.      p--------------------- ------- ------ -------- -------- ------(Intercept)              1.01   0.34     2.95    16.39   0.01lag(pct_polls)           0.69   0.02    29.60   339.36   0.00pct_cable                0.08   0.01     6.43   683.75   0.00lag(pct_cable)           0.04   0.01     3.52   688.28   0.00pct_online              -0.03   0.01    -2.11   692.90   0.04lag(pct_online)         -0.01   0.01    -1.00   690.93   0.32-------------------------------------------------------------p values calculated using Satterthwaite d.f. RANDOM EFFECTS:------------------------------------------     Group         Parameter    Std. Dev. ---------------- ------------- ----------- candidate_name   (Intercept)     1.559       Residual                      1.012   ------------------------------------------

Okay, same story here. Some good evidence of cable news helping and some very weak evidence of online news possibly hurting.

Driven by minor candidates?

Responding to Grossmann’s tweet, Jonathan Ladd raises an interesting question:

I wonder how much of this is driven only by the non-Biden candidates, since it seems to show that much of poll movement is driven by name recognition and need to coordinate on a non-Biden alternative.
— Jonathan Ladd (@jonmladd) September 28, 2019

There are a couple of ways to look at this. First of all, let’s think about this as less of a Biden vs. all others phenomenon and more about whether this effect of news on candidate support is concentrated among those with relatively low support.

We can deal with this via an interaction effect, seeing whether the effects are stronger or weaker among candidates with higher/lower absolute levels of support. I need to fit a slightly different model here to accommodate the inclusion of the lagged dependent variable without subtracting its mean (as is done for the conventional fixed effects analysis). Our focus will be on the “within” effects and cross-level interactions in the output below.

int_mod <- wbm(pct_polls ~                  pct_cable + lag(pct_cable) +                 pct_online + lag(pct_online) | lag(pct_polls) |                 lag(pct_polls) * pct_cable +                 lag(pct_polls) * lag(pct_cable) +                 lag(pct_polls) * pct_online +                 lag(pct_polls) * lag(pct_online),               data = joined_panel, model = "w-b")summary(int_mod)

MODEL INFO:Entities: 24Time periods: 2019-01-13-2019-09-15Dependent variable: pct_pollsModel type: Linear mixed effectsSpecification: within-betweenMODEL FIT:AIC = 2109.28, BIC = 2173.14Pseudo-R² (fixed effects) = 0.98Pseudo-R² (total) = 0.98Entity ICC = 0.21WITHIN EFFECTS:-------------------------------------------------------------                         Est.   S.E.   t val.     d.f.      p--------------------- ------- ------ -------- -------- ------pct_cable                0.09   0.02     4.68   672.05   0.00lag(pct_cable)           0.02   0.02     1.20   671.39   0.23pct_online              -0.01   0.02    -0.54   673.16   0.59lag(pct_online)          0.04   0.02     2.55   672.97   0.01-------------------------------------------------------------BETWEEN EFFECTS:---------------------------------------------------------------                           Est.   S.E.   t val.     d.f.      p----------------------- ------- ------ -------- -------- ------(Intercept)               -0.16   0.17    -0.92    18.51   0.37imean(pct_cable)           0.31   0.03     9.27    38.40   0.00imean(pct_online)          0.00   0.02     0.03    17.29   0.98lag(pct_polls)             0.63   0.02    26.58   589.12   0.00---------------------------------------------------------------CROSS-LEVEL INTERACTIONS:----------------------------------------------------------------------------                                        Est.   S.E.   t val.     d.f.      p------------------------------------ ------- ------ -------- -------- ------pct_cable:lag(pct_polls)                0.00   0.00     0.43   671.50   0.67lag(pct_cable):lag(pct_polls)           0.00   0.00     4.05   674.01   0.00pct_online:lag(pct_polls)              -0.00   0.00    -2.20   671.51   0.03lag(pct_online):lag(pct_polls)         -0.01   0.00    -5.66   674.15   0.00----------------------------------------------------------------------------p values calculated using Satterthwaite d.f. RANDOM EFFECTS:------------------------------------------     Group         Parameter    Std. Dev. ---------------- ------------- ----------- candidate_name   (Intercept)    0.4972       Residual                      0.955   ------------------------------------------

Okay so there’s a lot going on here. First of all, we see that the instantaneous effect of changes in cable news coverage does not appear to depend on the candidate’s previous standing in the polls. For the other interaction terms, we have some evidence of the effects changing depending on the candidate’s standing in the polls.

Let’s examine them one by one, with help from my interactions package. I’ll show predicted values of poll numbers depending on different values of news coverage to give a gist of what’s going on.

Last week’s cable news coverage

Each line represents the predicted standing in this week’s polls at different levels of last week’s standing in the polls. What we really care about is the slope of the lines.

library(interactions)interact_plot(int_mod, `lag(pct_cable)`, `lag(pct_polls)`,              modx.values = c(2, 10, 20),               x.label = "Last week's % change in cable news coverage",              y.label = "This week's polling average",              legend.main = "Last week's polling average")

.gallery .img16 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-15-1.png');}

So what we see here is that the higher a candidate’s standing in the polls, the more they benefit from news coverage! This stands somewhat in contradiction to Ladd’s speculation. Another way to think about it is that these changes in news coverage tend to have more staying power for candidates with more support.

Last week’s online coverage

For last week’s online coverage, we see in the model output that for a candidate with hypothetical zero polling support, increases in online news coverage are good for future polling, but there’s a negative interaction term. Let’s look at how that plays out.

interact_plot(int_mod, `lag(pct_online)`, `lag(pct_polls)`,              modx.values = c(2, 10, 20),               x.label = "Last week's % change in online news coverage",              y.label = "This week's polling average",              legend.main = "Last week's polling average")

.gallery .img17 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-16-1.png');}

Here we see that for higher polling candidates, the lagged changes in online coverage are a detriment while for lower polling candidates, such changes are a much-needed (small) boost.

This week’s online coverage

Let’s do the same test with the effect of this week’s online coverage on this week’s polls.

interact_plot(int_mod, pct_online, `lag(pct_polls)`,              modx.values = c(2, 10, 20),               x.label = "This week's % change in online news coverage",              y.label = "This week's polling average",              legend.main = "Last week's polling average")

.gallery .img18 {background-image: url('/post/news-coverage-candidate-support_files/figure-html/unnamed-chunk-17-1.png');}

Quite similar to last week’s online coverage, except not even the low-polling candidates seem to benefit.

Just drop Biden from the analysis

Another thing we can do is just drop Biden, who for most of the campaign cycle has dominated the polls and news coverage.

no_biden <- wbm(pct_polls ~ lag(pct_polls) +                 pct_cable + lag(pct_cable) +                 pct_online + lag(pct_online),               data = filter(joined_panel, candidate_name != "Joe Biden"),               model = "fixed")summary(no_biden)

MODEL INFO:Entities: 23Time periods: 2019-01-13-2019-09-15Dependent variable: pct_pollsModel type: Linear mixed effectsSpecification: withinMODEL FIT:AIC = 1879.42, BIC = 1915.49Pseudo-R² (fixed effects) = 0.07Pseudo-R² (total) = 0.97Entity ICC = 0.96-------------------------------------------------------------                         Est.   S.E.   t val.     d.f.      p--------------------- ------- ------ -------- -------- ------(Intercept)              2.70   0.92     2.93    22.01   0.01lag(pct_polls)           0.65   0.02    26.80   643.01   0.00pct_cable                0.10   0.01     7.91   643.02   0.00lag(pct_cable)           0.04   0.01     2.98   643.01   0.00pct_online              -0.02   0.01    -1.92   643.02   0.05lag(pct_online)          0.01   0.01     1.30   643.01   0.20-------------------------------------------------------------p values calculated using Satterthwaite d.f. RANDOM EFFECTS:------------------------------------------     Group         Parameter    Std. Dev. ---------------- ------------- ----------- candidate_name   (Intercept)     4.414       Residual                     0.8483   ------------------------------------------

And in this case, the results are basically the same, although the benefits of news coverage are perhaps a bit stronger.

A more advanced model

Let’s push a bit further to make sure we’re not making a mistake on the basic claim that (cable) news coverage appears to be beneficial. A more robust approach is to use an analysis that more deliberately addresses these issues of reverse causality and endogeneity.

Normally, I’d reach for the dynamic panel models featured in my dpm package, but these can’t handle data with so many time points and so few people. Instead, I’ll use the Arellano-Bond estimator¹, which the models in dpm were meant to replace — they are both unbiased, but Arellano-Bond models tend to be inefficient. In other words, this method is more conservative.

For this, I need the plm package and its pgmm() function. I’ll skip the technicalities and just say the interpretations will be similar to what I just did, but the underlying algorithm is more rigorous at ruling out reverse causality.

library(plm)ab_mod <- pgmm(pct_polls ~ lag(pct_polls, 1) +                          pct_cable + lag(pct_cable) +                         pct_online + lag(pct_online) |                          lag(pct_polls, 2:15),                 data = joined_pdata, effect = "individual", model = "twosteps",               transformation = "ld")summary(ab_mod)

Oneway (individual) effect Two steps modelCall:pgmm(formula = pct_polls ~ lag(pct_polls, 1) + pct_cable + lag(pct_cable) +     pct_online + lag(pct_online) | lag(pct_polls, 2:15), data = joined_pdata,     effect = "individual", model = "twosteps", transformation = "ld")Unbalanced Panel: n = 24, T = 11-37, N = 731Number of Observations Used: 1361Residuals:     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. -11.44106  -0.27460   0.00000  -0.00144   0.25889   9.50637 Coefficients:                    Estimate Std. Error z-value  Pr(>|z|)    lag(pct_polls, 1)  0.8953903  0.0196164 45.6449 < 2.2e-16 ***pct_cable          0.0741411  0.0165264  4.4862  7.25e-06 ***lag(pct_cable)     0.0109705  0.0065706  1.6696   0.09499 .  pct_online        -0.0108026  0.0120645 -0.8954   0.37057    lag(pct_online)    0.0095829  0.0140126  0.6839   0.49405    ---Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Sargan test: chisq(437) = 18.55703 (p-value = 1)Autocorrelation test (1): normal = -2.282824 (p-value = 0.022441)Autocorrelation test (2): normal = -1.182395 (p-value = 0.23705)Wald test for coefficients: chisq(5) = 146865.5 (p-value = < 2.22e-16)

Okay so what does this all mean? Basically, the same story we saw with the other, simpler analyses.

Conclusions

Does news coverage help candidates in the Democratic primary race? Probably. There are some limitations of the analyses at hand. It is possible, for instance, that there is something else that changes the news coverage. In fact, that is likely — early on, it appeared Elizabeth Warren drove news coverage by releasing new policy proposals on a fairly frequent schedule. Did the policy proposals themselves boost her support rather than the news coverage of them? That’s hard to separate, especially given the kind of birds-eye view we’re taking here. We’re not saying what’s in the news coverage.

Matt Grossmann suggested sentiment analysis:

Warren has had more TV coverage than Sanders over the last 7 weeks. Anecdotally, it seems to have been much more positive media coverage than for Biden & Sanders, enabling her rise. Is anyone doing real-time media sentiment analysis?
— Matt Grossmann (@MattGrossmann) September 28, 2019

and that’s probably a wise choice. Maybe once I’m off the job market!

Actually, I’ll use the Blundell-Bond estimator, which is a tweaked version that is a bit more efficient. ^{^}

To leave a comment for the author, please follow the link and comment on their blog: R on Jacob Long.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Tidy forecasting in R

September 28, 2019, 5:00 pm

≫ Next: Getting started with {golem}

≪ Previous: Does news coverage boost support for presidential candidates in the Democratic primary?

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The fable package for doing tidy forecasting in R is now on CRAN. Like tsibble and feasts, it is also part of the tidyverts family of packages for analysing, modelling and forecasting many related time series (stored as tsibbles).

For a brief introduction to tsibbles, see this post from last month.

Here we will forecast Australian tourism data by state/region and purpose. This data is stored in the tourism tsibble where Trips contains domestic visitor nights in thousands.

library(tidyverse)library(tsibble)library(lubridate)library(fable)tourism

## # A tsibble: 24,320 x 5 [1Q]## # Key:       Region, State, Purpose [304]##    Quarter Region   State           Purpose  Trips##                          ##  1 1998 Q1 Adelaide South Australia Business  135.##  2 1998 Q2 Adelaide South Australia Business  110.##  3 1998 Q3 Adelaide South Australia Business  166.##  4 1998 Q4 Adelaide South Australia Business  127.##  5 1999 Q1 Adelaide South Australia Business  137.##  6 1999 Q2 Adelaide South Australia Business  200.##  7 1999 Q3 Adelaide South Australia Business  169.##  8 1999 Q4 Adelaide South Australia Business  134.##  9 2000 Q1 Adelaide South Australia Business  154.## 10 2000 Q2 Adelaide South Australia Business  169.## # … with 24,310 more rows

There are 304 combinations of Region, State and Purpose, each one defining a time series of 80 observations.

To simplify the outputs, we will abbreviate the state names.

tourism <- tourism %>%  mutate(    State = recode(State,      "Australian Capital Territory" = "ACT",      "New South Wales" = "NSW",      "Northern Territory" = "NT",      "Queensland" = "QLD",      "South Australia" = "SA",      "Tasmania" = "TAS",      "Victoria" = "VIC",      "Western Australia" = "WA"    )  )

Forecasting a single time series

Although the fable package is designed to handle many time series, we will be begin by demonstrating its use on a single time series. For this purpose, we will extract the tourism data for holidays in the Snowy Mountains region of NSW.

snowy <- tourism %>%  filter(    Region == "Snowy Mountains",    Purpose == "Holiday"  )snowy

## # A tsibble: 80 x 5 [1Q]## # Key:       Region, State, Purpose [1]##    Quarter Region          State Purpose Trips##                      ##  1 1998 Q1 Snowy Mountains NSW   Holiday 101. ##  2 1998 Q2 Snowy Mountains NSW   Holiday 112. ##  3 1998 Q3 Snowy Mountains NSW   Holiday 310. ##  4 1998 Q4 Snowy Mountains NSW   Holiday  89.8##  5 1999 Q1 Snowy Mountains NSW   Holiday 112. ##  6 1999 Q2 Snowy Mountains NSW   Holiday 103. ##  7 1999 Q3 Snowy Mountains NSW   Holiday 254. ##  8 1999 Q4 Snowy Mountains NSW   Holiday  74.9##  9 2000 Q1 Snowy Mountains NSW   Holiday 118. ## 10 2000 Q2 Snowy Mountains NSW   Holiday 114. ## # … with 70 more rows

snowy %>% autoplot(Trips)

For this data set, a reasonable benchmark forecast method is the seasonal naive method, where forecasts are set to be equal to the last observed value from the same quarter. Alternative models for this series are ETS and ARIMA models. All these can be included in a single call to the model() function like this.

fit <- snowy %>%  model(    snaive = SNAIVE(Trips ~ lag("year")),    ets = ETS(Trips),    arima = ARIMA(Trips)  )fit

## # A mable: 1 x 6## # Key:     Region, State, Purpose [1]##   Region          State Purpose snaive   ets         arima                 ##                                        ## 1 Snowy Mountains NSW   Holiday

The returned object is called a “mable” or model table, where each cell corresponds to a fitted model. Because we have only fitted models to one time series, this mable has only one row.
To forecast all models, we pass the object to the forecast function.
fc <- fit %>%  forecast(h = 12)fc
## # A fable: 36 x 7 [1Q]## # Key:     Region, State, Purpose, .model [3]##    Region          State Purpose .model Quarter Trips .distribution##                                ##  1 Snowy Mountains NSW   Holiday snaive 2018 Q1 119.  N(119,  666) ##  2 Snowy Mountains NSW   Holiday snaive 2018 Q2 124.  N(124,  666) ##  3 Snowy Mountains NSW   Holiday snaive 2018 Q3 378.  N(378,  666) ##  4 Snowy Mountains NSW   Holiday snaive 2018 Q4  84.7 N( 85,  666) ##  5 Snowy Mountains NSW   Holiday snaive 2019 Q1 119.  N(119, 1331) ##  6 Snowy Mountains NSW   Holiday snaive 2019 Q2 124.  N(124, 1331) ##  7 Snowy Mountains NSW   Holiday snaive 2019 Q3 378.  N(378, 1331) ##  8 Snowy Mountains NSW   Holiday snaive 2019 Q4  84.7 N( 85, 1331) ##  9 Snowy Mountains NSW   Holiday snaive 2020 Q1 119.  N(119, 1997) ## 10 Snowy Mountains NSW   Holiday snaive 2020 Q2 124.  N(124, 1997) ## # … with 26 more rows
The return object is a “fable” or forecast table with the following characteristics:
the .model column becomes an additional key;
the .distribution column contains the estimated probability distribution of the response variable in future time periods;
the Trips column contains the point forecasts equal to the mean of the probability distribution.
The autoplot() function will produce a plot of all forecasts. By default, level=c(80,95) so 80% and 95% prediction intervals are shown. But to avoid clutter, we will set level=NULL to show no prediction intervals.
fc %>%  autoplot(snowy, level = NULL) +  ggtitle("Forecasts for Snowy Mountains holidays") +  xlab("Year") +  guides(colour = guide_legend(title = "Forecast"))
If you want to compute the prediction intervals, the hilo() function can be used:
hilo(fc, level = 95)
## # A tsibble: 36 x 7 [1Q]## # Key:       Region, State, Purpose, .model [3]##    Region          State Purpose .model Quarter Trips          `95%`##                                 ##  1 Snowy Mountains NSW   Holiday snaive 2018 Q1 119.  [ 68.5, 170]95##  2 Snowy Mountains NSW   Holiday snaive 2018 Q2 124.  [ 73.9, 175]95##  3 Snowy Mountains NSW   Holiday snaive 2018 Q3 378.  [327.6, 429]95##  4 Snowy Mountains NSW   Holiday snaive 2018 Q4  84.7 [ 34.1, 135]95##  5 Snowy Mountains NSW   Holiday snaive 2019 Q1 119.  [ 47.5, 191]95##  6 Snowy Mountains NSW   Holiday snaive 2019 Q2 124.  [ 53.0, 196]95##  7 Snowy Mountains NSW   Holiday snaive 2019 Q3 378.  [306.6, 450]95##  8 Snowy Mountains NSW   Holiday snaive 2019 Q4  84.7 [ 13.1, 156]95##  9 Snowy Mountains NSW   Holiday snaive 2020 Q1 119.  [ 31.4, 207]95## 10 Snowy Mountains NSW   Holiday snaive 2020 Q2 124.  [ 36.9, 212]95## # … with 26 more rows

Forecasting many series
To scale this up to include all series in the tourism data set requires no more work — we use exactly the same code.
fit <- tourism %>%  model(    snaive = SNAIVE(Trips ~ lag("year")),    ets = ETS(Trips),    arima = ARIMA(Trips)  )fit
## # A mable: 304 x 6## # Key:     Region, State, Purpose [304]##    Region       State Purpose  snaive   ets        arima                   ##                                        ##  1 Adelaide     SA    Business     ##  4 Adelaide     SA    Visiting     ##  6 Adelaide Hi… SA    Holiday             ##  7 Adelaide Hi… SA    Other     ##  8 Adelaide Hi… SA    Visiting            ##  9 Alice Sprin… NT    Business  ## 10 Alice Sprin… NT    Holiday   ## # … with 294 more rows
Now the mable includes models for every combination of keys in the tourism data set.
We can extract information about some specific model using the filter, select and report functions.
fit %>%  filter(Region == "Snowy Mountains", Purpose == "Holiday") %>%  select(arima) %>%  report()
## Series: Trips ## Model: ARIMA(1,0,0)(0,1,2)[4] ## ## Coefficients:##         ar1    sma1    sma2##       0.216  -0.371  -0.190## s.e.  0.116   0.128   0.116## ## sigma^2 estimated as 592.9:  log likelihood=-350## AIC=707   AICc=708   BIC=716
When the mable is passed to the forecast() function, forecasts are computed for every model and every key combination.
fc <- fit %>%  forecast(h = "3 years")fc
## # A fable: 10,944 x 7 [1Q]## # Key:     Region, State, Purpose, .model [912]##    Region   State Purpose  .model Quarter Trips .distribution##                          ##  1 Adelaide SA    Business snaive 2018 Q1  129. N(129, 2018) ##  2 Adelaide SA    Business snaive 2018 Q2  174. N(174, 2018) ##  3 Adelaide SA    Business snaive 2018 Q3  185. N(185, 2018) ##  4 Adelaide SA    Business snaive 2018 Q4  197. N(197, 2018) ##  5 Adelaide SA    Business snaive 2019 Q1  129. N(129, 4036) ##  6 Adelaide SA    Business snaive 2019 Q2  174. N(174, 4036) ##  7 Adelaide SA    Business snaive 2019 Q3  185. N(185, 4036) ##  8 Adelaide SA    Business snaive 2019 Q4  197. N(197, 4036) ##  9 Adelaide SA    Business snaive 2020 Q1  129. N(129, 6054) ## 10 Adelaide SA    Business snaive 2020 Q2  174. N(174, 6054) ## # … with 10,934 more rows
Note the use of natural language to specify the forecast horizon. The forecast() function is able to interpret many different time specifications. For quarterly data, h = "3 years" is equivalent to setting h = 12.
Plots of individual forecasts can also be produced, although filtering is helpful to avoid plotting too many series at once.
fc %>%  filter(Region == "Snowy Mountains") %>%  autoplot(tourism, level = NULL) +  xlab("Year") + ylab("Overnight trips (thousands)")
Forecast accuracy calculations
To compare the forecast accuracy of these models, we will create a training data set containing all data up to 2014. We will then forecast the remaining years in the data set and compare the results with the actual values.
train <- tourism %>%  filter(year(Quarter) <= 2014)fit <- train %>%  model(    ets = ETS(Trips),    arima = ARIMA(Trips),    snaive = SNAIVE(Trips)  ) %>%  mutate(mixed = (ets + arima + snaive) / 3)
Here we have introduced an ensemble forecast (mixed) which is a simple average of the three fitted models. Note that forecast() will produce distributional forecasts from the ensemble as well, taking into account the correlations between the forecast errors of the component models.
fc <- fit %>% forecast(h = "3 years")
fc %>%  filter(Region == "Snowy Mountains") %>%  autoplot(tourism, level = NULL)
Now to check the accuracy, we use the accuracy() function. By default it computes several point forecasting accuracy measures such as MAE, RMSE, MAPE and MASE for every key combination.
accuracy(fc, tourism)
## # A tibble: 1,216 x 12##    .model Region State Purpose .type    ME  RMSE   MAE      MPE  MAPE  MASE##                     ##  1 arima  Adela… SA    Busine… Test  22.5  28.5  25.3    11.9    14.0 0.765##  2 arima  Adela… SA    Holiday Test  21.9  34.8  28.0     9.93   14.8 1.31 ##  3 arima  Adela… SA    Other   Test   4.71 17.5  14.6     0.529  20.2 1.11 ##  4 arima  Adela… SA    Visiti… Test  32.8  37.1  32.8    13.7    13.7 1.05 ##  5 arima  Adela… SA    Busine… Test   1.31  5.58  3.57 -Inf     Inf   1.09 ##  6 arima  Adela… SA    Holiday Test   6.46  7.43  6.46   37.4    37.4 1.14 ##  7 arima  Adela… SA    Other   Test   1.35  2.79  1.93  -31.0    99.4 1.71 ##  8 arima  Adela… SA    Visiti… Test   8.37 12.6  10.4    -3.98   72.3 1.35 ##  9 arima  Alice… NT    Busine… Test   9.85 12.2  10.7    34.4    44.3 1.74 ## 10 arima  Alice… NT    Holiday Test   4.80 11.3   9.30    4.46   35.2 1.00 ## # … with 1,206 more rows, and 1 more variable: ACF1 
But because we have generated distributional forecasts, it is also interesting to look at the accuracy using CRPS (Continuous Rank Probability Scores) and Winkler Scores (for 95% prediction intervals).
fc_accuracy <- accuracy(fc, tourism,  measures = list(    point_accuracy_measures,    interval_accuracy_measures,    distribution_accuracy_measures  ))
fc_accuracy %>%  group_by(.model) %>%  summarise(    RMSE = mean(RMSE),    MAE = mean(MAE),    MASE = mean(MASE),    Winkler = mean(winkler),    CRPS = mean(CRPS)  ) %>%  arrange(RMSE)
## # A tibble: 4 x 6##   .model  RMSE   MAE  MASE Winkler  CRPS##           ## 1 mixed   19.8  16.0 0.997    104.  11.4## 2 ets     20.2  16.4 1.00     128.  11.9## 3 snaive  21.5  17.3 1.17     121.  12.8## 4 arima   21.9  17.8 1.07     140.  13.0
In this case, the mixed model is doing best on all accuracy measures.
Moving from forecast to fable
Many readers will be familiar with the forecast package and will wonder about the differences between forecast and fable. Here are some of the main differences.
fable is designed for tsibble objects, forecast is designed for ts objects.
fable handles many time series at a time, forecast handles one time series at a time.
fable can fit multiple models at once, forecast fits one model at a time.
forecast produces point forecasts and prediction intervals. fable produces point forecasts and distribution forecasts. In fable, you can get prediction intervals from the forecast object using hilo() and in plots using autoplot().
fable handles ensemble forecasting easily whereas forecast has no facilities for ensembles.
fable has a more consistent interface with every model specified as a formula.
Automated modelling in fable is obtained by simply not specifying the right hand side of the formula. This was shown in the ARIMA() and ETS() functions here.
Subsequent posts will explore other features of the fable package.
  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };  (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);  }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog:  R on Rob J Hyndman.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or  here if you don't.

↧

Getting started with {golem}

September 29, 2019, 11:30 am

≫ Next: bamlss: A Lego Toolbox for Flexible Bayesian Regression

≪ Previous: Tidy forecasting in R

[This article was first published on Rtask, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A little blog post about where to look if you want to get started with {golem}, and an invitation to code with us in October. go-what? If you’ve never heard about it before, {golem} is a tool for building production-grade Shiny applications. With {golem}, Shiny developers have a toolkit for making a stable, easy-to-maintain, and robust for production web applications

L’article Getting started with {golem} est apparu en premier sur Rtask.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rtask.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

bamlss: A Lego Toolbox for Flexible Bayesian Regression

September 29, 2019, 3:00 pm

≫ Next: Fall & Winter Workshop Roundup

≪ Previous: Getting started with {golem}

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modular R tools for Bayesian regression are provided by bamlss: From classic MCMC-based GLMs and GAMs to distributional models using the lasso or gradient boosting.

Citation

Umlauf N, Klein N, Simon T, Zeileis A (2019). “bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond).” arXiv:1909.11784, arXiv.org E-Print Archive. https://arxiv.org/abs/1909.11784

Abstract

Over the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as “Lego bricks” encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.

Software

CRAN package: https://CRAN.R-project.org/package=bamlss Replication script: bamlss.R Project web page: http://www.bamlss.org/

Quick overview

To illustrate that the bamlss follows the same familiar workflow of the other regression packages such as the basic stats package or the well-established mgcv or gamlss two quick examples are provided: a Bayesian logit model and a location-scale model where both mean and variance of a normal response depend on a smooth term.

The logit model is a basic labor force participation model, a standard application in microeconometrics. Here, the data are loaded from the AER package and the same model formula is specified that would also be used for glm() (as shown on ?SwissLabor).

data("SwissLabor", package = "AER")f <- participation ~ income + age + education + youngkids + oldkids + foreign + I(age^2)

Then, the model can be estimated with bamlss() using essentially the same look-and-feel as for glm(). The default is to use Markov chain Monte Carlo after obtaining initial parameters via backfitting.

library("bamlss")set.seed(123)b <- bamlss(f, family = "binomial", data = SwissLabor)summary(b)## Call:## bamlss(formula = f, family = "binomial", data = SwissLabor)## ---## Family: binomial## Link function: pi = logit## *---## Formula pi:## ---## participation ~ income + age + education + youngkids + oldkids +##     foreign + I(age^2)## -## Parametric coefficients:##                 Mean     2.5%      50%    97.5% parameters## (Intercept)  6.15503  1.55586  5.99204 11.11051      6.196## income      -1.10565 -1.56986 -1.10784 -0.68652     -1.104## age          3.45703  2.05897  3.44567  4.79139      3.437## education    0.03354 -0.02175  0.03284  0.09223      0.033## youngkids   -1.17906 -1.51099 -1.17683 -0.83047     -1.186## oldkids     -0.24122 -0.41231 -0.24099 -0.08054     -0.241## foreignyes   1.16749  0.76276  1.17035  1.55624      1.168## I(age^2)    -0.48990 -0.65660 -0.49205 -0.31968     -0.488## alpha        0.87585  0.32301  0.99408  1.00000         NA## ---## Sampler summary:## -## DIC = 1033.325 logLik = -512.7258 pd = 7.8734## runtime = 1.417## ---## Optimizer summary:## -## AICc = 1033.737 converged = 1 edf = 8## logLik = -508.7851 logPost = -571.3986 nobs = 872## runtime = 0.012## ---

The summary is based on the MCMC samples, which suggest “significant” effects for all covariates, except for variable education, since the 95% credible interval contains zero. In addition, the acceptance probabilities alpha are reported and indicate proper behavior of the MCMC algorithm. The column parameters shows respective posterior mode estimates of the regression coefficients, which are calculated by the upstream backfitting algorithm.

To show a more flexible regression model we fit a distributional scale-location model to the well-known simulated motorcycle accident data, provided as mcycle in the MASS package.

Here, the relationship between head acceleration and time after impact is captured by smooth relationships in both mean and variance. See also ?gaulss in the mgcv package for the same type of model estimated with REML rather than MCMC. Here, we load the data, set up a list of two formula with smooth terms (and increased knots k for more flexibility), fit the model almost as usual, and then visualize the fitted terms along with 95% credible intervals.

data("mcycle", package = "MASS")f <- list(accel ~ s(times, k = 20), sigma ~ s(times, k = 20))set.seed(456)b <- bamlss(f, data = mcycle, family = "gaussian")plot(b, model = c("mu", "sigma"))

Flexible count regression for lightning reanalysis

Finally, we show a more challenging case study. Here, emphasis is given to the illustration of the workflow. For more details on the background for the data and interpretation of the model, see Section 5 in the full paper linked above. The goal is to establish a probabilistic model linking positive counts of cloud-to-ground lightning discharges in the European Eastern Alps to atmospheric quantities from a reanalysis dataset.

The lightning measurements form the response variable and regressors are taken from the atmospheric quantities from ECMWF’s ERA5 reanalysis data. Both have a temporal resolution of 1 hour for the years 2010-2018 and a spatial mesh size of approximately 32 km. The subset of the data analyzed along with the fitted bamlss model are provided in the FlashAustria data on R-Forge which can be installed by

install.packages("FlashAustria", repos = "http://R-Forge.R-project.org")

To model only the lightning counts with at least one lightning discharge we employ a negative binomial count distribution, truncated at zero. The data can be loaded as follows and the regression formula set up:

data("FlashAustria", package = "FlashAustria")f <- list(  counts ~ s(d2m, bs = "ps") + s(q_prof_PC1, bs = "ps") +    s(cswc_prof_PC4, bs = "ps") + s(t_prof_PC1, bs = "ps") +    s(v_prof_PC2, bs = "ps") + s(sqrt_cape, bs = "ps"),  theta ~ s(sqrt_lsp, bs = "ps"))

The expectation mu of the underlying untruncated negative binomial model is modeled by various smooth terms for the atmospheric variables while the overdispersion parameter theta only depends on one smooth regressor. To fit this challenging model, gradient boosting is employed in a first step to obtain initial values for the subsequent MCMC sampler. Running the model takes about 30 minutes on a well-equipped standard PC. In order to move quickly through the example we load the pre-computed model from the FlashAustria package:

data("FlashAustriaModel", package = "FlashAustria")b <- FlashAustriaModel

But, of course, the model can also be refitted:

set.seed(111)b <- bamlss(f, family = "ztnbinom", data = FlashAustriaTrain,  optimizer = boost, maxit = 1000,   ## Boosting arguments.  thin = 5, burnin = 1000, n.iter = 6000)  ## Sampler arguments.

To explore this model in some more detail, we show a couple of visualizations. First, the contribution to the log-likelihood of individual terms during gradient boosting is depicted.

pathplot(b, which = "loglik.contrib", intercept = FALSE)

Subsequently, we show traceplots of the MCMC samples (left) along with autocorrelation for two splines the term s(sqrt_cape) of the model for mu.

plot(b, model = "mu", term = "s(sqrt_cape)", which = "samples")

Next, the effects of the terms s(sqrt_cape) and s(q_prof_PC1) from the model for mu and term s(sqrt_lsp) from the model for theta are shown along with 95% credible intervals derived from the MCMC samples.

plot(b, term = c("s(sqrt_cape)", "s(q_prof_PC1)", "s(sqrt_lsp)"),  rug = TRUE, col.rug = "#39393919")

Finally, estimated probabilities for observing 10 or more lightning counts (within one grid box) are computed and visualized. The reconstructions for four time points on September 15-16, 2001 are shown.

fit <- predict(b, newdata = FlashAustriaCase, type = "parameter")fam <- family(b)FlashAustriaCase$P10 <- 1 - fam$p(9, fit)world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")library("ggplot2")ggplot() + geom_sf(aes(fill = P10), data = FlashAustriaCase) +  colorspace::scale_fill_continuous_sequential("Oslo", rev = TRUE) +  geom_sf(data = world, col = "white", fill = NA) +  coord_sf(xlim = c(7.95, 17), ylim = c(45.45, 50), expand = FALSE) +  facet_wrap(~time, nrow = 2) + theme_minimal() +  theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0))

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Fall & Winter Workshop Roundup

September 29, 2019, 5:00 pm

≫ Next: Understanding Bootstrap Confidence Interval Output from the R boot Package

≪ Previous: bamlss: A Lego Toolbox for Flexible Bayesian Regression

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Join RStudio at one of our Fall and Winter workshops! We’ll be hosting a few different workshops in a variety of cities across the US and UK. Topics range from building tidy tools, to teaching data science, to mastering machine learning. See below for more details on each workshop and how to register.

Building Tidy Tools with Hadley Wickham

When: October 14 & 15, 2019

Where: Loudermilk Conference Center in Atlanta, GA

Who: Hadley Wickham, Chief Scientist at RStudio

Chief Data Scientist Hadley Wickham is hosting his popular “Building Tidy Tools” workshop in Atlanta, Georgia this October.

You should take this workshop if you have experience programming in R and want to learn how to tackle larger scale problems. You’ll get the most from it if you’re already familiar with functions and are comfortable with R’s basic data structures (vectors, matrices, arrays, lists, and data frames). Note: There is ~30% overlap in the material with Hadley’s previous “R Masterclass”. However, the material has been substantially reorganized, so if you’ve taken the R Masterclass in the past, you’ll still learn a lot in this class.

This course has three primary goals. You will:

Learn efficient workflows for developing high-quality R functions, using the set of conventions codified by a package. You’ll also learn workflows for unit testing, which helps ensure that your functions do exactly what you think they do.
Master the art of writing functions that do one thing well and can be fluently combined together to solve more complex problems. We’ll cover common function writing pitfalls and how to avoid them.
Learn how to write collections of functions that work well together and adhere to existing conventions so they’re easy to pick up for newcomers. We’ll discuss API design, functional programming tools, the basics of object design in S3, and the tidy eval system for NSE.

Welcome to the Tidyverse: An Introduction to R for Data Science

When: The one-day workshop is hosted on both October 14 & October 15

Where: Loudermilk Conference Center in Atlanta, GA

Who:

Carl Howe, Director of Education at RStudio
Christina Koch, University of Wisconsin
Teon Brooks, Data Scientist at Mozilla

Join RStudio’s Director of Education, Carl Howe, and some special teachers for their “Welcome to the Tidyverse: An Introduction to R for Data Science”. This workshop is designed for folks who are new to R and want to learn more.

Looking for an effective way to learn R? This one-day course will teach you a workflow for doing data science with the R language. It focuses on using R’s Tidyverse, which is a core set of R packages that are known for their impressive performance and ease of use. We will focus on doing data science, not programming.

In this course, you’ll learn to:

Visualize data with R’s ggplot2 package
Wrangle data with R’s dplyr package
Fit models with base R
Document your work reproducibly with R Markdown

Machine Learning Workshop with Max Kuhn

When: November 18 & 19, 2019

Where: Hilton London Paddington in London, UK

Who: Max Kuhn , Software Engineer at RStudio

See Max Kuhn teach his Machine Learning workshop this fall in London. This is a great chance to hear Max teach and experience this class while he is across the pond.

This two-day course will provide an overview of using R for supervised learning. The session will step through the process of building, visualizing, testing, and comparing models that are focused on prediction. The goal of the course is to provide a thorough workflow in R that can be used with many different regression or classification techniques. Case studies on real data will be used to illustrate the functionality and several different predictive models are illustrated.

Introduction to Machine Learning with the Tidyverse

When: December 12 & 13, 2019

Where: RStudio’s Boston Office

Who:

Garrett Grolemund, Data Scientist and Professional Educator at RStudio
Alison Hill, Data Scientist and Professional Educator at RStudio

Get a sneak peek at Garrett and Alison’s rstudio::conf2020 workshop, “Introduction to Machine Learning with the Tidyverse”. If you can’t make it to the conference this year, this is your chance to experience one of the workshops and help them test drive their content.

This is a test run for a workshop in the final stages of development. The workshop provides a gentle introduction to machine learning and to the tidyverse packages that do machine learning. You’ll learn how to train and assess predictive models with several common machine learning algorithms, as well as how to do feature engineering to improve the predictive accuracy of your models. We will focus on learning the basic theory and best practices that support machine learning, and we will do it with a modern suite of R packages known as tidymodels. Tidymodels packages, like parsnip, recipes, and rsample provide a grammar for modeling and work seamlessly with R’s tidyverse packages.

Since this is a test run, the workshop is limited to a small number of seats. The low price reflects the experimental nature of the material. Students will be asked to provide constructive feedback in a course survey.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Understanding Bootstrap Confidence Interval Output from the R boot Package

September 29, 2019, 5:00 pm

≫ Next: More models, more features: what’s new in ‘parameters’ 0.2.0

≪ Previous: Fall & Winter Workshop Roundup

[This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nuances of Bootstrapping

Most applied statisticians and data scientists understand that bootstrapping is a method that mimics repeated sampling by drawing some number of new samples (with replacement) from the original sample in order to perform inference. However, it can be difficult to understand output from the software that carries out the bootstrapping without a more nuanced understanding of how uncertainty is quantified from bootstrap samples.

To demonstrate the possible sources of confusion, start with the data described in Efron and Tibshirani’s (1993) text on bootstrapping (page 19). We have 15 paired observations of student LSAT scores and GPAs. We want to estimate the correlation between LSAT and GPA scores. The data are the following:

student	lsat	gpa
1	576	3.39
2	635	3.30
3	558	2.81
4	578	3.03
5	666	3.44
6	580	3.07
7	555	3.00
8	661	3.43
9	651	3.36
10	605	3.13
11	653	3.12
12	575	2.74
13	545	2.76
14	572	2.88
15	594	2.96

The correlation turns out to be 0.776. For reasons we’ll explore, we want to use the nonparametric bootstrap to get a confidence interval around our estimate of $r$. We do so using the boot package in R. This requires the following steps:

Define a function that returns the statistic we want.
Use the boot function to get R bootstrap replicates of the statistic.
Use the boot.ci function to get the confidence intervals.

For step 1, the following function is created:

get_r <- function(data, indices, x, y) {  d <- data[indices, ]  r <- round(as.numeric(cor(d[x], d[y])), 3)  r}

Steps 2 and 3 are performed as follows:

set.seed(12345)boot_out <- boot(  tbl,  x = "lsat",   y = "gpa",   R = 500,  statistic = get_r)boot.ci(boot_out)

## Warning in boot.ci(boot_out): bootstrap variances needed for studentized## intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out)## ## Intervals : ## Level      Normal              Basic         ## 95%   ( 0.5247,  1.0368 )   ( 0.5900,  1.0911 )  ## ## Level     Percentile            BCa          ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )  ## Calculations and Intervals on Original Scale## Some BCa intervals may be unstable

Looking at the boot.ci output, the following questions come up:

Why are there multiple CIs? How are they calculated?
What are the bootstrap variances needed for studentized intervals?
What does it mean that the calculations and intervals are on the original scale?
Why are some BCa intervals unstable?

To understand this output, let’s review statistical inference, confidence intervals, and the bootstrap.

Statistical Inference

The usual test statistic for determining if $r \neq 0$ is:

\[ t = \frac{r}{SE_r} \]

where

\[ SE_r = \sqrt{\frac{1-r^2}{n-2}} \]

In our case:

\[ SE_r = \sqrt{\frac{1-r^2}{n-2}} = \sqrt{\frac{1-0.776^2}{15-2}} = 0.175 \]

Dividing $r$ by $SE_r$ yields our $t$ statistic

\[ t = \frac{r}{SE_r} = \frac{0.776}{0.175} = 4.434 \]

We compare this to a $t$ distribution with $n-2 = 13$ degrees of freedom and easily find it to be significant.

In words: If the null hypothesis were true, and we repeatedly draw samples of size $n$, and we calculate $r$ each time, then the probability that we would observe an estimate of $|r| = 0.776$ or larger is less than 5%.

An important caveat. The above formula for the standard error is only correct when $r = 0$. The closer we get to $\pm 1$, the less correct it is.

Confidence Intervals

We can see why the standard error formula above becomes less correct the further we get from zero by considering the 95% confidence interval for our estimate. The usual formula you see for a confidence interval is the estimate plus or minus the 97.5th percentile of the normal or $t$ distribution times the standard error. In this case, the $t$-based formula would be:

\[ \text{95% CI} = r \pm t_{df = 13} SE_r \]

If we were to sample 15 students repeatedly from the population and calculate this confidence interval each time, the interval should include the true population value 95% of the time. So what happens if we use the standard formula for the confidence interval?

\[ \begin{align} \text{95% CI} &= r \pm t_{df = 13}SE_r \\ &= 0.776 \pm 2.16\times 0.175 \\ &= [0.398, 1.154] \end{align} \]

Recall that correlations are bounded in the range $[-1, +1]$, but our 95% confidence interval contains values greater than one!

Alternatives:

Use Fisher’s $z$-transformation. This is what your software will usually do, but it doesn’t work for most other statistics.
Use the bootstrap. While not necessary for the correlation coefficient, its advantage is that it can be used for almost any statistic.

The next sections review the nonparametric and parametric bootstrap.

Nonparametric Bootstrap

We do not know the true population distribution of LSAT and GPA scores. What we have instead is our sample. Just like we can use our sample mean as an estimate of the population mean, we can use our sample distribution as an estimate of the population distribution.

In the absence of supplementary information about the population (e.g. that it follows a specific distribution like bivariate normal), the empirical distribution from our sample contains as much information about the population distribution as we can get. If statistical inference is typically defined by repeated sampling from a population, and our sample provides a good estimate of the population distribution, we can conduct inferential tasks by repeatedly sampling from our sample.

(Nonparametric) bootstrapping thus works as follows for a sample of size N:

Draw a random sample of size N with replacement from our sample, which is the first bootstrap sample.
Estimate the statistic of interest using the bootstrap sample.
Draw a new random sample of size N with replacement, which is the second bootstrap sample.
Estimate the statistic of interest using the new bootstrap sample.
Repeat $k$ times.
Use the distribution of estimates across the $k$ bootstrap samples as the sampling distribution.

Note that the sampling is done with replacement. As an aside, most results from traditional statistics are based on the assumption of random sampling with replacement. Usually, the population we sample from is large enough that we do not bother noting the “with replacement” part. If the sample is large relative to the population, and sampling without replacement is used, we would typically be advised to use a finite population correction. This is just to say that the “with replacement” requirement is a standard part of the definition of random sampling.

Let’s take our data as an example. We will draw 500 bootstrap samples, each of size $n = 15$ chosen with replacement from our original data. The distribution across repeated samples is:

Note a few things about this distribution.

The distribution is definitely not normal.
The mean estimate of $r$ across the 500 bootstrap samples is 0.771. The difference between the mean of the bootstrap estimates $(\mathbb{E}(r_b) = 0.771)$ and the original sample estimate $(r = 0.776)$ is the bias.
The bootstrap standard error is the standard deviation of the bootstrap sampling distribution. Here the value is 0.131, which is much smaller than our earlier estimate of 0.175. This is because 0.175 was based on a formula that is only valid when $r = 0$ and does not account for the fact that a correlation’s values are bounded at -1 and +1.

The non-normality of the sampling distribution means that, if we divide $r$ by the bootstrap standard error, we will not get a statistic that is distributed standard normal or $t$. Instead, we decide that it is a better idea to summarize our uncertainty using a confidence interval. Yet we also want to make sure that our confidence intervals are bounded within the $[-1, +1]$ range, so the usual formula will not work.

Before turning to different methods for obtaining bootstrap confidence intervals, for completeness the next section describes the parametric bootstrap.

Parametric Bootstrap

The prior section noted that, in the absence of supplementary information about the population, the empirical distribution from our sample contains as much information about the population distribution as we can get.

An example of supplementary information that may improve our estimates would be that we know the LSAT and GPA scores are distributed bivariate normal. If we are willing to make this assumption, we can use our sample to estimate the distribution parameters. Based on our sample, we find:

\[ \begin{pmatrix} \text{LSAT} \\ \text{GPA} \end{pmatrix}\sim N\left(\begin{pmatrix} 600.27 \\ 3.09 \end{pmatrix},\begin{pmatrix} 1746.78 & 7.90 \\ 7.90 & 0.06 \end{pmatrix}\right). \]

The distribution looks like the following:

We can draw 500 random samples of size 15 from this specific bivariate normal distribution and calculate the correlation between the two variables for each.

get_cor <- function(iteration, n) {  dta <- MASS::mvrnorm(    n,    mu = c(600, 3),    Sigma = matrix(c(1747, 7.9, 7.9, .06), 2)  ) %>%    as.data.frame()  tibble(    iteration = iteration,    r = cor(dta$V1, dta$V2)  )}par_boot_tbl <- map_dfr(1:500, ~ get_cor(.x, 15))

The distribution of the correlation estimates across the 500 samples represents our parametric bootstrap sampling distribution. It looks like the following:

The average correlation across the 500 samples was 0.767, and the standard deviation (our estimate of the standard error) was 0.111.

This is smaller than our non-parametric bootstrap estimate of the standard error, 0.131, which is reflective of the fact that our knowledge of the population distribution gives us more information. This in turn reduces sampling variability.

Of course, we often will not feel comfortable saying that the population distribution follows a well-defined shape, and hence we will typically default to the non-parametric version of the bootstrap.

Bootstrap Confidence Intervals

Recall that the usual formula for estimating a confidence around a statistic $\theta$ is something like:

\[ \text{95% CI} = \theta \pm t_{df,1-\alpha/2} SE_{\theta} \]

We saw that using the textbook standard error estimate for a correlation led us astray because we ended up with an interval outside of the range of plausible values. There are a variety of alternative approaches to calculating confidence intervals based on the bootstrap.

Standard Normal Interval

The first approach starts with the usual formula for calculating a confidence interval, using the normal distribution value of 1.96 as the multiplier of the standard error. However, there are two differences. First, we use our bootstrap estimate of the standard error in the formula. Second, we make an adjustment for the estimated bias, -0.005:

In our example, we get

\[ \begin{align} \text{95% CI} &= r – \text{bias} \pm 1.96 \times SE_r \\ &= 0.776 + .005 \pm 1.96 \times 0.131 \\ &= [0.524, 1.038] \end{align} \]

This matches R’s output (given our hand calculations did some rounding along the way).

boot.ci(boot_out, type = "norm")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = "norm")## ## Intervals : ## Level      Normal        ## 95%   ( 0.5247,  1.0368 )  ## Calculations and Intervals on Original Scale

Problems:

If a normal approximation were valid, we probably don’t need to bootstrap.
We still have a CI outside the appropriate range.

We generally won’t use this method.

Studentized (t) Intervals

Recall that, when we calculate a $t$-statistic, we mean-center the original statistic and divide by the sample estimate of the standard error. That is,

\[ t = \frac{\hat{\theta} – \theta}{\widehat{SE}_{\theta}} \]

where $\hat{\theta}$ is the sample estimate of the statistic, $\theta$ is the “true” population value (which we get from our null hypothesis), and $\widehat{SE}_{\theta}$ is the sample estimate of the standard error.

There is an analog to this process for bootstrap samples. In the bootstrap world, we can convert each bootstrap sample into a $t$-score as follows:

\[ t = \frac{\tilde{\theta} – \hat{\theta}}{\widehat{SE}_{\hat{\theta}}} \]

Here $\tilde{\theta}$ is the statistic estimated from a single bootstrap sample, and $\hat{\theta}$ is the estimate from the original (non-bootstrap) sample.

But where does $\widehat{SE}_{\hat{\theta}}$ come from?

Just like for a $t$-test, where we estimated the standard error using our one sample, we estimate the standard error separately for each bootstrap sample. That is, we need an estimate of the bootstrap sample variance. (Recall the message from the R output above).

If we are lucky enough to have a formula for a sample standard error, we use that in each sample. For the mean, each bootstrap sample would return:

The bootstrap sample mean, $\frac{1}{n}\sum(s_{bi})$
The bootstrap sample variance: $\frac{s^2_b}{n}$.

We don’t have such a formula that works for any correlation, so we need another means to estimate the variance. The delta method is one choice. Alternatively, there is the nested bootstrap.

Nested bootstrap algorithm:

Draw a bootstrap sample.
Estimate the statistic.
Bootstrap the bootstrap sample, using the variance of estimates across the bootstrapped estimates as the estimate of the variance.
Save the bootstrap estimate of the statistic and the nested bootstrap estimate of the variance.
For each bootstrap sample, estimate $t = \frac{\tilde{\theta} – \hat{\theta}}{\widehat{SE}_{\hat{\theta}}}$

We now have the information we need to calculate the studentized confidence interval. The formula for the studentized bootstrap confidence interval is:

\[ 95\% \text{ CI} = [\hat{\theta} – sq_{1-\alpha/2}, \hat{\theta} – sq_{\alpha/2}] \]

The terms are:

$\hat{\theta}$: Our sample statistic (without performing the bootstrap)
$s$: Our bootstrap estimate of the standard error (the standard deviation of bootstrap estimates, not the nested bootstrap part)
$q_{1-\alpha/2}$: For $\alpha = .05$, the 97.5th percentile of our bootstrap $t$ estimates.
$q_{\alpha/2}$: For $\alpha = .05$, the 2.5th percentile of our bootstrap $t$ estimates.

For each bootstrap sample, we calculated a $t$ statistic. The $q_{1-\alpha/2}$ and $q_{\alpha/2}$ are identified by taking the appropriate quantile of these $t$ estimates. This is akin to creating our own table of $t$-statistics, rather than using the typical tables for the $t$ distribution you’d find in text books.

What does this look like in R? We need a second function for bootstrapping inside our bootstrap. The following will work.

get_r_var <- function(x, y, data, indices, its) {  d <- data[indices, ]  r <- cor(d[x], d[y]) %>%    as.numeric() %>%    round(3)  n <- nrow(d)  v <- boot(    x = x,     y = y,    R = its,    data = d,    statistic = get_r  ) %>%    pluck("t") %>%    var(na.rm = TRUE)  c(r, v)}boot_t_out <- boot(  x = "lsat", y = "gpa", its = 200,  R = 1000, data = tbl, statistic = get_r_var)

We now have our bootstrap estimates of $t$, and we can use the quantiles of the distribution to plug into the formula. We find that $q_{1-\alpha/2} = 8.137$ and that $q_{\alpha/2} = -1.6$. Substituting:

\[ \begin{align} \text{95% CI} &= [\hat{\theta} – sq_{1-\alpha/2}, \hat{\theta} – sq_{\alpha/2}] \\ &= [0.776 – 0.142 \times 8.137, 0.776 – 0.142 \times -1.6] \\ &= [-0.383,1.004] \end{align} \]

Checking our by-hand calculations, the studentized confidence interval from boot.ci is:

boot.ci(boot_t_out, type = "stud")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 1000 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_t_out, type = "stud")## ## Intervals : ## Level    Studentized     ## 95%   (-0.3828,  1.0039 )  ## Calculations and Intervals on Original Scale

Problems:

The nested bootstrap part is computationally intensive, even for simple problems like this.
May still produce estimates outside the range of plausible values.
Here our sample was small, so the variance of each bootstrap estimate was large.
Has been found to be erratic in practice.

Basic Bootstrap Confidence Interval

Another way of writing a confidence interval:

\[ 1-\alpha = P(q_{\alpha/2} \leq \theta \leq q_{1-\alpha/2}) \]

In non-bootstrap confidence intervals, $\theta$ is a fixed value while the lower and upper limits vary by sample. In the basic bootstrap, we flip what is random in the probability statement. Define $\tilde{\theta}$ as a statistic estimated from a bootstrap sample. We can write

\[ 1-\alpha = P(q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}) \]

Recall that the bias of a statistic is the difference between its expected value (mean) across many samples and the true population value:

\[ \text{bias} = \mathbb{E}(\hat{\theta}) – \theta \]

We estimate this using our bootstrap samples, $\mathbb{E}(\tilde{\theta}) – \hat{\theta}$, where $\hat{\theta}$ is the estimate from the original sample (before bootstrapping).

We can add in the bias-correction term to each side of our inequality as follows.

\[ \begin{align} 1-\alpha &= P(q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}) \\ &= P(q_{\alpha/2} – \hat{\theta} \leq \tilde{\theta} – \hat{\theta} \leq q_{1-\alpha/2} – \hat{\theta}) \end{align} \]

Some more algebra eventually leads to:

\[ 1-\alpha = P(2\hat{\theta} – q_{1-\alpha/2} \leq \theta \leq 2\hat{\theta} – q_{\alpha/2} ) \]

The right-hand side is our formula for the basic bootstrap confidence interval.

Because we started out with $\tilde{\theta}$ as the random variable, we can use our bootstrap quantiles for the values of $q_{1-\alpha/2}$ and $q_{\alpha/2}$. To do so, arrange the estimates in order from lowest to highest, then use a percentile function to find the value at the 2.5th and 97.5th percentiles (given two-tailed $\alpha = .05$). We find that $q_{1-\alpha/2} = 0.962$ and that $q_{\alpha/2} = 0.461$. Substituting into the inequality:

\[ \begin{align} 1-\alpha &= P(2\hat{r} – q_{1-\alpha/2} \leq r \leq 2\hat{r} – q_{\alpha/2} ) \\ &= P(2(0.776) – 0.962) \leq r \leq 2(0.776) – 0.461) \\ &= P(0.59 \leq r \leq 1.091) \end{align} \]

The basic bootstrap interval is $[0.59, 1.091]$.

To confirm:

boot.ci(boot_out, type = "basic")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = "basic")## ## Intervals : ## Level      Basic         ## 95%   ( 0.5900,  1.0911 )  ## Calculations and Intervals on Original Scale

But we’re still outside the range we want.

Percentile Confidence Intervals

Here’s an easy solution. Line up the bootstrap estimates from lowest to highest, then take the 2.5th and 97.5th percentile.

quantile(boot_out$t, probs = c(.025, .975), type = 6)

##     2.5%    97.5% ## 0.460775 0.962000

Compare:

boot.ci(boot_out, type = "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = "perc")## ## Intervals : ## Level     Percentile     ## 95%   ( 0.4609,  0.9620 )  ## Calculations and Intervals on Original Scale

(The slight difference is due to boot using its own quantile function.)

Looks like we have a winner. Our confidence interval will necessarily be limited to the range of plausible values. But let’s look at one other.

Bias Corrected and Accelerated (BCa) Confidence Intervals

BCa intervals require estimating two terms: a bias term and an acceleration term.

Bias is by now a familiar concept, though the calculation for the BCa interval is a little different. For BCa confidence intervals, estimate the bias correction term, $\hat{z}_0$, as follows:

\[ \hat{z}_0 = \Phi^{-1}\left(\frac{\#\{\hat{\theta}^*_b < \hat{\theta}\}}{B}\right) \]

where $\#$ is the counting operator. The formula looks complicated but can be thought of as estimating something close to the median bias transformed into normal deviates ($\Phi^{-1}$ is the inverse standard normal cdf).

The acceleration term is estimated as follows:

\[ \hat{a} = \frac{\sum^n_{i=1}(\hat{\theta}_{(\cdot)} – \hat{\theta}_{(i)})}{6\{\sum^n_{i=1}(\hat{\theta}_{(\cdot)} – \hat{\theta}_{(i)})^2\}^{3/2}} \]

where $\hat{\theta}_{(\cdot)}$ is the mean of the bootstrap estimates and $\hat{\theta}_{(i)}$ the estimate after deleting the $i$th case. The process of estimating a statistic $n$ times, each time dropping one of the $i \in N$ observations, is known as the jackknife estimate.

The purpose of the acceleration term is to account for situations in which the standard error of an estimator changes depending on the true population value. This is exactly what happens with the correlation (the SE estimator we provided at the start of the post only works when $r = 0$). An equivalent way of thinking about this is that it accounts for skew in the sampling distribution, like what we have seen in the prior histograms.

Armed with our bias correction and acceleration term, we now estimate the quantiles we will use for establishing the confidence limits.

\[ \alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z^{(\alpha)}}{1-\hat{a}(\hat{z}_0 + z^{(\alpha)}) } \right) \]

\[ \alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}_0 + z^{(1 – \alpha)}}{1-\hat{a}(\hat{z}_0 + z^{(1-\alpha)}) } \right) \]

where $\alpha$ is our Type-I error rate, usually .05.

Our confidence limits are:

\[ \\ 95\% \text{ CI} = [\hat{\theta}^{*(\alpha_1)}, \hat{\theta}^{*(\alpha_2)}] \]

Based on the formulas above, it should be obvious that $a_1$ and $a_2$ reduces to the percentile intervals when the bias and acceleration terms are zero. The effect of the bias and acceleration corrections is to change the percentiles we use to establish our limits.

If we perform all of the above calculations, we get the following:

boot.ci(boot_out, type = "bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = "bca")## ## Intervals : ## Level       BCa          ## 95%   ( 0.3948,  0.9443 )  ## Calculations and Intervals on Original Scale## Some BCa intervals may be unstable

We get a warning message that BCa intervals may be unstable. This is because the accuracy of the bias and acceleration terms require a large number of bootstrap samples and, especially when using the jackknife to get the acceleration parameter, this can be computationally intensive. If so, there is another type of confidence interval known as the ABC interval that provides a satisfactory approximation to BCa intervals that is less computationally demanding. Type ?boot::abc.ci at the command line for how to implement this in R.

Transformations

What does it mean that calculations and intervals are on the original scale?

There are sometimes advantages to transforming a statistic so that it is on a different scale. An example is the correlation coefficient. We mentioned briefly above that the usual way of performing inference is to use the Fisher-$z$ transformation.

\[ z = \frac{1}{2}\text{ln}\left(\frac{1+r}{1-r} \right) \]

This transformation is normally distributed with standard error $\frac{1}{\sqrt{N – 3}}$, so we can construct confidence intervals the usual way and then reverse-transform the limits using the function’s inverse. For Fisher-$z$, the inverse of the transformation function is:

\[ r = \frac{\text{exp}(2z) – 1}{\text{exp}(2z) + 1} \]

If we prefer to work with the transformed statistic, we can include the transformation function and its inverse in the boot.ci function. Define the transformations:

fisher_z <- function(r) .5 * log((1 + r) / (1 - r))inv_fisher_z <- function(z) (exp(2 * z) - 1) / (exp(2 * z) + 1)

We can use these functions within a call to boot.ci. What we get in return will depend on which functions we specify.

If only the transformation function is applied, the confidence intervals are on the transformed scale.
If the transformation and the inverse transformation functions are applied, the confidence intervals are calculated on the transformed scale but returned on the original scale.

Recall that not specifying either function returns:

boot.ci(  boot_out,   type = c("norm", "basic", "perc", "bca"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc", ##     "bca"))## ## Intervals : ## Level      Normal              Basic         ## 95%   ( 0.5247,  1.0368 )   ( 0.5900,  1.0911 )  ## ## Level     Percentile            BCa          ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )  ## Calculations and Intervals on Original Scale## Some BCa intervals may be unstable

Specifying the transformation only returns:

boot.ci(  boot_out,  h = fisher_z,  type = c("norm", "basic", "perc", "bca"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc", ##     "bca"), h = fisher_z)## ## Intervals : ## Level      Normal              Basic         ## 95%   ( 0.212,  1.688 )   ( 0.098,  1.572 )  ## ## Level     Percentile            BCa          ## 95%   ( 0.498,  1.972 )   ( 0.417,  1.777 )  ## Calculations and Intervals on  Transformed Scale## Some BCa intervals may be unstable

Specifying the transformation and its inverse returns the following:

boot.ci(  boot_out,  h = fisher_z,   hinv = inv_fisher_z,  type = c("norm", "basic", "perc", "bca"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS## Based on 500 bootstrap replicates## ## CALL : ## boot.ci(boot.out = boot_out, type = c("norm", "basic", "perc", ##     "bca"), h = fisher_z, hinv = inv_fisher_z)## ## Intervals : ## Level      Normal              Basic         ## 95%   ( 0.2091,  0.9340 )   ( 0.0981,  0.9173 )  ## ## Level     Percentile            BCa          ## 95%   ( 0.4609,  0.9620 )   ( 0.3948,  0.9443 )  ## Calculations on Transformed Scale;  Intervals on Original Scale## Some BCa intervals may be unstable

Conclusion

It is hoped that this post clarifies the output from boot::boot.ci, and in particular facilitates understanding the messages the function produces. We saw that the percentile and BCa methods were the only ones considered here that were guaranteed to return a confidence interval that respected the statistic’s sampling space. It turns out that there are theoretical grounds to prefer BCa in general. It is “second-order accurate”, meaning that it converges faster to the correct coverage. Unless you have a reason to do otherwise, make sure to perform a sufficient number of bootstrap replicates (a few thousand is usually not too computationally intensive) and go with reporting BCa intervals.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

More models, more features: what’s new in ‘parameters’ 0.2.0

September 29, 2019, 5:00 pm

≫ Next: Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize

≪ Previous: Understanding Bootstrap Confidence Interval Output from the R boot Package

[This article was first published on R on easystats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The easystats project continues to grow, expanding its capabilities and features, and the parameters package 0.2.0 update is now on CRAN.

The primary goal of this package is to provide utilities for processing the parameters of various statistical models. It is useful for end-users as well as developers, as it is a lightweight and open-developed package.

The main function, model_parameters(), can be seen as an alternative to broom::tidy(). However, the package also include many more useful features, some of which are described in our improved documentation:

Improved Documentation

Parameters Description
- Guide to parameters description
Parameters Engineering
- Guide to bootstrapped parameters
- Guide to standardized parameters
Parameters Selection
- Guide to parameters selection
Dimension Reduction
- Guide to feature reduction (PCA, cMDS, ICA…)
- Guide to structural models (EFA, CFA, SEM…)

Improved Support

Besides stabilizing and improving the functions for the most popular models (glm(), glmer(), stan_glm(), psych and lavaan…), the functions p_value(), ci(), standard_error(), standardize() and most importantly model_parameters() now support many more model objects, including mixed models from packages nlme, glmmTMB or GLMMadaptive, zero-inflated models from package pscl, other regression types from packages gam or mgcv, fixed effects regression models from panelr, lfe, feisr or plm, and structural models from FactoMineR.

Improved Printing

For models with special components, in particular zero-inflated models, model_parameters() separates these components for a clearer output.

## # Conditional component## ## Parameter   | Coefficient |   SE |         95% CI |     z |      p## ------------------------------------------------------------------## (Intercept) |       -0.36 | 0.28 | [-0.90,  0.18] | -1.30 | > .1  ## spp (PR)    |       -1.27 | 0.24 | [-1.74, -0.80] | -5.27 | < .001## spp (DM)    |        0.27 | 0.14 | [ 0.00,  0.54] |  1.95 | 0.05  ## spp (EC-A)  |       -0.57 | 0.21 | [-0.97, -0.16] | -2.75 | < .01 ## spp (EC-L)  |        0.67 | 0.13 | [ 0.41,  0.92] |  5.20 | < .001## spp (DES-L) |        0.63 | 0.13 | [ 0.38,  0.87] |  4.96 | < .001## spp (DF)    |        0.12 | 0.15 | [-0.17,  0.40] |  0.78 | > .1  ## mined (no)  |        1.27 | 0.27 | [ 0.74,  1.80] |  4.72 | < .001## ## # Zero-Inflated component## ## Parameter   | Coefficient |   SE |         95% CI |     z |      p## ------------------------------------------------------------------## (Intercept) |        0.79 | 0.27 | [ 0.26,  1.32] |  2.90 | < .01 ## mined (no)  |       -1.84 | 0.31 | [-2.46, -1.23] | -5.87 | < .001

Join the team

There is still room for improvement, and some new exciting features are already planned. Feel free to let us know how we could further improve this package!

Note that easystats is a new project in active development, looking for contributors and supporters. Thus, do not hesitate to contact one of us if you want to get involved

Check out our other blog postshere!

To leave a comment for the author, please follow the link and comment on their blog: R on easystats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize

September 29, 2019, 7:45 pm

≫ Next: Meetup Recap: Survey and Measure Development in R

≪ Previous: More models, more features: what’s new in ‘parameters’ 0.2.0

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial, we’ll show how we used clean_anomalies() from the anomalize package to reduce forecast error by 9%.

R Packages Covered:

anomalize– Time series anomaly detection

Cleaning Anomalies to Reduce Forecast Error by 9%

We can often improve forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function from anomalize into your forecast workflow.

Forecast Workflow

Forecast With Anomalize

We’ll use the following workflow to remove time series anomalies prior to forecasting.

Identify the anomalies– Decompose the time series with time_decompose() and anomalize() the remainder (residuals)
Clean the anomalies– Use the new clean_anomalies() function to reconstruct the time series, replacing anomalies with the trend and seasonal components
Forecast– Use a forecasting algorithm to predict new observations from a training set, then compare to test set with and without anomalies cleaned

Step 1 – Load Libraries

First, load the following libraries to follow along.

library(tidyverse)# Core data manipulation and visualization librarieslibrary(tidyquant)# Used for business-ready ggplot themeslibrary(anomalize)# Identify and clean time series anomalieslibrary(timetk)# Time Series Machine Learning Featureslibrary(knitr)# For kable() function

Step 2 – Get the Data

This tutorial uses the tidyverse_cran_downloads dataset that comes with anomalize. These are the historical downloads of several “tidy” R packages from 2017-01-01 to 2018-03-01.

Let’s take one package with some extreme events. We’ll hone in on lubridate (but you could pick any).

tidyverse_cran_downloads%>%time_decompose(count)%>%anomalize(remainder)%>%time_recompose()%>%plot_anomalies(ncol=3,alpha_dots=0.3)

plot of chunk unnamed-chunk-2

We’ll filter() downloads of the lubridate R package.

lubridate_tbl<-tidyverse_cran_downloads%>%ungroup()%>%filter(package=="lubridate")

Here’s a visual representation of the forecast experiment setup. Training data will be any data before “2018-01-01”.

plot of chunk unnamed-chunk-4

Step 3 – Workflow for Cleaning Anomalies

The workflow to clean anomalies:

We decompose the “counts” column using time_decompose()– This returns a Seasonal-Trend-Loess (STL) Decomposition in the form of “observed”, “season”, “trend” and “remainder”.
We fix any negative values – If present, they can throw off forecasting transformations (e.g. log and power transformations)
We identifying anomalies (anomalize()) on the “remainder” column – Returns “remainder_l1” (lower limit), “remainder_l2” (upper limit), and “anomaly” (Yes/No).
We use the function, clean_anomalies(), to add new column called “observed_cleaned” that repairs the anomalous data by replacing all anomalies with the trend + seasonal components from the decompose operation.

lubridate_anomalized_tbl<-lubridate_tbl%>%# 1. Decompose download counts and anomalize the STL decomposition remaindertime_decompose(count)%>%# 2. Fix negative values if any in observedmutate(observed=ifelse(observed<0,0,observed))%>%# 3. Identify anomaliesanomalize(remainder)%>%# 4. Clean & repair anomalous dataclean_anomalies()# Show change in observed vs observed_cleanedlubridate_anomalized_tbl%>%filter(anomaly=="Yes")%>%select(date,anomaly,observed,observed_cleaned)%>%head()%>%kable()

date	anomaly	observed	observed_cleaned
2017-01-12	Yes	0	3522.194
2017-04-19	Yes	8549	5201.716
2017-09-01	Yes	0	4136.721
2017-09-07	Yes	9491	4871.176
2017-10-30	Yes	11970	6412.571
2017-11-13	Yes	10267	6640.871

Here’s a visual of the “observed” (uncleaned) vs the “observed_cleaned” (cleaned) training sets. We’ll see what influence these anomalies have on a forecast regression (next).

plot of chunk unnamed-chunk-6

Step 4 – Forecasting Downloads of the Lubridate Package

First, we’ll make a function, forecast_downloads(), that can take the input of both cleaned and uncleaned anomalies and return the forecasted downloads versus actual downloads. The modeling function is described in the Appendix – Forecast Downloads Function.

Step 4.1 – Before Cleaning with anomalize

We’ll first perform a forecast without cleaning anomalies (high leverage points).

The forecast_downloads() function trains on the “observed” (uncleaned) data and returns predictions versus actual.
Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
The model uses a linear regression of the form sqrt(observed) ~ numeric index + year + quarter + month + day of week.

lubridate_forecast_with_anomalies_tbl<-lubridate_anomalized_tbl%>%# See Apendix - Forecast Downloads Functionforecast_downloads(col_train=observed,# First train with anomalies includedsep="2018-01-01",# Separate at 1st of yeartrans="sqrt"# Perform sqrt() transformation)

Forecast vs Actual Values

The forecast is overplotted against the actual values.

plot of chunk unnamed-chunk-9

We can see that the forecast is shifted vertically, an effect of the high leverage points.

plot of chunk unnamed-chunk-10

Forecast Error Calculation

The mean absolute error (MAE) is 1570, meaning on average the forecast is off by 1570 downloads each day.

lubridate_forecast_with_anomalies_tbl%>%summarise(mae=mean(abs(prediction-actual)))

## # A tibble: 1 x 1##     mae##   ## 1 1570.

Step 4.2 – After Cleaning with anomalize

We’ll next perform a forecast this time using the repaired data from clean_anomalies().

The forecast_downloads() function trains on the “observed_cleaned” (cleaned) data and returns predictions versus actual.
Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
The model uses a linear regression of the form sqrt(observed_cleaned) ~ numeric index + year + quarter + month + day of week

lubridate_forecast_without_anomalies_tbl<-lubridate_anomalized_tbl%>%# See Appendix - Forecast Downloads Functionforecast_downloads(col_train=observed_cleaned,# Forecast with cleaned anomaliessep="2018-01-01",# Separate at 1st of yeartrans="sqrt"# Perform sqrt() transformation)

Forecast vs Actual Values

The forecast is overplotted against the actual values. The cleaned data is shown in Yellow.

plot of chunk unnamed-chunk-13

Zooming in on the forecast region, we can see that the forecast does a better job following the trend in the test data.

plot of chunk unnamed-chunk-14

Forecast Error Calculation

The mean absolute error (MAE) is 1435, meaning on average the forecast is off by 1435 downloads each day.

lubridate_forecast_without_anomalies_tbl%>%summarise(mae=mean(abs(prediction-actual)))

## # A tibble: 1 x 1##     mae##   ## 1 1435.

8.6% Reduction in Forecast Error

Using the new anomalize function, clean_anomalies(), prior to forecasting results in an 8.6% reduction in forecast error as measure by Mean Absolute Error (MAE).

((1435-1570)/1570)

## [1] -0.08598726

Conclusion

Forecasting with clean anomalies is a good practice that can provide substantial improvement to forecasting accuracy by removing high leverage points. The new clean_anomalies() function in the anomalize package provides an easy workflow for removing anomalies prior to forecasting. Learn more in the anomalize documentation.

Data Science Training

Interested in Learning Anomaly Detection?

Business Science offers two 1-hour labs on Anomaly Detection:

Learning Lab 18– Time Series Anomaly Detection with anomalize
Learning Lab 17– Anomaly Detection with H2O Machine Learning

Interested in Improving Your Forecasting?

Business Science offers a 1-hour lab on increasing Forecasting Accuracy:

Learning Lab 5– 5 Strategies to Improve Forecasting Performance by 50% (or more) using arima and glmnet

Interested in Becoming an Expert in Data Science for Business?

Business Science offers a 3-Course Data Science for Business R-Track designed to take students from no experience to an expert data scientists (advanced machine learning and web application development) in under 6-months.

Appendix – Forecast Downloads Function

The forecast_downloads() function uses the following procedure:

Split the data into training and testing data using a date specified using the sep argument.
Apply a statistical transformation: none, log-1-plus (log1p()), or power (sqrt())
Model the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the col_train argument.
Compares the predictions to the observed values.

forecast_downloads<-function(data,col_train,sep="2018-01-01",trans=c("none","log1p","sqrt")){predict_expr<-enquo(col_train)trans<-trans[1]# Spit into training/testing setstrain_tbl<-data%>%filter(date<ymd(sep))test_tbl<-data%>%filter(date>=ymd(sep))# Apply Transformationpred_form<-quo_name(predict_expr)if(trans!="none")pred_form<-str_glue("{trans}({pred_form})")# Make the model formulamodel_formula<-str_glue("{pred_form} ~ index.num + half                               + quarter + month.lbl + wday.lbl")%>%as.formula()# Apply model formula to data that is augmented with time-based featuresmodel_glm<-train_tbl%>%tk_augment_timeseries_signature()%>%glm(model_formula,data=.)# Make PredictionsuppressWarnings({# Suppress rank-deficit warningprediction<-predict(model_glm,newdata=test_tbl%>%tk_augment_timeseries_signature())actual<-test_tbl%>%pull(!!actual_expr)})if(trans=="log1p")prediction<-expm1(prediction)if(trans=="sqrt")prediction<-ifelse(prediction<0,0,prediction)^2# Return predictions and actualtibble(date=tk_index(test_tbl),prediction=prediction,actual=observed)}

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Meetup Recap: Survey and Measure Development in R

September 30, 2019, 3:00 am

≫ Next: EARL London 2019 Conference Recap

≪ Previous: Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Have you ever taken a survey at the doctor or for a job interview and wondered what exactly that data was used for? There is a long-standing series of methodologies, many coming from psychology, on how to reliably measure “latent” traits, such as depression or loyalty, from self-report survey data.

While measurement is a common method in fields ranging from kinesiology to education, it’s usually conducted in proprietary tools like SPSS or MPlus. There is really not much training available online for survey development in R, but the program is beyond capable of conducting it through packages like psych and lavaan.

Over the summer, I presented the following hour-long workshop on survey development in R to the Greater Cleveland R Users meetup group. The video and slides are below. You can also access all code, files and assets used in the presentation on RStudio Cloud.

Slides:

To learn more about survey development, check out my DataCamp course, “Survey and Measure Development in R.” The first chapter is free.

Is your meetup looking to learn about survey measurement? Would your organization like to build validated survey instruments? Does your organization do it, but wants to move to R? Drop me a line.

To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

EARL London 2019 Conference Recap

September 30, 2019, 4:12 am

≫ Next: New package: GetQuandlData

≪ Previous: Meetup Recap: Survey and Measure Development in R

[This article was first published on r – Appsilon Data Science | End to End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Damian Rodziewicz of Appsilon Data Science Consulting at EARL London 2019

I had an awesome time at the Enterprise Applications of the R Language (EARL) Conference held in London in September, 2019. EARL reminded me that it is good to keep showing up at conferences. I entered and the first thing I heard was organisers at the table welcoming me “Damian is that you? Awesome to see you again!” Feels great to be a part of the amazing R community. Within minutes I already met a couple of people. I like the vibe at EARLs. I ran into some of the people from other conferences who remembered me from Insurance Data Science in Zurich and EARL 2018 in Seattle.

What a fantastic view from the #earlconf venue! pic.twitter.com/jtIS3y6ETX
— Damian Rodziewicz (@D_Rodziewicz) September 11, 2019

At EARL 2018 in Seattle I presented about analyzing satellite imagery with deep learning networks, but this time I was a pure attendee, which was a different sort of intense and fantastic. The conference kicked off with a keynote by Julia Silge, the Jane Austen loving astrophysicist and author of “Text Mining with R.”. Julia shared very interesting insights about Data Scientists all over the world that were gathered in a huge Stack Overflow poll.

% of ⁦@StackOverflow⁩ questions by technology presented by ⁦@juliasilge⁩ #earlconf @appsilon pic.twitter.com/6c6tz98es2
— Damian Rodziewicz (@D_Rodziewicz) September 11, 2019

I have worked on 50 something Shiny dashboard applications, but I always learn something new at conferences. Christel Swift had an interesting presentation about using shiny at BBC. I also learned quite a few tricks during the workshop about R Markdown and Interactive Dashboards that took place the day before the conference.

Useful and practical talk by Christel Swift from BBC about building insights and sharing them through a shiny app! #earlconf pic.twitter.com/IaKq0FeQd0
— Damian Rodziewicz (@D_Rodziewicz) September 11, 2019

I didn’t know that it took 15 years to develop a new drug. I heard a fascinating presentation by Bayer about where data science meets the world of pharmaceuticals. The topic is close to us as we also face many challenges when working with data in the pharma industry.

It takes 15 years to develop a new medicine! Data Science helps Bayer tremendously throughout the process #earlconf @appsilon pic.twitter.com/mTxwlpbFvW
— Damian Rodziewicz (@D_Rodziewicz) September 11, 2019

I thought this slide was hilarious:

Perfect image to explain regression testing #earlconf⁦@appsilon⁩ pic.twitter.com/wdKPc9PHth
— Damian Rodziewicz (@D_Rodziewicz) September 11, 2019

At the conference you could also learn how Transport for London (TFL) uses Data Science to reduce station overcrowding and closures – take a look at Mark Samuels blog post about the presentation – https://diginomica.com/how-tfl-using-data-science-reduce-station-overcrowding-and-closures.

Before/between/after the presentations I had so many fascinating conversations, some of which continued into the wee hours. I think everyone, even competitors, recognized that there was so much to gain from sharing information and bouncing ideas off of each other.

Many people I met were just starting with R and introducing R in their companies. I heard a lot of questions about using R in production – our pitch about big pharma that introduced R with our support fit perfectly there.

Note to self: pick up a Catch Box for when we host a conference. You can throw it at people instead of awkwardly leaning over crowds of people trying to hand them the microphone. It was entertaining each time they tossed the Catch Box at an audience member.

Thank you @MangoTheCat for organizing #earlconf! Worth noting that audio and logistics were perfect. Also, the microphone that you can throw at people is I think a dream of every presenter – awesome idea @rstudio! #rstats @appsilon pic.twitter.com/qz3XILhidJ
— Damian Rodziewicz (@D_Rodziewicz) September 12, 2019

EARL was exciting and well organized. I got to know the Mango Solutions founders, a lot of RStudio folks and plenty of different Data Scientists from various companies. EARL is the conference that you don’t want to miss.

We are in the middle of the WhyR Warsaw conference at the moment. We’re so excited to host Dr. Kenneth Benoit from the London School of Economics and creator of the quanteda R package. I will co-present with him on the topic of Natural Language Processing for non-programmers. But that is a post for another time! Thanks for stopping by. Questions? Comments? You can find me on Twitter @D_Rodziewicz.

Article EARL London 2019 Conference Recap comes from Appsilon Data Science | End to End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End to End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

New package: GetQuandlData

September 30, 2019, 5:00 pm

≫ Next: Fast adaptive spectral clustering in R (brain cancer RNA-seq)

≪ Previous: EARL London 2019 Conference Recap

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Quandl is one of the best platforms for finding and downloading financial and economic time series. The collection of free databases is solid and I’ve used it intensively in my research and class material.

But, a couple of things from the native package Quandl always bothered me:

Multiple data is always returned in the wide (column oriented) format (why??);
No local caching of data;
No control for importing error and status;
Not easy to work within the tidyverse collection of packages

As you suspect, I decided to tackle the problem over the weekend. The result is package GetQuandlData. This is what it does differently:

It uses the json api (and not the Quandl native function), so that some metadata is also returned;
The resulting dataframe is always returned in the long format, even for multiple series;
Users can set custom names for input series. This is very useful when using along ggplot or making tables;
Uses package memoise to set a local caching system. This means that the second time you ask for a particular time series, it will grab it from your hard drive (and not the internet);
Always compares the requested dates against dates available in the platform.

Installation

# not in CRAN yet (need to test it further)#install.packages('GetQuandlData')# from githubdevtools::install_github('msperlin/GetQuandlData')

Example 01 – Inflation in the US

Let’s download and plot information about inflation in the US:

library(GetQuandlData)library(tidyverse)my_id <- c('Inflation USA' = 'RATEINF/INFLATION_USA')my_api <- readLines('~/Dropbox/.quandl_api.txt') # you need your own API (get it at https://www.quandl.com/sign-up-modal?defaultModal=showSignUp>)first_date <- '2000-01-01'last_date <- Sys.Date()df <- get_Quandl_series(id_in = my_id,                         api_key = my_api,                         first_date = first_date,                        last_date = last_date,                         cache_folder = tempdir())glimpse(df)

## Observations: 236## Variables: 4## $ series_name  "Inflation USA", "Inflation USA", "Inflation USA", "…## $ ref_date     2019-08-31, 2019-07-31, 2019-06-30, 2019-05-31, 201…## $ value        1.750, 1.811, 1.648, 1.790, 1.996, 1.863, 1.520, 1.5…## $ id_quandl    "RATEINF/INFLATION_USA", "RATEINF/INFLATION_USA", "R…

As you can see, the data is in the long format. Let’s plot it:

p <- ggplot(df, aes(x = ref_date, y = value/100)) +   geom_col() +   labs(y = 'Inflation (%)',        x = '',       title = 'Inflation in the US') +   scale_y_continuous(labels = scales::percent)p

Beautiful!

Example 02 – Inflation for many countries

Next, lets have a look into a more realistic case, where we need inflation data for several countries:

First, we need to see what are the available datasets from database RATEINF:

library(GetQuandlData)library(tidyverse)db_id <- 'RATEINF'my_api <- readLines('~/Dropbox/.quandl_api.txt') # you need your own APIdf <- get_database_info(db_id, my_api)knitr::kable(df)

code	name	description	refreshed_at	from_date	to_date	quandl_code	quandl_db
CPI_ARG	Consumer Price Index – Argentina	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:58	1988-01-31	2013-12-31	RATEINF/CPI_ARG	RATEINF
CPI_AUS	Consumer Price Index – Australia	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1948-09-30	2019-06-30	RATEINF/CPI_AUS	RATEINF
CPI_CAN	Consumer Price Index – Canada	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1989-01-31	2019-08-31	RATEINF/CPI_CAN	RATEINF
CPI_CHE	Consumer Price Index – Switzerland	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:58	1983-01-31	2019-08-31	RATEINF/CPI_CHE	RATEINF
CPI_DEU	Consumer Price Index – Germany	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1991-01-31	2019-08-31	RATEINF/CPI_DEU	RATEINF
CPI_EUR	Consumer Price Index – Euro Area	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1990-01-31	2019-08-31	RATEINF/CPI_EUR	RATEINF
CPI_FRA	Consumer Price Index – France	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1990-01-31	2019-08-31	RATEINF/CPI_FRA	RATEINF
CPI_GBR	Consumer Price Index – UK	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:58	1988-01-31	2019-08-31	RATEINF/CPI_GBR	RATEINF
CPI_ITA	Consumer Price Index – Italy	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	2001-01-31	2019-08-31	RATEINF/CPI_ITA	RATEINF
CPI_JPN	Consumer Price Index – Japan	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1970-01-31	2019-08-31	RATEINF/CPI_JPN	RATEINF
CPI_NZL	Consumer Price Index – New Zealand	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1988-03-31	2019-06-30	RATEINF/CPI_NZL	RATEINF
CPI_RUS	Consumer Price Index – Russia	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1995-01-31	2019-07-31	RATEINF/CPI_RUS	RATEINF
CPI_USA	Consumer Price Index – USA	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1913-01-31	2019-08-31	RATEINF/CPI_USA	RATEINF
INFLATION_ARG	Inflation YOY – Argentina	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:58	1989-01-31	2013-12-31	RATEINF/INFLATION_ARG	RATEINF
INFLATION_AUS	Inflation YOY – Australia	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1949-03-31	2019-06-30	RATEINF/INFLATION_AUS	RATEINF
INFLATION_CAN	Inflation YOY – Canada	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1990-01-31	2019-08-31	RATEINF/INFLATION_CAN	RATEINF
INFLATION_CHE	Inflation YOY – Switzerland	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1984-01-31	2019-08-31	RATEINF/INFLATION_CHE	RATEINF
INFLATION_DEU	Inflation YOY – Germany	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1992-01-31	2019-08-31	RATEINF/INFLATION_DEU	RATEINF
INFLATION_EUR	Inflation YOY – Euro Area	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1991-01-31	2019-08-31	RATEINF/INFLATION_EUR	RATEINF
INFLATION_FRA	Inflation YOY – France	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1991-01-31	2019-08-31	RATEINF/INFLATION_FRA	RATEINF
INFLATION_GBR	Inflation YOY – UK	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1989-01-31	2019-08-31	RATEINF/INFLATION_GBR	RATEINF
INFLATION_ITA	Inflation YOY – Italy	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	2002-01-31	2019-08-31	RATEINF/INFLATION_ITA	RATEINF
INFLATION_JPN	Inflation YOY – Japan	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1971-01-31	2019-08-31	RATEINF/INFLATION_JPN	RATEINF
INFLATION_NZL	Inflation YOY – New Zealand	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	2001-03-31	2019-06-30	RATEINF/INFLATION_NZL	RATEINF
INFLATION_RUS	Inflation YOY – Russia	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1996-01-31	2019-07-31	RATEINF/INFLATION_RUS	RATEINF
INFLATION_USA	Inflation YOY – USA	Please visit http://www.rateinflation.com/inflation-information/calculate-inflation for more information.	2019-09-28 02:19:59	1914-01-31	2019-08-31	RATEINF/INFLATION_USA	RATEINF

Nice. Now we only need to filter the series with YOY inflation:

idx <- stringr::str_detect(df$name, 'Inflation YOY')df_series <- df[idx, ]

and grab the data:

my_id <- df_series$quandl_codenames(my_id) <- df_series$namefirst_date <- '2010-01-01'last_date <- Sys.Date()df_inflation <- get_Quandl_series(id_in = my_id,                                   api_key = my_api,                                  first_date = first_date,                                  last_date = last_date)glimpse(df_inflation)

## Observations: 897## Variables: 4## $ series_name  "Inflation YOY - Argentina", "Inflation YOY - Argent…## $ ref_date     2013-12-31, 2013-11-30, 2013-10-31, 2013-09-30, 201…## $ value        10.95, 10.54, 10.55, 10.49, 10.55, 10.61, 10.46, 10.…## $ id_quandl    "RATEINF/INFLATION_ARG", "RATEINF/INFLATION_ARG", "R…

And, finally, an elegant plot:

p <- ggplot(df_inflation, aes(x = ref_date, y = value/100)) +   geom_col() +   labs(y = 'Inflation (%)',        x = '',       title = 'Inflation in the World',       subtitle = paste0(first_date, ' to ', last_date)) +   scale_y_continuous(labels = scales::percent) +   facet_wrap(~series_name)p

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Fast adaptive spectral clustering in R (brain cancer RNA-seq)

September 30, 2019, 11:33 pm

≫ Next: Learning Data Science: The Supermarket knows you are pregnant before your Dad does

≪ Previous: New package: GetQuandlData

[This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Spectral clustering refers to a family of algorithms that cluster eigenvectors derived from the matrix that represents the input data’s graph. An important step in this method is running the kernel function that is applied on the input data to generate a NXN similarity matrix or graph (where N is our number of input observations). Subsequent steps include computing the normalised graph Laplacian from this similarity matrix, getting the eigensystem of this graph, and lastly applying k-means on the top K eigenvectors to get the K clusters. Clustering in this way adds flexibility in the range of data that may be analysed and spectral clustering will often outperform k-means. It is an excellent option for image and bioinformatic cluster analysis including single platform and multi-omics.

‘Spectrum’ is a fast adaptive spectral clustering algorithm for R programmed by myself from QMUL and David Watson from Oxford. It contains both novel methodological advances and implementation of pre-existing methods. Spectrum has the following features; 1) A new density-aware kernel that increases similarity between observations that share common nearest neighbours, 2) A tensor product graph data integration and noise reduction system, 3) The eigengap method to decide on the number of clusters, 4) Gaussian Mixture Modelling for the final clustering of the eigenvectors, 5) Implementation of a Fast Approximate Spectral Clustering method for very big datasets, 6) Data imputation for multi-view analyses. Spectrum has been recently published as an article in Bioinformatics. It is available to download from CRAN: https://cran.r-project.org/web/packages/Spectrum/index.html.

In this demonstration, we are going to use Spectrum to cluster brain cancer RNA-seq to find distinct patient groups with different survival times. This is the link, braincancer_test_data, to download the test data for the analysis. The data can also be accessed through Synapse (https://www.synapse.org/#!Synapse:syn18911542/files/).

## load libraries required for analysislibrary(Spectrum)library(plot3D)library(survival)library(survminer)library(scales)

The next block of code runs Spectrum, then does a t-sne analysis to visualise the clusters embedded in a lower dimensional space.

I have found 3D t-sne can work well to see complex patterns of Gaussian clusters which can be found in omic data (as in this cancer data), compared to PCA which is better for a more straightforward structure and outlier detection.

## run Spectrumr <- Spectrum(brain[[1]])## do t-sne analysis of resultsy <- Rtsne::Rtsne(t(brain[[1]]),dim=3)scatter3D(y$Y[,1],y$Y[,2],y$Y[,3], phi = 0, #bty = "g", ex = 2,          ticktype = "detailed", colvar = r$assignments,          col = gg.col(100), pch=20, cex=2, #type = 'h',          xlab='Dim1', ylab='Dim2', zlab='Dim3')

Each eigenvector of the graph Laplacian numerically indicates membership of an observation to a block (cluster) in the data's graph Laplacian and each eigenvalue represents how disconnected that cluster is with the other clusters. In the ideal case, if we had eigenvalues reading 1, 1, 1, 1, 0.2 in the below plot, that would tell us we have 4 completely disconnected clusters, followed by one that is much more connected (undesirable when trying to find separate clusters). This principal can be extended to look for the greatest gap in the eigenvalues to find the number of clusters (K).

This first plot, produced automatically by Spectrum of the brain cancer data, shows the eigenvalues for each eigenvector of the graph Laplacian. Here is biggest gap is between the 4th and 5th eigenvectors, thus corresponding to K=4.

This next plot shows the 3D t-sne of the brain cancer RNA-seq clusters. We can see clear separation of the groups. On this type of complex data, Spectrum tends to excel due to its prioritisation of local similarities and its ability to reduce noise by making the k nearest neighbour graph and diffusing on it.

Next, we can run a survival analysis using a Cox proportional hazards model and the log rank test is used to test whether there is a difference between the survival times of different groups (clusters).

## do testclinicali <- brain[[2]]clinicali$Death <- as.numeric(as.character(clinicali$Death))coxFit <- coxph(Surv(time = Time, event = Death) ~ as.factor(r$assignments), data = clinicali, ties = "exact")coxresults <- summary(coxFit)print(coxresults$logtest[3])## survival curvesurvival_p <- coxresults$logtest[3]fsize <- 18clinicalj <- clinicaliif (0 %in% pr$cluster){  res$cluster <- res$cluster+1}clinicalj$cluster <- as.factor(r$assignments)fit = survfit(Surv(Time,Death)~cluster, data=clinicalj)## plotgg <- ggsurvplot(fit, data = clinicalj, legend = 'right', font.x =  fsize, font.y =  fsize, font.tickslab = fsize,                 palette = gg.col(4),                 legend.title = 'Cluster', #3 palette = "Dark2",                 ggtheme = theme_classic2(base_family = "Arial", base_size = fsize))gg$plot + theme(legend.text = element_text(size = fsize, color = "black")) + theme_bw() +  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),        axis.text.y = element_text(size = fsize, colour = 'black'),        axis.text.x = element_text(size = fsize, colour = 'black'),        axis.title.x = element_text(size = fsize),        axis.title.y = element_text(size = fsize),        legend.title = element_text(size = fsize),        legend.text = element_text(size = fsize))

This is the survival curve for the analysis. The code above also gives us the p value (4.03E-22) for the log rank test which is highly significant, suggesting Spectrum yields good results. More extensive comparisons are included in our manuscript.

Spectrum is effective at reducing noise on a single-view (like here) as well as multi-view because it performs a tensor product graph diffusion operation in either case, and it uses a k nearest neighbour graph. Multi-omic clustering examples are included in the package vignette. Spectrum will also be well suited to a broad range of different types of data beyond this demonstration given today.

The density adaptive kernel enhances the intra-cluster similarity which helps to improve the quality of the similarity matrix. A good quality similarity matrix (that represents the data’s graph) is key to the performance of spectral clustering, where intra-cluster similarity should be maximised and inter-cluster similarity minimised.

Spectrum is a fast new algorithm for spectral clustering in R. It is largely based on work by Zelnik-Manor, Ng, and Zhang, and includes implementations of pre-existing methods as well as new innovations. This is the second of the pair of clustering tools we developed for precision medicine, the first being M3C, a consensus clustering algorithm which is on the Bioconductor. In many ways, spectral clustering is a more elegant and powerful method than consensus clustering type algorithms, but both are useful to examine data from a different perspective.

Spectrum can be downloaded here: https://cran.r-project.org/web/packages/Spectrum/index.html

Source code is available here: https://github.com/crj32/Spectrum

Reference

Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, Myles J Lewis, Spectrum: fast density-aware spectral clustering for single and multi-omic data, Bioinformatics, , btz704, https://doi.org/10.1093/bioinformatics/btz704

To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Learning Data Science: The Supermarket knows you are pregnant before your Dad does

October 1, 2019, 12:00 am

≫ Next: Notes from a panel II: Value of successful BBSRC grants

≪ Previous: Fast adaptive spectral clustering in R (brain cancer RNA-seq)

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few month ago I posted about market basket analysis (see Customers who bought…), in this post we will see another form of it, done with Logistic Regression, so read on…

A big supermarket chain wanted to target (wink, wink) certain customer groups better. In this special case we are talking about pregnant women. The story goes that they identified a young girl as being pregnant and kept sending her coupons for baby care products. Now, the father got angry because she was “too young”… and complained to the supermarket. The whole story took a turn when his daughter confessed that… well, you know what! We are now going to reproduce a similar model here!

In this example, we have a dataset with products bought by customers with the additional information whether the respective buyer was pregnant or not. This is coded in the last column as 1 for pregnant and 0 for not pregnant, 500 instances each. As always all kinds of analyses could be used but we stick with good old logistic regression because, first, it works quite well, and second, as we will see, the results are interpretable in this case.

Have a look at the following code (the data is from the book Data Smart by John Foreman and can be downloaded here: ch06.zip):

RetailMart <- read.csv("data/RetailMart.csv") # load datahead(RetailMart)##   Male Female Home Apt Pregnancy.Test Birth.Control Feminine.Hygiene## 1    1      0    0   1              1             0                0## 2    1      0    1   0              1             0                0## 3    1      0    1   0              1             0                0## 4    0      0    1   0              0             0                0## 5    0      1    0   1              0             0                0## 6    0      1    1   0              0             0                0##   Folic.Acid Prenatal.Vitamins Prenatal.Yoga Body.Pillow Ginger.Ale## 1          0                 1             0           0          0## 2          0                 1             0           0          0## 3          0                 0             0           0          0## 4          0                 0             0           0          1## 5          0                 0             1           0          0## 6          0                 1             0           0          0##   Sea.Bands Stopped.buying.ciggies Cigarettes Smoking.Cessation## 1         0                      0          0                 0## 2         0                      0          0                 0## 3         1                      0          0                 0## 4         0                      0          0                 0## 5         0                      0          0                 0## 6         0                      1          0                 0##   Stopped.buying.wine Wine Maternity.Clothes PREGNANT## 1                   0    0                 0        1## 2                   0    0                 0        1## 3                   0    0                 0        1## 4                   0    0                 0        1## 5                   1    0                 0        1## 6                   0    0                 0        1tail(RetailMart)##      Male Female Home Apt Pregnancy.Test Birth.Control Feminine.Hygiene## 995     1      0    1   0              0             0                1## 996     1      0    0   1              0             0                0## 997     0      1    0   1              0             0                0## 998     1      0    1   0              0             0                1## 999     0      0    1   0              0             0                0## 1000    1      0    0   1              0             0                0##      Folic.Acid Prenatal.Vitamins Prenatal.Yoga Body.Pillow Ginger.Ale## 995           0                 0             0           0          0## 996           0                 0             0           0          0## 997           0                 0             0           0          0## 998           0                 0             0           0          0## 999           0                 0             0           0          0## 1000          0                 0             0           0          1##      Sea.Bands Stopped.buying.ciggies Cigarettes Smoking.Cessation## 995          1                      0          0                 0## 996          0                      0          0                 0## 997          0                      0          0                 0## 998          0                      0          0                 0## 999          0                      0          0                 0## 1000         0                      0          0                 0##      Stopped.buying.wine Wine Maternity.Clothes PREGNANT## 995                    0    0                 0        0## 996                    0    0                 0        0## 997                    0    0                 0        0## 998                    0    0                 0        0## 999                    0    0                 0        0## 1000                   0    0                 1        0table(RetailMart$PREGNANT)## ##   0   1 ## 500 500str(RetailMart)## 'data.frame':    1000 obs. of  20 variables:##  $ Male                  : int  1 1 1 0 0 0 1 0 0 0 ...##  $ Female                : int  0 0 0 0 1 1 0 1 1 1 ...##  $ Home                  : int  0 1 1 1 0 1 1 1 1 1 ...##  $ Apt                   : int  1 0 0 0 1 0 0 0 0 0 ...##  $ Pregnancy.Test        : int  1 1 1 0 0 0 0 0 0 0 ...##  $ Birth.Control         : int  0 0 0 0 0 0 1 0 0 0 ...##  $ Feminine.Hygiene      : int  0 0 0 0 0 0 0 0 0 0 ...##  $ Folic.Acid            : int  0 0 0 0 0 0 1 0 0 0 ...##  $ Prenatal.Vitamins     : int  1 1 0 0 0 1 1 0 0 1 ...##  $ Prenatal.Yoga         : int  0 0 0 0 1 0 0 0 0 0 ...##  $ Body.Pillow           : int  0 0 0 0 0 0 0 0 0 0 ...##  $ Ginger.Ale            : int  0 0 0 1 0 0 0 0 1 0 ...##  $ Sea.Bands             : int  0 0 1 0 0 0 0 0 0 0 ...##  $ Stopped.buying.ciggies: int  0 0 0 0 0 1 0 0 0 0 ...##  $ Cigarettes            : int  0 0 0 0 0 0 0 0 0 0 ...##  $ Smoking.Cessation     : int  0 0 0 0 0 0 0 0 0 0 ...##  $ Stopped.buying.wine   : int  0 0 0 0 1 0 0 0 0 0 ...##  $ Wine                  : int  0 0 0 0 0 0 0 0 0 0 ...##  $ Maternity.Clothes     : int  0 0 0 0 0 0 0 1 0 1 ...##  $ PREGNANT              : int  1 1 1 1 1 1 1 1 1 1 ...

The metadata for each feature are the following:

Account holder is Male/Female/Unknown by matching surname to census data.
Account holder address is a home, apartment, or PO box.
Recently purchased a pregnancy test
Recently purchased birth control
Recently purchased feminine hygiene products
Recently purchased folic acid supplements
Recently purchased prenatal vitamins
Recently purchased prenatal yoga DVD
Recently purchased body pillow
Recently purchased ginger ale
Recently purchased Sea-Bands
Bought cigarettes regularly until recently, then stopped
Recently purchased cigarettes
Recently purchased smoking cessation products (gum, patch, etc.)
Bought wine regularly until recently, then stopped
Recently purchased wine
Recently purchased maternity clothing

For building the actual model we use glm (for generalized linear model):

logreg <- glm(PREGNANT ~ ., data = RetailMart, family = binomial) # logistic regression - glm stands for generalized linear modelsummary(logreg)## ## Call:## glm(formula = PREGNANT ~ ., family = binomial, data = RetailMart)## ## Deviance Residuals: ##     Min       1Q   Median       3Q      Max  ## -3.2012  -0.5566  -0.0246   0.5127   2.8658  ## ## Coefficients:##                         Estimate Std. Error z value Pr(>|z|)    ## (Intercept)            -0.204470   0.422738  -0.484 0.628613    ## Male                   -0.595820   0.315546  -1.888 0.058997 .  ## Female                 -0.141939   0.307588  -0.461 0.644469    ## Home                   -0.170115   0.334798  -0.508 0.611375    ## Apt                     0.002813   0.336432   0.008 0.993329    ## Pregnancy.Test          2.370554   0.521781   4.543 5.54e-06 ***## Birth.Control          -2.300272   0.365270  -6.297 3.03e-10 ***## Feminine.Hygiene       -2.028558   0.342398  -5.925 3.13e-09 ***## Folic.Acid              4.077666   0.761888   5.352 8.70e-08 ***## Prenatal.Vitamins       2.479469   0.369063   6.718 1.84e-11 ***## Prenatal.Yoga           2.922974   1.146990   2.548 0.010822 *  ## Body.Pillow             1.261037   0.860617   1.465 0.142847    ## Ginger.Ale              1.938502   0.426733   4.543 5.55e-06 ***## Sea.Bands               1.107530   0.673435   1.645 0.100053    ## Stopped.buying.ciggies  1.302222   0.342347   3.804 0.000142 ***## Cigarettes             -1.443022   0.370120  -3.899 9.67e-05 ***## Smoking.Cessation       1.790779   0.512610   3.493 0.000477 ***## Stopped.buying.wine     1.383888   0.305883   4.524 6.06e-06 ***## Wine                   -1.565539   0.348910  -4.487 7.23e-06 ***## Maternity.Clothes       2.078202   0.329432   6.308 2.82e-10 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ##     Null deviance: 1386.29  on 999  degrees of freedom## Residual deviance:  744.11  on 980  degrees of freedom## AIC: 784.11## ## Number of Fisher Scoring iterations: 7

Concerning interpretability, have a look at the output of the summary function above. First, you can see that some features have more stars than others. This has to do with their statistical significance (see also here: From Coin Tosses to p-Hacking: Make Statistics Significant Again!) and hints at whether the respective feature has some real influence on the outcome and is not just some random noise. We see that e.g. Pregnancy.Test, Birth.Control and Folic.Acid but also alcohol- and cigarette-related features get the maximum of three stars and are therefore considered highly significant for the model.

Another value is the estimate given for each feature which shows how strong each feature influences the final model (because all feature values are normalized to being either 0 or 1) and in which direction. We can e.g. see that buying pregnancy tests and to quit smoking are quite strong predictors for being pregnant (no surprises here). An interesting case is the sex of the customers: both are not statistically significant and both point in the same direction. The answer to this seeming paradox is of course that men also buy items for their pregnant girlfriends or wives.

The predictions coming out of the model are percentages of being pregnant. Now, because a woman is obviously either pregnant or not, and the supermarket has to decide whether to send a coupon or not, we employ a naive approach which draws the line at 50%:

pred <- ifelse(predict(logreg,RetailMart[ , -ncol(RetailMart)], "response") < 0.5, 0, 1) # naive approach to predict whether pregnantresults <- data.frame(actual = RetailMart$PREGNANT, prediction = pred)results[460:520, ]##     actual prediction## 460      1          1## 461      1          1## 462      1          1## 463      1          1## 464      1          1## 465      1          1## 466      1          1## 467      1          0## 468      1          1## 469      1          1## 470      1          1## 471      1          1## 472      1          1## 473      1          1## 474      1          1## 475      1          1## 476      1          1## 477      1          1## 478      1          0## 479      1          0## 480      1          1## 481      1          1## 482      1          0## 483      1          0## 484      1          0## 485      1          1## 486      1          1## 487      1          1## 488      1          0## 489      1          1## 490      1          0## 491      1          1## 492      1          0## 493      1          1## 494      1          1## 495      1          1## 496      1          1## 497      1          1## 498      1          0## 499      1          1## 500      1          0## 501      0          1## 502      0          1## 503      0          0## 504      0          0## 505      0          0## 506      0          1## 507      0          0## 508      0          0## 509      0          0## 510      0          0## 511      0          0## 512      0          0## 513      0          0## 514      0          0## 515      0          0## 516      0          0## 517      0          0## 518      0          0## 519      0          0## 520      0          0

As can be seen in the next code section, the accuracy (which is all correct predictions divided by all predictions) is well over 80 percent which is not too bad for a naive out-of-the-box model:

(conf <- table(pred, RetailMart$PREGNANT)) # create confusion matrix##     ## pred   0   1##    0 450 115##    1  50 385sum(diag(conf)) / sum(conf) # calculate accuracy## [1] 0.835

Now, how does a logistic regression work? One hint lies in the name of the function: generalized linear model. Whereas with standard linear regression (see e.g. here: Learning Data Science: Modelling Basics) in the 2D-case one tries to find the best-fitting line for all points, with logistic regression you try to find the best line which separates the two classes (in this case pregnant vs. not pregnant). In the n-D-case (i.e. with n features) the line becomes a hyperplane, e.g. in the 3D-case:

One learning from all of that is again that simple models are oftentimes quite good and better interpretable than more complicated models! Another learning is that even with simple models and enough data very revealing (and sometimes embarrassing) information can be inferred… you should keep that in mind too!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧