Quantcast
Channel: R-bloggers
Viewing all 12300 articles
Browse latest View live

Data cleaning with Kamehamehas in R

$
0
0

[This article was first published on Musings on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

Given present circumstances in in the world, I thought it might be nice to write a post on a lighter subject.

Recently, I came across an interesting Kaggle dataset that features the power levels of Dragon Ball characters at different points in the franchise. Whilst the dataset itself is quite simple with only four columns (Character, Power_Level, Saga_or_Movie, Dragon_Ball_Series), I noticed that you do need to do a fair amount of data and string manipulation before you can perform any meaningful data analysis with it. Therefore, if you’re a fan of Dragon Ball and interested in learning about string manipulation in R, this post is definitely for you!

The Kamehameha – image from Giphy

For those who aren’t as interested in Dragon Ball but still interested in general R tricks, please do read ahead anyway – you won’t need to understand the references to know what’s going on with the code. But you have been warned for spoilers! 😂

Functions or techniques that are covered in this post:

  • Basic regular expression (regex) matching
  • stringr::str_detect()
  • stringr::str_remove_all() or stringr::str_remove()
  • dplyr::anti_join()
  • Example of ‘dark mode’ ggplot in themes

Getting started

You can download the dataset from Kaggle, which you’ll need to register an account in order to do so. I would highly recommend doing so if you still haven’t, since they’ve got tons of datasets available on the website which you can practise on.

The next thing I’ll do is to set up my R working directory in this style, and ensure that the dataset is saved in the datasets subfolder. I’ll use the {here} workflow for this example, which is generally good practice as here::here implicitly sets the path root to the path to the top-level of they current project.

Let’s load our packages and explore the data using glimpse():

library(tidyverse)library(here)dball_data <-read_csv(here("datasets", "Dragon_Ball_Data_Set.csv"))dball_data %>%glimpse()
## Observations: 1,244## Variables: 4## $ Character           "Goku", "Bulma", "Bear Thief", "Master Roshi", "...## $ Power_Level         "10", "1.5", "7", "30", "5", "8.5", "4", "8", "2...## $ Saga_or_Movie       "Emperor Pilaf Saga", "Emperor Pilaf Saga", "Emp...## $ Dragon_Ball_Series  "Dragon Ball", "Dragon Ball", "Dragon Ball", "Dr...

…and also tail() to view the last five rows of the data, just so we get a more comprehensive picture of what some of the other observations in the data look like:

dball_data %>%tail()
## # A tibble: 6 x 4##   Character              Power_Level       Saga_or_Movie       Dragon_Ball_Seri~##                                                             ## 1 Goku (base with SSJG ~ 448,000,000,000   Movie 14: Battle o~ Dragon Ball Z    ## 2 Goku (MSSJ with SSJG'~ 22,400,000,000,0~ Movie 14: Battle o~ Dragon Ball Z    ## 3 Goku (SSJG)            224,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z    ## 4 Goku                   44,800,000,000    Movie 14: Battle o~ Dragon Ball Z    ## 5 Beerus (full power, n~ 896,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z    ## 6 Whis (full power, nev~ 4,480,000,000,00~ Movie 14: Battle o~ Dragon Ball Z

Who does the strongest Kamehameha? 🔥

In the Dragon Ball series, there is an energy attack called Kamehameha, which is a signature (and perhaps the most well recognised) move by the main character Goku. This move is however not unique to him, and has also been used by other characters in the series, including his son Gohan and his master Muten Roshi.

Goku and Muten Roshi – image from Giphy

As you’ll see, this dataset includes observations which detail the power level of the notable occasions when this attack was used. Our task here is get some understanding about this attack move from the data, and see if we can figure out whose kamehameha is actually the strongest out of all the characters.

Data cleaning

Here, we use regex (regular expression) string matching to filter on the Character column. The str_detect() function from the {stringr} package detects whether a pattern or expression exists in a string, and returns a logical value of either TRUE or FALSE (which is what dplyr::filter() takes in the second argument). I also used the stringr::regex() function and set the ignore_case argument to TRUE, which makes the filter case-insensitive, such that cases of ‘Kame’ and ‘kAMe’ are also picked up if they do exist.

dball_data %>%filter(str_detect(Character, regex("kameha", ignore_case =TRUE))) ->dball_data_1dball_data_1%>%head()
## # A tibble: 6 x 4##   Character                     Power_Level Saga_or_Movie      Dragon_Ball_Seri~##                                                             ## 1 Master Roshi's Max Power Kam~ 180         Emperor Pilaf Saga Dragon Ball      ## 2 Goku's Kamehameha             12          Emperor Pilaf Saga Dragon Ball      ## 3 Jackie Chun's Max power Kame~ 330         Tournament Saga    Dragon Ball      ## 4 Goku's Kamehameha             90          Red Ribbon Army S~ Dragon Ball      ## 5 Goku's Kamehameha             90          Red Ribbon Army S~ Dragon Ball      ## 6 Goku's Super Kamehameha       740         Piccolo Jr. Saga   Dragon Ball

If this filter feels convoluted, it’s for a good reason. There is a variation of cases and spellings used in this dataset, which a ‘straightforward’ filter wouldn’t have picked up. So there are two of these:

dball_data %>%filter(str_detect(Character, "Kamehameha")) ->dball_data_1b## Show the rows which do not appears on BOTH datasetsdball_data_1%>%dplyr::anti_join(dball_data_1b, by ="Character")
## # A tibble: 2 x 4##   Character                        Power_Level Saga_or_Movie   Dragon_Ball_Seri~##                                                             ## 1 Jackie Chun's Max power Kameham~ 330         Tournament Saga Dragon Ball      ## 2 Android 19 (Goku's kamehameha a~ 230,000,000 Android Saga    Dragon Ball Z

Before we go any further with any analysis, we’ll also need to do something about Power_Level, as it is currently in the form of character / text, which means we can’t do any meaningful analysis until we convert it to numeric. To do this, we can start with removing the comma separators with stringr::str_remove_all(), and then run as.numeric().

In ‘real life’, you often get data saved with k and m suffixes for thousands and millions, which will require a bit more cleaning to do – so here, I’m just thankful that all I have to do is to remove some comma separators.

dball_data_1%>%mutate_at("Power_Level", ~str_remove_all(., ",")) %>%mutate_at("Power_Level", ~as.numeric(.)) ->dball_data_2dball_data_2%>%tail()
## # A tibble: 6 x 4##   Character           Power_Level Saga_or_Movie                Dragon_Ball_Seri~##                                                             ## 1 Goku's Super Kame~  25300000000 OVA: Plan to Eradicate the ~ Dragon Ball Z    ## 2 Family Kamehameha  300000000000 Movie 10: Broly- The Second~ Dragon Ball Z    ## 3 Krillin's Kameham~      8000000 Movie 11: Bio-Broly          Dragon Ball Z    ## 4 Goten's Kamehameha    950000000 Movie 11: Bio-Broly          Dragon Ball Z    ## 5 Trunk's Kamehameha    980000000 Movie 11: Bio-Broly          Dragon Ball Z    ## 6 Goten's Super Kam~   3000000000 Movie 11: Bio-Broly          Dragon Ball Z

Now that we’ve fixed the Power_Level column, the next step is to isolate the information about the characters from the Character column. The reason why we have to do this is because, inconveniently, the column provides information for both the character and the occasion of when the kamehameha is used, which means we won’t be able to easily filter or group the dataset by the characters only.

One way to overcome this problem is to use the apostrophe (or single quote) as a delimiter to extract the characters from the column. Before I do this, I will take another manual step to remove the rows corresponding to absorbed kamehamehas, e.g. Android 19 (Goku’s kamehameha absorbed), as it refers to the character’s power level after absorbing the attack, rather than the attack itself. (Yes, some characters are able to absorb kamehameha attacks and make themselves stronger..!)

After applying the filter, I use mutate() to create a new column called Character_Single, and then str_remove_all() to remove all the characters that appear after the apostrophe:

dball_data_2%>%filter(!str_detect(Character, "absorbed")) %>%# Remove 2 rows unrelated to kamehameha attacksmutate(Character_Single =str_remove_all(Character, "\\'.+")) %>%# Remove everything after apostropheselect(Character_Single, everything()) ->dball_data_3
## # A tibble: 10 x 5##    Character_Single Character       Power_Level Saga_or_Movie   Dragon_Ball_Ser~##                                                        ##  1 Master Roshi     Master Roshi's~         180 Emperor Pilaf ~ Dragon Ball     ##  2 Goku             Goku's Kameham~          12 Emperor Pilaf ~ Dragon Ball     ##  3 Jackie Chun      Jackie Chun's ~         330 Tournament Saga Dragon Ball     ##  4 Goku             Goku's Kameham~          90 Red Ribbon Arm~ Dragon Ball     ##  5 Goku             Goku's Kameham~          90 Red Ribbon Arm~ Dragon Ball     ##  6 Goku             Goku's Super K~         740 Piccolo Jr. Sa~ Dragon Ball     ##  7 Goku             Goku's Kameham~         950 Saiyan Saga     Dragon Ball Z   ##  8 Goku             Goku's Kameham~       36000 Saiyan Saga     Dragon Ball Z   ##  9 Goku             Goku's Kameham~       44000 Saiyan Saga     Dragon Ball Z   ## 10 Goku             Goku's Angry K~   180000000 Frieza Saga     Dragon Ball Z

Note that the apostrophe is a special character, and therefore it needs to be escaped by adding two forward slashes before it. The dot (.) matches all characters, and + tells R to match the preceding dot to match one or more times. Regex is a very useful thing to learn, and I would highly recommend just reading through the linked references below if you’ve never used regular expressions before.1

Analysis

Now that we’ve got a clean dataset, what can we find out about the Kamehamehas?

The Kamehameha – image from Giphy

My approach is start with calculating the average power levels of Kamehamehas in R, grouped by Character_Single. The resulting table tells us that on average, Goku’s Kamehameha is the most powerful, followed by Gohan:

dball_data_3%>%group_by(Character_Single) %>%summarise_at(vars(Power_Level), ~mean(.)) %>%arrange(desc(Power_Level)) ->kame_data_grouped # Sort by descendingkame_data_grouped
## # A tibble: 11 x 2##    Character_Single           Power_Level##                                ##  1 Goku                           3.46e14##  2 Gohan                          1.82e12##  3 Family Kamehameha              3.00e11##  4 Super Perfect Cell             8.00e10##  5 Perfect Cell                   3.02e10##  6 Goten                          1.98e 9##  7 Trunk                          9.80e 8##  8 Krillin                        8.00e 6##  9 Student-Teacher Kamehameha     1.70e 4## 10 Jackie Chun                    3.30e 2## 11 Master Roshi                   1.80e 2

However, it’s not helpful to directly visualise this on a bar chart, as the Power Level of the strongest Kamehameha is 175,433 times greater than the median!

kame_data_grouped %>%pull(Power_Level) %>%summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. ## 1.800e+02 4.008e+06 1.975e+09 3.170e+13 1.900e+11 3.465e+14

A way around this is to log transform the Power_Level variable prior to visualising it, which I’ve saved the data into a new column called Power_Index. Then, we can pipe the data directly into a ggplot chain, and set a dark mode using theme():

kame_data_grouped %>%mutate(Power_Index =log(Power_Level)) %>%# Log transform Power Levelsggplot(aes(x =reorder(Character_Single, Power_Level),y = Power_Index,fill = Character_Single)) +geom_col() +coord_flip() +scale_fill_brewer(palette ="Spectral") +theme_minimal() +geom_text(aes(y = Power_Index,label =round(Power_Index, 1),hjust =-.2),colour ="#FFFFFF") +ggtitle("Power Levels of Kamehamehas", subtitle ="By Dragon Ball characters") +theme(plot.background =element_rect(fill ="grey20"),text =element_text(colour ="#FFFFFF"),panel.grid =element_blank(),plot.title =element_text(colour="#FFFFFF", face="bold", size=20),axis.line =element_line(colour ="#FFFFFF"),legend.position ="none",axis.title =element_text(colour ="#FFFFFF", size =12),axis.text =element_text(colour ="#FFFFFF", size =12)) +ylab("Power Levels (log transformed)") +xlab(" ")

So as it turns out, the results aren’t too surprising. Goku’s Kamehameha is the strongest of all the characters on average, although it has been referenced several times in the series that his son Gohan’s latent powers are beyond Goku’s.

Also, it is perhaps unsurprising that Master Roshi’s Kamehameha is the least powerful, given a highly powered comparison set of characters. Interestingly, Roshi’s Kamehameha is stronger as ‘Jackie Chun’ than as himself.

We can also see the extent to which Goku’s Kamehameha has grown more powerful across the series. This is available in the column Saga_or_Movie. In the same approach as above, we can do this by grouping the data by Saga_or_Movie, and pipe this into a ggplot bar chart:

dball_data_3%>%filter(Character_Single == "Goku") %>%mutate(Power_Index =log(Power_Level)) %>%# Log transform Power Levelsgroup_by(Saga_or_Movie) %>%summarise(Power_Index =mean(Power_Index)) %>%ggplot(aes(x =reorder(Saga_or_Movie, Power_Index),y = Power_Index)) +geom_col(fill ="#F85B1A") +theme_minimal() +geom_text(aes(y = Power_Index,label =round(Power_Index, 1),vjust =-.5),colour ="#FFFFFF") +ggtitle("Power Levels of Goku's Kamehamehas", subtitle ="By Saga/Movie") +scale_y_continuous(limits =c(0, 40)) +theme(plot.background =element_rect(fill ="grey20"),text =element_text(colour ="#FFFFFF"),panel.grid =element_blank(),plot.title =element_text(colour="#FFFFFF", face="bold", size=20),plot.subtitle =element_text(colour="#FFFFFF", face="bold", size=12),axis.line =element_line(colour ="#FFFFFF"),legend.position ="none",axis.title =element_text(colour ="#FFFFFF", size =10),axis.text.y =element_text(colour ="#FFFFFF", size =8),axis.text.x =element_text(colour ="#FFFFFF", size =8, angle =45, hjust =1)) +ylab("Power Levels (log transformed)") +xlab(" ")

I don’t have full knowledge of the chronology of the franchise, but I do know that Emperor Pilaf Saga, Red Ribbon Army Saga, and Piccolo Jr. Saga are the earliest story arcs where Goku’s martial arts abilities are still developing. It also appears that if I’d like to witness Goku’s most powerful Kamehameha attack, I should find this in the Baby Saga!

Notes

Hope this was an interesting read for you, and that this tells you something new about R or Dragon Ball.

There is certainly more you can do with this dataset, especially once it is processed into a usable, tidy format.

If you have any related datasets that will help make this analysis more interesting, please let me know!

In the mean time, please stay safe and take care all!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Musings on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Package lconnect: patch connectivity metrics and patch prioritization

$
0
0

[This article was first published on R Code – Geekcologist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I’m revisiting an older blog post on our package lconnect, which is available in CRAN (here). If you want to learn about the available connectivity metrics check this post.

It is intended to be a very simple approach to derive landscape connectivity metrics. Many of these metrics come from the interpretation of landscape as graphs.

Additionally, it also provides a function to prioritize landscape patches based on their contribution to the overall landscape connectivity. For now this function works only with the Integral Index of connectivity, by Pascual-Hortal & Saura (2006).

Here’s a brief tutorial!

First install the package:

#load package from CRAN#install.packages("lconnect")library(devtools)

Then, upload the landscape shapefile …

#Load datavec_path <- system.file("extdata/vec_projected.shp", package = "lconnect")

…and create a ‘lconnect’ class object:

#upload landscapeland <- upload_land(vec_path, habitat = 1, max_dist = 500)class(land)
## [1] "lconnect"

And now, let’s plot it:

plot(land, main="Landscape clusters")

fig1

If we wish we can derive patch importance (the contribution of each individual patch to the overall connectivity):

land1 <- patch_imp(land, metric="IIC")
##  [1]  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.1039501##  [7]  0.1039501  0.0000000  0.1039501  0.0000000  0.0000000  0.1039501## [13]  0.3118503 21.9334719  0.0000000 15.5925156  2.5987526  0.1039501## [19]  0.1039501  0.2079002  0.0000000  0.0000000  0.0000000  0.0000000## [25]  0.9355509  0.0000000 14.2411642  2.9106029  0.2079002 12.9937630## [31]  0.3118503  0.7276507  0.0000000  7.5883576  0.5197505 70.2702703
class(land1)

Which produces an object of the class ‘pimp’:

## [1] "pimp"

And, finally, we can also plot the relative contribution of each patch to the landscape connectivity:

plot(land1, main="Patch prioritization (%)")

fig2

And that’s it!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Code – Geekcologist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

J is for Join

$
0
0

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, we’ll start digging into the wonderful world of joins! The tidyverse offers several different types of joins between two datasets, X and Y:

  • left_join – keeps all rows from X and adds columns from Y to any that match cases in X; if there is no matching record from Y, the Y columns will be NA for those cases
  • right_join – keeps all rows from Y and adds columns from Y; if there is no matching record in X, the new row from Y will have NAs for all X variables
  • full_join – keeps all rows from both X and Y, matching where it can and filling in NAs where data are missing
  • inner_join – only keeps rows in common between X and Y
  • semi_join – does not add any columns from Y, but only keeps rows in X that have a matching record in Y
  • anti_join – removes rows from X with a matching case in Y

Joins make it essential to follow tidy data principles. You should always have a variable or set of variables that serve as identifiers, so that you can use those to match up data. And, in fact, you may have multiple identifiers depending on what kind of data you’re merging together. For instance, at work, I have many identifers I use for different data problems. For item characteristics (such as item difficulty), we can match with unique item identifier. Multiple people may receive the same item, but the characteristics of each item should be unique. For matching users with their certification records, we have their unique board ID. Data should be clear of any unnecessary duplicates, because they can either cause the merge to stop and throw out error messages or create an incredibly large dataset that contains all possible combinations of merges. Always know what your unit of measurement is (People? Books? Experimental conditions?), and figure out what variable or variables allow you to uniquely identify cases by that unit.

For instance, with my reading dataset, I have multiple variables I could use as identifiers. Book.ID uniquely identifies each book in the set, and I could use that variable to match up additional information about the book. Author is another identifier, and I could use that to match up with a dataset that includes more information about the authors. It’s okay that my book dataset includes multiple books by the same author, as long as the author information set I merge with only has one record for each author. I could even use the dates in the dataset to match up with a dataset providing information on what I was doing on those days, such as whether I was traveling, working, or at home. Any of those variables could be used as an identifier for a merge, depending on what research question I want to answer.

In the dummy_cols post, I used the left_join to merge the book genres with my reading data (download the genre flag file and reading set to test out this code):

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 -- 
##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0 
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag() 
reads2019<-read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv",col_names=TRUE)
## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   NewRating = col_double(), ##   FinalRating = col_double() ## ) 
genres<-read_csv("~/Downloads/Blogging A to Z/genres.csv",col_names=TRUE)
## Parsed with column specification: ## cols( ##   Book.ID = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## ) 
reads2019<-reads2019%>%left_join(genres,by="Book.ID")

For joins, variable names have to be formatted as strings (meaning, in quotes). We’re joining on a single variable (Book.ID), which has the same name in both files, so we only need one string after by = . If we were joining on two variables, we’d need a list of two string. To demonstrate, I created a file by Title and Author that gives the main reason why I read the book: 1) it was recommended to me (either single book or a whole series), 2) it was a personal goal to read/re-read this book, 3) I randomly found the book at a shop, or 4) I planned to buy that specific book (usually because I follow the author’s work). Let’s merge that file in to reads2019 by Title and Author:

reason<-read_csv("~/Downloads/Blogging A to Z/reason.csv",col_names=TRUE)
## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Author = col_character(), ##   Reason = col_character() ## ) 
reads2019<-reads2019%>%left_join(reason,by=c("Title","Author"))reads2019%>%ggplot(aes(Reason))+geom_bar()+coord_flip()

Based on this figure, the #1 reason I read a book or series of books is because of a recommendation from someone. Random finds and personal goals are tied for 2nd.

Let’s try a full_join. I created a dataset based on a friend’s reading from 2019. A full join will combine all rows from both datasets and merge any matching rows.

friendreads<-read_csv("~/Downloads/Blogging A to Z/friendreads.csv",col_names=TRUE)
## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Author = col_character() ## ) 
reads2019<-reads2019%>%full_join(friendreads,by=c("Title"))

Since I joined only by title, and both datasets include an Author variable, the dataset was changed to have Author.x and Author.y. So every row that has NA for Author.x is in my friend’s dataset and not mine, and every row that has NA for Author.y is in my dataset but not my friend’s. Rows that contain values for both are our areas of overlap.

The nice thing about join is that the variables you join on don’t need to have the same name (they do, however, need to be the same format, such as numeric, character, and so on). You’d just add an additional equal sign to tell which variables match up. Let’s demonstrate a new join type with mismatching variable names while using one of my favorite R packages, tidytext.

library(tidytext)titlewords<-reads2019%>%unnest_tokens(titleword, Title)titlewords<-titlewords%>%anti_join(stop_words,by=c("titleword"="word"))

First, I tokenized the book titles, which expanded the dataset so that each word in the title received its own row. Then, I used anti_join to remove stop words (common words, such as “a”, “the”, and “of”). Because my tokenized dataset used the variable name titleword, I needed to tell R that it should match with the variable called word to do the anti-join. Both variables are characters, so the types match and can easily be joined. I could then do further analysis to see what the most common words are among my book titles.

That’s all for now. Next time, we’ll talk about keeping and dropping variables from a dataset!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Live data extraction with Cron and R

$
0
0

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learn how to schedule robust data extraction with Cron and Docker containers.

The schema of the public live data set on Google Big Query we are going to extract

This article was also published on https://www.r-bloggers.com/.

This is part two of a multi article series on building and deploying an API robustly with R and Docker that lets you extract live data from Google Big Query. For part one, see Google Big Query with R. For a short introduction into Docker, see part two of the article series about building and deploying a Dashboard, Deploying a Shiny Flexdashboard with Docker.

Scheduled data extractions via database queries or API calls are an important part of building data infrastructure. This enables you to copy a data source easily and have the most recent data with some lag. Another use case is pre-computing aggregated/filtered data from an constantly updated source to enhance the performance of a service. I will show you how to use a Unix cron job to schedule an extraction. More advanced methods to minimize stale data would be setting up a listener or with a web hook.

We will extract data from the public Big Query Real-time Air Quality data set from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time”) from 5490 world wide air quality measurement stations, yet we will only extract measurements from Indian stations. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. To hold the older measurements, we will set up data extraction via cron job inside a Docker container. For a short introduction to Docker and why we use it, see this article. For an introduction to Google Big Query, how to get access to this public data set and querying it with dplyr verbs, see part one of this series, Google Big Query with R.

R extraction script for scheduling

The following script will be used to extract data if the data set was updated. You can find the script at cron/src/get_data_big_query.R in the project github repo.

<a href="https://medium.com/media/5823c0ce588a3c38426fe2043d32ebb8/href" rel="nofollow" target="_blank">https://medium.com/media/5823c0ce588a3c38426fe2043d32ebb8/href</a>

Scheduling the extraction with Cron

Cron is a scheduling program already contained in most modern Unix based distributions. Scheduling of so called cron jobs are managed via crontab. You can see the cron jobs for the current user in the crontab table via crontab -l or edit the cron jobs via crontab -e. The following syntax is used to define the execution interval through five time parameters:

* * * * * command to be executed - - - - - | | | | | | | | | ----- Day of week (0 - 7) (Sunday=0 or 7) | | | ------- Month (1 - 12) | | --------- Day of month (1 - 31) | ----------- Hour (0 - 23) ------------- Minute (0 - 59)

instead it can also be defined by special strings :

string         meaning               ------         -------               @reboot        Run once, at startup.               @yearly        Run once a year, "0 0 1 1 *".               @annually      (same as @yearly)               @monthly       Run once a month, "0 0 1 * *".               @weekly        Run once a week, "0 0 * * 0".               @daily         Run once a day, "0 0 * * *".               @midnight      (same as @daily)               @hourly        Run once an hour, "0 * * * *".

You can check how to set up specific intervals at https://crontab.guru/. Note that there are various cron monitoring tools worth a look at such as https://deadmanssnitch.com/ or https://cronitor.io/.

We will set up our cron job for data extraction to run the R script every 12 hours at the 11th minute. This is best practice to avoid conflicts with any processes that run at full hours or five minute intervals. It is easy to get the the file paths wrong the first time as cronjobs are executed in the home directory. Check that you have the right file paths to R, to the R script and inside the R script for the dependencies. In the cronjob, >> var/log/cron.log 2>&1 appends the script output to a log file and redirects standard error to standard output, so we have all the printed R output as well as the warnings and errors logged.

<a href="https://medium.com/media/b4f8231e32e8316cef69cdfd3ef33e01/href" rel="nofollow" target="_blank">https://medium.com/media/b4f8231e32e8316cef69cdfd3ef33e01/href</a>

Building the Dockerimage

This assumes basic knowledge of Docker, if not see Deploying a Shiny Flexdashboard with Docker. To run our scheduled extraction containerized we build an image, constructed through recipes in a Dockerfile. We will use the rocker/tidyverse image from Dockerhub as base image and add layers on top in the recipe with the needed R libraries and system dependencies, copy the directory with the R script and cronjob to the image and finally the CMD will start cron and tail the log file, so the output gets shown in the Docker container logs:

<a href="https://medium.com/media/1d5133f3a5fe05ec9ae5ae6a3e3f0dde/href" rel="nofollow" target="_blank">https://medium.com/media/1d5133f3a5fe05ec9ae5ae6a3e3f0dde/href</a>

Then in the directory of the Dockerfile run docker build -t openaq_extraction ., this will build the image from the Dockerfile and tag it as openaq_extraction.

You can either export the image and deploy the container on a server or cloud service such as AWS, Google Cloud and DigitalOceanor deploy it locally. Start the container via:

$ docker run -d \   --restart=always \   --name openaq_extraction_container \   --rm \   --mount type=bind,source=/filepath_to/openaq_extraction/shared-data,target=/src/shared-data \    openaq_extraction

This runs the container in detached mode, always restarts and removes the saved filesystem at exit. Additionally, this mounts the directory where the extracted data is saved to an existing source directory on the host which you need to retain the extracted data if the container gets stopped.

Notice: Querying open data sets gets billed on your Google Cloud billing account, however you have 5TB of querying free per month. Still remember to stop this Docker container if you do not need the data extraction.

Now we have a robust, shareable and reproducible scheduled data extraction up and running. In the last part of the project we will build an REST API with R in a Docker container network to enable easy access to the now permanent records of Indian air quality that are getting extracted. For this see part three of this article series.


Live data extraction with Cron and R was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New Package — {cdccovidview} — To Work with the U.S. CDC’s New COVID-19 Trackers: COVIDView and COVID-NET

$
0
0

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The United States Centers for Disease Control (CDC from now on) has setup two new public surveillance resources for COVID-19. Together, COVIDView and COVID-NET provide similar weekly surveillance data as FluView does for influenza-like illnesses (ILI).

The COVIDView resources are HTML tables (O_O) and, while the COVID-NET interface provides a “download” button, there is no exposed API to make it easier for the epidemiological community to work with these datasets.

Enter {cdccovidview} — https://cinc.rud.is/web/packages/cdccovidview/ — which scrapes the tables and uses the hidden API in the same way {cdcfluview}(https://cran.rstudio.com/web/packages/cdcfluview/index.html) does for the FluView data.

Weekly case, hospitalization, and mortality data is available at the national, state and regional levels (where provided) and I tried to normalize the fields across each of the tables/datasets (I hate to pick on them when they’re down, but these two sites are seriously sub-optimal from a UX and just general usage perspective).

After you follow the above URL for information on how to install the package, it should “just work”. No API keys are needed, but the CDC may change the layout of tables and fields structure of the hidden API at any time, so keep an eye out for updates.

Using it is pretty simple, just use one of the functions to grab the data you want and then work with it.

library(cdccovidview)library(hrbrthemes)library(tidyverse)hosp <- laboratory_confirmed_hospitalizations()hosp## # A tibble: 4,590 x 8##    catchment      network   year  mmwr_year mmwr_week age_category cumulative_rate weekly_rate##                                                       ##  1 Entire Network COVID-NET 2020  2020      10        0-4 yr                   0           0  ##  2 Entire Network COVID-NET 2020  2020      11        0-4 yr                   0           0  ##  3 Entire Network COVID-NET 2020  2020      12        0-4 yr                   0           0  ##  4 Entire Network COVID-NET 2020  2020      13        0-4 yr                   0.3         0.3##  5 Entire Network COVID-NET 2020  2020      14        0-4 yr                   0.6         0.3##  6 Entire Network COVID-NET 2020  2020      15        0-4 yr                  NA          NA  ##  7 Entire Network COVID-NET 2020  2020      16        0-4 yr                  NA          NA  ##  8 Entire Network COVID-NET 2020  2020      17        0-4 yr                  NA          NA  ##  9 Entire Network COVID-NET 2020  2020      18        0-4 yr                  NA          NA  ## 10 Entire Network COVID-NET 2020  2020      19        0-4 yr                  NA          NA  ## # … with 4,580 more rowsc(  "0-4 yr", "5-17 yr", "18-49 yr", "50-64 yr", "65+ yr", "65-74 yr", "75-84 yr", "85+") -> age_fmutate(hosp, start = mmwr_week_to_date(mmwr_year, mmwr_week)) %>%  filter(!is.na(weekly_rate)) %>%  filter(catchment == "Entire Network") %>%  select(start, network, age_category, weekly_rate) %>%  filter(age_category != "Overall") %>%  mutate(age_category = factor(age_category, levels = age_f)) %>%  ggplot() +  geom_line(    aes(start, weekly_rate)  ) +  scale_x_date(    date_breaks = "2 weeks", date_labels = "%b\n%d"  ) +  facet_grid(network~age_category) +  labs(    x = NULL, y = "Rates per 100,000 pop",    title = "COVID-NET Weekly Rates by Network and Age Group",    caption = sprintf("Source: COVID-NET: COVID-19-Associated Hospitalization Surveillance Network, Centers for Disease Control and Prevention.\n; Accessed on %s", Sys.Date())  ) +  theme_ipsum_es(grid="XY")

FIN

This is brand new and — as noted — things may change or break due to CDC site changes. I may have also missed a table or two (it’s a truly terrible site).

If you notice things are missing or would like a different interface to various data endpoints, drop an issue or PR wherever you’re most comfortable.

Stay safe!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Is COVID-19 as bad as all that? Yes it probably is.

$
0
0

[This article was first published on Econometrics By Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As an economist without formal training in epidemiology I have done my best to leave the modelling up to the experts. But, the world as shut down around me and my life is suddenly so much more complicated and I have to wonder, is this COVID thing as dangerous as it seems? When things got bad in Italy my optimistic friends said, “that’s just Italy”. When things got bad in Spain, they said the same. But now New York has more deaths per capita than either Italy or Spain and I am starting to sweat a little. Is there something particularly bad about the New York health care system which has made them more vulnerable to this disease than others?

Looking at the mortality rate of previous flu seasons in New York in which for the last four years, they have been in the top 10 best performing states according to the CDC. In 2017 New York had the 7th lowest death rate by State, only being beat by states which had lower elderly populations (Utah, Alaska, California, Colorado, Texas, and Washington).
Does New York have a particularly large elderly population which has made it more vulnerable? Nope. New York state, at 14.66%, ranks squarely toward the youngish center of states (29/50 youngest) while New York city is younger in general than the rest of the state with only 13% of the city population older than 65.
Well maybe the mortality rate of COVID-19 just seems high because it is about to peak? After all, the flu kills somewhere between 12 and 70 thousand people in the US every year and 290 to 650 thousand globally. COVID-19 with an estimated number of deaths in the US of around 19 thousand and 104 thousand globally doesn’t seem that dangerous.
Yet, the very reasonable concern is that this disease is just getting started. Wikipedia numbers suggest the total number of cases globally is 1.7 million, which we know is a lower bound of the true number of cases, as many of those who have COVID-19 have not been tested.
We don’t know how many people currently have COVID but we can imagine a few different scenarios.

Scenario 1: COVID-19 reported cases are close to true cases

Let’s imagine that the number of people with COVID is approximately the number that we have record of. There are some unreported cases but not that many. If this is the case, we are in an extremely frightening world because so far the disease has killed about 104 thousand people out of the 1.7 million it has affected, a 6 % mortality rate and almost all of those infected are not yet recovered, meaning some of them will die, increasing the observed mortality rate. The small consolation under this scenario is that cases are largely detected and therefore with enough government and individual intervention ongoing transmission likely could be slowed and stopped through thorough and diligent contact tracing.

Scenario 2: COVID-19 reported cases are reasonable fraction of true cases

Let’s imagine that the true number of cases is somewhere between 2 to 10 times as many as those reported. Under this scenario, the current mortality rate is calculated by dividing the observed mortality rate by the factor of unknown cases so 6/2=3% for 2 times with 6/10=0.6% for ten times. In this scenario contact tracing by and large will fail as there is simply too many unknown cases. The best thing governments and individuals can do in this scenario is shut off potential avenues of transmission between individuals until either a vaccine can be found or the number of new cases is so small that the implementation of contact tracing is feasible. Sadly even in the scenario in which the true number of cases is 10x that of the reported cases the mortality rate of COVID at a minimum of 0.6% is still much higher than of the seasonal flu and if left unchecked would result in 2.28 million fatalities in the US alone (0.6% * 380 million) which is greater than the top ten leading causes of death in the US combined:

Table 1

    Heart disease: 647,457
    Cancer: 599,108
    Accidents (unintentional injuries): 169,936
    Chronic lower respiratory diseases: 160,201
    Stroke (cerebrovascular diseases): 146,383
    Alzheimer’s disease: 121,404
    Diabetes: 83,564
    Influenza and pneumonia: 55,672
    Nephritis, nephrotic syndrome, and nephrosis: 50,633
    Intentional self-harm (suicide): 47,173
Total 2.08 million

Scenario 3: COVID is already everywhere and most people have it or have already had it

Strangely this is the best-case scenario. Under this scenario only those who have severe outcomes from COVID-19 are being reported while the vast majority (like 99%) of individuals are asymptomatic. Under this scenario, shutting down state, national, and international travel and social activities is futile for any extended period of time as the virus is already everywhere and we just need to treat the severe cases that pop up the best we can and suck it up. This scenario is appealing as it means the worst has already come or is soon to.

So which scenario are we in?

Reviewing the scenarios it is impossible to know with certainty in which scenario lies reality. However, does the evidence point against any given scenario?
Scenario 1 seems unlikely to me due to the tens of thousands of cases are popping up each day (Figure 5). This rate of new infections seems to indicate that there is a sizable infected population which has not yet been detected and has continued to spread the virus despite national, state, and local recommendations and mandates intended to limit spread.
Under Scenario 3 in which COVID-19 is already everywhere this scenario seems unlikely due to the lumpiness of the mortality numbers. If COVID-19 were everywhere then we would expect people across all states and countries to be dying from the disease more or less proportionately. If COVID-19 were already everywhere we would expect that mortality numbers to be mostly homogenous across states. However, this is not what we are seeing with highly heterogenous mortality numbers across states and countries. New York currently has around 400 deaths per million while New Jersey 218, Michigan 108, Florida has around 19, California 14, Texas 8, and Montana 6.
These numbers suggest that COVID is spreading from infected communities to non-infected communities in a hotspot community spread model rather than that of a widespread dispersal characteristic of Scenario 3.
But maybe one might ask, is it possible that deaths previously assigned to other causes might have actually been caused by COVID-19 before the virus was known and publicized? Yes, there are very likely deaths caused by COVID-19 which have not yet been correctly attributed to the disease. If accounted for could, these deaths correct the heterogeneity in the data in order to place us back in Scenario 3? Figure 1 shows the known deaths in New York by COVID-19 compared with flu mortality numbers from 2014-2017. Already, COVID-19 has or will soon double the mortality of the flu for these years and unfortunately the number of infections has continued to grow at an alarming pace (Figure 2).
Figure 1
 
Figure 1
So, while it is impossible to know, I believe it extremely unlikely that a disease twice as deadly as a typical flu (at least in New York) could go undetected in thousands of hospitals and laboratories across the United States.

Assuming Scenario 2

With some reports saying 80% of cases are asymptomatic, an estimate of 5x as many people infected with COVID-19 as what has been reported might not be crazy. This would mean that the actual number of people infected with COVID in New York is something like 650,000 which while encouraging in that 9,000 deaths out of 850,000 (1%) is much better than 9,000 deaths out of 170,000 people (5.3%).
The problem of course is that even the inflated number 850 thousand is only 10% of the city’s population and 4.6% of the total population of the state. Meaning we still would have a vast large potential population to infect. Combine that with the factor that we are having somewhere between 6,000 and 10,000 new cases pop up every day in the state despite a ‘stay are home’ order in effect for two and a half weeks.
Looking at the graph (Figure 2), the number of cases in New York has grown very rapidly. Yet, presumably the number of cases would be even greater if the lockdown order were not in effect.
Yet most of us don’t live in New York. How much should we be worried?
As New York has an above average health care system and relatively lower proportion at risk elderly population New York could be seen as a lower risk state compared with many. Yet, New York City is also the most dense city in the country with perhaps the highest use of public transportation and correspondingly highest use of public potential infection points such as grocery stores, theaters, restaurants, etc..
Looking at only states which have reported more than 5,000 cases and scaling counts by log10 we get Figure 3. In Figure 3 is it hard to mark out much except that the overall shape of the infection curve seems to be similar across states.
Figure 3: Total number of cases by state for states reporting at least 5,000 cases.   
It is difficult to make comparisons between states and to make predictions from Figure 3. However, one technique often used to pick a point in time with a certain number of cases then compare how growth rate in cases changed for others states after they reached the same point. In this case, I will pick my earliest date in my dataset March 18th in New York in which there were around 2,500 cases of COVID-19 reported. This number was reached later by different states, New Jersey on the 23rd, California on the 25th, Washington on the 26th, Michigan, Florida, and Illinois on the 27th, and so on.
Plotting cases starting at this common point now gives us a means of comparing case growth by state (Figure 4). Under this technique, New York definitely appears to have a higher growth rate followed by New Jersey with Michigan, California, Louisiana, Massachusetts, Pennsylvania, Illinois, Texas, Georgia, and many other states following a less aggressive but still positive growth trajectory.

Figure 4: Day 0 is the first day a state passes 2,480 cases of reported COVID-19.

Conclusions

COVID-19 appears to be really bad and New York has been hit the hardest – so far. How bad? We won’t know until after crisis has passed. Fortunately, other states other states had lower rates around the time the country (the President) started taking this crisis seriously. Since then those states appear to be on a more gradual growth trajectory than that of New York.
Yet despite widespread concern over COVID-19 and instructions and mandates to help reduce the spread new infections in the are still on the rise (Figure 5). And this is under conditions in which we have put a stop in person social gathering, closed restaurants, and ordered residents to stay in doors in many states. If we were to say, start withdrawing these restrictions, it would seem likely that growth rates of new infections would start rising rapidly once again.

Figure 5

Graphs created in R – code on GitHub

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics By Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to basic: bar plots

$
0
0

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post shows how to make bar plots and area charts. It’s mostly a list of recipes, indented for myself. These are plots I have often to do in reports and would like to have the code handy somewhere. Maybe this will be helpful to some of you as well. Actually, this post is exactly how I started my blog post. I wanted to have a repository of recipes, and with time the blog grew to what it is now (tutorials and me exploring methods and datasets with R).

Bar charts

Bar charts are quite simple plots, but there are enough variations of them that they deserve one single blog post. However, don’t expect many explanations.

Let’s first start by loading some data, and the usually required packages:

library(tidyverse)library(lubridate)library(janitor)library(colorspace)
data(gss_cat)

Very often, what one wants to show are counts:

gss_cat %>%  count(marital, race)
## # A tibble: 18 x 3##    marital       race      n##  *           ##  1 No answer     Other     2##  2 No answer     Black     2##  3 No answer     White    13##  4 Never married Other   633##  5 Never married Black  1305##  6 Never married White  3478##  7 Separated     Other   110##  8 Separated     Black   196##  9 Separated     White   437## 10 Divorced      Other   212## 11 Divorced      Black   495## 12 Divorced      White  2676## 13 Widowed       Other    70## 14 Widowed       Black   262## 15 Widowed       White  1475## 16 Married       Other   932## 17 Married       Black   869## 18 Married       White  8316

Let’s lump marital statuses that appear less than 10% of the time into an “Other” category:

(  counts_marital_race <- gss_cat %>%    mutate(marital = fct_lump(marital, prop = 0.1)) %>%    count(marital, race))
## # A tibble: 12 x 3##    marital       race      n##  *           ##  1 Never married Other   633##  2 Never married Black  1305##  3 Never married White  3478##  4 Divorced      Other   212##  5 Divorced      Black   495##  6 Divorced      White  2676##  7 Married       Other   932##  8 Married       Black   869##  9 Married       White  8316## 10 Other         Other   182## 11 Other         Black   460## 12 Other         White  1925

The simplest bar plot:

ggplot(counts_marital_race) +  geom_col(aes(x = marital, y = n, fill = race)) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog()

Now with position = "dodge":

ggplot(counts_marital_race) +  geom_col(aes(x = marital, y = n, fill = race), position = "dodge") +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog()

Moving the legend around with theme(legend.position = ...):

ggplot(counts_marital_race) +  geom_col(aes(x = marital, y = n, fill = race), position = "dodge") +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() +  theme(legend.position = "left")

Counting by year as well:

(  counts_marital_race_year <- gss_cat %>%    mutate(marital = fct_lump(marital, prop = 0.1)) %>%    count(year, marital, race) %>%    ungroup())
## # A tibble: 96 x 4##     year marital       race      n##  *            ##  1  2000 Never married Other    60##  2  2000 Never married Black   157##  3  2000 Never married White   495##  4  2000 Divorced      Other    20##  5  2000 Divorced      Black    60##  6  2000 Divorced      White   361##  7  2000 Married       Other    78##  8  2000 Married       Black   121##  9  2000 Married       White  1079## 10  2000 Other         Other    17## # … with 86 more rows

When you want to show how a variable evolves through time, area chart are handy:

counts_marital_race_year %>%  group_by(year, marital) %>%  summarise(n = sum(n)) %>%  ggplot() +  geom_area(aes(x = year, y = n, fill = marital)) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Now with facets:

counts_marital_race_year %>%  ggplot() +  geom_area(aes(x = year, y = n, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

But what if I want each plot to have its own y axis?

counts_marital_race_year %>%  ggplot() +  geom_area(aes(x = year, y = n, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free_y") +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Now doing an area chart but with relative frequencies:

counts_marital_race_year %>%   group_by(year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%   ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

With facet_wrap():

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%   ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free_y") +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Want to replace 2000 with “2000-01-01”? First need to create vector of prettier dates and positions:

pretty_dates <- counts_marital_race_year %>%  mutate(pretty_dates = paste0(year, "-01-01")) %>%  pull(pretty_dates) %>%  unique()position_dates <- counts_marital_race_year %>%  pull(year) %>%  unique() %>%  sort() 

scale_x_continuous() can now use this. Using guide = guide_axis(n.dodge = 2) to avoid overlapping labels:

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%  ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free_y") +  scale_x_continuous("Year of survey", labels = pretty_dates,                     breaks = position_dates, guide = guide_axis(n.dodge = 2)) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Adding labels is not trivial. Here it is not working:

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%   ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free_y") +  scale_x_continuous("Year of survey", labels = pretty_dates,                     breaks = position_dates, guide = guide_axis(n.dodge = 2)) +  geom_label(aes(x = year, y = freq, label = round(100 * freq))) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Another failed attempt. I leave it here for posterity. My first idea was first to sort the grouped data set by descending frequency, and then to reorder the factor variable marital by descending position, which is the cumulative percentage. This would work fine, if the same factor levels would have had the same order for each of the race categories. However, this is not the case. For blacks, the most frequent category is “Never Married”. As you can see below, this trick worked well for 2 categories out of 3:

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%  group_by(year, race) %>%    arrange(desc(freq)) %>%   mutate(position = cumsum(freq)) %>%   mutate(marital = fct_reorder(marital, desc(position))) %>%   ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free") +  scale_x_continuous("Year of survey", labels = pretty_dates,                     breaks = position_dates, guide = guide_axis(n.dodge = 2)) +  geom_label(aes(x = year, y = position, label = round(100 * freq))) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

So to remedy this, is not reorder too early; first, we need to reorder the factor variable by frequency. Then, we arrange the data by the now reordered marital variable, and then we can compute the position using the cumulative frequency.

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%  group_by(year, race) %>%    mutate(marital = fct_reorder(marital, freq)) %>%   arrange(desc(marital)) %>%   mutate(position = cumsum(freq)) %>%   ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free") +  scale_x_continuous("Year of survey", labels = pretty_dates,                     breaks = position_dates, guide = guide_axis(n.dodge = 2)) +  geom_label(aes(x = year, y = position, label = round(100 * freq))) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

We can place the labels a bit better (in the middle of their respective areas), like so:

counts_marital_race_year %>%   group_by(race, year, marital) %>%   summarise(n = sum(n)) %>%    mutate(freq = n/sum(n)) %>%  group_by(year, race) %>%    mutate(marital = fct_reorder(marital, freq)) %>%   arrange(desc(marital)) %>%   mutate(position = cumsum(freq)) %>% mutate(prev_pos = lag(position, default = 0)) %>%  mutate(position = (position + prev_pos)/2) %>%    ggplot() +  geom_area(aes(x = year, y = freq, fill = marital)) +  facet_wrap(facets = vars(race), ncol = 1, scales = "free") +  scale_x_continuous("Year of survey", labels = pretty_dates,                     breaks = position_dates, guide = guide_axis(n.dodge = 2)) +  geom_label(aes(x = year, y = position, label = round(100 * freq))) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Now let’s focus on the variable tvhours. We want to show the total watched hours, but also the total across all the categories of race and marital in a faceted bar plot:

(  total_tv <- gss_cat %>%    group_by(year, race, marital) %>%    summarise(total_tv = sum(tvhours, na.rm = TRUE)))
## # A tibble: 127 x 4## # Groups:   year, race [24]##     year race  marital       total_tv##                  ##  1  2000 Other No answer            2##  2  2000 Other Never married      103##  3  2000 Other Separated           16##  4  2000 Other Divorced            17##  5  2000 Other Widowed             24##  6  2000 Other Married            122##  7  2000 Black Never married      452##  8  2000 Black Separated          135##  9  2000 Black Divorced           156## 10  2000 Black Widowed            183## # … with 117 more rows

This tibble has the total watched hours by year, race and marital status variables. How to add the total by year and race categories? For this, by are first going to use the group_split():

total_tv_split <- total_tv %>%  select(race, year, marital, total_tv) %>%  mutate(year = as.character(year)) %>%    group_split(year, race)
## Warning: ... is ignored in group_split(), please use## group_by(..., .add = TRUE) %>% group_split()

I have to re-order the columns with select(), because when using janitor::adorn_totals(), which I will be using below to add totals, the first column must be a character column (it serves as an identifier column).

This creates a list with 3 races times 6 years, so 24 elements. Each element of the list is a tibble with each unique combination of year and race:

length(total_tv_split)
## [1] 24
total_tv_split[1:2]
## ##     year    : character##     marital : factor<82ceb>##     total_tv: integer##   >## >[2]>## [[1]]## # A tibble: 6 x 4##   race  year  marital       total_tv##                 ## 1 Other 2000  No answer            2## 2 Other 2000  Never married      103## 3 Other 2000  Separated           16## 4 Other 2000  Divorced            17## 5 Other 2000  Widowed             24## 6 Other 2000  Married            122## ## [[2]]## # A tibble: 5 x 4##   race  year  marital       total_tv##                 ## 1 Black 2000  Never married      452## 2 Black 2000  Separated          135## 3 Black 2000  Divorced           156## 4 Black 2000  Widowed            183## 5 Black 2000  Married            320

Why do this? To use janitor::adorn_totals(), which adds row-wise totals to a data frame, or to each data frame if a list of data frames gets passed to it. I need to still transform the data a little bit. After using adorn_totals(), I bind my list of data frames together, and then fill down the year column (when using adorn_totals(), character columns like year are filled with "-", but I chose to fill it with NA_character_). I then replace the NA value from the marital column with the string "Total" and then reorder the marital column by value of total_tv:

total_tv_split <- total_tv_split %>%  adorn_totals(fill = NA_character_) %>%  map(as.data.frame) %>%    bind_rows() %>%  fill(year, .direction = "down") %>%  mutate(marital = ifelse(is.na(marital), "Total", marital)) %>%  mutate(marital = fct_reorder(marital, total_tv))

I can finally create my plot. Because I have added “Total” as a level in the marital column, it now appears seamlessly in the plot:

ggplot(total_tv_split) +  geom_col(aes(x = marital, y = total_tv, fill = race)) +  facet_wrap(facets = vars(year), nrow = 2) +  scale_fill_discrete_qualitative(palette = "Dark 3") +  scale_x_discrete(guide = guide_axis(n.dodge = 3)) +  brotools::theme_blog() 

To finish this list of recipes, let’s do a pyramid plot now (inspiration from here:

data_pyramid <- gss_cat %>%  filter(year == "2000", marital %in% c("Married", "Never married")) %>%  group_by(race, marital, rincome) %>%    summarise(total_tv = sum(tvhours, na.rm = TRUE))ggplot(data_pyramid, aes(x = rincome, y = total_tv, fill = marital)) +  geom_col(data = filter(data_pyramid, marital == "Married")) +  geom_col(data = filter(data_pyramid, marital == "Never married"), aes(y = total_tv * (-1))) +  facet_wrap(facets = vars(race), nrow = 1, scales = "free_x") +  coord_flip() +  scale_fill_discrete_qualitative(palette = "Dark 3") +  brotools::theme_blog() 

Happy Easter!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub.

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}Buy me an EspressoBuy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#24: Test, test, test, … those R 4.0.0 binaries with Ubuntu and Rocker

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the 24nd post in the relentlessly regular R ravings series, or R4 for short.

R 4.0.0 will be released in less than two weeks, and testing is very important. I had uploaded two alpha release builds (at the end of March and a good week ago) as well as a first beta release yesterday, all to the Debian ‘experimental’ distribution (as you can see here) tracking the release schedule set by Peter Dalgaard. Because R 4.0.0 will require reinstallation of all packages, it makes some sense to use a spare machine. Or a Docker container. So to support that latter mode, I have now complemented the binaries created from the r-base source package with all base and recommended packages, providing a starting point for actually running simple tests. Which is what we do in the video, using again the ‘R on Ubuntu (18.04)’ Rocker container:

<br />

Slides from the video are at this link.

This container based on 18.04 is described here on the Docker Hub; a new 20.04 container with the pre-release of the next Ubuntu LTS should be there shortly once it leaves the build queue.

What we showed does of course also work on direct Ubuntu (or Debian, using those source repos) installations; the commands shown in the Rocker use case generally apply equally to a normal installation.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Android Smartphone Analysis in R [Code + Video]

$
0
0

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, We’ll learn how to take analyse your Android Smartphone usage data.

Steps:

  1. Download your MyActivity Data from Google Takeout – https://takeout.google.com/ (after Selecting json format – instead of the default html format)

  2. When the download is available, save the zip file and unzip them to find MyActivity.json inside last-level of the folder

  3. Create a new R project (using your RStudio) with the MyActivity.json file in the project folder

  4. Follow the below codes along with this Video

Youtube – https://www.youtube.com/watch?v=fv0idLNWfqg

Loading libraries

library(jsonlite)library(tidyverse)library(patchwork)

Reading Input json using jsonlite package

android <- jsonlite::fromJSON("MyActivity.json")

Glimpse of the dataframe

glimpse(android)
## Observations: 37,773## Variables: 5## $ header    "Sense Home Launcher-News,Theme", "Google Chrome: Fast & S...## $ title     "Used Sense Home Launcher-News,Theme", "Used Google Chrome...## $ titleUrl  "https://play.google.com/store/apps/details?id=com.htc.lau...## $ time      "2020-05-06T15:50:53.817Z", "2020-05-06T15:47:53.613Z", "2...## $ products  ["Android", "Android", "Android", "Android", "Android", "...

Data Preprocessing – Date Time

android$time <- parse_datetime(android$time,locale = locale(tz = "Asia/Calcutta"))
summary(android$time)
##                  Min.               1st Qu.                Median ## "2017-01-06 16:08:01" "2019-07-21 18:30:18" "2019-10-14 19:53:11" ##                  Mean               3rd Qu.                  Max. ## "2019-10-04 14:01:15" "2020-01-17 17:10:47" "2020-05-06 21:20:53"
android %>%   mutate(date = lubridate::date(time),         year = lubridate::year(time)) -> android
android_latest <- android %>%   filter(year %in% c(2019,2020))

Number of Unique Apps

android_latest %>%   count(header, sort = TRUE)  %>%   head(20) %>%   mutate(header = fct_reorder(header, n)) %>%   ggplot() + geom_col(aes(y = header, x = n)) +  theme_minimal() +  labs(title = "Most used Apps - Overall",       subtitle = "Android Smartphone usage",       caption = "Data:Google Takeout")

Apps Comparison

android_latest %>%   filter(year %in% '2019') %>%   group_by(year, header) %>%   summarise(n = n()) %>%   arrange(desc(n)) %>%   head(10) %>% #View()  mutate(header = fct_reorder(header, n)) %>%   ggplot() + geom_col(aes(y = header, x = n)) + # facet_wrap(~year, scales = "free") +  theme_minimal() +  labs(title = "Most used Apps - 2019",       subtitle = "Android Smartphone usage",       caption = "Data:Google Takeout") -> p2019
android_latest %>%   filter(year %in% '2020') %>%   group_by(year, header) %>%   summarise(n = n()) %>%   arrange(desc(n)) %>%   head(10) %>% #View()  mutate(header = fct_reorder(header, n)) %>%   ggplot() + geom_col(aes(y = header, x = n)) + # facet_wrap(~year, scales = "free") +  theme_minimal() +  labs(title = "Most used Apps - 2020",       subtitle = "Android Smartphone usage",       caption = "Data:Google Takeout") -> p2020
p2019 / p2020

Usage Timeline

android_latest %>%    count(date) %>%   ggplot() + geom_line(aes(date,n))

If you liked this tutorial, Please SUBSCRIBE to my Youtube Channel for more R programming related videos and share your feedback, it’d be great help!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#25: Test, test, test, … those R 4.0.0 binaries with Ubuntu 20.04 and Rocker

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the 25nd post in the randomly recurring R recitations series, or R4 for short.

Just yesterday, we posted a short post along with a video and supporting slides. It covered how to test the soon-to-be-released R 4.0.0 on a custom Ubuntu 18.04 Rocker container.

A container for Ubuntu 20.04, which is itself in final beta stages, was being built while the video was made. As it is available now, we created a quick follow-up video showing the use under Ubuntu 20.04:

<br />

The updated supporting slides from the video are still at this link.

What we showed in both videos does of course also work directly on Ubuntu (or Debian, using those source repos) installations; the commands shown in the Rocker use case generally apply equally to a normal installation.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Estimating and testing GLMs with `emmeans`

$
0
0

[This article was first published on R on I Should Be Writing: COVID-19 Edition, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post was written in collaboration with Almog Simchon (@almogsi) and Shachar Hochman (@HochmanShachar). Go follow them.

The fictional simplicity of Generalized Linear Models

Who doesn’t love GLMs? The ingenious idea of taking a response level variable (e.g. binary or count) and getting some link function magic to treat it as if it was our long-time friend, linear regression.

In the last few days, a preprint by McCabe et al. popped up in our twitter feed (recommended reading!) and re-focused our attention on the problem with interpreting effects and interactions within the GLM framework. McCabe et al. state in the abstract that:

“To date, typical practice in evaluating interaction effects in GLMs extends directly from linear approaches, in which the product term coefficient between variables of interest are used to provide evidence of an interaction effect. However, unlike linear models, interaction effects in GLMs are not equal to product terms between predictor variables and are instead a function of all predictors of a model.”

The what-now?

When fitting a GLM we think of the response level (the binary outcome or the counts that interest us), however, the model is fit (and often all statistical inferences are done) on the latent level – in the realm of the link function.

For interactions (but as you will see soon, not only), it means that when we test the effect of X1 at different levels of X2, we’re testing these effects on the latent (e.g., logistic) level, which might not represent these effects on the response level!

For example, both plots above represent corresponding predictions from the same interaction-model – on the left we have those predictions on the latent level, and on the right these have been transformed back to the response level (the probability). We can see that at the latent level, the effect of X1 on y is very different at all levels of X2, but at the response level these differences can shrink or possibly disappear (e.g., red vs. blue lines), or get larger (red and blue vs. purple line).

This is true regardless of whether or not an interaction was included in the model! And in fact, even main effects on the latent level do not always correspond to the response level the way we would have intuitively imagined.

What follows are 3 methods for testing interactions in GLMs, using emmeans. Again, we highly recommend reading McCabe et al.’s original paper.

Let’s load up some packages:

library(emmeans)  # 1.4.5library(magrittr) # 1.5 

The Model

The model used here is a logistic regression model, using data adapted from McCabe et al., except we’re using a binomial outcome (see code for data generation at the end of the post):

y <- plogis(xb) < 0.01model <- glm(y ~ x1 * female + x2,             data = df,             family = "binomial")

Using emmeans for estimation / testing

If you’re not yet familiar with emmeans, it is a package for estimating, testing, and plotting marginal and conditional means / effects from a variety of linear models, including GLMs.

So let’s answer the question:

Does the effect of sex (female) differ as a function of x1, and how does this interaction differ as a function of x2.

We will use the pick-a-point method for both continuous variables:

  • For x1: -1, +1
  • For x2: mean +- sd

1. On the latent level

Predictions from the model on the latent scale.

emmeans(model, ~ x1 + female + x2,        at = list(x1 = c(-1, 1)),        cov.reduce = list(x2 = mean_sd)) %>%  contrast(interaction = c("pairwise", "pairwise"),           by = "x2")
## x2 = -1.020:##  x1_pairwise female_pairwise estimate    SE  df z.ratio p.value##  -1 - 1      0 - 1              -1.17 0.497 Inf -2.350  0.0188 ## ## x2 = -0.007:##  x1_pairwise female_pairwise estimate    SE  df z.ratio p.value##  -1 - 1      0 - 1              -1.17 0.497 Inf -2.350  0.0188 ## ## x2 =  1.006:##  x1_pairwise female_pairwise estimate    SE  df z.ratio p.value##  -1 - 1      0 - 1              -1.17 0.497 Inf -2.350  0.0188 ## ## Note: contrasts are still on the log.o.r. scale

It seems that on the latent level the (estimated) difference of differences (the interaction) between female and x1 is unaffected by which level of x2 they are conditioned on. This makes sense – we did not model a 3-way interaction, so why should it? Everything is acting as expected.

Or is it? Well, that depends…

2. On the response level (the delta method)

We can also try and answer the same question on the response level using the delta method (baked into emmeans). Here we have two options for defining an “effect”:

  1. An effect is a difference in probabilities.
  2. An effect is a ratio of probabilities.

Predictions from the model on the response scale.

2.1. Differences in probabilities

For this, we just need to add trans = "response" in the call to emmeans():

emmeans(model, ~ x1 + female + x2,        at = list(x1 = c(-1, 1)),        cov.reduce = list(x2 = mean_sd),        trans = "response") %>%   contrast(interaction = c("pairwise", "pairwise"),           by = "x2")
## x2 = -1.020:##  x1_pairwise female_pairwise estimate     SE  df z.ratio p.value##  -1 - 1      0 - 1            -0.0265 0.0787 Inf -0.336  0.7365 ## ## x2 = -0.007:##  x1_pairwise female_pairwise estimate     SE  df z.ratio p.value##  -1 - 1      0 - 1             0.0976 0.0470 Inf  2.075  0.0380 ## ## x2 =  1.006:##  x1_pairwise female_pairwise estimate     SE  df z.ratio p.value##  -1 - 1      0 - 1             0.0371 0.0163 Inf  2.279  0.0227
# Difference of differences when x2 = -1.02(0.69457 - 0.44178) - (0.32986 - 0.05059)
## [1] -0.02648

It seems that on the response level, we get different results than on the latent level. And not only that, but even though the model did not include a 3-way interaction, the 2-way female:x1 interaction is conditional on the level of x2– changing in size as a function of x2, and is not significant in low levels of x2!

2.2. Ratios of probabilities

(Also called risk ratios.)

For this, we just need to add trans = "log" and type = "response" in the call to emmeans():

emmeans(model, ~ x1 + female + x2,        at = list(x1 = c(-1, 1)),        cov.reduce = list(x2 = mean_sd),        trans = "log",        type = "response") %>%   contrast(interaction = c("pairwise", "pairwise"),            by = "x2")
## x2 = -1.02:##  x1_pairwise female_pairwise ratio    SE  df z.ratio p.value##  -1 / 1      0 / 1           0.241 0.097 Inf -3.534  0.0004 ## ## x2 = -0.007:##  x1_pairwise female_pairwise ratio    SE  df z.ratio p.value##  -1 / 1      0 / 1           0.268 0.124 Inf -2.856  0.0043 ## ## x2 = 1.006:##  x1_pairwise female_pairwise ratio    SE  df z.ratio p.value##  -1 / 1      0 / 1           0.299 0.146 Inf -2.480  0.0131 ## ## Tests are performed on the log scale
# Ratio of ratios when x2 = -1.02(0.69457 / 0.44178) / (0.32986 / 0.05059)
## [1] 0.2411265

It seems that even on the response level, different delta methods produce different results!1 Although we maintain the finding that the size of the 2-way female:x1 interaction is conditional on the level of x2, here it decreases as a function of x2 (but is significant across all (tested) values of x2)!

(Note: as we are interested in the slope of x1, we could have used emtrends instead of emmeans. See code at the end of the post for what that would look like.)

Not just a problem with interactions…

As stated above, this is not only an issue of interactions. For example, when looking at the response level, the effect of x2 is itself conditional on the value of x2!

emmeans(model, ~ x2,        cov.reduce = list(x2 = mean_sd),        trans = "response") %>%   contrast(method = "consec")
##  contrast       estimate      SE  df z.ratio p.value##  -0.007 - -1.02  -0.2361 0.02190 Inf -10.782 <.0001 ##  1.006 - -0.007  -0.0965 0.00825 Inf -11.695 <.0001 ## ## Results are averaged over the levels of: female ## P value adjustment: mvt method for 2 tests

Note how the estimated effect of x2 is smaller for larger values of x2!

Conclusion

The growing popularity of GLMs (and GLMMs) in social research seems to come with another source of researcher degrees of freedom (and we all know how well that works for us)…

What should you do?

Honestly, we don’t know. Some of us feel that since the response variable is our variable of interest, that’s what we should be focusing on; some of us feel that with no common practice, we should stick to the latent level; some of us are agnostic (that covers all of us). We can’t recommend one approach, but we do think that when fitting and working with GLMs, this is a consideration one has to face.2 Good luck!

Make the data and model

set.seed(1678)b0 <- -3.8      # Interceptb1 <- .35       # X1 Effectb2 <- .9        # X2 Effectb3 <- 1.1       # Sex covariate effectb13 <- .2       # product term coefficientn <- 1000       # Sample Sizemu <- rep(0, 2) # Specify means# Specify covariance matrixS <- matrix(c(1, .5, .5, 1),             nrow = 2, ncol = 2) sigma <- 1     # Level 1 error# simulates our continuous predictors from a multivariate # normal distributionrawvars <- MASS::mvrnorm(n = n, mu = mu, Sigma = S) cat <- rbinom(n = n, 1, .5)id <- seq(1:n)eij <- rep(rnorm(id, 0, sigma))xb <- (b0) +  (b1) * (rawvars[, 1]) +  (b2) * (rawvars[, 2]) +   (b3) * cat +   b13 * cat * (rawvars[, 1]) +   eijdf <- data.frame(x1 = rawvars[, 1],                 x2 = rawvars[, 2],                 female = cat)y <- plogis(xb) < 0.01model <- glm(y ~ x1 * female + x2,             data = df,             family = "binomial")

Using emtrends

Note that the inferential results (\(z\) and \(p\) values) are similar (though not identical) to those obtained using emmeans.

# log(odds)emtrends(model, ~ female + x2, "x1",         cov.reduce = list(x2 = mean_sd)) %>%   contrast(method = "pairwise", by = "x2")
## x2 = -1.020:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.584 0.248 Inf 2.350   0.0188 ## ## x2 = -0.007:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.584 0.248 Inf 2.350   0.0188 ## ## x2 =  1.006:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.584 0.248 Inf 2.350   0.0188
# diffsemtrends(model, ~ female + x2, "x1",         cov.reduce = list(x2 = mean_sd),         trans = "response") %>%   contrast(method = "pairwise", by = "x2")
## x2 = -1.020:##  contrast estimate      SE  df z.ratio p.value##  0 - 1      0.0107 0.04117 Inf  0.259  0.7957 ## ## x2 = -0.007:##  contrast estimate      SE  df z.ratio p.value##  0 - 1     -0.0542 0.02399 Inf -2.259  0.0239 ## ## x2 =  1.006:##  contrast estimate      SE  df z.ratio p.value##  0 - 1     -0.0195 0.00793 Inf -2.457  0.0140
# ratiosemtrends(model, ~ female + x2, "x1",         cov.reduce = list(x2 = mean_sd),         trans = "log", type = "response") %>%   contrast(method = "pairwise", by = "x2")
## x2 = -1.020:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.727 0.207 Inf 3.509   0.0004 ## ## x2 = -0.007:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.663 0.233 Inf 2.848   0.0044 ## ## x2 =  1.006:##  contrast estimate    SE  df z.ratio p.value##  0 - 1       0.605 0.244 Inf 2.476   0.0133

  1. Note that for Poisson models with a “log” link function, this is the same as working on the latent level!↩

  2. But hey, whatever you do – don’t model binary / count data with a linear model, okay?↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on I Should Be Writing: COVID-19 Edition.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

psychonetrics 0.7, meta-analysis preprint and online SEM course

$
0
0

[This article was first published on r – psychonetrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version 0.7 of the psychonetrics package is now on CRAN! This version is a major restructure of the package leading to a lot of new functionality as well as much faster computations. In addition, a new pre-print is now online describing meta-analysis procedures now implemented in psychonetrics.

Free course on Structural Equation Modeling

I am teaching a course now on structural equation modeling (SEM), and am uploading every video lecture to Youtube. This playlist will be updated in the coming weeks and includes tutorial videos on psychonetrics and other R packages. For the Structural Equation Modeling course, students formed a questionnaire on measures of interest during the COVID-19 pandemic. We will release the data after the course for research and teaching purposes. If you have around 10 to 15 minutes to spare, it would be absolutely amazing if you could assist in this by filling in the questionnaire. Thanks!

Faster computations

Version 0.7 of the psychonetrics package features a major internal restructure of the computational motor. Most functions have now been translated to C++ code using the amazing RcppArmadillo package. This leads to a substantial increase in speed:

Comparison between version 0.6 (left) and 0.7 (right) versions of psychonetrics.

Note that because of the speed increase, verbose now defaults to FALSE, which can be altered for a model with the new setverbose function after forming the model. The default optimizer is still the R based nlminb function, but new optimizers using the package roptim have also been implemented:

Comparison between the default (left) and C++ based optimizer in psychonetrics 0.7.

These optimizers are still somewhat unstable however, especially in the more complicated psychonetrics models. To this end, please use these with care for the moment.

Meta-analysis methods in psychonetrics

Together with Adela Isvoranu and Mike Cheung, we extended meta-analytic SEM to be used in estimating Gaussian graphical models (GGM). We call this framework meta-analytic Gaussian network aggregation (MAGNA) and describe it in our new pre-print, which has been submitted for publication in Psychometrika. The pre-print features an extensive overview of functionality in psychonetrics. First, the pre-print discusses the motor behind psychonetrics which is used in every model framework:

The method in which psychonetrics computes the Gradient for every implemented model.

As such, the pre-print can be read as an introduction to everything psychonetrics does. Second, the pre-print includes an extensive tutorial on how to estimate GGMs from correlation matrices, including multi-group models with correlation matrices from different datasets. The latter can be used, for example, to partially pool GGM edges across groups:

Example of partially pooled network structures in which most edges, but not all, are constrained equal across groups. Based on correlation matrices collected by Eiko Fried.

Finally, the pre-print discusses how heterogeneity across datasets (e.g., PTSD studies featuring very different traumas) can be taken into account in estimating a single GGM across potentially many datasets:

Example of a pooled GGM structure estimated from four datasets while taking cross-study heterogeneity into account.

Simulation studies show that not taking such heterogeneity into account can lead to poor estimation of the common network structure:

Simulation results discussed in the paper.

Not taking heterogeneity into account while aiming to estimate a single fixed-effects model can lead to a false positive rate more than 50%. This has some severe implications for studies aiming to aggregate or compare network models from different datasets (e.g., to study replicability) that do not take heterogeneity into account.

More new features

More functionality of psychonetrics can be found in the NEWS file on CRAN. Some earlier changes include that most latent variable models will now automatically set the factor loadings to an identity matrix if lambda is not supplied, and the new ml_tsdlvm1 allows for specifying the panel data model using the same syntax as mlVAR and graphicalVAR. Finally, the new ml_lvm function marks the first iteration of multi-level modeling. This function includes full-information maximum likelihood estimation for random-intercept lvm models (including GGMs). Of note, the computation speed is a function of the number of cases in each cluster and becomes very slow with many cases in a cluster. To this end, ml_lvm works well for dyad or small family datasets, but not well for datasets with larger clusters such as classrooms or countries.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – psychonetrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

K is for Keep or Drop Variables

$
0
0

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few times in this series, I’ve wanted to display part of a dataset, such as key variables, like Title, Rating, and Pages. The tidyverse allows you to easily keep or drop variables, either temporarily or permanently, with the select function. For instance, we can use select along with other tidyverse functions to create a quick descriptive table of my dataset. Let’s filter down to books that are fantasy and/or sci-fi and that took me the longest to read, then select a few descriptives to display.

library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 -- 
##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0 
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag() 
reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",col_names=TRUE)
## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## ) 
reads2019%>%group_by(Fantasy, SciFi)%>%filter(read_time==max(read_time)&(Fantasy==1|SciFi==1))%>%select(Title, Author, Pages, read_time)
## Adding missing grouping variables: `Fantasy`, `SciFi` 
## # A tibble: 4 x 6 ## # Groups:   Fantasy, SciFi [3] ##   Fantasy SciFi Title                              Author        Pages read_time ##                                                    ## 1       1     1 1Q84                               Murakami, Ha…   925         7 ## 2       0     1 The End of All Things (Old Man's … Scalzi, John    380        10 ## 3       0     1 The Long Utopia (The Long Earth #… Pratchett, T…   373        10 ## 4       1     0 Tik-Tok of Oz (Oz, #8)             Baum, L. Fra…   272        25 

Of course, I can also permanently change the reads2019 dataset to only keep those variables or create a new dataset with just those variables. The select function can also be used to drop single variables, by putting a – sign before the variable name. Let’s say I decided I no longer wanted to keep the Self Help genre flag. I could throw that out of my dataset like this.

reads2019<-reads2019%>%select(-SelfHelp)

That variable is now gone. You can use this same code to drop multiple variables at once, by putting – signs before each variable name.

small_reads2019<-reads2019%>%select(-AdditionalAuthors,-AverageRating,-OriginalPublicationYear)

Whichever you do, keeping or dropping, choose the option that minimizes how many things you have to type. If you have a large number of variables and want a dataset with only a handful, I’d use the names of the variables I want to keep with select. If you only want to drop 1 or 2 variables, using select to drop will be faster.

Tomorrow we’ll talk about a variable transformation that makes plotting skewed variables much easier. Stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Biterm topic modelling for short texts

$
0
0

[This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago, we published an update of the BTM (Biterm Topic Models for text) package on CRAN.

Biterm Topic Models are especially usefull if you want to find topics in collections of short texts. Short texts are typically a twitter message, a short answer on a survey, the title of an email, search questions, … . For these types of short texts traditional topic models like Latent Dirichlet Allocation are less suited as most information is available in short word combinations. The R package BTM finds topics in such short texts by explicitely modelling word-word co-occurrences (biterms) in a short window.

The update which was pushed to CRAN a few weeks ago now allows to explicitely provide a set of biterms to cluster upon. Let us show an example on clustering a subset of R package descriptions on CRAN. The resulting cluster visualisation looks like this.

biterm topic model example

If you want to reproduce this, the following snippets show how to do this. Steps are as follows

1. Get some data of R packages and their description in plain text

## Get list of packages in the NLP/Machine Learning Task Views library(ctv) pkgs <- available.views() names(pkgs) <- sapply(pkgs, FUN=function(x) x$name) pkgs <- c(pkgs$NaturalLanguageProcessing$packagelist$name, pkgs$MachineLearning$packagelist$name)  ## Get package descriptions of these packages library(tools) x <- CRAN_package_db() x <- x[, c("Package", "Title", "Description")] x$doc_id <- x$Package x$text   <- tolower(paste(x$Title, x$Description, sep = "\n")) x$text   <- gsub("'", "", x$text) x$text   <- gsub("<.+>", "", x$text) x        <- subset(x, Package %in% pkgs)

2. Use the udpipe R package to perform Parts of Speech tagging on the package title and descriptions and use udpipe as well for extracting cooccurrences of nouns, adjectives and verbs within 3 words distance.

library(udpipe) library(data.table) library(stopwords) anno    <- udpipe(x, "english", trace = 10) biterms <- as.data.table(anno) biterms <- biterms[, cooccurrence(x = lemma,                                   relevant = upos %in% c("NOUN", "ADJ", "VERB") &                                               nchar(lemma) > 2 & !lemma %in% stopwords("en"),                                   skipgram = 3),                    by = list(doc_id)]

3. Build the biterm topic model with 9 topics and provide the set of biterms to cluster upon

library(BTM) set.seed(123456) traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "VERB") & !lemma %in% stopwords("en") & nchar(lemma) > 2) traindata <- traindata[, c("doc_id", "lemma")] model     <- BTM(traindata, biterms = biterms, k = 9, iter = 2000, background = TRUE, trace = 100)

4. Visualise the biterm topic clusters using the textplot package available at https://github.com/bnosac/textplot. This creates the plot show above.

library(textplot) library(ggraph) plot(model, top_n = 10,      title = "BTM model", subtitle = "R packages in the NLP/Machine Learning task views",      labels = c("Garbage", "Neural Nets / Deep Learning", "Topic modelling",                  "Regression/Classification Trees/Forests", "Gradient Descent/Boosting",                  "GLM/GAM/Penalised Models", "NLP / Tokenisation",                 "Text Mining Frameworks / API's", "Variable Selection in High Dimensions"))

 

Enjoy!

 

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Build a static website with R Shiny

$
0
0

[This article was first published on R & Census, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sounds stupid? Yes, it’s kind of throwing away 99% of Shiny’s power; and you can always build a static website with R markdown, blogdown, or bookdown.

Anyway, please keep reading as it will save you time if you are an R users who

  • want to make a portfolio website to showcase your work,
  • but know little about web development,
  • and yet prefer designing your own interface rather than using a popular theme.

The idea is to take advantage of Shiny’s convenient user interface design without relying on a Shiny server. Here is my portfolio website. It is nowhere near fancy, but is exactly what I want.

Without further ado, let’s get started. We will host the website on Github pages. Follow this instructions to create a repository called johndoe.github.io (replace johndoe with your username) and clone the repository to your computer.

Now let’s create a Shiny project in the cloned directory. All the files can be found here. The file structure for this project is:

As a simple example, this ui.R contains a picture (birds.png), some text (birds.Rmd), an internal link to a section in the same page, an external link to Wikipedia and a link to sample.html in subdirectory /samples. For the link to sample.html, we will use the true url at your website: href = "https://johndoe.github.io/samples/sample.html".

library(shiny)ui <- fluidPage(    sidebarLayout(        sidebarPanel(            width = 3,            h2("My Static website"),            h3("Internal link example"),            a(h4("Birds"),              href = "#birds"),            hr(),            h3("External link example"),            a(h4("Wikipedia"),              href = "https://en.wikipedia.org/wiki/Main_Page",              target = "blank")        ),                mainPanel(            width = 9,            h2("Link to a saved sample.html"),            p("The url is https://johndoe.github.io/samples/sample.html"),            a(h3("Lovely Birds"),              href = "https://johndoe.github.io/samples/sample.html",              target = "blank"),            hr(),            h2("Text and image example", id = "birds"),            fluidRow(                column(                    7,                    includeMarkdown("markdown/birds.Rmd")                ),                column(                    5,                    img(src = "birds.png", width = "100%")                )            ),        )    ))

The serve.R file has an empty function as we do not have any interactive features in ui.R.

library(shiny)server <- function(input, output) {    # empty}

With all the files in place, follow these steps to create the static website on github page.

Step 1: Run the Shiny app in RStudio and then open it in browser. The webpage is at localhost 127.0.0.1:6230 (may differ from yours) and looks like:

Step 2: Save this webpage as index.html under directory johndoe.github.io as complete html. An subdirectory index_files contains associated files is created automatically.

Step 3: Open index.html in a text editor.

  • delete all xxxx

  • replace all https://127.0.0.1:6230 with https://johndoe.github.io

Step 4: Push all files to the Github repository. Wait for a couple of minutes and your website is ready at https://johndoe.github.io.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R & Census.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Hosting a Virtual useR Meetup

$
0
0

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Rachael Dempsey, Senior Enterprise Advocate at RStudio / Greater Boston useR Organizer

Last month, the Boston useR Group held our very first virtual meetup and opened this up to anyone that was interested in joining. While I wasn’t sure what to expect at first, I was so happy with the turnout and reminded again of just how great the R community is. Everyone was so friendly and appreciative of the opportunity to meet together during this time. It was awesome to see that people joined from all over the world – not just from the Boston area. We had attendees from the Netherlands, Spain, Mexico, Chile, Canada, Ireland, and I’m sure many other places!

Our event was a virtual TidyTuesday Meetup held over Zoom, which can hold up to 100 people without having to purchase the large meeting add-on. (If you’re worried about the number of people being over this, keep in mind that often half the people that register will attend.)

This was our agenda:

  • 5:30: Introductions to useR Meetup & TidyTuesday (Rachael Dempsey & Tom Mock)
  • 5:35: Presentation #1 – Meghan Hall: “Good to Great: Making Custom Themes in ggplot2”
  • 5:50: Presentation #2 – Kevin Kent: “The science of (data science) teaching and learning”
  • 6:00: Introduction to R for Data Science Slack Channel – Jon Harmon
  • 6:05: Breakout into groups to work on TidyTuesday dataset – groups will be open for two-hours but you can come and go as you want!
  • 7:30: Come back together to the Main Room for an opportunity to see a few of the examples that people would like to share

If you’re thinking of keeping your monthly event and want to host it virtually, I’ve included a few tips below:

Find someone (or multiple people) to co-host with you!

Thank you, Kevin Kent and Asmae Toumi! Kevin, a member of the Boston useR Group was originally going to be the lead for our in-person TidyTuesday meetup and posted about the meetup on Twitter, where we both met our other co-host, Asmae Toumi. Asmae then introduced us to one of our presenters, Meghan Hall. Having co-hosts not only made me feel more comfortable, but gave me a chance to bounce ideas off of someone and made it much easier to market the event to different groups of people. While I often share events on LinkedIn, Kevin and Asmae have a much bigger presence on Twitter. Aside from your own meetup group and social media, another helpful place to find potential co-hosts may be on the events thread of community.rstudio.com. Instead of co-hosting, you could also just ask people if they would be willing to volunteer to help at the meetup. Thank you to Carl Howe, Jon Harmon, Josiah Parry, Meghan Hall, Priyanka Gagneja, and Tom Mock for your support. If I can help you with finding volunteers, please don’t hesitate to reach out on LinkedIn.

Have a practice session on Zoom!

The day before the event we held a practice session on Zoom to work out a few of the kinks. As we were hosting a TidyTuesday meetup, we wanted to be able to meet in smaller groups too, as we would if we were in-person. I had never used Zoom breakout rooms before and wanted to test this out first. After the initial presentations, we broke out into 7 smaller groups. These groups worked well to help facilitate conversation among attendees. During the test, we confirmed that you can move people from different breakout groups if needed. This was helpful for keeping the groups even as some attendees had to leave before the end of the event.

Have a Slack Channel or a way for people to chat if they have questions

During the meetup, we used the R for Data Science Online Learning Community Slack Channel as a venue to ask questions and share examples of what people were working on. You can join this Slack channel by going to r4ds.io/join. We used the channel, #chat-tidytuesday which you can find by using the search bar within Slack.

Accept that it won’t be perfect

You can practice and plan how you want things to go, but I think it’s helpful to recognize that this is the first time doing this and it’s okay if things aren’t perfect. For example, we were going to create separate breakout groups based on people’s interests and have everyone use a Google doc to indicate this at the start. While it was good in theory, we determined this would be a bit too hard to manage and complicate things so I just automatically split people up into the 7 different groups. It wasn’t perfect, but it worked!

Think about Zoom best practices

This came up in discussion during our practice call and I think we’ve all seen recently that there can be a few bad-actors out there trying to ruin open meetings. @alexlmiller shared a few tips on Twitter that I’d like to cross post here as well.

You can start with the Main Settings on your Zoom account and do the following:

1) Disable “Join Before Host”

2) Give yourself some moderation help by enabling “Co-Host” – this lets you assign the same host controls to another person in the call

3) Change “Screen sharing” to “Host Only”

4) Disable “File Transfer”

5) Disable “Allow Removed Participants to Rejoin”

And also to make the overall experience a little nicer:

1) Disable “Play sounds when participants join or leave”

2) Enable “Mute participants upon entry”

3) Turn on “Host Video” and “Participants Vide” (if you want that)

One more thing, if you want to split meeting participants into separate, smaller rooms you have to enable “Breakout Rooms”.

Market your event on social media

Once your event is posted to meetup, share it with others through multiple channels. Maybe that’s a mix of your internal Slack channel, Twitter, your LinkedIn page and/or the “R Project Group” on LinkedIn…or wherever you prefer to connect with people online. Keep in mind that this could be a different audience than your usual meetups because it’s now accessible to people all over the world. Ask a few people to share your post as well so that you can leverage their network as well. 

Have fun!

Reflecting back on our meetup, some of us found that with the use of Zoom breakout groups and a Slack channel our event was surprisingly more interactive than our actual in-person meetups. It was also an awesome opportunity to do something social and get together with others from the community during this crazy time. If you have any tips from your own experiences, please let me know and don’t hesitate to reach out if I can assist in any way. Hope this helps! 

The post Hosting a Virtual useR Meetup appeared first on R Consortium.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Consortium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The ‘spam comments’ puzzle: tidy simulation of stochastic processes in R

$
0
0

[This article was first published on Variance Explained, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously in this series:

I love 538’s Riddler column, and the April 10 puzzle is another interesting one. I’ll quote:

Over the course of three days, suppose the probability of any spammer making a new comment on this week’s Riddler column over a very short time interval is proportional to the length of that time interval. (For those in the know, I’m saying that spammers follow a Poisson process.) On average, the column gets one brand-new comment of spam per day that is not a reply to any previous comments. Each spam comment or reply also gets its own spam reply at an average rate of one per day.

For example, after three days, I might have four comments that were not replies to any previous comments, and each of them might have a few replies (and their replies might have replies, which might have further replies, etc.).

After the three days are up, how many total spam posts (comments plus replies) can I expect to have?

This is a great opportunity for tidy simulation in R, and also for reviewing some of the concepts of stochastic processes (this is known as a Yule process). As we’ll see, it’s even thematically relevant to current headlines, since it involves exponential growth.

Solving a puzzle generally involves a few false starts. So I recorded this screencast showing how I originally approached the problem. It shows not only how to approach the simulation, but how to use those results to come up with an exact answer.

Simulating a Poisson process

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length x is distributed according to \mbox{Pois}(\lambda x), for some rate parameter \lambda (for this puzzle, the rate is described as one per day, \lambda=1).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

# The rate parameter, 1, is the expected events per daywaiting<-rexp(10,1)waiting
##  [1] 0.1417638 2.7956808 1.2725448 0.3452203 0.5303130 0.2647746 2.6195738##  [8] 1.2933250 0.5539181 0.9835380

For example, in this case we waited 0.14 days for the first comment, then 2.8 after that for the second one, and so on. On average, we’ll be waiting one day for each new comment, but it could be a lot longer or shorter.

You can take the cumulative sum of these waiting periods to come up with the event times (new comments) in the Poisson process.

qplot(cumsum(waiting),0)

center

Simulating a Yule process

Before the first comment happened, the rate of new comments/replies was 1 per day. But as soon as the first comment happened, the rate increased: the comment could spawn its own replies, so the rate went up to 2 per day. Once there were two comments, the rate goes up to 3 per day, and so on.

This is a particular case of a stochastic process known as a Yule process (which is a special case of a birth process. We could prove a lot of mathematical properties of that process, but let’s focus on simulating it.

The waiting time for the first commentwould be \mbox{Exponential}(1), but the waiting time for the second is \mbox{Exponential}(2), then \mbox{Exponential}(3), and so on. We can use the vectorized rexp() function to simulate those. The waiting times will, on average, get shorter and shorter as there are more comments that can spawn replies.

set.seed(2020)
waiting_times<-rexp(20,1:20)# Cumulative timecumsum(waiting_times)
##  [1] 0.2938057 0.9288308 1.0078320 1.1927956 1.4766987 1.6876352 2.5258522##  [8] 2.5559037 2.6146623 2.6634295 2.7227323 2.8380710 2.9404016 2.9460719## [15] 2.9713356 3.0186731 3.1340060 3.2631936 3.2967087 3.3024576
# Number before the third daysum(cumsum(waiting_times)<3)
## [1] 15

In this case, the first 15 events happened before the third day. Notice that in this simulation, we’re not keeping track of which comment received a reply: we’re treating all the comments as interchangeable. This lets our simulation run a lot faster since we just have to generate the waiting times.

All combined, we could perform this simulation in one line:

sum(cumsum(rexp(20,1:20))<3)
## [1] 6

So in one line with replicate(), here’s one million simulations. We simulate 300 waiting periods from each, and see how many happen before the first day.

sim<-replicate(1e6,sum(cumsum(rexp(300,1:300))<3))mean(sim)
## [1] 19.10532

It looks like it’s about 19.1.

Turning this into an exact solution

Why 19.1? Could we get an exact answer that is intuitively satisfying?

One trick to get a foothold is to vary one of our inputs: rather than looking at 3 days, let’s look at the expected comments after time t. That’s easier if we expand this into a tidy simulation, using one of my favorite functions crossing().

library(tidyverse)set.seed(2020)sim_waiting<-crossing(trial=1:25000,observation=1:300)%>%mutate(waiting=rexp(n(),observation))%>%group_by(trial)%>%mutate(cumulative=cumsum(waiting))%>%ungroup()sim_waiting
## # A tibble: 7,500,000 x 4##    trial observation waiting cumulative##                    ##  1     1           1  0.294       0.294##  2     1           2  0.635       0.929##  3     1           3  0.0790      1.01 ##  4     1           4  0.185       1.19 ##  5     1           5  0.284       1.48 ##  6     1           6  0.211       1.69 ##  7     1           7  0.838       2.53 ##  8     1           8  0.0301      2.56 ##  9     1           9  0.0588      2.61 ## 10     1          10  0.0488      2.66 ## # … with 7,499,990 more rows

We can confirm that the average number of comments in the first three days is about 19.

sim_waiting%>%group_by(trial)%>%summarize(num_comments=sum(cumulative<=3))%>%summarize(average=mean(num_comments))
## # A tibble: 1 x 1##   average##     ## 1    18.9

But we can also use crossing() (again) to look at the expected number of cumulative comments as we vary t.

average_over_time<-sim_waiting%>%crossing(time=seq(0,3,.25))%>%group_by(time,trial)%>%summarize(num_comments=sum(cumulative<time))%>%summarize(average=mean(num_comments))

(Notice how often “solve the problem for one value” can be turned into “solve the problem for many values” with one use of crossing(): one of my favorite tricks).

How does the average number of comments increase over time?

ggplot(average_over_time,aes(time,average))+geom_line()

center

At a glance, this looks like an exponential curve. With a little experimentation, and noticing that the curve starts at (0, 0), we can find that the expected number of comments at time t follows e^t-1. This fits with our simulation: e^3 - 1 is 19.0855.

ggplot(average_over_time,aes(time,average))+geom_line(aes(y=exp(time)-1),color="red")+geom_point()+labs(y="Average # of comments",title="How many comments over time?",subtitle="Points show simulation, red line shows exp(time) - 1.")

center

Intuitively, it makes sense that on average the growth is exponential. If we’d described the process as “bacteria in a dish, each of which could divide at any moment”, we’d expect exponential growth. The “minus one” is because the original post is generating comments just like all the others do, but doesn’t itself count as a comment.1

Distribution of comments at a given time

It’s worth noting we’re still only describing an average path. There could easily be more, or fewer, spam comments by the third day. Our tidy simulation gives us a way to plot many such paths.

sim_waiting%>%filter(trial<=50,cumulative<=3)%>%ggplot(aes(cumulative,observation))+geom_line(aes(group=trial),alpha=.25)+geom_line(aes(y=exp(cumulative)-1),color="red",size=1)+labs(x="Time",y="# of comments",title="50 possible paths of comments over time",subtitle="Red line shows e^t - 1")

center

The red line shows the overall average, reaching about 19.1 at 3 days. However, we can see that it can sometimes be much smaller or much larger (even even more than 100).

What is the probability distribution of comments after three days- the probability there is one comment, or two, or three? Let’s take a look at the distribution.

# We'll use the million simulated values from earliernum_comments<-tibble(num_comments=sim)num_comments%>%ggplot(aes(num_comments))+geom_histogram(binwidth=1)

center

Interestingly, at a glance this looks a lot like an exponential curve. Since it’s a discrete distribution (with values 0, 1, 2…), this suggests it’s a geometric distribution: the expected number of “tails” flipped before we see the first “heads”.

We can confirm that by comparing it to the probability mass function, (1-p)^np. If it is a geometric distribution, then because we know the expected value is e^3-1 we know the rate parameter p (the probability of a success on each heads) is \frac{1}{e^3}=e^{-3}.

p<-exp(-3)num_comments%>%filter(num_comments<=150)%>%ggplot(aes(num_comments))+geom_histogram(aes(y=..density..),binwidth=1)+geom_line(aes(y=(1-p)^num_comments*p),color="red")

center

This isn’t a mathematical proof, but it’s very compelling. So what we’ve learned overall is:

X(t)\sim \mbox{Geometric}(e^{-t}) E[X(t)]= e^{-t}-1

These are true because the rate of comments is one per day. If the rate of new comments were \lambda, you’d replace t above with \lambda t.

I don’t have an immediate intuition for why the distribution is geometric. Though it’s interesting that the parameter p=e^{-t} for the geometric distribution (the probability of a “success” on the coin flip that would stop the process) is equal to the probability that there are no events in time t for a Poisson process.

Conclusion: Yule process

I wasn’t familiar with it when I first tried out the riddle, but this is known as a Yule process. For confirmation of some of the results above you can check out this paper or the Wikipedia entry, among others.

What I love about simulation is how it builds an intuition for these processes from the ground up. These simulated datasets and visualizations are a better “handle” for me for grasp the concepts than mathematical equations would be. After I’ve gotten a feel for the distributions, I can check my answer by looking through the mathematical literature.

  1. If you don’t like the -1, you could have counted the post as a comment, started everything out at X(0)=1, and then you would find that E[X(t)]=e^t. This is the more traditional definition of a Yule process. ↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Saving R Graphics across OSs

$
0
0

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is known for it’s amazing graphics. Not only ggplot2, but also plotly, and the other dozens of packages at the graphics task view. There seems to be a graph for every scenario. However once you’ve created your figure, how do you export it? This post compares standard methods for exporting R plots as PNGs/PDFs across different OSs. As R has excellent cross-platform capabilities, we may expect this to follow through to exporting graphics. But as we’ll see, this isn’t the case!

Saving Graphics

Suppose we create a simple ggplot2 scatter plot

library("ggplot2")g = ggplot(mtcars, aes(disp, mpg)) +   geom_point()

But once created, how do you save the plot to a file? If you want to save the scatter plot as a PDF file, then the standard route is something like

pdf("figure1.pdf")print(g)dev.off()

The pdf() function is part of the grDevices package that comes with base R. When you call pdf(), R starts a graphics device driver for producing PDF files. The function dev.off() then closes the file driver.

The documentation page on the pdf() function is very detailed – see ?pdf. It highlights the tension between documentation for the developer and documentation for the user. The former cares about details, such as the fact that larger circles use a Bezier curve. But users just want to save a graph. I suspect that the vast majority of people using the pdf() function don’t really care about the details. They just want a PDF file that contains their plot!

There are other functions for creating PDF graphics. You could use Cairo::CairoPDF() or grDevices::cairo_pdf(). As you might gather, both of these functions use cairo graphics. Cairo is a 2D graphics library with support for multiple output devices. These R functions use the Cairo API. What’s not clear from the documentation is how the functions differ (but we’ll see differences later).

You can determine if you have Cairo capabilities via

capabilities()["cairo"]#> cairo #>  TRUE

Most standard systems have Cairo support. Part of the difficultly I found when writing this post is that graphic support in R has changed over the years. So it’s very easy to find blog posts that contains out-dated information, especially around Cairo.

A similar situation applies to PNG graphics. You can use the default graphics device

png("figure1.png")print(g)dev.off()

or you could specify the type via png(..., type = "cairo") or png(..., type = "cairo-png"). There’s also a relatively new package, ragg that can save graphics as PNGs.

Intuitively, these functions must produce different outputs – otherwise why have them. But what is the difference? Is it file size? Speed of creating graphics? Or something else.

Cross Platform

One of R’s outstanding features is that it is cross platform. You write R code and it magically works under Linux, Windows and Mac. Indeed, the above the code “runs” under all three operating systems. But does it produce the same graphic under each platform? Spoiler! None of the above functions produce identical output across OS’s. So for “same”, I going to take a lax view and I just want figures that look the same.

Cannoical Graphic

To create a test graphic, we first make some data

library("ggplot2")set.seed(1)df = data.frame(x = rnorm(1000), y = rnorm(1000))

Then create a graphic that has a few challenging aspects

ggplot(df, aes(x = x, y = y)) +  # Anti-aliasing check  geom_abline(intercept = 3, slope = 1, size = 3) +  # Font check & newline  geom_label(x = 2.5, y = -1,             label = "This is italic text\n in Arial Narrow",             family = "Arial Narrow",             fontface = "italic", size = 6, label.size = 0) +  # Font check  theme_bw(base_family = "Times") +  # Font check  theme(axis.title = element_text(face = "italic"),        plot.title = element_text( face = "bold"),        plot.subtitle = element_text(face = "italic"),        plot.caption = element_text(face = "plain")) +   # Transparency  geom_point(alpha = 0.4)

Many of the aspects of this test graphic have been taken from other blog posts. I’ve provided links at the end of this post

The Challenge

The above graphic was created under all three operating systems, using the graphics drivers listed above. The complete script can be downloaded from this GitHub gist. In this post, I don’t care about file size or the speed of the graphics device. As most use cases for R graphics don’t really depend on a few KB or an extra second generating the graph, this seems a reasonable compromise for this test. All tests were performed using R 3.6.3 or R 3.6.2.

grDevices::pdf()

All plots failed due to fonts. Interesting, the pdf version was 1.4, compared to 1.5 under other methods. Careful reading of the pdf() help page suggests this is expected behaviour due to non-standard fonts. From the documentation, I’m pretty sure I could embed the necessary fonts in the PDF file. However, it seems clear that there are differences between OSs, so my fix under Linux, might not be cross-platform. Also if we change font, the issue would appear again.

Cairo::CairoPDF()

Under all OSs, this function call executed without giving an error. However, there are severe font issues. The x on each plot is different and the text from geom_label() is different. Under Macs, it isn’t even in italics.

When we compare these graphics to the output from cairo_pdf() and the different png functions, it appears that the output from CairoPDF() is incorrect across all OSs.

Also, using the pdfinfo tool in Linux, each figure was created using a different version of Cairo: Windows (v1.10.2), Mac (v1.14.6) and Linux (v1.15.0).

grDevices::cairo_pdf()

All generated PDFs look the same, but are not identical! They again use different versions of Cairo (ranging from v1.14.6 to v1.16.0) and so have different file sizes.

If we compare Cairo::CairoPDF() to grDevices::cairo_pdf() under Windows, we can see the graphics created are significantly different.

Overall, if you are generating PDFs files then it’s clear you should use grDevices::cairo_pdf() if you want any chance of your code working across different OSs.

grDevices::png()

The png function produces graphics under Linux and Mac. However, the placement of the text is slightly different, e.g. a few pixels. Under Windows, the font appears to be the default and not Times or Arial.

Using the pnginfo tool also highlights that the three PNGs differ by

  • Colour Type: Linux: paletted (256 colours) vs Mac: RGB vs Windows: paletted (156 colours)
  • Channels: 1 vs 4 vs 1

Also under Windows, the graphic doesn’t use anti-aliasing. This is a technique for smoothing over pixels on straight lines. If we zoom into the line on Linux/Mac vs Windows we can see the “stair-case” effect.

grDevices::png(…, type = “cairo”)

All OSs produce a graphic that looks similar. But placement of text still differs by a few pixels between the OSs – but it’s barely visible. The axes line looks lighter under Mac and pnginfo indicates that colour type and channels differ.

grDevices::png(…, type = “cairo_png”)

All OSs produce a graphic that looks similar and text placement appears (to the naked eye) to be identical. Using pnginfo indicates that colour type and channels are now the same.

The ragg package

Linux, Mac & Windows produced a graphic that looked similar and pnginfo indicates that all attributes were identical. However the linebreak in geom_label() uncovered a bug in the ragg package, which was fixed a few days later.

Conclusion

Overall, it appears that getting graphics to be identical across different OS’s is more difficult than one would first assume! For PDF plots, the least worst option is grDevices::cairo_pdf(), this doesn’t produce identical graphics as there are different versions of Cairo in play, but this test indicts the graphics are very similar. Frustratingly, you would typically save your bar/line/scatter plot as a PDF due to the superior resolution. But it also appears this isn’t particularly well suited to being cross-platform.

For PNG graphics, it’s clear you should always use the type = "cairo_png" with the png() function. However, I will be moving to the ragg package in the near future, especially as RStudio are incorporating it into their IDE. The quality and performance are impressive, and it’s goal is to produce identical cross-platform graphics.

See our latest blog post on setting graphics devices inside an Rmarkdown document, for details on how to use cairo within knitr.


Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (info@jumpingrivers.com). We may be able to offer free support.


Links and thanks

This post used bits and pieces from a wide variety of sources. Hopefully, I’ve not forgotten anyone.

  • This post actually started with an e-conversation with Bob Rudis four years ago, which resulted in this initial test.
  • This post made me think about Cairo graphics.
  • Another post on Cairo, with a look at anti-aliasing issues.
  • Thanks to flaticon for nice OS images.
  • A big thank you to Catherine Hurley for running the Mac tests.
  • I don’t think I used this post, but I certainly read it. The post provides an excellent description for working with images and figures.

The post Saving R Graphics across OSs appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

poorman: Select helpers, bug fixes and tests, tests, tests!

$
0
0

[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Hello all and welcome to another edition of the poorman series of blog posts. In this series I am discussing my progress in writing a base R equivalent of dplyr. What’s nice about this series is that if you’re not into poorman and would prefer just to use dplyr, then that’s absolutely OK! By highlighting poorman functionality, this series of blog posts simultaneously highlights dplyr functionality too!

Today I want to showcase some column selection helper features from the tidyselect package – often used in conjunction with dplyr– which I have finished now replicated within poorman, of course using base only. I’ll also discuss a little bit about what is happening in the background of poorman’s development with regards to testing.

Select Helpers

The first official release version of poorman (v 0.1.9) was the first version that I considered to contain all of the “core” functionality of dplyr; everything from select() to group_by() and summarise(). Now that that functionality is nailed down, it gives me time to focus of some of the smaller features of dplyr and the wider tidyverse and so over the last couple of weeks I have been working on adding the tidyselect::select_helpers to poorman. For those that are unaware, select_helpers are a collection of functions that help the user to select variables based on their names. For example you may wish to select all columns which start with a certain prefix or maybe select columns matching a particular regular expression. Let’s take a look at some examples.

Selecting Columns Based On Partial Column Names

If your data contain lots of columns whose names share a similar structure, you can use partial matching by adding starts_with(), ends_with() or contains() in your select()/relocate() statement.

library(poorman, warn.conflicts = FALSE)
iris %>% select(starts_with("Petal"), ends_with("Width")) %>% head()
#   Petal.Length Petal.Width Sepal.Width
# 1          1.4         0.2         3.5
# 2          1.4         0.2         3.0
# 3          1.3         0.2         3.2
# 4          1.5         0.2         3.1
# 5          1.4         0.2         3.6
# 6          1.7         0.4         3.9

Reordering Columns

The columns of the iris dataset come in the following order.

colnames(iris)
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

But what if we wanted all of the “Width” columns at the start of this data.frame? There are a couple of ways in which we can achieve this. Firstly, we can use select() in combination with the select helper everything().

iris %>% select("Petal.Width", "Sepal.Width", everything()) %>% head()
#   Petal.Width Sepal.Width Sepal.Length Petal.Length Species
# 1         0.2         3.5          5.1          1.4  setosa
# 2         0.2         3.0          4.9          1.4  setosa
# 3         0.2         3.2          4.7          1.3  setosa
# 4         0.2         3.1          4.6          1.5  setosa
# 5         0.2         3.6          5.0          1.4  setosa
# 6         0.4         3.9          5.4          1.7  setosa

poorman here first selects the columns “Petal.Width” and “Sepal.Width” before selecting everything else. This is great, but if your data contain a lot of columns containing “Width” then you will have to write out a lot of column names. Well this is where we can use relocate() and the select helper contains() to move those columns to the start of iris.

iris %>% relocate(contains("Width")) %>% head()
#   Sepal.Width Petal.Width Sepal.Length Petal.Length Species
# 1         3.5         0.2          5.1          1.4  setosa
# 2         3.0         0.2          4.9          1.4  setosa
# 3         3.2         0.2          4.7          1.3  setosa
# 4         3.1         0.2          4.6          1.5  setosa
# 5         3.6         0.2          5.0          1.4  setosa
# 6         3.9         0.4          5.4          1.7  setosa

By default, relocate() will move all selected columns to the start of the data.frame. You can adjust this behaviour with the .before and .after parameters. Let’s move the “Petal” columns to appear after the “Species” column.

iris %>% relocate(contains("Petal"), .after = Species) %>% head()
#   Sepal.Length Sepal.Width Species Petal.Length Petal.Width
# 1          5.1         3.5  setosa          1.4         0.2
# 2          4.9         3.0  setosa          1.4         0.2
# 3          4.7         3.2  setosa          1.3         0.2
# 4          4.6         3.1  setosa          1.5         0.2
# 5          5.0         3.6  setosa          1.4         0.2
# 6          5.4         3.9  setosa          1.7         0.4

Select Columns Using a Regular Expression

The previous helper functions work with exact pattern matches. Let’s say you have similar patterns within your column names that are not quite the same, you can use regular expressions with the matches() helper function to identify them. Here I will use the mtcars dataset and look to extract all columns which start with a “w” or a “d” and end with a “t”.

mtcars %>% select(matches("^[wd].*[t]$")) %>% head()
#                   drat    wt
# Mazda RX4         3.90 2.620
# Mazda RX4 Wag     3.90 2.875
# Datsun 710        3.85 2.320
# Hornet 4 Drive    3.08 3.215
# Hornet Sportabout 3.15 3.440
# Valiant           2.76 3.460

The Select Helper List

We have seen a few examples of select helpers now available in poorman. There are more, however, and the following list details each of them. Remember that these functions can be used to help users select() and relocate() columns within data.frames.

  • starts_with(): Starts with a prefix.
  • ends_with(): Ends with a suffix.
  • contains(): Contains a literal string.
  • matches(): Matches a regular expression.
  • num_range(): Matches a numerical range like x01, x02, x03.
  • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
  • any_of(): Same as all_of(), except that no error is thrown for names that don’t exist.
  • everything(): Matches all variables.
  • last_col(): Select last variable, possibly with an offset.

Docker

There was a request on Twitter to put together a Docker image for poorman. This has now been done and can be seen on Dockerhub. This means if you have Docker installed, you can run a containerised version of poorman easily with the following line of code.

docker run --rm -it nathaneastwood/poorman

Test, Test, Test!

Since the last release of poorman (v 0.1.9) to CRAN, I have been working on a few bugs that I and other users of the package had identified. I’m happy to say that these have now been squashed and the issues list is looking very empty. As a brief overview, the following problems are now fixed:

  • mutate() column creations are immediately available, e.g. mtcars %>% mutate(mpg2 = mpg * 2, mpg4 = mpg2 * 2) will create columns named mpg2 and mpg4
  • group_by() groups now persist in selections, e.g. mtcars %>% group_by(am) %>% select(mpg) will return am and mpg columns
  • slice() now duplicates rows, e.g. mtcars %>% slice(2, 2, 2) will return row 2 three times
  • summarize() is now exported

dplyr is a very well known and extremely well developed package. In order for poorman to have any credibility, it needs to work correctly. Therefore a large amount of effort and energy has gone into testing poorman to ensure it produces the results one would expect. Since adding all of the new features and bug fixes described in this blog, poorman has surpassed 100 tests!

tinytest::test_all()                                    
# Running test_arrange.R................    5 tests OK
# Running test_filter.R.................    6 tests OK
# Running test_groups.R.................    5 tests OK
# Running test_joins_filter.R...........    4 tests OK
# Running test_joins.R..................    7 tests OK
# Running test_mutate.R.................    6 tests OK
# Running test_pull.R...................    6 tests OK
# Running test_relocate.R...............    6 tests OK
# Running test_rename.R.................    4 tests OK
# Running test_rownames.R...............    2 tests OK
# Running test_select_helpers.R.........   25 tests OK
# Running test_select.R.................   13 tests OK
# Running test_slice.R..................    5 tests OK
# Running test_summarise.R..............    6 tests OK
# Running test_transmute.R..............    3 tests OK
# Running test_utils.R..................    1 tests OK
# [1] "All ok, 104 results"

This also means that the package coverage is extremely high.

covr::package_coverage()                                                                                        #
# poorman Coverage: 97.87%
# R/utils.R: 80.00%
# R/group.R: 90.91%
# R/joins.R: 92.86%
# R/arrange.R: 100.00%
# R/filter.R: 100.00%
# R/init.R: 100.00%
# R/joins_filtering.R: 100.00%
# R/mutate.R: 100.00%
# R/pipe.R: 100.00%
# R/pull.R: 100.00%
# R/relocate.R: 100.00%
# R/rename.R: 100.00%
# R/rownames.R: 100.00%
# R/select_helpers.R: 100.00%
# R/select.R: 100.00%
# R/slice.R: 100.00%
# R/summarise.R: 100.00%
# R/transmute.R: 100.00%

I hope this gives users of poorman that extra confidence when using the package. I have now submitted this updated version of poorman to CRAN and I am just waiting on their feedback so hopefully in the coming days it will be available. Watch this space!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

wrapped Normal distribution

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One version of the wrapped Normal distribution on (0,1) is expressed as a sum of Normal distributions with means shifted by all relative integers

\psi(x;\mu,\sigma)=\sum_{k\in\mathbb Z}\varphi(x;\mu+k,\sigma)\mathbb I_{(0,1)}(x)

which, while a parameterised density, has imho no particular statistical appeal over the use of other series. It was nonetheless the centre of a series of questions on X validated in the past weeks. Curiously used as the basis of a random walk type move over the unit cube along with a uniform component. Simulating from this distribution is easily done when seeing it as an infinite mixture of truncated Normal distributions, since the weights are easily computed

\sum_{k\in\mathbb Z}\overbrace{[\Phi_\sigma(1-\mu-k)-\Phi_\sigma(-\mu-k)]}^{p_k(\mu,\sigma)}\times

\dfrac{\varphi_\sigma(x-\mu-k)\mathbb I_{(0,1)}(y)}{\Phi_\sigma(1-\mu-k)-\Phi_\sigma(-\mu-k)}

Hence coding simulations as

wrap<-function(x, mu, sig){  ter = trunc(5*sig + 1)  return(sum(dnorm(x + (-ter):ter, mu, sig)))}siw = function(N=1e4,beta=.5,mu,sig){  unz = (runif(N)

and checking that the harmonic mean estimator was functioning for this density, predictably since it is lower bounded on (0,1). The prolix originator of the question was also wondering at the mean of the wrapped Normal distribution, which I derived as (predictably)

\mu+\sum_{k\in\mathbb Z} kp_k(xmu,\sigma)

but could not simplify any further except for x=0,½,1, when it is ½. A simulated evaluation of the mean as a function of μ shows a vaguely sinusoidal pattern, also predictably periodic and unsurprisingly antisymmetric, and apparently independent of the scale parameter σ…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12300 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>