Quantcast
Channel: R-bloggers
Viewing all 12081 articles
Browse latest View live

How to standardize group colors in data visualizations in R

$
0
0

[This article was first published on r – paulvanderlaken.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One best practice in visualization is to make your color scheme consistent across figures.

For instance, if you’re making multiple plots of the dataset — say a group of 5 companies — you want to have each company have the same, consistent coloring across all these plots.

R has some great data visualization capabilities. Particularly the ggplot2 package makes it so easy to spin up a good-looking visualization quickly.

The default in R is to look at the number of groups in your data, and pick “evenly spaced” colors across a hue color wheel. This looks great straight out of the box:

# install.packages('ggplot2')library(ggplot2)theme_set(new = theme_minimal()) # sets a default themeset.seed(1) # ensure reproducibility# generate some datan_companies = 5df1 = data.frame(  company = paste('Company', seq_len(n_companies), sep = '_'),  employees = sample(50:500, n_companies),  stringsAsFactors = FALSE)# make a simple column/bar plotggplot(data = df1) +   geom_col(aes(x = company, y = employees, fill = company))

However, it can be challenging is to make coloring consistent across plots.

For instance, suppose we want to visualize a subset of these data points.

index_subset1 = c(1, 3, 4, 5) # specify a subset# make a plot using the subsetted dataframeggplot(data = df1[index_subset1, ]) +   geom_col(aes(x = company, y = employees, fill = company))

As you can see the color scheme has now changed. With one less group / company, R now picks 4 new colors evenly spaced around the color wheel. All but the first are different to the original colors we had for the companies.

One way to deal with this in R and ggplot2, is to add a scale_* layer to the plot.

Here we manually set Hex color values in the scale_fill_manual function. These hex values I provided I know to be the default R values for four groups.

# install.packages('scales')# the hue_pal function from the scales package looks up a number of evenly spaced colors# which we can save as a vector of character hex valuesdefault_palette = scales::hue_pal()(5)# these colors we can then use in a scale_* function to manually override the color schemaggplot(data = df1[index_subset1, ]) +  geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = default_palette[-2]) # we remove the element that belonged to company 2

As you can see, the colors are now aligned with the previous schema. Only Company 2 is dropped, but all other companies retained their color.

However, this was very much hard-coded into our program. We had to specify which company to drop using the default_palette[-2].

If the subset changes, which often happens in real life, our solution will break as the values in the palette no longer align with the groups R encounters:

index_subset2 = c(1, 2, 5) # but the subset might change# and all manually-set colors will immediately misalignggplot(data = df1[index_subset2, ]) +  geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = default_palette[-2])

Fortunately, R is a smart language, and you can work your way around this!

All we need to do is created, what I call, a named-color palette!

It’s as simple as specifying a vector of hex color values! Alternatively, you can use the grDevices::rainbow or grDevices::colors() functions, or one of the many functions included in the scales package

# you can hard-code a palette using color stringsc('red', 'blue', 'green')# or you can use the rainbow or colors functions of the grDevices packagerainbow(n_companies)colors()[seq_len(n_companies)]# or you can use the scales::hue_pal() functionpalette1 = scales::hue_pal()(n_companies)print(palette1)
[1] "#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"

Now we need to assign names to this vector of hex color values. And these names have to correspond to the labels of the groups that we want to colorize.

You can use the names function for this.

names(palette1) = df1$companyprint(palette1)
Company_1 Company_2 Company_3 Company_4 Company_5"#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"

But I prefer to use the setNames function so I can do the inititialization, assignment, and naming simulatenously. It’s all the same though.

palette1_named = setNames(object = scales::hue_pal()(n_companies), nm = df1$company)print(palette1_named)
Company_1 Company_2 Company_3 Company_4 Company_5"#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"

With this named color vector and the scale_*_manual functions we can now manually override the fill and color schemes in a flexible way. This results in the same plot we had without using the scale_*_manual function:

ggplot(data = df1) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette1_named)

However, now it does not matter if the dataframe is subsetted, as we specifically tell R which colors to use for which group labels by means of the named color palette:

# the colors remain the same if some groups are not foundggplot(data = df1[index_subset1, ]) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette1_named)
# and also if other groups are not foundggplot(data = df1[index_subset2, ]) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette1_named)

Once you are aware of these superpowers, you can do so much more with them!

How about highlighting a specific group?

Just set all the other colors to ‘grey’…

# lets create an all grey color palette vectorpalette2 = rep('grey', times = n_companies)palette2_named = setNames(object = palette2, nm = df1$company)print(palette2_named)
Company_1 Company_2 Company_3 Company_4 Company_5"grey" "grey" "grey" "grey" "grey"
# this looks terrible in a plotggplot(data = df1) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette2_named)

… and assign one of the company’s colors to be a different color

# override one of the 'grey' elements using an index by namepalette2_named['Company_2'] = 'red'print(palette2_named)
Company_1 Company_2 Company_3 Company_4 Company_5"grey" "red" "grey" "grey" "grey"
# and our plot is professionally highlighting a certain groupggplot(data = df1) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette2_named)

We can apply these principles to other types of data and plots.

For instance, let’s generate some time series data…

timepoints = 10df2 = data.frame(  company = rep(df1$company, each = timepoints),  employees = rep(df1$employees, each = timepoints) + round(rnorm(n = nrow(df1) * timepoints, mean = 0, sd = 10)),  time = rep(seq_len(timepoints), times = n_companies),  stringsAsFactors = FALSE)

… and visualize these using a line plot, adding the color palette in the same way as before:

ggplot(data = df2) +   geom_line(aes(x = time, y = employees, col = company), size = 2) +  scale_color_manual(values = palette1_named)

If we miss one of the companies — let’s skip Company 2 — the palette makes sure the others remained colored as specified:

ggplot(data = df2[df2$company %in% df1$company[index_subset1], ]) +   geom_line(aes(x = time, y = employees, col = company), size = 2) +  scale_color_manual(values = palette1_named)

Also the highlighted color palete we used before will still work like a charm!

ggplot(data = df2) +   geom_line(aes(x = time, y = employees, col = company), size = 2) +  scale_color_manual(values = palette2_named)

Now, let’s scale up the problem! Pretend we have not 5, but 20 companies.

The code will work all the same!

set.seed(1) # ensure reproducibility# generate new data for more companiesn_companies = 20df1 = data.frame(  company = paste('Company', seq_len(n_companies), sep = '_'),  employees = sample(50:500, n_companies),  stringsAsFactors = FALSE)# lets create an all grey color palette vectorpalette2 = rep('grey', times = n_companies)palette2_named = setNames(object = palette2, nm = df1$company)# highlight one company in a different colorpalette2_named['Company_2'] = 'red'print(palette2_named)# make a bar plotggplot(data = df1) +   geom_col(aes(x = company, y = employees, fill = company)) +  scale_fill_manual(values = palette2_named) +  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # rotate and align the x labels

Also for the time series line plot:

timepoints = 10df2 = data.frame(  company = rep(df1$company, each = timepoints),  employees = rep(df1$employees, each = timepoints) + round(rnorm(n = nrow(df1) * timepoints, mean = 0, sd = 10)),  time = rep(seq_len(timepoints), times = n_companies),  stringsAsFactors = FALSE)ggplot(data = df2) +   geom_line(aes(x = time, y = employees, col = company), size = 2) +  scale_color_manual(values = palette2_named)

The possibilities are endless; the power is now yours!

Just think at the efficiency gain if you would make a custom color palette, with for instance your company’s brand colors!

For more R tricks to up your programming productivity and effectiveness, visit the R tips and tricks page!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – paulvanderlaken.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Introducing bwsTools: A Package for Case 1 Best-Worst Scaling (MaxDiff) Designs

$
0
0

[This article was first published on Mark H. White II, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Case 1 best-worst scaling (also known as MaxDiff) designs involve presenting respondents with a number of items and asking them to pick which is “best” and “worst” of the set. More generally, the respondent is asked which items have the most and least of any given feature (importance, attractiveness, interest, and so on). Respondents complete many sets of items in a row, and from this we can learn how the items rate and rank against one another. One of the reasons I like them as a prejudice researcher is that it can help hide the purpose of the measurement tool: If I ask about 13 different items over 13 different trials, but the one about prejudice only comes up in 4 of the 13 trials, it masks what the questionnaire is actually about. But the most standard use cases involve marketing.

There is a lot of literature out there on how to calculate rating scores for items in these designs across the entire sample (aggregate scores) and within a respondent (individual scores). With a focus on the individual-level measurement, I put together a package called bwsTools that provides functions for creating these designs and analyzing them at both the aggregate and individual level. The package is on CRAN and can be installed by install.packages(“bwsTools”).

Some resources to get you started using the package:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mark H. White II, PhD.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Working with Statistics Canada Data in R, Part 5: Retrieving Census Data

$
0
0

[This article was first published on Data Enthusiast's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back to Working with Statistics Canada Data in R, Part 4.

Introduction

Now that we are ready to start working with Canadian Census data, let’s first briefly address the question why you may need to use it. After all, CANSIM data is often more up-to-date and covers a much broader range of topics than the national census data, which is gathered every five years in respect of a limited number of questions.

The main reason is that CANSIM data is far less granular geographically. Most of it is collected at the provincial or even higher regional level. You may be able to find CANSIM data on a limited number of questions for some of the country’s largest metropolitan areas, but if you need the data for a specific census division, city, town, or village, you’ll have to use the Census.

To illustrate the use of cancensus package, let’s do a small research project. First, in this post we’ll retrieve these key labor force characteristics of the largest metropolitan areas in each of the five geographic regions of Canada:

  • Labor force participation rate, employment rate, and unemployment rate.
  • Percent of workers by work situation: full time vs part time, by gender.
  • Education levels of people aged 25 to 64, by gender.

The cities (metropolitan areas) that we are going to look at, are: Calgary, Halifax, Toronto, Vancouver, and Whitehorse. We’ll also get these data for Canada as a whole for comparison and to illustrate the retrieval of data at different geographic levels

Next, in the upcoming Part 6 of the “Working with Statistics Canada Data in R” series, we will visualize these data, including making a faceted plot and writing a function to automate repetitive plotting tasks.

Keep in mind that cancensus also allows you to retrieve geospatial data, that is, borders of census regions at various geographic levels, in sp and sf formats. Retrieving and visualizing Statistics Canada geospatial data will be covered later in these series.

So, let’s get started by loading the required packages:

library(cancensus)library(tidyverse)

Searching for Data

cancensus retrieves census data with the get_census function. get_census can take a number of arguments, the most important of which are dataset, regions, and vectors, which have no defaults. Thus, in order to be able to retrieve census data, you’ll first need to figure out:

  • your dataset,
  • your region(s), and
  • your data vector(s).

Find Census Datasets

Let’s see which census datasets are available through the CensusMapper API:

list_census_datasets()

Currently, datasets earlier than 2001 are not available, so if you need to work with the 20th century census data, you won’t be able to retrieve it with cancensus.

Find Census Regions

Next, let’s find the regions that we’ll be getting the data for. To search for census regions, use the search_census_regions function.

Let’s take a look at what region search returns for Toronto. Note that cancensus functions return their output as dataframes, so it is easy to subset. Here I limited the output to the most relevant columns to make sure it fits on screen. You can run the code without [c(1:5, 8)] to see all of it.

# all census levelssearch_census_regions(searchterm = "Toronto",                       dataset = "CA16")[c(1:5, 8)]

Returns:

A tibble: 3 x 6# region  name    level     pop municipal_status PR_UID                          1 35535   Toronto CMA   5928040 B                35    2 3520    Toronto CD    2731571 CDR              35    3 3520005 Toronto CSD   2731571 C                35    

You may have expected to get only one region: the city of Toronto, but instead you got three! So, what is the difference? Look at the column ‘level’ for the answer. Often, the same geographic region can be represented by several census levels, as is the case here. There are three levels for Toronto, which is simultaneously a census metropolitan area, a census division, and a census sub-division. Note also the ‘PR_UID’ column that contains numeric codes for Canada’s provinces and territories, which can help you distinguish between different census regions that have the same or similar names. For an example, run the code above replacing “Toronto” with “Windsor”.

Remember that we were going to plot the data for census metropolitan areas? You can choose the geographic level with the level argument, which can take the following values: ‘C’ for Canada (national level), ‘PR’ for province, ‘CMA’ for census metropolitan area, ‘CD’ for census division, ‘CSD’ for census sub-division, or NA:

# specific census levelsearch_census_regions("Toronto", "CA16", level = "CMA")

Let’s now list census regions that may be relevant for our project:

# explore available census regionsnames <- c("Canada", "Calgary", "Halifax",            "Toronto", "Vancouver", "Whitehorse")map_df(names, ~ search_census_regions(., dataset = "CA16"))

purrr::map_df function applies search_census_regions iteratively to each element of the names vector and returns output as a single dataframe. Note also the ~ . syntax. Think of it as the tilde taking names and passing it as an argument to a place indicated by the dot in the search_census_regions function. You can find more about the tilde-dot syntax here. It may be a good idea to read the whole tutorial: purrr is a super-useful package, but not the easiest to learn, and this tutorial does a great job explaining the basics.

So as you can see, there are multiple entries for each search term, so we’ll need to choose the results for census metropolitan areas, and for census sub-division in case of Whitehorse, since Whitehorse is too small to be considered a census metropolitan area:

# select only the regions we need: CMAs (and CSD for Whitehorse)regions <- list_census_regions(dataset = "CA16") %>%   filter(grepl("Calgary|Halifax|Toronto|Vancouver", name) &         grepl("CMA", level) |          grepl("Canada|Whitehorse$", name)) %>%   as_census_region_list()

Pay attention to the use of logical operators to filter the output by several conditions at once; also note using $ regex meta-character to choose the entry ending with ‘Whitehorse’ from the ‘names’ column (to filter out ‘Whitehorse, Unorganized’.

Finally, as_census_region_list converts list_census_regions output to a data object of type list that can be passed to the get_census function as its regions argument.

Find Census Vectors

Canadian census data is made up of individual variables, aka census vectors. Vector number(s) is another argument you need to specify in order to retrieve data with the get_census function.

cancensus has two functions that allow you to search through census data variables: list_census_vectors and search_census_vectors.

list_census_vectors returns all available vectors for a given dataset as a single dataframe containing vectors and their descriptions:

# structure of list_census_vectors outputstr(list_census_vectors(dataset = 'CA16'))# count variables in 'CA16' datasetnrow(list_census_vectors(dataset = 'CA16'))

As you can see, there are 6623 (as of the time of writing this) variables in the 2016 census dataset, so list_census_vectors won’t be the most convenient function to find a specific vector. Note however that there are situations (such as when you need to select a lot of vectors at once), in which list_census_vectors would be appropriate.

Usually it is more convenient to use search_census_vectors to search for vectors. Just pass the text string of what you are looking for as the searchterm argument. You don’t have to be precise: this function works even if you make a typo or are uncertain about the spelling of your search term.

Let’s now find census data vectors for labor force involvement rates:

# get census data vectors for labor force involvement rateslf_vectors <-   search_census_vectors(searchterm = "employment rate",                         dataset = "CA16") %>%   union(search_census_vectors("participation rate", "CA16")) %>%   filter(type == "Total") %>%   pull(vector)

Let’s take a look at what this code does. Since searchterm doesn’t have to be a precise match, “employment rate” search term retrieves unemployment rate vectors too. In the next line, union merges dataframes returned by search_census_vectors into a single dataframe. Note that in this case union could be substituted with bind_rows. I recommend using union in order to avoid data duplication. Next, we choose only the “Total” numbers, since we are not going to plot labor force indicators by gender. Finally, the pull command extracts a single vector from the dataframe, just like the $ subsetting operator: we need ‘lf_vectors’ to be a data object of type vector in order to pass it to the vectors argument of the get_census function.

The second labor force indicator we are looking for, is the number of people who work full-time and part-time, broken down by gender. But before we proceed with getting the respective vectors, let me show you another way to figure out search terms to put inside the search_census_vectors function: use Statistics Canada online Census Profile tool. It can be used to quickly explore census data as well as to figure out variable names (search terms) and their hierarchical structure.

For example, let’s look at census labor data for Calgary metropolitan area. Scrolling down, you will quickly find the numbers and text labels for full-time and part-time workers:

Now we know the exact search terms, so we can get precisely the vectors we need, free from any extraneous data:

# get census data vectors for full and part time work# get vectors and labels    work_vectors_labels <-   search_census_vectors("full year, full time", "CA16") %>%   union(search_census_vectors("part year and/or part time", "CA16")) %>%   filter(type != "Total") %>%   select(1:3) %>%   mutate(label = str_remove(label, ".*, |.*and/or ")) %>%   mutate(type = fct_drop(type)) %>%   setNames(c("vector", "gender", "type"))# extract vectorswork_vectors <- work_vectors_labels$vector

Note how this code differs from the code with which we extracted labor force involvement rates: since we need the data to be sub-divided both by the type of work and by gender (hence no “Total” values here), we create a dataframe that assigns respective labels to each vector number. This work_vectors_labels dataframe will supply categorical labels to be attached to the data retrieved with get_census.

Also, note these three lines:

  mutate(label = str_remove(label, ".*, |.*and/or ")) %>%   mutate(type = fct_drop(type)) %>%   setNames(c("vector", "gender", "type"))

The first mutate call removes all text up to and including ‘, ‘ and ‘and/or ‘ (spaces included) from the ‘label’ column. The second drops unused factor level “Total” – it is a good practice to make sure there are no unused factor levels if you are going to use ggplot2 to plot your data. Finally, setNames renames variables for convenience.

Finally, let’s retrieve vectors for the education data for the age group from 25 to 64 years, by gender. Before we do this, I’d like to draw your attention to the fact that some of the census data is hierarchical, which means that some variables (census vectors) are included into parent and/or include child variables. It is very important to choose vectors at proper hierarchical levels so that you do not double-count or omit your data.

Education data is a good example of hierarchical data. You can explore data hierarchy using parent_census_vectors and child_census_vectors functions as described here. However, you may find exploring the hierarchy visually using Statistics Canada Census Profile tool to be more convenient:

So, let’s now retrieve and label the education data vectors:

# get census vectors for education levels data# get vectors and labelsed_vectors_labels <-  search_census_vectors("certificate", "CA16") %>%  union(search_census_vectors("degree", "CA16")) %>%  union(search_census_vectors("doctorate", "CA16")) %>%  filter(type != "Total") %>%  filter(grepl("25 to 64 years", details)) %>%  slice(-1,-2,-7,-8,-11:-14,-19,-20,-23:-28) %>%  select(1:3) %>%  mutate(label =           str_remove_all(label,                          " cert.*diploma| dipl.*cate|, CEGEP| level|")) %>%  mutate(label =           str_replace_all(label,                            c("No.*" = "None",                             "Secondary.*" = "High school or equivalent",                             "other non-university" = "equivalent",                             "University above" = "Cert. or dipl. above",                             "medicine.*" = "health**",                             ".*doctorate$" = "Doctorate*"))) %>%  mutate(type = fct_drop(type)) %>%  setNames(c("vector", "gender", "level"))# extract vectorsed_vectors <- ed_vectors_labels$vector

Note the slice function that allows to manually select specific rows from a dataframe: positive numbers choose rows to keep, negative numbers choose rows to drop. I used slice to drop the hierarchical levels from the data that are either too generalized or too granular. Note also that I had to edit text strings in the data. Finally, I added asterisks after “Doctorate” and “health”. These are not regex symbols, but actual asterisks that will be used to refer to footnotes in plot captions later on.

Now that we have figured out our dataset, regions, and data vectors (and labelled the vectors, too), we are finally ready to retrieve the data itself.

Retrieve Census Data

To retrieve census data, feed the dataset, regions, and data vectors into get_census as its’ respective arguments. Note also that get_census has use_cache argument (set to TRUE by default), which tells get_census to retrieve data from cache if available. If there is no cached data, the function will query CensusMapper API for the data and will save it in cache, while use_cache = FALSE will force get_census to query the API and update the cache.

# get census data for labor force involvement rates# feed regions and vectors into get_census()labor <-   get_census(dataset = "CA16",              regions = regions,             vectors = lf_vectors) %>%   select(-c(1, 2, 4:7)) %>%   setNames(c("region", "employment rate",              "unemployment rate",              "participation rate")) %>%   mutate(region = str_remove(region, " (.*)")) %>%   pivot_longer("employment rate":"participation rate",                names_to = "indicator",               values_to = "rate") %>%   mutate_if(is.character, as_factor)

The select call drops columns with irrelevant data, setNames renames columns to remove vector numbers from variable names, which will be then converted to values in the ‘indicator’ column; str_remove inside the mutate call drops municipal status codes ‘(B)’ and ‘(CY)’ from region names; finally, mutate_if converts characters to factors for subsequent plotting.

An important function here is tidyr::pivot_longer. It converts the dataframe from wide to long format. It takes three columns: ‘employment rate’, ‘unemployment rate’, and ‘participation rate’, and converts their names into values of the ‘indicator’ variable, while their numeric values are passed to the ‘rate’ variable. The reason for the conversion is that we are going to plot the data for all three labor force indicators in the same graphic, which makes it necessary to store the indicators as a single factor variable.

Next, let’s retrieve census data about the percent of full time vs part time workers, by gender, and the data about the education levels of people aged 25 to 64, by gender:

# get census data for full time and part time workwork <-   get_census(dataset = "CA16",              regions = regions,             vectors = work_vectors) %>%   select(-c(1, 2, 4:7)) %>%   rename(region = "Region Name") %>%   pivot_longer(2:5, names_to = "vector",                     values_to = "count") %>%   mutate(region = str_remove(region, " (.*)")) %>%   mutate(vector = str_remove(vector, ":.*")) %>%   left_join(work_vectors_labels, by = "vector") %>%   mutate(gender = str_to_lower(gender)) %>%   mutate_if(is.character, as_factor)# get census data for education levelseducation <-   get_census(dataset = "CA16",              regions = regions,             vectors = ed_vectors) %>%   select(-c(1, 2, 4:7)) %>%   rename(region = "Region Name") %>%   pivot_longer(2:21, names_to = "vector",                      values_to = "count") %>%   mutate(region = str_remove(region, " (.*)")) %>%   mutate(vector = str_remove(vector, ":.*")) %>%   left_join(ed_vectors_labels, by = "vector") %>%   mutate_if(is.character, as_factor)

Note one important difference from the code I used to retrieve the labor force involvement data: here I added the dplyr::left_join function that joins labels to the census data.

We now have the data and are ready to visualize it, which will be done in the next post.

Annex: Notes and Definitions

For those of you who are outside of Canada, Canada’s geographic regions and their largest metropolitan areas are:

  • The Atlantic Provinces – Halifax
  • Central Canada – Toronto
  • The Prairie Provinces – Calgary
  • The West Coast – Vancouver
  • The Northern Territories – Whitehorse

These regions should not be confused with 10 provinces and 3 territories, which are Canada’s sub-national administrative divisions, much like states in the U.S. Each region consists of several provinces or territories, except the West Coast, which includes only one province – British Columbia. You can find more about Canada’s geographic regions and territorial structure here (pages 44 to 51).

For the definitions of employment rate, unemployment rate, labour force participation rate, full-time work, and part-time work, see Statistics Canada’s Guide to the Labour Force Survey.

You can find more about census geographic areas here and here. There is also a glossary of census-related geographic concepts.

The post Working with Statistics Canada Data in R, Part 5: Retrieving Census Data appeared first on Data Enthusiast's Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Enthusiast's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modeling pandemics (1)

$
0
0

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The most popular model to model epidemics is the so-called SIR model – or Kermack-McKendrick. Consider a population of size N, and assume that S is the number of susceptible, I the number of infectious, and R for the number recovered (or immune) individuals, \displaystyle {\begin{aligned}&{\frac {dS}{dt}}=-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-\gamma I,\\[6pt]&{\frac {dR}{dt}}=\gamma I,\end{aligned}}so that \displaystyle{{\frac{dS}{dt}}+{\frac {dI}{dt}}+{\frac {dR}{dt}}=0}which implies that S+I+R=N. In order to be more realistic, consider some (constant) birth rate \mu, so that the model becomes\displaystyle {\begin{aligned}&{\frac {dS}{dt}}=\mu(N-S)-{\frac {\beta IS}{N}},\\[6pt]&{\frac {dI}{dt}}={\frac {\beta IS}{N}}-(\gamma+\mu) I,\\[6pt]&{\frac {dR}{dt}}=\gamma I-\mu R,\end{aligned}}Note, in this model, that people get sick (infected) but they do not die, they recover. So here, we can model chickenpox, for instance, not SARS.

The dynamics of the infectious class depends on the following ratio:\displaystyle{R_{0}={\frac {\beta }{\gamma +\mu}}} which is the so-called basic reproduction number (or reproductive ratio). The effective reproductive ratio is R_0S/N, and the turnover of the epidemic happens exactly when R_0S/N=1, or when the fraction of remaining susceptibles is R_0^{-1}. As shown in Directly transmitted infectious diseases:Control by vaccination, if S/N<R_0^{-1}[/latex] the disease (the number of people infected) will start to decrease.</p><p>Want to see it  ? Start with</p><p>2865af5b81a09019fccc429625084395011</p><p>for the parameters. Here,  [latex]R_0=4. We also need starting values

12345
epsilon = .001N =1S =1-epsilonI= epsilonR =0

Then use the ordinary differential equation solver, in R. The idea is to say that \boldsymbol{Z}=(S,I,R) and we have the gradient \frac{\partial \boldsymbol{Z}}{\partial t} = SIR(\boldsymbol{Z})where SIR is function of the various parameters. Hence, set

12
p =c(mu =0, N =1, beta=2, gamma=1/2)start_SIR =c(S =1-epsilon, I= epsilon, R =0)

The we must define the time, and the function that returns the gradient,

123456789
times =seq(0, 10, by= .1)SIR =function(t,Z,p){S=Z[1];I=Z[2]; R=Z[3]; N=S+I+Rmu=p["mu"];beta=p["beta"];gamma=p["gamma"]dS=mu*(N-S)-beta*S*I/NdI=beta*S*I/N-(mu+gamma)*IdR=gamma*I-mu*RdZ=c(dS,dI,dR)return(dZ)}

To solve this problem use

12
library(deSolve)resol = ode(y=start_SIR, times=times, func=SIR, parms=p)

We can visualize the dynamics below

12345678
par(mfrow=c(1,2))t=resol[,"time"]plot(t,resol[,"S"],type="l",xlab="time",ylab="")lines(t,resol[,"I"],col="red")lines(t,resol[,"R"],col="blue")plot(t,t*0+1,type="l",xlab="time",ylab="",ylim=0:1)polygon(c(t,rev(t)),c(resol[,"R"],rep(0,nrow(resol))),col="blue")polygon(c(t,rev(t)),c(resol[,"R"]+resol[,"I"],rev(resol[,"R"])),col="red")

We can actually also visualize the effective reproductive number is R_0S/N, where

1
R0=p["beta"]/(p["gamma"]+p["mu"])

The effective reproductive number is on the left, and as we mentioned above, when we reach 1, we actually reach the maximum of the infected,

123456789
plot(t,resol[,"S"]*R0,type="l",xlab="time",ylab="")abline(h=1,lty=2,col="red")abline(v=max(t[resol[,"S"]*R0&gt;=1]),col="darkgreen")points(max(t[resol[,"S"]*R0&gt;=1]),1,pch=19)plot(t,resol[,"S"],type="l",xlab="time",ylab="",col="grey")lines(t,resol[,"I"],col="red",lwd=3)lines(t,resol[,"R"],col="light blue")abline(v=max(t[resol[,"S"]*R0&gt;=1]),col="darkgreen")points(max(t[resol[,"S"]*R0&gt;=1]),max(resol[,"I"]),pch=19)

And when adding a \mu parameter, we can obtain some interesting dynamics on the number of infected,

12345
times =seq(0, 100, by=.1)p =c(mu =1/100, N =1, beta=50, gamma=10)start_SIR =c(S=0.19, I=0.01, R =0.8)resol = ode(y=start_SIR, t=times, func=SIR, p=p)plot(resol[,"time"],resol[,"I"],type="l",xlab="time",ylab="")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Modeling Pandemics (3)

$
0
0

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Statistical Inference in a Stochastic Epidemic SEIR Model with Control Intervention, a more complex model than the one we’ve seen yesterday was considered (and is called the SEIR model). Consider a population of size N, and assume that S is the number of susceptible, E the number of exposed, I the number of infectious, and R for the number recovered (or immune) individuals, \displaystyle{\begin{aligned}{\frac {dS}{dt}}&=-\beta {\frac {I}{N}}S\\[8pt]{\frac {dE}{dt}}&=\beta {\frac {I}{N}}S-aE\\[8pt]{\frac {dI}{dt}}&=aE-b I\\[8pt]{\frac {dR}{dt}}&=b I\end{aligned}}Between S and I, the transition rate is \beta I, where \beta is the average number of contacts per person per time, multiplied by the probability of disease transmission in a contact between a susceptible and an infectious subject. Between I and R, the transition rate is b (simply the rate of recovered or dead, that is, number of recovered or dead during a period of time divided by the total number of infected on that same period of time). And finally, the incubation period is a random variable with exponential distribution with parameter a, so that the average incubation period is a^{-1}.

Probably more interesting, Understanding the dynamics of ebola epidemics suggested a more complex model, with susceptible people S, exposed E, Infectious, but either in community I, or in hospitals H, some people who died F and finally those who either recover or are buried and therefore are no longer susceptible R.

Thus, the following dynamic model is considered\displaystyle{\begin{aligned}{\frac {dS}{dt}}&=-(\beta_II+\beta_HH+\beta_FF)\frac{S}{N}\\[8pt]\frac {dE}{dt}&=(\beta_II+\beta_HH+\beta_FF)\frac{S}{N}-\alpha E\\[8pt]\frac {dI}{dt}&=\alpha E+\theta\gamma_H I-(1-\theta)(1-\delta)\gamma_RI-(1-\theta)\delta\gamma_FI\\[8pt]\frac {dH}{dt}&=\theta\gamma_HI-\delta\lambda_FH-(1-\delta)\lambda_RH\\[8pt]\frac {dF}{dt}&=(1-\theta)(1-\delta)\gamma_RI+\delta\lambda_FH-\nu F\\[8pt]\frac {dR}{dt}&=(1-\theta)(1-\delta)\gamma_RI+(1-\delta)\lambda_FH+\nu F\end{aligned}}In that model, parameters are \alpha^{-1} is the (average) incubation period (7 days), \gamma_H^{-1} the onset to hospitalization (5 days), \gamma_F^{-1} the onset to death (9 days), \gamma_R^{-1} the onset to “recovery” (10 days), \lambda_F^{-1} the hospitalisation to death (4 days) while \lambda_R^{-1} is the hospitalisation to recovery (5 days), \eta^{-1} is the death to burial (2 days). Here, numbers are from Understanding the dynamics of ebola epidemics (in the context of ebola). The other parameters are \beta_I the transmission rate in community (0.588), \beta_H the transmission rate in hospital (0.794) and \beta_F the transmission rate at funeral (7.653). Thus

123456
epsilon =0.001 Z =c(S =1-epsilon, E = epsilon, I=0,H=0,F=0,R=0)p=c(alpha=1/7*7, theta=0.81, delta=0.81, betai=0.588,    betah=0.794, blambdaf=7.653,N=1, gammah=1/5*7,    gammaf=1/9.6*7, gammar=1/10*7, lambdaf=1/4.6*7,    lambdar=1/5*7, nu=1/2*7)

If \boldsymbol{Z}=(S,E,I,H,F,R), if we write \frac{\partial \boldsymbol{Z}}{\partial t} = SEIHFR(\boldsymbol{Z})where SEIHFR is

123456789101112131415
SEIHFR =function(t,Z,p){  S=Z[1]; E=Z[2];I=Z[3]; H=Z[4];F=Z[5]; R=Z[6]  alpha=p["alpha"]; theta=p["theta"]; delta=p["delta"]  betai=p["betai"]; betah=p["betah"]; gammah=p["gammah"]  gammaf=p["gammaf"]; gammar=p["gammar"]; lambdaf=p["lambdaf"]  lambdar=p["lambdar"]; nu=p["nu"]; blambdaf=p["blambdaf"]  N=S+E+I+H+F+R  dS=-(betai*I+betah*H+blambdaf*F)*S/N  dE=(betai*I+betah*H+blambdaf*F)*S/N-alpha*E  dI=alpha*E-theta*gammah*I-(1-theta)*(1-delta)*gammar*I-(1-theta)*delta*gammaf*I  dH=theta*gammah*I-delta*lambdaf*H-(1-delta)*lambdaf*H  dF=(1-theta)*(1-delta)*gammar*I+delta*lambdaf*H-nu*F  dR=(1-theta)*(1-delta)*gammar*I+(1-delta)*lambdar*H+nu*F  dZ=c(dS,dE,dI,dH,dF,dR)list(dZ)}

We can solve it, or at least study the dynamics from some starting values

123
library(deSolve)times =seq(0, 50, by= .1)resol = ode(y=Z, times=times, func=SEIHFR, parms=p)

For instance, the proportion of people infected is the following

12
plot(resol[,"time"],resol[,"I"],type="l",xlab="time",ylab="",col="red")lines(resol[,"time"],resol[,"H"],col="blue")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Impact of a country’s age breakdown on COVID-19 case fatality rate by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Italy is routinely and correctly described as particularly vulnerable to COVID-19 because of its older age profile. I set out to understand for myself how important this factor is. What would happen if the case fatality rates observed in Italy were applied to demographic profiles of other countries?

Fatality rates by age and sex so far in Italy

The Istituto Superiore di Sanità, Roma is publishing regular bulletins with the latest data on COVID-19 cases in Italy. I used the 19 March 2020 version. These are the observed case ratality fates for around 35,000 cases to that point:

It’s worth pointing out that the snapshots presented in these bulletins change fast, including the raw fatality rate (for both sexes) which has increased from 5.8% seven days earlier to 8.5% on 19 March. Further rapid change is to be expected, remembering that deaths lag beginning of the illness by days or weeks, and diagnoses lag infections by days (symptoms start on average around five days after exposure).

It’s also worth pointing out how much worse this disease seems to be for men. Of the deaths in Italy at the time of this bulletin, 71% were men. Of diagnosed cases, 59% were male (more than 200 Italian boys have the illness but none had died at the time of the bulletin). There were more male fatalities aged 80 and over than female of all ages. Also, it’s worth pointing out that while it is definitely worse for older people, fatality rates are pretty bad for middle-aged people – about 1% for those between 30 and 59. That’s bad for a disease expecting as many cases as this one.

Population profiles in selected countries

I took population breakdowns by age and sex from the United Nations’ World Population Prospects. To illustrate I chose nine countries representing a range of cultural and economic situations. I’ve chosen to present these as density charts, not population pyramids (which I find difficult to make comparisons with). We can readily see the contrast between Italy and (for an extreme example) economically poor Timor Leste:

Applying fatality rates to population profiles

It’s straightforward to take a country’s population and apply the Italian case fatality rates to it to get a weighted average fatality rate. In effect, this tells us what the fatality rate would be in a country, if the Italian rates applied to its whole population or a subpopulation that was representative of the overall age and sex balance. Here’s what we get for our nine ‘countries’ (including the World aggregate):

Two things stand out.

First, the different demographics of the different countries make a huge difference. On these sorts of age-based rates, Italy can expect twice the fatality rate of China (and nearly five times that of Timor Leste).

Second, the death rate for Italy from this method is much lower than the actual fatality rate in the 19 March bulletin – 3.9% compared to 8.5%. This isn’t a mistake – it comes about because the profile of Italians diagnosed with COVID-19 is older and more male than Italians in general.

Older people and men are not just more likely to die if they get COVID-19, they are also more likely to be diagnosed with it in the first place.

As I note on the graphic, this could be due to women and younger people of either sex being less likely to be diagnosed given they have the disease; or it might mean they are less likely to have the disease at all. There is no way to tell with this data.

We can adjust the fatality rates by scaling them up to match Italy’s 19 March observed level. This gives a more realistic but still very rough answer to the question “what would Italy’s case fatality rates mean, translated to other countries”. It’s very rough because doing this assumes away a whole bunch of possible complexities and interactions between variables, but it’s probably as thorough a method as is warranted at the moment with the fast changing data. Here’s those scaled results:

What does it all mean?

Well, the danger to people over 50, particularly but not only men, is very very real from this disease. And the age profiles of countries vary enough for this to make big differences to the overall impact.

But regardless of this, the necessary actions are clear. Work hard to avoid getting this horrible disease and to avoid passing it on. Work to help others do the same, and pull together to manage society through some difficult months ahead. Wash your hands and practice social distancing.

Here’s the code behind those charts. The Italian data is just entered by hand because it’s only 20 numbers, not worth trying to automate.

#------------setup---------------# 59% cases male20686/(20686+14378)# 71% deaths men (no boys)2139/(2139+890)library(tidyverse)library(scales)library(wpp2019)# colours for male and female used by Washington Post 2017; see https://blog.datawrapper.de/gendercolor/sex_cols<-c(Male="#F4BA3B",Female="#730B6D")#---------------------Italian fatality rates---------italy_rates<-tibble(age_grp=rep(c('0-9','10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90+'),2),sex=rep(c("Male","Female"),each=10),cfr=c(0,0,0,0.6,0.7,1.7,6.0,17.8,26.4,32.5,0,0,0,0.2,0.4,0.6,2.8,10.7,19.1,22.3)/100,age_midpt=rep(c(5,15,25,35,45,55,65,75,85,95),2))italy_rates%>%ggplot(aes(x=age_midpt,y=cfr,colour=sex))+geom_point()+geom_text(data=filter(italy_rates,cfr>0.01),aes(label=percent(cfr),y=cfr+0.012),size=3)+geom_line()+scale_x_continuous(breaks=italy_rates$age_midpt,labels=italy_rates$age_grp)+scale_y_continuous(label=percent_format(accuracy=1))+scale_colour_manual(values=sex_cols)+theme(panel.grid.minor=element_blank(),panel.grid.major.x=element_blank())+labs(x="Age group",colour="",y="Observed case fatality rate",title="Observed fatality rate of diagnosed COVID-19 cases in Italy to 19 March 2020",subtitle="20,686 men and boys with case fatality rate of 10.3%; 14,378 women and girls with case fatality rate of 6.2%",caption="Source: Istituto Superiore di Sanità, Roma")#----------------Population rates ------------------data(popF)data(popM)selected_countries<-c("Australia","Italy","Timor-Leste","United States of America","World","China","Brazil","Japan","Germany")age_lu<-tibble(age=unique(popF$age),age_grp=c(rep(unique(italy_rates$age_grp),each=2),"90+"))%>%mutate(age_grp=factor(age_grp,levels=unique(age_grp)))# Visual check that this shorthand worked ok# View(age_lu)pop_2020<-popF%>%mutate(sex="Female")%>%rbind(mutate(popM,sex="Male"))%>%select(country=name,age,pop=`2020`,sex)%>%left_join(age_lu,by="age")%>%group_by(country,age_grp,sex)%>%summarise(pop=sum(pop))%>%ungroup()%>%filter(country%in%selected_countries)%>%mutate(country=fct_drop(country))%>%group_by(country)%>%mutate(prop=pop/sum(pop))%>%ungroup()# check no misspellings in countriesstopifnot(sum(!selected_countries%in%unique(pop_2020$country))==0)pop_2020%>%ggplot(aes(x=as.numeric(age_grp),y=prop,colour=sex))+geom_line()+facet_wrap(~country)+scale_y_continuous(label=percent_format(accuracy=1))+scale_x_continuous(breaks=1:10,labels=levels(pop_2020$age_grp))+scale_colour_manual(values=sex_cols)+theme(panel.grid.minor=element_blank(),panel.grid.major.x=element_blank(),axis.text.x=element_text(angle=45,hjust=1))+labs(x="Age group",y="",colour="",title="Estimated proportion of the population in 2020",subtitle="By age group and sex",caption="Source: UN World Population Prospects 2019")#----------Combine fatality rate with population--------------------the_caption="Source: Italian case fatality rates to 19 March 2020 from Istituto Superiore di Sanità, Roma, combined with UN World Population Prospects 2019"projected_cfr<-pop_2020%>%mutate(age_grp=as.character(age_grp))%>%left_join(italy_rates,by=c("age_grp","sex"))%>%group_by(country)%>%summarise(cfr=sum(cfr*prop)/sum(prop))%>%ungroup()%>%mutate(country=fct_reorder(country,-cfr))xlabel<-"Case fatality rate if rates observed in Italy applied to each country's total age and sex profile.\nDo not treat these as forecasts of actual case fatality rate."# Version 1:projected_cfr%>%ggplot(aes(y=country,x=cfr))+geom_point(colour="steelblue")+geom_text(aes(label=percent(cfr,accuracy=0.1)),nudge_x=0.001,size=3)+geom_segment(aes(yend=country,xend=0),colour="steelblue")+scale_x_continuous(label=percent_format(accuracy=0.1))+theme(panel.grid.minor=element_blank(),panel.grid.major.y=element_blank())+labs(subtitle=xlabel,y="",title="Different age profiles can make a big difference to overall fatality rates, based on Italian data",x="Note that in observed situations (eg Italy 8.5% to 19 March 2020), raw case fatality rates are more than doublethose shown here, suggesting younger cases are either not diagnosed or not occurring.",caption=the_caption)# Version 2, calibrated to actual Italy case fatality rate so farprojected_cfr%>%mutate(cfr_adj=cfr/cfr[country=="Italy"]*0.085)%>%ggplot(aes(y=country,x=cfr_adj))+geom_point(colour="steelblue")+geom_text(aes(label=percent(cfr_adj,accuracy=0.1)),nudge_x=0.002,size=3)+geom_segment(aes(yend=country,xend=0),colour="steelblue")+scale_x_continuous(label=percent_format(accuracy=0.1))+theme(panel.grid.minor=element_blank(),panel.grid.major.y=element_blank())+labs(subtitle=xlabel,y="",title="Different age profiles can make a big difference to overall fatality rates, based on Italian data",x="Estimates have been scaled to match Italy's raw case fatality rate to 19 March, toreflect likely patterns in younger people's case rate and diagnosis.",caption=the_caption)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Setting up R with Visual Studio Code quickly and easily with the languageserversetup package

$
0
0

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well.

In this post, we will look at the languageserversetup package that aims to make the setup of the R Language Server robust and easy to use by installing it into a separate, independent library and adjusting R startup in a way that initializes the language server when relevant.

Visual Studio Code and R

According to the 2019 StackOverflow developer survey, Visual Studio Code is the most popular development environment across the board, with amazing support for many languages and extensions ranging from improved code editing to advanced version control support and Docker integration.

Until recently the support for R in Visual Studio Code was in my view not comprehensive enough to justify switching from other tools such as RStudio (Server) to using VS Code exclusively. This has changed with the work done by the team implementing the following 3 tools:

The features now include all that we need to work efficiently, including auto-complete, definition provider, code formatting, code linting, information on functions on hover, color provider, code sections and more.

If you are interested in more steps around the setup and the overview of features I recommend the Writing R in VSCode: A Fresh Start blogpost by Kun Ren. I also recommend that you follow Kun on Twitter if you are interested in the latest developments.

Setup considerations, issues, and tweaks: creating the languageserversetup package

With my current team, we have almost fully embraced Visual Studio Code as an IDE for our work in R, which is especially great as the work is multi-language and multi-environment in nature and we can do our development in Scala, R and more, including implementing and testing Jenkins pipelines and designing Docker images without leaving VS Code.

Setting up for the team on multiple systems and platforms we have found the following interesting points which were my motivation to write a small R package, languageserversetup, that should make the installation and setup of the R language server as easy and painless as possible.

Managing package libraries

One of the specifics of R is that all extensions (packages) are installed into package libraries, be it the packages we develop and use for our applications or the tools we use mostly as means to make our development life easier. We can therefore often end in a situation where we need to use different versions of R packages for different purposes. For example, the languageserver package currently needs R6 (>= 2.4.1), stringr (>= 1.4.0) and more, in total it recursively requires 75 other R packages to be installed. When installing and running the package we can run into conflicting versions of what our current applications need versus what the languageserver package requires to function properly.

Managing library paths

The second consideration, related to the first one is that if we simply install the language server into the default library with for instance install.packages it will change the library to a state that is possibly not desired. We can also run into unexpected crashes, where the languageserver will function properly for a time until one of the non-triggered dependencies with a hidden conflict gets triggered.

A solution – Complete library separation and smart initialization

One possible solution to the above issues is to:

  1. Keep the package libraries of the languageserver and the other libraries that the user uses (perhaps apart from the main system library containing the base and recommended packages that come with the R installation itself) completely separated, including all non-base dependencies

  2. Initialize that library only when the R process in question is triggered by the language server, otherwise, keep the process untouched and use the user libraries as usual

Solving it with 2 R commands – the languageserversetup package

To make the above solution easily accessible, I have created a small R package called languageserversetup that will do all the work for you. It can be installed from CRAN and it has no dependencies on other R packages:

install.packages("languageserversetup")

Now the entire setup has only 2 steps:

  1. Install the languageserver package and all of its dependencies into a separate independent library (Will ask for confirmation before taking action) using:
languageserversetup::languageserver_install()
  1. Add code to .Rprofile to automatically align the library paths for the language server functionality if the process is an instance of the languageserver, otherwise, the R session will run as usual with library paths unaffected. This is achieved by running (will also ask for confirmation):
languageserversetup::languageserver_add_to_rprofile()

That’s it. Now you can enjoy the functionality without caring about the setup of libraries or any package version conflicts. Thanks to the full separation of libraries, the removal is as trivial as deleting the library directory.

In action with VS Code

Installing languageserversetup and using languageserver_install()

Installing the language server

Installing the language server

Initializing the functionality with languageserver_add_to_rprofile()

Adding the language server to startup

Adding the language server to startup

All done, now enjoy the awesomeness!

Technical details

If you are interested in more technical details,

  • please visit the package’s openly accessible GitHub repository.
  • the README.md has information on options configuration, installation, uninstallation, platforms and more
  • the help files for the functions can be accessed from R with ?languageserver_install, ?languageserver_startup, ?languageserver_add_to_rprofile and ?languageserver_remove_from_rprofile for more details on their arguments and customization
  • for testing, GitHub actions are set up for multiple platforms and to run all CRAN checks on the package on each commit

References

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Covid 19 Tracking

$
0
0

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Get Your Epidemiology from Epidemiologists

The COVID-19 pandemic continues to rage. I’m strongly committed to what should be the uncontroversial view that we should listen to the recommendations of those institutions and individuals with strong expertise in the relevant fields of Public Health, Epidemiology, Disease Control, and Infection Modeling. I also think that the open availability of data, and the free availability of methods to look at data, is generally a good thing. The tricky part is when these potentially conflict. For example, in a period of crisis it is reasonable to want to find out what’s happening and to inform yourself as much as possible about how events are unfolding. People who work with data of some sort will naturally want to look at the available trends themselves. But maybe those same people don’t know a great deal about how disease works, or how information about it is collected and processed, or what is likely to happen in a situation like the one we’re experiencing. At such times, there’s a balance to be struck between using the available tools to come to an informed opinion and recklessly mucking about with data when you don’t really know what you’re doing. This is especially important when, as is the case now, the Executive response to the crisis in the United States (and in several other countries) has been criminally irresponsible, to the point where even elementary facts about the spread of the disease over the past few months are being distorted.

Speaking for myself, I definitely want look at what the trends are and I prefer to do so by working directly with the data that official agencies and reliable reporting produces. So in this post I’ll show how I’m doing that. But I definitely don’t want to publicly mess around beyond this. While I might idly fit some models or play with various extrapolations of the data, I’m very conscious that I am not in a position to do this in a professional capacity. So that part I will firmly set aside. There are already many well-qualified people working publicly to actually analyze and model the data, as opposed to looking descriptively at what is happening.

I’m going to show you how to get the data to draw this graph.

Cumulative COVID-19 Deaths

Cumulative COVID-19 Deaths

Looking at COVID-19 Data from the European Centers for Disease Control

Each day, the ECDC publishes a a summary spreadsheet of global case and death counts since the beginning of the epidemic. This is good data collated by an EU-wide agency, and it’s what I’ve been using to keep up with the trends. There are other reliable sources, too, most notably the Johns Hopkins Coronavirus Dashboard. Here’s what I’ve been doing to get it into R. Again my principal reason for sharing this code is not to add much of anything on the public side. It’s much more of a pedagogical exercise. If you want to look at this data, here’s one way to do that. Along the way I’ll talk about a few of the things needed to work with the data in a reasonably clean way. Then I’ll end up drawing the plot that everyone draws—showing cumulative trends by country in deaths, counted in days since a threshold level of fatalities.

Preparation

First we load some libraries to help us out.

 1 2 3 4 5 6 7 8 910
library(tidyverse)library(lubridate)library(here)library(janitor)library(socviz)library(ggrepel)library(paletteer)

Next, we set things up by writing some functions that will help us grab and clean the data. In reality, of course, these functions got written piecemeal and were then cleaned up and moved to the front of the file. I didn’t sit down and write them off the top of my head.

The first one is going to grab the spreadsheet from the ECDC and both save the .xlsx file to our data/ folder and create a tibble of the results.

 1 2 3 4 5 6 7 8 910111213141516171819202122232425
## Download today's excel file, saving it to data/ and reading it inget_ecdc_data<-function(url="https://www.ecdc.europa.eu/sites/default/files/documents/",fname="COVID-19-geographic-distribution-worldwide-",date=lubridate::today(),ext="xlsx",dest="data"){target<-paste0(url,fname,date,".",ext)message("target: ",target)destination<-fs::path(here::here("data"),paste0(fname,date),ext=ext)message("saving to: ",destination)tf<-tempfile(fileext=ext)curl::curl_download(target,tf)fs::file_copy(tf,destination)switch(ext,xls=janitor::clean_names(readxl::read_xls(tf)),xlsx=janitor::clean_names(readxl::read_xlsx(tf)))}

Things to notice: We have to use curl_download() to get the file, because read_xls cannot directly grab an Excel file from a URL in the way that e.g. read_csv() can for a .csv file. So we create a temporary file handle and use curl to download the data file to it. Then we copy the file to its permanent home in our data/ folder, and we read the target file into R with the appropriate readxl function.

As we’ll see in a moment, the country codes contained in the ECDC data are not quite standard. It will be useful in the long run to make sure that every country has standardized two- and three-letter abbreviations. Some of the countries in the ECDC’s geo_id variable are missing these. This is a very common situation in data cleaning, where we have a big table with some data we know is missing (e.g., a country code), and we know for sure which cases the data are missing for, and we have a little lookup table that can fill in the blanks. The operation we will need to perform here is called a coalescing join. Before I knew that’s what it was called, I used to do this manually (I’ll show you below). But a little googling eventually revealed both the proper name for this operation and also a very useful function, written by Edward Visel that does exactly what I want:

 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627
coalesce_join<-function(x,y,by=NULL,suffix=c(".x",".y"),join=dplyr::full_join,...){joined<-join(x,y,by=by,suffix=suffix,...)# names of desired outputcols<-union(names(x),names(y))to_coalesce<-names(joined)[!names(joined)%in%cols]suffix_used<-suffix[ifelse(endsWith(to_coalesce,suffix[1]),1,2)]# remove suffixes and deduplicateto_coalesce<-unique(substr(to_coalesce,1,nchar(to_coalesce)-nchar(suffix_used)))coalesced<-purrr::map_dfc(to_coalesce,~dplyr::coalesce(joined[[paste0(.x,suffix[1])]],joined[[paste0(.x,suffix[2])]]))names(coalesced)<-to_coalescedplyr::bind_cols(joined,coalesced)[cols]}

Next we set up some country codes using ISO2 and ISO3 abbreviations.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
iso3_cnames<-read_csv("data/countries_iso3.csv")iso2_to_iso3<-read_csv("data/iso2_to_iso3.csv")cname_table<-left_join(iso3_cnames,iso2_to_iso3)cname_table# A tibble: 249 x 3iso3cnameiso2<chr><chr><chr>1AFGAfghanistanAF2ALAÅlandIslandsAX3ALBAlbaniaAL4DZAAlgeriaDZ5ASMAmericanSamoaAS6ANDAndorraAD7AGOAngolaAO8AIAAnguillaAI9ATAAntarcticaAQ10ATGAntiguaandBarbudaAG# … with 239 more rowseu<-c("AUT","BEL","BGR","HRV","CYP","CZE","DNK","EST","FIN","FRA","DEU","GRC","HUN","IRL","ITA","LVA","LTU","LUX","MLT","NLD","POL","PRT","ROU","SVK","SVN","ESP","SWE","GBR")europe<-c("ALB","AND","AUT","BLR","BEL","BIH","BGR","HRV","CYP","CZE","DNK","EST","FRO","FIN","FRA","DEU","GIB","GRC","HUN","ISL","IRL","ITA","LVA","LIE","LTU","LUX","MKD","MLT","MDA","MCO","NLD","NOR","POL","PRT","ROU","RUS","SMR","SRB","SVK","SVN","ESP","SWE","CHE","UKR","GBR","VAT","RSB","IMN","MNE")north_america<-c("AIA","ATG","ABW","BHS","BRB","BLZ","BMU","VGB","CAN","CYM","CRI","CUB","CUW","DMA","DOM","SLV","GRL","GRD","GLP","GTM","HTI","HND","JAM","MTQ","MEX","SPM","MSR","ANT","KNA","NIC","PAN","PRI","KNA","LCA","SPM","VCT","TTO","TCA","VIR","USA","SXM")south_america<-c("ARG","BOL","BRA","CHL","COL","ECU","FLK","GUF","GUY","PRY","PER","SUR","URY","VEN")africa<-c("DZA","AGO","SHN","BEN","BWA","BFA","BDI","CMR","CPV","CAF","TCD","COM","COG","DJI","EGY","GNQ","ERI","ETH","GAB","GMB","GHA","GNB","GIN","CIV","KEN","LSO","LBR","LBY","MDG","MWI","MLI","MRT","MUS","MYT","MAR","MOZ","NAM","NER","NGA","STP","REU","RWA","STP","SEN","SYC","SLE","SOM","ZAF","SHN","SDN","SWZ","TZA","TGO","TUN","UGA","COD","ZMB","TZA","ZWE","SSD","COD")asia<-c("AFG","ARM","AZE","BHR","BGD","BTN","BRN","KHM","CHN","CXR","CCK","IOT","GEO","HKG","IND","IDN","IRN","IRQ","ISR","JPN","JOR","KAZ","PRK","KOR","KWT","KGZ","LAO","LBN","MAC","MYS","MDV","MNG","MMR","NPL","OMN","PAK","PHL","QAT","SAU","SGP","LKA","SYR","TWN","TJK","THA","TUR","TKM","ARE","UZB","VNM","YEM","PSE")oceania<-c("ASM","AUS","NZL","COK","FJI","PYF","GUM","KIR","MNP","MHL","FSM","UMI","NRU","NCL","NZL","NIU","NFK","PLW","PNG","MNP","SLB","TKL","TON","TUV","VUT","UMI","WLF","WSM","TLS")

Now Actually Get the Data

The next step is to read the data. The file should be called COVID-19-geographic-distribution-worldwide- with the date appended and the extension .xlsx. But as it turns out there is a typo in the filename. The distribution part is misspelled disbtribution. I think it must have been introduced early on in the data collection process and so far—possibly by accident, but also possibly so as not to break a thousand scripts like this one—they have not been fixing the typo.

 1 2 3 4 5 6 7 8 91011121314151617181920212223
covid_raw<-get_ecdc_data(url="https://www.ecdc.europa.eu/sites/default/files/documents/",fname="COVID-19-geographic-disbtribution-worldwide-",ext="xlsx")covid_raw# A tibble: 6,012 x 8date_repdaymonthyearcasesdeathscountries_and_t<dttm><dbl><dbl><dbl><dbl><dbl><chr>12020-03-2100:00:00213202020Afghanistan22020-03-2000:00:00203202000Afghanistan32020-03-1900:00:00193202000Afghanistan42020-03-1800:00:00183202010Afghanistan52020-03-1700:00:00173202050Afghanistan62020-03-1600:00:00163202060Afghanistan72020-03-1500:00:00153202030Afghanistan82020-03-1100:00:00113202030Afghanistan92020-03-0800:00:0083202030Afghanistan102020-03-0200:00:0023202000Afghanistan# … with 6,002 more rows, and 1 more variable: geo_id 

That’s our base data. The get_ecdc_data() function uses file_copy() from the fs library to move the temporary file to the data/ folder. It will not overwrite a file if it finds one with that name already there. So if you grab the data more than once a day, you’ll need to decide what to do with the file you already downloaded.

The geo_id country code column isn’t visible here. We’re going to duplicate it (naming it iso2) and then join our table of two- and three-letter country codes. It has an iso2 column as well.

 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627
covid<-covid_raw%>%mutate(date=lubridate::ymd(date_rep),iso2=geo_id)## merge in the iso country namescovid<-left_join(covid,cname_table)covid# A tibble: 6,012 x 12date_repdaymonthyearcasesdeathscountries_and_t<dttm><dbl><dbl><dbl><dbl><dbl><chr>12020-03-2100:00:00213202020Afghanistan22020-03-2000:00:00203202000Afghanistan32020-03-1900:00:00193202000Afghanistan42020-03-1800:00:00183202010Afghanistan52020-03-1700:00:00173202050Afghanistan62020-03-1600:00:00163202060Afghanistan72020-03-1500:00:00153202030Afghanistan82020-03-1100:00:00113202030Afghanistan92020-03-0800:00:0083202030Afghanistan102020-03-0200:00:0023202000Afghanistan# … with 6,002 more rows, and 5 more variables: geo_id ,#   date , iso2 , iso3 , cname 

At this point we can notice a couple of things about the dataset. For example, not everything in the dataset is a country. This one’s a cruise ship:

 1 2 3 4 5 6 7 8 910111213
## Looks like a missing data codecovid%>%filter(cases==-9)# A tibble: 1 x 12date_repdaymonthyearcasesdeathscountries_and_t<dttm><dbl><dbl><dbl><dbl><dbl><chr>12020-03-1000:00:001032020-91Cases_on_an_int# … with 5 more variables: geo_id , date , iso2 ,#   iso3 , cname 

We can also learn, using an anti_join() that not all the ECDC’s geo_id country codes match up with the ISO codes:

 1 2 3 4 5 6 7 8 91011121314151617
anti_join(covid,cname_table)%>%select(geo_id,countries_and_territories,iso2,iso3,cname)%>%distinct()# A tibble: 7 x 5geo_idcountries_and_territoriesiso2iso3cname<chr><chr><chr><chr><chr>1JPG11668Cases_on_an_international_conveyance_JJPG116<NA><NA>2PYFFrench_PolynesiaPYF<NA><NA>3ELGreeceEL<NA><NA>4XKKosovoXK<NA><NA>5NANamibiaNA<NA><NA>6ANNetherlands_AntillesAN<NA><NA>7UKUnited_KingdomUK<NA><NA>

Let’s fix this. I made a small crosswalk file that can be coalesced into the missing values. In an added little wrinkle, we need to specify the na argument in read_csv explicity because the missing country codes include Namibia, which has an ISO country code of “NA”! This is different from the missing data code NA but read_csv() won’t know this by default.

 1 2 3 4 5 6 7 8 9101112131415
cname_xwalk<-read_csv("data/ecdc_to_iso2_xwalk.csv",na="")cname_xwalk# A tibble: 4 x 3geo_idiso3cname<chr><chr><chr>1UKGBRUnitedKingdom2ELGRCGreece3NANAMNamibia4XKXKVKosovo

I used to do coalescing like this:

12345678
# covid <- covid %>%#   left_join(cname_xwalk, by = "geo_id") %>% #   mutate(iso3 = coalesce(iso3.x, iso3.y),#          cname = coalesce(cname.x, cname.y)) %>% #   select(-iso3.x, -iso3.y, cname.x, cname.y)

Actually, I used to do it using match() and some index vectors, like an animal. But now I can use Edward Visel’s handy function instead.

 1 2 3 4 5 6 7 8 9101112131415161718192021
covid<-coalesce_join(covid,cname_xwalk,by="geo_id",join=dplyr::left_join)## Take a look againanti_join(covid,cname_table)%>%select(geo_id,countries_and_territories,iso2,iso3,cname)%>%distinct()# A tibble: 7 x 5geo_idcountries_and_territoriesiso2iso3cname<chr><chr><chr><chr><chr>1JPG11668Cases_on_an_international_conveyJPG116<NA><NA>2PYFFrench_PolynesiaPYF<NA><NA>3ELGreeceELGRCGreece4XKKosovoXKXKVKosovo5NANamibiaNANAMNamibia6ANNetherlands_AntillesAN<NA><NA>7UKUnited_KingdomUKGBRUnitedKin

Looks like a couple of new territories have been added to the ECDC file since I made the crosswalk file. I’ll have to update that soon.

Calculate and Plot Cumulative Mortality

Now we can actually analyze the data (in the privacy of our own home). Let’s draw the plot that everyone draws, looking at cumulative counts. I think it’s better at this point to plot cumulative deaths rather than cumulative reported cases, given that there’s so much unevenness in case reporting. The mortality counts aren’t free of that, but it’s not as much of a problem. We’ll take an arbitrary threshold for number of deaths, let’s say ten, start every country from zero days when they hit ten deaths, and count the cumulative deaths since that day.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132
cov_curve<-covid%>%select(date,cname,iso3,cases,deaths)%>%drop_na(iso3)%>%group_by(iso3)%>%arrange(date)%>%mutate(cu_cases=cumsum(cases),cu_deaths=cumsum(deaths))%>%filter(cu_deaths>9)%>%mutate(days_elapsed=date-min(date),end_label=ifelse(date==max(date),cname,NA))cov_curve# A tibble: 245 x 9# Groups:   iso3 [21]datecnameiso3casesdeathscu_casescu_deathsdays_elapsed<date><chr><chr><dbl><dbl><dbl><dbl><drtn>12020-01-22ChinaCHN14011526170days22020-01-23ChinaCHN970623171days32020-01-24ChinaCHN2599882262days42020-01-25ChinaCHN441151323413days52020-01-26ChinaCHN665151988564days62020-01-27ChinaCHN787252775815days72020-01-28ChinaCHN17532545281066days82020-01-29ChinaCHN14662659941327days92020-01-30ChinaCHN17403877341708days102020-01-31ChinaCHN19804397142139days# … with 235 more rows, and 1 more variable: end_label 

See how at the end there we create an end_label variable for use in the plot. It only has values for the most recent day in the dataset (i.e. the country name if date is max(date), otherwise NA).

Now we’ll narrow our focus to a few countries and make the plot.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738
focus_cn<-c("CHN","GBR","USA","IRN","JPN","KOR","ITA","FRA","ESP")cov_curve%>%filter(iso3%in%focus_cn)%>%## focus on just a few countries, defined abovemutate(end_label=recode(end_label,`United States`="USA",`Iran, Islamic Republic of`="Iran",`Korea, Republic of`="South Korea",`United Kingdom`="UK"))%>%ggplot(mapping=aes(x=days_elapsed,y=cu_deaths,color=cname,label=end_label,group=cname))+geom_line(size=0.8)+geom_text_repel(nudge_x=1.1,nudge_y=0.1,segment.color=NA)+guides(color=FALSE)+scale_color_manual(values=prismatic::clr_darken(paletteer_d("ggsci::category20_d3"),0.2))+scale_y_continuous(labels=scales::comma_format(accuracy=1),breaks=2^seq(4,11),trans="log2")+labs(x="Days Since 10th Confirmed Death",y="Cumulative Number of Deaths (log scale)",title="Cumulative Deaths from COVID-19, Selected Countries",subtitle=paste("Data as of",format(max(cov_curve$date),"%A, %B %e, %Y")),caption="Kieran Healy @kjhealy / Data: ECDC")+theme(plot.title=element_text(size=rel(2),face="bold"),plot.subtitle=element_text(size=rel(1.5)),axis.text.y=element_text(size=rel(2)),axis.title.x=element_text(size=rel(1.5)),axis.title.y=element_text(size=rel(1.5)),axis.text.x=element_text(size=rel(2)),legend.text=element_text(size=rel(2)))

Again, a few small details polish the plot. We do a quick bit of recoding on the end_label to shorten some country names, and use geom_text_repel() to put the labels at the end of the line. We get our y-axis breaks with 2^seq(4, 11), which (as case numbers rise) will be easier to extend than manually typing all the numbers. I use a base 2 log scale for the reasons Dr Drang gives here. It’s useful to look at the doubling time, which base 2 helps you see, rather than powers of ten. (The graphs won’t look any different.) Finally on the thematic side we can date-stamp the title of the graph using the opaque but standard UNIX date formatting codes, with paste("Data as of", format(max(cov_curve$date), "%A, %B %e, %Y")).

And here’s our figure.

Cumulative COVID-19 Deaths

Cumulative COVID-19 Deaths

The GitHub repository for this post also has some code to pull U.S. data from the COVID Tracking Project currently being run by a group of volunteers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Infectious diseases and nonlinear differential equations

$
0
0

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last summer, I wrote about love affairs and linear differential equations. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology.

Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my earlier post on the topic.

I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in!

Modeling Population Growth

Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself:

\frac{dN}{dt} = \dot{N} = r N \enspace .

As shown in a previous blog post, this leads to exponential growth for $r > 0$:

N(t) = N_0 e^{r t} \enspace ,

where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$.

plot of chunk unnamed-chunk-1

This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write:

\dot{N} = rN \left(1 - \frac{N}{K}\right) \enspace ,

where $r > 0$ and $K$ is the so-called carrying capacity, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 – N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease.

For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$.

plot of chunk unnamed-chunk-2

In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two fixed points— points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above.

Analyzing the Stability of Fixed Points

What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$.

To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) – N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing:

\frac{d\eta}{dt} = \frac{d}{dt}\left(N(t) - N^{\star}\right) = \frac{dN}{dt} \enspace ,

since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see this excellent video for details) by writing:

f(N^{\star} + \eta) = f(N^{\star}) + \eta f'(N^{\star}) + \mathcal{O}(\eta^2) \enspace ,

where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point

\dot{\eta} \approx \eta f'(N^{\star}) \enspace ,

which is a linear differential equation with solution:

\eta(t) = \eta_0 e^{f'(N^{\star})t} \enspace .

Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that:

% <![CDATA[\begin{aligned}f'(N) &= \frac{d}{dN} \left(rN(1 - N)\right) \\[0.50em] &= \frac{d}{dN} \left(rN - rN^2\right) \\[0.50em] & = r - 2rN \\[0.50em] &= r(1 - 2N) \enspace .\end{aligned} %]]&gt;

Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$.

In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called linear stability analysis. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example.

Analytic Solution

In contrast to linear differential equations, which was the topic of a previous blog post, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write:

% <![CDATA[\begin{aligned}\frac{dN}{dt} &= rN (1 - N) \\\frac{dN}{N(1 - N)} &= r dt \\\int \frac{1}{N(1 - N)} dN &= r t \enspace .\end{aligned} %]]&gt;

Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write:

% <![CDATA[\begin{aligned}\int \frac{1}{N(1 - N)} dN &= r t \\[0.50em]\int \frac{1}{N} dN + \int \frac{1}{1 - N}dN &= rt \\[0.50em]\text{log}N - \text{log}(1 - N) + Z &= rt \\[0.50em]e^{\text{log}N - \text{log}(1 - N) + Z} &= e^{rt} \enspace .\end{aligned} %]]&gt;

The exponents and the logs cancel each other nicely. We write:

% <![CDATA[\begin{aligned}\frac{e^{\text{log}N}}{e^{\text{log}(1 - N)}}e^Z &= e^{rt} \\[0.50em]\frac{N}{1 - N} e^Z &= e^{rt} \\[0.50em]\frac{N}{1 - N} &= e^{rt - Z} \\[0.50em]N &= e^{rt - Z} - N e^{rt - Z} \\[0.50em]N\left(1 + e^{rt - Z}\right) &= e^{rt - Z} \\[0.50em]N &= \frac{e^{rt - Z}}{1 + e^{rt - Z}} \enspace .\end{aligned} %]]&gt;

One last trick is to multiply by $e^{-rt + Z}$, which yields:

N = \frac{\left(e^{-rt + Z}\right)\left(e^{rt - Z}\right)}{\left(e^{-rt + Z}\right) + {\left(e^{-rt + Z}\right)\left(e^{-rt + Z}\right)}} = \frac{1}{1 + e^{-rt + Z}} \enspace ,

where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to:

% <![CDATA[\begin{aligned}\text{log}N_0 - \text{log}(1 - N_0) + Z &= 0 \\[0.50em]\text{log}N_0 - \text{log}(1 - N_0) &= -Z \\[0.50em]\frac{N_0}{1 - N_0} = e^{-Z} \\[0.50em]\frac{1 - N_0}{N_0} = e^{Z} \enspace .\end{aligned} %]]&gt;

Plugging this into our solution from above yields:

N(t) = \frac{1}{1 + e^{-rt + Z}} = \frac{1}{1 + \frac{1 - N_0}{N_0} e^{-rt}} \enspace .

While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section.

Numerical Solution

A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by:

\mathbf{x}_{n + 1} = \mathbf{x}_n + \Delta t \cdot f(\mathbf{x}_n) \enspace .

$\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:

solve_logistic<-function(N0,r=1,delta_t=0.01,times=1000){N<-rep(N0,times)dN<-function(N)r*N*(1-N)for(iinseq(2,times)){# EulerN[i]<-N[i-1]+delta_t*dN(N[i-1])# Improved Euler# k <- N[i-1] + delta_t * dN(N[i-1])# N[i] <- N[i-1] + 1 /2 * delta_t * (dN(N[i-1]) + dN(k))# Runge-Kutta 4th order# k1 <- dN(N[i-1]) * delta_t# k2 <- dN(N[i-1] + k1/2) * delta_t# k3 <- dN(N[i-1] + k2/2) * delta_t# k4 <- dN(N[i-1] + k3) * delta_t## N[i] <- N[i-1] + 1/6 * (k1 + 2*k2 + 2*k3 + k4)}N}

Clearly, the accuracy of this approximation is a function of $\Delta t$. To see how, the left panel shows the approximation for various values of $\Delta t$, while the right panel shows the (log) absolute error as a function of (log) $\Delta t$. The error is defined as:

E = |N(10) - \hat{N}(10)| \enspace ,

where $\hat{N}$ is the Euler approximation.

plot of chunk unnamed-chunk-4

The right panel approximately shows the relationship:

% <![CDATA[\begin{aligned}\text{log } E &\propto \text{log } \Delta t \\[0.50em]E &\propto \Delta t \enspace .\end{aligned} %]]&gt;

Therefore, the error goes down linearly with $\Delta t$. Other methods, such as the improved Euler method or Runge-Kutta solvers (see commented out code above) do better. However, it is ill-advised to choose $\Delta t$ extremely small, because this leads to an increase in computation time and can lead to accuracy errors which get exacerbated over time.

In summary, we have seen that nonlinear differential equations can model interesting behaviour such as multiple fixed points; how to classify the stability of these fixed points using linear stability analysis; and how to numerically solve nonlinear differential equations. In the remainder of this post, we study coupled nonlinear differential equations — the SIR and SIRS models — as a way to model the spread of infectious diseases.

Modeling Infectious Diseases

Many models have been proposed as tools to understand epidemics. In the following sections, I focus on the two simplest ones: the SIR and the SIRS model (see also Hirsch, Smale, Devaney, 2013, ch. 11).

The SIR Model

We use the SIR model to understand the spread of infectious diseases. The SIR model is the most basic compartmental model, meaning that it groups the overall population into distinct sub-populations: a susceptible population $S$, an infected population $I$, and a recovered population $R$. We make a number of further simplifying assumptions. First, we assume that the overall population is $1 = S + I + R$ so that $S$, $I$, and $R$ are proportions. We further assume that the overall population does not change, that is,

\frac{d}{dt} \left(S + I + R\right) = 0 \enspace .

Second, the SIR model assumes that once a person has been infected and has recovered, the person cannot become infected again — we will relax this assumption later on. Third, the model assumes that the rate of transmission of the disease is proportional to the number of encounters between susceptible and infected persons. We model this by setting

\frac{dS}{dt} = - \beta IS \enspace ,

where $\beta > 0$ is the rate of infection. Fourth, the model assumes that the growth of the recovered population is proportional to the proportion of people that are infected, that is,

\frac{dR}{dt} = \gamma I \enspace ,

where the $\gamma > 0$ is the recovery rate. Since the overall population is constant, these two equations naturally lead to the following equation for the infected:

\begin{aligned}\frac{d}{dt} \left(S + I + R\right) = 0 \\[0.50em]\frac{dI}{dt} = - \frac{dS}{dt} - \frac{dR}{dt} \\[0.50em]\frac{dI}{dt} = \beta IS - \gamma I \enspace .\end{aligned}

where $\beta I S$ gives the proportion of newly infected individuals and $\gamma I$ gives the proportion of newly recovered individuals. Observe that since we assumed that the overall population does not change, we only need to focus on two of these subgroup, since $R(t) = 1 – S(t) – I(t)$. The system is therefore fully characterized by

% <![CDATA[\begin{aligned}\frac{dS}{dt} &= - \beta IS \\[0.50em]\frac{dI}{dt} &= \beta IS - \gamma I \enspace .\end{aligned} %]]&gt;

Before we analyze this model mathematically, let’s implement Euler’s method and visualize some trajectories.

solve_SIR<-function(S0,I0,beta=1,gamma=1,delta_t=0.01,times=8000){res<-matrix(NA,nrow=times,ncol=3,dimnames=list(NULL,c('S','I','R')))res[1,]<-c(S0,I0,1-S0-I0)dS<-function(S,I)-beta*I*SdI<-function(S,I)beta*I*S-gamma*Ifor(iinseq(2,times)){S<-res[i-1,1]I<-res[i-1,2]res[i,1]<-res[i-1,1]+delta_t*dS(S,I)res[i,2]<-res[i-1,2]+delta_t*dI(S,I)}res[,3]<-1-res[,1]-res[,2]res}plot_SIR<-function(res,main=''){cols<-brewer.pal(3,'Set1')matplot(res,type='l',col=cols,axes=FALSE,lty=1,lwd=2,ylab='Subpopulations(t)',xlab='Time t',xlim=c(0,4000),ylim=c(0,1),main=main,cex.main=1.75,cex.lab=1.5,font.main=1,xaxs='i',yaxs='i')axis(1,cex.axis=1.25)axis(2,las=2,cex.axis=1.25)legend(3000,0.65,col=cols,legend=c('S','I','R'),lty=1,lwd=2,bty='n',cex=1.5)}

The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$ and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.

plot of chunk unnamed-chunk-6

For $\beta = 1/8$, no outbreak occurs (left panel). Instead, the proportion of susceptible and infected people monotonically decrease while the proportion of recovered people monotonically increases. The middle panel, on the other hand, shows a small outbreak. The proportion of infected people rises, but then falls again. Similarly, the right panel shows an outbreak as well, but a more severe one, as the proportion of infected people rises more starkly before it eventually decreases again.

How do things change when we change the recovery rate $\gamma$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller recovery rate $\gamma = 1/12$.

plot of chunk unnamed-chunk-7

We again observe no outbreak in the left panel, and outbreaks of increasing severity in both the middle and the right panel. In contrast to the results for $\gamma = 1/8$, the outbreak is more severe, as we would expect since the recovery rate with $\gamma = 1/12$ is now lower. In fact, whether an outbreak occurs or not and how severe it will be depends not on $\beta$ and $\gamma$ alone, but on their ratio. This ratio is known as $R_0 = \beta / \gamma$, pronounced “R-naught”. (Note the unfortunate choice of well-established terminology in this context, as $R_0$ also denotes the initial proportion of recovered people; it should be clear from the context which one is meant, however.) We can think of $R_0$ as the average number of people an infected person will infect before she gets better. If $R_0 > 1$, an outbreak occurs. In the next section, we look for the fixed points of this system and assess their stability.

Analyzing Fixed Points

A glance at the above figures suggests that the SIR model allows for multiple stable states. The left panels, for example, show that if there is no outbreak, the proportion of susceptible people stays above the proportion of recovered people. If there is an outbreak, however, then it always fades and the proportion of recovered people will be higher than the proportion of susceptible people; how much higher depends on the severity of the outbreak.

While we could play around some more with visualisations, it pays to do a formal analysis. Note that in contrast to the logistic equation, which only modelled a single variable — population size — an analysis of the SIR model requires us to handle two variables, $S$ and $I$; the third one, $R$, follows from the assumption of a constant population size. At the fixed points, nothing changes, that is, we have:

% <![CDATA[\begin{aligned}0 &= - \beta IS \\[0.50em]0 &= \beta IS - \gamma I \enspace .\end{aligned} %]]&gt;

This can only happen when $I = 0$, irrespective of the value of $S$. In other words, all $(I^{\star}, S^{\star}) = (0, S)$ are fixed points; if nobody is infected, the disease cannot spread — and so everybody stays either susceptible or recovered. To assess the stability of these fixed points, we again derive a differential equation for the perturbations close to the fixed point. However, note that in contrast to the one-dimensional case studied above, perturbations can now be with respect to $I$ or to $S$. Let $u = S – S^{\star}$ and $v = I – I^{\star}$ be the respective perturbations, and let $\dot{S} = f(S, I)$ and $\dot{I} = g(S, I)$. We first derive a differential equation for $u$, writing:

\dot{u} = \frac{d}{dt}\left(S - S^{\star}\right) = \dot{S} \enspace ,

since $S^{\star}$ is a constant. This implies that $u$ behaves as $S$. In contrast to the one-dimensional case above, we have two coupled differential equations, and so we have to take into account how $u$ changes as a function of both $S$ and $I$. We Taylor expand at the fixed point $(S^{\star}, I^{\star})$:

% <![CDATA[\begin{aligned}\dot{u} &= f(u + S^{\star}, v + I^{\star}) \\[0.50em] &= f(S^{\star}, I^{\star}) + u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} + \mathcal{O}(u^2, v^2, uv) \\[0.50em] &\approx u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} \enspace ,\end{aligned} %]]&gt;

since $f(S^{\star}, I^{\star}) = 0$ and we drop higher-order terms. Note that taking the partial derivative of $f$ with respect to $S$ (or $I$) yields a function, and the subscripts $(S^{\star}, I^{\star})$ mean that we evaluate this function at the fixed point $(S^{\star}, I^{\star})$. We can similarly derive a differential equation for $v$:

\dot{v} \approx u \frac{\partial g}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial g}{\partial I}_{(S^{\star}, I^{\star})} \enspace .

We can write all of this concisely using matrix algebra:

% <![CDATA[\begin{pmatrix}\dot{u} \\\dot{v}\end{pmatrix} =\begin{pmatrix}\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}\end{pmatrix}_{(S^{\star}, I^{\star})}\begin{pmatrix}u \\v\end{pmatrix} \enspace , %]]&gt;

where

% <![CDATA[J = \begin{pmatrix}\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}\end{pmatrix}_{(S^{\star}, I^{\star})} %]]&gt;

is called the Jacobian matrix at the fixed point $(S^{\star}, I^{\star})$. The Jacobian gives the linearized dynamics close to a fixed point, and therefore tells us how perturbations will evolve close to a fixed point.

In contrast to unidimensional systems, where we simply check whether the slope is positive or negative, that is, whether $f’(x^\star) < 0$ or $f’(x^\star) > 0$, the test for whether a fixed point is stable is slightly more complicated in multidimensional settings. In fact, and not surprisingly, since we have linearized this nonlinear differential equation, the check is the same as in linear systems: we compute the eigenvalues $\lambda_1$ and $\lambda_2$ of $J$, observing that negative eigenvalues mean exponential decay and positive eigenvalues mean exponential growth along the directions of the respective eigenvectors. (Note that this does not work for all types of fixed points, see Strogatz (2015, p. 152).)

What does this mean for our SIR model? First, let’s derive the Jacobian:

% <![CDATA[\begin{aligned}J &= \begin{pmatrix}-\frac{\partial}{\partial S} \beta I S & -\frac{\partial }{\partial I} \beta I S \\\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]\end{pmatrix} \\[1em]& = \begin{pmatrix}-\beta I & -\beta S \\\beta I & \beta S - \gamma \end{pmatrix} \enspace .\end{aligned} %]]&gt;

Evaluating this at the fixed point $(S^{\star}, I^{\star}) = (S, 0)$ results in:

% <![CDATA[J_{(S, 0)} = \begin{pmatrix} 0 & -\beta S \\ 0 & \beta S - \gamma \end{pmatrix} \enspace . %]]&gt;

Since this matrix is upper triangular — all entries below the diagonal are zero — the eigenvalues are given by the diagonal, that is, $\lambda_1 = 0$ and $\lambda_2 = \beta S – \gamma$. $\lambda_1 = 0$ implies a constant solution, while $\lambda_2 > 0$ implies exponential growth and $\lambda_2 < 0$ exponential decay of the perturbations close to the fixed point. Observe that $\lambda_2$ is not only a function of the parameters $\beta$ and $\gamma$, but also of the proportion of susceptible individuals $S$. We find that $\lambda_2 > 0$ for $S > \gamma / \beta$, which results in an unstable fixed point. On the other hand, we have that $\lambda_2 < 0$ for $S < \gamma / \beta$, which results in a stable fixed point. In the next section, we will use vector fields in order to get more intuition for the dynamics of the system.

Vector Field and Nullclines

A vector field shows for any position $(S, I)$ in which direction the system moves, which we indicate by the head of an arrow, and how quickly, which we indicate by the length of an arrow. We use the R code below to visualize such a vector field and selected trajectories on it.

library('fields')plot_vectorfield_SIR<-function(beta,gamma,main='',...){S<-seq(0,1,0.05)I<-seq(0,1,0.05)dS<-function(S,I)-beta*I*SdI<-function(S,I)beta*I*S-gamma*ISI<-as.matrix(expand.grid(S,I))SI<-SI[apply(SI,1,function(x)sum(x)<=1),]# S + I <= 1 must holddSI<-cbind(dS(SI[,1],SI[,2]),dI(SI[,1],SI[,2]))draw_vectorfield(SI,dSI,main,...)}draw_vectorfield<-function(SI,dSI,main,...){S<-seq(0,1,0.05)I<-seq(0,1,0.05)plot(S,I,type='n',axes=FALSE,xlab='',ylab='',main='',cex.main=1.5,xlim=c(-0.2,1),ylim=c(-0.2,1.2),...)lines(c(-0.1,1),c(0,0),lwd=1)lines(c(0,0),c(-0.1,1),lwd=1)arrow.plot(SI,dSI,arrow.ex=.075,length=.05,lwd=1.5,col='gray82',xpd=TRUE)cx<-1.5cn<-2text(0.5,1.05,main,cex=1.5)text(0.5,-.075,'S',cex=cn,font=1)text(-.05,0.5,'I',cex=cn,font=1)text(-.03,-.04,0,cex=cx,font=1)text(-.03,.975,1,cex=cx,font=1)text(0.995,-0.04,1,cex=cx,font=1)}

For $\beta = 1/8$ and $\gamma = 1/8$, we know from above that no outbreak occurs. The vector field shown in the left panel below further illustrates that, since $S \leq \gamma / \beta = 1$, all fixed points $(S^{\star}, I^{\star}) = (S, 0)$ are stable. In contrast, we know that $\beta = 3/8$ and $\gamma = 1/8$ result in an outbreak. The vector field shown in the right panel below indicates that fixed points with $S > \gamma / \beta = 1/3$ are unstable, while fixed points with $S < 1/3$ are stable; the dotted line is $S = 1/3$.

plot of chunk unnamed-chunk-9

Can we find some structure in such vector fields? One way to “organize” them is by drawing so-called nullclines. In our case, the $I$-nullcline gives the set of points for which $\dot{I} = 0$, and the $S$-nullcline gives the set of points for which $\dot{S} = 0$. We find these points in a similar manner to finding fixed points, but instead of setting both $\dot{S}$ and $\dot{I}$ to zero, we tackle them one at a time.

The $S$-nullclines are given by the $S$- and the $I$-axes, because $\dot{S} = 0$ when $S = 0$ or when $I = 0$. Along the $I$-axis axis we have $\dot{I} = – \gamma I$ since $S = 0$, resulting in exponential decay of the infected population; this indicated by the grey arrows along the $I$-axis which are of progressively smaller length the closer they approach the origin.

The $I$-nullclines are given by $I = 0$ and by $S = \gamma / \beta$. For $I = 0$, we have $\dot{S} = 0$ and so these yield fixed points. For $S = \gamma / \beta$ we have $\dot{S} = – \gamma I$, resulting in exponential decay of the susceptible population, but since $\dot{I} = 0$, the proportion of infected people does not change; this is indicated in the left vector field above, where we have horizontal arrows at the dashed line given by $S = \gamma / \beta$. However, this only holds for the briefest of moments, since $S$ decreases and for $S < \gamma / \beta$ we again have $\dot{I} < 0$, and so the proportion of infected people goes down to the left of the line. Similarly, to the right of the line we have $S > \gamma / \beta$, which results in $\dot{I} > 0$, and so the proportion of infected people grows.

In summary, we have seen how the SIR model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people while $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population.

The SIRS Model

The SIR model assumes that once infected people are immune to the disease forever, and so any disease occurs only once and then never comes back. More interesting dynamics occur when we allow for the reinfection of recovered people; we can then ask, for example, under what circumstances the disease becomes established in the population. The SIRS model extends the SIR model, allowing the recovered population to become susceptible again (hence the extra ‘S’). It assumes that the susceptible population increases proportional to the recovered population such that:

% <![CDATA[\begin{aligned}\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]\frac{dI}{dt} &= \beta IS - \gamma I \\[0.50em]\frac{dR}{dt} &= \gamma I - \mu R\enspace ,\end{aligned} %]]&gt;

where, since we added $\mu R$ to the change in the proportion of susceptible people, we had to subtract $\mu R$ from the change in the proportion of recovered people. We again make the simplifying assumption that the overall population does not change, and so it suffices to study the following system:

% <![CDATA[\begin{aligned}\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]\frac{dI}{dt} &= \beta IS - \gamma I \enspace ,\end{aligned} %]]&gt;

since $R(t) = 1 – S(t) – I(t)$. We adjust our implementation of Euler’s method:

solve_SIRS<-function(S0,I0,beta=1,gamma=1,mu=1,delta_t=0.01,times=1000){res<-matrix(NA,nrow=times,ncol=3,dimnames=list(NULL,c('S','I','R')))res[1,]<-c(S0,I0,1-S0-I0)dS<-function(S,I,R)-beta*I*S+mu*RdI<-function(S,I,R)beta*I*S-gamma*Ifor(iinseq(2,times)){S<-res[i-1,1]I<-res[i-1,2]R<-res[i-1,3]res[i,1]<-res[i-1,1]+delta_t*dS(S,I,R)res[i,2]<-res[i-1,2]+delta_t*dI(S,I,R)res[i,3]<-1-res[i,1]-res[i,2]}res}plot_SIRS<-function(res,main=''){cols<-brewer.pal(3,'Set1')matplot(res,type='l',col=cols,axes=FALSE,lty=1,lwd=2,ylab='Subpopulations(t)',xlab='Time t',ylim=c(0,1),main=main,cex.main=1.75,cex.lab=1.25,font.main=1,xlim=c(0,4000),font.main=1,xaxs='i',yaxs='i')axis(1,cex.axis=1.5)axis(2,las=2,cex.axis=1.5)legend(3000,0.95,col=cols,legend=c('S','I','R'),lty=1,lwd=2,bty='n',cex=1.5)}

The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$, a fixed reinfection rate of $\mu = 1/8$, and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.

plot of chunk unnamed-chunk-11

As for the SIR model, we again find that no outbreak occurs for $R_0 = \beta / \gamma < 1$, which is the case for the left panel. Most interestingly, however, we find that the proportion of infected people does not, in contrast to the SIR model, decrease to zero for the other panels. Instead, the disease becomes established in the population when $R_0 > 1$, and the middle and the right panel show different fixed points.

How do things change when we vary the reinfection rate $\mu$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller reinfection rate $\mu$.

plot of chunk unnamed-chunk-12

We again find no outbreak in the left panel, and outbreaks of increasing severity in the middle and right panel. Both these outbreaks are less severe compared to the outbreaks in the previous figures, as we would expect given a decrease in the reinfection rate. Similarly, the system seems to stabilize at different fixed points. In the next section, we provide a more formal analysis of the fixed points and their stability.

Analyzing Fixed Points

To find the fixed points of the SIRS model, we again seek solutions for which:

% <![CDATA[\begin{aligned}0 &= - \beta IS + \mu (1 - S - I) \\[0.50em]0 &= \beta IS - \gamma I \enspace ,\end{aligned} %]]&gt;

where we have substituted $R = 1 – S – I$ and from which it follows that also $\dot{R} = 0$ since we assume that the overall population does not change. We immediately see that, in contrast to the SIR model, $I = 0$ cannot be a fixed point for any $S$ because of the added term which depends on $\mu$. Instead, it is a fixed point only for $S = 1$. To get the other fixed point, note that the last equation gives $S = \gamma / \beta$, which plugged into the first equation yields:

% <![CDATA[\begin{aligned}0 &= -I\gamma + \mu\left(1 - \frac{\gamma}{\beta} - I\right) \\[0.50em]I\gamma &= \mu\left(1 - \frac{\gamma}{\beta}\right) - \mu I \\[0.50em]I(\gamma + \mu) &= \mu\left(1 - \frac{\gamma}{\beta}\right) \\[0.50em]I &= \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} \enspace .\end{aligned} %]]&gt;

Therefore, the fixed points are:

% <![CDATA[\begin{aligned}(S^{\star}, I^{\star}) &= (1, 0) \\[0.50em](S^{\star}, I^{\star}) &= \left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \enspace .\end{aligned} %]]&gt;

Note that the second fixed point does not exist when $\gamma / \beta > 1$, since the proportion of infected people cannot be negative. Another, more intuitive perspective on this is to write $\gamma / \beta > 1$ as $R_0 = \beta / \gamma < 1$. This allows us to see that the second fixed point, which would have a non-zero proportion of infected people in the population, does not exist when $R_0 < 1$, as then no outbreak occurs. We will come back to this in a moment.

To assess the stability of the fixed points, we derive the Jacobian matrix for the SIRS model:

% <![CDATA[\begin{aligned}J &= \begin{pmatrix}\frac{\partial}{\partial S} \left(-\beta I S + \mu(1 - S - I)\right) & \frac{\partial }{\partial I} \left(-\beta I S + \mu(1 - S - I)\right) \\\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]\end{pmatrix} \\[1em]&= \begin{pmatrix}-\beta I - \mu & -\beta S - \mu \\\beta I & \beta S - \gamma \end{pmatrix} \enspace .\end{aligned} %]]&gt;

For the fixed point $(S^{\star}, I^{\star}) = (1, 0)$ we have:

% <![CDATA[J_{(1, 0)} = \begin{pmatrix}- \mu & -\beta - \mu \\0 & \beta - \gamma \end{pmatrix} \enspace , %]]&gt;

which is again upper-triangular and therefore has eigenvalues $\lambda_1 = -\mu$ and $\lambda_2 = \beta – \gamma$. This means it is unstable whenever $\beta > \gamma$ since then $\lambda_2 > 0$, and any infected individual spreads the disease. The Jacobian at the second fixed point is:

% <![CDATA[J_{\left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)} = \begin{pmatrix}-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - \mu & -\gamma - \mu \\\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} & - 2\gamma \end{pmatrix} \enspace , %]]&gt;

which is more daunting. However, we know from the previous blog post that to classify the stability of the fixed point, it suffices to look at the trace $\tau$ and determinant $\Delta$ of the Jacobian, which are given by

% <![CDATA[\begin{aligned}\tau &= -\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - 2\gamma \\[0.50em]\Delta &= \left(-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)\left(-2\gamma\right) - \left(- \gamma - \mu\right)\left(\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \\[0.50em] &= 2\gamma\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} + \beta\mu\left(1 - \frac{\gamma}{\beta}\right) \enspace .\end{aligned} %]]&gt;

The trace can be written as $\tau = \lambda_1 + \lambda_2$ and the determinant can be written as $\Delta = \lambda_1 \lambda_2$, as shown in a previous blog post. Here, we have that $\tau < 0$ because both terms above are negative, and $\Delta > 0$ because both terms above are positive. This constrains $\lambda_1$ and $\lambda_2$ to be negative, and thus the fixed point is stable.

Vector Fields and Nullclines

As previously done for the SIR model, we can again visualize the directions in which the system changes at any point using a vector field.

plot_vectorfield_SIRS<-function(beta,gamma,mu,main='',...){S<-seq(0,1,0.05)I<-seq(0,1,0.05)dS<-function(S,I)-beta*I*S+mu*(1-S-I)dI<-function(S,I)beta*I*S-gamma*ISI<-as.matrix(expand.grid(S,I))SI<-SI[apply(SI,1,function(x)sum(x)<=1),]# S + I <= 1 must holddSI<-cbind(dS(SI[,1],SI[,2]),dI(SI[,1],SI[,2]))draw_vectorfield(SI,dSI,main,...)}

The figure below visualizes the vector field for the SIRS model, several trajectories, and the nullclines for $\gamma = 1/8$ and $\mu = 1/8$ for $\beta = 1/8$ (left panel) and $\beta = 3/8$ (right panel). The left panel shows that there exists only one stable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$ to which all trajectories converge.

plot of chunk unnamed-chunk-14

The right panel, on the other hand, shows two fixed points: one unstable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$, which we only reach when $I_0 = 0$, and a stable one at

(S^{\star}, I^{\star}) = \left(\frac{1/8}{3/8}, \frac{1/8\left(1 - \frac{3/8}{1/8}\right)}{1/8 + 1/8}\right) = (1/3, 1/3) \enspace .

In contrast to the SIR model, therefore, there exists a stable fixed point constituting a population which includes infected people, and so the disease is not eradicated but stays in the population.

The dashed lines give the nullclines. The $I$-nullcline gives the set of points where $\dot{I} = 0$, which are — as in the SIR model above — given by $I = 0$ and $S = \gamma / \beta$. The $S$-nullcline is given by:

% <![CDATA[\begin{aligned}0 &= - \beta I S + \mu(1 - S - I) \\[0.50em]\beta I S &= \mu(1 - S) - \mu I \\[0.50em]I &= \frac{\mu(1 - S)}{\beta S + \mu} \enspace ,\end{aligned} %]]&gt;

which is a nonlinear function in $S$. The nullclines help us again in “organizing” the vector field. This can be seen best in the right panel above. In particular, and similar to the SIR model, we will again have a decrease in the proportion of infected people to the left of the line given by $S = \gamma / \beta$, that is, when $S < \gamma / \beta$, and an increase to the right of the line, that is, when $S > \gamma / \beta$. Similarly, the proportion of susceptible people increases when the system is “below” the $S$-nullcline, while it increases when the system is “above” the $S$-nullcline.

Bifurcations

In the vector fields above we have seen that the system can go from having only one fixed point to having two fixed points. Whenever a fixed point is destroyed or created or changes its stability as an internal parameter is varied — here the ratio of $\gamma / \beta$ — we speak of a bifurcation.

As pointed out above, the second equilibrium point only exists for $\gamma / \beta \leq 1$. As long as $\gamma / \beta < 1$, we have two distinct fixed points. At $\gamma / \beta = 1$, the second fixed point becomes:

% <![CDATA[\begin{aligned}(S^{\star}, I^{\star}) &= \left(1, \frac{\mu\left(1 - 1\right)}{\gamma + \mu}\right) = (1, 0) \enspace ,\end{aligned} %]]&gt;

which equals the first fixed point. Thus, at $\gamma / \beta = 1$, the two fixed points merge into one; this is the bifurcation point. This makes sense: if $\gamma / \beta < 1$, we have that $\beta / \gamma > 1$, and so an outbreak occurs, which establishes the disease in the population since we allow for reinfections.

We can visualize this change in fixed points in a so-called bifurcation diagram. A bifurcation diagram shows how the fixed points and their stability change as we vary an internal parameter. Since we deal with two-dimensional fixed points, we split the bifurcation diagram into two: the left panel shows how the $I^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$, and the right panel shows how the $S^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$.

plot of chunk unnamed-chunk-15

The left panel shows that as long as $\gamma / \beta < 1$, which implies that $\beta / \gamma > 1$, we have two fixed points where the stable fixed point is the one with a non-zero proportion of infected people — the disease becomes established. These fixed points are on the diagonal line, indicates as black dots. Interestingly, this shows that the proportion of infected people can never be stable at a value larger than $1/2$. There also exist unstable fixed points for which $I^{\star} = 0$. These fixed points are unstable because if there even exists only one infected person, she will spread the disease, resulting in more infected people. At the point where $\beta = \gamma$, the two fixed points merge: the disease can no longer be established in the population, and the proportion of infected people always goes to zero.

Similarly, the right panel shows how the fixed points $S^{\star}$ change as a function of $\gamma / \beta$. Since the infection spreads for $\beta > \gamma$, the fixed point $S^{\star} = 1$ is unstable, as the proportion of susceptible people must decrease since they become infected. For outbreaks that become increasingly mild as $\gamma / \beta \rightarrow 1$, the stable proportion of susceptible people increases, reaching $S^{\star} = 1$ when at last $\gamma = \beta$.

In summary, we have seen how the SIRS extends the SIR model by allowing reinfections. This resulted in possibility of more interesting fixed points, which included a non-zero proportion of infected people. In the SIRS model, then, a disease can become established in the population. In contrast to the SIR model, we have also seen that the SIRS model allows for bifuractions, going from two fixed points in times of outbreaks ($\beta > \gamma$) to one fixed point in times of no outbreaks ($\beta < \gamma$).

Conclusion

In this blog post, we have seen that nonlinear differential equations are a powerful tool to model real-world phenomena. They allow us to model vastly more complicated behaviour than is possible with linear differential equations, yet they rarely provide closed-form solution. Luckily, the time-evolution of a system can be straightforwardly computed with basic numerical techniques such as Euler’s method. Using the simple logistic equation, we have seen how to analyze the stability of fixed points — simply pretend the system is linear close to a fixed point.

The logistic equation has only one state variable — the size of the population. More interesting dynamics occur when variables interact, and we have seen how the simple SIR model can help us understand the spread of infectious disease. Consisting only of two parameters, we have seen that an outbreak occurs only when $R_0 = \beta / \gamma > 1$. Moreover, the stable fixed points always included $I = 0$, implying that the disease always gets eradicated. This is not true for all diseases because recovered people might become reinfected. The SIRS model amends this by introducing a parameter $\mu$ that quantifies how quickly recovered people can become susceptible again. As expected, this led to stable states in which the disease becomes established in the population.

On our journey to understand these systems, we have seen how to quantify the stability of a fixed point using linear stability analysis, how to visualize the dynamics of a system using vector fields, how nullclines give structure to such vector fields, and how bifurcations can drastically change the dynamics of a system.

The SIR and the SIRS models discussed here are without a doubt crude approximations of the real dynamics of the spread of infectious diseases. There exist several ways to extend them. One way to do so, for example, is to add an exposed population which are infected but are not yet infectious; see here for a visualization of an elaborated version of this model in the context of SARS-CoV-2. These basic compartment models assume homogeneity of spatial-structure, which is a substantial simplification. There are various ways to include spatial structure (e.g., Watts, 2005; Riley, 2007), but that is for another blog post.


I would like to thank Adam Finnemann, Anton Pichler, and Oísin Ryan for very helpful comments on this blog post.


References

  • Strogatz, S. H. (2015). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.
  • Hirsch, M. W., Smale, S., & Devaney, R. L. (2013). Differential equations, dynamical systems, and an introduction to chaos. Boston, US: Academic Press.
  • Riley, S. (2007). Large-scale spatial-transmission models of infectious disease. Science, 316(5829), 1298-1301.
  • Watts, D. J., Muhamad, R., Medina, D. C., & Dodds, P. S. (2005). Multiscale, resurgent epidemics in a hierarchical metapopulation model. Proceedings of the National Academy of Sciences, 102(32), 11157-11162.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fabian Dablander.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version Control is a Time Machine That Translates Common Hindsight Into Valuable Foresight

$
0
0

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For data science projects I recommend using source control or version control, and committing changes at a very fine level of granularity. This means checking in possibly broken code, and the possibly weak commit messages (so when working in a shared project, you may want a private branch or second source control repository).

Please read on for our justification.

The issue we are facing is: Chesterton’s Fence

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”

How this appears in software or data science projects is often: “harmless cleanup” steps break your project, and you don’t detect this until much later.

The Chesterton’s Fence parable always amused me, as it doesn’t have an actual example of adverse consequences (though I always mis-remember it as having an example). Nobody who does actual work is in fact careful enough or knowledgable enough to always avoid removing Chesterton’s fence as a matter of foresight. However, in hindsight you often can see the problem. Luckily: version control is a time machine that translates common hindsight into more valuable foresight.

So, let’s add a minor data science example.

I’ve recently been playing around with a Keras/Tensorflow project, which I will probably write-up later. At some point I “cleaned up” the code by replacing a unsightly tensor slice of the form x[:, (j-1):j] with a more natural looking indexing x[:, j-1]. What I neglected is, Tensorflow uses the tensor rank/shape details to record the difference between a single data-column and a data-frame containing a single data-column (a small distinction that is very important to maintain in data science projects). This “cleanup” broke the code in a non-signaling way as additional Tensorflow re-shaping rules allowed the calculation to move forward with incorrect values. A few changes later I re-ran the project evaluation, and the model performance fell precipitously. I had no idea why a model that recently performed well now didn’t work.

The saving grace was: I had committed at very fine granularity even during the “harmless code clean-up” using git version control. Exactly the set of commits you would be embarrassed to share. These “useless” commits saved me. I could quickly bisection search for the poison commit. The concept is illustrated in chapter 11 of Practical Data Science with R (please check it out!) as follows:

Githistory

Now git is a bit of “when you walk with it you need fear no other” protector. In the process of finding the breaking change I accidentally checked out the repository to a given version (instead of a specific file), causing the dreaded “git Detached HEAD” issue in my source control repository. But the win was: that was a common researchable problem with known fixes. I was happy to trade my “why did this stop working for no reason” mystery for the routine maintenance task of fixing the repository after finding the root cause.

And that is the nature of source control or version control: it is a bunch of technical considerations that end-up being a net positive as they can save you from worse issues.

After note: a much worse, and more memorable parable, on the value of source control is the following. I remember a masters degree in mathematics candidate at UC Berkeley losing an entire draft of her dissertation as she accidentally typed “rm * .log” instead of “rm *.log” to clean-up side-effect files in her working directory. The extra space allowed the remove command to nuke important files. Without source control, this set her back a month.

For a nice lecture on the inevitability of errors (and thus why we need to mitigate them, as they can not be fully eliminated) I recommend The Lead Developer’s “Who Destroyed Three Mile Island” presentation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Deploying RMarkdown Online

$
0
0

[This article was first published on jessecambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RMarkdown is a great tool for creating a variety of documents with R code and it’s a natural choice for producing blog posts such as this one. However, depending on which blog software you use, you may run into some problems related to the file paths for figure images (such as ggplot charts) which will require tweaks in your RMarkdown workflow.

This blog post demonstrates a simple solution to this problem that will also give you central control over RMarkdown knit settings across your site. I use this solution for this blog and a GitHub repository of data science resources. You can also find the RMarkdown file that generated this blog post here.

Note that in this post I will be talking about implementing a solution for a Jekyll blog that is hosted via GitHub pages. Some modifications may be required if you are using another blog or website platform. However, this solution should be adaptable to all blogs or websites that use Markdown.

For Jekyll there are two steps to building web content (HTML) from an RMarkdown file. The first is to knit the RMarkdown (.Rmd) file which creates the Markdown (.md) file. The second step is to use the jekyll build command to create HTML content which is what will be displayed online.

1. Knit:          Rmarkdown (.Rmd) ---->  Markdown (.md) 2. Jekyll Build:  Markdown (.md)   ---->  HTML (.html)

The Problem

When I first used RMarkdown to create a post for this blog, none of my figures showed up in the post. The issue was that Jekyll creates the HTML file for a blog post in a different location than the RMarkdown (.Rmd) and Markdown (.md) files and this breaks figure file paths. This blog post describes the problem in more detail.

Also, by default RMarkdown stores files for figures two folder levels deep using the RMarkdown file location as its root (ie. _files/figure-gfm/image.png). I find it more convenient to organize figure files in a separate root directory from my RMarkdown files and store the images only one folder level deep (ie. /rmd_images//image.png). You can see this folder structure in action here (posts are in the _posts folder and figures are in the rmd_images folder).

The Solution

This solution uses a single R script file (.R) which contains knit settings adjustments and is referenced by all RMarkdown (.Rmd) files. This allows you to edit knit settings in one central location and use these settings whenever you knit an RMarkdown file. Modifications are made to the knit process so that figure image files are saved in a well organized folder structure and the HTML files display figures properly.

The contents of this central R script which I have named rmd_config.R is below. It lives in the root directory of my Github repository and the contents of this file will be run (via source) when each RMarkdown file is knit.

library(knitr)library(stringr)library(here)# get name of file during knitting and strip file extensionrmd_filename<-str_remove(knitr::current_input(),"\\.Rmd")# Figure path on disk = base.dir + fig.path# Figure URL online = base.url + fig.pathknitr::opts_knit$set(base.dir=str_c(here::here(),'/'),base.url='/')knitr::opts_chunk$set(fig.path=str_c("rmd_images/",rmd_filename,'/'),echo=TRUE)

Here is what is going on in the above script:

  • The filename of our RMarkdown file is extracted using knitr::current_input() and stored in the variable rmd_filename (str_remove is used to remove the .Rmd file extension).
  • The here package establishes our ‘base’ directory (the root folder of our GitHub repository). The base directory path could change based on which computer we use and where we put our GitHub repository files so the here package allows us to automatically find this path.
  • The fig.path, which is where our figures will be stored, is set to a folder named after the RMarkdown file being run that resides in the ‘/rmd_images’ root directory.

To utilize the above script in an RMarkdown file, we simply insert the code below as a chunk into the RMarkdown file. This will source the script to apply all the necessary knit settings when an RMarkdown file is knit.

library(here)source(here::here("rmd_config.R"))

For a Jekyll blog, you’ll also want to include certain YAML header content such as tags or the title of the post. To do this we can use the preserve_yaml output setting in generating our .md file and then insert whatever YAML content we need into the header. Below is an example YAML header (the first part of your RMarkdown document) which will generate a github-style (“gfm”) .md document. In this example I’ve added the fields “layout”, “title”, “date”, “author”, and “tags” which are all used by Jekyll in creating the blog post.

---layout:posttitle:"Deploying RMarkdown Online"date:2020-03-22author:JesseCambontags:[rmarkdown,data-science]output:md_document:pandoc_args:["--wrap=none"]variant:gfmpreserve_yaml:TRUE---

Note that the pandoc_args setting is to prevent the knit process from inserting extra line breaks into the Markdown file that don’t exist in our RMarkdown file. The extra line breaks normally are invisible, but I found they showed up when I pushed content to R-Bloggers which caused paragraphs to be broken up.

One other note on Jekyll is that it uses the liquid template language. If you want to display characters on your blog that are used by liquid such as {{}} (R’s “curly-curly” operator) then you will need to wrap these problematic characters with the raw and endraw liquid tags as described here. This prevents Jekyll from attempting to parse these characters as liquid syntax and passes them on in raw form to the HTML file for display.

Conclusion

To see this solution in action, you can look at the GitHub repository that produces this blog here and the RMarkdown file for this specific blog post here. To provide a self-contained example of a figure displaying, I’ve created a simple histogram plot below and you’ll find the image file neatly filed away in the rmd_images directory underneath a subfolder named after this blog post.

One caveat is that this approach does assume that each RMarkdown filename is unique. If this is not the case then you’ll need to make some modifications to the central rmd_config.R file above; otherwise figure images from different RMarkdown files may save in the same directory (and possibly overwrite each other). However, the solution described here is quite flexible and could be adapted to a variety of use cases with tweaks to the rmd_config.R file.

hist(mtcars$disp)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jessecambon-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Analyzing churn with chaid

$
0
0

[This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post tries to accomplish several things concisely. I’m making available a new function (chaid_table()) inside my own little CGPfunctions package, reviewing some graphing options and revisiting our old friend CHAID– Chi Squared \(\chi^2\) Automated Interaction Detection – to look at modeling a “real world” business problem.

It’s based on a blog post from Learning Machines and investigates customer churn for a wireless provider. The original blog post does a nice job of describing a package called rattle and how it was used. I’m going to use RStudio with a variety of packages that I’ll mention along the way.

The situation is drawn from the book Quantitative Methods for Management. A Practical Approach, (Authors: Canela, Miguel Angel; Alegre, Inés; Ibarra, Alberto) and the data are publicly available, you can access it from their Github repository as churn.csv and save it to a directory of your choice. We’re being asked to imagine we’re a wireless provider and we have a sample of 5,000 customers with 13 possible predictor variables, a unique ID number, and data about whether a customer did (Yes) leave us for a different provider, or stayed with us (No).

The predictors are a nice mix of things we might know about our customers, how long they have been with us, what sort of data and international calling plan they have, how often they call us for support, how often they call other customers of ours, etc., etc.. The original blog post has a nice little table listing all the variables that I won’t reproduce here.

Looks like in the book they took a classic linear regression approach, and the blog post builds on a classic decision tree model. Both well known approaches with plenty of supporters. As I’ve written before I actually love CHAID as an early modeling and explanation tool. Not necessarily because it is the most accurate, or the fastest, or the most modern but rather because it is easy to explain to our business customer, easy to understand the results, easy to use, and makes very few assumptions about the nature of our data.

Our objective

Let’s imagine that this is our first attempt at using the data we have on hand to predict “churn” and to look for opportunities to reduce it. Note that it is also good in a business setting to avoid unnecessary costs and reduce waste. So yes obviously we’d like to stop people from dropping our service and look for potential ways to retain them. We’d also like to avoid spending money or resources to induce people to stay when there is very little likelihood of them leaving. In other words, we are nearly equally interested in predicting who will leave and who will stay.

I’d also press the case that since this is our first attempt at modelling and since we are likely to be explaining our results to people who don’t natively talk about “p values” or “decision trees” or “model accuracy” that we should focus on being able to clearly explain our results rather than focus on how deep we go or how utterly accurate our model is.

For this post then, we’re going to explore a relatively shallow CHAID model and some tools for exploring the results in tabular and graphical format. Read on! I’ll say it again below but comments and critique are always welcomed via disqus or email. You’ll have no trouble finding the icon links in a couple of places.

Let’s load dplyr and CHAID (which requires partykit) and grab the dataset from github.

library(dplyr)library(ggplot2)theme_set(theme_bw())library(forcats)library(ggmosaic)# library(ggrepel)# install.packages("partykit")# install.packages("CHAID", repos="http://R-Forge.R-project.org")library(CHAID)# devtools::install_github("ibecav/CGPfunctions", build_vignettes = TRUE)library(CGPfunctions)library(knitr)# churn <- read.csv("https://raw.githubusercontent.com/quants-book/CSV_Files/master/churn.csv")churn <- read.csv("churn.csv")str(churn)
## 'data.frame':    5000 obs. of  15 variables:##  $ ID      : Factor w/ 5000 levels "350-1149","350-1404",..: 2915 4687 2525 2883 3989 1281 2466 1968 3481 4789 ...##  $ ACLENGTH: int  77 105 121 115 133 95 50 157 35 96 ...##  $ INTPLAN : int  0 0 0 0 0 0 1 0 0 0 ...##  $ DATAPLAN: int  0 0 1 0 1 1 0 1 1 0 ...##  $ DATAGB  : Factor w/ 7 levels "0","1.5G","100M",..: 1 1 2 1 2 7 1 6 7 1 ...##  $ OMMIN   : num  80.8 131.8 212.1 186.1 166.5 ...##  $ OMCALL  : int  70 66 57 64 61 85 96 73 56 99 ...##  $ OTMIN   : num  166 132 195 231 176 ...##  $ OTCALL  : int  67 105 140 125 74 98 73 71 77 99 ...##  $ NGMIN   : num  18.6 5.1 14.9 26.5 36.1 11.1 34.5 15.3 21.6 12.4 ...##  $ NGCALL  : int  6 6 14 16 11 2 10 8 7 2 ...##  $ IMIN    : num  9.5 6.7 28.6 9.9 5.3 0 18.4 11.3 0 5.2 ...##  $ ICALL   : int  4 2 8 4 2 0 7 3 0 2 ...##  $ CUSCALL : int  1 0 1 1 1 1 1 3 0 0 ...##  $ CHURN   : int  0 0 0 0 0 1 1 0 1 0 ...

Prepping the data

Okay we have the data we need. The original blog post does a good job of talking about the raw data and exploring distributions. I’m not going to repeat that work here. I’ll simply focus on what we need to do to prepare it for our CHAID analysis.

The variables INTPLAN, DATAPLAN, and CHURN are currently stored as integers 0/1 let’s make them true factors and label them No/Yes respectively just for clarity. We’ll do that with a mutate_at command.

DATAGB needs a little cleanup. It’s stored as a factor but the order is wrong because it was initially a character string. Far more convenient to store it as an ordered factor and specify the right order. You could reorder using base R but I’m going to use forcats::fct_relevel as clearer cleaner code.

The remainder of the variables are either real numbers or integers. These we’re going to convert to factors by cutting them into five more or less equal bins per variable. We’ll also use consistent ordering and labeling (“Low”, “MedLow”, “Medium”, “MedHigh”, “High”). For a much longer discussion of this see my earlier article.

But first we have CUSCALL, which doesn’t want to be broken into 5 bins so we’ll make it 4 bins and label them clearly (“0”, “1”, “2”, and “More than 2”) using fct_lump from forcats.

NB: There is a real difference between how factors and ordered factors are handled by CHAID because there are differences between nominal and ordinal variables. Factors can be split in any fashion. Ordered factors will always have sequence honored so you can’t have a split of 1:5 as 1, 2, and 5 vs 3, and 4. Make sure you know which you wish to use and why.

Our code and the resulting dataframe look like this.

churn <-   churn %>% mutate_at(c("INTPLAN", "DATAPLAN", "CHURN"),                       factor,                       labels = c("No", "Yes"))churn$DATAGB <- as.ordered(forcats::fct_relevel(churn$DATAGB,                                                 "0",                                                 "100M",                                                 "250M",                                                 "500M",                                                 "1G",                                                 "1.5G",                                                 "2G"))table(churn$DATAGB)
## ##    0 100M 250M 500M   1G 1.5G   2G ## 3449   74  168  291  410  522   86
churn$CUSCALL <- as.ordered(fct_lump(as_factor(churn$CUSCALL),                                      other_level = "More than 2"))table(churn$CUSCALL)
## ##           0           1           2 More than 2 ##        1095        1680        1277         948
churn <-   churn %>%   mutate_if(is.numeric,             ~ ggplot2::cut_number(.,                                   n=5,                                  labels = FALSE)            ) %>%  mutate_if(is.integer,            ~ factor(.,                     labels = c("Low", "MedLow", "Medium", "MedHigh", "High"),                     ordered = TRUE)            )churn <- churn %>% select(-ID)str(churn)
## 'data.frame':    5000 obs. of  14 variables:##  $ ACLENGTH: Ord.factor w/ 5 levels "Low"<"MedLow"<..: 2 3 4 4 4 3 1 5 1 3 ...##  $ INTPLAN : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...##  $ DATAPLAN: Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 2 2 1 ...##  $ DATAGB  : Ord.factor w/ 7 levels "0"<"100M"<"250M"<..: 1 1 6 1 6 4 1 7 4 1 ...##  $ OMMIN   : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 1 1 4 3 3 4 2 1 3 3 ...##  $ OMCALL  : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 3 2 2 2 2 4 5 3 2 5 ...##  $ OTMIN   : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 2 1 3 4 2 3 2 1 2 3 ...##  $ OTCALL  : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 1 4 5 5 1 3 1 1 2 3 ...##  $ NGMIN   : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 3 1 3 4 5 2 5 3 4 2 ...##  $ NGCALL  : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 2 2 5 5 4 1 3 2 2 1 ...##  $ IMIN    : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 3 2 5 3 2 1 5 3 1 2 ...##  $ ICALL   : Ord.factor w/ 5 levels "Low"<"MedLow"<..: 4 2 5 4 2 1 5 3 1 2 ...##  $ CUSCALL : Ord.factor w/ 4 levels "0"<"1"<"2"<"More than 2": 2 1 2 2 2 2 2 4 1 1 ...##  $ CHURN   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...

Build the model

Now that we have the data organized the way we want it we can let CHAID do its’ thing and tell us what we want to know. What combination of our 13 predictor variables best explain or predict why our customers are leaving us (or as I mentioned before which customers are most likely to stay). The way it does that is devilishly simple and elegant. For all 13 predictors it runs a \(\chi^2\) test of independence (a.k.a. association) between the predictor and our outcome churn. Unlike some other tree models it can be a multi-way split. It will compute a p value for all these possible splits (see note above about ordinal versus nominal splits) and choose the split out of all possible splits with the smallest Bonferroni adjusted p value. That is the simplest explanation please see the ?chaid help page for complete details.

Just to break up the long passages of text and because I’m a visual learner let’s make a mosaic plot using ggmosaic of CUSCALL vs CHURN. We’ll even do a little magic to display the percentage of churn within the categories of CUSCALL. chaid will test not only this 4 x 2 table but also see if combining any adjacent categories is a better split.

p <-   ggplot(data = churn) +   geom_mosaic(aes(x = product(CUSCALL), fill = CHURN))xxx <- ggplot_build(p)$data[[1]]XXX <- xxx %>%   group_by_at(vars(ends_with("__CUSCALL"))) %>%   mutate(NN = sum(.wt)) %>%   mutate(pct = paste0(round(.wt/NN*100, 1), "%")) %>%   select(-(xmin:weight))p + geom_text(data = xxx,               aes(x = (xmin + xmax)/2,                   y = (ymin + ymax)/2,                   label = XXX$pct)) +   labs(y = NULL,        x = "Number of Customer Calls",       title = "Amount of Churn by # of Customer Calls") +  scale_y_continuous(labels = scales::label_percent(accuracy = 1.0),                     breaks = seq(from = 0,                                   to = 1,                                   by = 0.10),                     minor_breaks = seq(from = 0.05,                                         to = 0.95,                                         by = 0.10))

Clearly, visually, something is going on but we’ll let the algorithm decide what the best splits are. It will keep on going until it runs out of significant splits or some other criteria we set such as the size of the bins or the number of levels we want in the model.

Since we’re pretending this is our first time through the data, and since we want clear, easy to understand, recommendations to give our business leaders to act on, we’re going to tell chaid to limit itself to just three “levels” of prediction. If you want a better understanding of what you can change to control the chaid model I suggest you go back to one of my earlier posts here.

Let’s build a “solution” and put it in an object called solution and then print and plot it using the built-in methods from partykit.

solution <- CHAID::chaid(CHURN ~ .,                          data = churn,                         control = chaid_control(maxheight = 3))print(solution)
## ## Model formula:## CHURN ~ ACLENGTH + INTPLAN + DATAPLAN + DATAGB + OMMIN + OMCALL + ##     OTMIN + OTCALL + NGMIN + NGCALL + IMIN + ICALL + CUSCALL## ## Fitted party:## [1] root## |   [2] INTPLAN in No## |   |   [3] CUSCALL in 0## |   |   |   [4] OMMIN in Low: No (n = 206, err = 3.4%)## |   |   |   [5] OMMIN in MedLow, Medium, MedHigh: No (n = 593, err = 7.4%)## |   |   |   [6] OMMIN in High: No (n = 179, err = 15.6%)## |   |   [7] CUSCALL in 1## |   |   |   [8] OMMIN in Low, MedLow: No (n = 577, err = 6.4%)## |   |   |   [9] OMMIN in Medium, MedHigh: No (n = 597, err = 10.9%)## |   |   |   [10] OMMIN in High: No (n = 330, err = 19.1%)## |   |   [11] CUSCALL in 2## |   |   |   [12] OMMIN in Low, MedLow: No (n = 466, err = 8.4%)## |   |   |   [13] OMMIN in Medium, MedHigh, High: No (n = 679, err = 19.0%)## |   |   [14] CUSCALL in More than 2## |   |   |   [15] OMMIN in Low, MedLow, Medium: No (n = 516, err = 20.0%)## |   |   |   [16] OMMIN in MedHigh, High: No (n = 334, err = 36.8%)## |   [17] INTPLAN in Yes## |   |   [18] OMMIN in Low, MedLow## |   |   |   [19] OMCALL in Low, MedLow, Medium: No (n = 186, err = 46.2%)## |   |   |   [20] OMCALL in MedHigh, High: Yes (n = 32, err = 21.9%)## |   |   [21] OMMIN in Medium, MedHigh, High## |   |   |   [22] IMIN in Low, MedLow, Medium: No (n = 33, err = 45.5%)## |   |   |   [23] IMIN in MedHigh, High: Yes (n = 272, err = 25.0%)## ## Number of inner nodes:     9## Number of terminal nodes: 14
plot(solution,  main = "churn dataset, maxheight = 3",  gp = gpar(    lty = "solid",    lwd = 2,    fontsize = 8  ))

With an overall churn rate of about 20% we can see that even this simple chaid model gives us a much clearer picture of where our risks and opportunities lie. (A reminder that you can review how to interpret the print and plot output here.) We can identify hundreds of customers who pose little risk of changing carriers on the left side of the plot and conversely several high risk hotspots towards the right with the prediction for churn approaching 80%!

Makes sense at this point to check the accuracy of our model. if we were primarily interested in squeezing as much accuracy as possible we might follow the approach I demonstrate here but for a very simple first step I’m happy to sacrifice details and the risk of over-fitting for the simplicity of this design and the ease with which we can make suggestions for changing business practice.

caret::confusionMatrix(predict(solution), churn$CHURN)
## Confusion Matrix and Statistics## ##           Reference## Prediction   No  Yes##        No  3957  739##        Yes   75  229##                                           ##                Accuracy : 0.8372          ##                  95% CI : (0.8267, 0.8473)##     No Information Rate : 0.8064          ##     P-Value [Acc > NIR] : 1.013e-08       ##                                           ##                   Kappa : 0.2948          ##                                           ##  Mcnemar's Test P-Value : < 2.2e-16       ##                                           ##             Sensitivity : 0.9814          ##             Specificity : 0.2366          ##          Pos Pred Value : 0.8426          ##          Neg Pred Value : 0.7533          ##              Prevalence : 0.8064          ##          Detection Rate : 0.7914          ##    Detection Prevalence : 0.9392          ##       Balanced Accuracy : 0.6090          ##                                           ##        'Positive' Class : No              ## 

As with the original blog post we can clearly tell our management team that additional focus is needed on our clients who have international calling plans.

More with what we have

I have to give partykit credit, the print and plot methods pack a lot of information into some efficient space. But they also leave me wanting for information, especially with regards to the “inner nodes”. To be fair, part of the problem is that chaid is a bit old and doesn’t take maximum advantage of partykit but regardless I’m a big fan of getting back an object I can do more analysis on.

That’s what I told myself when I started writing a function I called chaid_table(). It takes our solution object and converts it to a tibble that is chock full of information about our analysis/model.

A quick look at the first 5 rows should give you an idea of what’s in there. Hopefully nodeID, parent, NodeN, No, and Yes are obvious. Note that “no” and “yes” are actually pulled from the levels of the outcome variable so it will match your data. “ruletext” is a plain English summary of the complete rule to arrive at this node. “split.variable” is the variable that will be used to split the current node and produce child nodes.

“chisq” is obvious, “df” is the degrees of freedom, “adjustedp” is the p value after the bonferroni correction while “rawpvalue” is the uncorrected value. The split rule columns are the R code that would produce the split.

review_me <- CGPfunctions::chaid_table(solution)kable(review_me[1:5, ])
nodeIDparentNodeNNoYesruletextsplit.variablechisqdfadjustedprawpvaluesplitrulesplit1split2split3
1NA50004032968NAINTPLAN715.7097310.00000000.00e+00NANANANA
2144773839638INTPLAN is ‘No’CUSCALL149.7845530.00000000.00e+00INTPLAN %in% c(‘No’)INTPLAN %in% c(‘No’)NANA
3297889979INTPLAN is ‘No’ & CUSCALL is ‘0’OMMIN20.2164820.00024454.07e-05INTPLAN %in% c(‘No’) & CUSCALL %in% c(‘0’)INTPLAN %in% c(‘No’)CUSCALL %in% c(‘0’)NA
432061997INTPLAN is ‘No’ & CUSCALL is ‘0’ & OMMIN is ‘Low’NANANANANAINTPLAN %in% c(‘No’) & CUSCALL %in% c(‘0’) & OMMIN %in% c(‘Low’)INTPLAN %in% c(‘No’)CUSCALL %in% c(‘0’)OMMIN %in% c(‘Low’)
5359354944INTPLAN is ‘No’ & CUSCALL is ‘0’ & OMMIN is ‘MedLow’, ‘Medium’, ‘MedHigh’NANANANANAINTPLAN %in% c(‘No’) & CUSCALL %in% c(‘0’) & OMMIN %in% c(‘MedLow’, ‘Medium’, ‘MedHigh’)INTPLAN %in% c(‘No’)CUSCALL %in% c(‘0’)OMMIN %in% c(‘MedLow’, ‘Medium’, ‘MedHigh’)

The best part IMHO about this tibble format is that you can ask the data lots of additional questions in a dplyr pipeline. here are two obvious ones:

  1. What percentage of customers are leaving if they have an international plan versus don’t? (14% versus 63%)

  2. Can you provide an ordered list of where our churns are most likely to occur? (of course – allows us to make good business decisions. For example while node # 20 has the most churn at 78% there’s only 30 some people in that node while #23 has slightly less churn and a lot more people to influence.)

# Question #1 review_me %>%  select(nodeID:ruletext) %>%  mutate(pctLeaving = Yes/NodeN * 100) %>%   filter(parent == 1) %>%  kable(digits = 1, caption = "Question #1 answer")
(#tab:chaid_table4)Question #1 answer
nodeIDparentNodeNNoYesruletextpctLeaving
2144773839638INTPLAN is ‘No’14.3
171523193330INTPLAN is ‘Yes’63.1
# Question #2  review_me %>%  select(nodeID:split.variable) %>%  mutate(pctLeaving = Yes/NodeN * 100) %>%   filter(is.na(split.variable)) %>%  select(-parent, -split.variable) %>%  arrange(desc(pctLeaving)) %>%  kable(digits = 1, caption = "Question #2 answer")
(#tab:chaid_table4)Question #2 answer
nodeIDNodeNNoYesruletextpctLeaving
2032725INTPLAN is ‘Yes’ & OMMIN is ‘Low’, ‘MedLow’ & OMCALL is ‘MedHigh’, ‘High’78.1
2327268204INTPLAN is ‘Yes’ & OMMIN is ‘Medium’, ‘MedHigh’, ‘High’ & IMIN is ‘MedHigh’, ‘High’75.0
1918610086INTPLAN is ‘Yes’ & OMMIN is ‘Low’, ‘MedLow’ & OMCALL is ‘Low’, ‘MedLow’, ‘Medium’46.2
22331815INTPLAN is ‘Yes’ & OMMIN is ‘Medium’, ‘MedHigh’, ‘High’ & IMIN is ‘Low’, ‘MedLow’, ‘Medium’45.5
16334211123INTPLAN is ‘No’ & CUSCALL is ‘More than 2’ & OMMIN is ‘MedHigh’, ‘High’36.8
15516413103INTPLAN is ‘No’ & CUSCALL is ‘More than 2’ & OMMIN is ‘Low’, ‘MedLow’, ‘Medium’20.0
1033026763INTPLAN is ‘No’ & CUSCALL is ‘1’ & OMMIN is ‘High’19.1
13679550129INTPLAN is ‘No’ & CUSCALL is ‘2’ & OMMIN is ‘Medium’, ‘MedHigh’, ‘High’19.0
617915128INTPLAN is ‘No’ & CUSCALL is ‘0’ & OMMIN is ‘High’15.6
959753265INTPLAN is ‘No’ & CUSCALL is ‘1’ & OMMIN is ‘Medium’, ‘MedHigh’10.9
1246642739INTPLAN is ‘No’ & CUSCALL is ‘2’ & OMMIN is ‘Low’, ‘MedLow’8.4
559354944INTPLAN is ‘No’ & CUSCALL is ‘0’ & OMMIN is ‘MedLow’, ‘Medium’, ‘MedHigh’7.4
857754037INTPLAN is ‘No’ & CUSCALL is ‘1’ & OMMIN is ‘Low’, ‘MedLow’6.4
42061997INTPLAN is ‘No’ & CUSCALL is ‘0’ & OMMIN is ‘Low’3.4

Those are just a sample of what you can do with the data in a tibble. Feel free to experiment.

A plot is worth a thousand words

I’m a huge fan of displaying data graphically any time it can be done. I find it helps to drive home your messaging from data science projects. The other new function I added to my package recently is PlotXTabs2() which is built to display bivariate crosstabs. I borrowed heavily from ggstatsplot to allow you to optionally include statistical information.

Let’s use it to display information about the relationship between churn and our clients international plan. First some simple plots just displaying different x and y axis orientation and with and without pipelining. There are too many options to talk about here so please refer to the online pages.. My next project is to bring mosaic plots to the function.

CGPfunctions::PlotXTabs2(churn,                          CHURN,                          INTPLAN,                          bf.display = "sensible")

churn %>% CGPfunctions::PlotXTabs2(INTPLAN,                                    CHURN,                                    bf.display = "sensible")

The ability to pipe from the data to the graph is immensely useful for displaying practical data. The two final plots will focus on the second level of the model, where our customers have no international calling plan. Once again we’ll display the same data in reversed x/y order and in the second plot since there are more levels we’ll select a palette that is more friendly to users with potential sight challenges.

churn %>%   filter(INTPLAN == "No") %>%   CGPfunctions::PlotXTabs2(CUSCALL,                            CHURN,                            bf.display = "sensible")

churn %>%   filter(INTPLAN == "No") %>%   CGPfunctions::PlotXTabs2(CHURN,                            CUSCALL,                            bf.display = "sensible",                            package = "ggthemes",                           palette = "Color Blind")

A final look at why it matters

So I wanted to end this post by reinforcing why chaid can be a useful tool even if it isn’t the newest algorithm or the coolest. Note that the accuracy wasn’t quite as good as some other methods but it is a fantastic tool for initial modelling. Note that it is also very robust and gives you answers that are easily explained to business leaders.

Done

Hope you enjoyed the post. Comments always welcomed. Especially please let me know if you actually use the tools and find them useful.

Chuck

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Faster R package installation

$
0
0

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Faster package installation

Every few weeks or so, a tweet pops up asking about how to speed up package installation in R

Depending on the luck of twitter, the author may get a few suggestions.

The bigger picture is that package installation time is starting to become more of an issue for a number of reasons. For example, packages are getting larger and more complex (tidyverse and friends), so installation just takes longer. Or we are using more continuous integration strategies such as Travis or GitLab-CI, and want quick feedback. Or we are simply updating a large number of packages via update.packages(). This is a problem we often solve for our clients – optimising their CI/CD pipelines.

The purpose of this blog post is to pull together a few different methods for tackling this problem. If I’ve missed any, let me know (https://twitter.com/csgillespie)!

Faster installation with Ncpus

The first tactic you should use is the Ncpus argument in install.packages() and update.packages(). This installs packages in parallel. It doesn’t speed up an individual package installs, but it does allow dependencies to install in parallel, e.g. tidyverse. Using it is easy; it’s just an additional argument in install.packages(). So to use six cores, we would simply use

install.packages("tidyverse", Ncpus = 6)

When installing a fresh version of the tidyverse and all dependencies, this can give a two-fold speed-up.

NcpusElapsed (Secs)Ratio
14092.26
22241.24
41961.08
61811.00

Not bad for a simple tweak with no downsides. For further information, see our blog post from a few years ago.

In short, this is something you should definitely use and add to your .Rprofile. It would in theory speed-up continuous integration pipelines, but only if you have multiple cores available. The free version of travis only comes with a single core, but if you hook up a multi-core Kubernettes cluster to your CI (we sometimes do this at Jumping Rivers), then you can achieve a large speed-up.

Faster installation with ccache

If you are installing packages from source, i.e. tar.gz files, then most of the installation time is spent on compiling source code, such as C, C++ & Fortran. A few years ago, Dirk Eddelbuettel wrote a great blog post on leveraging the ccache utility for reducing the compile time step. Essentially, ccache stores the resulting object file created when compiling. If that file is ever compiled again, instead of rebuilding, ccache returns the object code, resulting in a significant speed up. It’s the classic trade-off between memory (caching) and CPU.

Dirk’s post gives clear details on how to implement ccache (so I won’t repeat). He also compares re-installation times of packages, with RQuantlib going from 500 seconds to a few seconds. However, for ccache to be effective, the source files have to be static. Obviously, when you update an R package things change!

As an experiment, I download the last seventeen versions of dplyr from CRAN. This takes us back to version 0.5.0 from 2016. Next I installed each version in turn, via

# Avoid tidyverse packages, as we are messing about with dplyrf = list.files("data", full.names = TRUE)elapsed = numeric(length(f))for (i in seq_along(f)) {  elapsed[i] = system.time(install.packages(f[i], repos = NULL))["elapsed"]}

As all packages dependencies have been installed and the source code has already been downloaded, the above code times the installing of just dplyr. If we then implement ccache, we can easily rerun the above code. After a little manipulation we can plot the absolute installation times

The first (slightly obvious) takeaway is that there is no speed-up with dplyr v0.5.0. This is simply because ccache relies on previous installations. As v0.5.0 is the first version in our study, there is no difference between standard and ccache installations.

Over the seventeen versions of dplyr, we achieved a 24 fold speed-up for three versions, and more modest two to four fold speed-up for a further three versions. Averaged over all seventeen version, a typical speed-up is around 50%.

Overall, using ccache is a very effective and easy strategy. It requires a single, simple set-up, and doesn’t require root access. Of course it doesn’t always work, but it never really slows anything down.

At the start of this section, I mentioned the trade off between memory and CPU. I’ve been using ccache since 2017, and the current cache size is around 6GB. Which on a modern hard drive isn’t much (and I install a lot of packages)!

Using Ubuntu Binaries

On Linux, the standard way of installing packages is via source and install.packages(). However, it is also possible to install packages using binary packages. This has two main benefits

  • It’s faster – typically a few seconds
  • It (usually) solves any horrible dependency problems by installing the necessary dev-libraries.

If you are using continuous integration, such as GitLab runners, then this is a straightforward step to reduce the package installation time. The key idea is to add an additional binary source to your source.lists file, see for example, the line in rocker. After that, you can install most CRAN packages via

sudo apt install r-cran-dplyr

The one big downside here is that the user requires root access to install an R package, so this solution isn’t suitable in all situations.

There’s lots of documentation available, CRAN and blog posts, so I won’t bother repeating by adding more.

Using RStudio Package Manager

The RStudio Package Manager is one of RStudio’s Pro products that is used to ultimately pay for their open source work, e.g. the RStudio desktop IDE and all of their tidyverse R packages.

CRAN mirrors have for a long time distributed binary packages for Windows and Mac. The RSPM provides precompiled binaries for CRAN packages for

  • Ubuntu 16.04 (Xenial), Ubuntu 18.04 (Bionic)
  • CentOS/RHEL 7, CentOS/RHEL 8
  • openSUSE 42/SLES 12, openSUSE 15/SLES 15
  • Windows (soon, currently in beta)

The big advantage of RSPM over the Ubuntu binaries solution above, is that root access is no longer necessary. Users can just install via the usual install.packages().


Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (info@jumpingrivers.com). We may be able to offer free support.

The post Faster R package installation appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Seed germination: fitting hydro-time models with R

$
0
0

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am locked at home, due to the COVID-19 emergency in Italy. Luckily I am healthy, but there is not much to do, inside. I thought it might be nice to spend some time to talk about seed germination models and the connections with survival analysis.

We all know that seeds need water to germinate. Indeed, the absorption of water activates the hydrolytic enzymes, which break down food resources stored in seeds and provide energy for germination. As the consequence, there is a very close relationship between water content in the substrate and germination velocity: the higher the water content the quickest the germination, as long as the availability of oxygen does not become a problem (well, water and oxygen in soil may compete for space and a high water content may result in oxygen shortage).

Indeed, it is relevant to build germination models, linking the proportion of germinated seeds to water availability in the substrate; these models are usually known as hydro-time (HT) models. The starting point is the famous equation of Bradford (1992), where the germination rate (GR) for the \(i-th\) seed in the lot is expressed as a linear function of water potential in the substrate (\(\Psi\)):

\[ GR_i = \textrm{min} \left( \frac{\Psi – \Psi_{b(i)}}{\theta_H}; 0 \right) \quad \quad \quad \quad (1)\]

In that equation, \(\Psi_{b(i)}\) is the base water potential for the \(i-th\) seed and \(\theta_H\) is the hydro-time constant, expressed as MPa day or MPa hour. The concept is relatively simple: we just need to remember that the water can only move from a position with a higher water potential to a position with a lower water potential. Therefore, a seed cannot germinate when its base water potential is higher than the water potential in the substrate.

When \(\Psi > \Psi_b(i)\), the germination rate of the \(i-th\) seed is linearly related to \(\Psi\): the higher this latter value, the higher the germination rate. Now we should consider that the germination rate is the inverse of the germination time (\(GR = 1/t\)); thus, the higher the GR, the shortest the germination time. Germination is achieved at the time \(t\) when:

\[ t \, \left( \Psi – \Psi_{b(i)} \right) = \theta_H \quad \quad \quad (2)\]

Elsewhere in this website, I show that Equation 1 can be fitted to germination data in a two-steps fashion. In this page we will see how we can embed Equation 1 into a germination model, to predict the proportion of germinated seeds, depending on time and water content in the substrate. As usual, let’s start from a practical example.


The dataset

The germination of rapeseed (Brassica napus L. var. oleifera, cv. Excalibur) was tested at thirteen different water potentials (-0.03, -0.15, -0.3, -0.4, -0.5, -0.6, -0.7, -0.8, -0.9, -1, -1.1, -1.2, -1.5 MPa), which were created by using a polyethylene glycol solution (PEG 6000). For each water potential level, three replicated Petri dishes with 50 seeds were incubated at 20°C. Germinated seeds were counted and removed every 2-3 days for 14 days.

The dataset was published by Pace et al. (2012). It is available as rape in the drcSeedGerm package, which needs to be installed from github (see below). The following code loads the necessary packages, loads the dataset rape and shows the first six lines.

# library(devtools)# install_github("OnofriAndreaPG/drcSeedGerm")library(drc)library(drcSeedGerm)library(lmtest)library(sandwich)data(rape)head(rape)##   Psi Dish timeBef timeAf nSeeds nCum propCum## 1   0    1       0      3     49   49    0.98## 2   0    1       3      4      0   49    0.98## 3   0    1       4      5      0   49    0.98## 4   0    1       5      7      0   49    0.98## 5   0    1       7     10      0   49    0.98## 6   0    1      10     14      0   49    0.98

We can see that the data are grouped by assessment interval: ‘timeAf’ represents the moment when germinated seeds were counted, while ’timeBef’ represents the previous inspection time (or the beginning of the assay). The column ’nSeeds’ is the number of seeds that germinated during the time interval between ‘timeBef’ and ‘timeAf. The ’propCum’ column contains the cumulative proportions of germinated seeds and it is not necessary for time-to-event models. The ‘drcSeedGerm’ package contains some service functions which might help prepare the dataset in this form (see the documentation for the functions ‘makeDrm()’ and ‘makeDrm2()’).


Building hydro-time models

Models based on the distribution of germination time

How can we rework Equation 1 to predict the proportion of germinated seeds, as a function of time and water potential? One line of attack follows the proposal we made in a relatively recent paper (Onofri at al., 2018). We started from the idea that the time course of the proportion of germinated seeds (\(P\)) is expected to increase over time, according to a S-shaped curve, such as the usual log-logistic cumulative probability function (other cumulative distribution functions can be used; see our original paper):

\[P(t) = \frac{ P_{MAX} }{1 + \exp \left\{ b \left[ \log(t) – \log(t_{50} ) \right] \right\} } \quad \quad \quad (3)\]

where \(t_{50}\) is the median germination time, \(b\) is the slope at the inflection point and \(P_{MAX}\) is the maximum germinated proportion. Considering that the germination rate is the inverse of germination time, we can write:

\[P(t) = \frac{ P_{MAX} }{1 + \exp \left\{ b \left[ \log(t) – \log(1 / GR_{50} ) \right] \right\} } \quad \quad \quad (4)\]

where \(GR_{50}\) is the median germination rate in the population.

We can now express \(GR_{50}\), \(b\) and \(P_{MAX}\) as linear/nonlinear functions of \(\Psi\) (temperature and other environmental variables can be included as well. See our original paper). In our paper, for \(GR_{50}\), we used the Equation 1 above. For \(P_{MAX}\), we used a shifted exponential distribution, which implies that germination capability is fully determined by the distribution of base water potential within the population and no germinations occur at \(\Psi \leq \Psi_b\):

\[P_{MAX} = h_1(\Psi ) = \textrm{min} \left\{ G \, \left[ 1 – \exp \left( \frac{ \Psi – \Psi_b }{\sigma_{\Psi_b}} \right) \right]; 0 \right\} \quad \quad \quad (5)\]

In the above equation, \(\sigma_{\Psi_b}\) represents the variability of \(\Psi_b\) within the population, which determines the steepness of the increase in \(P_{MAX}\) as \(\Psi\) increases. \(G\) is the germinable fraction, accounting for the fact that \(P_{MAX}\) may not reach 1, regardless of time and water potential.

The parameter \(b\) was assumed to be constant and independent on \(\Psi\). In the end, our hydro-time model is composed by four sub-models:

  1. a cumulative probability function (log-logistic, in our example), based on the three parameters \(P_{MAX}\), \(b\) and \(GR50\);
  2. a sub-model expressing \(P_{MAX}\) as a function of \(\Psi\);
  3. a sub-model expressing \(GR50\) as a function of \(\Psi\);
  4. a sub-model expressing \(b\) as a function of \(\Psi\), although, this was indeed a simple identity model \(b(\Psi) = b\).

The equation is:

\[P(t) = \frac{ h_1(\Psi) }{1 + \exp \left\{ b \left[ \log(t) – \log(1 / \left[ GR_{50}(\Psi) \right] ) \right] \right\} } \quad \quad \quad (6)\]

This hydro-time model was implemented in R as the HTE1() function, and it is available within the drcSeedGerm package, together with the appropriate self-starting routine. It can be fitting by using the drm() function in the drc package. Please, note that the argument type has to be set to “event”.

modHTE <- drm(nSeeds ~ timeBef + timeAf + Psi,                 data = rape, fct = HTE1(), type = "event")summary(modHTE)## ## Model fitted: Hydro-time model with shifted exponential for Pmax and linear model for GR50## ## Parameter estimates:## ##                         Estimate Std. Error  t-value   p-value    ## G:(Intercept)          0.9577943  0.0063663  150.448 < 2.2e-16 ***## Psib:(Intercept)      -1.0397178  0.0047014 -221.152 < 2.2e-16 ***## sigmaPsib:(Intercept)  0.1108836  0.0087593   12.659 < 2.2e-16 ***## thetaH:(Intercept)     0.9060853  0.0301585   30.044 < 2.2e-16 ***## b:(Intercept)          4.0272972  0.1960877   20.538 < 2.2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As seeds are clustered in Petri dishes, in order not to violate the independence assumption, it is preferable to get cluster robust standard errors. One possibility is to use the grouped version of the sandwich estimator, as available in the sandwich package (Berger, 2017). The function coeftest is available in the lmtest package (Zeileis, 2002):

coeftest(modHTE, vcov = vcovCL, cluster = rape$Dish)## ## t test of coefficients:## ##                         Estimate Std. Error   t value  Pr(>|t|)    ## G:(Intercept)          0.9577943  0.0080918  118.3661 < 2.2e-16 ***## Psib:(Intercept)      -1.0397178  0.0047067 -220.9003 < 2.2e-16 ***## sigmaPsib:(Intercept)  0.1108836  0.0121872    9.0983 < 2.2e-16 ***## thetaH:(Intercept)     0.9060853  0.0410450   22.0754 < 2.2e-16 ***## b:(Intercept)          4.0272972  0.1934579   20.8174 < 2.2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An alternative way to obtain cluster robust standard errors is to use the delete-a-group jackknife technique, which I described in one of my previous papers (Onofri et al., 2014). This is available in the jackGroupSE() function in the drcSeedGerm package. It takes quite a bit of computing time, so you may need to be patient, especially if you have a lot of Petri dishes.

jack <- jackGroupSE(modHTE, data = rape, cluster = rape$Dish)

Once the model is fitted, we may be interested in using the fitted to curve to retrieve some biologically relevant information. For example, it may be interesting to retrieve the germination rates for some selected percentiles (e.g., the 30th, 20th and 10th percentiles). This is possible using the GRate() function, that is a wrapper for the original ED() function in the package drc. It reverses the behavior of the ED() function, in the sense that it considers, by default, the percentiles for the whole population, including the ungerminated fraction, which is, in our opinion, the most widespread interpretation of germination rates in seed science. The GRate() function works very much like the ED() function, although additional variables, such as the selected \(\Psi\) level can be specified by using the argument x2.

#Naive standard errorsGRate(modHTE, x2 = -1, respLev = c(30, 20, 10))##           Estimate         SE## GR:1:30 0.00000000 0.00000000## GR:1:20 0.03578644 0.01515519## GR:1:10 0.05129734 0.01546193

In this example, we see that the \(GR_{30}\) cannot be calculated, as the germination capacity did not reach 30% at the selected water potential level (\(-1 \,\, MPa\)).

As we said, cluster robust standard errors are recommended. The GRate() function allows entering a user-defined variance-covariance matrix, that is obtained by using the vcovCL() function in the sandwich package. If necessary, germination times can be obtained in a similar way, by using the GTime() function.

#Cluster robust standard errorssc <- vcovCL(modHTE, cluster = rape$Dish)GRate(modHTE, x2 = -1, respLev = c(30, 20, 10), vcov.=sc)##           Estimate          SE## GR:1:30 0.00000000 0.000000000## GR:1:20 0.03578644 0.005452517## GR:1:10 0.05129734 0.005870701#Germination timesGTime(modHTE, x2 = -1, respLev = c(30, 20, 10), vcov.=sc)##        Estimate       SE## T:1:30      Inf       NA## T:1:20 27.94355 4.257553## T:1:10 19.49419 2.231004

Last, but not least, we can predict the proportion of germinated seeds at given time and water potential level.

predictSG(modHTE, se.fit=T, vcov. = vcovCL,        newdata = data.frame(Time=c(10, 10, 10),                              Psi=c(-1.5, -0.75, 0))        )##      Prediction         SE## [1,]  0.0000000 0.00000000## [2,]  0.8794104 0.03907374## [3,]  0.9576615 0.01493018

Models based on the distribution of \(\Psi_b\)

Another type of hydro-time model was proposed by Bradford (2002) and later extended by Mesgaran et al (2013). This approach starts always from Equation 1; from that equation, considering that the germination time is the inverse of the GR, we can easily get to the following equation:

\[\Psi_b = \Psi – \frac{\theta_H}{t} \quad \quad \quad (7)\]

where \(t\) is the germination time. What does this equation tell us? Let’s assume that the hydro-time to germination is 10 \(MPa \, d\) and the environmental water potential is -1 \(MPa\). A single seed germinates in exactly one day, if its base water potential is \(-1 – 10/1 = -11\). If the base water potential is higher, germination will take more than one day; if it is lower, germination will take less than one day. But now, the following questions come: how many seeds in a population will be able to germinate in one day? And in two days? And in \(t\) days?

We know that the seeds within a population do not germinate altogether in the same moment, as they have different individual values of base water potential. If the population is big enough, we can describe the variation of \(\Psi_b\) within the population by using some density function, possibly parameterised by way of a location (\(\mu\)) and a scale (\(\sigma\)) parameter:

\[ \Psi_b \sim \phi \left( \frac{\Psi_b – \mu}{\sigma} \right) \quad \quad \quad (8)\]

This is easier to understand if we make a specific example. Let’s assume that the distribution of \(\Psi_b\) values within the population is gaussian, with mean \(\mu = -9\) and standard deviation \(\sigma = 1\). Let’s also assume that the hydro-time parameter (\(\theta_H\)) is constant within the population. We have the situation depicted in the figure below.

The red left tail represents the proportion of seeds that germinate during the first day, as they have base water potentials equal to or lower than -11. By using the gaussian cumulative distribution function we can easily see that that proportion is 0.228:

pnorm(-1 - 10/1, mean = -9, sd = 1)## [1] 0.02275013

More generally, we can write:

\[ G(t, \Psi) = \Phi \left\{ \frac{\Psi – (\theta_H / t) -\mu }{\sigma} \right\} \quad \quad \quad (9)\]

where \(\Phi\) is the selected cumulative distribution function. The above model returns the proportion of germinated seeds (G), as a function of time and water potential in the substrate. According to Bradford (2002), \(\Phi\) is cumulative gaussian.

Let’s think more deeply about Equation 9 (Bradford, 2002). This function was built to represent the cumulative distribution function of base water potential within the population. However, it can be as well taken to represent the cumulative distribution function of germination time within the population. Obviously, while the first distribution is gaussian, the second one is not: indeed, the germination time appears at the denominator of the expression \(\theta_H / t\). It doesn’t matter: every cumulative distribution function for germination time can be fit by using time-to-event methods!

We implemented this model in R as the function HTnorm() that is available within the drcSeedGerm package and it is meant to be used with the drm() function, in the drc package.

Mesgaran et al (2013) suggested that the distribution of base water potential within the population may not be gaussian and proposed several alternatives, which we have all implemented within the package. In all, drcSeedGerm contains six possible distributions:

  1. gaussian distribution (function HTnorm())
  2. logistic distribution (function HTL())
  3. Gumbel (function HTG())
  4. log-logistic (function HTLL())
  5. Weibull (Type I) (function HTW1())
  6. Weibull (Type II) (function HTW2())

The equations are given at the end of this page; for gaussian, logistic and log-logistic distributions, \(\Psi_{b(50)}\) is the median base water potential within the population. For the gaussian distribution, \(\sigma_{\Psi b}\) corresponds to the standard deviation of \(\Psi_b\) within the population.

Distributions based on logarithms (the log-logistic and all other distributions thereafter) are only defined for positive amounts. On the contrary, we know that base water potential is mostly negative. Therefore, shifted distributions need to be used, by introducing a shifting parameter \(\delta\) which ‘moves’ the distribution to the left, along the x-axis, so that negative values are possible (see Mesgaran et al., 2013).

Let’s fit the above functions to the ‘rape’ dataset. But, before, let me highlight that providing starting values is not necessary, as self-starting routines are already implemented for all models.

library(drc)mod1 <- drm(nSeeds ~ timeBef + timeAf + Psi, data = rape,             fct = HTnorm(), type = "event")mod2 <- drm(nSeeds ~ timeBef + timeAf + Psi, data = rape,            fct = HTL(), type = "event")mod3 <- drm(nSeeds ~ timeBef + timeAf + Psi, data = rape,            fct = HTG(), type = "event")mod4 <- drm(nSeeds ~ timeBef + timeAf + Psi,            data = rape, fct = HTLL(), type = "event")mod5 <- drm(nSeeds ~ timeBef + timeAf + Psi,            data = rape, fct = HTW1(), type = "event")mod6 <- drm(nSeeds ~ timeBef + timeAf + Psi,            data = rape, fct = HTW2(), type = "event")

What is the best model for this dataset? Let’s use the Akaike’s Information Criterion (AIC) to decide:

AIC(mod1, mod2, mod3, mod4, mod5, mod6, modHTE)##         df      AIC## mod1   291 3516.914## mod2   291 3300.824## mod3   291 3097.775## mod4   290 2886.609## mod5   290 2889.307## mod6   290 2998.915## modHTE 289 2832.481

The first model modHTE considers explicitly the distribution of germination times and it is the best fitting of all. The other models consider explicitly the distribution of base water potential, while the distribution of germination times is indirectly included. Among these models, the gaussian is the worse fitting, while the log-logistic is the best one (mod4).

For this latter model, we take a look at the value of estimated parameters. Cluster robust standard errors can be obtained as before, by way of the sandwich estimator or a fully iterated delete-a-group jackknife estimator.

sand <- coeftest(mod4, vcov = vcovCL, cluster = rape$Dish)# jack <- jackGroupSE(mod4, data = rape, cluster = rape$Dish)
sand## ## t test of coefficients:## ##                     Estimate Std. Error  t value Pr(>|t|)    ## thetaH:(Intercept)  0.677472   0.072902   9.2930  < 2e-16 ***## delta:(Intercept)   1.136963   0.102440  11.0988  < 2e-16 ***## Psib50:(Intercept) -0.948101   0.020341 -46.6097  < 2e-16 ***## sigma:(Intercept)   0.372172   0.173360   2.1468  0.03264 *  ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# jack

Germination rates and times for a certain percentile (e.g. GR50, GR30), can be obtained by using the GRate() and GTime() function in drcSeedGerm. Again, the use of cluster-robust standard errors is highly recommended.

GRate(mod4, respLev=c(30, 50, 70), x2 = 0, vcov. = vcovCL)##         Estimate        SE## GR:1:30 1.474866 0.1796681## GR:1:50 1.399469 0.1648390## GR:1:70 1.296120 0.1561949GTime(mod4, respLev=c(30, 50, 70), x2 = 0, vcov. = vcovCL)##         Estimate         SE## T:1:30 0.6780277 0.08259731## T:1:50 0.7145569 0.08416538## T:1:70 0.7715337 0.09297725

We can also make predictions about the germinated proportion for a certain time and water potential level. The code below returns the maximum germinated proportions at -1.5, -0.75, and 0 MPa.

predictSG(mod4, se.fit=T, vcov. = vcovCL,        newdata = data.frame(Time=c(10, 10, 10),                              Psi=c(-1.5, -0.75, 0))        )##        Prediction           SE## [1,] 6.658526e-15 1.124604e-13## [2,] 8.038034e-01 5.594629e-02## [3,] 9.906059e-01 9.465305e-03

Let’s use the predictSG() function to plot the ‘modHTE’ and ‘mod4’ objects together in the same graph.

Thanks for reading!

Prof. Andrea Onofri Department of Agricultural, Food and Environmental Sciences University of Perugia (Italy) Borgo XX Giugno 74 I-06121 – PERUGIA


References

  1. Berger, S., Graham, N., Zeileis, A., 2017. Various versatile variances: An object-oriented implementation of clustered covariances in R. Faculty of Economics and Statistics, University of Innsbruck, Innsbruck.
  2. Bradford, K.J., 2002. Applications of hydrothermal time to quantifying and modeling seed germination and dormancy. Weed Science 50, 248–260.
  3. Mesgaran, M.B., Mashhadi, H.R., Alizadeh, H., Hunt, J., Young, K.R., Cousens, R.D., 2013. Importance of distribution function selection for hydrothermal time models of seed germination. Weed Research 53, 89–101. https://doi.org/10.1111/wre.12008
  4. Onofri, A., Benincasa, P., Mesgaran, M.B., Ritz, C., 2018. Hydrothermal-time-to-event models for seed germination. European Journal of Agronomy 101, 129–139.
  5. Onofri, A., Mesgaran, M.B., Neve, P., Cousens, R.D., 2014. Experimental design and parameter estimation for threshold models in seed germination. Weed Research 54, 425–435. https://doi.org/10.1111/wre.12095
  6. Pace, R., Benincasa, P., Ghanem, M.E., Quinet, M., Lutts, S., 2012. Germination of untreated and primed seeds in rapeseed (brassica napus var oleifera del.) under salinity and low matric potential. Experimental Agriculture 48, 238–251.
  7. Ritz, C., Jensen, S. M., Gerhard, D., Streibig, J. C. (2019) Dose-Response Analysis Using R CRC Press. Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3), 7-10. URL: https://CRAN.R-project.org/doc/Rnews/

Some further detail

Let us conclude this page by giving some detail on all equations.

The equation for the model HTnorm(). Here, we show all other equations, as implemented in our package.

HTL()

\[ G(t, \Psi) = \frac{1}{1 + exp \left[ – \frac{ \Psi – \left( \theta _H/t \right) – \Psi_{b(50)} } {\sigma} \right] }\]

HTG()

\[ G(t, \Psi) = \exp \left\{ { – \exp \left[ { – \left( {\frac{{\Psi – (\theta _H / t ) – \mu }}{\sigma }} \right)} \right]} \right\} \]

HTLL()

\[ G(t, \Psi) = \frac{1}{1 + \exp \left\{ \frac{ \log \left[ \Psi – \left( \frac{\theta _H}{t} \right) + \delta \right] – \log(\Psi_{b50} + \delta) }{\sigma}\right\} }\]

HTW1()

\[ G(t, \Psi) = exp \left\{ – \exp \left[ – \frac{ \log \left[ \Psi – \left( \frac{\theta _H}{t} \right) + \delta \right] – \log(\Psi_{b50} + \delta) }{\sigma}\right] \right\}\]

HTW2()

\[ G(t, \Psi) = 1 – exp \left\{ – \exp \left[ \frac{ \log \left[ \Psi – \left( \frac{\theta _H}{t} \right) + \delta \right] – \log(\Psi_{b50} + \delta) }{\sigma}\right] \right\}\]

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to create a simple Coronavirus dashboard specific to your country in R

$
0
0

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coronavirus dashboard: the case of Belgium

Coronavirus dashboard: the case of Belgium

Introduction

The Novel COVID-19 Coronavirus is the hottest topic right now. Every day, the media and newspapers share the number of new cases and deaths in several countries, try to measure the impacts of the virus on citizens and remind us to stay home in order to stay safe. The Coronavirus is on everyone’s lips.

In addition to governments, media and companies discussing about it, data scientists and data professionals in general are putting their knowledge and time at the service of the virus. This leads to a proliferation of applications, dashboards, blog posts, videos, datasets and code analyzing, in one way or another, the expansion of the COVID-19 and how it spreads in the population.

Top R resources on Coronavirus

As a data lover myself, I discovered a multitude of great resources about the Coronavirus. However, these resources were spread all over the internet and were often hidden by the mass of information of another type (e.g., alarming headlines, names of infected celebrities, companies showing off how they helped health care agencies, etc.). To tackle this issue, I collected and then shared in a previous article the best R resources on the Coronavirus I came across.

Note that only resources on R are shared in this article as R is my favorite statistical program and the one I am most familiar with. The fact that I use this program almost daily makes it easier for me to realize the complexity and time put behind a resource, and appreciate its quality and its potential. I am sure that there are other very interesting resources online (see for example the probably most popular dashboard by the Johns Hopkins Coronavirus Resource Center).1 Nonetheless, a lot of people are in a better position than I am to judge the quality of resources made with programming languages for which I am not competent.

This article made me discover so many great resources about the Coronavirus and I keep receiving data visualizations and data analyses from scientists all over the world so that I include them in the collection. Thanks for that, it continuously improves the quality and completeness of the collection.

In addition to receiving R resources, a question often asked by readers was “How can I create a dashboard myself?” or “How can I build a dashboard specific to my country?”. I therefore thought it would serve some people if I created a dashboard specific to my country (Belgium) and detailed the steps on how to build it.

Questions on how to develop such dashboards came mostly from R beginners as advanced R users most probably know how to do one, or at least can easily use the resources I collected here as sources of inspiration for their own work. Furthermore, in response to the craze about the Coronavirus, interested users were quite in a hurry and wanted to have their own dashboard running as quickly as possible.

These questions led me to the idea of creating a simple (yet powerful and visually appealing) dashboard, as opposed to a Shiny app. Shiny apps have the advantage that they are interactive in the sense that users can edit the outputs and the visualizations by simply changing some inputs in a user-friendly way, while dashboards are static and cannot be modified by the final user. On the other hand, the advantage of a dashboard over a Shiny app is that it is much easier to code, especially if you are already proficient in R Markdown.

Coronavirus dashboard: the case of Belgium

Amongst all the visualizations I have seen so far, one is standing out by its simplicity and at the same time, by its completeness and by the quality of its visualizations. I thus decided to create a Coronavirus dashboard based on this already existing dashboard by Rami Krispin (which comes with a license that allows to be freely adapted and shared) and adapt it so that it is specific to Belgium. Note that I also removed some visualizations and tables from the initial dashboard to keep it really simple and straight to the point.

Before reading further, here is my Coronavirus dashboard adapted to Belgium and previews of the main sections below:

The dashboard is segmented into several sections that can be selected at the top:

  • The summary section provides key measures about the Coronavirus (total cases, active cases and deaths) and a plot displaying the cumulative number of active cases and deaths from January 22, 2020 to the latest available date.
  • The comparison section presents a comparison of the number of daily new cases (left panel) and the distribution of cases by type (right panel) with other European countries (you can also change these countries by replacing them in the code).
  • The map section shows a world map of the confirmed cases, recovered cases and deaths. You can uncheck one or several types of cases (top right corner) and zoom in or out (top left corner) to adapt the map to your needs.
  • The about section gives more information about the data, the dashboard in general and how often it is updated and some contact details in case someone finds a bug or a mistake.

I believe this simple dashboard is easy enough to be adapted to any country (and by anyone from beginner to expert), and still communicate key measures about the virus throughout some visualizations. A little extra which is worth mentioning is the fact that all plots are generated with the {plotly} package. This package allows to enhance plots by displaying additional relevant information when hovering over them (try by yourself!).

How to create your own Coronavirus dashboard

If you want to build your own dashboard specific to a country, follow these steps:

  1. Open the dashboard here
  2. See the entire code via the button “Source code” located in the top right corner of the dashboard, or see the code on GitHub. Copy that code.
  3. Open a new R Markdown file (.Rmd), type any title and author (they will be replaced in the next step anyway), select HTML as the output format and click on OK:

  1. Remove all the template code already present and paste the code you copied in step 1.
  2. Make sure that the required packages are installed:
install.packages(c("devtools", "flexdashboard"))devtools::install_github("RamiKrispin/coronavirus")
  1. In the code, replace Belgium with your country. Change also the title and the author at the top of the document, and edit the about section at the bottom of the document.
  2. Knit the document (see this article if you are unfamiliar with R Markdown). Your dashboard should appear in HTML format.

Following these 7 steps, you should already have a simple dashboard specific to your country. I have intentionally kept it simple so that everyone could copy it and have their own dashboard in a limited amount of time.

If you are familiar with the {flexdashboard}, {plotly} and {leaflet} packages for the dashboard interface and the visualizations, and the {dplyr} and {tidyr} packages for the data manipulation, feel free to edit the code according to your needs and improve your dashboard.

Additional notes

Data

The input data for this dashboard is the dataset available from the {coronavirus} R package. Make sure to download the development version of the package to have the latest data:

install.packages("devtools")devtools::install_github("RamiKrispin/coronavirus")

You can update the data with the update_datasets() function.

The raw data is pulled from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus repository.

Open source

This dashboard and the code available on GitHub are open source so feel free to copy it, adapt it and share it as much as you want.

Accuracy

Please note that this dashboard has been built mainly for educational purposes. I update the dashboard as often as possible to keep it accurate. However, there is some uncertainty concerning the COVID-19 case numbers and the testing methods vary between countries so the figures on this dashboard may be slightly different compared to other sources. Currently, the maintainer of the dataset updates it on a daily basis, but updates may become less frequent in the future.

Thanks for reading. I hope this article helped you to build your first Coronavirus dashboard in R. See these top R resources on Coronavirus if you need inspiration to enhance further your dashboard.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

Related articles:


  1. I would be glad to mention a collection of Python resources if someone is willing to create a collection of resources about the Coronavirus made with this programming language. Feel free to contact me if this is the case.↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Tidying the John Hopkins Covid-19 data

$
0
0

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My guess is that by now everybody knows that the public Github repository maintained by the Johns Hopkins University Center for Systems Science and Engineering has developed to a standard resource for individuals interested in analyzing the spread of SARS-CoV-2. There are alternative resources and blog articles covering them. Also, this blog post features a nice collection of R projects related to the Corona Virus.

The main reason for me sharing this code regardless is that I did not find code that merges standardized country level identifiers to to the data in a semi-automatic way. These identifiers are important whenever you want to merge additional country level data for additional analyses, like, e.g. population data to calculate per capita measures. Also, I thought that the steps presented below are nice small case on how to obtain, tidy and merge country-level data from public sources.

Pulling and tidying the John Hopkins Covid-19 data to long format

So, without much ado, here is the code for pulling and tidying the data.

library(tidyverse)library(lubridate)library(rvest)library(stringdist)# Function to read the raw CSV files. The files are aggregated to the country# level and then converted to long formatclean_jhd_to_long <- function(df) {  df_str <- deparse(substitute(df))  var_str <- substr(df_str, 1, str_length(df_str) - 4)    df %>% group_by(`Country/Region`) %>%    filter(`Country/Region` != "Cruise Ship") %>%    select(-`Province/State`, -Lat, -Long) %>%    mutate_at(vars(-group_cols()), sum) %>%     distinct() %>%    ungroup() %>%    rename(country = `Country/Region`) %>%    pivot_longer(      -country,       names_to = "date_str",       values_to = var_str    ) %>%    mutate(date = mdy(date_str)) %>%    select(country, date, !! sym(var_str)) }confirmed_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")deaths_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv")recovered_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv")jh_covid19_data <- clean_jhd_to_long(confirmed_raw) %>%  full_join(clean_jhd_to_long(deaths_raw)) %>%  full_join(clean_jhd_to_long(recovered_raw))# Next, I pull official country level indicators from the UN Statstics Division# to get country level identifiers.jhd_countries <- tibble(country = unique(jh_covid19_data$country)) %>% arrange(country)ctry_ids <- read_html("https://unstats.un.org/unsd/methodology/m49/") %>%  html_table()un_m49 <- ctry_ids[[1]]colnames(un_m49) <- c("country", "un_m49", "iso3c")# Merging by country name is messy. I start with a fuzzy matching approach# using the {stringdist} packagectry_names_dist <- matrix(NA, nrow = nrow(jhd_countries), ncol = nrow(un_m49))for(i in 1:length(jhd_countries$country)) {  for(j in 1:length(un_m49$country)) {     ctry_names_dist[i,j]<-stringdist(tolower(jhd_countries$country[i]),                                 tolower(un_m49$country[j]))        }  }min_ctry_name_dist <- apply(ctry_names_dist, 1, min)matched_ctry_names <- NULLfor(i in 1:nrow(jhd_countries)) {  un_m49_row <- match(min_ctry_name_dist[i], ctry_names_dist[i,])  if (length(which(ctry_names_dist[i,] %in% min_ctry_name_dist[i])) > 1) un_m49_row <- NA  matched_ctry_names <- rbind(matched_ctry_names,                         tibble(                            jhd_countries_row = i,                            un_m49_row = un_m49_row,                           jhd_ctry_name = jhd_countries$country[i],                            un_m49_name = ifelse(is.na(un_m49_row), NA,                                                 un_m49$country[un_m49_row])                         ))}# This matches most cases well but some cases need to be adjusted by hand.# In addition there are two jurisdictions (Kosovo, Taiwan)# that cannot be matched as they are no 'country' as far as the U.N.# Statistics Devision is concerned.# WATCH OUT: The data from JHU is subject to change without notice.# New countries are being added and names/spelling might change. # Also, in the long run, the data provided by the UNSD might change.# Inspect 'matched_ctry_names' before using the data.matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Bolivia"] <- 27matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Brunei"] <- 35matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Congo (Brazzaville)"] <- 54matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Congo (Kinshasa)"] <- 64matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "East Timor"] <- 222matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Iran"] <- 109matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Korea, South"] <- 180matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Kosovo"] <- NAmatched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Moldova"] <- 181matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Russia"] <- 184matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Taiwan*"] <- NAmatched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Tanzania"] <- 236matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "United Kingdom"] <- 235matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "US"] <- 238matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Venezuela"] <- 243# Last Step: Match country identifier data and save file (commented out here)jhd_countries %>%   left_join(matched_ctry_names %>%               select(jhd_ctry_name, un_m49_row),             by = c(country = "jhd_ctry_name")) %>%  left_join(un_m49 %>% mutate(un_m49_row = row_number()), by = "un_m49_row") %>%  rename(country = country.x) %>%  select(country, iso3c)  -> jhd_countriesjh_covid19_data <- jh_covid19_data %>% left_join(jhd_countries) %>%  select(country, iso3c, date, confirmed, deaths, recovered)  # write_csv(jh_covid19_data, sprintf("jh_covid19_data_%s.csv", Sys.Date()))

The code essentially follows the following steps

  • Read the relevant CSV files for confirmed cases, casualties and recovered patients from the Github repository
  • Aggregate the data at country level and discard data that is not required
  • Scrape official country identifiers from the U.N. Statistics Division
  • Fuzzy match these to the countries present in the JH data. Apply manual corrections that were correct based on the data pulled March 23, 2020 (check the match when you use this code later)
  • Merge the identifiers with the longitudinal data and save the result as a tidy CSV file.

Merging some World Bank data

The next code snippet pulls some World Bank data using the {wbstats} package.

library(tidyverse)library(wbstats)pull_worldbank_data <- function(vars) {  new_cache <- wbcache()  all_vars <- as.character(unique(new_cache$indicators$indicatorID))  data_wide <- wb(indicator = vars, mrv = 10, return_wide = TRUE)  new_cache$indicators[new_cache$indicators[,"indicatorID"] %in% vars, ] %>%    rename(var_name = indicatorID) %>%    mutate(var_def = paste(indicator, "\nNote:",                           indicatorDesc, "\nSource:", sourceOrg)) %>%    select(var_name, var_def) -> wb_data_def  new_cache$countries %>%    select(iso3c, iso2c, country, region, income) -> ctries  left_join(data_wide, ctries, by = "iso3c") %>%    rename(year = date,           iso2c = iso2c.y,           country = country.y) %>%    select(iso3c, iso2c, country, region, income, everything()) %>%    select(-iso2c.x, -country.x) %>%    filter(!is.na(NY.GDP.PCAP.KD),           region != "Aggregates") -> wb_data  wb_data$year <- as.numeric(wb_data$year)  wb_data_def<- left_join(data.frame(var_name = names(wb_data),                                     stringsAsFactors = FALSE),                          wb_data_def, by = "var_name")  wb_data_def$var_def[1:6] <- c(    "Three letter ISO country code as used by World Bank",    "Two letter ISO country code as used by World Bank",    "Country name as used by World Bank",    "World Bank regional country classification",    "World Bank income group classification",    "Calendar year of observation"  )  wb_data_def$type = c("cs_id", rep("factor",  4), "ts_id",                       rep("numeric", ncol(wb_data) - 6))  return(list(wb_data, wb_data_def))}vars <- c("SP.POP.TOTL", "AG.LND.TOTL.K2", "EN.POP.DNST", "EN.URB.LCTY", "SP.DYN.LE00.IN", "NY.GDP.PCAP.KD")wb_list <- pull_worldbank_data(vars)wb_data <- wb_list[[1]]wb_data_def <- wb_list[[2]]wb_data %>%  group_by(iso3c) %>%  arrange(iso3c, year) %>%  summarise(    population = last(na.omit(SP.POP.TOTL)),    land_area_skm = last(na.omit(AG.LND.TOTL.K2)),    pop_density = last(na.omit(EN.POP.DNST)),    pop_largest_city = last(na.omit(EN.URB.LCTY)),    gdp_capita = last(na.omit(NY.GDP.PCAP.KD)),    life_expectancy = last(na.omit(SP.DYN.LE00.IN))  ) %>% left_join(wb_data %>% select(iso3c, region, income) %>% distinct()) -> wb_cs# write_csv(wb_cs, "jh_add_wbank_data.csv")

Use the data

And finally, some code to use the data for typical event time visualizations.

suppressPackageStartupMessages({  library(tidyverse)  library(lubridate)  library(gghighlight)  library(ggrepel)})dta <- read_csv(  "https://joachim-gassen.github.io/data/jh_covid19_data_2020-03-23.csv",  col_types = cols()) %>%  mutate(date = ymd(date))wb_cs <- read_csv(  "https://joachim-gassen.github.io/data/jh_add_wbank_data.csv",   col_types = cols())# I define event time zero where, for a given country, the confirmed# cases match or exceed the Chinese case number at the beginning of the# data so that all countries can be compared across event time.# Also a require each country to have at least 7 days post event day 0dta %>%   group_by(country) %>%  filter(confirmed >= min(dta$confirmed[dta$country == "China"])) %>%  summarise(edate_confirmed = min(date)) -> edates_confirmeddta %>%   left_join(edates_confirmed, by = "country") %>%  mutate(    edate_confirmed = as.numeric(date - edate_confirmed)  ) %>%  filter(edate_confirmed >= 0) %>%  group_by(country) %>%  filter (n() >= 7) %>%   ungroup() %>%  left_join(wb_cs, by = "iso3c") %>%   mutate(    confirmed_1e5pop = 1e5*confirmed/population  ) -> dflab_notes <- paste0(  "Data as provided by Johns Hopkins University Center for Systems Science ",   "and Engineering (JHU CSSE)\nand obtained on March 23, 2020. ",  "The sample is limited to countries with at least seven days of positive\n",   "event days data. Code and walk-through: https://joachim-gassen.github.io.")lab_x_axis_confirmed <- sprintf(paste(  "Days since confirmed cases matched or exceeded\n",   "initial value reported for China (%d cases)\n"), min(dta$confirmed[dta$country == "China"]))gg_my_blob <- list(  scale_y_continuous(trans='log10', labels = scales::comma),    theme_minimal(),   theme(    plot.title.position = "plot",     plot.caption.position =  "plot",    plot.caption = element_text(hjust = 0),    axis.title.x = element_text(hjust = 1),    axis.title.y = element_text(hjust = 1),  ),  labs(caption = lab_notes,       x = lab_x_axis_confirmed,       y = "Confirmed cases (logarithmic scale)"),  gghighlight(TRUE,  label_key = country, use_direct_label = TRUE,              label_params = list(segment.color = NA, nudge_x = 1)))ggplot(df %>% filter (edate_confirmed <= 30),        aes(x = edate_confirmed, color = country, y = confirmed)) +  geom_line() +  labs(    title = "Focus on the first month: Confirmed Cases\n"  ) +  gg_my_blob

ggplot(df %>% filter (edate_confirmed <= 30),        aes(x = edate_confirmed, color = country, y = confirmed_1e5pop)) +  geom_line() +  gg_my_blob +  labs(    y = "Confirmed cases per 100,000 inhabitants",    title = "Cases relative to population\n"  ) 

Wrap-Up

This is it. I hope that somebody might fight this useful. In any case, help #FlattenTheCurve and stay healthy, everybody!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1134]

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Aweekly Monde current mathematical puzzle on gcd’s and scm’s:

If one replaces a pair (a,b) of integers with the pair (g,s) of their greatest common denominator and smallest common multiple, how long at most before the sequence ends. Same question when considering a collection of five integers where two are selected by the pair (g,s) of their greatest common denominator and smallest common multiple.

The first question is straightforward as s is a multiple of s. So the sequence ends at most after one run. For five, run of a brute force R search return 9 as “the” solution (even though the true maximum is 10, as illustrated by the quintuplet (16,24,36,54,81):

ogcd <- function(x,y){r<-x%%y  return(ifelse(r,ogcd(y,r),y))}oscm<-function(x,y) x*y/ogcd(x,y)divemul<-function(a,b) return(c(oscm(a,b),ogcd(a,b)))for (t in 1:1e5){ini=sample(1:1e2,5)i=0;per=ker=sample(ini,2)nez=divemul(per[1],per[2])while(!max(nez%in%per)){ ini=c(ini[!ini%in%per],nez) per=sample(ini,2) ker=rbind(ker,per) nez=divemul(per[1],per[2]) i=i+1} sol=max(sol,i)}
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What to study if you’re under quarantine

$
0
0

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

r courses and books

If you’re staying indoors more often recently because of the current COVID-19 outbreak and looking for new things to study, here’s a few ideas!

Free 365 Data Science Courses

365 Data Science is making all of their courses free until April 15. They have a variety of courses across R, Python, SQL, and more. Their platform also has courses that give a great mathematical foundation behind machine learning, which helps you a lot as you get deeper into data science. They also have courses on deep learning, which is a hot field right now.

In addition to pure data science, 365 Data Science also covers material on Git / Github, which is essential for any data scientist nowadways.

Another nice feature of 365 Data Science is that they also offer courses related to business intelligence tools like Tableau, thus giving a well-rounded grounding in data science.

Recommended R books

It’s never a bad time to learn more about R by reading books!

For a few suggestions (with reviews) check out this link.

Here’s some additional recommendations for R, Python, and Linux: Recommended Open Source Books.

Learn a new package

R has so many awesome packages that it’s great to spend time delving into one to learn its features. There’s the ever-popular ones like dplyr or data.table, but you can also check out ones such as MLmetrics for evaluating machine learning models, or tabulizer for extracting data from PDF files.

That’s it for now! Keep up with my latest posts by following this blog on Twitter!

The post What to study if you’re under quarantine appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing a Markov Chain

$
0
0
A Markov Chain describes a sequence of states where the probability of transitioning from states depends only the current state. Markov chains are useful in a variety of computer science, mathematics, and probability contexts, also featuring prominently in Bayesian computation as Markov Chain Monte Carlo. Here, we’re going to look at a relatively simple breed of Markov chain and build up some intuition using simulations and animations (two of my favorite things). The R code for simulating the Markov Chain and creating the visualizations is at the bottom of the post. Example Imagine a school yard with 100 children. During recess, children can either be on the soccer field or the playground. So there are two possible states for a child to be in at any given time: (1) soccer field or (2) playground. At the beginning of recess, approximately half of the children will start on the soccer field and the other half will start on the playground. But as time goes on some children will switch from the soccer field to the playground and vice versa. Let’s say we know from previous experience that at any given moment, if a child is on the soccer field they have a .10 probability (10%) of going to the playground. In contrast, if a child is on the playground they have a .05 probability (5%) of going to the soccer field. Thus, the child has a .90 and .95 probability of staying in their respective state. This can represented pictorially. The circles (nodes) represent the two states and the arrows (edges) represent the probabilities of moving to the other state. In this fairly simple setup, we’ll see ways of exploring how this system evolves over time. Formal Definitions Some terminology here: We’re describing a finite state Markov chain. This means that there is a discrete (countable) set of possible states to be in. For example, a child can be on the soccer field OR the playground and we ignore in-between states. We also say that the Markov chain is memoryless, which means that the process is only dependent on the current state of the chain. Each child’s probability of being, say, on the playground depends only on where they currently are. Let’s express these statements formally: \[\text{Pr}(X_{n+1}=x|X_n=x_n)\] We read this as “the probability that the next state (\(n+1\)) of the random variable \(X\) is a particular value, \(x\), depends on the current state of the random variable, \(X_n=x_n\)”. We can also express this process as a series of matrix operations. Don’t worry if matrices aren’t your cup of tea; in this example the matrices are pretty small and the math isn’t very complicated. We start by defining a transition matrix, \(T\): \[T=\begin{bmatrix} 0.90&0.05 \\ 0.10&0.95 \\ \end{bmatrix}\] The elements in \(T\) correspond to the probabilities we defined earlier. For example, \(T_{1,1} = .90\) is the probability of staying on the soccer field and \(T_{2,1} = .10\) is the probability of moving from the soccer field to the playground (note that these sum to 1). Column 2 of \(T\) likewise corresponds to the probabilities of moving vs. staying when the child is on the playground. Computing the Chain The question we want to answer is: at the end of recess, how many children will be on the soccer field vs. the playground? Let’s start by looking at just one moment - what happens after a single iteration of the process? We don’t need to specify a time interval for each iteration (but in this example we can imagine that one iteration is ten seconds or so), we just need to be clear that the transition probabilities apply at each iteration. Here’s our equation for the first iteration: \[X_{n=1}\begin{bmatrix} 0.90&0.05 \\ 0.10&0.95 \\ \end{bmatrix}\begin{bmatrix} 0.50\\ 0.50\\ \end{bmatrix}\] We have our transition matrix, \(T\), from before and we’re going to matrix-multiply it with a vector containing the proportion of children present in each state (remember we said that at the beginning of recess half of the children would be on the soccer field and half on the playground). Multiplying this out we get: \[X_{n=1}\begin{bmatrix} 0.90&0.05 \\ 0.10&0.95 \\ \end{bmatrix}\begin{bmatrix} 0.50\\ 0.50\\ \end{bmatrix} = \begin{bmatrix} 0.90\cdot0.50+0.05\cdot0.50 \\ 0.10\cdot0.50+0.95\cdot0.50 \\ \end{bmatrix} =\begin{bmatrix} 0.475\\ 0.525\\ \end{bmatrix}\] So after one iteration we expect roughly 48 children (rounding up from \(.475\cdot100\)) to be on the soccer field and 52 children to be on the playground. If we want to see how the process unfolds over another iteration, we simply plug in the previous result and do the operation again: \[X_{n=2}\begin{bmatrix} 0.90&0.05 \\ 0.10&0.95 \\ \end{bmatrix}\begin{bmatrix} 0.475\\ 0.525\\ \end{bmatrix} = \begin{bmatrix} 0.90\cdot0.475+0.05\cdot0.525 \\ 0.10\cdot0.475+0.95\cdot0.525 \\ \end{bmatrix} =\begin{bmatrix} 0.45\\ 0.55\\ \end{bmatrix}\] Running the Simulation So, analytically, this is our expected result over a couple of iterations. But we want to know how the Markov process unfolds over many iterations. We could continue doing this analytically using

Tidying the new Johns Hopkins Covid-19 time-series datasets

$
0
0

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Just hours after my old blog post about tidying Johns Hopkins CSSE Covid-19 data the team has changed their time-series table data structure. The data of the old post is still available but won’t be updated. This new blog post is based on the new times-series data structure. Currently, they only seem to provide time series data on confirmed cases and deaths.

The main reason for me sharing this code is that I did not find code that merges standardized country level identifiers to to the data in a semi-automatic way. These identifiers are important whenever you want to merge additional country level data for additional analyses, like, e.g. population data to calculate per capita measures. Also, I thought that the steps presented below are nice small case on how to obtain, tidy and merge country-level data from public sources.

Pulling and tidying the Johns Hopkins Covid-19 data to long format

So, without much ado, here is the code for pulling and tidying the data.

library(tidyverse)library(lubridate)library(rvest)library(stringdist)# Function to read the raw CSV files. The files are aggregated to the country# level and then converted to long formatclean_jhd_to_long <- function(df) {  df_str <- deparse(substitute(df))  var_str <- substr(df_str, 1, str_length(df_str) - 4)    df %>% group_by(`Country/Region`) %>%    filter(`Country/Region` != "Cruise Ship") %>%    select(-`Province/State`, -Lat, -Long) %>%    mutate_at(vars(-group_cols()), sum) %>%     distinct() %>%    ungroup() %>%    rename(country = `Country/Region`) %>%    pivot_longer(      -country,       names_to = "date_str",       values_to = var_str    ) %>%    mutate(date = mdy(date_str)) %>%    select(country, date, !! sym(var_str)) }confirmed_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")deaths_raw <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")jh_covid19_data <- clean_jhd_to_long(confirmed_raw) %>%  full_join(clean_jhd_to_long(deaths_raw))# Next, I pull official country level indicators from the UN Statstics Division# to get country level identifiers.jhd_countries <- tibble(country = unique(jh_covid19_data$country)) %>% arrange(country)ctry_ids <- read_html("https://unstats.un.org/unsd/methodology/m49/") %>%  html_table()un_m49 <- ctry_ids[[1]]colnames(un_m49) <- c("country", "un_m49", "iso3c")# Merging by country name is messy. I start with a fuzzy matching approach# using the {stringdist} packagectry_names_dist <- matrix(NA, nrow = nrow(jhd_countries), ncol = nrow(un_m49))for(i in 1:length(jhd_countries$country)) {  for(j in 1:length(un_m49$country)) {     ctry_names_dist[i,j]<-stringdist(tolower(jhd_countries$country[i]),                                 tolower(un_m49$country[j]))        }  }min_ctry_name_dist <- apply(ctry_names_dist, 1, min)matched_ctry_names <- NULLfor(i in 1:nrow(jhd_countries)) {  un_m49_row <- match(min_ctry_name_dist[i], ctry_names_dist[i,])  if (length(which(ctry_names_dist[i,] %in% min_ctry_name_dist[i])) > 1) un_m49_row <- NA  matched_ctry_names <- rbind(matched_ctry_names,                         tibble(                            jhd_countries_row = i,                            un_m49_row = un_m49_row,                           jhd_ctry_name = jhd_countries$country[i],                            un_m49_name = ifelse(is.na(un_m49_row), NA,                                                 un_m49$country[un_m49_row])                         ))}# This matches most cases well but some cases need to be adjusted by hand.# In addition there are two jurisdictions (Kosovo, Taiwan)# that cannot be matched as they are no 'country' as far as the U.N.# Statistics Devision is concerned.# WATCH OUT: The data from JHU is subject to change without notice.# New countries are being added and names/spelling might change. # Also, in the long run, the data provided by the UNSD might change.# Inspect 'matched_ctry_names' before using the data.matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Bolivia"] <- 27matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Brunei"] <- 35matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Congo (Brazzaville)"] <- 54matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Congo (Kinshasa)"] <- 64matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "East Timor"] <- 222matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Iran"] <- 109matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Korea, South"] <- 180matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Kosovo"] <- NAmatched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Moldova"] <- 181matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Russia"] <- 184matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Taiwan*"] <- NAmatched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Tanzania"] <- 236matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "United Kingdom"] <- 235matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "US"] <- 238matched_ctry_names$un_m49_row[matched_ctry_names$jhd_ctry_name == "Venezuela"] <- 243# Last Step: Match country identifier data and save file (commented out here)jhd_countries %>%   left_join(matched_ctry_names %>%               select(jhd_ctry_name, un_m49_row),             by = c(country = "jhd_ctry_name")) %>%  left_join(un_m49 %>% mutate(un_m49_row = row_number()), by = "un_m49_row") %>%  rename(country = country.x) %>%  select(country, iso3c)  -> jhd_countriesjh_covid19_data <- jh_covid19_data %>% left_join(jhd_countries) %>%  select(country, iso3c, date, confirmed, deaths)  # write_csv(jh_covid19_data, sprintf("jh_covid19_data_%s.csv", Sys.Date()))

The code essentially follows the following steps

  • Read the relevant CSV files for confirmed cases, casualties and recovered patients from the Github repository
  • Aggregate the data at country level and discard data that is not required
  • Scrape official country identifiers from the U.N. Statistics Division
  • Fuzzy match these to the countries present in the JH data. Apply manual corrections that were correct based on the data pulled March 23, 2020 (check the match when you use this code later)
  • Merge the identifiers with the longitudinal data and save the result as a tidy CSV file.

Merging some World Bank data

The next code snippet pulls some World Bank data using the {wbstats} package.

library(tidyverse)library(wbstats)pull_worldbank_data <- function(vars) {  new_cache <- wbcache()  all_vars <- as.character(unique(new_cache$indicators$indicatorID))  data_wide <- wb(indicator = vars, mrv = 10, return_wide = TRUE)  new_cache$indicators[new_cache$indicators[,"indicatorID"] %in% vars, ] %>%    rename(var_name = indicatorID) %>%    mutate(var_def = paste(indicator, "\nNote:",                           indicatorDesc, "\nSource:", sourceOrg)) %>%    select(var_name, var_def) -> wb_data_def  new_cache$countries %>%    select(iso3c, iso2c, country, region, income) -> ctries  left_join(data_wide, ctries, by = "iso3c") %>%    rename(year = date,           iso2c = iso2c.y,           country = country.y) %>%    select(iso3c, iso2c, country, region, income, everything()) %>%    select(-iso2c.x, -country.x) %>%    filter(!is.na(NY.GDP.PCAP.KD),           region != "Aggregates") -> wb_data  wb_data$year <- as.numeric(wb_data$year)  wb_data_def<- left_join(data.frame(var_name = names(wb_data),                                     stringsAsFactors = FALSE),                          wb_data_def, by = "var_name")  wb_data_def$var_def[1:6] <- c(    "Three letter ISO country code as used by World Bank",    "Two letter ISO country code as used by World Bank",    "Country name as used by World Bank",    "World Bank regional country classification",    "World Bank income group classification",    "Calendar year of observation"  )  wb_data_def$type = c("cs_id", rep("factor",  4), "ts_id",                       rep("numeric", ncol(wb_data) - 6))  return(list(wb_data, wb_data_def))}vars <- c("SP.POP.TOTL", "AG.LND.TOTL.K2", "EN.POP.DNST", "EN.URB.LCTY", "SP.DYN.LE00.IN", "NY.GDP.PCAP.KD")wb_list <- pull_worldbank_data(vars)wb_data <- wb_list[[1]]wb_data_def <- wb_list[[2]]wb_data %>%  group_by(iso3c) %>%  arrange(iso3c, year) %>%  summarise(    population = last(na.omit(SP.POP.TOTL)),    land_area_skm = last(na.omit(AG.LND.TOTL.K2)),    pop_density = last(na.omit(EN.POP.DNST)),    pop_largest_city = last(na.omit(EN.URB.LCTY)),    gdp_capita = last(na.omit(NY.GDP.PCAP.KD)),    life_expectancy = last(na.omit(SP.DYN.LE00.IN))  ) %>% left_join(wb_data %>% select(iso3c, region, income) %>% distinct()) -> wb_cs# write_csv(wb_cs, "jh_add_wbank_data.csv")

Use the data

And finally, some code to use the data for typical event time visualizations.

suppressPackageStartupMessages({  library(tidyverse)  library(lubridate)  library(gghighlight)  library(ggrepel)})dta <- read_csv(  "https://joachim-gassen.github.io/data/jh_covid19_data_2020-03-24.csv",  col_types = cols()) %>%  mutate(date = ymd(date))wb_cs <- read_csv(  "https://joachim-gassen.github.io/data/jh_add_wbank_data.csv",   col_types = cols())# I define event time zero where, for a given country, the confirmed# cases match or exceed the Chinese case number at the beginning of the# data so that all countries can be compared across event time.# Also a require each country to have at least 7 days post event day 0dta %>%   group_by(country) %>%  filter(confirmed >= min(dta$confirmed[dta$country == "China"])) %>%  summarise(edate_confirmed = min(date)) -> edates_confirmeddta %>%   left_join(edates_confirmed, by = "country") %>%  mutate(    edate_confirmed = as.numeric(date - edate_confirmed)  ) %>%  filter(edate_confirmed >= 0) %>%  group_by(country) %>%  filter (n() >= 7) %>%   ungroup() %>%  left_join(wb_cs, by = "iso3c") %>%   mutate(    confirmed_1e5pop = 1e5*confirmed/population  ) -> dflab_notes <- paste0(  "Data as provided by Johns Hopkins University Center for Systems Science ",   "and Engineering (JHU CSSE)\nand obtained on March 23, 2020. ",  "The sample is limited to countries with at least seven days of positive\n",   "event days data. Code and walk-through: https://joachim-gassen.github.io.")lab_x_axis_confirmed <- sprintf(paste(  "Days since confirmed cases matched or exceeded\n",   "initial value reported for China (%d cases)\n"), min(dta$confirmed[dta$country == "China"]))gg_my_blob <- list(  scale_y_continuous(trans='log10', labels = scales::comma),    theme_minimal(),   theme(    plot.title.position = "plot",     plot.caption.position =  "plot",    plot.caption = element_text(hjust = 0),    axis.title.x = element_text(hjust = 1),    axis.title.y = element_text(hjust = 1),  ),  labs(caption = lab_notes,       x = lab_x_axis_confirmed,       y = "Confirmed cases (logarithmic scale)"),  gghighlight(TRUE,  label_key = country, use_direct_label = TRUE,              label_params = list(segment.color = NA, nudge_x = 1)))ggplot(df %>% filter (edate_confirmed <= 30),        aes(x = edate_confirmed, color = country, y = confirmed)) +  geom_line() +  labs(    title = "Focus on the first month: Confirmed Cases\n"  ) +  gg_my_blob

ggplot(df %>% filter (edate_confirmed <= 30),        aes(x = edate_confirmed, color = country, y = confirmed_1e5pop)) +  geom_line() +  gg_my_blob +  labs(    y = "Confirmed cases per 100,000 inhabitants",    title = "Cases relative to population\n"  ) 

Wrap-Up

This is it. I hope that somebody might fight this useful. In any case, help #FlattenTheCurve and stay healthy, everybody!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12081 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>