Quantcast
Channel: R-bloggers
Viewing all 12212 articles
Browse latest View live

Merge Covid-19 Data with Governmental Interventions Data

$
0
0

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Disclaimer: I am not an epidemiologist so I will abstain from analyzing the spread of the disease or estimating the effects of non-pharmaceutical interventions. As an applied economist, however, I study the economic effects of regulatory interventions, and thus feel confident to make some general remarks about why it is complicated to assess the causal effect of regulatory interventions.

Besides this, I am leading the Open Science Data Center (OSDC) of the collaborative research center TRR 266 Accounting for Transparency, which is funded by the German Science Foundation (DFG). The OSDC has the objective to make research transparent in a way that others can contribute and collaborate. This is the spirit that motivated me to write this blog post and to set up its repository. I hope that it might be helpful for others that are interested in doing research on the Covid 19 pandemic by promoting the benefits of open science.

While analyzing the data on the virus spread from the Johns Hopkins University can be insightful, one of the most pressing questions that we currently face is whether the non-pharmaceutical inventions (NPIs) that we observe around the globe will eventually be effective to contain the spread of the virus, aka to #FlattenTheCurve.

While it is most likely too early to tell (also see below), one key requirement for such analyses is the availability of both, high quality data on the virus spread at country and regional level as well as finely grained data on governmental measures that implement (and enforce) these non-pharmaceutical interventions.

Data on the spread of SARS-CoV-2 is readily available, thanks to the effort of the Johns Hopkins CSSE team. However, it is much harder to obtain high quality information on NPIs. While I was searching through various web resources, luckily, I received the new edition of the fabulous ‘Data are Plural’ newsletter by Jeremy Singer-Vine. It contained two pointers to NPI data and after a quick screening of those along the ones that I was able to detect (full list in the Github repository), I decided to settle with the data provided by the Assessment Capacities Project (ACAPS) as this dataset seems well maintained and also reasonably well documented.

The code to pull the data is provided in the Github repository that I generated along with this post. Besides the data sources mentioned above it also includes additional country level information provided by the World Bank. These data allow researchers to calculate per capita measures of the virus spread and to assess the association of macro-economic variables with the development of the virus.

The code directory in the repository contains the following R code files:

  • import_jhu_csse_covid19_data.R Imports the JHU CSSE data on the global spread of the Covid-19 pandemic, aggregates the data at the country level, and adds ISO3c country indicators. See this blog post for more information on the code.
  • import_acaps_npi.R Imports the ACAPS Government measures data, and fixes some minor inconsistencies in the category labels.
  • import_wbank_data.R Imports current country-level World Bank data.
  • merge_data.R Merges data from the three data sources into a country-day panel data set.
  • descriptive_analyses.R: Generates some descriptive visualizations of the data, some of which you see below. You can use it as a starting point for your own analyses.

While I will not replicate the data retrieval and merging code in this blog post I will show you what you can do with the merged data by providing some descriptives on governmental NPIs.

My analysis is based on the data as generated by the code of the repository at March 27, 2020. While these data are not included in the repository, you can download them as indicated by the code below.

suppressPackageStartupMessages({  library(tidyverse)  library(lubridate)  library(gghighlight)  library(ggrepel)})merged <- read_csv("https://joachim-gassen.github.io/data/merged_data_2020-03-27.csv",                 col_types = cols()) %>%  mutate(date = ymd(date))read_csv("https://joachim-gassen.github.io/data/npi_acaps_2020-03-27.csv",         col_types = cols()) %>%  mutate(npi_date = ymd(date_implemented)) %>%  rename(npi_type = category) %>%  mutate(    npi_regional = !is.na(admin_level_name),    npi_targeted_pop_group = targeted_pop_group == "Yes",    npi_lockdown = str_detect(measure, "General lockdown")  ) %>%  select(iso3c, npi_date, npi_type, npi_regional,          npi_targeted_pop_group, npi_lockdown) -> npi

As a first question, let’s see how these interventions distribute across calendar time.

ggplot(npi, aes(x = npi_date, fill = npi_type)) +   geom_bar(position = "stack") + theme_minimal() +  labs(title = "Implementation of Interventions over Calendar Time",       x = "Date",       y = "Number of interventions")

And now: How does this look in event time, meaning normalized to the respective country’s outbreak timing? I use the day where the number of deaths reaches 10 as event day zero.

merged %>%   group_by(iso3c) %>%  filter(deaths >= 10) %>%  summarise(edate = min(date)) -> ctry_edatemerged %>%  select(iso3c, country) %>%  unique() -> ctry_namesnpi %>%  left_join(ctry_edate, by = "iso3c") %>%  filter(!is.na(edate)) %>%  mutate(npi_edate = as.numeric(npi_date - edate)) %>%  left_join(ctry_names, by = "iso3c") %>%  select(iso3c, country, npi_date, npi_edate, npi_type, npi_lockdown) -> npi_edateslab_x <- "Days relative to the date where the number of deaths reached 10"ggplot(npi_edates, aes(x = npi_edate, fill = npi_type)) +   geom_bar(position = "stack") + theme_minimal() +  labs(title = "Implementation of Interventions over Time",       x = lab_x,       y = "Number of interventions")

You can clearly see from the histograms that NPIs are clustered both in calendar time and event time. This makes it harder to separate their effects from each other, yielding a lower test power. Based on the graphs, it is hard to tell the different interventions types apart. For this, you can use the next display.

npi_edates %>%  group_by(npi_edate, npi_type) %>%  summarise(    npi_count = n()  ) %>%  ungroup() %>%  arrange(npi_type, npi_edate) %>%  group_by(npi_type) %>%  mutate(npi_count =  cumsum(npi_count)) %>%  complete(npi_edate = min(npi_edates$npi_edate):max(npi_edates$npi_edate)) %>%  fill(npi_count) %>%   replace_na(list(npi_count = 0)) %>%  ggplot(aes(x = npi_edate, fill = npi_type, y = npi_count)) +  theme_minimal() + labs(    x = lab_x,    y = "Percentage share of all interventions at event date",    fill = "Intervention type"  ) +   geom_area(position = "fill") +   scale_y_continuous(labels = scales::percent)

You can see that, in particular, lockdown and social distancing measures are heavily clustered around the two weeks of day zero while socio-economic and public health measures are mostly taken earlier, similar to movement restrictions. This is in inline with governments taking less intrusive measures earlier and hints at the non-randomness of interventions (more on that below).

I will now focus on two types of measures that have been argued to be particular important to flatten the curve: Social distancing and the general lockdown of a country. First let’s see, based on ACAPS data, which countries have more social distancing measures in place and which countries have implemented a lockdown?

merged %>%  inner_join(ctry_edate, by = "iso3c") %>%  mutate(edate = as.numeric(date - edate)) %>%  group_by(iso3c) %>%  mutate(    lockdown_ctry = max(lockdown) > 0,    soc_dist_ctry = max(soc_dist)   ) %>%  ungroup() %>%  mutate(soc_dist_ctry = soc_dist_ctry > median(soc_dist_ctry)) -> dfdf %>%  select(country, soc_dist_ctry, lockdown_ctry) %>%  unique() %>%  arrange(country) -> npi_ctryggplot(npi_ctry, aes(x = soc_dist_ctry, y = lockdown_ctry)) +  geom_label_repel(aes(label = country)) +  theme_minimal() +  labs(    x = "More than median amount of social distancing measures",    y = "Lockdown initiated",    caption = paste0(      "Government intervention measures as provided by ",      "Assessment Capacities Project (ACAPS). Data as of March 27, 2020.\n",      "All countries with 10 or more reported deaths are included. ",      "Code: https://github.com/joachim-gassem/tidy_covid19"    )  )

When you look at the data and are somewhat familiar with the activities that several countries have taken, you might be surprised by some of the data points. As an example: No lockdown in China? It seems important to note that coding NPIs is far from trivial and that the ACAPS data provide much more detail on the measures than I use here. You are encouraged and advised to use this richness of the data for your own analyses. In particular, I hope that more regional-level analyses will allow us to assess the effects of NPIs in the future.

As a last step, I would like to provide to visuals to explain why it seems too early (for me) to assess the effectiveness of NPIs at the country level. For that, I use the grouping of countries of the last chart and plot the event daily mean percentage increase in recorded deaths for each group. I require each group to have at least five countries and I calculate the averages by event date as, luckily, there seems to be an overall declining trend of death growth rates over time in the data.

First, let’s compare countries with more social distancing measures with countries that have less social distancing measures in place.

compare_death_growth <- function(df, var) {  lab_caption <- paste0(    "Deaths data as provided by Johns Hopkins University Center for Systems Science ",     "and Engineering (JHU CSSE).\nGovernment intervention measures as provided by ",    "Assessment Capacities Project (ACAPS). Data as of March 27, 2020.\n",    "At least five daily country-level observations required by group for ",     "estimation. Code: https://github.com/joachim-gassem/tidy_covid19"  )  lab_color <- case_when(    var == "soc_dist_ctry" ~      "More than median amount of\nsocical distancing measures",    var == "lockdown_ctry" ~ "Lockdown initiated",    TRUE ~ var  )  df %>%    mutate(pct_inc_deaths = deaths/lag(deaths) - 1) %>%    filter(edate >= 0) %>%    group_by(edate, !! sym(var)) %>%    filter(n() >= 5) %>%    summarise(      mean = mean(pct_inc_deaths),      std_err = sd(pct_inc_deaths)/sqrt(n()),      n = n()    ) %>%    ggplot(aes(x = edate, y = mean, color = !! sym(var))) +    geom_pointrange(      aes(ymin = mean-1.96*std_err, ymax = mean+1.96*std_err),      position=position_dodge(0.4)    ) + labs(      x = lab_x,      y = "Average daily percentage increase in reported deaths by group",      caption = lab_caption,      color = lab_color    ) +     theme_minimal() +     theme(      legend.position = c(0.75, 0.75),      plot.title.position = "plot",       plot.caption.position =  "plot",      plot.caption = element_text(hjust = 0),      axis.title.x = element_text(hjust = 1),      axis.title.y = element_text(hjust = 1),    ) +    scale_y_continuous(labels = scales::percent) }compare_death_growth(df, "soc_dist_ctry")

The first thing that you see is that we only have very few data points with overlapping data (13 to be precise). Also, you see the wide and overlapping standard errors. This translates to: At least using such highly aggregated data, it is much too early to assess the impact of government measures on the spread of the virus.

Another thing that you can glance from the data is that, on average, countries with more social distancing measures seem to have higher death growth rates. Why is that? While I want to abstain from interpreting exploratory displays a potential explanation is that countries facing a faster spread of the virus are likely to adopt more rigid measures of social distancing. This non-randomness of governmental interventions is a common issue in regulatory intervention studies and makes it harder to assess the causal effect of regulatory interventions.

Finally, let’s compare this to the graph separating the death growth rates of countries with and without governmental lockdowns:

compare_death_growth(df, "lockdown_ctry")

As you can see, the graph looks reasonably similar. We will have to wait until we eventually can learn how the interventions have affected the spread of the virus.

Wrapping up

All the code to pull the data and to run the analyses of this post and also my prior post are available in the Github repository of this post. I designed the repository in a way that should make it straight-forward for you to base your own analysis on it. Just fork it and code away. However, please remember:

  • Predicting the development of the virus and assessing the effects of NPIs requires domain knowledge. Let’s keep this job with the Epidemiologists!
  • Open Science implies that you share and contribute. If you find something odd in the underlying data or my code, please share it with the responsible parties.
  • Give credit where credit is due. Always provide references to the original data suppliers.
  • And finally: Stay healthy and continue to help #FlattenTheCurve!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


COVID-19 Resource Gallery

$
0
0

[This article was first published on Mirai Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have gathered interesting COVID-19 resources and made them easily accessible within a single gallery website.

The Data Science community is showing a great spirit of collaboration, extensively contributing to the understanding of the COVID-19 pandemic by finding new and creative ways to analyze, visualize and communicate data. Many such resources are popping up daily and we at Mirai Solutions have been inspired by many such works for our Covid19 Shiny app.

While working on updates and new releases of the app, we also realized the value of being able to gather and browse in one place the interesting resources we were coming across on social media, newspapers, GitHub etc. As data sets are constantly being updated, this also addresses the need to compare different data sources in a easy and quick way.

Such a need is now fulfilled by the COVID-19 Resource Gallery: an open sourceR Markdown website including a gallery of (embedded) pages created in a dynamic way based on simple metadata in JSON format.

The gallery website is built using a dedicated R Markdown site generator provided by the novel rmdgallery package hosted on GitHub:

remotes::install_github("riccardoporreca/rmdgallery")

The entries of our gallery are listed as metadata in JSON format like the following for our Covid19 app

{"mirai-covid-19":{"title":"Covid19 Shiny App","author":"Mirai Solutions GmbH","url":"https://github.com/miraisolutions/Covid19#readme","menu_entry":"Mirai Covid19","template":"embed-url","content":"https://miraisolutions.shinyapps.io/covid19"}}

The rmdgallery::gallery_site generator allows to include the specified content (along with the other metadata) in the resulting gallery page using alternative templates. In the case above we are simply embedding the Covid19 Shiny app given its URL (see the template field), which is the typical case for including pages like dashboards and web apps.

The website is automatically built and deployed via GitHub Actions, and new gallery pages are simply added by pushing updated JSON metadata to the master branch.

The gallery currently features:

Do you like the idea and want to include your favorite COVID-19 resources in your own gallery? Feel free to fork our GitHub repository and see it live at username.github.io/covid-19-gallery!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mirai Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Time series cross-validation using crossval

$
0
0

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Time series cross-validation is now available in crossval, using function crossval::crossval_ts. Main parameters for crossval::crossval_ts include:

  • fixed_window described below in sections 1 and 2, and indicating if the training set’s size is fixed or increasing through cross-validation iterations
  • initial_window: the number of points in the rolling training set
  • horizon: the number of points in the rolling testing set

Yes, this type of functionality exists in packages such as caret, or forecast, but with different flavours. We start by installing crossval from its online repository (in R’s console):

library(devtools)devtools::install_github("thierrymoudiki/crossval")library(crossval)

1 – Calling crossval_ts with option fixed_window = TRUE

image-title-here

initial_windowis the length of the training set, depicted in blue, which is fixed through cross-validation iterations. horizon is the length of the testing set, in orange.

1 – 1 Using statistical learning functions

# regressors including trend xreg<-cbind(1,1:length(AirPassengers))# cross validation with least squares regressionres<-crossval_ts(y=AirPassengers,x=xreg,fit_func=crossval::fit_lm,predict_func=crossval::predict_lm,initial_window=10,horizon=3,fixed_window=TRUE)# print resultsprint(colMeans(res))
MERMSEMAEMPEMAPE0.1647382971.4238283667.014722990.023452010.22106607

1 – 2 Using time series functions from package forecast

res<-crossval_ts(y=AirPassengers,initial_window=10,horizon=3,fcast_func=forecast::thetaf,fixed_window=TRUE)print(colMeans(res))
MERMSEMAEMPEMAPE2.65708219551.42717038246.5118746930.0034238430.155428590

2 – Calling crossval_ts with option fixed_window = FALSE

image-title-here

initial_windowis the length of the training set, in blue, which increases through cross-validation iterations. horizon is the length of the testing set, depicted in orange.

2 – 1 Using statistical learning functions

# regressors including trend xreg<-cbind(1,1:length(AirPassengers))# cross validation with least squares regression res<-crossval_ts(y=AirPassengers,x=xreg,fit_func=crossval::fit_lm,predict_func=crossval::predict_lm,initial_window=10,horizon=3,fixed_window=FALSE)# print resultsprint(colMeans(res))
MERMSEMAEMPEMAPE11.3515962940.5489577236.07794747-0.017238160.11825111

2 – 2 Using time series functions from package forecast

res<-crossval_ts(y=AirPassengers,initial_window=10,horizon=3,fcast_func=forecast::thetaf,fixed_window=FALSE)print(colMeans(res))
MERMSEMAEMPEMAPE2.67028145544.75810648740.2842671360.0021837070.135572333

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

Licence Creative Commons Under License Creative Commons Attribution 4.0 International.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A COVID Small Multiple

$
0
0

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

John Burn-Murdoch has been doing very good work at the Financial Times producing various visualizations of the progress of COVID-19. One of his recent images is a small-multiple plot of cases by country, showing the trajectory of the outbreak for a large number of countries, with a the background of each small-multiple panel also showing (in grey) the trajectory of every other country for comparison. It’s a useful technique. In this example, I’ll draw a version of it in R and ggplot. The main difference is that instead of ordering the panels alphabetically by country, I’ll order them from highest to lowest current reported cases.

Here’s the figure we’ll end up with:

covid small multiple

Cumulative reported COVID-19 cases to date, top 50 Countries

There are two small tricks. First, getting all the data to show (in grey) in each panel while highlighting just one country. Second, for reasons of space, moving the panel labels (in ggplot’s terminology, the strip labels) inside the panels, in order to tighten up the space a bit. Doing this is really the same trick both times, viz, creating a some mini-datasets to use for particular layers of the plot.

The code for this (including code to pull the data) is in my COVID GitHub repository. See the repo for details on downloading and cleaning it. Just this morning the ECDC changed how it’s supplying its data, moving from an Excel file to your choice of JSON, CSV, or XML, so this earlier post walking through the process for the Excel file is already out of date for the downloading step. There’s a new function in the repo, though.

We’ll start with the data mostly cleaned and organized.

 1 2 3 4 5 6 7 8 91011121314151617
>cov_case_curve# A tibble: 1,165 x 9# Groups:   iso3 [94]datecnameiso3casesdeathscu_casescu_deathsdays_elapsedend_label<date><chr><chr><dbl><dbl><dbl><dbl><drtn><chr>12020-01-19ChinaCHN136121630daysNA22020-01-20ChinaCHN19023531daysNA32020-01-21ChinaCHN151338662daysNA42020-01-22ChinaCHN14011526173daysNA52020-01-23ChinaCHN970623174daysNA62020-01-24ChinaCHN2599882265daysNA72020-01-25ChinaCHN441151323416daysNA82020-01-26ChinaCHN665151988567daysNA92020-01-27ChinaCHN787252775818daysNA102020-01-28ChinaCHN17532545281069daysNA# … with 1,155 more rows

Then we pick out the top 50 countries, isolating their maximum case value. The code here is a bit inefficient as I keep having to recode some of the country names in the mini-datasets. There are other inefficiencies too, but oh well. I’ll clean them up later.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132
top_50<-cov_case_curve%>%group_by(cname)%>%filter(cu_cases==max(cu_cases))%>%ungroup()%>%top_n(50,cu_cases)%>%select(iso3,cname,cu_cases)%>%mutate(days_elapsed=1,cu_cases=max(cov_case_curve$cu_cases)-1e4,cname=recode(cname,`United States`="USA",`Iran, Islamic Republic of`="Iran",`Korea, Republic of`="South Korea",`United Kingdom`="UK"))top_50# A tibble: 50 x 4iso3cnamecu_casesdays_elapsed<chr><chr><dbl><dbl>1ARGArgentina7599112AUSAustralia7599113AUTAustria7599114BELBelgium7599115BRABrazil7599116CANCanada7599117CHLChile7599118CHNChina7599119CZECzechRepublic75991110DNKDenmark759911# … with 40 more rows

This gives us our label layer. We’ve set days_elapsed and cu_cases values to the same thing for every country, because these are the x and y locations where the country labels will go.

Next, a data layer for the grey line traces and a data layer for the little endpoints at the current case-count value.

 1 2 3 4 5 6 7 8 9101112131415
cov_case_curve_bg<-cov_case_curve%>%select(-cname)%>%filter(iso3%in%top_50$iso3)cov_case_curve_endpoints<-cov_case_curve%>%filter(iso3%in%top_50$iso3)%>%mutate(cname=recode(cname,`United States`="USA",`Iran, Islamic Republic of`="Iran",`Korea, Republic of`="South Korea",`United Kingdom`="UK"))%>%group_by(iso3)%>%filter(cu_cases==max(cu_cases))%>%select(cname,iso3,days_elapsed,cu_cases)%>%ungroup()

We drop cname in the cov_case_curve_bg layer, because we’re going to facet by that value with the main dataset in a moment. That’s the trick that allows the traces for all the countries to appear in each panel.

And now we can draw the plot. I really need to fix that country recode—a prime example of DRY.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
cov_case_sm<-cov_case_curve%>%filter(iso3%in%top_50$iso3)%>%mutate(cname=recode(cname,`United States`="USA",`Iran, Islamic Republic of`="Iran",`Korea, Republic of`="South Korea",`United Kingdom`="UK"))%>%ggplot(mapping=aes(x=days_elapsed,y=cu_cases))+# The line traces for every country, in every panelgeom_line(data=cov_case_curve_bg,aes(group=iso3),size=0.15,color="gray80")+# The line trace in red, for the country in any given panelgeom_line(color="firebrick",lineend="round")+# The point at the end. Bonus trick: some points can have fills!geom_point(data=cov_case_curve_endpoints,size=1.1,shape=21,color="firebrick",fill="firebrick2")+# The country label inside the panel, in lieu of the strip labelgeom_text(data=top_50,mapping=aes(label=cname),vjust="inward",hjust="inward",fontface="bold",color="firebrick",size=2.1)+# Log transform and friendly labelsscale_y_log10(labels=scales::label_number_si())+# Facet by country, order from high to lowfacet_wrap(~reorder(cname,-cu_cases),ncol=5)+labs(x="Days Since 100th Confirmed Case",y="Cumulative Number of Cases (log10 scale)",title="Cumulative Number of Reported Cases of COVID-19: Top 50 Countries",subtitle=paste("Data as of",format(max(cov_curve$date),"%A, %B %e, %Y")),caption="Kieran Healy @kjhealy / Data: https://www.ecdc.europa.eu/")+theme(plot.title=element_text(size=rel(1),face="bold"),plot.subtitle=element_text(size=rel(0.7)),plot.caption=element_text(size=rel(1)),# turn off the strip label and tighten the panel spacingstrip.text=element_blank(),panel.spacing.x=unit(-0.05,"lines"),panel.spacing.y=unit(0.3,"lines"),axis.text.y=element_text(size=rel(0.5)),axis.title.x=element_text(size=rel(1)),axis.title.y=element_text(size=rel(1)),axis.text.x=element_text(size=rel(0.5)),legend.text=element_text(size=rel(1)))ggsave("figures/cov_case_sm.png",cov_case_sm,width=10,height=12,dpi=300)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simulating your Model Training for Robust ML Models

$
0
0

[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will explain you why one should run simulations on their model training process so the models don’t fail in production. Traditionally we are always used to training models at certain split ratio’s of say, 70:30 or 80:20. The fundamental issue with this is that we don’t always train models on different parts of data in those splits as shown in below image.

Capture

Hence, it becomes vital to make sure one trains their model with various scenarios to make sure the model is not biased. This also ensures that the model is reliable and robust enough to deploy it into production.

Below we will go over an example on how this all comes together. The steps are as follows:

  1. load libraries
  2. load mtcars data set
  3. write a function to split data at different degrees
  4. run simulation in a loop to get error rates
  5. Finally visualizing the result

We will first begin with loading libraries and our data set as shown below

# load librarieslibrary(MASS)library(e1071)library(Metrics)library(dplyr)# load datadata(mtcars)

Next, we will write a function that includes

  • set seed value. This is because we want to capture new data every time (Duh! that the whole point of this simulation)
  • split the data in to train and test at various ratios
  • build an SVM model using train data
  • do predictions on test data
  • calculate & return error value (MAE)
# function to run simulationrunsimulation = function(i, split){  seedValue = i*rnorm(1)  # change seed values  set.seed(seedValue)  # create samples  samples = sample(1:nrow(mtcars), split*nrow(mtcars))  # split data to test and train  train = mtcars[samples, ]  test = mtcars[-samples, ]  # build a model  model = svm(mpg ~ ., data  = train, scale = F, kernel = "radial")  # do predictions  prediction = predict(model, test %>% select(-mpg))  # calculate error  error = mae(actual = test$mpg, predicted = prediction)  # return error values  return(error)}

We will create a sequence of split ratios and then run these ratios in the loop. For each split ratio, we will run around 300 runs.

# create split ratiossplit = seq(from = 0.5, to = 0.95, by = 0.05) %>% rep(300) %>% sort(decreasing = FALSE)# get the length of i for seed valuesi = 1:length(split)# get errorserrors = mapply(runsimulation, i = i, split = split)# put data into a data framesimResults = data.frame(split, errors)

Finally, we visualize the data and look at the results. In the below box plot we can see that the median decreases as the split ratio increases. This should be true as we are feeding in more data to the model. We also notice that the minimum error decreases as we add more data while training. This also increases the max errors. We can notice similar observation for quantile as well.

Capture

Next, we will look at the summary of mean and variance for each split ratios. We notice that the least average error is with 95% split and also comes with higher degree of SD. and vice versa.

# plot resultsboxplot(errors~split,data = simResults,main = "Error Simulation Results",xlab = "Train Split",ylab = "Mean Absolute Error",col = "light blue")grid (NULL,NULL, lty = 6, col = "cornsilk2") simResults %>%  group_by(split) %>%  summarise(mean = mean(errors), sd = sd(errors)) %>%  data.frame#     split   mean      sd# 1   0.50 4.826838 0.7090876# 2   0.55 4.701303 0.8178482# 3   0.60 4.674690 0.8442144# 4   0.65 4.645363 0.8727532# 5   0.70 4.652534 1.0769249# 6   0.75 4.555186 1.1046217# 7   0.80 4.588761 1.3002216# 8   0.85 4.572775 1.6021275# 9   0.90 4.519118 1.7865828# 10  0.95 4.443357 2.4188333

At this point, its up to the decision maker to decide what model one should go for. Can they afford significant variations in error rates or want to control the variance of error rate. If I was the decision maker, I would go with either 65% or 70% split and control that variance in error.

In conclusion, machine learning is hard. Its not as simple as fitting a model with data. You need to run simulations as above to analyze your models. The above is the most simplest case you could come across. Once you get to hyper parameters, it gets even more complicated. There is not one set of tools or flows that works for all. You sometimes need to get creative and come up with your own flows.

Hope you enjoyed this quick tutorial. Feel free to like, share and subscribe to this blog. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Grid Search and Bayesian Hyperparameter Optimization using {tune} and {caret} packages

$
0
0

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Category

Tags

A priori there is no guarantee that tuning hyperparameter(HP) will improve the performance of a machine learning model at hand. In this blog Grid Search and Bayesian optimization methods implemented in the {tune} package will be used to undertake hyperparameter tuning and to check if the hyperparameter optimization leads to better performance.

We will also conduct hyperparamater optimization using the {caret} package, this will allow us to compare the performance of both packages {tune} and {caret}.

High Level Workflow

The following picture is showing the high level workflow to perform hyperparameter tuning:

Hyperparameter Optimization Methods

In contrast to the model parameters, which are discovered by the learning algorithm of the ML model, the so called Hyperparameter(HP) are not learned during the modeling process, but specified prior to training.

Hyperparameter tuning is the task of finding optimal hyperparameter(s) for a learning algorithm for a specific data set and at the end of the day to improve the model performance.

There are three main methods to tune/optimize hyperparameters:

a) Grid Search method: an exhaustive search (blind search/unguided search) over a manually specified subset of the hyperparameter space. This method is a computationally expensive option but guaranteed to find the best combination in your specified grid.

b) Random Search method: a simple alternative and similar to the grid search method but the grid is randomly selected. This method (also blind search/unguided search) is faster at getting reasonable model but will not get the best in your grid.

c) Informed Search method: In informed search, each iteration learns from the last, the results of one model helps creating the next model.

The most popular informed search method is Bayesian Optimization. Bayesian Optimization was originally designed to optimize black-box functions. To understand the concept of Bayesian Optimization this article and this are highly recommended.

In this post, we will focus on two methods for automated hyperparameter tuning, Grid Search and Bayesian optimization. We will optimize the hyperparameter of a random forest machine using the tune library and other required packages (workflows, dials. ..).

Preparing the data

The learning problem(as an example) is the binary classification problem; predict customer churn. We will be using the Telco Customer Churn data set also available here.

Load needed libraries.

# Needed packages  library(tidymodels)   # packages for modeling and statistical analysislibrary(tune)         # For hyperparemeter tuninglibrary(workflows)    # streamline processlibrary(tictoc)       # for timimg

Load data and explore it.

# load dataTelco_customer <- read.csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")# Get summary of the dataskimr::skim(Telco_customer)
NameTelco_customer
Number of rows7043
Number of columns21
_______________________
Column type frequency:
factor17
numeric4
________________________
Group variablesNone

Variable type: factor

skim_variablen_missingcomplete_rateorderedn_uniquetop_counts
customerID01FALSE7043000: 1, 000: 1, 000: 1, 001: 1
gender01FALSE2Mal: 3555, Fem: 3488
Partner01FALSE2No: 3641, Yes: 3402
Dependents01FALSE2No: 4933, Yes: 2110
PhoneService01FALSE2Yes: 6361, No: 682
MultipleLines01FALSE3No: 3390, Yes: 2971, No : 682
InternetService01FALSE3Fib: 3096, DSL: 2421, No: 1526
OnlineSecurity01FALSE3No: 3498, Yes: 2019, No : 1526
OnlineBackup01FALSE3No: 3088, Yes: 2429, No : 1526
DeviceProtection01FALSE3No: 3095, Yes: 2422, No : 1526
TechSupport01FALSE3No: 3473, Yes: 2044, No : 1526
StreamingTV01FALSE3No: 2810, Yes: 2707, No : 1526
StreamingMovies01FALSE3No: 2785, Yes: 2732, No : 1526
Contract01FALSE3Mon: 3875, Two: 1695, One: 1473
PaperlessBilling01FALSE2Yes: 4171, No: 2872
PaymentMethod01FALSE4Ele: 2365, Mai: 1612, Ban: 1544, Cre: 1522
Churn01FALSE2No: 5174, Yes: 1869

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
SeniorCitizen010.160.370.000.000.000.001.00▇▁▁▁▂
tenure0132.3724.560.009.0029.0055.0072.00▇▃▃▃▆
MonthlyCharges0164.7630.0918.2535.5070.3589.85118.75▇▅▆▇▅
TotalCharges1112283.302266.7718.80401.451397.473794.748684.80▇▂▂▂▁
# Make copy of Telco_customer and drop the unneeded columnsdata_set <- Telco_customer%>%dplyr::select(-"customerID")# Rename the outcome variable (Churn in my case) to Targetdata_in_scope <- data_set%>% plyr::rename(c("Churn" = "Target"))# Drop rows with missing value(11 missing values, very small percentage of our total data)data_in_scope <- data_set%>% plyr::rename(c("Churn" = "Target"))%>%drop_na()

Check severity of class imbalance.

round(prop.table(table(data_in_scope$Target)), 2)## ##   No  Yes ## 0.73 0.27

For the data at hand there is no need to conduct downsampling or upsampling, but if you have to balance your data you can use the function step_downsample() or step_upsample() to reduce the imbalance between majority and minority class.

Below we will split the data into train and test data and create resamples. The test data is saved for model evaluation and we will use it twice, once to evaluate the model with default hyperparameter and at the end of the tuning process to test the tuning results(evaluate the final tuned model).

During the tuning process we will deal only with the resamples created on the training data. In my example we will use V-Fold Cross-Validation to split the training data into 5 folds and the repetition consists of 2 iterations.

# Split data into train and test data and create resamples for tuningset.seed(2020)train_test_split_data <- initial_split(data_in_scope)data_in_scope_train <- training(train_test_split_data)data_in_scope_test <-  testing(train_test_split_data)# create resammplesfolds <- vfold_cv(data_in_scope_train, v = 5, repeats = 2)

Preprocessing the data

We create the recipe and assign the steps for preprocessing the data.

#  Pre-Processing the data with{recipes}  set.seed(2020)  rec <- recipe(Target ~.,                 data = data_in_scope_train) %>%   # Fomula  step_dummy(all_nominal(), -Target) %>%          # convert nominal data into one or more numeric.  step_corr(all_predictors()) %>%                 # remove variables that have large absolute                                                      # correlations with other variables.  step_center(all_numeric(), -all_outcomes())%>%  # normalize numeric data to have a mean of zero.  step_scale(all_numeric(), -all_outcomes())         # normalize numeric data to have a standard deviation of one.  # %>%step_downsample(Target)                    # all classes should have the same frequency as the minority                                                      # class(not needed in our case)

Next we will train the recipe data. The trained data (train_data and test_data) will be used for modeling and fitting the model using the default hyperparameter of the model at hand. The model performance is determined by AUC (Area under the ROC Curve), which will be computed via roc_auc {yardstick} function. This AUC value will be taken as reference value to check if the hyperparameters Optimization leads to better performance or not.

trained_rec<-  prep(rec, training = data_in_scope_train, retain = TRUE)# create the train and test set train_data <- as.data.frame(juice(trained_rec))test_data  <- as.data.frame( bake(trained_rec, new_data = data_in_scope_test))

The model

We will use the {parsnip} function rand_forest() to create a random forest model and add the r-package “ranger” as the computational engine.

# Build the model (generate the specifications of the model) model_spec_default <- rand_forest(mode = "classification")%>%set_engine("ranger", verbose = TRUE)

Fit the model on the training data (train_data prepared above)

set.seed(2020) tic()# fit the modelmodel_fit_default <- model_spec_default%>%fit(Target ~ . , train_data )toc()## 2.37 sec elapsed
# Show the configuration of the fitted modelmodel_fit_default## parsnip model object## ## Fit time:  1.5s ## Ranger result## ## Call:##  ranger::ranger(formula = formula, data = data, verbose = ~TRUE,      num.threads = 1, seed = sample.int(10^5, 1), probability = TRUE) ## ## Type:                             Probability estimation ## Number of trees:                  500 ## Sample size:                      5274 ## Number of independent variables:  23 ## Mtry:                             4 ## Target node size:                 10 ## Variable importance mode:         none ## Splitrule:                        gini ## OOB prediction error (Brier s.):  0.1344156

Predict on the testing data (test_data) and extract the model performance. How does this model perform against the holdout data (test_data, not seen before)?

# Performance and statistics: set.seed(2020) test_results_default <-    test_data %>%   select(Target) %>%   as_tibble() %>%   mutate(     model_class_default = predict(model_fit_default, new_data = test_data) %>%        pull(.pred_class),     model_prob_default  = predict(model_fit_default, new_data = test_data, type = "prob") %>%        pull(.pred_Yes))

The computed AUC is presented here:

# Compute the AUC valueauc_default <- test_results_default %>% roc_auc(truth = Target, model_prob_default) cat("The default model scores", auc_default$.estimate, " AUC on the testing data")## The default model scores 0.8235755  AUC on the testing data
# Here we can also compute the confusion matrix conf_matrix <- test_results_default%>%conf_mat(truth = Target, model_class_default)

As we can see the default model performs not bad, but would the tuned model deliver better performance ?

Hyperparameter Tuning Using {tune}.

Hyperparameter tuning using the {tune} package will be performed for the parsnip model rand_forest and we will use ranger as the computational engine. The list of {parsnip} models can be found here

In the next section we will define and describe the needed elements for the tuning function tun_*() (tune_grid() for Grid Search and tune_bayes() for Bayesian Optimization)

Specification of the ingredients for the tune function

Preparing the elements needed for the tuning function tune_*()

  1. model to tune: Build the model with {parsnip} package and specify the parameters we want to tune. Our model has three important hyperparameters:
    • mtry: is the number of predictors that will be randomly sampled at each split when creating the tree models. (Default values are different for classification(sqrt(p) and regression (p/3) where p is number of variables in the data set)
    • trees: is the number of trees contained in the ensemble (Default: 500)
    • min_n: is the minimum number of data points in a node (Default value: 1 for classification and 5 for regression) mtry,trees and min_n parameters build the hyperparameter set to tune.
# Build the model to tune and leave the tuning parameters empty (Placeholder with the tune() function)model_def_to_tune <- rand_forest(mode = "classification",                                  mtry = tune(),         # mtry is the number of predictors that will be randomly                                                         #sampled at each split when creating the tree models.                                 trees = tune(),        # trees is the number of trees contained in the ensemble.                                 min_n =  tune())%>% # min_n is the minimum number of data points in a node                                                         #that are required for the node to be split further.                                  set_engine("ranger") #  computational engine
  1. Build the workflow {workflows} object workflow is a container object that aggregates information required to fit and predict from a model. This information might be a recipe used in preprocessing, specified through add_recipe(), or the model specification to fit, specified through add_model().

For our example we combine the recipe(rc) and the model_def_to_tune into a single object (model_wflow) via the workflow() function from the {workflows} package.

# Build the workflow objectmodel_wflow <-  workflow() %>%  add_model(model_def_to_tune) %>%  add_recipe(rec)

Get information on all possible tunable arguments in the defined workflow(model_wflow) and check whether or not they are actually tunable.

tune_args(model_wflow)## # A tibble: 3 x 6##   name  tunable id    source     component   component_id##                            ## 1 mtry  TRUE    mtry  model_spec rand_forest         ## 2 trees TRUE    trees model_spec rand_forest         ## 3 min_n TRUE    min_n model_spec rand_forest 
  1. Finalize the hyperparameter set to be tuned. Parameters update will be done via the finalize {dials} function.
# Which parameters have been collected ?HP_set <- parameters(model_wflow)HP_set## Collection of 3 parameters for tuning## ##     id parameter type object class##   mtry           mtry    nparam[?]##  trees          trees    nparam[+]##  min_n          min_n    nparam[+]## ## Model parameters needing finalization:##    # Randomly Selected Predictors ('mtry')## ## See `?dials::finalize` or `?dials::update.parameters` for more information.
# Update the parameters which denpends on the data (in our case mtry)without_output <- select(data_in_scope_train, -Target)HP_set <- finalize(HP_set, without_output)HP_set## Collection of 3 parameters for tuning## ##     id parameter type object class##   mtry           mtry    nparam[+]##  trees          trees    nparam[+]##  min_n          min_n    nparam[+]

Now we do have all needed stuff in place to run the optimization process, but before we go forward and start the Grid Search process, a wrapper function (my_finalize_func) will be built, it takes the result of the tuning process, the recipe object, model to tune as arguments, finalize the recipe and the tuned model and returns AUC value, the confusion matrix and the ROC-curve. This function will be applied on the results of grid search and Bayesian optimization process.

# Function to finalliaze the recip and the model and returne the AUC value and the ROC curve of the tuned model.  my_finalize_func <- function(result_tuning, my_recipe, my_model) {# Accessing the tuning results  bestParameters <- select_best(result_tuning, metric = "roc_auc", maximize = TRUE)# Finalize recipe  final_rec <-     rec %>%    finalize_recipe(bestParameters) %>%    prep()# Attach the best HP combination to the model and fit the model to the complete training data(data_in_scope_train)   final_model <-    my_model %>%    finalize_model(bestParameters) %>%    fit(Target ~ ., data = juice(final_rec))# Prepare the finale trained data to use for performing model validation.   df_train_after_tuning <- as.data.frame(juice(final_rec))   df_test_after_tuning <- as.data.frame(bake(final_rec, new_data = data_in_scope_test))  # Predict on the testing data set.seed(2020)  results_ <-     df_test_after_tuning%>%    select(Target) %>%    as_tibble()%>%    mutate(      model_class = predict(final_model, new_data = df_test_after_tuning) %>%         pull(.pred_class),      model_prob  = predict(final_model, new_data = df_test_after_tuning, type = "prob") %>%         pull(.pred_Yes))# Compute the AUC    auc <-  results_%>% roc_auc(truth = Target, model_prob)# Compute the confusion matrix  confusion_matrix <- conf_mat(results_, truth= Target, model_class)# Plot the ROC curve  rocCurve <- roc_curve(results_, truth = Target, model_prob)%>%    ggplot(aes(x = 1 - specificity, y = sensitivity)) +    geom_path(colour = "darkgreen", size = 1.5) +    geom_abline(lty = 3, size= 1, colour = "darkred") +    coord_equal()+    theme_light()    new_list <- list(auc, confusion_matrix, rocCurve)  return(new_list)}

Hyperparameter tuning via Grid Search

To perform Grid Search process, we need to call tune_grid() function. Execution time will be estimated via {tictoc} package.

# Perform Grid Search set.seed(2020)tic() results_grid_search <- tune_grid(  model_wflow,                       # Model workflow defined above  resamples = folds,                 # Resamples defined obove  param_info = HP_set,               # HP Parmeter to be tuned (defined above)   grid = 10,                         # number of candidate parameter sets to be created automatically  metrics = metric_set(roc_auc),     # metric  control = control_grid(save_pred = TRUE, verbose = TRUE) # controle the tuning process)results_grid_search## #  5-fold cross-validation repeated 2 times ## # A tibble: 10 x 6##    splits              id      id2   .metrics          .notes           .predictions         ##  *                                                         ##  1  Repeat1 Fold1   ##  2  Repeat1 Fold2   ##  3  Repeat1 Fold3   ##  4  Repeat1 Fold4   ##  5  Repeat1 Fold5   ##  6  Repeat2 Fold1   ##  7  Repeat2 Fold2   ##  8  Repeat2 Fold3   ##  9  Repeat2 Fold4   ## 10  Repeat2 Fold5   
toc()## 366.69 sec elapsed

Results Grid Search process

Results of the executed Grid Search process:

  • Best hyperparameter combination obtained via Grid Search process:
# Select best HP combinationbest_HP_grid_search <- select_best(results_grid_search, metric = "roc_auc", maximize = TRUE)best_HP_grid_search## # A tibble: 1 x 3##    mtry trees min_n##     ## 1     1  1359    16
  • Performance: AUC value, confusion matrix, and the ROC curve (tuned model via Grid Search):
# Extract the AUC value, confusion matrix and the roc vurve with my_finalize_func functionFinalize_grid <- my_finalize_func(results_grid_search, rec, model_def_to_tune)cat("Model tuned via Grid Search scores an AUC value of ", Finalize_grid[[1]]$.estimate, "on the testing data", "\n")## Model tuned via Grid Search scores an AUC value of  0.8248226 on the testing data
cat("The Confusion Matrix", "\n")## The Confusion Matrix
print(Finalize_grid[[2]])##           Truth## Prediction   No  Yes##        No  1268  404##        Yes   19   67
cat("And the ROC curve:", "\n")## And the ROC curve:
print(Finalize_grid[[3]])

We've done with the Grid Search method, let's now start the Bayesian hyperparameter process.

Bayesian Hyperparameter tuning with tune package

How Bayesian Hyperparameter Optimization with {tune} package works ?

In Package ‘tune’ vignete the optimization starts with a set of initial results, such as those generated by tune_grid(). If none exist, the function will create several combinations and obtain their performance estimates. Using one of the performance estimates as the model outcome, a Gaussian process (GP) model is created where the previous tuning parameter combinations are used as the predictors. A large grid of potential hyperparameter combinations is predicted using the model and scored using an acquisition function. These functions usually combine the predicted mean and variance of the GP to decide the best parameter combination to try next. For more information, see the documentation for exp_improve() and the corresponding package vignette. The best combination is evaluated using resampling and the process continues.

For our example we define the arguments of the tune_bayes() function as follows:

# Start the Baysian HP search processset.seed(1291)tic()search_results_bayesian <- tune_bayes(    model_wflow,                              # workflows object defined above                 resamples = folds,                        # rset() object defined above    param_info = HP_set,                      # HP set defined above (updated HP set)    initial = 5 ,                             # here you could also use the results of the Grid Search    iter = 10,                                # max number of search iterations    metrics = metric_set(roc_auc),            # to optimize for the roc_auc metric     control = control_bayes(no_improve = 8,   # cutoff for the number of iterations without better results.                            save_pred = TRUE, # output of sample predictions should be saved.                            verbose = TRUE))toc()## 425.76 sec elapsed

Results Bayesian Optimization Process

Results of the executed Bayesian optimization search process:

  • Best hyperparameter combination obtained via Grid Search process:
# Get the best HP combinationbest_HP_Bayesian <- select_best(search_results_bayesian, metric = "roc_auc", maximize = TRUE)best_HP_Bayesian## # A tibble: 1 x 3##    mtry trees min_n##     ## 1     2  1391    17
  • AUC value abstained with the final model (tuned model via Bayesian Optimization process):
# Build the final model (apply my_finalize_func)Finalize_Bayesian <- my_finalize_func(search_results_bayesian, rec, model_def_to_tune)# Get the AUC valuecat(" Tuned model via Bayesian method scores", Finalize_Bayesian[[1]]$.estimate, "AUC on the testing data", "\n")##  Tuned model via Bayesian method scores 0.8295968 AUC on the testing data
cat("The Confusion Matrix", "\n")## The Confusion Matrix
print(Finalize_Bayesian[[2]])##           Truth## Prediction   No  Yes##        No  1178  263##        Yes  109  208
cat("And the ROC curve:", "\n")## And the ROC curve:
print(Finalize_Bayesian[[3]])

Summary Achievements (with {tune} package)

lets summarize what we achieved with Grid Search and Bayesian Optimization so far.

# Build a new table with the achieved AUC'sxyz <- tibble(Method = c("Default", "Grid Search", "Bayesian Optimization"),                            AUC_value = c(auc_default$.estimate, Finalize_grid[[1]]$.estimate,  Finalize_Bayesian[[1]]$.estimate))default_value <- c(mtry = model_fit_default$fit$mtry, trees=  model_fit_default$fit$num.trees,min_n = model_fit_default$fit$min.node.size)vy <- bind_rows(default_value, best_HP_grid_search, best_HP_Bayesian )all_HP <- bind_cols(xyz, vy)
all_HP%>%knitr::kable(  caption = "AUC Values and the best hyperparameter combination: we can see that the Bayesian hyperparameter using the {tune} package improved the performance (AUC) of our model, but what about using the caret package ?")
MethodAUC_valuemtrytreesmin_n
Default0.8235755450010
Grid Search0.82482261135916
Bayesian Optimization0.82959682139117

Now, let's tune the model using the {caret} package

Hyperparameter Tuning Using {caret}

By default, the train function from the caret package creates automatically a grid of tuning parameters, if p is the number of tuning parameters, the grid size is 3p. But in our example we set the number of hyperparameter combinations to 10.

Grid Search via {caret} package

## 186.69 sec elapsed
# print the trained modelranger_fit_grid## Random Forest ## ## 5274 samples##   23 predictor##    2 classes: 'No', 'Yes' ## ## No pre-processing## Resampling: Cross-Validated (5 fold, repeated 2 times) ## Summary of sample sizes: 4219, 4220, 4219, 4219, 4219, 4219, ... ## Resampling results across tuning parameters:## ##   mtry  splitrule   ROC        Sens       Spec     ##    2    gini        0.8500179  0.9224702  0.4832002##    2    extratrees  0.8469737  0.9280161  0.4631669##    4    gini        0.8438961  0.9044102  0.5186060##    4    extratrees  0.8435452  0.9031199  0.5075128##    6    gini        0.8378432  0.8984766  0.5203879##    6    extratrees  0.8383252  0.9004117  0.5050090##    9    gini        0.8336365  0.8958967  0.5175243##    9    extratrees  0.8336034  0.8946059  0.5046544##   11    gini        0.8317812  0.8929298  0.5221736##   11    extratrees  0.8313396  0.8918976  0.5092947##   13    gini        0.8295577  0.8948648  0.5146633##   13    extratrees  0.8296291  0.8900928  0.5067934##   16    gini        0.8280568  0.8906072  0.5243203##   16    extratrees  0.8282040  0.8893184  0.5032220##   18    gini        0.8266870  0.8908655  0.5218139##   18    extratrees  0.8270139  0.8891897  0.5089542##   20    gini        0.8259053  0.8899628  0.5196672##   20    extratrees  0.8264358  0.8884154  0.5064388##   23    gini        0.8242706  0.8895753  0.5182373##   23    extratrees  0.8259214  0.8884169  0.5025051## ## Tuning parameter 'min.node.size' was held constant at a value of 1## ROC was used to select the optimal model using the largest value.## The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
# Predict on the testing datamodel_class_gr <- predict(ranger_fit_grid, newdata = test_data)model_prob_gr <- predict(ranger_fit_grid, newdata = test_data, type = "prob")test_data_with_pred_gr <- test_data%>%  select(Target)%>%as_tibble()%>%  mutate(model_class_ca = predict(ranger_fit_grid, newdata = test_data),  model_prob_ca = predict(ranger_fit_grid, newdata = test_data, type= "prob")$Yes)

AUC achieved via Caret package after tuning the hyperparameter via Grid Search

# Compute the AUCauc_with_caret_gr <- test_data_with_pred_gr%>% yardstick::roc_auc(truth=Target, model_prob_ca)cat("Caret model via Grid Search method scores" , auc_with_caret_gr$.estimate , "AUC on the testing data")## Caret model via Grid Search method scores 0.8272427 AUC on the testing data

Adaptive Resampling Method

We will be using the advanced tuning method the Adaptive Resampling method. This method resamples the hyperparameter combinations with values near combinations that performed well. This method is faster and more efficient (unneeded computations is avoided).

set.seed(2020)tic()fitControl <- trainControl(                    method = "adaptive_cv",                    number = 5,  repeats = 4,               # Crossvalidation(20 Folds will be created)                    adaptive = list(min =3,                 # minimum number of resamples per hyperparameter                                    alpha =0.05,            # Confidence level for removing hyperparameters                                    method = "BT",# Bradly-Terry Resampling method (here you can instead also use "gls")                                    complete = FALSE),      # If TRUE a full resampling set will be generated                     search = "random",                    summaryFunction = twoClassSummary,                    classProbs = TRUE)ranger_fit <- train(Target ~ .,                     metric = "ROC",                    data = train_data,                     method = "ranger",                     trControl = fitControl,                     verbose = FALSE,                     tuneLength = 10)                      # Maximum number of hyperparameter combinationstoc()## 22.83 sec elapsed## Random Forest ## ## 5274 samples##   23 predictor##    2 classes: 'No', 'Yes' ## ## No pre-processing## Resampling: Adaptively Cross-Validated (5 fold, repeated 4 times) ## Summary of sample sizes: 4219, 4220, 4219, 4219, 4219, 4219, ... ## Resampling results across tuning parameters:## ##   min.node.size  mtry  splitrule   ROC        Sens       Spec       Resamples##    1             16    extratrees  0.8258154  0.8882158  0.5262459  3        ##    4              2    extratrees  0.8459167  0.9303470  0.4617981  3        ##    6              3    extratrees  0.8457763  0.9118612  0.5238479  3        ##    8              4    extratrees  0.8457079  0.9071322  0.5310207  3        ##   10             16    gini        0.8341897  0.8912221  0.5286226  3        ##   10             18    extratrees  0.8394607  0.8972503  0.5369944  3        ##   13              8    extratrees  0.8456075  0.9058436  0.5405658  3        ##   17              2    gini        0.8513404  0.9256174  0.4892473  3        ##   17             22    extratrees  0.8427424  0.8985379  0.5453320  3        ##   18             14    gini        0.8393974  0.8989635  0.5286226  3        ## ## ROC was used to select the optimal model using the largest value.## The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 17.
# Predict on the testing datatest_data_with_pred <- test_data%>%  select(Target)%>%as_tibble()%>%  mutate(model_class_ca = predict(ranger_fit, newdata = test_data),  model_prob_ca = predict(ranger_fit, newdata = test_data, type= "prob")$Yes)

AUC achieved via Caret package using Adaptive Resampling Method

# Compute the AUC valueauc_with_caret <- test_data_with_pred%>% yardstick::roc_auc(truth=Target, model_prob_ca)cat("Caret model via  Adaptive Resampling Method scores" , auc_with_caret$.estimate , " AUC on the testing data")## Caret model via  Adaptive Resampling Method scores 0.8301066  AUC on the testing data

Summary results

Conclusion and Outlook

In this case study we used the {tune} and the {caret} packages to tune hyperparameter.

A) Using the {tune} package we applied Grid Search method and Bayesian Optimization method to optimize mtry, trees and min_n hyperparameter of the machine learning algorithm “ranger” and found that:

  1. compared to using the default values, our model using tuned hyperparameter values had better performance.
  2. the tuned model via Bayesian optimization method performs better than the Grid Search method

B) And using the {caret} package we applied the Grid Search method and the Adaptive Resampling Method to optimize mtry, splitrule , min.node.size and found that:

  1. compared to using the default values, our model using tuned hyperparameter values had better performance.
  2. the tuned model via Adaptive Resampling Method performs better than the Grid Search method.
  3. compared to using the relative new {tune} package, our model using the old {caret} package had better performance.

The results of our hyperparameter tuning experiments are displayed in the following table:

xyz <- tibble(Method = c("Default", "Grid Search", "Bayesian Optimization",                          "Grid Search Caret", "Adaptive Resampling Method"),               AUC_value = c(auc_default$.estimate,                             Finalize_grid[[1]]$.estimate,                              Finalize_Bayesian[[1]]$.estimate,                             auc_with_caret_gr$.estimate,                             auc_with_caret$.estimate))
MethodAUC_value
Default0.8235755
Grid Search0.8248226
Bayesian Optimization0.8295968
Grid Search Caret0.8272427
Adaptive Resampling Method0.8301066

Of course these results depend on the data set used and on the defined configuration(resampling, number of Iterations, cross validation, ..), you may come to a different conclusion if you use another data set with different configuration, but regardless of this dependency, our case study shows that the coding effort made for hyperparameter tuning using the tidymodels library is high and complex compared to the effort made by using the caret package. The caret package is more effective and leads to better performance. I’m currently working on a new shiny application, which we can use for tuning hyperparameter of almost all the {parsnip} models using the {tune} package, and hopefully in this way we can reduce the complexity and the coding effort.

Thank you for your feedback also at kritiker2017@gmail.com

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

E-Learning Quizzes with R/exams for Moodle and OpenOLAT

$
0
0

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Step-by-step tutorials for generating, importing, and customizing online tests and quizzes using exams2moodle() and exams2openolat() in R/exams.

E-Learning Quizzes with R/exams for Moodle and OpenOLAT

Motivation

E-learning resources such as online tests and quizzes or more formal e-exams are very useful in a variety of settings: formative vs. summative assessments; in-class vs. distance learning; synchronous vs. asynchronous; small vs. large groups of students. Some typical examples are outlined here.

Motivation
  • Short tests/quizzes conducted in-class (synchronously) as a quick assessment of content the students had to prepare before class. In a flipped classroom approach the test/quiz might also be conducted after a collaboration phase in class.
  • Asynchronous online tests that students can do in their own time (e.g., over several days) to obtain (incentivized) feedback regarding their learning progress. Such tests might follow any form of content delivery, be it classical lectures or video screencasts or some other approach.
  • Synchronous e-exams conducted in-class or remotely (e.g., coupled with a safe exam browser) as a summative assessment at the end of a course.

R/exams can support these scenarios by creating a sufficiently large number of randomized versions of dynamic exercises that can subsequently be imported into a learning management system (LMS). The actual quiz/test/exam is then conducted in the LMS only, i.e., without the need to have R running in the background, because all exercises and corresponding solutions have been pre-computed and stored in the LMS. Popular LMS include the open-source systems Moodle, Canvas, OpenOLAT, or Ilias or the commerical Blackboard system. R/exams provides suitable interfaces for all of these but the capabilities differ somewhat between the LMS. In the following we focus on Moodle and OpenOLAT, both of which provide very flexible and powerful assessment modules.

Creation in R/exams

Just like for other R/exams interfaces the starting point is putting together a vector or list of (potentially) dynamic exercises in R. From these exercises – in either R/Markdown or R/LaTeX format – a number of random replications can be drawn using either exams2moodle() or exams2openolat(), respectively. Both interfaces support all R/exams exercise types: single-choice (schoice), multiple-choice (mchoice), numeric (num), string (string), or combinations of these (cloze). See the First Steps tutorial for more details.

Creation

Here, we use a collection of exercise templates that are all shipped within the R/exams package and that cover a broad range of different question types as well as different randomyly-generated content (shuffling, random parameters, R output, graphics, simulated data sets).

Exercise templateTypeTask
swisscapitalschoiceKnowledge quiz question with basic shuffling
derivnumComputing the derivative of a function with randomized parameters
ttestmchoiceInterpretation of R output from t.test()
boxplotsmchoiceInterpretation of parallel boxplots
functionstringKnowledge quiz question where the answer is the name of an R function
lmclozeConducting a simple linear regression based on a randomly-generated CSV file
fourfold2clozeCompleting a fourfold table based on verbal description with randomized parameters

First, we load the exams package and define a vector with all exercise .Rmd file names.

library("exams")elearn_exam <- c("swisscapital.Rmd", "deriv.Rmd", "ttest.Rmd",  "boxplots.Rmd", "function.Rmd", "lm.Rmd", "fourfold2.Rmd")

Alternatively, the corresponding .Rnw files could be used, yielding virtually identical output.

Second, we generate a Moodle XML file with 3 random replications of each of the exercises.

set.seed(2020-03-15)exams2moodle(elearn_exam, n = 3, name = "R-exams")

This yields the file R-exams.xml that can be imported into Moodle.

Analogously, a ZIP archive containing QTI 2.1 XML files (Question & Test Interoperability standard) for import into OpenOLAT.

set.seed(2020-03-15)rxm <- exams2openolat(elearn_exam, n = 3, name = "R-exams")

The resulting output file is R-exams.zip.

Moreover, to show that the object returned within R can also be useful we have assigned the output of exams2openolat() to an object rxm. This is not necessary but inspecting this object might be helpful when developing and testing new exercises. In particular, we can easily extract the meta-information regarding the correct answers in all randomly generated exercises.

exams_metainfo(rxm)
## ## exam1##     1. Swiss Capital: 4##     2. derivative exp: 38.72 (38.71--38.73)##     3. 2-sample t-test: 1, 2, 5##     4. Parallel boxplots: 2, 4, 5##     5. R functions: lm##     6. Linear regression: FALSE, FALSE, TRUE | -0.861##     7. fourfold: 4.44 | 19.74 | 1.56 | 74.26 | 24.18 | 75.82 | 6 | 94 | 100## ## exam2##     1. Swiss Capital: 4##     2. derivative exp: 2 (1.99--2.01)##     3. 2-sample t-test: 3##     4. Parallel boxplots: 2, 3, 4##     5. R functions: vcov##     6. Linear regression: FALSE, TRUE, FALSE | 0.531##     7. fourfold: 5.76 | 23.92 | 2.24 | 68.08 | 29.68 | 70.32 | 8 | 92 | 100## ## exam3##     1. Swiss Capital: 2##     2. derivative exp: 2.05 (2.04--2.06)##     3. 2-sample t-test: 2, 3##     4. Parallel boxplots: 2, 4, 5##     5. R functions: glm##     6. Linear regression: TRUE, FALSE, FALSE | 0.024##     7. fourfold: 6.5 | 22.5 | 3.5 | 67.5 | 29 | 71 | 10 | 90 | 100

Import into Moodle and OpenOLAT

Finally, the output files generated above can be imported into the Moodle and OpenOLAT learning management system, respectively. In Moodle the random exercises are imported into a question bank based on which a quiz with randomly-selected questions can be constructed. In OpenOLAT the import directly yields a test learning resource that can then be embedded in a course.

Import

Moodle import

A step-by-step video guide to importing and customizing the quiz in Moodle is available on YouTube at https://www.youtube.com/watch?v=5K9hrE3YkPs.

Moodle import

OpenOLAT import

A step-by-step video guide to importing and customizing the test in OpenOLAT is available on YouTube at https://www.youtube.com/watch?v=1ZhdmoDtUSA.

OpenOLAT import

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1133]

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Aweekly Monde current mathematical puzzle that reminded me of an earlier one (but was too lazy to check):

If ADULE-ELUDE=POINT, was is the largest possible value of POINT? With the convention that all letters correspond to different digits and no digit can start with 0. Same question when ADULE+ELUDE=POINT.

The run of a brute force R search return 65934 as the solution (codegolf welcomed!)

dify<-function(aluda,point)   (sum(aluda*10^(4:0))-sum(rev(aluda)*10^(4:0)))num2dig<-function(dif) (dif%/%10^(0:4))%%10sl=NULLfor (t in 1:1e6){  adule=sample(0:9,5)  while((dify(aluda)<=0)||(!prod(adule[c(1,5)])))     adule=sample(0:9,5)point=rev(num2dig(dify(adule)))if ((!sum(duplicated(point)))&(prod(point%in%(0:9)[-adule-1])))  sl=rbind(sl,c(adule,point))}sl=as.matrix(distinct(as.data.frame(sl),.keep_all = TRUE))

where distinct is a dplyr R function.

> 94581-18549[1] 76032

The code can be easily turned into solving the second question

> 31782+28713[1] 60495
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


A deep dive into glmnet: predict.glmnet

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, instead of looking at one of the function options of glmnet, we’ll look at the predict method for a glmnet object instead. The object returned by glmnet (call it fit) has class "glmnet"; when we run predict(fit), it runs the predict method for class "glmnet" objects, i.e. predict.glmnet(fit).

For reference, here is the full signature of the predict.glmnet function/method (v3.0-2):

predict(object, newx, s = NULL, type = c("link",  "response", "coefficients", "nonzero", "class"), exact = FALSE,  newoffset, ...)

In the above, object is a fitted "glmnet" object (call it fit). Recall that every glmnet fit has a lambda sequence associated with it: this will be important in understanding what follows. (This sequence can be accessed via fit$lambda.)

For the rest of this post, we will use the following data example:

set.seed(1)n <- 100; p <- 20x <- matrix(rnorm(n * p), nrow = n)beta <- matrix(c(rep(1, 5), rep(0, 15)), ncol = 1)y <- x %*% beta + rnorm(n)fit <- glmnet(x, y)

Function option: newx

newx is simply the new x matrix at which we want predictions for. So for example, if we want predictions for the training x matrix, we would do

predict(fit, x)

If no other arguments are passed, we will get a matrix of predictions, each column corresponding to predictions for each value of \lambda in fit$lambda. For our example, fit$lambda has length 68 and x consists of 100 rows/observations, so predict(fit, x) returns a 100 \times 68 matrix.

length(fit$lambda)# [1] 68dim(predict(fit, x))# [1] 100  68

newx must be provided except when type="coefficients" or type="nonzero" (more on these types later).

Function option: newoffset

If the original glmnet call was fit with an offset, then an offset must be included in the predict call under the newoffset option. If not included, an error will be thrown.

set.seed(2)offset <- rnorm(n)fit2 <- glmnet(x, y, offset = offset)predict(fit2, x)# Error: No newoffset provided for prediction, yet offset used in fit of glmnet

The reverse is true, in that if the original glmnet call was NOT fit with an offset, then predict will not allow you to include an offset in the prediction, EVEN if you pass it the newoffset option. It does not throw a warning or error, but simply ignore the newoffset option. You have been warned! This is demonstrated in the code snippet below.

pred_no_offset <- predict(fit, x)pred_w_offset <- predict(fit, x, offset = offset)max(abs(pred_no_offset - pred_w_offset))# [1] 0

Function option: s and exact

s indicates the \lambda values for which we want predictions at. If the user does not specify s, predict will give predictions for each of the \lambda values in fit$lambda.

(Why is this option nameds and not the more intuitive lambda? In page 5 of this vignette, the authors say they made this choice “in case later we want to allow one to specify the model size in other ways”. lambda controls the model size in the sense that the larger it is, the more coefficients will be forced to zero. There are other ways to specify model size. For example, one could imagine a function option where we specify the number of non-zero coefficients we want in the model, or where we specify the maximum \ell_1 norm the coefficient vector can have. None of these other options have been implemented at the moment.)

If the user-specified s values all belong to fit$lambda, then predict pulls out the coefficients corresponding to those values and returns predictions. In this case, the exact option has no effect.

If the user-specified s value does NOT belong to fit$lambda, things get interesting. If exact=FALSE (the default), predict uses linear interpolation to make predictions. (More accurately, it does linear interpolation of the coefficients, which translates to linear interpolation of the predictions.) As stated in the documentation: “while this is often a good approximation, it can sometimes be a bit coarse”.

As a demonstration: In the snippet below, we look at the predictions at a value of \lambda that lies between the two largest values in fit$lambda. If the function does as the documentation says, the last line should give a value of 0 (to machine precision).

b1 <- as.numeric(predict(fit, x, s = fit$lambda[1]))b2 <- as.numeric(predict(fit, x, s = fit$lambda[2]))b3 <- as.numeric(predict(fit, x,    s = 0.3*fit$lambda[1] + 0.7*fit$lambda[2]))max(abs(b3 - (0.3*b1 + 0.7*b2)))# [1] 3.885781e-16

What happens if we have values in s that are not within the range of fit$lambda? First, I would recommend using exact=TRUE because extrapolation beyond the range of fit$lambda is dangerous in general. In my little experiments, it looks like predict simply returns the predictions for the \lambda value in fit$lambda that is closest to s.

If exact=TRUE, predict merges s with fit$lambda to get a single (decreasing) \lambda sequence, refits the glmnet model, then returns predictions at the \lambda values in s. If your training data is very large, this refitting could take a long time.

One note when using exact=TRUE is that you have to pass in additional arguments in order for the refitting to happen. That’s because the fitted glmnet object does not contain all the ingredients needed to do refitting. For our example, to predict for fit we need to supply x and y as well. For more complicated glmnet calls, more options have to be provided.

predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE)# Error: used coef.glmnet() or predict.glmnet() with `exact=TRUE` # so must in addition supply original argument(s)  x and y  in order to # safely rerun glmnetpredict(fit, x, s = fit$lambda[68] / 2, exact = TRUE, x = x, y = y)# glmnet correctly returns predictions...

Function option: type

The type option determines the type of prediction returned. type="coefficients" returns the model coefficients for the \lambda values in s as a sparse matrix. type="nonzero" returns a list, with each element being a vector of the features which have non-zero features. For example, the code snippet below shows that for the second and third \lambda values in fit$lambda, the features that have non-zero coefficients are feature 5 and features 3 and 5 respectively.

predict(fit, type = "nonzero", s = fit$lambda[2:3])# $`1`# [1] 5# # $`2`# [1] 3 5

For type="coefficients" and type="nonzero", the user does not have to provide a newx argument since the return value does not depend on where we want the predictions. For the rest of the possible values of type, newx is required.

For type="link" (the default) and type="response" it helps to know a little GLM theory. For a observation having values x \in \mathbb{R}^p, type="link" returns x^T \beta, where \beta is the coefficient vector corresponding to a \lambda value in s.

For type="response", x^T \beta is passed through the GLM’s inverse link function to return predictions on the y scale. For “gaussian” family it is still x^T \beta. For “binomial” and “poisson” families it is \exp(x^T \beta) / (1 + \exp(x^T \beta)) and \exp(x^T \beta) respectively. For “multinomial” it returns fitted probabilities and for “cox” it returns fitted relative risk.

The final possibility, type="class", applies only to “binomial” and “multinomial” families. For each observation, it simply returns the class with the highest predicted probability.

Bonus: The coef method

The coef method for glmnet is actually just a special case of the predict method. This can be seen from the source code:

coef.glmnet# function (object, s = NULL, exact = FALSE, ...) #     predict(object, s = s, type = "coefficients", exact = exact, #             ...)# # 

Bonus: predict.elnet, predict.lognet, …

If you inspect the class of the object returned by a glmnet call, you will realize that it has more than one class. In the code below, we see that “gaussian” family results in an “elnet” class object. (“binomial” family returns a “lognet” object, “poisson” family returns a “fishnet” object, etc.)

class(fit)# [1] "elnet"  "glmnet"

These classes have their own predict methods as well, but they draw on this base predict.glmnet call. As an example, here is the code for predict.fishnet:

glmnet:::predict.fishnet# function (object, newx, s = NULL, type = c("link", "response", #        "coefficients", "nonzero"), exact = FALSE, newoffset, ...) # {#     type = match.arg(type)#     nfit = NextMethod("predict")#     switch(type, response = exp(nfit), nfit)# }# # 

What happens here is that predict.glmnet is first called. If type is not "response", then we simply return whatever predict.glmnet would have returned. However, if type="response", then (i) we call predict.glmnet, and (ii) the predictions are passed through the function x \mapsto \exp(x) before being returned.

This is how predict is able to give the correct return output across the different family and type options.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mean expectations

$
0
0

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re taking a break from our extended analysis of rebalancing to get back to the other salient parts of portfolio construction. We haven’t given up on the deep dive into the merits or drawbacks of rebalancing, but we feel we need to move the discussion along to keep the momentum. This should ultimately tie back to rebalancing, but from a different angle. We’ll now start to examine capital market expectations.

Capital market expectations are the bedrock of portfolio construction. From pension funds to hedge funds, from portfolio managers to financial advisors, one needs to make some sort of assumption about the future—explict or otherwise. The reason: you’re building a portfoliog to achieve some expected return. The starting point, then, is to estimate the expected returns of various asset classes worthy of inclusion in a satisfactory portfolio. Yet, where do these expectations come from?

Practictioners use three main methods:

  • Historical means
  • Discounted cash flow
  • Risk premia

Each has its advantages and drawbacks. For this post we’ll introduce the concept of using historical means to formulate expected returns without going into an exhaustive analysis of the benenfits and drawbacks of this method.

The mean return over a sufficient period is a very economical way to estimate expected returns. The data and calculation are relatively easy to procure. And the concept is easy to understand: the future will look somewhat like the past. But there are a number of problems with this approach:

  • What’s the confidence that this mean is the right one?
  • What’s the right period to use to calculate the mean?
  • How do we know the future will resemble the past?

We’ll use the S&P 500 to explore some of these questions. We start by graphing a histogram of monthly returns from 1950 to the present. We overlay a normal distribution (red line) and boundaries for two standard deviations (vertical blue lines).

As is evident, the returns don’t fall into a normal distribution, are negatively skewed (more negative observations), and are fat-tailed (more observations pierce the red line at the ends of the graph). Why is this an issue? Using the properties of a normal distribution on non-normal data won’t yield the same degree of confidence that the calculated mean is a good estimate of the true mean. Moreover, the range of likely outcomes will be different than a normal distribution. The normal distribution estimates that only 5% of the observations will fall outside of two standard deviations. For the S&P, it’s more than that and not symmetric—more observations outside of a negative two standard deviations, less observations outside of a positive two.

There are ways to get around this. One is to bootstrap the data to arrive at non-parametric estimates. That is, pull returns from the data thousands of times to arrive at multiple samples of mean returns. Since the present data set is so large, bootstrapping won’t produce a huge difference. but we’ll show the results for completeness in the table below.

Table 1: Estimates with lower and upper confidence levels (%)
EstimateReturnLowerUpper
Sample8.45.011.8
Bootstrap8.24.711.7

As one can see the bootstrapped return is not that far off from the sample return and the confidence intervals are pretty close too. But should we be using the entire dataset to estimate the mean return?

There’s a good argument to claim that the period from 1950s to the 1980s was much different than the period afterward, and not because of nostalgia. The U.S., the main economy underpinning the S&P, saw a siginficant shift from a manufacturing to a consumer-driven economy. Then there’s technological change, globalization, and financial innovation.

Our point is not to argue that we’re living in a brave new world; rather, it’s to stress that the factors driving returns many decades in the past, are not likely to be the same factors that drive returns in the future. Hence, why should long ago returns have the same weight as the near term on our estimates of the future, which is what is implied when one uses a simple average? A graphic should help highlight this phenomenon. In the graph below, we present the average return calculated over a period that starts on every year since the beginning of our data set until the present. The last start year is 2000 so even the shortest timeframe features 20 years of data.

As one can see, even for long periods of time, mean returns aren’t stable. The range is 4.8 percentage points. While that might seem small, using the top end would yield 20% more money than the bottom after five years. And this doesn’t account for the confidence interval of that estimate. We graph the mean return along with the 95% confidence interval below.

Here we see that not only is the mean return declining, but the confidence interval is getting wider. That suggests we should be less certain how close that single estimate of return is to the true return.

To see how wide the confidence intervals have become, in the chart below we graph the absolute value of the potential range of returns as a percentage of the mean. In other words, we take the standard width of the confidence interval (whether up or down) and divide it by the mean return. This gives one a standardized estimate as to how wide the outcomes could be.

A few points to note with ths graph. First, even at the low end, a confidence interval that is 40% above or below the mean is pretty wide. But one that is 129% of the mean (the ratio at the high end) borders on absurdity. In truth, if your confidence interval crosses zero (implied by the 100% or greater ratio), then you might as well assume that the sample mean is not much different than noise. The red line on the plot shows where that cut-off occurs. In other words, from 1997 on, it is really hard to pinpoint what the true average return might be.

But don’t let statistics get in the way of a good story! That period included two major drawdowns—the tech bubble and the global financial crisis—along with an unprecedented drop in interest rates globally and massive money printing. So shouldn’t it feature different returns than the past? If we graph the distributions before and after 1997, an artificial cut-off to say the least, we don’t see a dramatic difference

When we bootstrap the difference in means for the period pre and post-1997, we find that only 25% of the samples produced a lower mean for the post-1997 data vs. the pre-1997 data. That suggests that the differences we do see in the mean returns (post-1997 is lower than pre on the actual data) are likely due to chance. Seem’s like we’re in a quandary.

Ultimately, even if we had a great deal of confidence that the mean return we calculate was the correct number, what confidence do we have that it will be the right number in the future? Don’t all those investment brochures tell us that that past performance is not indicative of future results?

Part of the problem might be methodological. The approach we’re using assumes there is some true population mean, and the data we have is only one sample from that population. Our confidence that the average returns we’ve calculated are the true averages, assumes there is some underlying order to the population distribution. Bootstrapping doesn’t assume as much, but it certrainly doesn’t believe we’re drawing data from a chaotic soup. However, other approaches don’t make an assumption about the population. They turn it around, believing that one can make an educated guess about the distribution based on the data observed. This approach fits the distribution to the data vs. fitting the data to the distribution, as in the prior method.1 Perhaps using this second approach, we can achieve more confidence about which mean return to use. But we’d have to test that in another post.

A different approach might be to use some sort of rolling average of past returns to forecast a future return. We might average the past five years of monthly data as a way to forecast the next year or few years of returns. For example, when we regress the rolling five-year monthly annualized returns against the forward three-year annualized returns we arrive at the following statistics.

TermEstimate (%)Statistic
Intercept8.618.2
Five-year returns-0.1-2.3

The estimates are significant and the intercept approximates the long-run mean return we calculated previously, while capturing some degree of mean reversion. We’d want to analyze this approach using a train-test split to see how well this would work out-of-sample. It could prove interesting, but that will also have to wait for another post.

Let’s review. The historical mean return is an intuitive, easy-to-use proxy for expected future returns. But having confidence in any particular estimate is relatively low. A range of likely outcomes would seem better, except that depending on the period used, that range may be too wide to be useful. Which brings us to a third point: deciding which period to use leads to some very different estimates of the future.

Clearly, this wasn’t an exhaustive examination of using the historical mean to formulate expected returns. It asked more questions than answered. But the goal was only to introduce the concept. Next we’ll examine the other methods we mentioned above; that is, discounted cash flow and risk premia methods. After we’ve discussed all these approaches, we can then analyze which approach or combination of approaches is likely to produce the most accurate forecast. Until then, here’s all the code behind this week’s post.

### Load packagessuppressPackageStartupMessages({  library(tidyquant)  library(tidyverse)  library(boot)  })options("getSymbols.warning4.0"=FALSE)### Load data# SP500 price sp <- getSymbols("^GSPC", src = "yahoo", from = "1950-01-01", auto.assign = FALSE) %>%   Ad() %>%   `colnames<-`("sp")sp_mon <- to.monthly(sp, indexAt = "lastof", OHLC = FALSE)sp_ret <- ROC(sp_mon, type='discrete')df <- data.frame(date = index(sp_ret), sp = as.numeric(sp_ret))## Histogramdf %>%   ggplot(aes(sp*100)) +  geom_histogram(aes(y = ..density..),                  fill = "blue",                 bins = 100) +  stat_function(fun = dnorm,                 args = list(mean = mean(df$sp*100, na.rm = TRUE),                            sd = sd(df$sp*100, na.rm = TRUE)),                color = "red",                lwd = 1.25) +  geom_vline(xintercept = sd(df$sp*100, na.rm = TRUE)*-2, size = 1.25, color = "slateblue") +  geom_vline(xintercept = sd(df$sp*100, na.rm = TRUE)*2, size = 1.25, color = "slateblue") +  geom_text(aes(x = -12, y = .08,                label = "2 Std. deviation"),            size = 4,            color = "slateblue") +  geom_segment(aes(x = -14, xend = -9,                   y = .07, yend = .07),               arrow = arrow(length = unit(2, "mm")),               color = "slateblue") +  labs(x = "Returns (%)",       y = "Density",       title = "S&P 500 monthly return distribuion 1950-Present")mean_ret <- round(mean(df$sp, na.rm = TRUE),3)*1200med_ret <- round(median(df$sp, na.rm = TRUE),3)*1200t_test <- t.test(df$sp)t_stat <- t_test$statisticlow <- round(t_test$conf.int[1],3)*1200high <- round(t_test$conf.int[2],3)*1200## Create function for bootmean_boot <- function(df,index){  dat <- df[index,]  mean(dat$sp, na.rm = TRUE)}## Run bootset.seed(123)b_strap <- boot(df,mean_boot,1000)## Create summmry statsbs_mean <- mean(b_strap$t)*1200bs_up <- (mean(b_strap$t)+2*sd(b_strap$t))*1200bs_low <- (mean(b_strap$t)-2*sd(b_strap$t))*1200sderr <- sd(df$sp,na.rm = TRUE)/sqrt(nrow(df))samp_mean <- mean_retsamp_up <- (mean_ret/1200 + 2*sderr)*1200samp_low <- (mean_ret/1200 - 2*sderr)*1200## Compile data framestats <- data.frame(Estimate = c("Sample", "Bootstrap"),                    Return = c(samp_mean, bs_mean),                    Lower = c(samp_low, bs_low),                    Upper = c(samp_up, bs_up))## Print tablestats %>%   mutate_at(vars(-Estimate), function(x) round(x,1)) %>%   knitr::kable(caption = "Estimates with lower and upper confidence levels (%)")## Create rolling forward to present mean return# create index and functionindex <- data.frame(start = seq(as.Date("1950-01-01"),length = 51, by = "years"),                    end = rep(as.Date("2019-12-31"),51))time_func <- function(start, end, dframe){  out <- dframe %>%     filter(date >= start, date <= end) %>%     summarise(mean = mean(sp, na.rm = TRUE),              sd = sd(sp, na.rm = TRUE),              sderr = sd(sp,na.rm = TRUE)/sqrt(n())) %>%     as.numeric()    out  }## Create rolling avege data frameroll_est <- data.frame(ret = rep(0,51), sd = rep(0,51), sderr = rep(0,51))for(i in 1:51){  roll_est[i,] <- time_func(index[i,1], index[i,2], df)}roll_est <- roll_est %>%   mutate(date = seq(1950,2000,1)) %>%   select(date, everything())# Graph resultsroll_est %>%   ggplot(aes(date, ret*1200))+  geom_bar(stat = 'identity', fill = "blue") +   labs(x = "",       y = "Return (%)",       title = "Mean return from start year to present")range <- round(max(roll_est$ret) - min(roll_est$ret),3)*1200# POint range blogroll_est %>%   mutate(upper = ret + sderr,         lower = ret - sderr) %>%    ggplot(aes(x=date))+  geom_line(aes(y = ret*1200), color = "blue", size =1) +  geom_pointrange(aes(y=ret*1200,                      ymin = lower*1200,                       ymax = upper*1200),                   color = "blue",                   fill = "white",                  fatten = 3,                  size = 0.8,                  shape = 21) +  labs(x = "",       y = "Return (%)",       title = "Mean return start year to present with std error range")# Sderr as % of meanroll_est %>%   mutate(ret_range = (2*sderr)/ret * 100) %>%   ggplot(aes(date, ret_range)) +  geom_bar(stat = "identity", fill = "blue") +  geom_hline(yintercept = 82, color = "red") +   geom_text(aes(x = 1975,                y = 100,                label = "Significance line"),            color = "red",            size = 4)+  labs(x = "",       y = "Interval over mean (%)",       title = "Confidence interval as a percent of the mean")conf_int <- roll_est %>%   mutate(ret_range = (2*sderr)/ret * 100) %>%   summarise(max = max(ret_range),            min = min(ret_range)) %>%   as.numeric() %>%   round()## Pre and post 1997 histogramdf %>%   mutate(div = ifelse(date <= "1997-01-01", "a", "b")) %>%  ggplot(aes(sp*100)) +  geom_density(aes(fill = div), alpha = 0.4) +  scale_fill_manual("", labels = c("Pre-1997", "Post-1997"),                    values = c("blue", "pink"),                    drop = FALSE) +  labs(x = "Return (%)",       y = "Density",       title = "Monthly return distributions pre and post 1997") +  theme(legend.position = "top")# T-testdiff_func <- function(df, index){  df1 <- df %>% filter(date < "1997-01-01")  df2 <- df %>% filter(date >= "1997-01-01")  dat1 <- df1[index,]  dat2 <- df2[index,]  mean(dat2 < dat1, na.rm = TRUE)}t_boot <- boot(df, diff_func, 1000)t_out <- round(mean(t_boot$t),2)*100# Regression modeldf %>%   mutate(five_yr = rollapply(sp, width = 60, mean, align = "right", fill = NA)*1200,         three_yr = rollapply(sp, width = 36, mean, align = "left", fill = NA)*1200) %>%  lm(three_yr ~ five_yr, .) %>%   broom::tidy() %>%   mutate(Term = c("Intercept", "Five-year returns")) %>%   select(Term, estimate, statistic) %>%   rename("Estimate" = estimate,         "Statistic" = statistic) %>%  mutate_at(vars(-Term), function(x) round(x,1)) %>%   knitr::kable()

  1. We’re of course discussing the differences between the Frequentist and Bayesian approaches.↩

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RProtoBuf 0.4.17: Robustified

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.4.17 of RProtoBuf is now on CRAN. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol.

This release contains small polishes related to the release 0.4.16 which added JSON support for messages, and switched to ByteSizeLong. This release now makes sure JSON functionality is only tested where available (on version 3 of the Protocol Buffers library), and that ByteSizeLong is only called where available (version 3.6.0 or later). Of course, older versions build as before and remain fully supported.

Changes in RProtoBuf version 0.4.17 (2020-03-xx)

  • Condition use of ByteSizeLong() on building with ProtoBuf 3.6.0 or later (Dirk in #71 fixing #70).

  • The JSON unit tests are skipped if ProtoBuf 2.* is used (Dirk, also #71).

  • The configure script now extracts the version from the DESCRIPTION file ( (Dirk, also #71).

CRANberries provides the usual diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, and the pre-print of our JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A GARCH Tutorial in R

$
0
0

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Myself, Mauro Mastella, Daniel Vancin and Henrique Ramos, just finished a tutorial paper about GARCH models in R and I believe it is a good content for those learning financial econometrics. You can find the full paper in this link.

In a nutshell, the paper introduces motivation behind the GARCH type of models and presents an empirical application: given the recent COVID-19 crisis, we investigate how much time it would take for the Ibovespa index to reach its peak value once again. The results indicate that it would take, on average, about two and half years for the index to recover.

All code and data used in the study is available in GitHub, so fell free to download the zip file and play around. You can find all figures of the paper in this link. Worth pointing out that you can reproduce all results in your own computer by executing the source code at GitHub.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

wrapr 2.0.0 up on CRAN

$
0
0

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

wrapr 2.0.0 is now up on CRAN.

This means the := variant of unpack[] is now easy to install.

Please give it a try!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Whats Cooking ??

$
0
0

[This article was first published on Stencilled, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This R shiny app is for all the home chefs out there looking to try different recipes, especially now that we are all social distancing and looking at ways to keep us sane! (Yes, cooking is a form of meditation wherein the end, you have something to eat :D). From a database of 28000 + recipes, just enter the ingredients you wish to filter on and the type of cuisine and you will see all the recipes with your specified ingredients from your favorite cuisine!

Bon Appetite !!

What’s Cooking

You can find code for the app here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stencilled.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

a global pandemic on Twitter

$
0
0

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Brief introduction

So, lots of linguistic variation happening in real time. coronavirus, covid19, pandemic, and more recently (the?) coronavirus pandemic. For sure, these expressions are not proper synonyms – each refer to different “aspects” of the virus. coronavirus ~ virus. covid19 ~ disease. pandemic ~ social/epi. Here, we take a super quick look at how this variation in reference is materializing on Twitter among the 535 voting members of the United States Congress since January 2020.

Twitter details

First things first, we obtain Twitter handles and some relevant biographical details (here, political affiliation) for the 100 US Senators and the 435 members of the House of Representatives from the unitedstates project.

library(tidyverse)leg_dets <- 'https://theunitedstates.io/congress-legislators/legislators-current.csv'twitters <- read.csv((url(leg_dets)),                     stringsAsFactors = FALSE) %>%  #filter(type == 'rep') %>% # & twitter!=''  rename (state_abbrev = state,          district_code = district)

Then we scrape the last 1000 tweets for each of the 535 members of congress using the rtweet package. Here, we are just trying to get all tweets from 2020 – 1,000 is overkill. We exclude re-tweets. The scraping process takes roughly an hour or so.

congress_tweets <- rtweet::get_timeline(   twitters$twitter,   n = 1000,  check = FALSE) %>%  mutate(created_at = as.Date(gsub(' .*$', '',                                    created_at))) %>%  filter(is_quote == 'FALSE' &            is_retweet == 'FALSE' &            created_at >= '2020-01-01' &           display_text_width > 0)# setwd("/home/jtimm/jt_work/GitHub/data_sets")# saveRDS(congress_tweets, 'cong2020_tweets_tif.rds')

Then we join the two data sets. And calculate total tweets generated by members of Congress by party affiliation in 2020.

congress_tweets1 <- congress_tweets %>%  mutate(twitter = toupper(screen_name)) %>%  select(status_id, created_at, twitter, text) %>%  inner_join(twitters %>% mutate(twitter = toupper(twitter)))all_tweets <- congress_tweets1 %>%  group_by(created_at, party) %>%  summarise(ts = n()) %>%  rename(date = created_at)

The figure below summarizes total tweets by party affiliation since the first of the year. Donald Trump presented his State of the Union address on February 5th, hence the spike in activity. There seems to be a slight upward trend in total tweets – perhaps one that is more prevalent among Democrats – presumably in response to the Coronavirus.

Also, Democrats do tweet more, but they also have numbers at present. And it seems that members of Congress put their phones down a bit on the weekends.

all_tweets %>%  filter(party != 'Independent') %>% # Justin Amash & Bernie Sanders & Angus King  ggplot() +  geom_line(aes(x = date,                 y= ts,                 color = party                ),            size = 1.25) +  theme_minimal() +  ggthemes::scale_color_stata() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  scale_x_date(date_breaks = '1 week', date_labels = "%b %d") +  theme(legend.position = 'bottom')  +  labs(title = 'Total congressional tweets by party affiliation')

2019 NOVEL CORONAVIRUS & lexical variation

For clarity purposes, we will refer to the broad conceptual category of the virus in all caps as 2019 NOVEL CORONAVIRUS. In contrast, the ways that speakers/tweeters can refer to this concept will be represented in lowercase, eg, pandemic& covid19. From this perspective, coronavirus is one way speakers can refer to the concept 2019 NOVEL CORONAVIRUS.

So, with data sets in tow, the first step is to identify and extract the different lexical forms used to reference the virus. Referents include (1) pandemic, (2) coronavirus, corona virus, (3) covid19, covid, covid 19, covid-19, and (4) coronavirus pandemic. Some spelling variants. For simplicity, we ignore variation in orthographic case.

pan <- 'pandemic|'cv <- 'coronavirus|corona virus|'covid <- 'covid19|covid|covid 19|covid-19|'cvp <- 'coronavirus pandemic'searches <- paste0(pan, cv, covid, cvp)covid_tweets <- lapply(1:nrow(congress_tweets1), function(x) {      spots <- gregexpr(pattern = searches, congress_tweets1$text[x], ignore.case=TRUE)    covid_gram <- regmatches(congress_tweets1$text[x], spots)[[1]]     if (-1 %in% spots){} else {      data.frame(doc_id = congress_tweets1$status_id[x],                 date = congress_tweets1$created_at[x],                 twitter = congress_tweets1$twitter[x],                 party = congress_tweets1$party[x],                 covid_gram = covid_gram,                 stringsAsFactors = FALSE)}  })  %>%   data.table:::rbindlist() 

Attested variants are highlighted below. So, some disagreement on how things should be spelled. (It will be curious to see if this settles some moving forward, and conventions established.)

table(covid_tweets$covid_gram)
## ##         corona virus         Corona virus         Corona Virus ##                    4                    3                    3 ##         CORONA VIRUS          coronavirus          Coronavirus ##                    1                 6757                 2493 ##          CoronaVirus          CORONAVIRUS coronavirus pandemic ##                   94                   14                  340 ## Coronavirus pandemic Coronavirus Pandemic CoronaVirus Pandemic ##                   45                    7                    1 ##                covid                Covid                COVID ##                    7                  126                  674 ##             COVID 19             covid-19             Covid-19 ##                    6                    5                   11 ##             COVID-19              covid19              Covid19 ##                 1962                   48                   55 ##              COVID19             pandemic             Pandemic ##                 3676                 1139                  111 ##             PANDEMIC ##                    1

After normalizing spelling variation, a portion of the resulting table (less the tweet id) is presented below:

covid_tweets <- covid_tweets %>%  mutate(covid_gram = tolower(covid_gram),         covid_gram = ifelse(grepl('covid', covid_gram), 'covid19', covid_gram),         covid_gram = ifelse(grepl('corona virus', covid_gram), 'coronavirus', covid_gram))covid_tweets %>% sample_n(10) %>% select(-doc_id) %>%knitr::kable()
datetwitterpartycovid_gram
2020-03-20REPBONAMICIDemocratcoronavirus pandemic
2020-03-20REPCARBAJALDemocratcovid19
2020-03-24REPLINDASANCHEZDemocratcoronavirus
2020-03-06REPLLOYDDOGGETTDemocratcovid19
2020-03-27REPPETEKINGRepublicanpandemic
2020-03-16LEADERHOYERDemocratcoronavirus
2020-03-11SENBRIANSCHATZDemocratcoronavirus
2020-03-20CHELLIEPINGREEDemocratcovid19
2020-03-20NORMAJTORRESDemocratcovid19
2020-03-17REPJEFFDUNCANRepublicancovid19

Patterns of variation over time

The table below details the first attestation of each referring expression in our 2020 Congressional Twitter corpus. coronavirus hit the scene on 1-17, followed by pandemic on 1-22, coronavirus pandemic on 2-11, and covid19 on 2-12 – the name for the disease coined by the World Health Organization on 2-11.

covid_tweets %>%  group_by(covid_gram) %>%  filter(date == min(date)) %>%  arrange(date) %>%  select(covid_gram, date, twitter) %>%  knitr::kable()
covid_gramdatetwitter
coronavirus2020-01-17SENFEINSTEIN
pandemic2020-01-22MICHAELCBURGESS
pandemic2020-01-22SENTOMCOTTON
coronavirus pandemic2020-02-11SENATORHASSAN
covid192020-02-12REPELIOTENGEL

Rates of reference to 2019 NOVEL CORONAVIRUS

So, what lexical forms are Senators and House Reps using to reference 2019 NOVEL CORONAVIRUS on Twitter? How often are they referring to 2019 NOVEL CORONAVIRUS? And how have these patterns changed over time? To get a beat, we consider the daily rate of reference to 2019 NOVEL CORONAVIRUS, and the rate at which each lexical variant has been used to reference 2019 NOVEL CORONAVIRUS.

The reference rate for referring expression X, then, is approximated as the proportion of total tweets generated by members of Congress that contain referring expression X. The plot below illustrates daily rates of reference for each form from Jan 17 to March 27. Included is the total reference rate for 2019 NOVEL CORONAVIRUS.

all <- covid_tweets %>%  group_by(date) %>%  summarize(n = n()) %>%  left_join(all_tweets %>% group_by(date) %>% summarise(ts = sum(ts))) %>%  mutate(per = n/ts,         covid_gram = 'total') %>%  select(date, covid_gram, n:per)covid_tweets %>%  group_by(date, covid_gram) %>% #,party,     summarize(n = n()) %>%  left_join(all_tweets %>% group_by(date) %>% summarise(ts = sum(ts))) %>%  mutate(per = n/ts) %>%  bind_rows(all) %>%    ggplot() +  geom_line(aes(x = date,                 y= per,                 color = covid_gram                ), size = 1.5            ) +  theme_minimal() +  ggthemes::scale_color_stata() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  scale_x_date(date_breaks = '2 days', date_labels = "%b %d") +  theme(legend.position = 'top',        legend.title = element_blank())  +   ylab('Daily reference rate') +  labs(title = 'Rates of reference to 2019 NOVEL CORONAVIRUS',       subtitle = 'Among US Senators & House Representatives')

So, lots going on. Some 2019 NOVEL CORONAVIRUS chatter through January and most of February. At Feb 24, we see a substantial jump. Reference to 2019 NOVEL CORONAVIRUS as coronvirus has been most frequent since the onset; however, covid19 has more recently taken the lead among members of Congress.

The plot below presents the same data disaggregated by party affiliation. Roughly the same profiles, which is interesting. With the exception of covid19, which shows an uptick with Dems not mirrored by Republicans. Likely just an anomaly.

covid_tweets %>%  group_by(date, party, covid_gram) %>% #,party,     summarize(n = n()) %>%  left_join(all_tweets) %>%  mutate(per = n/ts) %>%    #filter(date < '2020-3-27') %>% # & date < '2020-3-27'  filter(party != 'Independent') %>%  ggplot() +  geom_line(aes(x = date,                 y= per,                 color = covid_gram                ),            size = 1.25) +  theme_minimal() +  ggthemes::scale_color_stata() +  theme(axis.text.x = element_text(angle = 90, hjust = 1),        legend.title = element_blank(),        legend.position = 'top')+  scale_x_date(date_breaks = '1 week', date_labels = "%b %d") +  facet_wrap(~party) + ylab('Daily reference rate') +  labs(title = 'Rates of reference to 2019 NOVEL CORONAVIRUS by party affiliation',       subtitle = 'Among US Senators & House Representatives')

Probability distributions

Lastly, we consider a proportional perspective on reference to 2019 NOVEL CORONAVIRUS. Instead of total tweets, the denominator here becomes overall references to 2019 NOVEL CORONAVIRUS on Twitter among members of Congress.

The figure below, then, illustrates daily probability distributions for forms used to reference 2019 NOVEL CORONAVIRUS. covid19 has slowly become the majority form on Twitter – coronavirus has become less and less prevalent. One explanation is that the effects of the virus in the US, ie, the disease, have become more prevalent and, hence, the proper use of the referring expression covid19. Another explanation is that covid19 is shorter orthographically, and in the character-counting world of Twitter, a more efficient way to express the notion 2019 NOVEL CORONAVIRUS. An empirical question for sure.

x1 <- covid_tweets %>%  filter(date > '2020-2-25') %>%  group_by(date, covid_gram) %>% #,party,     summarize(n = n()) %>%  mutate(per = n/sum(n)) x2 <- x1 %>%   ggplot(aes(x=date, y=per, fill = covid_gram))+  geom_bar(alpha = 0.65, stat = 'identity', width = .9) + #  theme_minimal() +  theme(axis.text.x = element_text(angle = 90, hjust = 1))+  theme(legend.position = "none")+  ggthemes::scale_fill_economist() +  scale_x_date(date_breaks = '1 day', date_labels = "%b %d") +  labs(title = 'Referring to 2019 NOVEL CORONAVIRUS',       subtitle = 'Among US Senators & House Representatives')x2 +    annotate(geom="text",            x = c(rep(as.Date('2020-3-22'), 4)),            y = c(.05, .35, .6, .8),            label = c('pandemic', 'covid19', 'coronavirus pandemic', 'coronavirus'),           size = 4, color = 'black')

Summary

So, a weekend & social distancing. Caveats galore, but for folks interested in language change & innovation & the establishment of convention in a community of speakers, something to keep an eye on.

Perhaps more interesting is how regular folks are referencing 2019 NOVEL CORONAVIRUS on Twitter. Everyone stay home & healthy.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Meet {tidycovid19}: Yet another Covid-19 related R Package

$
0
0

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have decided that the world needs another Covid-19 related R package. Not sure whether you agree, but the new package facilitates the direct download of various Covid-19 related data (including data on governmental measures) directly from authoritative sources. It also provides a flexible function and accompanying shiny app to visualize the spreading of the virus. Play around with the shiny app here if you like or hang around to learn more about the package.

Why yet another package on Covid-19?

There are at least two R packages that provide data and infrastructure related to Covid-19:

  • {nCov2019}: This package has a focus on Chinese data but also contains data on other countries and regions. It contains a shiny dashboard.
  • {conronavirus}: This package provides the Johns Hopkins University CSSE dataset together with a dashboard

Other than the two packages above, the key objective of the {tidycovid19} package is to provide transparent access to various data sources. It does not contain any data per se. Instead, it provides functions to pull data from publicly available sources.

However, to save processing time and bandwidth, for those interested in speedy downloads it alternatively provides the option to download the data from the cached data in the Github repository (stored in the directory cached_data). The cached data will be updated daily.

If you rather want to start your own project by customizing the download code to fit your needs, I suggest that you take a look at my Github repository “tidy_covid19” (mind the underscore). This repo presents a forkable code infrastructure for Covid 19 projects using the same data infrastructure.

The disclaimer from my previous blog post continues to apply. I am not and Epidemiologist. My motivation to develop this package is to help researchers interested in studying the spread of the virus and the effects of non-pharmaceutical interventions on the virus spread. To reach this objective, the package pulls data from various sources. T

Here are the download functions offered by the package:

  • download_jhu_csse_covid19_data(): Downloads and tidies Covid-19 data from the Johns Hopkins University CSSE Github Repo. This data has developed to a standard resource for researchers and the general audience interested in assessing the global spreading of the virus. The data is aggregated to the country level.
  • download_acaps_npi_data(): Downloads and tidies the Government measures dataset provided by the Assessment Capacities Project (ACAPS). These relatively new data allow researchers to study the effect of non-pharmaceutical interventions on the development of the virus.
  • download_google_trends_data(): Uses {gtrendsR} to Download and tidy Google Trends data on the search volume for the term “coronavirus” (Thank you to Yan Ouaknine for bringing up that idea!). This data can be used to assess the public attention to Covid-19 across countries (see plot below) and over time within a given country.
  • download_wbank_data(): Downloads and tidies additional country level information provided by the World Bank using the {wbstats} package. These data allow researchers to calculate per capita measures of the virus spread and to assess the association of macro-economic variables with the development of the virus.
  • download_merged_data(): Downloads all data sources and creates a merged country-day panel sample.

All functions can be called with the parameter cached = TRUE to download the cached data instead of assessing the original data sources. So, a quick way to use the data is to do the following.

# remotes::install_github("joachim-gassen/tidycovid19")suppressPackageStartupMessages({  library(tidycovid19)  library(dplyr)  library(ggplot2)    library(ggrepel)})merged_dta <- download_merged_data(cached = TRUE)
## Downloading cached version of merged data...done. Timestamp is 2020-03-30 06:43:14
merged_dta %>%  group_by(country) %>%  mutate(    reported_deaths = max(deaths),    soc_dist_measures = max(soc_dist)  ) %>%  select(country, iso3c, reported_deaths, soc_dist_measures) %>%  distinct() %>%  ungroup() %>%  arrange(-reported_deaths) %>%  head(20) -> dfggplot(df, aes(x = reported_deaths, y = soc_dist_measures)) +  geom_point() +  geom_label_repel(aes(label = iso3c)) +  theme_minimal() +  scale_x_continuous(trans='log10', labels = scales::comma) +   labs(x = "Reported deaths (logarithmic scale)",       y = "Number of governmental social distancing measures",       annotation = "Data from JHU CSSE and ACAPS.") 

Visualization

The focus of the package lies on data collection and not on visualization as there are already many great tools floating around. The function plot_covid19_spread() however, allows you to quickly visualize the spread of the virus in relation to governmental intervention measures. It is inspired by the insightful displays created by John Burn-Murdoch from the Financial Times and offers various customization options.

plot_covid19_spread(merged_dta, highlight = c("ITA", "ESP", "FRA", "DEU", "USA"),                     intervention = "lockdown")

Shiny App

Sorry, I could not resist. The options of the plot_covid19_spread() make the implementation of a shiny app a little bit to tempting to pass. The command shiny_covid19_spread() starts the app. You can check it out online if you like.

Screenshot of shiny_covid19_spread() app

Wrapping up

This is it for the time being. I hope that the package might be helpful to those doing Covid-19 related research. If you have suggestions and/or feedback, consider opening an Issue on Github

Stay well and keep #FlattenTheCurve!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Installing spatial R packages on Ubuntu

$
0
0

[This article was first published on the Geocomputation with R website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explains how to quickly get key R packages for geographic research installed on Ubuntu, a popular Linux distribution.

A recent thread on the r-spatial GitHub organization alludes to many considerations when choosing a Linux set-up for work with geographic data, ranging from the choice of Linux distribution (distro) to the use of binary vs or compiled versions (binaries are faster to install). This post touches on some of these things but, its main purpose is to provide advice on getting R’s key spatial packages up-and-running on a future-proof Linux operating system (Ubuntu).

Now is an excellent time to be thinking about the topic because changes are in the pipeline and getting set-up (or preparing to get set-up) now could save hours in the future. These imminent changes include:

  • The next major release of R (4.0.0), scheduled for the 24th April (2020-04-24)
  • The next major release of Ubuntu (20.04), a Long Term Support (LTS) version that will be used by millions of servers and research computers worldwide for years to come. Coincidentally, Ubuntu 20.04 will be released a day earlier than R 4.0.0, on 23rd April (2020-04-23).
  • Ongoing changes to the OSGeo stack on which key geographic R packages depend, as documented in r-spatial repos and a recent blog post on how recent versions of PROJ enable more precise coordinate reference system definitions.

To keep-up with these changes, this post will be updated in late April when some of the dust has settled around these changes. However, the advice presented here should be future-proof, including information on how to upgrade Ubuntu in section 3.

There many ways of getting Ubuntu set-up for spatial R packages. A benefit of Linux operating systems is that they offer choice and prevent ‘lock-in’. However, the guidance in the next section should reduce set-up time and improve maintainability (with updates managed by Ubuntu) compared with other ways of doing things, especially for beginners. If you’re planning to switch to Linux as the basis of your geographic work, this advice may be particularly useful. (The post was written in response to people asking how to set-up R on their new Ubuntu installations. For more on getting a computer running Ubuntu, check out companies that support open source operating systems and guides installing Ubuntu on an existing machine.

By ‘key packages’ I mean the following, which enable the majority of day-to-day geographic data processing and visualization tasks:

  • sf for reading, writing and working with a range geographic vector file formats and geometry types
  • raster, a mature package for working with geographic raster data (see the terra for an in-development replacement for raster)
  • tmap, a flexible package for making static and interactive maps

The focus is on Ubuntu because that’s what I’ve got most experience with and it is well supported by the community. Links for installing geographic R packages on other distros are provided in a subsequent.

1. Installing spatial R packages on Ubuntu

R’s spatial packages can be installed from source on the latest version of this popular operating system, once the appropriate repository has been set-up, meaning faster install times (only a few minutes including the installation of upstream dependencies). The following bash commands should install key geographic R packages on Ubuntu 19.10:

# add a repository that ships the latest version of R:sudo add-apt-repository ppa:marutter/rrutter3.5# update the repositories so the software can be found:sudo apt update# install system dependencies:sudo apt install libudunits2-dev libgdal-dev libgeos-dev libproj-dev libfontconfig1-dev# binary versions of key R packages:sudo apt install r-base-dev r-cran-sf r-cran-raster r-cran-rjava

To test your installation of R has worked, try running R in an IDE such as RStudio or in the terminal by entering R. You should be able to run the following commands without problem:

library(sf)#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0install.packages("tmap")

If you are using an older version of Ubuntu and don’t want to upgrade to 19.10, which will upgrade to (20.04) by the end of April 2020, see instructions at github.com/r-spatial/sf and detailed instructions on the blog rtask.thinkr.fr, which contains this additional shell command:

# for Ubuntu 18.04sudo add-apt-repository ppa:marutter/c2d4u3.5

That adds a repository that ships hundreds of binary versions of R packages, meaning faster install times for packages (see the Binary package section of the open source book R Packages for more on binary packages). An updated repository, called c2d4u4.0 or similar, will be available for Ubuntu 20.04 in late April.

If you have issues with the instructions in this post here, you can find a wealth of answers on site such as StackOverflow, the sf issue tracker, r-sig-geo and Debian special interest group (SIG) email lists (the latter of which provided input into this blog post, thanks to Dirk Eddelbuettel and Michael Rutter).

2. Updating R packages and upstream dependencies

Linux operating systems allow you to customize your set-up in myriad ways. This can be enlightening but it can also be wasteful. It’s worth considering the stability/cutting-edge continuum before diving into a particular set-up and potentially wasting time (if the previous section hasn’t already made-up your mind).

A reliable way to keep close (but not too close) to the cutting edge on the R side is simply to keep your packages up-to-date. Running the following command (or using the Tools menu in RStudio) every week or so will ensure you have up-to-date package versions:

update.packages()

The following commands will update system dependencies including the ‘OSGeo stack’ composed of PROJ, GEOS and GDAL:

sudo apt-get update # see if things have changedsudo apt upgrade # install changes

If you want to update Ubuntu to the latest version, you can with the following command (also see instructions here):

apt-get dist-upgrade

To get more up-to-date upstream geographic libraries than provided in the default Ubuntu repositories, you can add the ubuntugis repository as follows. This is a pre-requisite on Ubuntu 18.04 and earlier but also works with later versions (warning, adding this repository could cause complications if you already have software such as QGIS that uses a particular version of GDAL installed):

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstablesudo apt updatesudo apt upgrade

That will give you more up-to-date versions of GDAL, GEOS and PROJ which may offer some performance improvements. Note: if you do update dependencies such as GDAL you will need to re-install the relevant packages, e.g. with install.packages("sf"). You can revert that change with the following little-known command:

sudo add-apt-repository --remove ppa:ubuntugis/ubuntugis-unstable

If you also want the development versions of key R packages, e.g. to test new features and support development efforts, you can install them from GitHub, e.g. as follows:

remotes::install_github("r-spatial/sf")remotes::install_github("rspatial/raster")remotes::install_github("mtennekes/tmaptools") # required for dev version of tmapremotes::install_github("mtennekes/tmap")

3. Installing geographic R packages on other Linux operating systems

If you are in the fortunate position of switching to Linux and being able to choose the distribution that best fits your needs, it’s worth thinking about which distribution will be both user-friendly (more on that soon), performant and future-proof. Ubuntu is a solid choice, with a large user community and repositories such as ‘ubuntugis’ providing more up-to-date versions of upstream geographic libraries such as GDAL.

QGIS is also well-supported on Ubuntu.

However, you can install R and key geographic packages on other operating systems, although it may take longer. Useful links on installing R and geographic libraries are provided below for reference:

  • Installing R on Debian is covered on the CRAN website. Upstream dependencies such as GDAL can be installed on recent versions of Debian, such as buster, with commands such as apt-get install libgdal-dev as per instructions on the rocker/geospatial.

  • Installing R on Fedora/Red Hat is straightforward, as outlined on CRAN. GDAL and other spatial libraries can be installed from Fedora’s dnf package manager, e.g. as documented here for sf.

  • Arch Linux has a growing R community. Information on installing and setting-up R can be found on the ArchLinux wiki. Installing upstream dependencies such as GDAL on Arch is also relatively straightforward. There is also a detailed guide for installing R plus geographic packages by Patrick Schratz.

4. Geographic R packages on Docker

The Ubuntu installation instructions outlined above provide such an easy and future-proof set-up. But if you want an even easier way to get the power of key geographic packages running on Linux, and have plenty of RAM and HD space, running R on the ‘Docker Engine’ may be an attractive option.

Advantages of using Docker include reproducibility (code will always run the same on any given image, and images can be saved), portability (Docker can run on Linux, Windows and Mac) and scalability (Docker provides a platform for scaling-up computations across multiple nodes).

For an introduction to using R/RStudio in Docker, see the Rocker project.

Using that approach, I recommend the following Docker images for using R as a basis for geographic research:

  • rocker/geospatial which contains key geographic packages, including those listed above
  • robinlovelace/geocompr which contains all the packages needed to reproduce the contents of the book, and which you can run with the following command in a shell in which Docker is installed:
docker run -e PASSWORD=yourpassword --rm -p 8787:8787 robinlovelace/geocompr

To test-out the Ubuntu 19.10 set-up recommended above I created a Dockerfile and associated image on Dockerhub that you can test-out as follows:

docker run -it robinlovelace/geocompr:ubuntu-eoanRlibrary(sf)#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0library(raster)library(tmap) 

The previous commands should take you to a terminal inside the docker container where you try out the Linux command line and R. If you want to use more cutting-edge versions of the geographic libraries, you can use the ubuntu-bionic image (note the more recent version numbers, with PROJ 7.0.0 for example):

sudo docker run -it robinlovelace/geocompr:ubuntu-bionicRlibrary(sf)#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0

These images do not currently contain all the dependencies needed to reproduce the code in Geocomputation with R.

However, as documented in issue 476 in the geocompr GitHub repo, there is a plan to provide Docker images with this full ‘R-spatial’ stack installed, building on strong foundations such as rocker/geospatial and the ubuntugis repositories, to support different versions of GDAL and other dependencies. We welcome any comments or tech support to help make this happen. Suggested changes to this post are also welcome, see the source code here.

5. Fin

R is an open-source language heavily inspired by Unix/Linux so it should come as no surprise that it runs well on a variety of Linux distributions, Ubuntu (covered in this post) in particular. The guidance in this post should get geographic R packages set-up quickly in a future-proof way. A sensible next step is to sharpen you system administration (sysadmin) and shell coding skills, e.g. with reference to Ubuntu wiki pages and Chapter 2 of the open source book Data Science at the Command Line.

This will take time but, building on OSGeo libraries, a well set-up Linux machine is an ideal platform to install, run and develop key geographic R packages in a performant, stable and future-proof way.

Be the FOSS4G change you want to see in the world!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: the Geocomputation with R website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coronavirus : spatially smoothed decease in France

$
0
0

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Carto décès COVID 19 France# avec lissage# sources -----------------------------------------------------------------fichier_covid <- "donnees/covid.csv"url_donnees_covid <- "https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7"fichier_pop <- "donnees/pop.xls"# https://www.insee.fr/fr/statistiques/2012713#tableau-TCRD_004_tab1_departementsurl_donnees_pop <- "https://www.insee.fr/fr/statistiques/fichier/2012713/TCRD_004.xls"# Adminexpress :# https://geoservices.ign.fr/documentation/diffusion/telechargement-donnees-libres.html#admin-express# config ------------------------------------------------------------------library(tidyverse)library(httr)library(fs)library(sf)library(readxl)library(janitor)library(tmap)# + btb, raster, fasterize, plyr#' Kernel weighted smoothing with arbitrary bounding area#'#' @param df sf object (points)#' @param field weight field in the df#' @param bandwidth kernel bandwidth (map units)#' @param resolution output grid resolution (map units)#' @param zone sf study zone (polygon)#' @param out_crs EPSG (should be an equal-area projection)#'#' @return a raster object#' @import btb, raster, fasterize, dplyr, plyr, sflissage <- function(df, field, bandwidth, resolution, zone, out_crs = 3035) {    if (st_crs(zone)$epsg != out_crs) {    message("reprojecting data...")    zone <- st_transform(zone, out_crs)  }    if (st_crs(df)$epsg != out_crs) {    message("reprojecting study zone...")    df <- st_transform(df, out_crs)  }    zone_bbox <- st_bbox(zone)    # grid generation  message("generating reference grid...")  zone_xy <- zone %>%     dplyr::select(geometry) %>%     st_make_grid(cellsize = resolution,                 offset = c(plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),                            plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor)),                 what = "centers") %>%    st_sf() %>%     st_join(zone, join = st_intersects, left = FALSE) %>%     st_coordinates() %>%     as_tibble() %>%     dplyr::select(x = X, y = Y)    # kernel  message("computing kernel...")  kernel <- df %>%     cbind(., st_coordinates(.)) %>%    st_set_geometry(NULL) %>%     dplyr::select(x = X, y = Y, field) %>%     btb::kernelSmoothing(dfObservations = .,                         sEPSG = out_crs,                         iCellSize = resolution,                         iBandwidth = bandwidth,                         vQuantiles = NULL,                         dfCentroids = zone_xy)    # rasterization  message("\nrasterizing...")  raster::raster(xmn = plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),                 ymn = plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor),                 xmx = plyr::round_any(zone_bbox[3] + bandwidth, resolution, f = ceiling),                 ymx = plyr::round_any(zone_bbox[4] + bandwidth, resolution, f = ceiling),                 resolution = resolution) %>%     fasterize::fasterize(kernel, ., field = field)}# téléchargement--------------------------------------------------------------if (!file_exists(fichier_covid) |    file_info(fichier_covid)$modification_time < Sys.Date()) {  GET(url_donnees_covid,      progress(),      write_disk(fichier_covid, overwrite = TRUE))}if (!file_exists(fichier_pop)) {  GET(url_donnees_pop,      progress(),      write_disk(fichier_pop))}# données -----------------------------------------------------------------covid <- read_csv2(fichier_covid) # adminexpress prétéléchargédep <- read_sf("donnees/ADE_2-0_SHP_LAMB93_FR/DEPARTEMENT.shp") %>%   clean_names()pop <- read_xls(fichier_pop, skip = 2) %>%   clean_names()# prétraitement -----------------------------------------------------------fr <- dep %>%  st_union() %>%   st_sf() %>%   st_set_crs(2154)deces <- dep %>%  left_join(pop, by = c("insee_dep" = "x1")) %>%   left_join(covid %>%               filter(jour == max(jour),                     sexe == 0) %>%               group_by(dep) %>%               summarise(deces = sum(dc, na.rm = TRUE)),            by = c("insee_dep" = "dep")) %>%   st_point_on_surface() %>%   st_set_crs(2154)# lissage -----------------------------------------------------------------d <- deces %>%   lissage("deces", 100000, 10000, fr, 3035) p <- deces %>%   lissage("x2020_p", 100000, 10000, fr, 3035)d100k <- d * 100000 / p# carto -------------------------------------------------------------------tm_layout(title = paste("Covid19 - France métropolitaine", max(covid$jour)),          legend.position = c("left", "bottom")) +  tm_shape(d100k) +   tm_raster(title = "décès pour 100 000 hab.\n(lissage 100 km sur grille 10 km,\n classif. kmeans)",             style = "kmeans",             palette = "viridis",            legend.format = list(text.separator = "à moins de",                                 digits = 0),            legend.reverse = TRUE) +  tm_shape(dep) +  tm_borders() +  tm_credits("http://r.iresmi.net/\nprojection LAEA Europe\ndonnées départementales Santé publique France\nINSEE RP 2020, IGN Adminexpress 2019")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Screenager: screening times at bioRxiv

$
0
0

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When a preprint is uploaded to bioRxiv, it undergoes screening before it appears online. How long does it take for Affiliates to screen preprints at bioRxiv?

tl;dr I used R to look at bioRxiv screening times. Even though bioRxiv has expanded massively, screening happens quickly (in about 24 h).

I am a bioRxiv Affiliate – one of the people who does the screening. Preprints wait in a queue to be screened. Over the years, I’ve seen the typical queue get longer and longer. In the early days the queue was maybe 10 preprints. These days it’s often over 100.

It’s a team effort and more Affiliates have been recruited over time. Yet I often wonder how we’re doing. My impression is that there are always lots of Neuroscience and Bioinformatics papers in queue. Do any subject areas get neglected? If so, should Affiliates in these areas be recruited specifically?

To look at these questions I used this wonderful R client for the bioRxiv API written by Nicholas Fraser.

To set up:

devtools::install_github("nicholasmfraser/rbiorxiv")# load packageslibrary(rbiorxiv)library(tidyverse)library(gridExtra)# make directory for output if it doesn't existif (dir.exists("./output")==FALSE) dir.create("./output")if (dir.exists("./output/plots")==FALSE) dir.create("./output/plots")

Use the R client to get a data frame of preprints uploaded in 2020.

data <- biorxiv_content(from = "2020-01-01", to = "2020-03-29", limit = "*", format = "df")

We only want to look at new preprints (Version 1) and not revisions, so let’s filter for that. Then, we’ll take advantage of bioRxiv’s new style DOIs to find the “submission date”.

data <- filter(data, version == 1)data$doi_date <- substr(data$doi, 9, 18)data$doi_date <- gsub("\\.", "-", data$doi_date)data$days <- as.Date(data$date) - as.Date(data$doi_date)data$category <- as.factor(data$category)

We now have a column called ‘days’ that shows the time in days from “submission” to “publication”. We will use this as a measure of screening time. Note: this is imperfect because the submission date is when an author begins uploading their preprint (they could take several days to do this) and not when it actually gets submitted to bioRxiv.

Let’s look at the screening time per subject area.

p1 <- ggplot(data, aes(x = as.numeric(days))) +  geom_histogram(binwidth = 1) +  xlim(NA, 6) +  facet_grid(category ~ ., scales = "free") +  labs(x = "Days",       y= "Preprints") +  theme(strip.text.y = element_text(angle = 0))ggsave("./output/plots/screenLag.png", p1, height = 15, width = 6, dpi = 300,)
Histogram of screening times per subject

I was surprised to see that, with the exception of “Scientific Communication and Education”, the screening times were pretty constant across categories.

The subject areas on bioRxiv are not equal in size. Look at the numbers on the axes for Zoology and for Neuroscience to get a feel for the difference. The histogram view conceals these differences.

Next, we can calculate the average screening time and see if the busiest categories suffer delayed screening.

df1 <- aggregate(as.numeric(data$days), list(data$category), mean)colnames(df1) <- c("category","mean_days")df2 <- count(data$category)colnames(df2) <- c("category","count")summary_df <- merge(df1,df2)

And then make some bar charts to look at the data.

p3 <- ggplot(summary_df, aes(x = category, y = mean_days)) +  geom_bar(stat = "identity") +  scale_x_discrete(limits = rev(levels(summary_df$category))) +  labs(x = "", y = "Mean (days)") +  coord_flip()p4 <- ggplot(summary_df, aes(x = category, y = count)) +  geom_bar(stat = "identity") +  scale_x_discrete(limits = rev(levels(summary_df$category))) +  labs(x = "", y = "Preprints") +  coord_flip()p5 <- grid.arrange(p3,p4, nrow = 1, ncol = 2)ggsave("./output/plots/summary.png", p5, height = 8, width = 8, dpi = 300,)

The average screening time is 1 day or less. Neuroscience, microbiology and bioinformatics (the biggest categories) have similar screening delays to less busy categories. So, assuming that Affiliates screen on the basis of expertise, the pool is either enriched for these popular areas, or those affiliates are more busy!

The longest lag is for “Scientific Communication and Education”, which is a very small category. Assuming the authors take a similar time to upload these manuscripts, I guess the Affiliates tend to screen these preprints as a lower priority. These papers do tend to be a bit different from other research papers and have separate screening criteria. Anyway, they still get screened in just over 2 days, which is still impressive.

I was pleased to see “Cell Biology” had the shortest screening time (around half a day)!

Conclusion

Even though my impression was that Bioinformatics and Neuroscience papers linger in the queue, this is not actually the case. There’s likely more of them in the queue because there are more of them, period.

The bioRxiv team have done a great job in maintaining a pool of Affiliates that can screen the huge number of preprints that are uploaded.

The post title comes from “Screenager” by Muse from their Origin of Symmetry album.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? Webinars

$
0
0

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivated by a successful turnaround of visits of Why R? 2019 keynote talks that uploaded to Why R? YouTube channel we decided to start Why R? Webinars series! We hope this will impact the growth of the R community.

Details

  • channel: youtube.com/c/WhyRFoundation
  • date: every Thursday 8:00 pm GMT+2 (starting April 2nd!)
  • format: one 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
  • comments: ability to ask questions on YouTube as message on live chat

First talk

We are excited to present our first webinar speaker: Achim Zeileis from Universität Innsbruck. The topic of the webinar is R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R which is a very good suite for teachers facing the need of the remote R teaching!

Speaker’s biogram and the abstract of the talk is available on a meetup event and on the webinar url that is also visible from youtube.com/c/WhyRFoundation YouTube channel.

Stay up to date

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12212 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>