Shiny Contest 2020 deadline extended

March 17, 2020, 5:00 pm

≫ Next: Time Series Machine Learning (and Feature Engineering) in R

≪ Previous: Community of Bioinformatics Software Developers (CDSB): The story of a diversity and outreach hotspot in Mexico that hopes to empower local R developers

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The original deadline for Shiny Contest 2020 was this week, but given that many of us have had lots of unexpected changes to our schedules over the last week due to the COVID-19 outbreak, we have decided to extend the deadline by two weeks. If you’ve been planning to submit an entry for the contest this week (and if history is any indicator, there may be a few of you out there), please feel free to take this additional time. The new deadline for the contest is 3 April 2020 at 5pm ET.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Time Series Machine Learning (and Feature Engineering) in R

March 17, 2020, 11:04 pm

≫ Next: RcppCCTZ 0.2.7

≪ Previous: Shiny Contest 2020 deadline extended

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine learning is a powerful way to analyze Time Series. With innovations in the tidyverse modeling infrastructure (tidymodels), we now have a common set of packages to perform machine learning in R. These packages include parsnip, recipes, tune, and workflows. But what about Machine Learning with Time Series Data? The key is Feature Engineering. (Read the updated article at Business Science)

The timetk package has a feature engineering innovation in version 0.1.3. A recipe step called step_timeseries_signature() for Time Series Feature Engineering that is designed to fit right into the tidymodels workflow for machine learning with timeseries data.

The small innovation creates 25+ time series features, which has a big impact in improving our machine learning models. Further, these “core features” are the basis for creating 200+ time-series features to improve forecasting performance. Let’s see how to do Time Series Machine Learning in R.

Time Series Feature Engineering with the Time Series Signature

Use feature engineering with timetk to forecast

The time series signature is a collection of useful engineered features that describe the time series index of a time-based data set. It contains a 25+ time-series features that can be used to forecast time series that contain common seasonal and trend patterns:

Trend in Seconds Granularity: index.num
Yearly Seasonality: Year, Month, Quarter
Weekly Seasonality: Week of Month, Day of Month, Day of Week, and more
Daily Seasonality: Hour, Minute, Second
Weekly Cyclic Patterns: 2 weeks, 3 weeks, 4 weeks

We can then build 200+ of new features from these core 25+ features by applying well-thought-out time series feature engineering strategies.

Time Series Forecast Strategy 6-Month Forecast of Bike Transaction Counts

In this tutorial, the user will learn methods to implement machine learning to predict future outcomes in a time-based data set. The tutorial example uses a well known time series dataset, the Bike Sharing Dataset, from the UCI Machine Learning Repository. The objective is to build a model and predict the next 6-months of Bike Sharing daily transaction counts.

Feature Engineering Strategy

I’ll use timetk to build a basic Machine Learning Feature Set using the new step_timeseries_signature() function that is part of preprocessing specification via the recipes package. I’ll show how you can add interaction terms, dummy variables, and more to build 200+ new features from the pre-packaged feature set.

Machine Learning Strategy

We’ll then perform Time Series Machine Learning using parsnip and workflows to construct and train a GLM-based time series machine learning model. The model is evaluated on out-of-sample data. A final model is trained on the full dataset, and extended to a future dataset containing 6-months to daily timestamp data.

Time Series Forecast using Feature Engineering

How to Learn Forecasting Beyond this Tutorial

I can’t possibly show you all the Time Series Forecasting techniques you need to learn in this post, which is why I have a NEW Advanced Time Series Forecasting Course on its way. The course includes detailed explanations from 3 Time Series Competitions. We go over competition solutions and show how you can integrate the key strategies into your organization’s time series forecasting projects. Check out the course page, and Sign-Up to get notifications on the Advanced Time Series Forecasting Course (Coming soon).

Need to improve forecasting at your company?

I have the Advanced Time Series Forecasting Course (Coming Soon). This course pulls forecasting strategies from experts that have placed 1st and 2nd solutions in 3 of the most important Time Series Competitions. Learn the strategies that win forecasting competitions. Then apply them to your time series projects.

Join the waitlist to get notified of the Course Launch!

Join the Advanced Time Series Course Waitlist

Prerequisites

Please use timetk 0.1.3 or greater for this tutorial. You can install via remotes::install_github("business-science/timetk") until released on CRAN.

Before we get started, load the following packages.

library(workflows)library(parsnip)library(recipes)library(yardstick)library(glmnet)library(tidyverse)library(tidyquant)library(timetk)# Use >= 0.1.3, remotes::install_github("business-science/timetk")

Data

We’ll be using the Bike Sharing Dataset from the UCI Machine Learning Repository. Download the data and select the “day.csv” file which is aggregated to daily periodicity.

# Read databikes<-read_csv("2020-03-18-timeseries-ml/day.csv")# Select date and countbikes_tbl<-bikes%>%select(dteday,cnt)%>%rename(date=dteday,value=cnt)

A visualization will help understand how we plan to tackle the problem of forecasting the data. We’ll split the data into two regions: a training region and a testing region.

# Visualize data and training/testing regionsbikes_tbl%>%ggplot(aes(x=date,y=value))+geom_rect(xmin=as.numeric(ymd("2012-07-01")),xmax=as.numeric(ymd("2013-01-01")),ymin=0,ymax=10000,fill=palette_light()[[4]],alpha=0.01)+annotate("text",x=ymd("2011-10-01"),y=7800,color=palette_light()[[1]],label="Train Region")+annotate("text",x=ymd("2012-10-01"),y=1550,color=palette_light()[[1]],label="Test Region")+geom_point(alpha=0.5,color=palette_light()[[1]])+labs(title="Bikes Sharing Dataset: Daily Scale",x="")+theme_tq()

plot of chunk unnamed-chunk-3

Split the data into train and test sets at “2012-07-01”.

# Split into training and test setstrain_tbl<-bikes_tbl%>%filter(date<ymd("2012-07-01"))test_tbl<-bikes_tbl%>%filter(date>=ymd("2012-07-01"))

Modeling

Start with the training set, which has the “date” and “value” columns.

# Training settrain_tbl

## # A tibble: 547 x 2##    date       value##         ##  1 2011-01-01   985##  2 2011-01-02   801##  3 2011-01-03  1349##  4 2011-01-04  1562##  5 2011-01-05  1600##  6 2011-01-06  1606##  7 2011-01-07  1510##  8 2011-01-08   959##  9 2011-01-09   822## 10 2011-01-10  1321## # … with 537 more rows

Recipe Preprocessing Specification

The first step is to add the time series signature to the training set, which will be used this to learn the patterns. New in timetk 0.1.3 is integration with the recipes R package:

The recipes package allows us to add preprocessing steps that are applied sequentially as part of a data transformation pipeline.
The timetk has step_timeseries_signature(), which is used to add a number of features that can help machine learning models.

# Add time series signaturerecipe_spec_timeseries<-recipe(value~.,data=train_tbl)%>%step_timeseries_signature(date)

When we apply the prepared recipe prep() using the bake() function, we go from 2 features to 29 features! Yes, 25+ new columns were added from the timestamp “date” feature. These are features we can use in our machine learning models and build on top of. .

bake(prep(recipe_spec_timeseries),new_data=train_tbl)

## # A tibble: 547 x 29##    date       value date_index.num date_year date_year.iso date_half##                                      ##  1 2011-01-01   985     1293840000      2011          2010         1##  2 2011-01-02   801     1293926400      2011          2010         1##  3 2011-01-03  1349     1294012800      2011          2011         1##  4 2011-01-04  1562     1294099200      2011          2011         1##  5 2011-01-05  1600     1294185600      2011          2011         1##  6 2011-01-06  1606     1294272000      2011          2011         1##  7 2011-01-07  1510     1294358400      2011          2011         1##  8 2011-01-08   959     1294444800      2011          2011         1##  9 2011-01-09   822     1294531200      2011          2011         1## 10 2011-01-10  1321     1294617600      2011          2011         1## # … with 537 more rows, and 23 more variables: date_quarter ,## #   date_month , date_month.xts , date_month.lbl ,## #   date_day , date_hour , date_minute , date_second ,## #   date_hour12 , date_am.pm , date_wday , date_wday.xts ,## #   date_wday.lbl , date_mday , date_qday , date_yday ,## #   date_mweek , date_week , date_week.iso , date_week2 ,## #   date_week3 , date_week4 , date_mday7

Building Engineered Features on Top of our Recipe

Next is where the magic happens. I apply various preprocessing steps to improve the modeling behavior to go from 29 features to 225 engineered features! If you wish to learn more, I have an Advanced Time Series course that will help you learn these techniques.

recipe_spec_final<-recipe_spec_timeseries%>%step_rm(date)%>%step_rm(contains("iso"),contains("second"),contains("minute"),contains("hour"),contains("am.pm"),contains("xts"))%>%step_normalize(contains("index.num"),date_year)%>%step_interact(~date_month.lbl*date_day)%>%step_interact(~date_month.lbl*date_mweek)%>%step_interact(~date_month.lbl*date_wday.lbl*date_yday)%>%step_dummy(contains("lbl"),one_hot=TRUE)bake(prep(recipe_spec_final),new_data=train_tbl)

## # A tibble: 547 x 225##    value date_index.num date_year date_half date_quarter date_month date_day##                                          ##  1   985          -1.73    -0.705         1            1          1        1##  2   801          -1.72    -0.705         1            1          1        2##  3  1349          -1.71    -0.705         1            1          1        3##  4  1562          -1.71    -0.705         1            1          1        4##  5  1600          -1.70    -0.705         1            1          1        5##  6  1606          -1.70    -0.705         1            1          1        6##  7  1510          -1.69    -0.705         1            1          1        7##  8   959          -1.68    -0.705         1            1          1        8##  9   822          -1.68    -0.705         1            1          1        9## 10  1321          -1.67    -0.705         1            1          1       10## # … with 537 more rows, and 218 more variables: date_wday ,## #   date_mday , date_qday , date_yday , date_mweek ,## #   date_week , date_week2 , date_week3 , date_week4 ,## #   date_mday7 , date_month.lbl.L_x_date_day ,## #   date_month.lbl.Q_x_date_day , date_month.lbl.C_x_date_day ,## #   `date_month.lbl^4_x_date_day` , `date_month.lbl^5_x_date_day` ,## #   `date_month.lbl^6_x_date_day` , `date_month.lbl^7_x_date_day` ,## #   `date_month.lbl^8_x_date_day` , `date_month.lbl^9_x_date_day` ,## #   `date_month.lbl^10_x_date_day` , `date_month.lbl^11_x_date_day` ,## #   date_month.lbl.L_x_date_mweek , date_month.lbl.Q_x_date_mweek ,## #   date_month.lbl.C_x_date_mweek , `date_month.lbl^4_x_date_mweek` ,## #   `date_month.lbl^5_x_date_mweek` ,## #   `date_month.lbl^6_x_date_mweek` ,## #   `date_month.lbl^7_x_date_mweek` ,## #   `date_month.lbl^8_x_date_mweek` ,## #   `date_month.lbl^9_x_date_mweek` ,## #   `date_month.lbl^10_x_date_mweek` ,## #   `date_month.lbl^11_x_date_mweek` ,## #   date_month.lbl.L_x_date_wday.lbl.L ,## #   date_month.lbl.Q_x_date_wday.lbl.L ,## #   date_month.lbl.C_x_date_wday.lbl.L ,## #   `date_month.lbl^4_x_date_wday.lbl.L` ,## #   `date_month.lbl^5_x_date_wday.lbl.L` ,## #   `date_month.lbl^6_x_date_wday.lbl.L` ,## #   `date_month.lbl^7_x_date_wday.lbl.L` ,## #   `date_month.lbl^8_x_date_wday.lbl.L` ,## #   `date_month.lbl^9_x_date_wday.lbl.L` ,## #   `date_month.lbl^10_x_date_wday.lbl.L` ,## #   `date_month.lbl^11_x_date_wday.lbl.L` ,## #   date_month.lbl.L_x_date_wday.lbl.Q ,## #   date_month.lbl.Q_x_date_wday.lbl.Q ,## #   date_month.lbl.C_x_date_wday.lbl.Q ,## #   `date_month.lbl^4_x_date_wday.lbl.Q` ,## #   `date_month.lbl^5_x_date_wday.lbl.Q` ,## #   `date_month.lbl^6_x_date_wday.lbl.Q` ,## #   `date_month.lbl^7_x_date_wday.lbl.Q` ,## #   `date_month.lbl^8_x_date_wday.lbl.Q` ,## #   `date_month.lbl^9_x_date_wday.lbl.Q` ,## #   `date_month.lbl^10_x_date_wday.lbl.Q` ,## #   `date_month.lbl^11_x_date_wday.lbl.Q` ,## #   date_month.lbl.L_x_date_wday.lbl.C ,## #   date_month.lbl.Q_x_date_wday.lbl.C ,## #   date_month.lbl.C_x_date_wday.lbl.C ,## #   `date_month.lbl^4_x_date_wday.lbl.C` ,## #   `date_month.lbl^5_x_date_wday.lbl.C` ,## #   `date_month.lbl^6_x_date_wday.lbl.C` ,## #   `date_month.lbl^7_x_date_wday.lbl.C` ,## #   `date_month.lbl^8_x_date_wday.lbl.C` ,## #   `date_month.lbl^9_x_date_wday.lbl.C` ,## #   `date_month.lbl^10_x_date_wday.lbl.C` ,## #   `date_month.lbl^11_x_date_wday.lbl.C` ,## #   `date_month.lbl.L_x_date_wday.lbl^4` ,## #   `date_month.lbl.Q_x_date_wday.lbl^4` ,## #   `date_month.lbl.C_x_date_wday.lbl^4` ,## #   `date_month.lbl^4_x_date_wday.lbl^4` ,## #   `date_month.lbl^5_x_date_wday.lbl^4` ,## #   `date_month.lbl^6_x_date_wday.lbl^4` ,## #   `date_month.lbl^7_x_date_wday.lbl^4` ,## #   `date_month.lbl^8_x_date_wday.lbl^4` ,## #   `date_month.lbl^9_x_date_wday.lbl^4` ,## #   `date_month.lbl^10_x_date_wday.lbl^4` ,## #   `date_month.lbl^11_x_date_wday.lbl^4` ,## #   `date_month.lbl.L_x_date_wday.lbl^5` ,## #   `date_month.lbl.Q_x_date_wday.lbl^5` ,## #   `date_month.lbl.C_x_date_wday.lbl^5` ,## #   `date_month.lbl^4_x_date_wday.lbl^5` ,## #   `date_month.lbl^5_x_date_wday.lbl^5` ,## #   `date_month.lbl^6_x_date_wday.lbl^5` ,## #   `date_month.lbl^7_x_date_wday.lbl^5` ,## #   `date_month.lbl^8_x_date_wday.lbl^5` ,## #   `date_month.lbl^9_x_date_wday.lbl^5` ,## #   `date_month.lbl^10_x_date_wday.lbl^5` ,## #   `date_month.lbl^11_x_date_wday.lbl^5` ,## #   `date_month.lbl.L_x_date_wday.lbl^6` ,## #   `date_month.lbl.Q_x_date_wday.lbl^6` ,## #   `date_month.lbl.C_x_date_wday.lbl^6` ,## #   `date_month.lbl^4_x_date_wday.lbl^6` ,## #   `date_month.lbl^5_x_date_wday.lbl^6` ,## #   `date_month.lbl^6_x_date_wday.lbl^6` ,## #   `date_month.lbl^7_x_date_wday.lbl^6` ,## #   `date_month.lbl^8_x_date_wday.lbl^6` ,## #   `date_month.lbl^9_x_date_wday.lbl^6` ,## #   `date_month.lbl^10_x_date_wday.lbl^6` ,## #   `date_month.lbl^11_x_date_wday.lbl^6` ,## #   date_month.lbl.L_x_date_yday , date_month.lbl.Q_x_date_yday , …

Model Specification

Next, let’s create a model specification. We’ll use a glmnet.

model_spec_glmnet<-linear_reg(mode="regression",penalty=10,mixture=0.7)%>%set_engine("glmnet")

Workflow

We can mary up the preprocessing recipe and the model using a workflow().

workflow_glmnet<-workflow()%>%add_recipe(recipe_spec_final)%>%add_model(model_spec_glmnet)workflow_glmnet

## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: linear_reg()## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────## 8 Recipe Steps## ## ● step_timeseries_signature()## ● step_rm()## ● step_rm()## ● step_normalize()## ● step_interact()## ● step_interact()## ● step_interact()## ● step_dummy()## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────## Linear Regression Model Specification (regression)## ## Main Arguments:##   penalty = 10##   mixture = 0.7## ## Computational engine: glmnet

Training

The workflow can be trained with the fit() function.

workflow_trained<-workflow_glmnet%>%fit(data=train_tbl)

Visualize the Test (Validation) Forecast

With a suitable model in hand, we can forecast using the “test” set for validation purposes.

prediction_tbl<-workflow_trained%>%predict(test_tbl)%>%bind_cols(test_tbl)prediction_tbl

## # A tibble: 184 x 3##    .pred date       value##          ##  1 6903. 2012-07-01  5531##  2 7030. 2012-07-02  6227##  3 6960. 2012-07-03  6660##  4 6931. 2012-07-04  7403##  5 6916. 2012-07-05  6241##  6 6934. 2012-07-06  6207##  7 7169. 2012-07-07  4840##  8 6791. 2012-07-08  4672##  9 6837. 2012-07-09  6569## 10 6766. 2012-07-10  6290## # … with 174 more rows

Visualize the results using ggplot().

ggplot(aes(x=date),data=bikes_tbl)+geom_rect(xmin=as.numeric(ymd("2012-07-01")),xmax=as.numeric(ymd("2013-01-01")),ymin=0,ymax=10000,fill=palette_light()[[4]],alpha=0.01)+annotate("text",x=ymd("2011-10-01"),y=7800,color=palette_light()[[1]],label="Train Region")+annotate("text",x=ymd("2012-10-01"),y=1550,color=palette_light()[[1]],label="Test Region")+geom_point(aes(x=date,y=value),alpha=0.5,color=palette_light()[[1]])+# Add predictionsgeom_point(aes(x=date,y=.pred),data=prediction_tbl,alpha=0.5,color=palette_light()[[2]])+theme_tq()+labs(title="GLM: Out-Of-Sample Forecast")

plot of chunk unnamed-chunk-13

Validation Accuracy (Out of Sample)

The Out-of-Sample Forecast Accuracy can be measured with yardstick.

# Calculating forecast errorprediction_tbl%>%metrics(value,.pred)

## # A tibble: 3 x 3##   .metric .estimator .estimate##                ## 1 rmse    standard    1377.   ## 2 rsq     standard       0.422## 3 mae     standard    1022.

Next we can visualize the residuals of the test set. The residuals of the model aren’t perfect, but we can work with it. The residuals show that the model predicts low in October and high in December.

prediction_tbl%>%ggplot(aes(x=date,y=value-.pred))+geom_hline(yintercept=0,color="black")+geom_point(color=palette_light()[[1]],alpha=0.5)+geom_smooth(span=0.05,color="red")+geom_smooth(span=1.00,se=FALSE)+theme_tq()+labs(title="GLM Model Residuals, Out-of-Sample",x="")+scale_y_continuous(limits=c(-5000,5000))

plot of chunk unnamed-chunk-15

At this point you might go back to the model and try tweaking features using interactions or polynomial terms, adding other features that may be known in the future (e.g. temperature of day can be forecasted relatively accurately within 7 days), or try a completely different modeling technique with the hope of better predictions on the test set. Once you feel that your model is optimized, move on the final step of forecasting.

This accuracy can be improved significantly with Competition-Level Forecasting Strategies. And, guess what?! I teach these strategies in my NEW Advanced Time Series Forecasting Course (coming soon). Register for the waitlist to get notified.

Learn algorithms that win competitions

Join the waitlist to get notified of the Course Launch!

Forecasting Future Data

Let’s use our model to predict What are the expected future values for the next six months. The first step is to create the date sequence. Let’s use tk_get_timeseries_summary() to review the summary of the dates from the original dataset, “bikes”.

# Extract bikes indexidx<-bikes_tbl%>%tk_index()# Get time series summary from indexbikes_summary<-idx%>%tk_get_timeseries_summary()

The first six parameters are general summary information.

bikes_summary[1:6]

## # A tibble: 1 x 6##   n.obs start      end        units scale tzone##                ## 1   731 2011-01-01 2012-12-31 days  day   UTC

The second six parameters are the periodicity information.

bikes_summary[7:12]

## # A tibble: 1 x 6##   diff.minimum diff.q1 diff.median diff.mean diff.q3 diff.maximum##                                    ## 1        86400   86400       86400     86400   86400        86400

From the summary, we know that the data is 100% regular because the median and mean differences are 86400 seconds or 1 day. We don’t need to do any special inspections when we use tk_make_future_timeseries(). If the data was irregular, meaning weekends or holidays were excluded, you’d want to account for this. Otherwise your forecast would be inaccurate.

idx_future<-idx%>%tk_make_future_timeseries(n_future=180)future_tbl<-tibble(date=idx_future)future_tbl

## # A tibble: 180 x 1##    date      ##        ##  1 2013-01-01##  2 2013-01-02##  3 2013-01-03##  4 2013-01-04##  5 2013-01-05##  6 2013-01-06##  7 2013-01-07##  8 2013-01-08##  9 2013-01-09## 10 2013-01-10## # … with 170 more rows

Retrain the model specification on the full data set, then predict the next 6-months.

future_predictions_tbl<-workflow_glmnet%>%fit(data=bikes_tbl)%>%predict(future_tbl)%>%bind_cols(future_tbl)

Visualize the forecast.

bikes_tbl%>%ggplot(aes(x=date,y=value))+geom_rect(xmin=as.numeric(ymd("2012-07-01")),xmax=as.numeric(ymd("2013-01-01")),ymin=0,ymax=10000,fill=palette_light()[[4]],alpha=0.01)+geom_rect(xmin=as.numeric(ymd("2013-01-01")),xmax=as.numeric(ymd("2013-07-01")),ymin=0,ymax=10000,fill=palette_light()[[3]],alpha=0.01)+annotate("text",x=ymd("2011-10-01"),y=7800,color=palette_light()[[1]],label="Train Region")+annotate("text",x=ymd("2012-10-01"),y=1550,color=palette_light()[[1]],label="Test Region")+annotate("text",x=ymd("2013-4-01"),y=1550,color=palette_light()[[1]],label="Forecast Region")+geom_point(alpha=0.5,color=palette_light()[[1]])+# future datageom_point(aes(x=date,y=.pred),data=future_predictions_tbl,alpha=0.5,color=palette_light()[[2]])+geom_smooth(aes(x=date,y=.pred),data=future_predictions_tbl,method='loess')+labs(title="Bikes Sharing Dataset: 6-Month Forecast",x="")+theme_tq()

plot of chunk unnamed-chunk-21

Forecast Error

A forecast is never perfect. We need prediction intervals to account for the variance from the model predictions to the actual data. There’s a number of methods to achieve this. We’ll follow the prediction interval methodology from Forecasting: Principles and Practice.

# Calculate standard deviation of residualstest_resid_sd<-prediction_tbl%>%summarize(stdev=sd(value-.pred))future_predictions_tbl<-future_predictions_tbl%>%mutate(lo.95=.pred-1.96*test_resid_sd$stdev,lo.80=.pred-1.28*test_resid_sd$stdev,hi.80=.pred+1.28*test_resid_sd$stdev,hi.95=.pred+1.96*test_resid_sd$stdev)

Now, plotting the forecast with the prediction intervals.

bikes_tbl%>%ggplot(aes(x=date,y=value))+geom_point(alpha=0.5,color=palette_light()[[1]])+geom_ribbon(aes(y=.pred,ymin=lo.95,ymax=hi.95),data=future_predictions_tbl,fill="#D5DBFF",color=NA,size=0)+geom_ribbon(aes(y=.pred,ymin=lo.80,ymax=hi.80,fill=key),data=future_predictions_tbl,fill="#596DD5",color=NA,size=0,alpha=0.8)+geom_point(aes(x=date,y=.pred),data=future_predictions_tbl,alpha=0.5,color=palette_light()[[2]])+geom_smooth(aes(x=date,y=.pred),data=future_predictions_tbl,method='loess',color="white")+labs(title="Bikes Sharing Dataset: 6-Month Forecast with Prediction Intervals",x="")+theme_tq()

plot of chunk unnamed-chunk-23

My Key Points on Time Series Machine Learning

Forecasting using the time series signature can be very accurate especially when time-based patterns are present in the underlying data. As with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance). However, in may situations the time series signature can provide an accurate forecast.

External Regressors – A huge benefit: One benefit to the machine learning approach that was not covered in this tutorial is that _correlated features (including non-time-based) can be included in the analysis. This is called adding External Regressors – examples include adding data from weather, financial, energy, google analytics, email providers, and more. For example, one can expect that experts in Bike Sharing analytics have access to historical temperature and weather patterns, wind speeds, and so on that could have a significant affect on bicycle sharing. The beauty of this method is these features can easily be incorporated into the model and prediction.

There is a whole lot more to time series forecasting that we did not cover (read on).

How to Learn Time Series Forecasting?

Here are some techniques you need to learn to become good at forecasting. These techiques are absolutely critical to developing forecasts that will return ROI to your company:

Preprocessing
Feature engineering using Lagged Features and External Regressors
Hyperparameter Tuning
Time series cross validation
Using Multiple Modeling Techniques
Leveraging Autocorrelation
and more.

All of these techiniques are covered in my upcoming Advanced Time Series Course (Register Here). I teach Competition-Winning Forecast Strategies too:

Ensembling Strategies and Techniques
Deep Learning Algorithms leveraging Recurrent Neural Networks
Feature-Based Model Selection

And a whole lot more! It should be simple by now – Join my course waitlist.

Join the Advanced Time Series Course Waitlist

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

RcppCCTZ 0.2.7

March 18, 2020, 2:53 pm

≫ Next: nnlib2Rcpp: a(nother) R package for Neural Networks

≪ Previous: Time Series Machine Learning (and Feature Engineering) in R

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.2.7 of RcppCCTZ is now at CRAN.

RcppCCTZ uses Rcpp to bring CCTZ to R. CCTZ is a C++ library for translating between absolute and civil times using the rules of a time zone. In fact, it is two libraries. One for dealing with civil time: human-readable dates and times, and one for converting between between absolute and civil times via time zones. And while CCTZ is made by Google(rs), it is not an official Google product. The RcppCCTZ page has a few usage examples and details. This package was the first CRAN package to use CCTZ; by now at least three others do—using copies in their packages which remains less than ideal.

This version adds internal extensions, contributed by Leonardo, which support upcoming changes to the nanotime package we are working on.

Changes in version 0.2.7 (2020-03-18)
Added functions _RcppCCTZ_convertToCivilSecond that converts a time point to the number of seconds since epoch, and _RcppCCTZ_convertToTimePoint that converts a number of seconds since epoch into a time point; these functions are only callable from C level (Leonardo in #34 and #35).
Added function _RcppCCTZ_getOffset that returns the offset at a speficied time-point for a specified timezone; this function is only callable from C level (Leonardo in #32).

We also have a diff to the previous version thanks to CRANberries. More details are at the RcppCCTZ page; code, issue tickets etc at the GitHub repository.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

nnlib2Rcpp: a(nother) R package for Neural Networks

March 18, 2020, 2:54 pm

≫ Next: Extended floating point precision in R with Rmpfr

≪ Previous: RcppCCTZ 0.2.7

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For anyone interested, nnlib2Rcpp is an R package containing a number of Neural Network implementations and is available on GitHub. It can be installed as follows (the usual way for packages on GitHub):

library(devtools)
install_github("VNNikolaidis/nnlib2Rcpp")

The NNs are implemented in C++ (using nnlib2 C++ class library) and are interfaced with R via Rcpp package (which is also required).

The package currently includes the following NN implementations:

A Back-Propagation (BP) multi-layer NN (supervised) for input-output mappings.
An Autoencoder NN (unsupervised) for dimensionality reduction (a bit like PCA) or dimensionality expansion.
A Learning Vector Quantization NN (LVQ, supervised) for classification.
A Self-Organizing Map NN (unsupervised, simplified 1-D variation of SOM) for clustering (a bit like k-means).
A simple Matrix-Associative-Memory NN (MAM, supervised) for storing input-output vector pairs.

For more information see the package documentation.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Extended floating point precision in R with Rmpfr

March 18, 2020, 5:53 pm

≫ Next: Why Is It Called That Way?! – Origin and Meaning of R Package Names

≪ Previous: nnlib2Rcpp: a(nother) R package for Neural Networks

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I learnt from a recent post on John Cook’s excellent blog that it’s really easy to do extended floating point computations in R using the Rmpfr package. Rmpfr is R’s wrapper around the C library MPFR, which stands for “Multiple Precision Floating-point Reliable”.

The main function that users will interact with is the mpfr function: it converts numeric values into (typically) high-precision numbers, which can then be used for computation. The function’s first argument is the numeric value(s) to be converted, and the second argument, precBits, represents the maximal precision to be used in numbers of bits. For example, precBits = 53 corresponds to double precision.

In his blog post, Cook gives an example of computing $\pi$ to 100 decimal places by multiplying the arctangent of 1 by 4 (recall that $\tan (\pi / 4) = 1$ , so $\text{arctan}(1) = \pi / 4$ ):

4 * atan(mpfr(1, 333))# 1 'mpfr' number of precision  333   bits # [1] 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807

Why does he set the precision to 333 bits?This link suggests that with bits, we get $d = \log_{10}(2^b) \approx 0.3010b$ decimal digits of precision. (Reality for floating point numbers is not quite as straightforward as that: see this for a discussion. But for our purposes, this approximation will do.) Hence, to get 100 decimal places, we need around $b = 100 / 0.3010 \approx 332.2$ bits, so he rounds it up to 333 bits.

The first argument to mpfr can be a vector as well:

mpfr(1:10, 5)# 10 'mpfr' numbers of precision  5   bits # [1]  1  2  3  4  5  6  7  8  9 10

As the next code snippet shows, R does NOT consider the output of a call to mpfr a numeric variable.

x <- sin(mpfr(1, 100))x# 1 'mpfr' number of precision  100   bits # [1] 0.84147098480789650665250232163005is.numeric(x)# [1] FALSE

We can use the asNumeric function to convert it to a numeric:

y <- asNumeric(x)y# [1] 0.841471is.numeric(y)# [1] TRUE

Can we use the more familiaras.numeric instead? According to the function’s documentation, as.numeric coerces to both “numeric” and to a vector, whereas asNumeric() should keep dim (and other) attributes. We can see this through a small example:

x <- mpfr(matrix(1:4, nrow = 2), 10)x# 'mpfrMatrix' of dim(.) =  (2, 2) of precision  10   bits # [,1]   [,2]  # [1,] 1.0000 3.0000# [2,] 2.0000 4.0000asNumeric(x)# [,1] [,2]# [1,]    1    3# [2,]    2    4as.numeric(x)# [1] 1 2 3 4

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Why Is It Called That Way?! – Origin and Meaning of R Package Names

March 19, 2020, 3:00 am

≫ Next: How to do a t-test or ANOVA for many variables at once in R and communicate the results in a better way

≪ Previous: Extended floating point precision in R with Rmpfr

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I started with R, I soon discovered that, more often than not, a package name has a particular meaning. For example, the first package I ever installed was foreign. The name corresponds to its ability to read and write data from other foreign sources to R. While this and many other names are rather straightforward, others are much less intuitive. The name of a package often conveys a story, which is inspired by a general property of its functions. And sometimes I just don’t get the deeper meaning, because English is not my native language.

In this blog post, I will shed light on the wonderful world of package names. After this journey, you will not only admire the creativity of R package creator; you’ll also be king at your next class reunion! Or at least at the next R-Meetup.

Before we start, and I know that you are eager to continue, I have two remarks about this article. First: Sometimes, I refer to official explanations from the authors or other sources; other times, it’s just my personal explanation of why a package is called that way. So if you know better or otherwise, do not hesitate to contact me. Second: There are currently 15,341 packages on CRAN, and I am sure there are a lot more naming mysteries and ingenuities to discover than any curious blog reader would like to digest in one sitting. Therefore, I focussed on the most famous packages and added some of my other preferences.

But enough of the talking now, let’s start!

dplyr (diːˈplaɪə)

dplyr You might have noticed that many packages contain the string plyr, e.g. dbplyr, implyr, dtplyr, and so on. This homophone of pliers corresponds to its refining of base R apply-functions as part of the „split-apply-combine“ strategy. Instead of doing all steps for data analysis and manipulation at once, you split the problem into manageable pieces, apply your function to each piece, and combine everything together afterward. We see this approach in perfection when we use the pipe operator. The first part of each package just refers to the object it is applied upon. So the d stands for data frames, db for databases, im for Apache Impala, dt for data tables, and so on… Sources: Hadley [Wickham

lubridate (ˈluːbrɪdeɪt)

lubridate This wonderful package makes it so easy and smooth to work with dates and times in R. You could say it goes like a clockwork. In German, there is a proverb with the same meaning („Das läuft wie geschmiert“), that can literally be translated to:

„It works as lubricated“

ggplot2 (ʤiːʤiːplɒt tuː)

ggplot Leland Wilkinson wrote a book in which he defined multiple components that a comprehensive plot is made of. You have to define the data you want to show, what kind of plot it should be, e.g., points or lines, the scales of the axes, the legend, axis titles, etc. These parts, he called them layers, should be built on top of each other. The title of this influential piece of paper is Grammar of Graphics. Once you got it, it enables you to build complex yet meaningful plots with concise styling across packages. That’s because its logic has also been used by many other packages like plotly, rBokeh, visNetwork, or apexcharter. Sources: [ggplot2

data.table (ˈdeɪtə ˈteɪbl) – logo

data_table_logo Okay, full disclosure, I am a tidyverse guy, and one of my sons shall be named Hadley. At least one. However, this does not mean that I don’t appreciate the very powerful package data.table. Occasionally, I take the liberty and exploit its functions to improve the performance of my code (hello fread() and rbindlist()). Anyway, the name itself is pretty straightforward – but did you notice how cool the logo is?! Well, there is obviously the name „data.table“ and the square brackets that are fundamental in data.table syntax. Likewise, there is the assignment by reference operator, a.k.a. the walrus operator. „Wait, stop,“ your inner marine mammal researcher says, „isn’t this a sea lion on top there?!“ Yes indeed! The sea lion is used to highlight that it is an R package since, of course, it shouts R! R!. Source: [Rdatatable

tibble (tɪbl)

data_table_logo Regular base R data frames are nice, but did you ever print a data frame in the console, unaware that it is 10 million rows long? Good luck with interrupting R without quitting the whole session. That might be one of the reasons why the tidyverse uses another type of data frames: tibbles. The name tibble could just stem from its similar sound to table, but I suspect there is more to it than meets the eye. Did you ever hear the story about Tibbles and Stephen Island’s Wren? NO? Then let me take you to New Zealand, AD 1894. Between the northern and southern main islands of NZ, there is a small and uninhabited island: Stephen Island. Its rocks have been the downfall of many poor souls that tried to pass the Cook Strait. Therefore, it was decided to build a lighthouse as that ships shall henceforth pass safely and undamaged. wren Due to its isolation, Stephen Island was the only habitat for many rare species. One of these was Lyall’s wren, a small flightless passerine. It did not know any predators and lived its life in joy and harmony, until… The arrival of the first lighthouse keeper. His name was David Lyall and he was a man interested in natural history and, facing a long and lonely time on his own at Stephen Island, the owner of a cat. This cat was not satisfied by just comforting Mr. Lyall and enjoying beach walks. Shortly after his arrival, Mr. Lyall noticed the carcasses of little birds, seemingly slaughtered and dishonored by a fierce predator. Interested in biology as he was, he found out that these small birds were a distinct species. He preserved some carcasses in alcohol and sent them to a friend. This was in October 1894. A scientific article about the wren was published in an ornithology journal, soon making the specimen a sought-after collector’s item. The summer in New Zealand goes on and in February 1895, four bird-watchers arrived at Stephen Island. They were looking for this cute little wren and found… none. Within a few months, Mr. Lyalls hungry cat made the whole species go extinct. On March 16, 1895, the Christchurch newspaper The Press wrote: „there is very good reason to believe that the bird is no longer to be found on the island, and, as it is not known to exist anywhere else, it has apparently become quite extinct. This is probably a record performance in the way of extermination.“. The name of the cat? Tibbles. Sources: Wikipedia; All About Birds; Oddity Central Indicator: [Hadley Wickhams birth country

purrr (pɜːɜː)

purrr This extension of the base R apply-functions has been one of my favorites lately. The concise usage of purrr enables powerful functional programming that, in turn, makes your code faster, more readable, and more stable. Or, as Mr. Wickham states, it makes „your pure R functions purr„. Also, note its parallelized sibling furrr. Sources: [Hadley Wickham

Amelia (əˈmiːlɪə)

amelia During my Master’s degree, I had a course about missing data and multiple imputations. One of the packages we used, or rather analyzed, was Amelia. It turned out that this package is named after an impressive woman: Amelia Earhart. Living in the early 20th century, she was an aviation pioneer and feminist. She has been the first woman to fly solo across the Atlantic, a remarkable achievement and an inspiration for women to start a technical career. Unfortunately, she disappeared during a flight over the central pacific at age 39 and is thus… missing. ba dum-tss Source: Gary King– Co-Author

magrittr (maɡʁitə)

magrittr The conciseness of coding with dplyr or its siblings is not imaginable without the pipe operator %>%. This allows you to write and read code from top to bottom and from left to right, just like regular text. Pipes are no special feature of R, yet I am sure René Magritte had nothing else in mind when he painted The Treachery of Images in 1929 with its slogan: „Ceci n’est pas une pipe„. The logo designers just made a slight adjustment to his painting. Or should I say: unearthed the meaning that has always been behind it?! Sources: Vignette; [revolutionanalytics.com

](https://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html)

batman (ˈbætmən)

amelia Data science can be quite fun if it weren’t for the data. Especially when working with textual data, typos and inconsistent coding can be very cumbersome. For example, you’ve got questionnaire data consisting of yes/no questions. For R, this corresponds to TRUE/FALSE, but who would write this in a questionnaire? In fact, when we try to convert such data to logical values by calling as.logical(), almost every string becomes NA. Lost and doomed? NO! Cause who is more expert to determine actual NA’s than nananananana…batman!

Homeric (həʊˈmɛrɪk)

amelia Hey, you made it all the way down here?! You deserve a little treat! What about a soft, sweet, and special-sprinkled donut? And who would be better suitable to present it to you, than the best-known lover of donuts himself: Homer Simpson! Just help yourself: Homeric::PlotDoughnut(1, col = "magenta") Source: Homeric Documentation

fcuk (fʌk)

fcuk Error in view(my_data): could not find function "view" Are you sick and tired of this or similar error messages? Do you regularly employ your ample stock of swear words to describe the stupidity of inconsistent usage of camel or snake case function names across packages? Or do you just type faster than your shadow, causing minor typos in your, otherwise, excellent code? There is help! Just go and install the amazing fcuk package and useless error messages are a thing of the past.

hellno (hɛl nəʊ)

fcuk Slip into the role of a dedicated R user. I can only imagine the troubles I had to have with a specific default argument value of a base R function to write an entire package that just handles this case. I am talking about the tormentor of many beginRs when working with as.data.frame(): stringsAsFactors = TRUE. But I do not only change it to FALSE! Also, I create my own FALSE value and name it HELLNO.

Honorable mentions

gremlin: package for mixed-effects model REML incorporating Generalized Inverses.
harrietr: named after Charles Darwin’s pet giant tortoise. A package for phylogenetic and evolutionary biology data manipulations.
beginr: it helps where we’ve all been, searching for ages until setting pch = 16.
charlatan: worse than creating dubious medicine, this one makes fake data.
fauxpas: explains what specific HTTP errors mean.
fishualize: give your plots a fishy look.
greybox: why just thinking black or white? This is a package for time series analysis.
vroom: it reads data so fast to R, you almost can hear it making vroom vroom.
helfRlein: some little helper functions, inspired by the German word Helferlein = little helper.

Über den Autor

Matthias Nistler

I am a data scientist at STATWORX and passionate for wrangling data and getting the most out of it. Outside of the office, I use every second for cycling until the sun goes down.

ABOUT US

STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

.button { background-color: #0085af;}</p><p>.x-container.width { width: 100% !important;}</p><p>.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag Why Is It Called That Way?! – Origin and Meaning of R Package Names erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to do a t-test or ANOVA for many variables at once in R and communicate the results in a better way

March 18, 2020, 5:00 pm

≫ Next: parzer: Parse Messy Geographic Coordinates

≪ Previous: Why Is It Called That Way?! – Origin and Meaning of R Package Names

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Teemu Paananen

Introduction

As part of my teaching assistant position in a Belgian university, students often ask me for some help in their statistical analyses for their master’s thesis.

A frequent question is how to compare groups of patients in terms of several quantitative continuous variables. Most of us know that:

To compare two groups, a Student’s t-test should be used¹
To compare three groups or more, an ANOVA should be performed

These two tests are quite basic and have been extensively documented online and in statistical textbooks so the difficulty is not in how to perform these tests.

In the past, I used to do the analyses by following these 3 steps:

Draw boxplots illustrating the distributions by group (with the boxplot() function or thanks to the {esquisse} R Studio addin if I wanted to use the {ggplot2} package)
Perform a t-test or an ANOVA depending on the number of groups to compare (with the t.test() and oneway.test() functions for t-test and ANOVA, respectively)
Repeat steps 1 and 2 for each variable

This was feasible as long as there were only a couple of variables to test. Nonetheless, most students came to me asking to perform these kind of tests not on one or two variables, but on multiples variables (sometimes up to around 100 variables!). So when there were many variables to test, I quickly realized that I was wasting my time and that there must be a more efficient way to do the job.

Perform multiple tests at once

I thus wrote a piece of code that automated the process, by drawing boxplots and performing the tests on several variables at once. Below is the code I used, illustrating the process with the iris dataset. The Species variable has 3 levels, so let’s remove one, and then draw a boxplot and apply a t-test on all 4 continuous variables at once. Note that the continuous variables that we would like to test are variables 1 to 4 in the iris dataset.

dat <- iris# remove one level to have only two groupsdat <- subset(dat, Species != "setosa")dat$Species <- factor(dat$Species)# boxplots and t-tests for the 4 variables at oncefor (i in 1:4) { # variables to compare are variables 1 to 4  boxplot(dat[, i] ~ dat$Species, # draw boxplots by group    ylab = names(dat[i]), # rename y-axis with variable's name    xlab = "Species"  )  print(t.test(dat[, i] ~ dat$Species)) # print results of t-test}

## ##  Welch Two Sample t-test## ## data:  dat[, i] by dat$Species## t = -5.6292, df = 94.025, p-value = 1.866e-07## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -0.8819731 -0.4220269## sample estimates:## mean in group versicolor  mean in group virginica ##                    5.936                    6.588

## ##  Welch Two Sample t-test## ## data:  dat[, i] by dat$Species## t = -3.2058, df = 97.927, p-value = 0.001819## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -0.33028364 -0.07771636## sample estimates:## mean in group versicolor  mean in group virginica ##                    2.770                    2.974

## ##  Welch Two Sample t-test## ## data:  dat[, i] by dat$Species## t = -12.604, df = 95.57, p-value < 2.2e-16## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -1.49549 -1.08851## sample estimates:## mean in group versicolor  mean in group virginica ##                    4.260                    5.552

## ##  Welch Two Sample t-test## ## data:  dat[, i] by dat$Species## t = -14.625, df = 89.043, p-value < 2.2e-16## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -0.7951002 -0.6048998## sample estimates:## mean in group versicolor  mean in group virginica ##                    1.326                    2.026

As you can see, the above piece of code draws a boxplot and then prints results of the test for each continuous variable, all at once.

At some point in the past, I even wrote code to:

draw a boxplot
test for the equality of variances (thanks to the Levene’s test)
depending on whether the variances were equal or unequal, the appropriate test was applied: the Welch test if the variances were unequal and the Student’s t-test in the case the variances were equal (see more details about the different versions of the t-test for two samples)
apply steps 1 to 3 for all continuous variables at once

I had a similar code for ANOVA in case I needed to compare more than two groups.

The code was doing the job relatively well. Indeed, thanks to this code I was able to test many variables in an automated way in the sense that it compared groups for all variables at once.

The only thing I had to change from one project to another is that I needed to modify the name of the grouping variable and the numbering of the continuous variables to test (Species and 1:4 in the above code).

Concise and easily interpretable results

T-test

Although it was working quite well and applicable to different projects with only minor changes, I was still unsatisfied with another point.

Someone who is proficient in statistics and R can read and interpret the output of a t-test without any difficulty. However, as you may have noticed with your own statistical projects, most people do not know what to look for in the results and are sometimes a bit confused when they see so many graphs, code, output, results and numeric values in a document. They are quite easily overwhelmed by this mass of information.

With my old R routine, the time I was saving by automating the process of t-tests and ANOVA was (partially) lost when I had to explain R outputs to my students so that they could interpret the results correctly. Although most of the time it simply boiled down to pointing out what to look for in the outputs (i.e., p-values), I was still losing quite a lot of time because these outputs were, in my opinion, too detailed for most real-life applications. In other words, too much information seemed to be confusing for many people so I was still not convinced that it was the most optimal way to share statistical results to nonscientists.

Of course, they came to me for statistical advices, so they expected to have these results and I needed to give them answers to their questions and hypotheses. Nonetheless, I wanted to find a better way to communicate these results to this type of audience, with the minimum of information required to arrive at a conclusion. No more and no less than that.

After a long time spent online trying to figure out a way to present results in a more concise and readable way, I discovered the {ggpubr} package. This package allows to indicate the test used and the p-value of the test directly on a ggplot2-based graph. It also facilitates the creation of publication-ready plots for non-advanced statistical audiences.

After many refinements and modifications of the initial code (available in this article), I finally came up with a rather stable and robust process to perform t-tests and ANOVA for many variables at once, and more importantly, make the results concise and easily readable by anyone (statisticians or not).

A graph is worth a thousand words, so here are the exact same tests than in the previous section, but this time with my new R routine:

library(ggpubr)# Edit from here #x <- which(names(dat) == "Species") # name of grouping variabley <- which(names(dat) == "Sepal.Length" # names of variables to test| names(dat) == "Sepal.Width"| names(dat) == "Petal.Length"| names(dat) == "Petal.Width")method <- "t.test" # one of "wilcox.test" or "t.test"paired <- FALSE # if paired make sure that in the dataframe you have first all individuals at T1, then all individuals again at T2# Edit until here# Edit at your own riskfor (i in y) {  for (j in x) {    ifelse(paired == TRUE,      p <- ggpaired(dat,        x = colnames(dat[j]), y = colnames(dat[i]),        color = colnames(dat[j]), line.color = "gray", line.size = 0.4,        palette = "npg",        legend = "none",        xlab = colnames(dat[j]),        ylab = colnames(dat[i]),        add = "jitter"      ),      p <- ggboxplot(dat,        x = colnames(dat[j]), y = colnames(dat[i]),        color = colnames(dat[j]),        palette = "npg",        legend = "none",        add = "jitter"      )    )    #  Add p-value    print(p + stat_compare_means(aes(label = paste0(..method.., ", p-value = ", ..p.format.., " (", ifelse(..p.adj.. >= 0.05, "not significant", ..p.signif..), ")")),      method = method,      paired = paired,      # group.by = NULL,      ref.group = NULL    ))  }}

As you can see from the graphs above, only the most important information is presented for each variable:

a visual comparison of the groups thanks to boxplots
the name of the statistical test
the p-value of the test

Based on these graphs, it is easy, even for non-experts, to interpret the results and conclude that the versicolor and virginica species are significantly different in terms of all 4 variables (since p-values < 0.05).

Of course, experts may be interested in more advanced results. However, this simple yet complete graph, which includes the name of the test and the p-value, gives all the necessary information to answer the question: “Are the groups different?”.

In my experience, I have noticed that students and professionals (especially those from a less scientific background) understand way better these results than the ones presented in the previous section.

The only lines of code that need to be modified for your own project is the name of the grouping variable (Species in the above code), the names of the variables you want to test (Sepal.Length, Sepal.Width, etc.),² whether you want to apply a t-test (t.test) or Wilcoxon test (wilcox.test) and whether the samples are paired or not (FALSE if samples are independent, TRUE if they are paired).

ANOVA

Below the same process with an ANOVA. Note that we reload the dataset iris to include all three Species this time:

dat <- iris# Edit from herex <- which(names(dat) == "Species") # name of grouping variabley <- which(names(dat) == "Sepal.Length" # names of variables to test| names(dat) == "Sepal.Width"| names(dat) == "Petal.Length"| names(dat) == "Petal.Width")method1 <- "anova" # one of "anova" or "kruskal.test"method2 <- "t.test" # one of "wilcox.test" or "t.test"my_comparisons <- list(c("setosa", "versicolor"), c("setosa", "virginica"), c("versicolor", "virginica")) # comparisons for post-hoc tests# Edit until here# Edit at your own riskfor (i in y) {  for (j in x) {    p <- ggboxplot(dat,      x = colnames(dat[j]), y = colnames(dat[i]),      color = colnames(dat[j]),      legend = "none",      palette = "npg",      add = "jitter"    )    print(      p + stat_compare_means(aes(label = paste0(..method.., ", p-value = ", ..p.format.., " (", ifelse(..p.adj.. > 0.05, "not significant", ..p.signif..), ")")),        method = method1, label.y = max(dat[, i], na.rm = TRUE)      )      + stat_compare_means(comparisons = my_comparisons, method = method2, label = "p.format") # remove if p-value of ANOVA or Kruskal-Wallis test >= 0.05    )  }}

Like the improved routine for the t-test, I have noticed that students and non-expert professionals understand ANOVA results presented this way much more easily compared to the default R outputs.

With one graph for each variable, it is easy to see that all species are different from each other in terms of all 4 variables (since all p-values of post-hoc tests < 0.05).

If you want to apply the same automated process to your data, you will need to modify the name of the grouping variable (Species), the names of the variables you want to test (Sepal.Length, etc.), whether you want to perform an ANOVA (anova) or Kruskal-Wallis test (kruskal.test) and finally specify the comparisons for the post-hoc tests.³

To go even further

As we have seen, these two improved R routines allow to:

Perform t-tests and ANOVA on a small or large number of variables with only minor changes to the code. I basically only have to replace the variable names and the name of the test I want to use. It takes almost the same time to test one or dozens of variables so it is quite an improvement compared to testing one variable at a time
Share test results in a much proper and cleaner way. This is possible thanks to a graph showing the observations by group and the p-value of the appropriate test included on this graph. This is particularily important when communicating results to a wider audience or to people from diverse backgrounds.

However, like most of my R routines, these two pieces of code are still a work in progress. Below are some additional features I have been thinking of and which could be added in the future to make the process of comparing two or more groups even more optimal:

Add the possibility to select variables by their numbering in the dataframe. For the moment it is only possible to do it via their names. This will allow to automate the process even further because instead of typing all variable names one by one, we could simply type 4:100 (to test variables 4 to 100 for instance).
When comparing more than two groups, it is only possible to apply an ANOVA or Kruskal-Wallis test at the moment. A major improvement would be to add the possibility to perform a repeated measures ANOVA (i.e., an ANOVA when the samples are dependent). It is currently already possible to do a t-test with two paired samples, but it is not yet possible to do the same with more than two groups.
Another less important (yet still nice) feature when comparing more than 2 groups would be to automatically apply post-hoc tests only in the case where the null hypothesis of the ANOVA or Kruskal-Wallis test is rejected (so when there is at least one group different from the others, because if the null hypothesis of equal groups is not rejected we do not apply a post-hoc test). At the present time, I manually add or remove the code that displays the p-values of post-hoc tests depending on the global p-value of the ANOVA or Kruskal-Wallis test.

I will try to add these features in the future, or I would be glad to help if the author of the {ggpubr} package needs help in including these features (I hope he will see this article!).

Thanks for reading. I hope this article will help you to perform t-tests and ANOVA for multiple variables at once and make the results more easily readable and interpretable by nonscientists. Learn more about the t-test and how to compare two samples in this article.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

Related articles:

In theory, an ANOVA can also be used to compare two groups as it will give the same results compared to a Student’s t-test, but in practice we use the Student’s t-test to compare two samples and the ANOVA to compare three samples or more.
Do not forget to separate the variables you want to test with |.
Post-hoc test is only the name used to refer to a specific type of statistical tests. Post-hoc test includes, among others, the Tukey HSD test, the Bonferroni correction, Dunnett’s test. Even if an ANOVA or a Kruskal-Wallis test can determine whether there is at least one group that is different from the others, it does not allow us to conclude which are different from each other. For this purpose, there are post-hoc tests that compare all groups two by two to determine which ones are different, after adjusting for multiple comparisons. Concretely, post-hoc tests are performed to each possible pair of groups after an ANOVA or a Kruskal-Wallis test has shown that there is at least one group which is different (hence “post” in the name of this type of test). The null and alternative hypotheses and the interpretations of these tests are similar to a Student’s t-test for two samples.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

parzer: Parse Messy Geographic Coordinates

March 18, 2020, 5:00 pm

≫ Next: Simulating COVID-19 interventions with R

≪ Previous: How to do a t-test or ANOVA for many variables at once in R and communicate the results in a better way

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

parzer is a new package for handling messy geographic coordinates. The first version is now on CRAN, with binaries coming soon hopefully (see note about installation below). The package recently completed rOpenSci review.

parzer motivation

The idea for this package started with a tweet from Noam Ross (https://twitter.com/noamross/status/1070733367522590721) about 15 months ago.

The idea being that sometimes you have geographic coordinates in a messy format, or in many different formats, etc. You can think of it as being the package for geographic coordinates that lubridate is for dates.

I started off thinking about wrapping a Javascript library with Jeroen’s V8 R package, but then someone showed me or I found (can’t remember) some C++ code from back in 2006 that seemed appropriate. I figured I’d go down the C++ track instead of the Javascript track because I figured I could likely get better performance out of C++ and have slightly less install headaches for users.

Package installation

The package is on CRAN so you can use install.packages

install.packages("parzer")

However, since this package requires compilation you probably want a binary. Binaries are not available on CRAN yet. You can install a binary like

install.packages("parzer", repos = "https://dev.ropensci.org/")

library(parzer)

Check out the package documentation to get started: https://docs.ropensci.org/parzer/

Package basics

The following is a summary of the functions in the package and what they do:

Parse latitude or longitude separately

parse_lat
parse_lon

Parse latitudes and longitudes at the same time

parse_lon_lat

Parse into separate parts of degrees, minutes, seconds

parse_parts_lat
parse_parts_lon

Pull out separately degrees, minutes, seconds, or hemisphere

pz_degree
pz_minute
pz_second
parse_hemisphere

Add/subtract degrees, minutes, seconds

pz_d
pz_m
pz_s

Some examples:

parse latitudes and longitudes

lats <- c("40.123°", "40.123N74.123W", "191.89", 12, "N45 04.25764")
parse_lat(lats)

#> Warning in pz_parse_lat(lat): invalid characters, got: 40.123n74.123w

#> Warning in pz_parse_lat(lat): not within -90/90 range, got: 191.89
#>   check that you did not invert lon and lat

#> [1] 40.12300      NaN      NaN 12.00000 45.07096

longs <- c("45W54.2356", "181", 45, 45.234234, "-45.98739874N")
parse_lon(longs)

#> Warning in pz_parse_lon(lon): invalid characters, got: -45.98739874n

#> [1] -45.90393 181.00000  45.00000  45.23423       NaN

In the above examples you can see there’s a mix of valid coordinate values as well as invalid values. There’s a mix of types supported as well.

Sometimes you may want to parse a geographic coordinate into its component parts; parse_parts_lat and parse_parts_lon are what you need:

x <- c("191.89", 12, "N45 04.25764")
parse_parts_lon(x)

#> Warning in pz_parse_parts_lon(scrub(str)): invalid characters, got: n45 04.25764

#>   deg min      sec
#> 1 191  53 23.99783
#> 2  12   0  0.00000
#> 3  NA  NA      NaN

Taking a cue from lubridate, we thought it would be useful to make it easier to add or subtract numbers for coordinates. Three functions help with this:

pz_d(31)
#> 31
pz_d(31) + pz_m(44)
#> 31.73333
pz_d(31) - pz_m(44)
#> 30.26667
pz_d(31) + pz_m(44) + pz_s(59)
#> 31.74972
pz_d(-121) + pz_m(1) + pz_s(33)
#> -120.9742

Use cases

Check out the parzer use cases vignette on the docs site. Get in touch if you have a use case that might be good to add to that vignette.

Thanks

Thanks to the reviewers Maria Munafó and Julien Brun for their time invested in improving the package.

To Do

There’s more to do. We are thinking about dropping the Rcpp dependency, support parsing strings that have both latitude and longitude together, making error messages better, and more.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Simulating COVID-19 interventions with R

March 18, 2020, 5:00 pm

≫ Next: Shiny: Performance tuning with future & promises – Part 1

≪ Previous: parzer: Parse Messy Geographic Coordinates

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tim Churches is a Senior Research Fellow at the UNSW Medicine South Western Sydney Clinical School at Liverpool Hospital, and a health data scientist at the Ingham Institute for Applied Medical Research. This post examines simulation of COVID-19 spread using R, and how such simulations can be used to understand the effects of various public health interventions design to limit or slow its spread.

DISCLAIMER

The simulation results in this blog post, or any other results produced by the R code described in it, should not be used as actual estimates of mortality or any other aspect of the COVID-19 pandemic. The simulations are intended to explain principles, and permit exploration of the potential effects of various combinations and timings of interventions on spread. Furthermore, the code for these simulations has been written hurriedly over just a few days, it has not yet been peer-reviewed, and is considered alpha quality, and the simulation parameterisation presented here has not yet been validated against real-world COVID-19 data.

Introduction

In a previous post, we looked at the use of some R packages developed by the REpidemics Consortium (RECON) to undertake epidemiological analyses COVID-19 incidence data scraped from various web sources.

Undertaking such value-adding analyses of COVID-19 incidence data, as the full horror of the pandemic unfolds, is a worthwhile endeavour. But it would also be useful to be able to gain a better understanding of the likely effects of various public health interventions on COVID-19 spread.

“Flattening-the-curve” infographics such as the one shown below are now everywhere. They are a useful and succinct way of communicating a key public health message – that social distancing and related measures will help take the strain off our health care systems in the coming months.

Source: Siouxsie Wiles and Toby Morris

However, as pointed out by several commentators, many of these infographics miss a crucial point: that public health measures can do more than just flatten the curve, they can also shrink it, thus reducing the total number of cases (and thus serious cases) of COVID-19 in a population, rather than just spread the same number of cases over a longer period such that the area under the curve remains the same.

This crucial point was beautifully illustrated using R in a post by Michael Höhle, which is highly recommended reading. Michael used a dynamic model of disease transmission, which is based on solving a system of ordinary differential equations (ODEs) with the tools found in base R.

Such mathematical approaches to disease outbreak simulation are elegant, and efficient to compute, but they can become unwieldy as the complexity of the model grows. An alternative is to use a more computational approach. In this post, we will briefly look at the individual contact model (ICM) simulation capability implemented in the excellent EpiModel package by Samuel Jenness and colleagues at the Emory University Rollins School of Public Health, and some extensions to it. Note also that this post is based on two more detailed posts that provide more technical details and access to relevant source code.

The `EpiModel` package

The EpiModel package provides facilities to explore three types of disease transmission model (or simulations): dynamic contact models (DCMs) as used by Michael Höhle, stochastic individual contact models (ICMs) and stochastic network models. The last are particularly interesting, as they can accurately model disease transmission with shifting social contact networks – they were developed to model HIV transmission, but have been used to model transmission of other diseases, including ebola, and even the propagation of memes in social media networks. These network models potentially have application to COVID-19 modelling – they could be used to model shifting household, workplace or school and wider community interactions, and thus opportunity for transmission of the virus. However, the networks models as currently implemented are not quite suitable for such purposes, modifying them is complex, and they are also very computationally intensive to run. For these reasons, we will focus on the simpler ICM simulation facilities provided by EpiModel.

Interested readers should consult the extensive tutorials and other documentation for EpiModel for a fuller treatment, but in a nutshell, an EpiModel ICM simulation starts with a hypothetical population of individuals who are permitted to be in one of several groups, or compartments, at any particular time. Out-of-the-box, EpiModel supports several types of models, including the popular SIR model which uses Susceptible, Infectious and Recovered compartments. At each time step of the simulation, individuals randomly encounter and are exposed to other individuals in the population. The intensity of this population mixing is controlled by an act rate parameter, with each “act” representing an opportunity for disease transmission, or at least those “acts” between susceptible individuals and infectious individuals. Recovered individuals are no longer infectious and are assumed to be immune from further re-infection, so we are not interested in their interactions with others, nor are we interested in interactions between pairs of susceptible individuals, only the interactions between susceptible and infectious individuals. However, not every such opportunity for disease transmission will result in actual disease transmission. The probability of transmission at each interaction is controlled by an infection probability parameter.

It is easy to see that decreasing the act.rate parameter is equivalent to increasing social distancing in the population, and decreasing the inf.prob parameter is equates to increased practice of hygiene measures such as hand washing, use of hand sanitisers, not touching one’s face, and mask wearing by the infectious. This was what I explored in some detail in my first personal blog post on simulating COVID-19 spread.

Extensions to `EpiModel`

However, the SIR model type is a bit too simplistic if we want to use the model to explore the potential effect of various public health measures on COVID-19 spread. Fortunately, EpiModel provides a plug-in architecture that allows more elaborate models to be implemented. The full details of my recent extensions to EpiModel can be found in my second personal blog post on COVID-19 simulation, but the gist of it is that several new compartment types were added, as shown in the table below, with support for transition between them as shown in the diagram below the table. The dashed lines indicate infection interactions.

Compartment	Functional definition
S	Susceptible individuals
E	Exposed and infected, not yet symptomatic but potentially infectious
I	Infected, symptomatic and infectious
Q	Infectious, but (self-)quarantined
H	Requiring hospitalisation (would normally be hospitalised if capacity available)
R	Recovered, immune from further infection
F	Case fatality (death due to COVID-19, not other causes)

{"x":{"diagram":"\ndigraph SEIQHRF {\n\n # a \"graph\" statement\n graph [overlap = false, fontsize = 10] #, rankdir = LR]\n\n # several \"node\" statements\n node [shape = box,\n fontname = Helvetica]\n S[label=\"S=Susceptible\"];\n E[label=\"E=Exposed and infected,\nasymptomatic,\npotentially infectious\"];\n I[label=\"I=Infected and infectious\"];\n Q[label=\"Q=(Self-)quarantined\n(infectious)\"];\n H[label=\"H=Requires\nhospitalisation\"];\n R[label=\"R=Recovered/immune\"];\n F[label=\"F=Case fatality\"]\n\n # several \"edge\" statements\n S->E\n I->S[style=\"dashed\"]\n E->I\n E->S[style=\"dashed\"]\n I->Q\n Q->S[style=\"dashed\"]\n I->R\n I->H\n H->F\n H->R\n Q->R\n Q->H\n}\n","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}

Another capability that has been added is the ability to specify time-variant parameters, as a vector of the same length as there are time steps in the simulation. This allows us to smoothly (or step-wise) introduce, and withdraw, various interventions at arbitrary times during the course of our simulation.

We won’t cover here the details of how to obtain these extensions, which at the time of writing should still be considered alpha quality code – please see the blog post for those. Let’s just proceed to running some simulations.

Baseline simulation

First we’ll run a baseline simulation for a hypothetical population of 10,000 people, in which there are just three infectious COVID-19 cases at the outset. We’ll run it for 365 days, and we’ll set a very low rate at which infectious individuals enter self-quarantine (thereby dramatically lowering their rate of interactions with others) after they become symptomatic (or have been tested and found positive), and thus aware of their infectivity. Because it is stochastic, the simulation is run eight times, using parallel processing if available, and the results averaged.

tic()baseline_sim <- simulate(ncores = 4)toc()

## 58.092 sec elapsed

Let’s visualise the results as a set of time-series of the daily count of our 10,000 individuals in each compartment.

OK, that looks very reasonable. Note that almost the entire population ends up being infected. However, the S and R compartments dominate the plot (which is good, because it means humanity will survive!), so let’s re-plot leaving out those compartments so we can see a bit more detail.

Notice that the I compartment curve lags behind the E compartment curve – the lag is the incubation period, and that the Q curve lags still further as infected people only reluctantly and belatedly quarantine themselves (in this baseline scenario).

Running intervention experiments

Now we are in a position to run an experiment, by altering some parameters of our baseline model.

Let’s model the effect of decreasing the infection probability at each exposure event by smoothly decreasing the inf.prob parameters for the I compartment. The infection probability at each exposure event (for the I compartment individuals) starts at 5%, and we’ll reduce it to 2% between days 15 and 30. This models the effect of symptomatic infected people adopting better hygiene practices such as wearing masks, coughing into their elbows, using hand sanitisers, not shaking hands and so on, perhaps in response to a concerted public health advertising campaign by the government.

Let’s examine the results of experiment 1, alongside the baseline for comparison:

We can see from the plots on the left that by encouraging hygiene measures in symptomatic infectious individuals, we have not only substantially “flattened the curve”, but we have actually shrunk it. The result, as shown in the plots on the right, is that demand for hospital beds is substantially reduced, and only briefly exceeds our defined hospital capacity of 40 bed. This results in a substantially reduced mortality rate, shown by the black line.

More experiments

We can now embark on a series of experiments, exploring various interventions singly, or in combination, and with different timings.

Experiment 2

Let’s repeat experiment 1, but let’s delay the start of the hygiene campaign until day 30 and make it less intense so it takes until day 60 to achieve the desired increase in hygiene in the symptomatic infected.

infectious_hygiene_delayed_ramp <- function(t) {  ifelse(t < 30, 0.05, ifelse(t <= 60, 0.05 - (t - 30) * (0.05 -     0.02)/30, 0.02))}infectious_hygiene_delayed_ramp_sim <- simulate(inf.prob.i = infectious_hygiene_delayed_ramp(1:366))

Experiment 3

Let’s repeat experiment 1, except this time instead of promoting hygiene measures in the symptomatic infected, we’ll promote, starting at day 15, prompt self-quarantine by anyone who is infected as soon as they become symptomatic. By “prompt”, we mean most such people will self-quarantine themselves immediately, but with an exponentially declining tail of such people taking longer to enter quarantine, with a few never complying. Those in self-quarantine won’t or can’t achieve complete social isolation, so we have set the act.rate parameter for the quarantined compartment to a quarter of that for the other compartments to simulate such a reduction in social mixing (an increase in social distancing) in that group.

quarantine_ramp <- function(t) {  ifelse(t < 15, 0.0333, ifelse(t <= 30, 0.0333 + (t - 15) *     (0.3333 - 0.0333)/15, 0.333))}quarantine_ramp_sim <- simulate(quar.rate = quarantine_ramp(1:366))

Experiment 4

Let’s add a moderate increase in social distancing for everyone (halving the act.rate), again ramping it down between days 15 and 30.

social_distance_ramp <- function(t) {  ifelse(t < 15, 10, ifelse(t <= 30, 10 - (t - 15) * (10 -     5)/15, 5))}soc_dist_ramp_sim <- simulate(act.rate.i = social_distance_ramp(1:366),   act.rate.e = social_distance_ramp(1:366))

Experiment 5

Let’s combine experiments 3 and 4: we’ll add a moderate increase in social distancing for everyone, as well as prompt self-quarantining in the symptomatic.

quar_soc_dist_ramp_sim <- simulate(quar.rate = quarantine_ramp(1:366),   act.rate.i = social_distance_ramp(1:366), act.rate.e = social_distance_ramp(1:366))

Now let’s examine the results.

Discussion

The results of our experiments almost speak for themselves, but a few things are worth highlighting:

Implementing interventions too late is almost worthless. Act early and decisively. You can always wind back the intervention later, whereas a failure to act early enough can never be recovered from.
Prompt self-quarantining of symptomatic cases is effective. In practice that means everyone with COVID-19-like symptoms, whether they actually have COVID-19 or something else, should immediately self-quarantine. Don’t wait to be tested.
A moderate increase in social distancing (decrease in social mixing) in everyone is also effective, mainly because it reduces exposure opportunities with both the asymptomatic-but-infected and the symptomatic infected.
Combining measures is even more effective, as can be seen in experiment 5. In fact, there are theoretical reasons to believe that the effect of combined measures is partially multiplicative, not just additive.
Public health interventions don’t just flatten the curve, they shrink it, and the result is very substantially reduced mortality due to COVID-19.

None of these insights are novel, but it is nice to be able to independently confirm the recommendations of various expert groups, such as the WHO Collaborating Centre for Infectious Disease Modelling at Imperial College London (ICL) who have recently released a report on the impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand which recommends similar strategies to those we have just discovered from our modest simulations in R.

Two more experiments

What happens if we dramatically increase social distancing through a two week lock-down, which is then relaxed? We’ll use a step function to model this. We test such a lock-down lasting from day 15 to 30, and separately a lock-down from day 30 to day 45 instead. We’ll model the lock-down by reducing the act.rate parameters for all compartments from 10 to 2.5.

twoweek_lockdown_day15_vector <- c(rep(10, 15), rep(2.5, 15),   rep(10, 336))twoweek_lockdown_day30_vector <- c(rep(10, 30), rep(2.5, 15),   rep(10, 321))twoweek_lockdown_day15_sim <- simulate(act.rate.i = twoweek_lockdown_day15_vector,   act.rate.e = twoweek_lockdown_day15_vector)twoweek_lockdown_day30_sim <- simulate(act.rate.i = twoweek_lockdown_day30_vector,   act.rate.e = twoweek_lockdown_day30_vector)

Wow, that’s a bit surprising! The two week lock-down starting at day 15 isn’t effective at all – it just stops the spread in its tracks for two weeks, and then it just resumes again. But a two-week lock-down starting at day 30 is somewhat more effective, presumably because there are more infected people being taken out of circulation from day 30 onwards. But the epidemic still partially bounces back after the two weeks are over.

What this tells us is that single lock-downs for only two weeks aren’t effective. What about a lock-down for a whole month, instead, combined with prompt quarantine with even more effective isolation and hygiene measures for those quarantined?

fourweek_lockdown_day15_vector <- c(rep(10, 15), rep(2.5, 30),   rep(7.5, 321))fourweek_lockdown_day30_vector <- c(rep(10, 30), rep(2.5, 30),   rep(7.5, 306))fourweek_lockdown_day15_sim <- simulate(act.rate.i = fourweek_lockdown_day15_vector,   act.rate.e = fourweek_lockdown_day15_vector, quar.rate = quarantine_ramp(1:366),   inf.prob.q = 0.01)fourweek_lockdown_day30_sim <- simulate(act.rate.i = fourweek_lockdown_day30_vector,   act.rate.e = fourweek_lockdown_day30_vector, quar.rate = quarantine_ramp(1:366),   inf.prob.q = 0.01)

Well, that’s satisfying! By acting early, and decisively, we’ve managed to stop COVID-19 dead in it tracks in experiment 8, and in doing so have saved many lives – at least, we have in our little simulated world. But even experiment 9 provides a much better outcome, indicating that decisive action, even if somewhat belated, is much better than none.

Of course, the real world is far more complex and messier, and COVID-19 may not behave in exactly the same way in real-life, but at least we can see the principles of public health interventions in action in our simulation, and perhaps better understand or want to question what is being done, or not done, or being done too late, to contain the spread of the virus in the real world.

Conclusion

Although there is still a lot of work yet to be done on the extensions to EpiModel demonstrated here, it seems that they offer promise as a tool for understanding real-world action and inaction on COVID-19, and prompting legitimate questions about such actions or lack thereof.

One would hope that governments are using far more sophisticated simulation models than the one we have described here, which was built over the course of just a few days, to plan or inform their responses to COVID-19. If not, they ought to be.

_____='https://rviews.rstudio.com/2020/03/19/simulating-covid-19-interventions-with-r/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Shiny: Performance tuning with future & promises – Part 1

March 19, 2020, 7:11 am

≫ Next: RProtoBuf 0.4.16: Now with JSON

≪ Previous: Simulating COVID-19 interventions with R

[This article was first published on R-Bloggers – eoda GmbH, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our previous article about Shiny we shared our experiences with load testing and horizontal scaling of apps. We showed the design of a process from a proof of concept to a company-wide application.

The second part of the blog series focuses on the R packages future & promises, which are used for optimizations within the app. These facilitate the drastic reduction of potential waiting times for the user.

In order to cover this topic in as much detail as possible, the first part of the article refers to the theory and operation of Shiny and future/promises. It also explains the asynchronous programming techniques on which the package functions are based. In the second part, a practical example is used to show how the ideas for optimization can be implemented.

Shiny workflow – connections and R processes

The following figure shows the procedure by which user access to a Shiny app is carried out:

For each user access to a Shiny app, the server decides if the user is either added to an existing R process or a new R process is started to which the user will be connected. How exactly the server handles this decision can be controlled separately for each application via certain tuning parameters.

R is single-threaded, i.e. all commands within a process are executed sequentially (not in parallel). For this reason, the process highlighted in orange may cause delays in the execution of the app. However, this depends on the use of other users of the application, that are connected to the process. The following graphic illustrates this usage.

Since user 2 is connected to the same R process as user 1, he has to wait until the task started by user 1 is finished to use the app. Since it is usually impractical (or impossible) to assign each user a separate R session for the Shiny app, this leads to the above mentioned possibilities, which are realized with future/promises and will be discussed in more detail below.

future & promises – Asynchronous programming in R

Anyone who already made experience with other programming languages will probably have encountered the term „asynchronous programming“ in one form or another. The idea behind it is simple: From a task list of a process a sublist of complex tasks is outsourced in order to keep the initial process reactive. In our example, one of these „process blocking tasks“ would be the machine learning/database task started by user 1. The R packages future & promises implement this programming paradigm for R, which only allows sequential („synchronous“) programming natively. The workflow is divided into different classes (so-called „plans“):

The reference operation set reflects the „normal“ way in which R works. Processes execute tasks one after the other, i.e. later tasks must wait for their predecessors to complete.

Tasks 1 & 2 from the asynchronous schedules shown above are each outsourced to secondary/sub-processes so that the main process remains free and task 3 can be processed. The difference is that „Multisession“ starts two new R processes on which the tasks are executed. Multicore on the other hand branches the main process into two sub-processes, which is only possible under Linux. E.g. further plans allow outsourcing to distributed systems.

Optimization: Inter-session vs. Intra-session

In order to better understand the steps in the following practical article, two types of optimization are presented, into which the term in-app performance tuning can be divided. These have a significant influence on the optimization process and are therefore essential for a common understanding.

As shown in the graphic below, performance tuning is divided into inter-session and intra-session optimization. If an app is optimized with regard to intra-session performance, the system tries to reduce the waiting time for the user running the current application.

The packages future & promises are designed in the Shiny context for inter-session optimization. This is focused on keeping the app reactive for all other users accessing the app at the same time. The user who starts the task will still have to wait for the task to complete before he can continue using the app.

Conclusion & Outlook

The theory behind the way Shiny works and the asynchronous programming paradigms is an important step towards understanding how future/promises works. Furthermore, deeper insights into the architecture of the systems sharpen the view with regard to necessary optimization processes and at which points this can be applied. In the second part of this article we will see how the wealth of information can be cast into shape using an intuitive syntax and how this can influence the development of an app.

We are the experts for developing Shiny applications and building productive IT infrastructures in the data science context. Do you have questions on these topics? Then we would be pleased to be at your disposal as contact persons.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – eoda GmbH.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

RProtoBuf 0.4.16: Now with JSON

March 19, 2020, 8:29 am

≫ Next: On model specification, identification, degrees of freedom and regularization

≪ Previous: Shiny: Performance tuning with future & promises – Part 1

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release 0.4.16 of RProtoBuf is now on CRAN. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol.

This release contains a PR contributed by Siddhartha Bagaria which adds JSON support for messages, which had been an open wishlist item. I also appeased a clang deprecation warning that had come up on one of the CRAN test machines.

Changes in RProtoBuf version 0.4.16 (2020-03-19)
Added support for parsing and printing JSON (Siddhartha Bagaria in #68 closing wishlist #61).
Switched ByteSize() to ByteSizeLong() to appease clang (Dirk).

CRANberries provides the usual diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, and the pre-print of our JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

On model specification, identification, degrees of freedom and regularization

March 19, 2020, 5:00 pm

≫ Next: Domestic data science – energy use

≪ Previous: RProtoBuf 0.4.16: Now with JSON

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had a lot of fun this week, revisiting this blog post (Monte Carlo simulation of a 2-factor interest rates model with ESGtoolkit) I wrote a few years ago in 2014 – that somehow generated a heatwave. This 2020 post is about model specification, identification, degrees of freedom and regularization. The first part is on Monte Carlo simulation for financial pricing, and the second part on optimization in deep learning neural networks. I won’t draw a lot of conclusions here, but will let you draw your own. Of course, feel free to reach out if something seems/sounds wrong to you. That’s still the best way to deal with issues.

Simulation of a G2++ short rates model

Let’s start by loading ESGtoolkit for the first part of this post:

# In R consolesuppressPackageStartupMessages(library(ESGtoolkit))

G2++ Model input parameters:

# Observed maturitiesu <- 1:30# Yield to maturitiestxZC <- c(0.01422,0.01309,0.01380,0.01549,0.01747,0.01940,          0.02104,0.02236,0.02348, 0.02446,0.02535,0.02614,          0.02679,0.02727,0.02760,0.02779,0.02787,0.02786,          0.02776,0.02762,0.02745,0.02727,0.02707,0.02686,          0.02663,0.02640,0.02618,0.02597,0.02578,0.02563)# Zero-coupon prices = 'Observed' market pricesp <- c(0.9859794,0.9744879,0.9602458,0.9416551,0.9196671,       0.8957363,0.8716268,0.8482628,0.8255457,0.8034710,       0.7819525,0.7612204,0.7416912,0.7237042,0.7072136       ,0.6922140,0.6785227,0.6660095,0.6546902,0.6441639,       0.6343366,0.6250234,0.6162910,0.6080358,0.6003302,       0.5929791,0.5858711,0.5789852,0.5722068,0.5653231)

G2++ simulation function (HCSPL stands for Hermite Cubic Spline interpolation of the Yield Curve):

# Function of the number of scenariossimG2plus <- function(n, methodyc = "HCSPL", seed=13435,                      b_opt=NULL, rho_opt=NULL, eta_opt=NULL,                       randomize_params=FALSE){     set.seed(seed)      # Horizon, number of simulations, frequency    horizon <- 20    freq <- "semi-annual"     delta_t <- 1/2        # Parameters found for the G2++    a_opt <- 0.50000000 + ifelse(randomize_params, 0.5*runif(1), 0)    if(is.null(b_opt))       b_opt <- 0.35412030 + ifelse(randomize_params, 0.5*runif(1), 0)    sigma_opt <- 0.09416266    if(is.null(rho_opt))       rho_opt <- -0.99855687    if(is.null(eta_opt))       eta_opt <- 0.08439934        print(paste("a:", a_opt))    print(paste("b:", b_opt))    print(paste("sigma:", sigma_opt))    print(paste("rho:", rho_opt))    print(paste("eta:", eta_opt))        # Simulation of gaussian correlated shocks    eps <- ESGtoolkit::simshocks(n = n, horizon = horizon,                     frequency = "semi-annual",                     family = 1, par = rho_opt)        # Simulation of the factor x    x <- ESGtoolkit::simdiff(n = n, horizon = horizon,                  frequency = freq,                   model = "OU",                  x0 = 0, theta1 = 0, theta2 = a_opt, theta3 = sigma_opt,                 eps = eps[[1]])        # Simulation of the factor y    y <- ESGtoolkit::simdiff(n = n, horizon = horizon,                  frequency = freq,                   model = "OU",                  x0 = 0, theta1 = 0, theta2 = b_opt, theta3 = eta_opt,                 eps = eps[[2]])        # Instantaneous forward rates, with spline interpolation    methodyc <- match.arg(methodyc)    fwdrates <- ESGtoolkit::esgfwdrates(n = n, horizon = horizon,     out.frequency = freq, in.maturities = u,     in.zerorates = txZC, method = methodyc)    fwdrates <- window(fwdrates, end = horizon)        # phi    t.out <- seq(from = 0, to = horizon,                  by = delta_t)    param.phi <- 0.5*(sigma_opt^2)*(1 - exp(-a_opt*t.out))^2/(a_opt^2) +     0.5*(eta_opt^2)*(1 - exp(-b_opt*t.out))^2/(b_opt^2) +      (rho_opt*sigma_opt*eta_opt)*(1 - exp(-a_opt*t.out))*      (1 - exp(-b_opt*t.out))/(a_opt*b_opt)    param.phi <- ts(replicate(n, param.phi),                     start = start(x), deltat = deltat(x))    phi <- fwdrates + param.phi    colnames(phi) <- c(paste0("Series ", 1:n))        # The short rates    r <- x + y + phi    colnames(r) <- c(paste0("Series ", 1:n))        return(r)}

Simulations of G2++ for 4 types of parameters’ sets:

r.HCSPL <- simG2plus(n = 10000, methodyc = "HCSPL", seed=123)r.HCSPL2 <- simG2plus(n = 10000, methodyc = "HCSPL", seed=2020)r.HCSPL3 <- simG2plus(n = 10000, methodyc = "HCSPL", seed=123,                      randomize_params=TRUE)r.HCSPL4 <- simG2plus(n = 10000, methodyc = "HCSPL", seed=123,                      b_opt=1, rho_opt=0, eta_opt=0,                      randomize_params=FALSE)

Stochastic discount factors derived from short rates simulations:

deltat_r <- deltat(r.HCSPL)Dt.HCSPL <- ESGtoolkit::esgdiscountfactor(r = r.HCSPL, X = 1)Dt.HCSPL <- window(Dt.HCSPL, start = deltat_r, deltat = 2*deltat_r)Dt.HCSPL2 <- ESGtoolkit::esgdiscountfactor(r = r.HCSPL2, X = 1)Dt.HCSPL2 <- window(Dt.HCSPL2, start = deltat_r, deltat = 2*deltat_r)Dt.HCSPL3 <- ESGtoolkit::esgdiscountfactor(r = r.HCSPL3, X = 1)Dt.HCSPL3 <- window(Dt.HCSPL3, start = deltat_r, deltat = 2*deltat_r)Dt.HCSPL4 <- ESGtoolkit::esgdiscountfactor(r = r.HCSPL4, X = 1)Dt.HCSPL4 <- window(Dt.HCSPL4, start = deltat_r, deltat = 2*deltat_r)

Prices (observed vs Monte Carlo for previous 4 examples):

# Observed market priceshorizon <- 20marketprices <- p[1:horizon]# Monte Carlo prices## Example 1montecarloprices.HCSPL <- rowMeans(Dt.HCSPL)## Example 2montecarloprices.HCSPL2 <- rowMeans(Dt.HCSPL2)## Example 3montecarloprices.HCSPL3 <- rowMeans(Dt.HCSPL3)## Example 4montecarloprices.HCSPL4 <- rowMeans(Dt.HCSPL4)

Plotsobserved prices vs Monte Carlo prices:

par(mfrow=c(4, 2))ESGtoolkit::esgplotbands(r.HCSPL, xlab = 'time', ylab = 'short rate',                          main="short rate simulations \n for example 1")plot(marketprices, col = "blue", type = 'l',      xlab = "time", ylab = "prices", main = "Prices for example 1 \n (observed vs Monte Carlo)")points(montecarloprices.HCSPL, col = "red")ESGtoolkit::esgplotbands(r.HCSPL2, xlab = 'time', ylab = 'short rate',                          main="short rate simulations \n for example 2")plot(marketprices, col = "blue", type = 'l',      xlab = "time", ylab = "prices", main = "Prices for example 2 \n (observed vs Monte Carlo)")points(montecarloprices.HCSPL2, col = "red")ESGtoolkit::esgplotbands(r.HCSPL3, xlab = 'time', ylab = 'short rate',                          main="short rate simulations \n for example 3")plot(marketprices, col = "blue", type = 'l',      xlab = "time", ylab = "prices", main = "Prices for example 3 \n (observed vs Monte Carlo)")points(montecarloprices.HCSPL3, col = "red")ESGtoolkit::esgplotbands(r.HCSPL4, xlab = 'time', ylab = 'short rate',                          main="short rate simulations \n for example 4")plot(marketprices, col = "blue", type = 'l',      xlab = "time", ylab = "prices", main = "Prices for example 4 \n (observed vs Monte Carlo)")points(montecarloprices.HCSPL4, col = "red")

image-title-here

What do we observe on these graphs, both on simulations and prices? What will happen if we add a third factor to this model, meaning, three more parameters; a G3++/any other hydra?

Optimization in Deep learning neural networks

On a different type of question/problem, but still on the subject of model specification, identification, degrees of freedom and regularization: Deep learning neural networks. Some people suggest that if you keep adding parameters (degrees of freedom?) to these models, you’ll still obtain a good generalization. Well, there’s this picture that I like a lot:

image-title-here

When we optimize the loss function in Deep learning neural networks models, we are most likely using gradient descent, which is fast and scalable. Still, no matter how sophisticated the gradient descent procedure we’re using, we will likely get stuck into a local minimum – because the loss function is rarely convex.

image-title-here

Stuck is a rather unfortunate term here, because it’s not an actual problem, but instead, an indirect way to avoid overtraining. Also, in our gradient descent procedure, we tune the number of epochs (number of iterations in the descent/ascent), the learning rate (how fast we roll in the descent/ascent), in addition to the dropout (randomly dropping out some nodes in networks’ layers), etc. These are also ways to avoid learning too much, to stop the optimization relatively early, and preserve the model’s ability to generalize. They regularize the model, whereas the millions of network nodes serve as degrees of freedom. This is a different problem than the first one we examined, with different objectives, but… still on the subject of model specification, identification, degrees of freedom and regularization.

For those who are working from home because of the COVID-19, I’d recommend this book about work-life balance, that I literally devoured a few months ago: REMOTE: Office Not Required (and nope, I’m not paid to promote this book).

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Domestic data science – energy use

March 20, 2020, 1:59 am

≫ Next: Google Big Query with R

≪ Previous: On model specification, identification, degrees of freedom and regularization

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wrote previously about my home electricity use. We’ve an open home energy monitor logging our import of electricity. Our house isn’t typical for the UK, we don’t have a gas supply and our space and water heating is via an air source heat pump (which runs on electricity).

In my previous blog post I mentioned we were considering having Solar PV installed. We’ve recently done this, but this blog isn’t about solar. We’ve also been thinking of switching energy suppliers. Our current supplier, Ecotricity, charges a flat rate for kWh used, with an additional standing charge. However the wholesale price of electricity is not static for energy suppliers, it varies through the day based on the amount of energy available and how much consumers need/want to use (supply – demand). Typically there is a spike in energy costs at tea time when most folk get home from work and put on the kettle for a brew and cook their tea. Octopus is another green energy supplier, but their agile tariff varies their price of electricity through the day and even between days.

If you’d like to switch to Octopus you can use this referral code to give us both £50 credit.

This blog post compares how much our electricity cost from Ecotricity with how much it might cost from Octopus. Octopus publish their prices for the previous year, I’ve two years of their data which have a 30 time step. I’ve a little more than 12 months of my own consumption data at a 10 second time step. Ecotricity charge a flat rate, but it did change in Sept, I’ve used their earlier (lower) rate for my comparison.

For data prep, I’ve removed the year from the Octopus data to compare my single year of observations to multiple years. The time period for which we have complete overlaps is from 2019-01-31 to 2019-12-20.

Code for the analysis is at the bottom of the post.

First, how do prices vary through the day? We can see that for most of the day Octopus is much cheaper, it’s only the tea time spike for which they charge more. So in order for us to save money, the bulk of our use needs to avoid the tea time spike.

price

How does our use vary through the day? I’ve adjusted our 10 data to 30 minute windows to help with comparison. We heat our water at midday and also a top up at about 9 pm, hence those spikes. Our heating comes on at around 4 or 5 in the morning, hence the heavy use then. Our tea time use is inline with these other heavy periods.

use

What if we multiply my energy use by the cost for each hour? As we’d expect Octopus is cheaper for the majority of hours (points below the 1:1 line), but some time periods is more expensive.

hourly_cost

We can aggregate these costs to a daily time step, and include the standing charge. The Octopus standing charge is ~ 10 p a day less than Ecotricity, so price gaps start to become quite big. At their 2019 prices, Octopus is cheaper on every day of the year. This is especially so in winter when we use a lot of electricity during off-peak periods (for space heating).

daily_cost

What’s the total difference over the year (2019-01-31 to 2019-12-20)? The following figures are without us altering our electricity consumption patterns. A 30% reduction in our annual bill is huge. Unsurprisingly I’ve requested a switch of supplier.

Ecotricity costs £ 950 for the time period.
Octopus 2018 costs £ 779 for the time period.
Octopus 2019 costs £ 628 for the time period.

Finally, if we can eliminate the 4pm to 7pm peak, how much could we save? If we can cut electricity use/import during this expensive period we should be able to reduce our bills even further. 50-60p a day doesn’t seem a lot, but over a year it’ll add up. Over the time period from 2019-01-31 to 2019-12-20 our 4pm to 7pm Octopus cost would have been £232 at 2018 prices and £198 at 2019 prices.

peak

Analysis and graphics code below. I should probably have manipulated my data to let me plot the hourly and daily scatter plots with facets instead of patchwork. For some reason wordpress is not displaying pipes (%>%) correctly, I’ve contacted wordpress.com to see how to fix it.

# Packageslibrary(tidyverse)library(readxl)library(lubridate)library(patchwork)# datatarrif_2018 = read_excel("agile_rates.2018-12-20.xlsx", sheet = "South Scotland") %>%   select(date_2018 = date, from_2018 = from, rate_2018 = unit_rate_excl_vat) %>%   mutate(date_2018 = date_2018 + years(1),          date_2018 = str_sub(date_2018, 1, 10))tarrif_2019 = read_excel("agile_rates_2019.xlsx", sheet = "South Scotland") %>%   select(date_2019 = date, from_2019 = from, rate_2019 = unit_rate_excl_vat) %>%   mutate(date_2019 = str_sub(date_2019, 1, 10))f = list.files(".", pattern = "elec*")use = lapply(f, function(i){   read_csv(i, col_names = F) %>%      mutate(datetime = dmy_hms(X1)) %>%      select(datetime, Watts = X2)})use = do.call("rbind.data.frame", use)use = use[!duplicated(use$datetime), ]# Hourly pricetarrif_2018 %>%   mutate(year = 2018) %>%   select(year, from = from_2018, rate = rate_2018) %>%   bind_rows(tarrif_2019 %>%                mutate(year = 2019) %>%                select(year, from = from_2019, rate = rate_2019)) %>%   group_by(year, from) %>%   summarise(rate_med = median(rate),             rate_25 = quantile(rate, .25),             rate_75 = quantile(rate, .75)) %>%   mutate(from = str_sub(from, 1, 5)) %>%   ggplot(aes(from, rate_med, colour = as.factor(year))) +   geom_pointrange(aes(ymin = rate_25, ymax = rate_75)) +   geom_hline(yintercept = 17.74, size = 1.2) +   annotate("text", x = 5, y = 18.5, angle = 270,            label = "Ecotricity 2019") +   coord_flip() +   scale_color_brewer(type = "qual", palette = "Dark2") +   labs(title = "How does price vary through the day?",        subtitle = "Octopus median and interquartile range",        x = "Time of day",        y = "Price (p/kWh)",        colour = "Octopus") +   theme_minimal() +   theme(text = element_text(size = 15),         plot.margin = margin(5, 10, 2, 2, "pt"))# Chunk use to 30 min intervalsuse_30 = use %>%   mutate(mi = minute(datetime),          mi = if_else(mi %   group_by(date, hr, mi) %>%   summarise(kWh = sum(kWh)) %>%   mutate(from = paste(hr, mi, "00", sep = ":")) %>%   ungroup()# Hourly useuse_30 %>%   group_by(from) %>%   summarise(kWh_med = median(kWh),             kWh_25 = quantile(kWh, .25),             kWh_75 = quantile(kWh, .75)) %>%   mutate(from = str_sub(from, 1, 5)) %>%   ggplot(aes(from, kWh_med)) +   geom_pointrange(aes(ymin = kWh_25, ymax = kWh_75)) +   annotate("rect",            xmin = "16:00", xmax = "19:00",            ymin=0, ymax=Inf, alpha=0.2, fill="red") +   coord_flip() +   labs(title = "How does our use vary through the day?",        subtitle = "Highlighted area is expensive Octopus time",        x = "Time of day",        y = "Use (kWh)") +   theme_minimal() +   theme(text = element_text(size = 15),         plot.margin = margin(5, 10, 2, 2, "pt"))# Join price and usecost = use_30 %>%   left_join(tarrif_2018, by = c(date = "date_2018",                                 from = "from_2018")) %>%   left_join(tarrif_2019, by = c(date = "date_2019",                                 from = "from_2019")) %>%   filter(!is.na(rate_2018) &             !is.na(rate_2019)) %>%   mutate(ecotricity = kWh * .1774,          octopus_2018 = kWh * rate_2018 / 100,          octopus_2019 = kWh * rate_2019 / 100)# Hourly costx = cost %>%   select(from, ecotricity, octopus_2018, octopus_2019) %>%   gather(supplier, cost, -from) %>%   group_by(from, supplier) %>%   summarise(cost_med = median(cost)) %>%   ungroup() %>%   mutate(from = str_sub(from, 1, 5)) %>%   spread(supplier, cost_med) %>%   mutate(time = seq(0, 23.5, by = 0.5))ggplot(x, aes(x = ecotricity, y = octopus_2018, colour = time)) +   geom_point() +   geom_abline(slope = 1) +   scale_colour_viridis_c() +   labs(title = "Hourly cost between suppliers",        subtitle = "Octopus 2018",        x = "Ecotricity (£)",        y = "Octopus (£)",        colour = "Hour") +   theme_minimal() +   theme(text = element_text(size = 15),         plot.margin = margin(5, 10, 2, 2, "pt"),         legend.position="none") +   ggplot(x, aes(x = ecotricity, y = octopus_2019, colour = time)) +   geom_point() +   geom_abline(slope = 1) +   scale_colour_viridis_c() +   labs(subtitle = "Octopus 2019",        x = "Ecotricity (£)",        y = "Octopus (£)",        colour = "Hour") +   theme_minimal() +   theme(text = element_text(size = 15),         plot.margin = margin(5, 10, 2, 2, "pt"))# Daily costcost_daily = cost %>%   select(date, ecotricity, octopus_2018, octopus_2019) %>%   group_by(date) %>%   summarise(ecotricity = sum(ecotricity),             octopus_2018 = sum(octopus_2018),             octopus_2019 = sum(octopus_2019)) %>%   mutate(ecotricity = ecotricity + .2959,          octopus_2018 = octopus_2018 + .21,          octopus_2019 = octopus_2019 + .21,          jul = yday(as.Date(date)))ggplot(cost_daily, aes(ecotricity, octopus_2018)) +   geom_point(aes(colour = jul), alpha = 0.8) +   scale_colour_viridis_c() +   geom_abline(slope = 1) +   labs(title = "Daily electricity cost",        subtitle = "Octopus 2018",        x = "Ecotricity (£)",        y = "Octopus (£)",        colour = "Day of year") +   theme_minimal() +   theme(text = element_text(size = 15),         legend.position="none") +   ggplot(cost_daily, aes(ecotricity, octopus_2019)) +   geom_point(aes(colour = jul), alpha = 0.8) +   scale_colour_viridis_c() +   geom_abline(slope = 1) +   labs(subtitle = "Octopus 2019",        x = "Ecotricity (£)",        y = "Octopus (£)",        colour = "Day of year") +   theme_minimal() +   theme(text = element_text(size = 15))# Peak usex = cost %>%   select(date, from, octopus_2018, octopus_2019) %>%   filter(from %in% c("16:00:00",                      "16:30:00",                      "17:00:00",                      "17:30:00",                      "18:00:00",                      "18:30:00",                      "19:00:00")) %>%   group_by(date) %>%   summarise(octopus_2018 = sum(octopus_2018),             octopus_2019 = sum(octopus_2019))x %>%   gather(year, cost, -date) %>%   ggplot(aes(year, cost)) +   geom_boxplot(fill = "yellow") +   coord_flip() +   labs(title = "How much will we spend during peak price?",        subtitle = "Daily spread of cost between 4pm and 7pm",        x = "",        y = "Cost (£/day)") +   theme_minimal() +   theme(text = element_text(size = 15))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Google Big Query with R

March 20, 2020, 7:11 am

≫ Next: COVID-19 Tracker: Days since N

≪ Previous: Domestic data science – energy use

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lightning fast database querying with the R API.

Source

What is Google Big Query?

Big Query is a highly performant cloud data storage service which started in 2011. You can manage it inside the Google Cloud Console and query the storage with standard SQL commands from the bq console or via the API. It is easy to set up, auto scales and there are a variety of established connectors to Google and other services. In this article I will show you the advantages of working with Big Query and how to use the API from R and build queries with dplyr functions.

What are the advantages of Big Query?

Google Big Query has a few advantages in comparison to other cloud data storages. Other technologies will be comparable or even better, but the combination of all the advantages and especially the Google integration is what makes Big Query really outstanding. Advantages are:

Lightning fast query speeds: BQ had similar benchmarking results to other modern data base technologies. BQ was also compared to other Data Warehouse solution that have similar features such as Amazon Redshift, Snowflake, Microsoft Azure and Presto and all showed more or less similar performances and pricing.

Big Query shows exceptional Performance with queries without GROUP BY statements (Q1) and has significant worse performance with GROUP BY statements (Q2-Q3). With many large JOINs and many GROUP BYs (Q4) it performs in the middle of the other tested technologies. Source

Biq Query really shines with performance under concurrent queries. Query time stays constant in comparison to the other technologies due to fast auto scaling. Source

Low costs: BQ has similar costs to other big data warehouse solutions. Costs as of today are 0.02$/GB per Month storage and 5$/TB data querying. 10 GB storage per month is free as well as 1 TB per month data querying. Many operations as e.g. data loading, copying, exporting, deleting as well as failed queries are free. Furthermore there is query caching, you do not have to pay if you run a query again on the same, unchanged data. There are flat-rate prices available, too.
Easy integration with Google services: Data from Google Analytics 360 can be easily stored in BQ. This is a big advantage as Google Analytics has a limit on stored rows and only enables reports on sample data. You can get a more detailed customer journey and combine every dimension with every metric if you store your Analytics data in BQ as you can access all of your tracking data. Additionally, datasets on Google Cloud storage and Google Drive can be queried via BQ without manually importing the dataset.
Easy integration with other tools: BQ has it’s own machine learning suite with Big Query ML engine which enables you to import tensorflow models for prediction. There also is a BQ BI engine, but both seem not particularly useful to me yet as the functionalities are limited. Many services such as Tableau, Qlik or Looker have connectors to BQ.
No-Ops management: No prior data base management knowledge is needed to set up BQ and manage security and recovery.
Public datasets: You have a nice selection of publicly available data on BQ, some of the datasets are constantly updated!

Use Big Query with R

Enable Big Query and get your credentials

Go to the Google Cloud Platform and login with your Google account. At the top left corner go to “Choose Project” and start a new project. If you go on your home dashboard to “Go to APIs overview” you will see the activated APIs of the Google Cloud Service. “BigQuery API” and “BigQuery Storage API” should be activated by default for all new projects.

Activated APIs for a new Google Cloud project.

2. Get your API key as described by the gargle R package here. In short, go to the Credentials section at the Google Cloud Platform in the dashboard shown above and create credentials > API key. You can rename your API key and restrict it to only certain APIs like “BigQuery API”. If you need application access to BQ you will need a service account token which you can download as JSON.

Querying with R

For querying BQ we will use the R library bigrquery. The other prominent R library for BQ is bigQueryR which in contrast to bigrquery depends on library googleAuthR which makes it more compatible with Shiny and other packages.

First, we get the libraries and authenticate either with our created API key or with the downloaded service account token JSON.

<a href="https://medium.com/media/943f918f8d6fcc4a0e23eaea491155c3/href" rel="nofollow" target="_blank">https://medium.com/media/943f918f8d6fcc4a0e23eaea491155c3/href</a>

Now we can start querying our Big Query data or public datasets. We will query the Real-time Air Quality dataset from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time) from 5490 world wide air quality measurement stations, which is awesome! You can see the dataset and a short description on Big Query here if you are logged into Google. To find open datasets in the Cloud Console, scroll down on the left menu, there you should see “Big Query” under the header “Big Data”. If you go then to “+Add Data” you will be able to browse public data sets.

We will wrap the bigrquery API with DBI to be able to use it with dplyr verbs, however the bigrquery package provides an low level API, too.

<a href="https://medium.com/media/20c4648d720d8a2425e1c2c5dde9c82d/href" rel="nofollow" target="_blank">https://medium.com/media/20c4648d720d8a2425e1c2c5dde9c82d/href</a>

In this example you can see querying with dplyr functions that are converted to SQL queries, however you can do not get the full flexibility that direct SQL querying provides. For this you could send SQL queries via DBI::dbGetQuery() from R. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. Check out my next post on how to build a Dockerized Cron-job to get the newest air pollution data from India while keeping older records.

This article was also published on https://www.r-bloggers.com/.

Google Big Query with R was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

COVID-19 Tracker: Days since N

March 18, 2020, 5:00 pm

≫ Next: Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer

≪ Previous: Google Big Query with R

[This article was first published on R Bloggers on , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There’s no shortage of dashboards and data-visualizations covering some aspect of the ongoing coronavirus pandemic, but not having come across a tool that allowed me to easily compare countries myself, I developed this COVID-19 Tracker shiny app both for my own personal use, as well as to get some more experience working with Shiny.

This app was inspired by a vizualization produced by John Burn-Murdoch for the Financial Times (FT) that I thought did a very nice job at allowing cross-country comparisons of the trajectories of the total confirmed cases by standardizing countries using the “Number of days since the 100th case” on the x-axis.

The Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) maintains a dashboard whose data source serves as the underlying data for the FT vizualization, as well as many others floating around on the internet at the moment.

At the time of writing, the JHU GSSE dasboard does not allow for an easy way to select countries for direct comparison. The Shiny app presented here allows the user to select any of the country/region units available in the entire dataset, standardize them on the x-axis using “Days since N”, and automatically generate fairly clean level- and log- plots with dynamically rendered titles and axis labels. The data in the app is timestamped and updated automatically along with the JSU CSSE repo, and there are download buttons for the plots and filtered long-format data tables use for those plots in PNG and CSV formats, respectively.

Currently, a maximum of six countries can be compared at a time. The limit was set simply to allow for better readability of the resulting plots. Users can select between total confirmed cases, deaths, and total recovered as the different y-axis outcome variables.

The default N number for the total confirmed cases outcome is set to 100, in keeping with the most widely used convention at the moment. For deaths, N=10 can be used.

There are a few countries that include more detailed regional breakdowns of the data. Where this is the case, the totals for those countries are given by the country name + “(all territories)”.

Additional features and edits will be added on an ongoing basis. Feedback or comments are welcome.

COVID-19 Tracker Shiny app

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer

March 19, 2020, 5:00 pm

≫ Next: The significance of population size, year, and per cent women on the education level in Sweden

≪ Previous: COVID-19 Tracker: Days since N

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you a BSc, MSc or PhD student, and this summer (winter down south) you would like to contribute to open-source while earning some cash?

Then you will be interested to know that with @CancuCS this year we are going to mentor for the Google Summer of Code 2020.

Take a look at The R Code Optimizer, apply, and help us grow the R code optimizer, rco, package.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The significance of population size, year, and per cent women on the education level in Sweden

March 19, 2020, 5:00 pm

≫ Next: Rebalancing history

≪ Previous: Apply for the Google Summer of Code and Help Us Improving The R Code Optimizer

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In twelve posts I have analysed how different factors are related to salaries in Sweden with data from Statistics Sweden. In this post, I will analyse a new dataset from Statistics Sweden, population by region, age, level of education, sex and year. Not knowing exactly what to find I will use a criterion-based procedure to find the model that minimises the AIC. Then I will perform some test to see how robust the model is. Finally, I will plot the findings.

First, define libraries and functions.

library (tidyverse)

## -- Attaching packages -------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3## v tibble  2.1.3     v dplyr   0.8.3## v tidyr   1.0.2     v stringr 1.4.0## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --## x dplyr::filter() masks stats::filter()## x dplyr::lag()    masks stats::lag()

library (broom)library (car)

## Loading required package: carData

## ## Attaching package: 'car'

## The following object is masked from 'package:dplyr':## ##     recode

## The following object is masked from 'package:purrr':## ##     some

library (sjPlot)

## Registered S3 methods overwritten by 'lme4':##   method                          from##   cooks.distance.influence.merMod car ##   influence.merMod                car ##   dfbeta.influence.merMod         car ##   dfbetas.influence.merMod        car

library (leaps)library (splines)library (MASS)

## ## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':## ##     select

library (mgcv)

## Loading required package: nlme

## ## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':## ##     collapse

## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

library (lmtest)

## Loading required package: zoo

## ## Attaching package: 'zoo'

## The following objects are masked from 'package:base':## ##     as.Date, as.Date.numeric

library (earth)

## Warning: package 'earth' was built under R version 3.6.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 3.6.3

## Loading required package: plotrix

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 3.6.3

library (acepack)

## Warning: package 'acepack' was built under R version 3.6.3

library (lspline)

## Warning: package 'lspline' was built under R version 3.6.3

library (lme4)

## Loading required package: Matrix

## ## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':## ##     expand, pack, unpack

## ## Attaching package: 'lme4'

## The following object is masked from 'package:nlme':## ##     lmList

library (pROC)

## Warning: package 'pROC' was built under R version 3.6.3

## Type 'citation("pROC")' for a citation.

## ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':## ##     cov, smooth, var

readfile <- function (file1){read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%  gather (starts_with("19"), starts_with("20"), key = "year", value = groupsize) %>%  drop_na() %>%  mutate (year_n = parse_number (year))}perc_women <- function(x){    ifelse (length(x) == 2, x[2] / (x[1] + x[2]), NA)} nuts <- read.csv("nuts.csv") %>%  mutate(NUTS2_sh = substr(NUTS2, 3, 4))

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, UF0506A1.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

I will calculate the percentage of women in for the different education levels in the different regions for each year. In my later analysis I will use the number of people in each education level, region and year.

The table: Population 16-74 years of age by region, highest level of education, age and sex. Year 1985 – 2018 NUTS 2 level 2008- 10 year intervals (16-74)

tb <- readfile("UF0506A1.csv") %>%    mutate(edulevel = `level of education`) %>%  group_by(edulevel, region, year, sex) %>%  mutate(groupsize_all_ages = sum(groupsize)) %>%    group_by(edulevel, region, year) %>%   mutate (sum_edu_region_year = sum(groupsize)) %>%    mutate (perc_women = perc_women (groupsize_all_ages[1:2])) %>%   group_by(region, year) %>%  mutate (sum_pop = sum(groupsize)) %>% rowwise() %>%  mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>%  rowwise() %>%   mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>%  mutate(age_n = (age_l + age_h) / 2) %>%  left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en"))

## Warning: Column `region`/`NUTS2_en` joining character vector and factor,## coercing into character vector

numedulevel <- read.csv("edulevel_1.csv") numedulevel %>%  knitr::kable(  booktabs = TRUE,  caption = 'Initial approach, length of education')

Table 1: Initial approach, length of education

level.of.education	eduyears
primary and secondary education less than 9 years (ISCED97 1)	8
primary and secondary education 9-10 years (ISCED97 2)	9
upper secondary education, 2 years or less (ISCED97 3C)	11
upper secondary education 3 years (ISCED97 3A)	12
post-secondary education, less than 3 years (ISCED97 4+5B)	14
post-secondary education 3 years or more (ISCED97 5A)	15
post-graduate education (ISCED97 6)	19
no information about level of educational attainment	NA

tbnum <- tb %>%   right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%  filter(!is.na(eduyears)) %>%   drop_na()

## Warning: Column `level of education`/`level.of.education` joining character## vector and factor, coercing into character vector

tbnum %>%  ggplot () +      geom_point (mapping = aes(x = NUTS2_sh,y = perc_women, colour = year_n)) +  facet_grid(. ~ eduyears)

Figure 1: Population by region, level of education, percent women and year, Year 1985 – 2018

summary(tbnum)

##     region              age            level of education     sex           ##  Length:22848       Length:22848       Length:22848       Length:22848      ##  Class :character   Class :character   Class :character   Class :character  ##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  ##                                                                             ##                                                                             ##                                                                             ##      year             groupsize         year_n       edulevel        ##  Length:22848       Min.   :    0   Min.   :1985   Length:22848      ##  Class :character   1st Qu.: 1634   1st Qu.:1993   Class :character  ##  Mode  :character   Median : 5646   Median :2002   Mode  :character  ##                     Mean   : 9559   Mean   :2002                     ##                     3rd Qu.:14027   3rd Qu.:2010                     ##                     Max.   :77163   Max.   :2018                     ##  groupsize_all_ages sum_edu_region_year   perc_women        sum_pop       ##  Min.   :    45     Min.   :   366      Min.   :0.1230   Min.   : 266057  ##  1st Qu.: 20033     1st Qu.: 40482      1st Qu.:0.4416   1st Qu.: 515306  ##  Median : 45592     Median : 90871      Median :0.4816   Median : 740931  ##  Mean   : 57353     Mean   :114706      Mean   :0.4641   Mean   : 823034  ##  3rd Qu.: 86203     3rd Qu.:172120      3rd Qu.:0.5217   3rd Qu.:1195658  ##  Max.   :271889     Max.   :486270      Max.   :0.6423   Max.   :1716160  ##      age_l           age_h        age_n         NUTS2_sh        ##  Min.   :16.00   Min.   :24   Min.   :20.00   Length:22848      ##  1st Qu.:25.00   1st Qu.:34   1st Qu.:29.50   Class :character  ##  Median :40.00   Median :49   Median :44.50   Mode  :character  ##  Mean   :40.17   Mean   :49   Mean   :44.58                     ##  3rd Qu.:55.00   3rd Qu.:64   3rd Qu.:59.50                     ##  Max.   :65.00   Max.   :74   Max.   :69.50                     ##     eduyears    ##  Min.   : 8.00  ##  1st Qu.: 9.00  ##  Median :12.00  ##  Mean   :12.57  ##  3rd Qu.:15.00  ##  Max.   :19.00

In a previous post, I approximated the number of years of education for every education level. Since this approximation is significant for the rest of the analysis I will see if I can do a better approximation. I use Multivariate Adaptive Regression Splines (MARS) to find the permutation of years of education, within the given limitations, which gives the highest adjusted R-Squared value. I choose not to calculate more combinations than between the age of 7 and 19 because I assessed it would take to much time. From the table, we can see that the R-Squared only gains from a higher education year for post-graduate education. A manual test shows that setting years of education to 22 gives a higher R-Squared without getting high residuals.

educomb <- as_tibble(t(combn(7:19,7))) %>%   filter((V6 - V4) > 2) %>% filter((V4 - V2) > 2) %>%   filter(V2 > 8) %>%   mutate(na = NA)

## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.## This warning is displayed once per session.

summary_table = vector()for (i in 1:dim(educomb)[1]) {  numedulevel[, 2] <- t(educomb[i,])  suppressWarnings (tbnum <- tb %>%     right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%    filter(!is.na(eduyears)) %>%     drop_na())  tbtest <- tbnum %>%     dplyr::select(eduyears, sum_pop, sum_edu_region_year, year_n, perc_women)  mmod <- earth(eduyears ~ ., data = tbtest, nk = 12, degree = 2)  summary_table <- rbind(summary_table, summary(mmod)$rsq)}which.max(summary_table)

## [1] 235

educomb[which.max(summary_table),] #235

## # A tibble: 1 x 8##      V1    V2    V3    V4    V5    V6    V7 na   ##          ## 1     8     9    10    12    13    15    19 NA

numedulevel[, 2] <- t(educomb[235,])numedulevel[7, 2] <- 22numedulevel %>%  knitr::kable(  booktabs = TRUE,  caption = 'Recalculated length of education')

Table 2: Recalculated length of education

level.of.education	eduyears
primary and secondary education less than 9 years (ISCED97 1)	8
primary and secondary education 9-10 years (ISCED97 2)	9
upper secondary education, 2 years or less (ISCED97 3C)	10
upper secondary education 3 years (ISCED97 3A)	12
post-secondary education, less than 3 years (ISCED97 4+5B)	13
post-secondary education 3 years or more (ISCED97 5A)	15
post-graduate education (ISCED97 6)	22
no information about level of educational attainment	NA

tbnum <- tb %>%   right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%  filter(!is.na(eduyears)) %>%   drop_na()

## Warning: Column `level of education`/`level.of.education` joining character## vector and factor, coercing into character vector

Let’s investigate the shape of the function for the response and predictors. The shape of the predictors has a great impact on the rest of the analysis. I use acepack to fit a model and plot both the response and the predictors.

tbtest <- tbnum %>% dplyr::select(sum_pop, sum_edu_region_year, year_n, perc_women)tbtest <- data.frame(tbtest)acefit <- ace(tbtest, tbnum$eduyears)plot(tbnum$eduyears, acefit$ty, xlab = "Years of education", ylab = "transformed years of education")

Figure 2: Plots of the response and predictors using acepack

plot(tbtest[,1], acefit$tx[,1], xlab = "Population in region", ylab = "transformed population in region")

Figure 3: Plots of the response and predictors using acepack

plot(tbtest[,2], acefit$tx[,2], xlab = "# persons with same edulevel, region, year", ylab = "transformed # persons with same edulevel, region, year")

Figure 4: Plots of the response and predictors using acepack

plot(tbtest[,3], acefit$tx[,3], xlab = "Year", ylab = "transformed year")

Figure 5: Plots of the response and predictors using acepack

plot(tbtest[,4], acefit$tx[,4], xlab = "Percent women", ylab = "transformed percent women")

Figure 6: Plots of the response and predictors using acepack

I use MARS to fit hockey-stick functions for the predictors. I do not wish to overfit by using a better approximation at this point. I will include interactions of degree two.

tbtest <- tbnum %>% dplyr::select(eduyears, sum_pop, sum_edu_region_year, year_n, perc_women)mmod <- earth(eduyears ~ ., data=tbtest, nk = 9, degree = 2)summary (mmod)

## Call: earth(formula=eduyears~., data=tbtest, degree=2, nk=9)## ##                                                       coefficients## (Intercept)                                               9.930701## h(37001-sum_edu_region_year)                              0.000380## h(sum_edu_region_year-37001)                              0.000003## h(0.492816-perc_women)                                    9.900436## h(perc_women-0.492816)                                   49.719932## h(1.32988e+06-sum_pop) * h(37001-sum_edu_region_year)     0.000000## h(sum_pop-1.32988e+06) * h(37001-sum_edu_region_year)     0.000000## h(sum_edu_region_year-37001) * h(2004-year_n)            -0.000001## ## Selected 8 of 9 terms, and 4 of 4 predictors## Termination condition: Reached nk 9## Importance: sum_edu_region_year, perc_women, sum_pop, year_n## Number of terms at each degree of interaction: 1 4 3## GCV 3.774465    RSS 86099.37    GRSq 0.8049234    RSq 0.8052222

plot (mmod)

Figure 7: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018

plotmo (mmod)

##  plotmo grid:    sum_pop sum_edu_region_year year_n perc_women##                   740931             90870.5 2001.5  0.4815703

Figure 8: Hockey-stick functions fit with MARS for the predictors, Year 1985 – 2018

model_mmod <- lm (eduyears ~ lspline(sum_edu_region_year, c(37001)) +               lspline(perc_women, c(0.492816)) +               lspline(sum_pop, c(1.32988e+06)):lspline(sum_edu_region_year, c(37001)) +              lspline(sum_edu_region_year, c(1.32988e+06)):lspline(year_n, c(2004)),             data = tbnum) summary (model_mmod)$r.squared

## [1] 0.7792166

anova (model_mmod)

## Analysis of Variance Table## ## Response: eduyears##                                                                        Df## lspline(sum_edu_region_year, c(37001))                                  2## lspline(perc_women, c(0.492816))                                        2## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))     4## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))       2## Residuals                                                           22837##                                                                     Sum Sq## lspline(sum_edu_region_year, c(37001))                              292982## lspline(perc_women, c(0.492816))                                     39071## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))   9629## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))     2763## Residuals                                                            97595##                                                                     Mean Sq## lspline(sum_edu_region_year, c(37001))                               146491## lspline(perc_women, c(0.492816))                                      19535## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))    2407## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))      1382## Residuals                                                                 4##                                                                      F value## lspline(sum_edu_region_year, c(37001))                              34278.55## lspline(perc_women, c(0.492816))                                     4571.22## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880))   563.27## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))     323.30## Residuals                                                                   ##                                                                        Pr(>F)## lspline(sum_edu_region_year, c(37001))                              < 2.2e-16## lspline(perc_women, c(0.492816))                                    < 2.2e-16## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880)) < 2.2e-16## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))   < 2.2e-16## Residuals                                                                    ##                                                                        ## lspline(sum_edu_region_year, c(37001))                              ***## lspline(perc_women, c(0.492816))                                    ***## lspline(sum_edu_region_year, c(37001)):lspline(sum_pop, c(1329880)) ***## lspline(sum_edu_region_year, c(1329880)):lspline(year_n, c(2004))   ***## Residuals                                                              ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I will use regsubsets to find the model which minimises the AIC. I will also calculate the Receiver Operating Characteristic (ROC) for the model I find for each level of years of education.

b <- regsubsets (eduyears ~ (lspline(sum_pop, c(1.32988e+06)) + lspline(perc_women, c(0.492816)) + lspline(year_n, c(2004)) + lspline(sum_edu_region_year, c(37001))) * (lspline(sum_pop, c(1.32988e+06)) + lspline(perc_women, c(0.492816)) + lspline(year_n, c(2004)) + lspline(sum_edu_region_year, c(37001))), data = tbnum, nvmax = 20)rs <- summary(b)AIC <- 50 * log (rs$rss / 50) + (2:21) * 2which.min (AIC)

## [1] 9

names (rs$which[9,])[rs$which[9,]]

##  [1] "(Intercept)"                                                              ##  [2] "lspline(sum_pop, c(1329880))1"                                            ##  [3] "lspline(sum_edu_region_year, c(37001))2"                                  ##  [4] "lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1"          ##  [5] "lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1"                  ##  [6] "lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1"    ##  [7] "lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1"              ##  [8] "lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1"              ##  [9] "lspline(perc_women, c(0.492816))1:lspline(sum_edu_region_year, c(37001))2"## [10] "lspline(year_n, c(2004))1:lspline(sum_edu_region_year, c(37001))2"

model <- lm(eduyears ~   lspline(sum_pop, c(1329880)) +   lspline(sum_edu_region_year, c(37001)) +   lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +  lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +  lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +  lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +  lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +  lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)),   data = tbnum) summary (model)$r.squared

## [1] 0.8455547

anova (model)

## Analysis of Variance Table## ## Response: eduyears##                                                                            Df## lspline(sum_pop, c(1329880))                                                2## lspline(sum_edu_region_year, c(37001))                                      2## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))               4## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       4## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))         4## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                   4## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))     4## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))             4## Residuals                                                               22819##                                                                         Sum Sq## lspline(sum_pop, c(1329880))                                                 0## lspline(sum_edu_region_year, c(37001))                                  306779## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            35378## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      775## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       7224## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 8932## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   6979## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           7700## Residuals                                                                68271##                                                                         Mean Sq## lspline(sum_pop, c(1329880))                                                  0## lspline(sum_edu_region_year, c(37001))                                   153389## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))              8844## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       194## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        1806## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  2233## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    1745## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            1925## Residuals                                                                     3##                                                                          F value## lspline(sum_pop, c(1329880))                                                0.00## lspline(sum_edu_region_year, c(37001))                                  51269.26## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            2956.20## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      64.80## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       603.67## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 746.37## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   583.19## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           643.44## Residuals                                                                       ##                                                                         Pr(>F)## lspline(sum_pop, c(1329880))                                                 1## lspline(sum_edu_region_year, c(37001))                                  <2e-16## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))           <2e-16## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                   <2e-16## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))     <2e-16## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))               <2e-16## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816)) <2e-16## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))         <2e-16## Residuals                                                                     ##                                                                            ## lspline(sum_pop, c(1329880))                                               ## lspline(sum_edu_region_year, c(37001))                                  ***## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))           ***## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                   ***## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))     ***## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))               ***## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816)) ***## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))         ***## Residuals                                                                  ## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot (model)

Figure 9: Find the model that minimises the AIC, Year 1985 – 2018

Figure 10: Find the model that minimises the AIC, Year 1985 – 2018

Figure 11: Find the model that minimises the AIC, Year 1985 – 2018

Figure 12: Find the model that minimises the AIC, Year 1985 – 2018

tbnumpred <- bind_cols(tbnum, as_tibble(predict(model, tbnum, interval = "confidence")))suppressWarnings(multiclass.roc(tbnumpred$eduyears, tbnumpred$fit))

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## ## Call:## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$fit)## ## Data: tbnumpred$fit with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.## Multi-class area under the curve: 0.8743

There are a few things I would like to investigate to improve the credibility of the analysis. First, the study is a longitudinal study. A great proportion of people is measured each year. The majority of the people in the region remains in the region from year to year. I will assume that each birthyear and each region is a group and set them as random effects and the rest of the predictors as fixed effects. I use the mean age in each age group as the year of birth.

temp <- tbnum %>% mutate(yob = year_n - age_n) %>% mutate(region = tbnum$region)mmodel <- lmer(eduyears ~  lspline(sum_pop, c(1329880)) +   lspline(sum_edu_region_year, c(37001)) +   lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +  lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +  lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +  lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +  lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +  lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)) +  (1|yob) +   (1|region),  data = temp)

## Warning: Some predictor variables are on very different scales: consider## rescaling

## boundary (singular) fit: see ?isSingular

plot(mmodel)

Figure 13: A diagnostic plot of the model with random effects components

qqnorm (residuals(mmodel), main="")

Figure 14: A diagnostic plot of the model with random effects components

summary (mmodel)

## Linear mixed model fit by REML ['lmerMod']## Formula: ## eduyears ~ lspline(sum_pop, c(1329880)) + lspline(sum_edu_region_year,  ##     c(37001)) + lspline(sum_pop, c(1329880)):lspline(perc_women,  ##     c(0.492816)) + lspline(sum_pop, c(1329880)):lspline(year_n,  ##     c(2004)) + lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year,  ##     c(37001)) + lspline(perc_women, c(0.492816)):lspline(year_n,  ##     c(2004)) + lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year,  ##     c(37001)) + lspline(year_n, c(2004)):lspline(sum_edu_region_year,  ##     c(37001)) + (1 | yob) + (1 | region)##    Data: temp## ## REML criterion at convergence: 90514.4## ## Scaled residuals: ##     Min      1Q  Median      3Q     Max ## -5.1175 -0.5978 -0.0137  0.5766  2.8735 ## ## Random effects:##  Groups   Name        Variance Std.Dev.##  yob      (Intercept) 0.000    0.000   ##  region   (Intercept) 1.115    1.056   ##  Residual             2.970    1.723   ## Number of obs: 22848, groups:  yob, 108; region, 8## ## Fixed effects:##                                                                             Estimate## (Intercept)                                                                2.516e+01## lspline(sum_pop, c(1329880))1                                              1.514e-04## lspline(sum_pop, c(1329880))2                                              2.912e-03## lspline(sum_edu_region_year, c(37001))1                                    2.314e-03## lspline(sum_edu_region_year, c(37001))2                                   -2.288e-03## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            5.502e-05## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            7.840e-05## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           -2.061e-05## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            1.467e-05## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -7.788e-08## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                   -1.428e-06## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -3.009e-07## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.430e-07## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -4.707e-10## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -2.387e-09## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      2.554e-13## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      1.137e-12## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -1.659e-02## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                3.580e-02## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.888e-01## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -1.008e+00## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  9.201e-05## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -4.149e-04## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2 -1.441e-04## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.086e-04## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         -1.211e-06## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          1.240e-06## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         -2.615e-06## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.146e-06##                                                                           Std. Error## (Intercept)                                                                6.548e-01## lspline(sum_pop, c(1329880))1                                              1.494e-05## lspline(sum_pop, c(1329880))2                                              6.394e-03## lspline(sum_edu_region_year, c(37001))1                                    3.150e-04## lspline(sum_edu_region_year, c(37001))2                                    7.229e-05## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            1.344e-06## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            1.213e-05## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            2.853e-06## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            1.540e-05## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    7.362e-09## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    3.191e-06## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    1.349e-08## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    7.352e-08## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      9.596e-12## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      8.271e-11## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      7.991e-13## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      2.836e-12## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                4.545e-04## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                4.504e-03## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.671e-02## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                9.737e-02## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  2.688e-05## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  1.117e-05## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  2.526e-04## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.429e-05## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          1.586e-07## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          3.623e-08## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          4.441e-07## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          6.085e-08##                                                                           t value## (Intercept)                                                                38.420## lspline(sum_pop, c(1329880))1                                              10.137## lspline(sum_pop, c(1329880))2                                               0.455## lspline(sum_edu_region_year, c(37001))1                                     7.345## lspline(sum_edu_region_year, c(37001))2                                   -31.645## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            40.921## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1             6.463## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            -7.226## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2             0.952## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -10.579## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    -0.448## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -22.303## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                     1.945## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -49.052## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -28.855## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2       0.320## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2       0.401## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -36.497## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                 7.949## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                10.593## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -10.350## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1   3.423## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -37.150## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  -0.571## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2   7.602## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          -7.639## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          34.226## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          -5.887## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          18.833

## ## Correlation matrix not shown by default, as p = 29 > 12.## Use print(x, correlation=TRUE)  or##     vcov(x)        if you need it

## fit warnings:## Some predictor variables are on very different scales: consider rescaling## convergence code: 0## boundary (singular) fit: see ?isSingular

anova (mmodel)

## Analysis of Variance Table##                                                                         Df## lspline(sum_pop, c(1329880))                                             2## lspline(sum_edu_region_year, c(37001))                                   2## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            4## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                    4## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))      4## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                4## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))  4## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))          4##                                                                         Sum Sq## lspline(sum_pop, c(1329880))                                                 0## lspline(sum_edu_region_year, c(37001))                                  308190## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            35415## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      589## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       7737## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 8202## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   7316## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           6809##                                                                         Mean Sq## lspline(sum_pop, c(1329880))                                                  0## lspline(sum_edu_region_year, c(37001))                                   154095## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))              8854## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       147## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        1934## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  2051## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    1829## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            1702##                                                                           F value## lspline(sum_pop, c(1329880))                                                0.000## lspline(sum_edu_region_year, c(37001))                                  51879.188## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            2980.805## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                      49.613## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))       651.234## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                 690.377## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))   615.763## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))           573.138

tbnumpred <- bind_cols(temp, as_tibble(predict(mmodel, temp, interval = "confidence")))

## Warning in predict.merMod(mmodel, temp, interval = "confidence"): unused## arguments ignored

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.## This warning is displayed once per session.

suppressWarnings (multiclass.roc (tbnumpred$eduyears, tbnumpred$value))

## Setting direction: controls < cases

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## ## Call:## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$value)## ## Data: tbnumpred$value with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.## Multi-class area under the curve: 0.8754

Another problem could be that the response variable is limited in its range. To get more insight about this issue we could model with Poisson regression.

pmodel <- glm(eduyears ~   lspline(sum_pop, c(1329880)) +   lspline(sum_edu_region_year, c(37001)) +   lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816)) +  lspline(sum_pop, c(1329880)):lspline(year_n, c(2004)) +  lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001)) +  lspline(perc_women, c(0.492816)):lspline(year_n, c(2004)) +  lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, c(37001)) +  lspline(year_n, c(2004)):lspline(sum_edu_region_year, c(37001)),  family = poisson,  data = tbnum) plot (pmodel)

Figure 15: A diagnostic plot of Poisson regression

Figure 16: A diagnostic plot of Poisson regression

Figure 17: A diagnostic plot of Poisson regression

Figure 18: A diagnostic plot of Poisson regression

tbnumpred <- bind_cols(tbnum, as_tibble(predict(pmodel, tbnum, interval = "confidence")))suppressWarnings (multiclass.roc (tbnumpred$eduyears, tbnumpred$value))

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## Setting direction: controls > cases

## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases## Setting direction: controls < cases

## ## Call:## multiclass.roc.default(response = tbnumpred$eduyears, predictor = tbnumpred$value)## ## Data: tbnumpred$value with 7 levels of tbnumpred$eduyears: 8, 9, 10, 12, 13, 15, 22.## Multi-class area under the curve: 0.8716

summary (pmodel)

## ## Call:## glm(formula = eduyears ~ lspline(sum_pop, c(1329880)) + lspline(sum_edu_region_year, ##     c(37001)) + lspline(sum_pop, c(1329880)):lspline(perc_women, ##     c(0.492816)) + lspline(sum_pop, c(1329880)):lspline(year_n, ##     c(2004)) + lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, ##     c(37001)) + lspline(perc_women, c(0.492816)):lspline(year_n, ##     c(2004)) + lspline(perc_women, c(0.492816)):lspline(sum_edu_region_year, ##     c(37001)) + lspline(year_n, c(2004)):lspline(sum_edu_region_year, ##     c(37001)), family = poisson, data = tbnum)## ## Deviance Residuals: ##      Min        1Q    Median        3Q       Max  ## -2.32031  -0.33091  -0.01716   0.30301   1.40215  ## ## Coefficients:##                                                                             Estimate## (Intercept)                                                                3.403e+00## lspline(sum_pop, c(1329880))1                                              5.825e-06## lspline(sum_pop, c(1329880))2                                             -8.868e-05## lspline(sum_edu_region_year, c(37001))1                                    3.722e-04## lspline(sum_edu_region_year, c(37001))2                                   -2.310e-04## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            3.838e-06## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            8.103e-06## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           -2.276e-06## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2           -3.732e-06## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   -3.188e-09## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    4.535e-08## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -2.600e-08## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.616e-08## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -2.870e-11## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -1.718e-10## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     -2.527e-13## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2     -2.193e-14## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -9.758e-04## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                2.556e-03## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                3.188e-02## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               -1.221e-01## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 -1.020e-05## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -2.991e-05## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  1.916e-05## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  1.271e-05## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         -1.874e-07## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          1.224e-07## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         -1.952e-07## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.122e-07##                                                                           Std. Error## (Intercept)                                                                3.236e-02## lspline(sum_pop, c(1329880))1                                              1.792e-06## lspline(sum_pop, c(1329880))2                                              9.916e-04## lspline(sum_edu_region_year, c(37001))1                                    4.837e-05## lspline(sum_edu_region_year, c(37001))2                                    1.222e-05## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            1.962e-07## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1            2.131e-06## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            4.682e-07## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            2.516e-06## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    9.022e-10## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                    4.948e-07## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    1.917e-09## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                    1.155e-08## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      1.422e-12## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      1.343e-11## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      1.161e-13## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      4.747e-13## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                6.510e-05## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                6.648e-04## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                5.260e-03## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                1.564e-02## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  4.161e-06## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  1.813e-06## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2  3.734e-05## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2  2.408e-06## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          2.435e-08## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          6.124e-09## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          6.510e-08## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          1.002e-08##                                                                           z value## (Intercept)                                                               105.166## lspline(sum_pop, c(1329880))1                                               3.251## lspline(sum_pop, c(1329880))2                                              -0.089## lspline(sum_edu_region_year, c(37001))1                                     7.694## lspline(sum_edu_region_year, c(37001))2                                   -18.907## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            19.559## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1             3.803## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2            -4.861## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2            -1.483## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                    -3.534## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                     0.092## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   -13.558## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                     1.400## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     -20.183## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     -12.790## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2      -2.176## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2      -0.046## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               -14.991## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1                 3.845## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2                 6.060## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2                -7.810## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1  -2.451## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 -16.498## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2   0.513## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2   5.280## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1          -7.698## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          19.994## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2          -2.998## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          11.202##                                                                           Pr(>|z|)## (Intercept)                                                                < 2e-16## lspline(sum_pop, c(1329880))1                                             0.001151## lspline(sum_pop, c(1329880))2                                             0.928739## lspline(sum_edu_region_year, c(37001))1                                   1.42e-14## lspline(sum_edu_region_year, c(37001))2                                    < 2e-16## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1            < 2e-16## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1           0.000143## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           1.17e-06## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2           0.138097## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   0.000410## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                   0.926973## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                    < 2e-16## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                   0.161556## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1      < 2e-16## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1      < 2e-16## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     0.029521## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2     0.963157## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1                < 2e-16## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1               0.000121## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2               1.36e-09## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               5.70e-15## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 0.014246## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1  < 2e-16## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2 0.607856## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2 1.29e-07## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         1.39e-14## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1          < 2e-16## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         0.002713## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2          < 2e-16##                                                                              ## (Intercept)                                                               ***## lspline(sum_pop, c(1329880))1                                             ** ## lspline(sum_pop, c(1329880))2                                                ## lspline(sum_edu_region_year, c(37001))1                                   ***## lspline(sum_edu_region_year, c(37001))2                                   ***## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))1           ***## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))1           ***## lspline(sum_pop, c(1329880))1:lspline(perc_women, c(0.492816))2           ***## lspline(sum_pop, c(1329880))2:lspline(perc_women, c(0.492816))2              ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))1                   ***## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))1                      ## lspline(sum_pop, c(1329880))1:lspline(year_n, c(2004))2                   ***## lspline(sum_pop, c(1329880))2:lspline(year_n, c(2004))2                      ## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))1     ***## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))1     ***## lspline(sum_pop, c(1329880))1:lspline(sum_edu_region_year, c(37001))2     *  ## lspline(sum_pop, c(1329880))2:lspline(sum_edu_region_year, c(37001))2        ## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))1               ***## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))1               ***## lspline(perc_women, c(0.492816))1:lspline(year_n, c(2004))2               ***## lspline(perc_women, c(0.492816))2:lspline(year_n, c(2004))2               ***## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))1 *  ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))1 ***## lspline(sum_edu_region_year, c(37001))1:lspline(perc_women, c(0.492816))2    ## lspline(sum_edu_region_year, c(37001))2:lspline(perc_women, c(0.492816))2 ***## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))1         ***## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))1         ***## lspline(sum_edu_region_year, c(37001))1:lspline(year_n, c(2004))2         ** ## lspline(sum_edu_region_year, c(37001))2:lspline(year_n, c(2004))2         ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for poisson family taken to be 1)## ##     Null deviance: 32122.2  on 22847  degrees of freedom## Residual deviance:  5899.4  on 22819  degrees of freedom## AIC: 105166## ## Number of Fisher Scoring iterations: 4

anova (pmodel)

## Analysis of Deviance Table## ## Model: poisson, link: log## ## Response: eduyears## ## Terms added sequentially (first to last)## ## ##                                                                         Df## NULL                                                                      ## lspline(sum_pop, c(1329880))                                             2## lspline(sum_edu_region_year, c(37001))                                   2## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))            4## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                    4## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))      4## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                4## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))  4## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))          4##                                                                         Deviance## NULL                                                                            ## lspline(sum_pop, c(1329880))                                                 0.0## lspline(sum_edu_region_year, c(37001))                                   21027.5## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))             2729.6## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       51.2## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))        528.8## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                  601.3## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))    502.2## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))            782.2##                                                                         Resid. Df## NULL                                                                        22847## lspline(sum_pop, c(1329880))                                                22845## lspline(sum_edu_region_year, c(37001))                                      22843## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))               22839## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                       22835## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))         22831## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                   22827## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))     22823## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))             22819##                                                                         Resid. Dev## NULL                                                                         32122## lspline(sum_pop, c(1329880))                                                 32122## lspline(sum_edu_region_year, c(37001))                                       11095## lspline(sum_pop, c(1329880)):lspline(perc_women, c(0.492816))                 8365## lspline(sum_pop, c(1329880)):lspline(year_n, c(2004))                         8314## lspline(sum_pop, c(1329880)):lspline(sum_edu_region_year, c(37001))           7785## lspline(perc_women, c(0.492816)):lspline(year_n, c(2004))                     7184## lspline(sum_edu_region_year, c(37001)):lspline(perc_women, c(0.492816))       6682## lspline(sum_edu_region_year, c(37001)):lspline(year_n, c(2004))               5899

Now let’s see what we have found. Note that the models do not handle extrapolation well. I will plot all the models for comparison.

plot_model (model, type = "pred", terms = c("sum_pop"))

Figure 19: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_pop"))

Figure 20: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_pop"))

Figure 21: The significance of the population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("sum_edu_region_year"))

Figure 22: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_edu_region_year"))

Figure 23: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_edu_region_year"))

Figure 24: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_point (mapping = aes(x = sum_edu_region_year, y = eduyears)) +   labs(    x = "# persons with same edulevel, region, year",    y = "Years of education"  )

Figure 25: The significance of the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 26: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 27: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("perc_women", "sum_pop"))

Figure 28: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = perc_women, y = eduyears, colour = sum_pop)) +   labs(    x = "Percent women",    y = "Years of education"  )

Figure 29: The significance of the interaction between per cent women and population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "sum_pop"))

Figure 30: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "sum_pop"))

Figure 31: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "sum_pop"))

Figure 32: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = sum_pop, y = eduyears, colour = year_n)) +   labs(    x = "Population in region",    y = "Years of education"  )

Figure 33: The significance of the interaction between the population in the region and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 - 2018

Figure 34: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

Figure 35: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("sum_edu_region_year", "sum_pop"))

Figure 36: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = sum_pop)) +   labs(    x = "# persons with same edulevel, region, year",    y = "Years of education"  )

Figure 37: The significance of the interaction between the number of persons with the same level of education, region and year and population in the region on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "perc_women"))

Figure 38: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "perc_women"))

Figure 39: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "perc_women"))

Figure 40: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = perc_women, y = eduyears, colour = year_n)) +   labs(    x = "Percent women",    y = "Years of education"  )

Figure 41: The significance of the interaction between per cent women and year on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 - 2018

Figure 42: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

Figure 43: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("perc_women", "sum_edu_region_year"))

Figure 44: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = perc_women)) +   labs(    x = "# persons with same edulevel, region, year",    y = "Years of education"  )

Figure 45: The significance of the interaction between the number of persons with the same level of education, region and year and per cent women on the level of education, Year 1985 – 2018

plot_model (model, type = "pred", terms = c("year_n", "sum_edu_region_year"))

The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 - 2018

Figure 46: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (mmodel, type = "pred", terms = c("year_n", "sum_edu_region_year"))

Figure 47: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

plot_model (pmodel, type = "pred", terms = c("year_n", "sum_edu_region_year"))

Figure 48: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

tbnum %>%  ggplot () +      geom_jitter (mapping = aes(x = sum_edu_region_year, y = eduyears, colour = year_n)) +   labs(    x = "# persons with same edulevel, region, year",    y = "Years of education"  )

Figure 49: The significance of the interaction between year and the number of persons with the same level of education, region and year on the level of education, Year 1985 – 2018

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Rebalancing history

March 19, 2020, 5:00 pm

≫ Next: 3 Free Resources to Learn R – Now Open

≪ Previous: The significance of population size, year, and per cent women on the education level in Sweden

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our last post on rebalancing struck an equivocal note. We ran a thousand simulations using historical averages across different rebalancing regimes to test whether rebalancing produced better absolute or risk-adjusted returns. The results suggested it did not. But we noted many problems with the tests—namely, unrealistic return distributions and correlation scenarios. We argued that if we used actual historical data and sampled from it, we might resolve many of these issues. But we also asked our readers whether it was worthwhile to test further. Based on the responses and page views, we believe the interest is there, so we’ll proceed!

As we mentioned, historical data more closely approximates the fat-tailed, skewed distribution common to asset returns. But only if you have a long enough time series. While we weren’t able to find 50 years worth of major asset class returns, we were able to compile a 20-year series that includes two market downturns. The data isn’t all from the same source, unfortunately. But it is fairly reputable—Vanguard’s stock and US bond index funds, emerging market bond indices from the St. Louis Fed, and the S&P GSCI commodity index. The code will show how we aggregated it for those interested. Using this data series we should be able to test rebalancing more robustly.

Before we proceed, a brief word on methodology. To run the simulation, we need to sample (with replacement) from our twenty year period and combine each sample into an entire series. To capture the non-normal distribution and serial correlation of asset returns, we can’t just sample one return, however. We need to sample a block of returns. This allows us to approximate the serial correlation of individual assets as well as the correlation between assets. But how long should the block be? Trying to answer that can get pretty complicated, pretty quickly.¹ We decided to take a shortcut and use a simple block of 6 periods. This equates to six months, since our series is monthly returns. There’s nothing magical about this number but it does feature as a period used in academic studies on momentum, a topic beyond the scope of this post.²

We sample six months of returns at at time. Repeat 42 times to get over 20 years of data. Repeat to create 1000 portfolios. From there we apply the different rebalancing regimes on each of the portfolios and then aggregate the data. As before, we first use an equal-weighting, and then a 60/35/5 weighting for stocks, bonds, and commodities. Let’s see what we get.

First, we look at the average return for equal-weighted portfolios by rebalancing regime along with the range of outcomes.

Recall, the white line is the mean and the top and bottom of the boxes represent the middle 50% of outcomes, Interestingly, no rebalancing had far more positive and far less negative outliers (the red dots) than any of the rebalancing regimes.

Given where the averages line up, it doesn’t look like there are significant differences. Let’s run some t-tests for completeness.

Table 1: Aggregate p-values for simulation

Comparison	P-value
None vs. Months	0.87
None vs. Quarters	0.87
None vs. Years	0.88
Months vs. Quarters	0.97
Months vs. Years	0.95
Quarters vs. Years	0.97

As expected, the p-values are quite high, meaning that any differences in mean returns are likely due to chance.

Now we’ll check on the number of times each rebalancing strategy outperforms the others.

A dramatic result! No rebalancing beat the other strategies a majority of the time and less rebalancing outperformed more most of the time too. Now for the crux. Does rebalancing lead to better risk-adjusted returns as calculated by the Sharpe ratio.

Table 2: Sharpe ratios by rebalancing period

Period	Ratio
None	0.76
Months	0.74
Quarters	0.75
Years	0.76

Not much difference. Recall that from our previous simulations, no rebalancing actually generated a slightly worse Sharpe ratio by about 30-40 bps. But that result occurred less than 90% of the time, so it could be due to randomness. Let’s check the Sharpe ratios for the present simulation.

Table 3: Frequency of a better Sharpe ratio (%)

Periods	Occurence
None vs. Months	60.2
None vs. Quarters	53.4
None vs. Years	48.3
Months vs. Quarters	3.7
Months vs. Years	8.2
Quarters vs. Years	22.1

No rebalancing generates a better Sharpe ratio a majority of the time, but not enough to conclude it isn’t due to chance. Interestingly, the frequency with which quarterly and yearly rebalancing produce better Sharpe ratios than monthly rebalancing looks significant. In both cases the frequency is greater than 90% of the time. That the lower frequency rebalancing outperforms the higher frequency likely plays a role in the significance of the Sharpe ratios, but is an area of investigation we’ll shelve for now.

Let’s move to the next simulation where we weight the portfolios 60/35/5 for stocks, bonds, and commodities. First, we show the boxplot of mean returns and range of outcomes.

Like the equal-weighted simulations, the means don’t look that dissimilar and no rebalancing generates more positive and less negative outliers than other rebalancing regimes. We can say almost undoubtedly that the differences in average returns, if there are any, is likely due to chance. The p-values from the t-tests we show below should prove that.

Table 4: Aggregate p-values for simulation

Comparison	P-value
None vs. Months	0.91
None vs. Quarters	0.92
None vs. Years	0.92
Months vs. Quarters	0.98
Months vs. Years	0.97
Quarters vs. Years	0.98

Now let’s calculate and present the frequency of outperformance by rebalancing strategy.

No rebalancing outperforms again! Less frequent rebalancing outperforms more frequent. And risk-adjusted returns?

Table 5: Sharpe ratios by rebalancing period

Period	Ratio
None	0.66
Months	0.66
Quarters	0.67
Years	0.68

Here, no rebalancing performed slightly worse than rebalancing quarterly or yearly. How likely should we believe these results to be significant?

Table 6: Frequency of a better Sharpe ratio (%)

Periods	Occurence
None vs. Months	47.8
None vs. Quarters	40.9
None vs. Years	35.5
Months vs. Quarters	2.2
Months vs. Years	6.4
Quarters vs. Years	21.3

Slightly worse than 50/50 for no rebalancing. But less frequent rebalancing appears to have the potential to produce higher risk-adjusted returns that more frequent rebalancing.

Let’s briefly sum up what we’ve discovered thus far. Rebalancing does not seem to produce better risk-adjusted returns. If we threw in taxation and slippage, we think rebalancing might likely be a significant underperformer most of the time.

Does this mean you should never rebalance? No. You should definitely rebalance if you’ve got a crystal ball. Baring that, if your risk-return parameters change, then you should rebalance. But it would not be bringing the weights back to their original targets; rather, it would be new targets. A entirely separate case.

What do we see as the biggest criticism of the foregoing analysis? That it was a straw man argument. In practice, few professional investors rebalance because it’s July 29th or October 1st. Of course, there is quarter-end and year-end rebalancing, but those dates are usually coupled with a threshold. That is, only rebalance if the weights have exceeded some threshold, say five or ten percentage points from target. Analyzing the effects of only rebalancing based on thresholds would require more involved code on our part.³ Given the results thus far, we’re not convinced that rebalancing based on the thresholds would produce meaningfully better risk-adjusted returns.

However, rebalancing based on changes in risk-return constraints might do so. Modeling that would be difficult since we’d also have to model (or assume) new risk-return forecasts. But we could model the traditional shift recommended by financial advisors to clients as they age; that is, slowly shifting from high-risk to low-risk assets. In other words, lower the exposure to stocks and increase the exposure to bonds.

As a toy example, we use our data set to compare a no rebalancing strategy with an initial 60/35/5 split between stocks, bonds, and commodities to a yearly rebalancing strategy that starts at a 90/5/5 split and changes to a 40/60/0 split over the period. The Sharpe ratio for the rebalanced portfolio is actually a bit worse than the no rebalancing one. Mean returns are very close. Here’s the graph of the cumulative return.

This is clearly one example and highly time-dependent. But we see that rebalancing wasn’t altogether different than not, and we’re not including tax and slippage effects. To test this notion we’d have to run some more simulations, but that will be for another post.

We’ll end this post with a question for our readers. Are you convinced rebalancing doesn’t improve returns or do you think more analysis is required? Please send us your answer to nbw dot osm at gmail dot com. Until next time, here’s all the code behind the simulations, analyses, and charts.

## Load packageslibrary(tidyquant)library(tidyverse)### Load data## Stockssymbols <- c("VTSMX", "VGTSX", "VBMFX", "VTIBX")prices <- getSymbols(symbols, src = "yahoo",                     from = "1990-01-01",                     auto.assign = TRUE) %>%   map(~Ad(get(.))) %>%   reduce(merge) %>%   `colnames<-`(tolower(symbols))## Bonds# Source for bond indices:# https://fred.stlouisfed.org/categories/32413em_hg <- getSymbols("BAMLEMIBHGCRPITRIV",                         src = "FRED",                         from = "1990-01-01",                         auto.assign = FALSE)em_hg <- em_hg %>% na.locf()em_hy <- getSymbols("BAMLEMHBHYCRPITRIV",                       src = "FRED",                       from = "1990-01-01",                       auto.assign = FALSE)em_hy <- em_hy %>% na.locf()# Commodity data# Source for commodity data# https://www.investing.com/indices/sp-gsci-commodity-total-return-historical-data# Unfortunately, the data doesn't open into a separate link so you'll need to download it into a # csv file unless you're good a web scraping. We're not. Note too, the dates get a little funky # when being transferred into the csv, so you'll need to clean that up. Finally, the dates are # give as beginning of the month. But when we spot checked a few, they were actually end of the# month, which lines up with the other datacmdty <- read_csv("sp_gsci.csv")cmdty$Date <- as.Date(cmdty$Date,"%m/%d/%Y")cmd_price <- cmdty %>%   filter(Date >="1998-12-01", Date <="2019-12-31")## Mergedmerged <- merge(prices[,1:3], em_hg, em_hy)colnames(merged) <- c("us_stock", "intl_stock", "us_bond", "em_hg", "em_hy")merged <- merged["1998-12-31/2019-12-31"] %>% na.locf()merge_mon <- to.monthly(merged, indexAt = "lastof", OHLC = FALSE)merge_mon$cmdty <- cmd_price$Pricemerge_yr <- to.yearly(merge_mon, indexAt = "lastof", OHLC = FALSE)merge_ret <- ROC(merge_mon, type = "discrete") %>% na.omit()merge_ret_yr <- ROC(merge_yr, type = "discrete") %>% na.omit()## Data framedf <- data.frame(date = index(merge_ret), coredata(merge_ret))df_yr <- data.frame(date = index(merge_ret_yr), coredata(merge_ret_yr))### Block sampling## Create functionblock_samp <- function(dframe,block,cycles){  idx <- seq(1,block*cycles,block)    assets <- ncol(dframe)  size <- block*cycles    mat <- matrix(rep(0,assets*size), ncol = assets)    for(i in 1:cycles){        start <-sample(size,1)        if(start <= (size - block + 1)){      end <- start + block -1      len <- start:end    }else if(start > (size - block + 1) & start < size){      end <- size      step <- block - (end - start) - 1      if(step == 1){        adder <- 1      }else{        adder <- 1:step      }            len <- c(start:end, adder)    }else{      adder <-  1:(block - 1)      len <- c(start, adder)    }        mat[idx[i]:(idx[i]+block-1),] <- data.matrix(df[len,2:7])        }    mat}# Create 1000 samplesset.seed(123)block_list <- list()for(i in 1:1000){  block_list[[i]] <- block_samp(df[,2:7], 6, 42)}### Rebalancing on simulation## Create functionrebal_func <- function(port, wt, ...){    if(missing(wt)){    wt <- rep(1/ncol(port), ncol(port))  }else{    wt <- wt  }    port <- ts(port, start = c(1999,1), frequency = 12)    port_list <- list()  rebals = c("none","months", "quarters", "years")    for(pd in rebals){    if(pd == "none"){      port_list[[pd]] <- Return.portfolio(port, wt) %>%         `colnames<-`(pd)    }else{      port_list[[pd]] <- Return.portfolio(port, wt, rebalance_on = pd)%>%         `colnames<-`(pd)    }  }    port_r <- port_list %>%     bind_cols() %>%    data.frame()    port_r   }## Run function on simulations# Note this may take 10 minutes to run. We hope to figure out a way to speed this up in later# versions.rebal_test <- list()for(i in 1:1000){  rebal_test[[i]] <- rebal_func(block_list[[i]])}### Analyze results## Average resultsrebal_mean_df <- data.frame(none = rep(0,1000),                            monthly = rep(0,1000),                            quarterly = rep(0,1000),                            yearly = rep(0,1000))for(i in 1:1000){  rebal_mean_df[i,] <- colMeans(rebal_test[[i]]) %>% as.vector()}port_names <-  c("None", "Months", "Quarters", "Years")# Boxplot of reultsrebal_mean_df %>%   `colnames<-`(port_names) %>%   gather(key,value) %>%  mutate(key = factor(key, levels = port_names)) %>%   ggplot(aes(key,value*1200)) +   geom_boxplot(fill = "blue", color = "blue", outlier.colour = "red") +  stat_summary(geom = "crossbar", width=0.7, fatten=0, color="white",                fun.data = function(x){ return(c(y=mean(x), ymin=mean(x), ymax=mean(x))) })+  labs(x = "",       y = "Return (%)",       title = "Range of mean annualized returns by rebalancing period")## Find percentage of time one rebalancing period generates higher returns than another# Create means comparison functionfreq_comp <- function(df){  count <- 1  opf <- data.frame(comp = rep(0,6), prob = rep(0,6))  port_names <-  c("None", "Months", "Quarters", "Years")    for(i in 1:4){    for(j in 2:4){      if(i < j & count < 7){        opf[count,1] <- paste(port_names[i], " vs. ", port_names[j])        opf[count,2] <- mean(df[,i]) > mean(df[,j])        count <- count + 1      }    }  }  opf}# Aggregate function across simulationsprop_df <- matrix(rep(0,6000), nrow = 1000)for(i in 1:1000){  prop_df[i,] <- freq_comp(rebal_test[[i]])[,2]}long_names <- c()count <- 1for(i in 1:4){  for(j in 2:4){    if(i < j & count < 7){      long_names[count] <- paste(port_names[i], " vs. ", port_names[j])      count <- count + 1    }  }}prop_df %>%   data.frame() %>%   summarize_all(mean) %>%   `colnames<-`(long_names) %>%   gather(key, value) %>%   mutate(key = factor(key, levels = long_names)) %>%   ggplot(aes(key,value*100)) +  geom_bar(stat = "identity", fill = "blue")+  labs(x= "",       y = "Frequency (%)",       title = "Number of times one rebalancing strategy outperforms another") +  geom_text(aes(label = value*100), nudge_y = 2.5)## Run t-test# Create functiont_test_func <- function(df){  count <-  1  t_tests <- c()    for(i in 1:4){    for(j in 2:4){      if(i < j & count < 7){        t_tests[count] <- t.test(df[,i],df[,j])$p.value        count <- count +1      }    }  }    t_tests}t_tests <- matrix(rep(0,6000), ncol = 6)for(i in 1:1000){  t_tests[i,] <- t_test_func(rebal_test[[i]])}t_tests <- t_tests %>%   data.frame() %>%   `colnames<-`(long_names)t_tests %>%   summarise_all(function(x) round(mean(x),2)) %>%   gather(Comparison, `P-value`) %>%   knitr::kable(caption = "Aggregate p-values for simulation")## Sharpe ratiossharpe <- matrix(rep(0,4000), ncol = 4)for(i in 1:1000){  sharpe[i,] <- apply(rebal_test[[i]], 2, mean)/apply(rebal_test[[i]], 2, sd) * sqrt(12)}sharpe <- sharpe %>%   data.frame() %>%   `colnames<-`(port_names)# Tablesharpe %>%   summarise_all(mean) %>%   gather(Period, Ratio) %>%  mutate(Ratio = round(Ratio,2)) %>%   knitr::kable(caption = "Sharpe ratios by rebalancing period")# Permutation test for sharpesharpe_t <- data.frame(Periods = names(t_tests), Occurence = rep(0,6))count <- 1for(i in 1:4){  for(j in 2:4){    if(i  sharpe[,j])      count <- count + 1    }  }}# tablesharpe_t %>%   knitr::kable(caption = "Frequency of better Sharpe ratio")## Rum simulation pt 2# This may take 10 minutes or so to run.wt1 <- c(0.30, 0.30, 0.2, 0.075, 0.075, 0.05)rebal_wt <- list()for(i in 1:1000){  rebal_wt[[i]] <- rebal_func(block_list[[i]], wt1)}## Average resultsrebal_wt_mean_df <- data.frame(none = rep(0,1000),                               monthly = rep(0,1000),                               quarterly = rep(0,1000),                               yearly = rep(0,1000))for(i in 1:1000){  rebal_wt_mean_df[i,] <- colMeans(rebal_test[[i]]) %>% as.vector()}# Boxplotrebal_wt_mean_df %>%   `colnames<-`(port_names) %>%   gather(key,value) %>%  mutate(key = factor(key, levels = port_names)) %>%   ggplot(aes(key,value*1200)) +   geom_boxplot(fill = "blue", color = "blue", outlier.colour = "red") +  stat_summary(geom = "crossbar", width=0.7, fatten=0, color="white",                fun.data = function(x){ return(c(y=mean(x), ymin=mean(x), ymax=mean(x))) })+  labs(x = "",       y = "Return (%)",       title = "Range of mean annualized returns by rebalancing period")## Find percentage of time one rebalancing period generates higher returns than another# Aggregate function across simulationsprop_wt_df <- matrix(rep(0,6000), nrow = 1000)for(i in 1:1000){  prop_wt_df[i,] <- freq_comp(rebal_wt[[i]])[,2]}prop_wt_df %>%   data.frame() %>%   summarize_all(mean) %>%   `colnames<-`(long_names) %>%   gather(key, value) %>%   mutate(key = factor(key, levels = long_names)) %>%   ggplot(aes(key,value*100)) +  geom_bar(stat = "identity", fill = "blue")+  labs(x= "",       y = "Frequency (%)",       title = "Number of times one rebalancing strategy outperforms another") +  geom_text(aes(label = value*100), nudge_y = 2.5)## Run t-testt_tests_wt <- matrix(rep(0,6000), ncol = 6)for(i in 1:1000){  t_tests_wt[i,] <- t_test_func(rebal_wt[[i]])}t_tests_wt <- t_tests_wt %>%   data.frame() %>%   `colnames<-`(long_names)t_tests_wt %>%   summarise_all(function(x) round(mean(x),2)) %>%   gather(Comparison, `P-value`) %>%   knitr::kable(caption = "Aggregate p-values for simulation")## Sharpe ratiossharpe_wt <- matrix(rep(0,4000), ncol = 4)for(i in 1:1000){  sharpe_wt[i,] <- apply(rebal_wt[[i]], 2, mean)/apply(rebal_wt[[i]],2, sd) * sqrt(12)}sharpe_wt <- sharpe_wt %>%   data.frame() %>%   `colnames<-`(port_names)# tablesharpe_wt %>%   summarise_all(mean) %>%   gather(Period, Ratio) %>%  mutate(Ratio = round(Ratio,2)) %>%   knitr::kable(caption = "Sharpe ratios by rebalancing period")# Permutation test for sharpesharpe_wt_t <- data.frame(Periods = names(t_tests_wt), Occurence = rep(0,6))count <- 1for(i in 1:4){  for(j in 2:4){    if(i  sharpe_wt[,j])      count <- count + 1    }  }}sharpe_wt_t %>%   mutate(Occurence = round(Occurence,3)*100) %>%   knitr::kable(caption = "Frequency of better Sharpe ratio (%)")# Create weight change data frameweights <- data.frame(us_stock = seq(.45,.2, -0.0125),                       intl_stock =seq(.45,.2, -0.0125),                      us_bond = seq(.025, .3, .01375),                      em_hg = seq(0.0125, .15, .006875),                      em_hy = seq(0.0125, .15, .006875),                      cmdty = seq(.05, 0, -0.0025))# Change in ts objectsyr_ts <- ts(df_yr[,2:7], start = c(1999,1), frequency = 1)wts_ts <- ts(weights, start=c(1998,1), frequency = 1)# Run portfolio rebalancingno_rebal <- Return.portfolio(yr_ts,wt1)rebal <- Return.portfolio(yr_ts,wts_ts)# Convert into data framerebal_yr <- data.frame(date = index(no_rebal), no_rebal = as.numeric(no_rebal),                       rebal = as.numeric(rebal))# Graphrebal_yr %>%   gather(key, value, -date) %>%  group_by(key) %>%   mutate(value = (cumprod(1+value)-1)*100) %>%   ggplot(aes(date, value, color = key)) +  geom_line() +  scale_color_manual("", labels = c("No rebalancing", "Rebalancing"),                     values = c("black", "blue"))+  labs(x="",       y="Return(%)",       title = "Rebalancing or not?")+  theme(legend.position = "top")

We started experimenting with the cross-correlation function to see which lag had a higher or more significant correlation across the assets. But it became clear, quite quickly, that choosing a reasonable lag would take longer than the time we had allotted for this post. So we opted for the easy way out. Make a simplifying assumption! If anyone can point us to cross-correlation function studies that might apply to this scenario, please let us know.
Please see this article for more detail.
The Performance Analytics package in R doesn’t offer a rebalancing algorithm based on thresholds unless we missed it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on OSM.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

3 Free Resources to Learn R – Now Open

March 19, 2020, 10:45 pm

≫ Next: Modeling pandemics (2)

≪ Previous: Rebalancing history

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The coronavirus (COVID-19) is changing our living and working lives. Social distancing is the new norm for many, and each country is dealing with this situation differently. But, we are all in this together and we will get through this together.

I want you to know that during this challenging time, Business Science is offering free educational resources as a response to the coronavirus outbreak and social distancing measures. These are free courses and resources to help you learn R safely from your home while simultaneously reducing the financial burden of those being affected.

1. Free Jumpstart with R Course

I’ve opened Jumpstart with R to help you make your first steps learning Data Science For Business. Jumpstart will be open until Sunday evening to help manage the demand. Each offering gets 1000+ students, which is why I need to throttle to manage forum support.

I will offer more openings as well, but I encourage you to join with this cohort.

Join Jumpstart with R

2. Free Learning Labs

Learning Labs are webinars that teach intermediate and advanced topics on 2-week intervals. Live attendance is free of charge.

Attend Learning Labs– Subscribe for notifications on upcoming learning labs. Live attendance is always completely free.
Lab 31 – Forecasting Website Traffic with Google Analytics API– Attend my my next lab live for free.

Join Learning Labs

3. NEW R-Tips Newsletter

As of this week, I have a brand new Newsletter called R-Tips Weekly. With the speed that R is changing and improving, you need this to stay ahead of the curve.

R-Tips and Learning Labs will really help you stay ahead of the curve as R continues its rapidly changing ecosystem of packages. There are over 15,000 publicly available R packages. You learn the most useful ones with techniques that get results.

Get R-Tips Weekly

Moving Forward

I will continue to share more resources and updates on my Business Science blog, Learning Labs, and social channels over the coming days and weeks. I invite you to reach out to me as needed via LinkedIn.

Stay safe during this time period. Things will get better. Until then, learn, grow, and be safe.

~ Matt Dancho, CEO and Founder of Business Science

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Modeling pandemics (2)

March 20, 2020, 12:25 am

≫ Next: How to standardize group colors in data visualizations in R

≪ Previous: 3 Free Resources to Learn R – Now Open

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When introducing the SIR model, in our initial post, we got an ordinary differential equation, but we did not really discuss stability, and periodicity. It has to do with the Jacobian matrix of the system. But first of all, we had three equations for three function, but actually\displaystyle{{\frac{dS}{dt}}+{\frac {dI}{dt}}+{\frac {dR}{dt}}=0}so it means that our problem is here simply in dimension 2. Hence\displaystyle {\begin{aligned}&X={\frac {dS}{dt}}=\mu(N-S)-{\frac {\beta IS}{N}},\\[6pt]&Y={\frac {dI}{dt}}={\frac {\beta IS}{N}}-(\mu+\gamma)I\end{aligned}}and therefore, the Jacobian of the system is\begin{pmatrix}\displaystyle{\frac{\partial X}{\partial S}}&\displaystyle{\frac{\partial X}{\partial I}}\\[9pt]\displaystyle{\frac{\partial Y}{\partial S}}&\displaystyle{\frac{\partial Y}{\partial I}}\end{pmatrix}=\begin{pmatrix}\displaystyle{-\mu-\beta\frac{I}{N}}&\displaystyle{-\beta\frac{S}{N}}\\[9pt]\displaystyle{\beta\frac{I}{N}}&\displaystyle{\beta\frac{S}{N}-(\mu+\gamma)}\end{pmatrix}We should evaluate the Jacobian at the equilibrium, i.e. S^\star=\frac{\gamma+\mu}{\beta}=\frac{1}{R_0}andI^\star=\frac{\mu(R_0-1)}{\beta}We should then look at eigenvalues of the matrix.

Our very last example was

times =seq(0, 100, by=.1)p =c(mu =1/100, N =1, beta=50, gamma=10)start_SIR =c(S=0.19, I=0.01, R =0.8)resol = ode(y=start_SIR, t=times, func=SIR, p=p)plot(resol[,"time"],resol[,"I"],type="l",xlab="time",ylab="")

We can compute values at the equilibrium

1234	mu=p["mu"];beta=p["beta"];gamma=p["gamma"]N=1S =(gamma+ mu)/betaI= mu *(beta/(gamma+ mu)-1)/beta

and the Jacobian matrix

12	J=matrix(c(-(mu +betaI/N),-(beta S/N), betaI/N,beta S/N -(mu +gamma)),2,2,byrow = TRUE)

Now, if we look at the eigenvalues,

12	eigen(J)$values[1]-0.024975+0.6318831i -0.024975-0.6318831i

or more precisely 2\pi/b where a\pm ib are the conjuguate eigenvalues

12	2*pi/(Im(eigen(J)$values[1]))[1]9.943588

we have a damping period of 10 time lengths (10 days, or 10 weeks), which is more or less what we’ve seen above,

The graph above was obtained using

123456789

p =c(mu =1/100, N =1, beta=50, gamma=10)start_SIR =c(S=0.19, I=0.01, R =0.8)resol = ode(y=start_SIR, t=times, func=SIR, p=p)plot(resol[1:1e5,"time"],resol[1:1e5,"I"],type="l",xlab="time",ylab="",lwd=3,col="red")yi=resol[,"I"]dyi=diff(yi)i=which((dyi[2:length(dyi)]*dyi[1:(length(dyi)-1)])&lt;0)t=resol[i,"time"]arrows(t[2],.008,t[4],.008,length=.1,code=3)

If we look carefully. at the begining, the duration is (much) longer than 10 (about 13)… but it does converge towards 9.94

12	plot(diff(t[seq(2,40,by=2)]),type="b")abline(h=2*pi/(Im(eigen(J)$values[1]))

So here, theoretically, every 10 weeks (assuming that our time length is a week), we should observe an outbreak, smaller than the previous one. In practice, initially it is every 13 or 12 weeks, but the time to wait between outbreaks decreases (until it reaches 10 weeks).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Time Series Feature Engineering with the Time Series Signature

Time Series Forecast Strategy 6-Month Forecast of Bike Transaction Counts

Feature Engineering Strategy

Machine Learning Strategy

How to Learn Forecasting Beyond this Tutorial

Need to improve forecasting at your company?

Prerequisites

Data

Modeling

Recipe Preprocessing Specification

Building Engineered Features on Top of our Recipe

Model Specification

Workflow

Training

Visualize the Test (Validation) Forecast

Validation Accuracy (Out of Sample)

Learn algorithms that win competitions

Forecasting Future Data

Forecast Error

My Key Points on Time Series Machine Learning

How to Learn Time Series Forecasting?

Changes in version 0.2.7 (2020-03-18)

dplyr (diːˈplaɪə)

lubridate (ˈluːbrɪdeɪt)

ggplot2 (ʤiːʤiːplɒt tuː)

data.table (ˈdeɪtə ˈteɪbl) – logo

tibble (tɪbl)

purrr (pɜːɜː)

Amelia (əˈmiːlɪə)

magrittr (maɡʁitə)

batman (ˈbætmən)

Homeric (həʊˈmɛrɪk)

fcuk (fʌk)

hellno (hɛl nəʊ)

Honorable mentions

Matthias Nistler

ABOUT US

Introduction

Perform multiple tests at once

Concise and easily interpretable results

T-test

ANOVA

To go even further

parzer motivation

Package installation

Package basics

Use cases

Thanks

To Do

DISCLAIMER

Introduction

The EpiModel package

Extensions to EpiModel

Baseline simulation

Running intervention experiments

More experiments

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Discussion

Two more experiments

Conclusion

Changes in RProtoBuf version 0.4.16 (2020-03-19)

Simulation of a G2++ short rates model

Optimization in Deep learning neural networks

Lightning fast database querying with the R API.

What is Google Big Query?

What are the advantages of Big Query?

Use Big Query with R

Enable Big Query and get your credentials

Querying with R

1. Free Jumpstart with R Course

2. Free Learning Labs

3. NEW R-Tips Newsletter

Moving Forward

The `EpiModel` package

Extensions to `EpiModel`