Quantcast
Channel: R-bloggers
Viewing all 12194 articles
Browse latest View live

Demo Week: Tidy Forecasting with sweep

$
0
0

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We’re into the third day of Business Science Demo Week. Hopefully by now you’re getting a taste of some interesting and useful packages. For those that may have missed it, every day this week we are demo-ing an R package: tidyquant (Monday), timetk (Tuesday), sweep (Wednesday), tibbletime (Thursday) and h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today is sweep, which has broom-style tidiers for forecasting. Let’s get going!

Previous Demo Week Demos

sweep: What’s It Used For?

sweep is used for tidying the forecast package workflow. Like broom is to the stats library, sweep is to forecast package. It has useful functions including: sw_tidy, sw_glance, sw_augment, and sw_sweep. We’ll check out each in this demo.

An added benefit to sweep and timetk is if the ts-objects are created from time-based tibbles (tibbles with date or datetime index), the date or datetime information is carried through the forecasting process as a timetk index attribute. Bottom Line: This means we can finally use dates when forecasting as opposed to the regularly spaced numeric dates that the ts-system uses!

Demo Week: sweep

Load Libraries

We’ll need four libraries today:

  • sweep: For tidying the forecast package (like broom is to stats, sweep is to forecast)
  • forecast: Package that includes ARIMA, ETS, and other popular forecasting algorithms
  • tidyquant: For getting data and loading the tidyverse behind the scenes
  • timetk: Toolkit for working with time series in R. We’ll use to coerce from tbl to ts.

If you don’t already have installed, you can install with install.packages(). Then load the libraries as follows.

# Load librarieslibrary(sweep)# Broom-style tidiers for the forecast packagelibrary(forecast)# Forecasting models and predictions packagelibrary(tidyquant)# Loads tidyverse, financial pkgs, used to get datalibrary(timetk)# Functions working with time series

Data

We’ll use the same data as in the previous post where we used timetk to forecast with time series machine learning. We get data using the tq_get() function from tidyquant. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.

# Beer, Wine, Distilled Alcoholic Beverages, in Millions USDbeer_sales_tbl<-tq_get("S4248SM144NCEN",get="economic.data",from="2010-01-01",to="2016-12-31")beer_sales_tbl
## # A tibble: 84 x 2##          date price##         ##  1 2010-01-01  6558##  2 2010-02-01  7481##  3 2010-03-01  9475##  4 2010-04-01  9424##  5 2010-05-01  9351##  6 2010-06-01 10552##  7 2010-07-01  9077##  8 2010-08-01  9273##  9 2010-09-01  9420## 10 2010-10-01  9413## # ... with 74 more rows

It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use tidyquant charting tools: mainly geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).

# Plot Beer Salesbeer_sales_tbl%>%ggplot(aes(date,price))+geom_line(col=palette_light()[1])+geom_point(col=palette_light()[1])+geom_ma(ma_fun=SMA,n=12,size=1)+theme_tq()+scale_x_date(date_breaks="1 year",date_labels="%Y")+labs(title="Beer Sales: 2007 through 2016")

plot of chunk unnamed-chunk-3

Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!

DEMO: Tidy forecasting with forecast + sweep

We’ll use the combination of forecast and sweep to perform tidy forecasting.

Key Insight:

Forecasting using the forecast package is a non-tidy process that involves ts class objects. We have seen this system before where we can “tidy” these objects. For the stats library, we have broom, which tidies models and predictions. For the forecast package we now have sweep, which tidies models and forecasts.

Objective: We’ll work through an ARIMA analysis to forecast the next 12 months of time series data.

Step 1: Create ts object

Use timetk::tk_ts() to convert from tbl to ts. From the previous post, we learned that this has two benefits:

  1. It’s a consistent method to convert to and from ts.
  2. The ts-object contains a timetk_idx (timetk index) as an attribute, which is the original time-based index.

Here’s how to convert. Remember that ts-objects are regular time series so we need to specify a start and a freq.

# Convert from tbl to tsbeer_sales_ts<-tk_ts(beer_sales_tbl,start=2010,freq=12)beer_sales_ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct## 2010  6558  7481  9475  9424  9351 10552  9077  9273  9420  9413## 2011  6901  8014  9833  9281  9967 11344  9106 10468 10085  9612## 2012  7486  8641  9709  9423 11342 11274  9845 11163  9532 10754## 2013  8395  8888 10109 10493 12217 11385 11186 11462 10494 11541## 2014  8559  9061 10058 10979 11794 11906 10966 10981 10827 11815## 2015  8398  9061 10720 11105 11505 12903 11866 11223 12023 11986## 2016  8540 10158 11879 11155 11916 13291 10540 12212 11786 11424##        Nov   Dec## 2010  9866 11455## 2011 10328 11483## 2012 10953 11922## 2013 11139 12709## 2014 10466 13303## 2015 11510 14190## 2016 12482 13832

We can check that the ts-object has a timetk_idx.

# Check that ts-object has a timetk indexhas_timetk_idx(beer_sales_ts)
## [1] TRUE

Great. This will be important when we use sw_sweep() later. Next, we’ll model using ARIMA.

Step 2A: Model using ARIMA

We can use the auto.arima() function from the forecast package to model the time series.

# Model using auto.arimafit_arima<-auto.arima(beer_sales_ts)fit_arima
## Series: beer_sales_ts ## ARIMA(3,0,0)(1,1,0)[12] with drift ## ## Coefficients:##           ar1     ar2     ar3     sar1    drift##       -0.2498  0.1079  0.6210  -0.2817  32.1157## s.e.   0.0933  0.0982  0.0925   0.1333   5.8882## ## sigma^2 estimated as 175282:  log likelihood=-535.49## AIC=1082.97   AICc=1084.27   BIC=1096.63

Step 2B: Tidy the Model

Like broom tidies the stats package, we can use sweep functions to tidy the ARIMA model. Let’s examine three tidiers, which enable tidy model evaluation:

  • sw_tidy(): Used to retrieve the model coefficients
  • sw_glance(): Used to retrieve model description and training set accuracy metrics
  • sw_augment(): Used to get model residuals

sw_tidy

The sw_tidy() function returns the model coefficients in a tibble (tidy data frame).

# sw_tidy - Get model coefficientssw_tidy(fit_arima)
## # A tibble: 5 x 2##    term   estimate##         ## 1   ar1 -0.2497937## 2   ar2  0.1079269## 3   ar3  0.6210345## 4  sar1 -0.2816877## 5 drift 32.1157478

sw_glance

The sw_glance() function returns the training set accuracy measures in a tibble (tidy data frame). We use glimpse to aid in quickly reviewing the model metrics.

# sw_glance - Get model description and training set accuracy measuressw_glance(fit_arima)%>%glimpse()
## Observations: 1## Variables: 12## $ model.desc  "ARIMA(3,0,0)(1,1,0)[12] with drift"## $ sigma       418.6665## $ logLik      -535.4873## $ AIC         1082.975## $ BIC         1096.635## $ ME          1.189875## $ RMSE        373.9091## $ MAE         271.7068## $ MPE         -0.06716239## $ MAPE        2.526077## $ MASE        0.4989005## $ ACF1        0.02215405

sw_augment

The sw_augument() function helps with model evaluation. We get the “.actual”, “.fitted” and “.resid” columns, which are useful in evaluating the model against the training data. Note that we can pass timetk_idx = TRUE to return the original date index.

# sw_augment - get model residualssw_augment(fit_arima,timetk_idx=TRUE)
## # A tibble: 84 x 4##         index .actual   .fitted    .resid##                     ##  1 2010-01-01    6558  6551.474  6.525878##  2 2010-02-01    7481  7473.583  7.416765##  3 2010-03-01    9475  9465.621  9.378648##  4 2010-04-01    9424  9414.704  9.295526##  5 2010-05-01    9351  9341.810  9.190414##  6 2010-06-01   10552 10541.641 10.359293##  7 2010-07-01    9077  9068.148  8.852178##  8 2010-08-01    9273  9263.984  9.016063##  9 2010-09-01    9420  9410.869  9.130943## 10 2010-10-01    9413  9403.908  9.091831## # ... with 74 more rows

We can visualize the residual diagnostics for the training data to make sure there is no pattern leftover.

# Plotting residualssw_augment(fit_arima,timetk_idx=TRUE)%>%ggplot(aes(x=index,y=.resid))+geom_point()+geom_hline(yintercept=0,color="red")+labs(title="Residual diagnostic")+scale_x_date(date_breaks="1 year",date_labels="%Y")+theme_tq()

plot of chunk unnamed-chunk-10

Step 3: Make a Forecast

Make a forecast using the forecast() function.

# Forecast next 12 monthsfcast_arima<-forecast(fit_arima,h=12)

One problem is the forecast output is not “tidy”. We need it in a data frame if we want to work with it using the tidyverse functionality. The class is “forecast”, which is a ts-based-object (its contents are ts-objects).

class(fcast_arima)
## [1] "forecast"

Step 4: Tidy the Forecast with sweep

We can use sw_sweep() to tidy the forecast output. As an added benefit, if the forecast-object has a timetk index, we can use it to return a date/datetime index as opposed to regular index from the ts-based-object.

First, let’s check if the forecast-object has a timetk index. Great. We can use the timetk_idx argument when we apply sw_sweep().

# Check if object has timetk index has_timetk_idx(fcast_arima)
## [1] TRUE

Now, use sw_sweep() to tidy the forecast output. Internally it projects a future time series index based on “timetk_idx” that is an attribute (this all happens because we created the ts-object originally with tk_ts() in Step 1). Bottom Line: This means we can finally use dates with the forecast package (as opposed to the regularly spaced numeric index that the ts-system uses)!!!

# sw_sweep - tidies forecast outputfcast_tbl<-sw_sweep(fcast_arima,timetk_idx=TRUE)fcast_tbl
## # A tibble: 96 x 7##         index    key price lo.80 lo.95 hi.80 hi.95##               ##  1 2010-01-01 actual  6558    NA    NA    NA    NA##  2 2010-02-01 actual  7481    NA    NA    NA    NA##  3 2010-03-01 actual  9475    NA    NA    NA    NA##  4 2010-04-01 actual  9424    NA    NA    NA    NA##  5 2010-05-01 actual  9351    NA    NA    NA    NA##  6 2010-06-01 actual 10552    NA    NA    NA    NA##  7 2010-07-01 actual  9077    NA    NA    NA    NA##  8 2010-08-01 actual  9273    NA    NA    NA    NA##  9 2010-09-01 actual  9420    NA    NA    NA    NA## 10 2010-10-01 actual  9413    NA    NA    NA    NA## # ... with 86 more rows

Step 5: Compare Actuals vs Predictions

We can use tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.

actuals_tbl<-tq_get("S4248SM144NCEN",get="economic.data",from="2017-01-01",to="2017-12-31")

Notice that we have the entire forecast in a tibble. We can now more easily visualize the forecast.

# Visualize the forecast with ggplotfcast_tbl%>%ggplot(aes(x=index,y=price,color=key))+# 95% CIgeom_ribbon(aes(ymin=lo.95,ymax=hi.95),fill="#D5DBFF",color=NA,size=0)+# 80% CIgeom_ribbon(aes(ymin=lo.80,ymax=hi.80,fill=key),fill="#596DD5",color=NA,size=0,alpha=0.8)+# Predictiongeom_line()+geom_point()+# Actualsgeom_line(aes(x=date,y=price),color=palette_light()[[1]],data=actuals_tbl)+geom_point(aes(x=date,y=price),color=palette_light()[[1]],data=actuals_tbl)+# Aestheticslabs(title="Beer Sales Forecast: ARIMA",x="",y="Thousands of Tons",subtitle="sw_sweep tidies the auto.arima() forecast output")+scale_x_date(date_breaks="1 year",date_labels="%Y")+scale_color_tq()+scale_fill_tq()+theme_tq()

plot of chunk unnamed-chunk-16

We can investigate the error on our test set (actuals vs predictions).

# Investigate test error error_tbl<-left_join(actuals_tbl,fcast_tbl,by=c("date"="index"))%>%rename(actual=price.x,pred=price.y)%>%select(date,actual,pred)%>%mutate(error=actual-pred,error_pct=error/actual)error_tbl
## # A tibble: 8 x 5##         date actual      pred      error    error_pct##                            ## 1 2017-01-01   8664  8601.815   62.18469  0.007177365## 2 2017-02-01  10017 10855.429 -838.42908 -0.083700617## 3 2017-03-01  11960 11502.214  457.78622  0.038276439## 4 2017-04-01  11019 11582.600 -563.59962 -0.051147982## 5 2017-05-01  12971 12566.765  404.23491  0.031164514## 6 2017-06-01  14113 13263.918  849.08191  0.060163106## 7 2017-07-01  10928 11507.277 -579.27693 -0.053008504## 8 2017-08-01  12788 12527.278  260.72219  0.020388035

And we can calculate a few residuals metrics. The MAPE error is approximately 4.3% from the actual value, which is slightly better than the simple linear regression from the timetk demo. Note that the RMSE is slighly worse.

# Calculate test error metricstest_residuals<-error_tbl$errortest_error_pct<-error_tbl$error_pct*100# Percentage errorme<-mean(test_residuals,na.rm=TRUE)rmse<-mean(test_residuals^2,na.rm=TRUE)^0.5mae<-mean(abs(test_residuals),na.rm=TRUE)mape<-mean(abs(test_error_pct),na.rm=TRUE)mpe<-mean(test_error_pct,na.rm=TRUE)tibble(me,rmse,mae,mape,mpe)%>%glimpse()
## Observations: 1## Variables: 5## $ me    6.588034## $ rmse  561.4631## $ mae   501.9144## $ mape  4.312832## $ mpe   -0.3835956

Next Steps

The sweep package is very useful for tidying the forecast package output. This demo showed some of the basics. Interested readers should check out the documentation, which goes into expanded detail on scaling analysis by groups and using multiple forecast models.

Announcements

We have a busy couple of weeks. In addition to Demo Week, we have:

DataTalk

On Thursday, October 26 at 7PM EST, Matt will be giving a FREE LIVE #DataTalk on Machine Learning for Recruitment and Reducing Employee Attrition. You can sign up for a reminder at the Experian Data Lab website.

#MachineLearning for Reducing Employee Attrition @BizScienchttps://t.co/vlxmjWzKCL#ML#AI#HR#IoTT#IoT#DL#BigData#Tech#Cloud#Jobspic.twitter.com/dF5Znf10Sk

— Experian DataLab (@ExperianDataLab) October 18, 2017

EARL

On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.

😀Hey #rstats. I'll be presenting @earlconf on #MachineLearning applications in #HumanResources. Get 15% off tickets: https://t.co/b6JUQ6BSTl

— Matt Dancho (@mdancho84) October 11, 2017

Courses

Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:

The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Follow Business Science on Social Media

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


September 2017 New Package Picks

$
0
0

(This article was first published on R Views, and kindly contributed to R-bloggers)

There were so many interesting ideas among the 222 new packages that made it to CRAN in September that I found it exceptionally difficult to decide on the “Top 40” packages. In the end, I only managed to limit my selection to 40 by avoiding all packages that I would normally classify under “Data”: packages that are primarily intended to provide access to some data source. I hope to make up for this by providing a list of data packages sometime soon.

Below are my picks for September’s Top 40 in six categories: Computational Methods, Machine Learning, Science, Statistics, Utilities, and Visualizations.

Computational Methods

DES v1.0.0: Implements an event-oriented approach to Discrete Event Simulation. There is a tutorial.

JuliaCall 0.9.3: Implements an interface to Julia. The vignette illustrates basic usage.

Rlinsolve v0.1.1: Implements iterative solvers for sparse linear systems of equations, including basic stationary iterative solvers using Jacobi, Gauss-Seidel, Successive Over-Relaxation and SSOR methods and non-stationary, Krylov subspace methods. There is a vignette to get started. Detailed descriptions may be found in the SIAM book.

sdpt3r v0.1: Implements the SDPT3 method of Toh, Todd, and Tutuncu to solve Semi-Definite Linear Programming problems. There are several vignettes illustrating the use of the package in various applications, including D-Optimal Experimental Design and Distance Weighted Discrimination.

VeryLargeIntegers v0.1.4: Provides tools to work with arbitrarily large integers without loss of precision.

Machine Learning

bnclassify v0.3.3: Implements algorithms for learning discrete Bayesian network classifiers from data, including a number of those described in Bielza & Larranaga. There is an Introduction and vignettes giving Runtime Information and additional Technical Information.

DMRnet v0.1.0: Provides model selection algorithms for regression and classification, where the predictors can be numerical and categorical and the number of regressors exceeds the number of observations. See the papers by Maj-Kańska et al. and Pokarowski and Mielniczuk for the mathematical details.

ELMSurv v0.4: Implements an Extreme Learning Machine for Survival Analysis. Look here for details and here to get started.

fastrtext v0.2.1: Provides an interface to Facebook’s fastText library for text representation and classification. There is a List of Commands and vignettes on Supervised and Unsupervised learning.

FSelectorRcpp v0.1.8: provides an Rcpp-based implementation of FSelector entropy-based feature selection algorithms based on an Multi-Interval Discretization with a sparse matrix support. There are vignettes on Getting Started and Benchmarks.

googleLanguageR v0.1.0: Provides an interface to Google Cloud machine-learning APIs for text and speech tasks. Call the Cloud Translation API for detection and translation of text, the Natural Language API to analyse text for sentiment, entities or syntax, and the Cloud Speech API to transcribe sound files to text. There is an Introduction and vignettes for the NLP, Speech, and Translation APIs.

leabRa v0.1.0: Implements the Leabra (local, error-driven and associative, biologically realistic algorithm) that allows for the construction of artificial neural networks that are biologically realistic, and balances supervised and unsupervised learning within a single framework. See the vignette to get started and look here for details.

lime v0.3.0: Is a port of the Python package, which attempts to explain the outcome of black-box models by fitting local models around the points of interest. Look here for details. There is a vignette to get you started.

slowraker v0.1.0: Implements the RAKE algorithm, which can be used to extract keywords from documents without any training data. There is a Getting Started vignette and a list of FAQs.

udpipe v0.1.1: Provides a natural-language-processing toolkit for tokenization, parts-of-speech tagging, lemmatization, and dependency parsing of raw text. For details, see this paper and the vignettes on Annotating Text and Model Building.

Science

afpt v1.0.0: Implements the aerodynamic power model described in Klein Heerenbrink et al., and allows estimation and modelling of flight costs in vertebrate animal flight. There are vignettes on Basic Usage, the underlying Aerodynamic Model, and Multiple Birds.

soundgen v1.1.O: Tools for sound synthesis and acoustic analysis. There are vignettes on Acoustic Analysis and Sound Generation.

Statistics

cr17 v0.1.0: Provides tools for analyzing competing-risks models, including testing differences between groups (Gray and Fine and Gray) and visualizations of survival and cumulative incidence curves. The vignette gives examples.

EAinference v0.2.1: Provides estimator augmentation methods for statistical inference on high-dimensional data, as described in Zho and Zhou and Min. The vignette describes how to use the package.

fdAnova v0.1.0: Provides functions to perform analysis of variance testing procedures for univariate and multivariate functional data. See Cuesta-Albertos and Febrero-Bande. There is a comprehensive vignette.

geex v1.0.3: Provides a general, flexible framework for estimating parameters and empirical sandwich variance estimator from a set of unbiased estimating equations. See M-estimation as in Stefanski & Boos. There is an Introduction, as well as vignettes on M-estimation, Custom root solvers, Parameter Estimation, Software Design, and more.

mosaicModel v0.3.0: Provides functions for evaluating, displaying, and interpreting statistical models with the goal of abstracting the operations on models from the particular architecture of the model. The vignette shows how to use the package.

odr v0.3.2: Provides methods for calculating the optimal sample allocation that minimizes variance of treatment effects in a multilevel randomized trial under fixed budget and cost structure, and for performing power analyses with and without accommodating costs and budget. There is a vignette.

mvord v0.1.0: Provides a flexible framework for fitting multivariate ordinal regression models with composite likelihood methods. The vignette gives the details.

OultiersO3 v0.2.1: Provides methods for identifying potential outliers for all combinations of a dataset’s variables. The vignette shows how to use the package.

powerlmm v0.1.0: Implements both analytical and simulation methods to calculate power for two- and three-level multilevel longitudinal studies with missing data. The analytical calculations extends the method described in Galbraith et al. to three-level models. There are tutorials on Model Evaluation via Monte Carclo Simulation, Two-level Longitudinal Power Analysis, Three-level Longitudinal Power Analysis, and a vignette on the Details of Power Calculations.

randnet v0.1: Facilitates model-selection and parameter-tuning procedures for a class of random network models. Model selection can be done by a general cross-validation framework called ECV, NCV, a likelihood ratio method, and spectral methods.

threshr v1.0.0: Provides functions for the selection of thresholds for use in extreme value models, based mainly on the methodology in Northrop, Attalides and Jonathan. There is a vignette.

tscount v1.4.0: Implements likelihood-based methods for model fitting and assessment, prediction, and intervention analysis of count time series following generalized linear models. The vignette provides the details.

Utilities

basictabler v0.1.0: Provides functions to create tables from data frames and matrices, manipulate tables row-by-row, column-by-column or cell-by-cell, and then publish them using HTML, HTML widgets or Excel. There is an Introduction and vignettes on Working with Cells, Outputs, Styling, Formatting, Shiny, and Excel.

bigstatsr v0.2.2: Uses file-backed matrices to provide scalable statistical tools.

keyring v1.0.0: Provides a platform-independent API to access the operating system’s credential store. It currently supports: Keychain on macOS, The Credential Store on Windows, the Secret Service API on Linux, and a simple, platform-independent store implemented with environment variables.

pinp v0.0.2: Offers a PNAS-like style for rmarkdown derived from the Proceedings of the National Academy of Sciences of the United States of America. The vignette shows how to get started.

re2r v0.2.0: Provides an interface to Google’s deterministic finite-automaton-based regular expression engine that is very fast at matching large amounts of text. There is an Introduction and a vignette on Syntax.

spiderbar v0.2.0: Provides a wrapper for the rep-cpp C++ library for processing robots.txt files in accordance with the The Robots Exclusion Protocol, a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Look in the README for an example of how to use the package.

tibbletime v0.0.2: Is an extension of the tibble package that allows for the creation of time-aware tibbles. Some immediate advantages include: the ability to perform time-based subsetting on tibbles, quickly summarising and aggregating results by time periods, and calling functions similar in spirit to the map family from purrr on time-based tibbles. There is an Introduction and vignettes on Time-based Filtering, Changing Periodicity, and Rolling Calculaions.

Visualizations

egg v0.2.0: Provides miscellaneous functions to customize ggplot2 plots, including high-level functions to post-process layouts and allow alignment between plot panels, as well as setting panel sizes to fixed values. There is an Overview and a vignette for laying out multiple plots on a page.

ggridges v0.4.1: Extends ggplot2 to enable ridgeline plots, which are a way of visualizing changes in distributions over time or space. There is an introduction and a gallery of examples.

linemap v0.1.0: Provides functions to create maps from lines. The README file shows examples.

_____='https://rviews.rstudio.com/2017/10/25/september-17-top-40-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

create meme in R

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

I developed a tiny toy package, meme, which is now on CRAN. As it’s name indicated, it was designed to create memes, which are captioned photos that are intended to be funny, riduculous.

meme()

The package is quite simple. You can use meme() function to add meme captions, and this is all the package supposed to do:

library(meme)u <- "http://www.happyfamilyneeds.com/wp-content/uploads/2017/08/angry8.jpg"meme(u, "code", "all the things!")

the grammar 🙈

The meme package was implemented using grid graphic system. Since grid is the most flexible graphic system in R, I try to mimic ggplot2 (although very superficial) for practice.

User can use mmplot() to read and plot the input image and then using + mm_caption() to add meme captions.

mmplot(u) + mm_caption("calm down", "and RTFM", color="purple")

meme_save()

The meme output can be saved as an object, and can be exported to file using ggsave(). Since we would like to keep the original figure aspection ratio for output meme figure, I provide a helper function, meme_save(), which takes care of the figure aspection ratio and then called ggsave() to export the figure.

u2 <- "http://i0.kym-cdn.com/entries/icons/mobile/000/000/745/success.jpg"x <- meme(u2, "please", "tell me more")meme_save(x, file="docs/Figs/meme.png")

the plot method

Users can plot the meme() output and change the caption or other parameters in real time.

plot(x, size = 2, "happy friday!", "wait, sorry, it's monday", color = "firebrick", font = "Courier")

the + method

Instead of using parameters in plot() explictely, Users can use + aes() to set the plot parameters:

x + aes(upper = "#barbarplots",        lower = "friends don't let friends make bar plots",        color = firebrick, font = Courier, size=1.5)

or using + list(). The following command will also generate the figure displayed above.

x + list(upper = "#barbarplots",        lower = "friends don't let friends make bar plots",        color = "firebrick", font = "Courier", size=1.5)

multi-language support

I didn’t do anything about it. Multi-language was supported internally. Just simply select a font for your language.

y <- meme(u, "卧槽", "听说你想用中文", font="STHeiti")y


As the meme package was developed using grid, It would be better to provide function to convert the output object to grob. Similar to ggplotGrob() for ggplot object, I provide memeGrob() for the meme object and this making it possible to edit the details of the graph and compatible with the grid ecosystem.

Here are the examples of using meme in grid, ggplot2 and cowplot.

grid support

library(grid)mm <- meme(u, "code", "all the things!", size=.3, color='firebrick')grid.newpage()pushViewport(viewport(width=.9, height=.9))grid.rect(gp = gpar(lty="dashed"))xx <- seq(0, 2*pi , length.out=10)yy <- sin(xx)for (i in seq_along(xx)) {    vp <- viewport(x = xx[i]/(2*pi), y = (yy[i]-min(yy))/2, width=.05, height=.05)    print(mm, vp = vp)}

ggplot2 support

library(ggplot2)library(ggimage)d <- data.frame(x = xx, y = yy)ggplot(d, aes(x, y)) + geom_line() +    geom_subview(mm, x = xx, y = yy, width=.3, height=.15)

ggplot(d, aes(x, y)) +    geom_subview(mm+aes(size=3), x=0, y=0, width=Inf, height=Inf) +    geom_point() + geom_line()

cowplot support

cowplot::plot_grid(x, y, ncol=1, labels = c("A", "B"))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Robust IRIS Dataset?

$
0
0

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

After some Python and R code, the results were in.

I have tested following pairs (train:test sample size):

  • 80% – 20%
  • 60% – 40%
  • 50% – 50%
  • 30% – 70%
  • 10% – 90%

Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

for x in range(3000):    exec(open("./classification.py").read(), globals())    x += 1

At the end I have generated the file:

predictions

With these results, simple R code to generate the scatter plot was used:

library(ggplot2)setwd("C:\\Predictions\\")df_pred <- data.frame(read.table("results_split.txt", sep=";"))p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) p <- p + geom_point(aes(df_pred$V3))p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset")p

Which resulted as:

Plot

The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

Snippet of Python code to generate euclidean distance:

def eucl_dist(set1, set2, length):    distance = 0    for x in range(length):        distance += pow(set1[x] - set2[x], 2)    return math.sqrt(distance)

and neighbours:

def find_neighbors(train, test, nof_class):    distances = []    length_dist = len(test) - 1    for x in range(len(train)):        dist = eucl_dist(test, train[x], length_dist)        distances.append((train[x],dist))    distances.sort(key=operator.itemgetter(1))    neighbour = []    for x in range(nof_class):        neighbour.append(distances[x][0])    return neighbour

 

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

Happy R & Python coding! 🙂

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Two upcoming webinars

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Two new Microsoft webinars are taking place over the next week that may be of interest:

AI Development in Azure using Data Science Virtual Machines

The Azure Data Science Virtual Machine (DSVM) provides a comprehensive development and production environment to Data Scientists and AI-savvy developers. DSVMs are specialized virtual machine images that have been curated, configured, tested and heavily used by Microsoft engineers and data scientists. DSVM is an integral part of the Microsoft AI Platform and is available for customers to use through the Microsoft Azure cloud. In this session, we will first introduce DSVM, familiarize attendees with the product, including our newest offering, namely Deep Learning Virtual Machines (DLVMs). That will be followed by technical deep-dives into samples of end-to-end AI development and deployment scenarios that involve deep learning. We will also cover scenarios involving cloud based scale-out and parallelization.

This webinar runs from 10-11AM Pacific Time on Thursday October 26, and is presented by Gopi Kumar, Principal Program Manager, Paul Shealy, Senior Software Engineer, and Barnam Bora, Program Manager, at Microsoft. Register here.

Document Collection Analysis

With the extremely large volumes of data, especially unstructured text data, that are being collected every day, a huge challenge facing customers is the need for tools and techniques to organize, search, and understand this vast quantity of text. This webinar demonstrates an efficient and automated end-to-end workflow around analyzing large document collections and serving your downstream NLP tasks. We’ll demonstrate how to summarize and analyze a large collection of documents, including techniques such as phrase learning, topic modeling, and topic model analysis using the Azure Machine Learning Workbench.

This webinar runs from 10-11AM Pacific Time on Tuesday October 31, and will be presented by Ke Huang, Data Scientist at Microsoft. Register here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Easy peasy STATA-like marginal effects with R

$
0
0

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Model interpretation is essential in the social sciences. If one wants to know the effect of variable x on the dependent variable y, marginal effects are an easy way to get the answer. STATA includes a margins command that has been ported to R by Thomas J. Leeper of the London School of Economics and Political Science. You can find the source code of the package on github. In this short blog post, I demo some of the functionality of margins.

First, let’s load some packages:

library(ggplot2)library(tibble)library(broom)library(margins)library(Ecdat)

As an example, we are going to use the Participation data from the Ecdat package:

data(Participation)
?Participation
Labor Force ParticipationDescriptiona cross-sectionnumber of observations : 872observation : individualscountry : SwitzerlandUsagedata(Participation)FormatA dataframe containing :lfplabour force participation ?lnnlincthe log of nonlabour incomeageage in years divided by 10educyears of formal educationnycthe number of young children (younger than 7)nocnumber of older childrenforeignforeigner ?SourceGerfin, Michael (1996) “Parametric and semiparametric estimation of the binary response”, Journal of Applied Econometrics, 11(3), 321-340.ReferencesDavidson, R. and James G. MacKinnon (2004) Econometric Theory and Methods, New York, Oxford University Press, http://www.econ.queensu.ca/ETM/, chapter 11.Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/.

The variable of interest is lfp: whether the individual participates in the labour force or not. To know which variables are relevant in the decision to participate in the labour force, one could estimate a logit model, using glm().

logit_participation = glm(lfp ~ ., data = Participation, family = "binomial")

Now that we ran the regression, we can take a look at the results. I like to use broom::tidy() to look at the results of regressions, as tidy() returns a nice data.frame, but you could use summary() if you’re only interested in reading the output:

tidy(logit_participation)
##          term    estimate  std.error  statistic      p.value## 1 (Intercept) 10.37434616 2.16685216  4.7877499 1.686617e-06## 2     lnnlinc -0.81504064 0.20550116 -3.9661122 7.305449e-05## 3         age -0.51032975 0.09051783 -5.6378920 1.721444e-08## 4        educ  0.03172803 0.02903580  1.0927211 2.745163e-01## 5         nyc -1.33072362 0.18017027 -7.3859224 1.514000e-13## 6         noc -0.02198573 0.07376636 -0.2980454 7.656685e-01## 7  foreignyes  1.31040497 0.19975784  6.5599678 5.381941e-11

From the results above, one can only interpret the sign of the coefficients. To know how much a variable influences the labour force participation, one has to use margins():

effects_logit_participation = margins(logit_participation) 
## Warning in warn_for_weights(model): 'weights' used in model estimation are## currently ignored!
print(effects_logit_participation)
## Average marginal effects
## glm(formula = lfp ~ ., family = "binomial", data = Participation)
##  lnnlinc     age     educ     nyc       noc foreignyes##  -0.1699 -0.1064 0.006616 -0.2775 -0.004584     0.2834

Using summary() on the object returned by margins() provides more details:

summary(effects_logit_participation)
##      factor     AME     SE       z      p   lower   upper##         age -0.1064 0.0176 -6.0494 0.0000 -0.1409 -0.0719##        educ  0.0066 0.0060  1.0955 0.2733 -0.0052  0.0185##  foreignyes  0.2834 0.0399  7.1102 0.0000  0.2053  0.3615##     lnnlinc -0.1699 0.0415 -4.0994 0.0000 -0.2512 -0.0887##         noc -0.0046 0.0154 -0.2981 0.7656 -0.0347  0.0256##         nyc -0.2775 0.0333 -8.3433 0.0000 -0.3426 -0.2123

And it is also possible to plot the effects with base graphics:

plot(effects_logit_participation)

This uses the basic R plotting capabilities, which is useful because it is a simple call to the function plot() but if you’ve been using ggplot2 and want this graph to have the same look as the others made with ggplot2 you first need to save the summary in a variable. Let’s overwrite this effects_logit_participation variable with its summary:

effects_logit_participation = summary(effects_logit_participation)

And now it is possible to use ggplot2 to create the same plot:

ggplot(data = effects_logit_participation) +  geom_point(aes(factor, AME)) +  geom_errorbar(aes(x = factor, ymin = lower, ymax = upper)) +  geom_hline(yintercept = 0) +  theme_minimal() +  theme(axis.text.x = element_text(angle = 45))

So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points.

You can also extract the marginal effects of a single variable, with dydx():

head(dydx(Participation, logit_participation, "lnnlinc"))
##   dydx_lnnlinc## 1  -0.15667764## 2  -0.20014487## 3  -0.18495109## 4  -0.05377262## 5  -0.18710476## 6  -0.19586986

Which makes it possible to extract the effects for a list of individuals that you can create yourself:

my_subjects = tribble(    ~lfp,  ~lnnlinc, ~age, ~educ, ~nyc, ~noc, ~foreign,    "yes",   10.780,  7.0,     4,    1,    1,    "yes",     "no",     1.30,  9.0,     1,    4,    1,    "yes")dydx(my_subjects, logit_participation, "lnnlinc")
##   dydx_lnnlinc## 1  -0.09228119## 2  -0.17953451

I used the tribble() function from the tibble package to create this test data set, row by row. Then, using dydx(), I get the marginal effect of variable lnnlinc for these two individuals. No doubt that this package will be a huge help convincing more social scientists to try out R and make a potential transition from STATA easier.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demo Week: Tidy Time Series Analysis with tibbletime

$
0
0

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We’re into the fourth day of Business Science Demo Week. We have a really cool one in store today: tibbletime, which uses a new tbl_time class that is time-aware!! For those that may have missed it, every day this week we are demo-ing an R package: tidyquant (Monday), timetk (Tuesday), sweep (Wednesday), tibbletime (Thursday) and h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Let’s take tibbletime for a spin!

Previous Demo Week Demos

tibbletime: What’s It Used For?

  1. The future of “tidy” time series analysis: New class tbl_time rests on top of tbl and makes tibbles time aware.

  2. Time Series Functions: Can use a series of “tidy” time series functions designed specifically for tbl_time objects. Some of them are:

    • time_filter(): Succinctly filter a tbl_time object by date.

    • time_summarise(): Similar to dplyr::summarise but with the added benefit of being able to summarise by a time period such as “yearly” or “monthly”.

    • tmap(): The family of tmap functions transform a tbl_time input by applying a function to each column at a specified time interval.

    • as_period(): Convert a tbl_time object from daily to monthly, from minute data to hourly, and more. This allows the user to easily aggregate data to a less granular level.

    • time_collapse(): When time_collapse is used, the index of a tbl_time object is altered so that all dates that fall in a period share a common date.

    • rollify(): Modify a function so that it calculates a value (or a set of values) at specific time intervals. This can be used for rolling averages and other rolling calculations inside the tidyverse framework.

    • create_series(): Use shorthand notation to quickly initialize a tbl_time object containing a date column with a regularly spaced time series.

Demo Week: tibbletime

Load Libraries

The tibbletime package is under active development, and because of this we recommend downloading the package from GitHub using devtools. You’ll get the latest functionality with all of the features demo-ed in this article.

# Get tibbletime version with latest featuresdevtools::install_github("business-science/tibbletime")

Once installed, load the following libraries:

  • tibbletime: Enables creation of time-aware tibbles. Can use new tbl_time functions.
  • tidyquant: Loads tidyverse, and is used to get data with tq_get().
# Load librarieslibrary(tibbletime)# Future of tidy time series analysislibrary(tidyquant)# Loads tidyverse, tq_get()

Data

We’ll download the daily stock prices for the FANG stocks (FB, AMZN, NFLX, GOOG) using tq_get().

# Stock Prices from Yahoo! FinanceFANG_symbols<-c("FB","AMZN","NFLX","GOOG")FANG_tbl_d<-FANG_symbols%>%tq_get(get="stock.prices",from="2014-01-01",to="2016-12-31")FANG_tbl_d<-FANG_tbl_d%>%group_by(symbol)FANG_tbl_d
## # A tibble: 3,024 x 8## # Groups:   symbol [4]##    symbol       date  open  high   low close   volume adjusted##                      ##  1     FB 2014-01-02 54.83 55.22 54.19 54.71 43195500    54.71##  2     FB 2014-01-03 55.02 55.65 54.53 54.56 38246200    54.56##  3     FB 2014-01-06 54.42 57.26 54.05 57.20 68852600    57.20##  4     FB 2014-01-07 57.70 58.55 57.22 57.92 77207400    57.92##  5     FB 2014-01-08 57.60 58.41 57.23 58.23 56682400    58.23##  6     FB 2014-01-09 58.65 58.96 56.65 57.22 92253300    57.22##  7     FB 2014-01-10 57.13 58.30 57.06 57.94 42449500    57.94##  8     FB 2014-01-13 57.91 58.25 55.38 55.91 63010900    55.91##  9     FB 2014-01-14 56.46 57.78 56.10 57.74 37503600    57.74## 10     FB 2014-01-15 57.98 58.57 57.27 57.60 33663400    57.60## # ... with 3,014 more rows

We setup a function to plot facets by symbol that can be reused throughout this article. For those unfamiliar with the rlang package and tidyeval framework, it’s not necessary to understand for this article. Just recognize that we are creating a ggplot2 function that creates plots that are faceted by “symbol” by specifying the data frame, x, y, and group (if present).

# Setup plotting function that can be reused laterggplot_facet_by_symbol<-function(data,x,y,group=NULL){# Setup expressionsx_expr<-rlang::enquo(x)y_expr<-rlang::enquo(y)group_expr<-rlang::enquo(group)if(group_expr==~NULL){# No groupsg<-data%>%ggplot(aes(x=rlang::eval_tidy(rlang::`!!`(x_expr)),y=rlang::eval_tidy(rlang::`!!`(y_expr)),color=symbol))+labs(x=quo_name(x_expr),y=quo_name(y_expr))}else{# Deal with groupsg<-data%>%ggplot(aes(x=rlang::eval_tidy(rlang::`!!`(x_expr)),y=rlang::eval_tidy(rlang::`!!`(y_expr)),color=symbol,group=rlang::eval_tidy(rlang::`!!`(group_expr))))+labs(x=quo_name(x_expr),y=quo_name(y_expr),group=quo_name(group_expr))}# Add faceting and themeg<-g+geom_line()+facet_wrap(~symbol,ncol=2,scales="free_y")+scale_color_tq()+theme_tq()return(g)}

We can quickly visualize our data with our plotting function, ggplot_facet_by_symbol. Let’s have a look at the “adjusted” stock prices by “date”.

# Plot adjusted vs dateFANG_tbl_d%>%ggplot_facet_by_symbol(date,adjusted)+labs(title="FANG Stocks: Adjusted Prices 2014 through 2016")

plot of chunk unnamed-chunk-5

Now that we see what data we are dealing with, let’s move onto the tibbletime demo.

DEMO: tibbletime

We’ll test out the following functions today:

Initialize a Tibble-Time Object

Before we can use these new functions, we need to create a tbl_time object. The new class operates almost identically to a normal tibble object. However, under the hood it tracks the time information.

Use the as_tbl_time() function to initialize the object. Specify index = date, which tells the tbl_time object which index to track.

# Convert to tbl_timeFANG_tbl_time_d<-FANG_tbl_d%>%as_tbl_time(index=date)

We can print the tbl_time object. Looks almost identical to a grouped tibble. Note that “Index: date” informs us that the”time tibble” is initialized properly.

# Show the tbl_time object we createdFANG_tbl_time_d
## # A time tibble: 3,024 x 8## # Index:  date## # Groups: symbol [4]##    symbol       date  open  high   low close   volume adjusted##                      ##  1     FB 2014-01-02 54.83 55.22 54.19 54.71 43195500    54.71##  2     FB 2014-01-03 55.02 55.65 54.53 54.56 38246200    54.56##  3     FB 2014-01-06 54.42 57.26 54.05 57.20 68852600    57.20##  4     FB 2014-01-07 57.70 58.55 57.22 57.92 77207400    57.92##  5     FB 2014-01-08 57.60 58.41 57.23 58.23 56682400    58.23##  6     FB 2014-01-09 58.65 58.96 56.65 57.22 92253300    57.22##  7     FB 2014-01-10 57.13 58.30 57.06 57.94 42449500    57.94##  8     FB 2014-01-13 57.91 58.25 55.38 55.91 63010900    55.91##  9     FB 2014-01-14 56.46 57.78 56.10 57.74 37503600    57.74## 10     FB 2014-01-15 57.98 58.57 57.27 57.60 33663400    57.60## # ... with 3,014 more rows

We can plot it with our plotting function, ggplot_facet_by_symbol(), and we see the tbl_time object reacts the same as the tbl object.

# Plot the tbl_time objectFANG_tbl_time_d%>%ggplot_facet_by_symbol(date,adjusted)+labs(title="Working with tbltime: Reacts same as tbl class")

plot of chunk unnamed-chunk-8

Special Time Series Functions

Let’s see what we can do with the new tbl_time object.

time_filter

The time_filter() function is used to succinctly filter a tbl_time object by date. It uses a function format (e.g. “date_operator_start ~ date_operator_end”). We specify the date operators in normal YYYY-MM-DD + HH:MM:SS, but there is also powerful shorthand to more efficiently subset by date.

Suppose we’d like to filter all observations inclusive of “2014-06-01” and “2014-06-15”. We can do this using the function notation, time_filter(2014-06-01 ~ 2014-06-15).

# time_filter by dayFANG_tbl_time_d%>%time_filter(2014-06-01~2014-06-15)%>%# Plottingggplot_facet_by_symbol(date,adjusted)+geom_point()+labs(title="Time Filter: Use functional notation to quickly subset by time",subtitle="2014-06-01 ~ 2014-06-15")

plot of chunk unnamed-chunk-9

We can do the same by month. Suppose we just want observations in March, 2014. Use the shorthand functional notation “~ 2014-03”.

# time_filter by monthFANG_tbl_time_d%>%time_filter(~2014-03)%>%# Plottingggplot_facet_by_symbol(date,adjusted)+geom_point()+labs(title="Time Filter: Use shorthand for even easier subsetting",subtitle="~ 2014-03")

plot of chunk unnamed-chunk-10

The tbl_time object also responds to bracket notation [. Here we collect all dates in 2014 for each of the groups.

# time filter bracket [] notationFANG_tbl_time_d[~2014]%>%# Plottingggplot_facet_by_symbol(date,adjusted)+labs(title="Time Filter: Bracket Notation Works Too",subtitle="FANG_tbl_time_d[~ 2014]")

plot of chunk unnamed-chunk-11

The time_filter() has a lot of capability and useful shorthand. Those interested should check out the time_filter vignette and the time_filter function documentation.

time_summarise

The time_summarise() function is similar to dplyr::summarise but with the added benefit of being able to summarise by a time period such as “yearly” or “monthly”

# Summarize functions over time periods such as weekly, monthly, etcFANG_tbl_time_d%>%time_summarise(period="yearly",adj_min=min(adjusted),adj_max=max(adjusted),adj_range=adj_max-adj_min)
## # A time tibble: 12 x 5## # Index:  date## # Groups: symbol [4]##    symbol       date   adj_min   adj_max adj_range##  *                      ##  1   AMZN 2014-12-31 287.06000 407.04999 119.98999##  2   AMZN 2015-12-31 286.95001 693.96997 407.01996##  3   AMZN 2016-12-30 482.07001 844.35999 362.28998##  4     FB 2014-12-31  53.53000  81.45000  27.92000##  5     FB 2015-12-31  74.05000 109.01000  34.96000##  6     FB 2016-12-30  94.16000 133.28000  39.12000##  7   GOOG 2014-12-31 492.68097 606.14264 113.46167##  8   GOOG 2015-12-31 489.85431 776.59998 286.74567##  9   GOOG 2016-12-30 668.26001 813.10999 144.84998## 10   NFLX 2014-12-31  44.88714  69.19857  24.31143## 11   NFLX 2015-12-31  45.54714 130.92999  85.38285## 12   NFLX 2016-12-30  82.79000 128.35001  45.56001

The really cool thing about time_summarise() is that we can use the functional notation to define the period to summarize over. For example if we want bimonthly, or every two months, we can use the notation 2 Months: “2~m”. Similarly we could do every 20 days as “20~d”. The summarization options are endless.

Let’s plot the min, max, and median on a Bi-Monthly frequency (2~m) with time_summarise(). This is really cool!!

# Summarize by 2-Month periodsFANG_min_max_by_2m<-FANG_tbl_time_d%>%time_summarise(period=2~m,adj_min=min(adjusted),adj_max=max(adjusted),adj_med=median(adjusted))%>%gather(key=key,value=value,adj_min,adj_max,adj_med)# Plot using our plotting function, grouping by key (min, max, and median)FANG_min_max_by_2m%>%ggplot_facet_by_symbol(date,value,group=key)+geom_point()+labs(title="Summarizing Data By 2-Months (Bi-Monthly)",subtitle="2~m")

plot of chunk unnamed-chunk-13

Those interested in furthering their understanding of time_summarise() can check out the time_summarise function documentation.

as_period

The next function, as_period(), enables changing the period of a tbl_time object. Two advantages to using this method over traditional approaches:

  1. The functions are flexible: “yearly” == “y” == “1~y”
  2. The functional notation allows for endless periodicity change combinations, for example:

    • “15~d” to change to 15-day periodicity
    • “2~m” to change to bi-monthly periodicity
    • “4~m” to change to tri-annual (semesters or trimesters)
    • “6~m” to change to bi-annual

To start off, let’s do a simple monthly periodicity change.

# Convert from daily to monthly periodicityFANG_tbl_time_d%>%as_period(period="month")%>%# Plottingggplot_facet_by_symbol(date,adjusted)+labs(title="Periodicity Change from Daily to Monthly")+geom_point()

plot of chunk unnamed-chunk-14

Let’s step it up a notch. What about bi-monthly? Just use the functional notation, “2~m”.

# Convert from daily to bi-monthly periodicityFANG_tbl_time_d%>%as_period(period=2~m)%>%# Plottingggplot_facet_by_symbol(date,adjusted)+labs(title="Periodicity Change to Daily to Bi-Monthly",subtitle="2~m")+geom_point()

plot of chunk unnamed-chunk-15

Let’s keep going. What about bi-annually? Just use “6~m”.

# Convert from daily to bi-monthly periodicityFANG_tbl_time_d%>%as_period(period=6~m)%>%# Plottingggplot_facet_by_symbol(date,adjusted)+labs(title="Periodicity Change to Daily to Bi-Annually",subtitle="6~m")+geom_point()

plot of chunk unnamed-chunk-16

The possibilities are endless with the functional notation. Interested learners can check out the vignette on periodicity change with tibbletime.

rollify

The rollify() function is an adverb (a special type of function in the tidyverse that modifies another function). What rollify() does is turn any function into a rolling version of itself.

# Rolling 60-day meanroll_mean_60<-rollify(mean,window=60)FANG_tbl_time_d%>%mutate(mean_60=roll_mean_60(adjusted))%>%select(-c(open:volume))%>%# Plotggplot_facet_by_symbol(date,adjusted)+geom_line(aes(y=mean_60),color=palette_light()[[6]])+labs(title="Rolling 60-Day Mean with rollify")

plot of chunk unnamed-chunk-17

We can even do more complicated rolling functions such as correlations. We use the functional form .f = ~ fun(.x, .y, ...) within rollify().

# Rolling correlationroll_corr_60<-rollify(~cor(.x,.y,use="pairwise.complete.obs"),window=60)FANG_tbl_time_d%>%mutate(cor_60=roll_corr_60(open,close))%>%select(-c(open:adjusted))%>%# Plotggplot_facet_by_symbol(date,cor_60)+labs(title="Rollify: 60-Day Rolling Correlation Between Open and Close Prices")

plot of chunk unnamed-chunk-18

We can even return multiple results. For example, we can create a rolling quantile.

First, create a function that returns a tibble of quantiles.

# Quantile tbl functionquantile_tbl<-function(x){q<-quantile(x)tibble(quantile_name=names(q),quantile_value=q)}# Test the functionquantile_tbl(1:100)
## # A tibble: 5 x 2##   quantile_name quantile_value##                     ## 1            0%           1.00## 2           25%          25.75## 3           50%          50.50## 4           75%          75.25## 5          100%         100.00

Great, it works. Next, use rollify to create a rolling version. We set unlist = FALSE to return a list-column.

# Rollified quantile functionroll_quantile_60<-rollify(quantile_tbl,window=60,unlist=FALSE)

Next, apply the rolling quantile function within mutate() to get a rolling quantile. Make sure you select(), filter() and unnest() to remove unnecessary columns, filter NA values, and unnest the list-column (“rolling_quantile”). Each date now has five values for each quantile.

# Apply rolling quantile FANG_quantile_60<-FANG_tbl_time_d%>%mutate(rolling_quantile=roll_quantile_60(adjusted))%>%select(-c(open:adjusted))%>%filter(!is.na(rolling_quantile))%>%unnest()FANG_quantile_60
## # A time tibble: 13,940 x 4## # Index:  date## # Groups: symbol [4]##    symbol       date quantile_name quantile_value##  *                          ##  1     FB 2014-03-28            0%        53.5300##  2     FB 2014-03-28           25%        57.8750##  3     FB 2014-03-28           50%        64.2100##  4     FB 2014-03-28           75%        68.6275##  5     FB 2014-03-28          100%        72.0300##  6     FB 2014-03-31            0%        53.5300##  7     FB 2014-03-31           25%        57.9350##  8     FB 2014-03-31           50%        64.2100##  9     FB 2014-03-31           75%        68.6275## 10     FB 2014-03-31          100%        72.0300## # ... with 13,930 more rows

Finally, we can plot the results.

FANG_quantile_60%>%ggplot_facet_by_symbol(date,quantile_value,group=quantile_name)+labs(title="Rollify: Create Rolling Quantiles")

plot of chunk unnamed-chunk-22

Interested learners can continue exploring rollify by checking out our vignette on rolling functions with rollify.

Changes Coming

This package is currently under active development. Don’t be shocked if the functionality increases soon… Davis Vaughan is working hard to expand the capability of tibbletime. Reproducible bug reports are welcome!

Next Steps

Interested learners can check out the following links to further understanding of tibbletime:

Announcements

We have a busy couple of weeks. In addition to Demo Week, we have:

DataTalk

!!TONIGHT!! Thursday, October 26 at 7PM EST, Matt will be giving a FREE LIVE #DataTalk on Machine Learning for Recruitment and Reducing Employee Attrition. You can sign up for a reminder at the Experian Data Lab website.

#MachineLearning for Reducing Employee Attrition @BizScienchttps://t.co/vlxmjWzKCL#ML#AI#HR#IoTT#IoT#DL#BigData#Tech#Cloud#Jobspic.twitter.com/dF5Znf10Sk

— Experian DataLab (@ExperianDataLab) October 18, 2017

EARL

On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.

😀Hey #rstats. I'll be presenting @earlconf on #MachineLearning applications in #HumanResources. Get 15% off tickets: https://t.co/b6JUQ6BSTl

— Matt Dancho (@mdancho84) October 11, 2017

Courses

Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:

The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Follow Business Science on Social Media

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Not Mustard – Exploring McDonalds Reviews on Yelp with R

$
0
0

(This article was first published on Jasmine Dumas' R Blog, and kindly contributed to R-bloggers)

Leveraging tidyverse packages httr, stringr & purrr –

Introduction

McDonald’s is a nostalgic component of America 🇺🇸 and a pioneer of fast food operations and real estate ventures, as depicted in the 2016 film, The Founder, about Ray Kroc. As a kid I traveled to different McDonald’s across the east coast and noticed a difference in the classic hamburger preparation for adding mustard (i.e. in Maryland and not in Upstate New York). After some Google research, I noticed others had documented the regional differences in the use of mustard and but no aggregated data set existed detailing which McDonald’s added mustard to their hamburgers.

I hypothesized that these deviations in food prep could be identified from yelp.com reviews. The process below explains the approaches I took to gather data from the web with the yelp API and the development of a shiny web application which detects string patterns in reviews for the keyword ‘mustard’ for a specific McDonald’s.

API Process

This script highly references Jenny Bryan’s yelpr example!

library(yelpr)# devtools::install_github("jennybc/ryelp")library(httr)library(stringr)library(purrr)# 1. Create an application on the [Yelp developers site](https://www.yelp.com/developers/v3/manage_app) and agree to the Terms and aggreements## Set your credentials as environment variables. Sys.setenv(YELP_CLIENT_ID='**************')Sys.setenv(YELP_SECRET='*****************************')# 2. search for businesses by creating an appyelp_app<-oauth_app("yelp",key=Sys.getenv("YELP_CLIENT_ID"),secret=Sys.getenv("YELP_SECRET"))# authenticate an endpoint## https://www.yelp.com/developers/documentation/v3/authenticationyelp_endpoint<-oauth_endpoint(NULL,authorize="https://api.yelp.com/oauth2/token",access="https://api.yelp.com/oauth2/token")# 3. Get an access token: Just enter anything for the authorization code when prompted in the Console of RStudiotoken<-oauth2.0_token(yelp_endpoint,yelp_app,user_params=list(grant_type="client_credentials"),use_oob=T)# make this arg TRUE when interactive# 4. Create a url to make calls to the business search endpoint: The parts of the url include the endpoint and the query search parameters after the **?**(url<-modify_url("https://api.yelp.com",path=c("v3","businesses","search"),query=list(term="McDonalds",location="Hartford, CT",limit=10)))# 5. Retrieve info from the server with the `GET` verb: HTTP response verbs enable the client to send us back data on: status, headers, and body/content. Available verbs include **`GET`ting** data from the server, **`POST`ing** new data to the server, **`PUT`** new data to update a partial record and **`DELETE`ing** data.response1<-GET(url,config(token=token))# was this api request successful?## HTTP status codes consist of 3 digit numeric codes for status (1xx is information, 2xx is success, 3xx is redirection, 4xx is client error, 5xx server error).http_status(response1)# what type of format does the data come back with?response1$headers$`content-type`# 6. Return some content with geolocation data, business info & categoriesct2<-content(response1)## create an object with resturant name and id for further callsbiz_info<-ct2$businesses%>%map_df(`[`,c("name","id","phone","review_count"))biz_info%>%knitr::kable()# 7. Get business reviews: After getting a specific McDonald's `id` restructure the url as an individual value and secondly creating a function to create a data.frame with urls for each business from the search endpoint.url_id<-modify_url("https://api.yelp.com",path=c("v3","businesses","mcdonalds-glastonbury","reviews"),query=list(locale="en_US"))# 8. Retrieve response data on up to 3 reviews for the specific McDonald'sresponse2<-GET(url_id,config(token=token))content2<-content(response2)# Detect for string of 'mustard'content2$reviews%>%map_df(`[`,c("text"))%>%str_detect("mustard")

The purrr version to check multiple restaurant text reviews for the string ‘mustard’.

# create a function to structure the urls with the business idurl_id_f<-function(id){modify_url("https://api.yelp.com",path=c("v3","businesses",id,"reviews"),query=list(locale="en_US"))}# create a df which maps the url function of all the restaurantsbiz_reviews<-data.frame()biz_reviews<-map_chr(biz_info$id,url_id_f)%>%data.frame(url=.)biz_reviews$url<-as.character(biz_reviews$url)# Get each url for the requestresponse3<-map(biz_reviews$url,GET,config(token=token))response3%>%map_df(`[`,"status_code")==200# loop through each restaurant's 3 reviews and extract the text and detect the presence of the string 'mustard'for(idxin1:length(response3)){mcd<-response3[[idx]]ct<-content(mcd)print(ct)result<-ct$reviews%>%map_df(`[`,c("text"))%>%str_detect("mustard")print(result)}

Learnings & Gotchas

The non-premium API access only includes up to 3 reviews and only a sample of the full text, leaving obvious gaps when trying to detect the keyword ‘mustard’ and contingent on enough reviews which details 🍔 preparation.

In trying to create and publish a shiny application that wraps this code, I came up with errors given that OAuth2.0 grants access to users 👩 and not applications 💻. However here is a screenshot of the script above developed into an interactive shiny application to search for any [city, state] and the gist of the code if your interested in running a local version.

The name of this shiny app is a nod to Silicon Valley’s Not Hotdog application.


Image source: https://www.mcdonalds.com/us/en-us/about-us.html

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jasmine Dumas' R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Analysing Cryptocurrency Market in R

$
0
0

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Cryptocurrency market has been growing rapidly that being an Analyst, It intrigued me what does it comprise of. In this post, I’ll explain how can we analyse the Cryptocurrency Market in R with the help of the package coinmarketcapr. Coinmarketcapr package is an R wrapper around coinmarketcap API.

To get started, Let us load the library into our R session and Plot the top 5 Cryptocurrencies.

library(coinmarketcapr)plot_top_5_currencies()

Gives this plot:

The above plot clearly shows how bitcoin is leading the market but that does not give us the picture of how the marketshare is split among various cryptocurrencies, so let us get the complete data of various cryptocurrencies.

market_today <- get_marketcap_ticker_all()head(market_today[,1:8])              id         name symbol rank price_usd  price_btc X24h_volume_usd market_cap_usd1      bitcoin      Bitcoin    BTC    1   5568.99        1.0    2040540000.0  92700221345.02     ethereum     Ethereum    ETH    2   297.408  0.0537022     372802000.0  28347433482.03       ripple       Ripple    XRP    3  0.204698 0.00003696     100183000.0   7887328954.04 bitcoin-cash Bitcoin Cash    BCH    4   329.862  0.0595624     156369000.0   5512868154.05     litecoin     Litecoin    LTC    5    55.431   0.010009     124636000.0   2967255097.06         dash         Dash   DASH    6   287.488  0.0519109      46342600.0   2197137527.0

Having extracted the complete data of various cryptocurrencies, let us try to visualize the marketshare split with a treemap. For plotting, let us extract only the two columns ID and market_cap_usd and convert the market_cap_usd into numeric type and a little bit of number formatting for the treemap labels.

library(treemap)df1 <- na.omit(market_today[,c('id','market_cap_usd')])df1$market_cap_usd <- as.numeric(df1$market_cap_usd)df1$formatted_market_cap <-  paste0(df1$id,'\n','$',format(df1$market_cap_usd,big.mark = ',',scientific = F, trim = T))treemap(df1, index = 'formatted_market_cap', vSize = 'market_cap_usd', title = 'Cryptocurrency Market Cap', fontsize.labels=c(12, 8), palette='RdYlGn')

Gives this plot:

The above visualization explains the whole cryptocurrency market is propped by two currencies primarily – Bitcoin and Etherum and even the second ranked Etherum is far behind than Bitcoin which is the driving factor of this market. But it is also fascinating (and shocking at the same time) that both Bitcoin and Etherum together create a 100 Billion Dollar (USD) market. Whether this is a sign of bubble or no – We’ll leave that for market analysts to speculate, but being a data scientist or analyst, We have a lot of insights to extract from the above data and it should be interesting analysing such an expensive market.

    Related Post

    1. Time Series Analysis in R Part 3: Getting Data from Quandl
    2. Pulling Data Out of Census Spreadsheets Using R
    3. Extracting Tables from PDFs in R using the Tabulizer Package
    4. Extract Twitter Data Automatically using Scheduler R package
    5. An Introduction to Time Series with JSON Data
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

    $
    0
    0

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    by Błażej Moska, computer science student and data science intern 

    One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

    Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

    \[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

    \[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

    Bias is defined as follows:

    \[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

    One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

    \(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

    We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl times, where nsampl is the parameter describing number of bootstrap samples.

    Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

    Because bias and variance could be controlled by parameters sent to the rpart function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp (complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit, which defines minimum number of observations needed to attempt a split.

    Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl trees, calculate nsampl predictions, nrow variances, nrow biases and repeat those operations for the number of parameters (length of the vector cp or minsplit). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

    So, summing up, the procedure looks as follows:

    1. Create bootstrap samples (by bootstrapping original dataset)
    2. Train model on each of these bootstrap datasets
    3. Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
    4. Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
    5. Calculate mean bias/absolute bias and mean variance 

    R Code

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Gold-Mining – Week 8 (2017)

    $
    0
    0

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    </p> <p>Week 8 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold! </p> <p>

    The post Gold-Mining – Week 8 (2017) appeared first on Fantasy Football Analytics.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Substitute levels in a factor or character vector

    $
    0
    0

    I’ve been using the ggplot2 package a lot recently. When creating a legend or tick marks on the axes, ggplot2 uses the levels of a character or factor vector. Most of the time, I am working with coded variables that use some abbreviation of the “true” meaning (e.g. “f” for female and “m” for male or single characters for some single character for a location: “S” for Stuttgart and “M” for Mannheim).

    In my plots, I don’t want these codes but the full name of the level. Since I am not aware of any super-fast and easy to use function in base R (let me know in the comments if there is one), I came up with a very simple function and put this in my .Rprofile (that means that it is available whenever I start R). I called it “replace.levels“. The dot before the name means that it is invisible and does not show up in my Global Environment overview in RStudio. You have to call it with .replace.levels(), of course.

    This is how it works: .replace.levels takes two arguments: vec and replace.list

    vec is the vector where substitutions shall be made. replace.list is a list with named elements. The names of the elements are substituted for the contents of the elements.

    The function also checks if all elements (both old and new ones) appear in vec. If not, it throws an error. It is a very simple function, but it saves me a lot of typing.

    Example:

    a <- c(“B”, “C”, “F”).replace.levels(a, list(“B” = “Braunschweig”, “C” = “Chemnitz”, “F” = “Frankfurt”))[1] “Braunschweig” “Chemnitz”     “Frankfurt”

    Here it is:

    .replace.levels <- function (vec, replace.list) {  # Checking if all levels to be replaced are in vec (and other way around)  cur.levels <- unique(as.character(vec))  not.in.cur.levels <- setdiff(cur.levels, names(replace.list))  not.in.new.levels <- setdiff(names(replace.list), cur.levels)  if (length(not.in.cur.levels) != 0 | length(not.in.new.levels) != 0) {    stop(“The following elements do not match: “, paste0(not.in.cur.levels, not.in.new.levels, collapse = “, “))  }  for (el in 1:length(replace.list)) {    vec <- gsub(names(replace.list[el]), replace.list[[el]], vec)  }  vec}

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Microsoft R Open 3.4.2 now available

    $
    0
    0

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Microsoft R Open (MRO), Microsoft's enhanced distribution of open source R, has been upgraded to version 3.4.2 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R 3.4.2 and updates the bundled packages

    MRO is 100% compatible with all R packages. MRO 3.4.2 points to a fixed CRAN snapshot taken on October 15 2017, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

    MRO 3.4.2 is based on R 3.4.2, a minor update to the R engine (you can see the detailed list of updates to R here). This update is backwards-compatible with R 3.4.1 (and MRO 3.4.1), so you shouldn't encounter an new issues by upgrading. 

    We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

    MRAN: Download Microsoft R Open

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Projects chapter added to “Empirical software engineering using R”

    $
    0
    0

    (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

    The Projects chapter of my Empirical software engineering book has been added to the draft pdf (download here).

    This material turned out to be harder to bring together than I had expected.

    Building software projects is a bit like making sausages in that you don’t want to know the details, or in this case those involved are not overly keen to reveal the data.

    There are lots of papers on requirements, but remarkably little data (Soo Ling Lim’s work being the main exception).

    There are lots of papers on effort prediction, but they tend to rehash the same data and the quality of research is poor (i.e., tweaking equations to get a better fit; no explanation of why the tweaks might have any connection to reality). I had not realised that Norden did all the heavy lifting on what is sometimes called the Putnam model; Putnam was essentially an evangelist. The Parr curve is a better model (sorry, no pdf), but lacked an evangelist.

    Accurate estimates are unrealistic: lots of variation between different people and development groups, the client keeps changing the requirements and developer turnover is high.

    I did turn up a few interesting data-sets and Rome came to the rescue in places.

    I have been promised more data and am optimistic some will arrive.

    As always, if you know of any interesting software engineering data, please tell me.

    I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

    Reliability chapter next.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Stan Roundup, 27 October 2017

    $
    0
    0

    (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

    I missed two weeks and haven’t had time to create a dedicated blog for Stan yet, so we’re still here. This is only the update for this week. From now on, I’m going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.

    Not one, but two new devs!

    This is my favorite news to post, hence the exclamation.

    • Matthijs Vákár from University of Oxford joined the dev team. Matthijs’s first major commit is a set of GLM functions for negative binomial with log link (2–6 times speedup), normal linear regression with identity link (4–5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn’t just implement the straight-line case—this is a fully vectorized implementation as a density, so we’ll be able to use them this way:
      int y[N];  // observations
      matrix[N, K] x;                  // "data" matrix
      vector[K] beta;                  // slope coefficients
      real alpha;                      // intercept coefficient
      
      y ~ bernoulli_logit_glm(x, beta, alpha);
      

      These stand in for what is now written as

      y ~ bernoulli_logit(x * beta + alpha);
      

      and before that was written

      y ~ bernoulli(inv_logit(x * beta + alpha)); 
      

      Matthijs also successfully defended his Ph.D. thesis—welcome to the union, Dr. Vákár.

    • Andrew Johnson from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He’s also been patching doc and other issues.

    Visitors

    • Kentaro Matsura, author of Bayesian Statistical Modeling Using Stan and R (in Japanese) visited and we talked about what he’s been working on and how we’ll handle the syntax for tuples in Stan.

    • Shuonan Chen visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics—specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro’s book!

    Other news

    • Bill Gillespie presented a Stan and Torsten tutorial at ACoP.

    • Charles Margossian had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.

    • Krzysztof Sakrejda nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).

    • Ben Goodrich has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review—suggestions welcome).

    • Bob Carpenter and Mitzi Morris taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! Mike Lawrence, of Stan tutorial YouTube fame, helped people with debugging and installs—nice to finally meet him in person.

    • Ben Bales got the metric initialization into CmdStan, so we’ll finally be able to restart (the metric used to be called the mass matrix—it’s just the inverse of a regularized estimate of global posterior covariance during warmup)

    • Michael Betancourt just returned from SACNAS (diversity in STEM conference attended by thousands).

    • Michael also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.

    • Aki Vehtari was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with Andrew Gelman; it’ll also be joint between Helsinki and New York. Sounds like fun!

    • Breck Baldwin and Sean Talts headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.

    • Imad Ali is leaving Columbia to work for the NBA league office (he’ll still be in NYC) as a statistical analyst! That’s one way to get access to the data!

    The post Stan Roundup, 27 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


    Spring Budget 2017: Circle Visualisation

    $
    0
    0

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    It’s time to branch out into a new area of data visualisation: proportion area plots. These plots use area to show proportion between different related values. A common type of proportional area plots are tree maps. We are going to be using the same principle but with circles.

    A common subject for area visualisation is budgets and money, which will be the subject of this post.

    We are going to take a look at the Spring Budget 2017 in the UK.

    Page four of the document has two pie charts detailing how public money will be spent this year and where it will come from.

    I couldn’t find this data in any tables, so we will have to create our data frames manually:

     spending <- data.frame(c("Social protection","Health","Education","Other","Defence","Debt interest","Transport","Housing\n & environment","Public order","Social\n services","Industry"), c(245,149,102,50,48,46,37,36,34,32,23), c(802,802,802,802,802,802,802,802,802,802,802)) names(spending) <- c("Expenditure","Spending","Total")income <- data.frame(c("Income tax","VAT","National Insurance", "Other taxes","Borrowing","Other (non taxes)","Corporation Tax","Excise duties","Council tax","Business\n rates"), c(175,143,130,80,58,54,52,48,32,30), c(802,802,802,802,802,802,802,802,802,802)) names(income) <- c("Source","Income","Total")

    I am including ‘borrowing’ in the Government’s income section, although it will be the only circle that won’t be a tax.

    To draw our circles we will be using the packcircles package.

    library("packcircles")library("ggplot2")options(scipen = 999)library(extrafont)library(extrafontdb)loadfonts()fonts()t <- theme_bw() +  theme(panel.grid = element_blank(),  axis.text=element_blank(), axis.ticks=element_blank(),  axis.title=element_blank())

    The first thing to do is to use the functions within the package to generate the radii and coordinates, as per this code here.

    # circle areasareas <- income$Income areas2 <- spending$Spending# incomepacking1 <- circleProgressiveLayout(areas)dat1 <- circleLayoutVertices(packing1)# spendingpacking2 <- circleProgressiveLayout(areas2) dat2 <- circleLayoutVertices(packing2)

    Next up we are going to move to the visualisation. It would be more informative if we could display income and expenditure side-by-side.

    Like last time when we compared Donald Trump and Hillary Clinton on Facebook, we can use facetting for this purpose in ggplot2.

    I adapted this code for this purpose.

    dat <- rbind(cbind(dat1, set = 1),cbind(dat2, set = 2) )p <- ggplot(data = dat, aes(x, y)) + geom_polygon(aes(group = id, fill = -set), colour = "black", show.legend = FALSE) + theme(text = element_text(family="Browallia New", size = 50)) +coord_equal() + ggtitle("Income and expenditure for the UK Government") +scale_colour_manual(values = c("white","White")) +facet_grid(~set, labeller = as_labeller(c('1' = "Income", '2' = "Expenditure")))

    With no labels:

    This is looking good, but obviously we have no way of knowing which circle corresponds to which area of income or expenditure. In ggplot2 you can use geom_text() to add annotations. In order for them to be positioned correctly, we need to build a relationship between the labels and the circles. We can do this by creating centroids from the coordinates we already have for the circles.

    If we have accurate centroid coordinates then our labels will always appear in the centre of our circles.

    Going back to our dat data frame:

    str(dat)> str(dat)'data.frame': 546 obs. of  4 variables:$ x  : num  0 -0.277 -1.092 -2.393 -4.099 ...$ y  : num  0 2.2 4.25 6.05 7.46 ...$ id : int  1 1 1 1 1 1 1 1 1 1 ...$ set: num  1 1 1 1 1 1 1 1 1 1 ...

    Our ID column shows which circle it belongs to, and our set column shows whether it belongs in income or expenditure. From this we can deduce that each circle is plotted using 26 pairs of x/y coordinates (at row 27 the ID changes from one to two).

    The mean of each 26 pairs is the centre of the corresponding circle.

    Our dat data frame is 546 rows long. We need a sequence of numbers, 26 apart, going up to 546:

    sequence <- data.frame(c(x = (seq(from = 1, to = 546, by = 26))), y = seq(from = 26, to = 547, by = 26))names(sequence)  head(sequence) x y x1 1 26 x2 27 52 x3 53 78 x4 79 104 x5 105 130 x6 131 156

    Next up we will create a data frame that will contain our centroids:

    centroid_df <- data.frame(seq(from = 1, to = 521, by = 1))names(centroid_df) <- c("number")

    There are probably a few different ways to run the next section, probably a bit more elegantly than what we’re about to do as well.

    This for loop will calculate the mean of every single combination of 26 coordinates and store the results in the new centroid_df.

    for (i in 1:521) {      j = i + 25      centroid_df$x[i] <- mean(dat$x[i:j])      centroid_df$y[i] <- mean(dat$y[i:j]) }

    Now we bring in our sequence data frame to filter out the ones we don’t need in a new coords data frame, leaving just the ones that correspond correctly to the circles:

    coords <- merge(sequence,centroid_df, by.x = "x", by.y = "number")names(coords) <- c("number1","number2","x","y")

    Next up for labelling purposes is to clarify which set each pair of coordinates belongs to:

    coords$set[1:21] <- 1 coords$set[11:21] <- 2 #this overrides some of the values set in the line above

    Finally we will paste in the names and amounts from our original income and spending data frames into a new label column:

    coords$label <- NAcoords$label[1:21] <- paste(income$Source,"\n£",income$Income,"bn")coords$label[11:21] <- paste(spending$Expenditure,"\n£",spending$Spending,"bn")coords$label <- gsub("£ ","£",coords$label)coords$label <- gsub(" bn","bn",coords$label)

    Now let’s add in the labels:

    p <- p+ geom_text(data = coords, aes(x = x, y = y, label = label, colour = factor(set)),show.legend = FALSE)p

    Gives this plot:

    There we have it, some nicely formatted labels showing what each circle refers to, and how much money it represents.

    Brief analysis

    On the left hand side we see that income tax is the biggest cash cow for the Government. Incidentally, about 28 per cent of that is paid for by the top 1% of earners:

    The top 5% are liable for almost half the UK's total income tax.

    Not enough for Jeremy Corbyn, so what would he like it to be? pic.twitter.com/poeC2N6Sn9

    — Rob Grant (@robgrantuk) June 21, 2017

    Social protection means benefits, essentially. A huge chunk of that is the State Pension, followed by various working-age benefits and tax credits.

    Notice that as a country we spend almost as much paying off the interest on our debts as we do on our own defence.

    Health and education – those two ‘classic’ public services – cost a combined £250bn.

    Conclusion

    I am a fan of using these circles to visualise this kind of data, especially money.

    If you have lots of networked data (for instance, if you had breakdowns of how exactly the social protection budget will be spent) then ggraph would be a good option for you to produce plots like this:

    Comment below if you have questions!

      Related Post

      1. Qualitative Research in R
      2. Multi-Dimensional Reduction and Visualisation with t-SNE
      3. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
      4. Analyzing Obesity across USA
      5. Can we predict flu deaths with Machine Learning and R?
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Demo Week: Time Series Machine Learning with h2o and timetk

      $
      0
      0

      (This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

      We’re at the final day of Business Science Demo Week. Today we are demo-ing the h2o package for machine learning on time series data. What’s demo week? Every day this week we are demoing an R package: tidyquant (Monday), timetk (Tuesday), sweep (Wednesday), tibbletime (Thursday) and h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today you’ll see how we can use timetk + h2o to get really accurate time series forecasts. Here we go!

      Previous Demo Week Demos

      h2o: What’s It Used For?

      The h2o package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. One of the main benefits of H2O is that it can be deployed on a cluster (this will not be discussed today). From the R perspective, there are four main uses:

      1. Data Manipulation: Merging, grouping, pivoting, imputing, splitting into training/test/validation sets, etc.

      2. Machine Learning Algorithms: Very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural networks), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles, and xgboost. Unsupervised include generalized low rank models, k-means and PCA. There’s also Word2vec for text analysis. The latest stable release also has AutoML: automatic machine learning, which is really cool as we’ll see in this post!

      3. Auxiliary ML Functionality Performance analysis and grid hyperparameter search

      4. Production, Map/Reduce and Cloud: Capabilities for productionizing into Java environments, cluster deployment with Hadoop / Spark (Sparkling Water), deploying in cloud environments (Azure, AWS, Databricks, etc)

      Sticking with the theme for the week, we’ll go over how h2o can be used for time series machine learning as an advanced algorithm. We’ll use h2o locally to develop a high accuracy time series model on the same data set (beer_sales_tbl) from the timetk and sweep posts. This is a supervised regression problem.

      Demo Week: H2O

      Load Libraries

      We’ll need three libraries today:

      • h2o: Awesome machine learning library
      • tidyquant: For getting data and loading the tidyverse behind the scenes
      • timetk: Toolkit for working with time series in R

      IMPORTANT FOR INSTALLING H2O

      For h2o, you must install the latest stable release. Select H2O » Latest Stable Release » Install in R. Then follow the instructions exactly.

      Installing Other Packages

      If you haven’t done so already, install the timetk and tidyquant packages:

      # Install packagesinstall.packages("timetk")install.packages("tidyquant")

      Loading Libraries

      Load the libraries.

      # Load librarieslibrary(h2o)# Awesome ML Librarylibrary(timetk)# Toolkit for working with time series in Rlibrary(tidyquant)# Loads tidyverse, financial pkgs, used to get data

      Data

      We’ll get data using the tq_get() function from tidyquant. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.

      # Beer, Wine, Distilled Alcoholic Beverages, in Millions USDbeer_sales_tbl<-tq_get("S4248SM144NCEN",get="economic.data",from="2010-01-01",to="2017-10-27")beer_sales_tbl
      ## # A tibble: 92 x 2##          date price##         ##  1 2010-01-01  6558##  2 2010-02-01  7481##  3 2010-03-01  9475##  4 2010-04-01  9424##  5 2010-05-01  9351##  6 2010-06-01 10552##  7 2010-07-01  9077##  8 2010-08-01  9273##  9 2010-09-01  9420## 10 2010-10-01  9413## # ... with 82 more rows

      It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting, and it’s a good idea to identify spots where we will split the data into training, test and validation sets.

      # Plot Beer Sales with train, validation, and test sets shownbeer_sales_tbl%>%ggplot(aes(date,price))+# Train Regionannotate("text",x=ymd("2012-01-01"),y=7000,color=palette_light()[[1]],label="Train Region")+# Validation Regiongeom_rect(xmin=as.numeric(ymd("2016-01-01")),xmax=as.numeric(ymd("2016-12-31")),ymin=0,ymax=Inf,alpha=0.02,fill=palette_light()[[3]])+annotate("text",x=ymd("2016-07-01"),y=7000,color=palette_light()[[1]],label="Validation\nRegion")+# Test Regiongeom_rect(xmin=as.numeric(ymd("2017-01-01")),xmax=as.numeric(ymd("2017-08-31")),ymin=0,ymax=Inf,alpha=0.02,fill=palette_light()[[4]])+annotate("text",x=ymd("2017-05-01"),y=7000,color=palette_light()[[1]],label="Test\nRegion")+# Datageom_line(col=palette_light()[1])+geom_point(col=palette_light()[1])+geom_ma(ma_fun=SMA,n=12,size=1)+# Aestheticstheme_tq()+scale_x_date(date_breaks="1 year",date_labels="%Y")+labs(title="Beer Sales: 2007 through 2017",subtitle="Train, Validation, and Test Sets Shown")

      plot of chunk unnamed-chunk-4

      Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!

      DEMO: h2o + timetk, Time Series Machine Learning

      We’ll follow a similar workflow for time series machine learning from the timetk + linear regression post on Tuesday. However, this time we’ll swap out the lm() function for h2o.autoML() to get superior accuracy!

      Time Series Machine Learning

      Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:

      • Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.

      • Objective: We’ll predict the next 8 months of data for 2017 using the time series signature. We’ll then compare the results to the two prior demos that predicted the same data using different methods: timetk + lm() (linear regression) and sweep + auto.arima() (ARIMA).

      We’ll go through a workflow that can be used to perform time series machine learning.

      Step 0: Review data

      Just to show our starting point, let’s print out our beer_sales_tbl. We use glimpse() to take a quick peek at the data.

      # Starting pointbeer_sales_tbl%>%glimpse()
      ## Observations: 92## Variables: 2## $ date   2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01, 20...## $ price  6558, 7481, 9475, 9424, 9351, 10552, 9077, 9273, 94...

      Step 1: Augment Time Series Signature

      The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use glimpse() for quick inspection. See how there are now 30 features. Not all will be important, but some will.

      # Augment (adds data frame columns)beer_sales_tbl_aug<-beer_sales_tbl%>%tk_augment_timeseries_signature()beer_sales_tbl_aug%>%glimpse()
      ## Observations: 92## Variables: 30## $ date       2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01...## $ price      6558, 7481, 9475, 9424, 9351, 10552, 9077, 9273...## $ index.num  1262304000, 1264982400, 1267401600, 1270080000,...## $ diff       NA, 2678400, 2419200, 2678400, 2592000, 2678400...## $ year       2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,...## $ year.iso   2009, 2010, 2010, 2010, 2010, 2010, 2010, 2010,...## $ half       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,...## $ quarter    1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2,...## $ month      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,...## $ month.xts  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, ...## $ month.lbl  January, February, March, April, May, June, Jul...## $ day        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ hour       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ minute     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ second     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ hour12     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ am.pm      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ wday       6, 2, 2, 5, 7, 3, 5, 1, 4, 6, 2, 4, 7, 3, 3, 6,...## $ wday.xts   5, 1, 1, 4, 6, 2, 4, 0, 3, 5, 1, 3, 6, 2, 2, 5,...## $ wday.lbl   Friday, Monday, Monday, Thursday, Saturday, Tue...## $ mday       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ qday       1, 32, 60, 1, 31, 62, 1, 32, 63, 1, 32, 62, 1, ...## $ yday       1, 32, 60, 91, 121, 152, 182, 213, 244, 274, 30...## $ mweek      5, 6, 5, 5, 5, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5,...## $ week       1, 5, 9, 13, 18, 22, 26, 31, 35, 40, 44, 48, 1,...## $ week.iso   53, 5, 9, 13, 17, 22, 26, 30, 35, 39, 44, 48, 5...## $ week2      1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1,...## $ week3      1, 2, 0, 1, 0, 1, 2, 1, 2, 1, 2, 0, 1, 2, 0, 1,...## $ week4      1, 1, 1, 1, 2, 2, 2, 3, 3, 0, 0, 0, 1, 1, 1, 1,...## $ mday7      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

      Step 2: Prep the Data for H2O

      We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer dplyr operations for these steps.

      beer_sales_tbl_clean<-beer_sales_tbl_aug%>%select_if(~!is.Date(.))%>%select_if(~!any(is.na(.)))%>%mutate_if(is.ordered,~as.character(.)%>%as.factor)beer_sales_tbl_clean%>%glimpse()
      ## Observations: 92## Variables: 28## $ price      6558, 7481, 9475, 9424, 9351, 10552, 9077, 9273...## $ index.num  1262304000, 1264982400, 1267401600, 1270080000,...## $ year       2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,...## $ year.iso   2009, 2010, 2010, 2010, 2010, 2010, 2010, 2010,...## $ half       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,...## $ quarter    1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2,...## $ month      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,...## $ month.xts  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, ...## $ month.lbl  January, February, March, April, May, June, Ju...## $ day        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ hour       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ minute     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ second     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ hour12     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...## $ am.pm      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ wday       6, 2, 2, 5, 7, 3, 5, 1, 4, 6, 2, 4, 7, 3, 3, 6,...## $ wday.xts   5, 1, 1, 4, 6, 2, 4, 0, 3, 5, 1, 3, 6, 2, 2, 5,...## $ wday.lbl   Friday, Monday, Monday, Thursday, Saturday, Tu...## $ mday       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...## $ qday       1, 32, 60, 1, 31, 62, 1, 32, 63, 1, 32, 62, 1, ...## $ yday       1, 32, 60, 91, 121, 152, 182, 213, 244, 274, 30...## $ mweek      5, 6, 5, 5, 5, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5,...## $ week       1, 5, 9, 13, 18, 22, 26, 31, 35, 40, 44, 48, 1,...## $ week.iso   53, 5, 9, 13, 17, 22, 26, 30, 35, 39, 44, 48, 5...## $ week2      1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1,...## $ week3      1, 2, 0, 1, 0, 1, 2, 1, 2, 1, 2, 0, 1, 2, 0, 1,...## $ week4      1, 1, 1, 1, 2, 2, 2, 3, 3, 0, 0, 0, 1, 1, 1, 1,...## $ mday7      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

      Let’s split into a training, validation and test sets following the time ranges in the visualization above.

      # Split into training, validation and test setstrain_tbl<-beer_sales_tbl_clean%>%filter(year<2016)valid_tbl<-beer_sales_tbl_clean%>%filter(year==2016)test_tbl<-beer_sales_tbl_clean%>%filter(year==2017)

      Step 3: Model with H2O

      First, fire up h2o. This will initialize the Java Virtual Machine (JVM) that H2O uses locally.

      h2o.init()# Fire up h2o
      ##  Connection successful!## ## R is connected to the H2O cluster: ##     H2O cluster uptime:         46 minutes 4 seconds ##     H2O cluster version:        3.14.0.3 ##     H2O cluster version age:    1 month and 5 days  ##     H2O cluster name:           H2O_started_from_R_mdancho_pcs046 ##     H2O cluster total nodes:    1 ##     H2O cluster total memory:   3.51 GB ##     H2O cluster total cores:    4 ##     H2O cluster allowed cores:  4 ##     H2O cluster healthy:        TRUE ##     H2O Connection ip:          localhost ##     H2O Connection port:        54321 ##     H2O Connection proxy:       NA ##     H2O Internal Security:      FALSE ##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 ##     R Version:                  R version 3.4.1 (2017-06-30)
      h2o.no_progress()# Turn off progress bars

      We change our data to an H2OFrame object that can be interpreted by the h2o package.

      # Convert to H2OFrame objectstrain_h2o<-as.h2o(train_tbl)valid_h2o<-as.h2o(valid_tbl)test_h2o<-as.h2o(test_tbl)

      Set the names that h2o will use as the target and predictor variables.

      # Set names for h2oy<-"price"x<-setdiff(names(train_h2o),y)

      Apply any regression model to the data. We’ll use h2o.automl.

      • x = x: The names of our feature columns.
      • y = y: The name of our target column.
      • training_frame = train_h2o: Our training set consisting of data from 2010 to start of 2016.
      • validation_frame = valid_h2o: Our validation set consisting of data in the year 2016. H2O uses this to ensure the model does not overfit the data.
      • leaderboard_frame = test_h2o: The models get ranked based on MAE performance against this set.
      • max_runtime_secs = 60: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.
      • stopping_metric = "deviance": Use deviance as the stopping metric, which provides very good results for MAPE.
      # linear regression model used, but can use any modelautoml_models_h2o<-h2o.automl(x=x,y=y,training_frame=train_h2o,validation_frame=valid_h2o,leaderboard_frame=test_h2o,max_runtime_secs=60,stopping_metric="deviance")

      Next we extract the leader model.

      # Extract leader modelautoml_leader<-automl_models_h2o@leader

      Step 4: Predict

      Generate predictions using h2o.predict() on the test data.

      pred_h2o<-h2o.predict(automl_leader,newdata=test_h2o)

      Step 5: Evaluate Performance

      There are a few ways to evaluate performance. We’ll go through the easy way, which is h2o.performance(). This yields a preset values that are commonly used to compare regression models including root mean squared error (RMSE) and mean absolute error (MAE).

      h2o.performance(automl_leader,newdata=test_h2o)
      ## H2ORegressionMetrics: gbm## ## MSE:  340918.3## RMSE:  583.8821## MAE:  467.8388## RMSLE:  0.04844583## Mean Residual Deviance :  340918.3

      Our preference for this is assessment is mean absolute percentage error (MAPE), which is not included above. However, we can easily calculate. We can investigate the error on our test set (actuals vs predictions).

      # Investigate test errorerror_tbl<-beer_sales_tbl%>%filter(lubridate::year(date)==2017)%>%add_column(pred=pred_h2o%>%as.tibble()%>%pull(predict))%>%rename(actual=price)%>%mutate(error=actual-pred,error_pct=error/actual)error_tbl
      ## # A tibble: 8 x 5##         date actual      pred     error    error_pct##                           ## 1 2017-01-01   8664  8241.261  422.7386  0.048792541## 2 2017-02-01  10017  9495.047  521.9534  0.052106763## 3 2017-03-01  11960 11631.327  328.6726  0.027480989## 4 2017-04-01  11019 10716.038  302.9619  0.027494498## 5 2017-05-01  12971 13081.857 -110.8568 -0.008546509## 6 2017-06-01  14113 12796.170 1316.8296  0.093306142## 7 2017-07-01  10928 10727.804  200.1962  0.018319563## 8 2017-08-01  12788 12249.498  538.5016  0.042109915

      For comparison sake, we can calculate a few residuals metrics.

      error_tbl%>%summarise(me=mean(error),rmse=mean(error^2)^0.5,mae=mean(abs(error)),mape=mean(abs(error_pct)),mpe=mean(error_pct))%>%glimpse()
      ## Observations: 1## Variables: 5## $ me    440.1246## $ rmse  583.8821## $ mae   467.8388## $ mape  0.03976961## $ mpe   0.03763299

      And The Winner of Demo Week Is…

      The MAPE for the combination of h2o + timetk is superior to the two prior demos:

      • timetk + h2o: MAPE = 3.9% (This demo)
      • timetk + linear regression: MAPE = 4.3% (timetk demo)
      • sweep + ARIMA: MAPE = 4.3%, (sweep demo)

      A question for the interested reader to figure out: What happens to the accuracy when you average the predictions of all three different methods? Try it to find out.

      Note that the accuracy of time series machine learning may not always be superior to ARIMA and other forecast techniques including those implemented by prophet and GARCH methods. The data scientist has a responsibility to test different methods and to select the right tool for the job.

      HaLLowEen TRick oR TrEat BoNuS!

      We are going to visualize the forecast compared to the actual values, but this time taking a cue from @lenkiefer’s theme_spooky described in one of his recent posts, Mortgage Rates are Low!

      We’re going to need to load a few libraries to get setup. The biggest challenge is the fonts, but there’s a really cool package called extrafont that we can use. We’ll use extrafont to load the Chiller fontset. Load the bonus library.

      # Libraries needed for bonus materiallibrary(extrafont)# More fonts!! We'll use Chiller

      Next, you’ll need to setup the Chiller font. Revolutions Analytics has a great article, How to Use Your Favorite Fonts in R Charts, which will get you up and running with extrafont. IMPORTANT: Make sure you go throught the process of loading your system fonts with font_import().

      Once fonts are imported, you can load fonts using.

      # Loads Chiller and a bunch of system fonts# Note - Your fontset may differ if you are using Mac / Linuxloadfonts(device="win")

      We’ll use Len’s script for theme_spooky(). I highly encourage you to use theme_spooky() all month of October around the office. Very spooky, and surprisingly engaging. 🙂

      # Create spooky dark theme:theme_spooky=function(base_size=10,base_family="Chiller"){theme_grey(base_size=base_size,base_family=base_family)%+replace%theme(# Specify axis optionsaxis.line=element_blank(),axis.text.x=element_text(size=base_size*0.8,color="white",lineheight=0.9),axis.text.y=element_text(size=base_size*0.8,color="white",lineheight=0.9),axis.ticks=element_line(color="white",size=0.2),axis.title.x=element_text(size=base_size,color="white",margin=margin(0,10,0,0)),axis.title.y=element_text(size=base_size,color="white",angle=90,margin=margin(0,10,0,0)),axis.ticks.length=unit(0.3,"lines"),# Specify legend optionslegend.background=element_rect(color=NA,fill=" gray10"),legend.key=element_rect(color="white",fill=" gray10"),legend.key.size=unit(1.2,"lines"),legend.key.height=NULL,legend.key.width=NULL,legend.text=element_text(size=base_size*0.8,color="white"),legend.title=element_text(size=base_size*0.8,face="bold",hjust=0,color="white"),legend.position="none",legend.text.align=NULL,legend.title.align=NULL,legend.direction="vertical",legend.box=NULL,# Specify panel optionspanel.background=element_rect(fill=" gray10",color=NA),#panel.border = element_rect(fill = NA, color = "white"),  panel.border=element_blank(),panel.grid.major=element_line(color="grey35"),panel.grid.minor=element_line(color="grey20"),panel.spacing=unit(0.5,"lines"),# Specify facetting optionsstrip.background=element_rect(fill="grey30",color="grey10"),strip.text.x=element_text(size=base_size*0.8,color="white"),strip.text.y=element_text(size=base_size*0.8,color="white",angle=-90),# Specify plot optionsplot.background=element_rect(color=" gray10",fill=" gray10"),plot.title=element_text(size=base_size*1.2,color="white",hjust=0,lineheight=1.25,margin=margin(2,2,2,2)),plot.subtitle=element_text(size=base_size*1,color="white",hjust=0,margin=margin(2,2,2,2)),plot.caption=element_text(size=base_size*0.8,color="white",hjust=0),plot.margin=unit(rep(1,4),"lines"))}

      Now let’s create the final visualization so we can see our spooky forecast… Conclusion from the plot: It’s scary how accurate h2o is.

      beer_sales_tbl%>%ggplot(aes(x=date,y=price))+# Data - Spooky Orangegeom_point(size=2,color="gray",alpha=0.5,shape=21,fill="orange")+geom_line(color="orange",size=0.5)+geom_ma(n=12,color="white")+# Predictions - Spooky Purplegeom_point(aes(y=pred),size=2,color="gray",alpha=1,shape=21,fill="purple",data=error_tbl)+geom_line(aes(y=pred),color="purple",size=0.5,data=error_tbl)+# Aestheticstheme_spooky(base_size=20)+labs(title="Beer Sales Forecast: h2o + timetk",subtitle="H2O had highest accuracy, MAPE = 3.9%",caption="Thanks to @lenkiefer for theme_spooky!")

      plot of chunk unnamed-chunk-23

      Next Steps

      We’ve only scratched the surface of h2o. There’s more to learn including working classifiers and unsupervised learning. Here are a few resources to help you along the way:

      Announcements

      We have a busy couple of weeks. In addition to Demo Week, we have:

      Facebook LIVE DataTalk

      Matt was recently hosted on Experian DataLabs live webcast, #DataTalk, where he spoke about Machine Learning in Human Resources. The talk already has 80K+ views and is growing!! Check it out if you are interested in #rstats, #hranalytics and #MachineLearning.

      (function(d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = 'https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v2.10'; fjs.parentNode.insertBefore(js, fjs);}(document, 'script', 'facebook-jssdk'));

      Machine Learning to Reduce Employee Attrition

      Machine Learning to Reduce Employee Attrition w/ Business Science, LLC

      Posted by Experian News on Thursday, October 26, 2017

      EARL

      On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.

      😀Hey #rstats. I'll be presenting @earlconf on #MachineLearning applications in #HumanResources. Get 15% off tickets: https://t.co/b6JUQ6BSTl

      — Matt Dancho (@mdancho84) October 11, 2017

      Courses

      Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:

      The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.

      About Business Science

      Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

      Follow Business Science on Social Media

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      R Meetup with Hadley Wickham

      $
      0
      0

      (This article was first published on R – Data Science Los Angeles, and kindly contributed to R-bloggers)

      In our next meetup, we’ll have Hadley (I think no introduction needed).

      Packages made easyby Hadley Wickham

      In this talk I’ll show you how easy it to make an R package by live coding one in front of you! You’ll learn about three useful packages: * usethis: automating the setup of key package components * devtools: for automating many parts of the package development workflow * testthat: to write unit tests that make you confident your code is correct

      Timeline:

      – 6:30pm arrival, drinks and networking – 7:00pm talk

      You must have a confirmed RSVP and please arrive by 6:55pm the latest. Please RSVP here on Eventbrite.

      Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one) Park underneath the building (Colorado Center), Edmunds will validate.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Los Angeles.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      It’s a dirty job, but someone’s got to do it..

      $
      0
      0

      (This article was first published on HighlandR, and kindly contributed to R-bloggers)

      A tidytext analysis of Faith No More lyrics –

      Is this a midlife crisis?

      I wanted to ease myself back into text mining,specifically using the tidytext package as I haven’t had to do any at work for well over a year.

      I’ve been thinking about some of the old bands of the 90’s, some of whom split up, and then reformed. I was interested to see how lyrics evolve over time, and to see what changes there were as the band matured.

      I’ve decided to look at Faith No More, because:

      • they recently reformed after an 18 year split
      • they had numerous line up changes, the main one being a change of vocalist shortly before the release of their 3rd (and most successful) album, “The Real Thing”.
      • Said vocalist, Mike Patton, was only 21 when he joined. I thought it might be interesting to see how his lyrics evolved over time
      • Their 4th album “Angel Dust” has been named the most influential rock album of all time.

      Let’s get one thing out of the way – there will be some review of the band’s history as well as being an R article but I promise – there will be no wordclouds.

      To start – let’s look at the number of tracks being analysed by album. I’ve excluded “We Care A Lot” from “Introduce Yourself”, as it’s the same as the title track from their first album, and would skew the analysis. I’ve also excluded cover versions and bonus tracks and a track from Sol Invictus, because the title is a very naughty word indeed .

      2017-10-22-Tracks-Analysed-by-Album.png

      Most of the examples here follow the code available on the most excellent http://tidytextmining.com/. Big thanks to the package authors, Julia Silge and David Robinson for this package. I had done some text mining prior to the release of this package, but as someone used to dataframes, I soon found my progress limited to following a small number of tutorials and not knowing how to manipulate term document matrices etc to conduct further analysis. Tidytext is a game-changer because the data remains in dataframe/tibble format, and we can use dplyr and ggplot2 to wrangle and visualise.

      I had originally hoped this whole exercise would give me some experience with web-scraping – but the site I chose only allowed the use of an API – and while there was a Python one, there was no R alternative. Regardless – I was able to collate the data I needed.

      The original dataframe consisted of Title, Album and a lyrics column, and this was transformed into a format suitable for text analysis:

      text_df<-data%>%unnest_tokens(word,lyrics)%>%filter(word%notin%c("b","e","a","g","r","s","i","v","la"))%>%mutate(linenumber=row_number())

      One song (“Be Aggressive”) has a cheerleader style chorus where the title is spelled out letter by letter – so, I excluded these, along with “la”, which was also featuring highly in my initial plot of common words. Along with filtering out common “stop words”, I arrived at the following plot:

      2017-10-22-Most-Common-Words.png

      This looks pretty boring, so let’s at least split it by album:

      2017-10-22-Most-Common-Words-by-Album.png

      We Care A Lot

      2017-10-22-WCAL.jpg

      2017-10-22-WCAL-lineup.jpg

      Unsurprisingly, “We Care A Lot” dominates the first album, given that it is repeated endlessly throughout the title track. Live favourite “As The Worm Turns” doesn’t feature though.

      Introduce Yourself

      2017-10-22-IY.jpg

      2017-10-22-IY-lineup.jpg

      Looking at “Introduce Yourself”, I’m surprised the title track doesn’t feature more heavily . Of the top words shown here, I recognise a couple from “Anne’s Song”. None from my under the radar favourites “Faster Disco” and “Chinese Arithmetic” though.

      The Real Thing

      2017-10-22-TRT.jpg

      2017-10-22-TRT-lineup.jpg

      Now on to the golden age of the band – Mike Patton unleashed his 6 octave vocal range and guitarist Jim Martin was still alternately shredding and providing short, clever solos. It took me a while figure out why “yeah” was so high, then I remembered the chorus of “Epic”. Never heard Faith No More? Start with this one. Also featuring here are words from “Underwater Love”, “Surprise! You’re Dead!” and personal favourite “From Out of Nowhere”.

      Angel Dust

      2017-10-22-AD.jpg

      2017-10-22-AD-lineup.jpg

      “Angel Dust” is dominated from words in just 2 songs – the aforementioned “Be Aggressive” and “Midlife Crisis”.. Other popular tracks from this album were “A Small Victory”, and their hit cover of the Commodore’s “Easy”.

      King For A Day…Fool For A Lifetime

      2017-10-22-KFAD.jpg

      2017-10-22-KFAD-lineup.jpg

      “King For A Day…Fool For A Lifetime” marks the start of the post-Jim Martin era and stands out in this plot in that it has a number of different tracks with frequently mentioned words. This album saw the band alternate many musical styles – sometimes in the same track – an example being “Just A Man”. This was one of the album’s highlights (see also – “Evidence”) but there was some filler on there too unfortunately.

      Album Of The Year

      2017-10-22-AOTY.jpg

      2017-10-22-AOTY-lineup.jpg

      The ironically titled “Album of the Year” reverts to being dominated by a few tracks, including single “Last Cup of Sorrow”. Other highlights were “Ashes To Ashes” and “Stripsearch”.

      Shortly after the release of this album, and years of in-fighting the band called it a day. They didn’t match the success of “The Real Thing” in the US, although the rest of the world were more switched on and “Angel Dust” actually outsold it in some areas.

      Even towards the end, they were popular enough in the UK to appear on “TFI Friday” (popular 90’s live entertainment show, hosted by then top presenter Chris Evans) 2 weeks running. While Britpop was a thing, FNM played “Ashes to Ashes” and turned in one of the finest live performances on UK TV (according to the internet).

      And, for a long time (18 years) it looked like that was that. An avant-garde rock band consigned to history. But, following several years of occasional live shows, the band returned in their final lineup and released “Sol Invictus”. I have to confess I have not listened to this album, apart from a few tracks. “Sunny Side Up” being the stand-out for me so far.

      Enough history, let’s look at some more plots and R code.

      song_words<-data%>%unnest_tokens(word,lyrics)%>%count(Title,word,sort=TRUE)%>%filter(n>30)%>%ungroup()ggplot(song_words,aes(reorder(word,n),n,fill=Title))+geom_col()+xlab(NULL)+ylab(NULL)+coord_flip()+theme_ipsum()+scale_fill_ipsum()+#scale_fill_viridis(option ="C", discrete = TRUE)+ggtitle("Most Common Words - All Faith No More Studio Songs",subtitle="Frequency >30. Including common stop words")+theme(legend.position="bottom")

      2017-10-22-MostCommonWordsbySongIncludingStopWords.png

      ## now remove stop words and the letters from# "Be Aggressive"song_words_filtered<-data%>%unnest_tokens(word,lyrics)%>%anti_join(stop_words)%>%filter(stringr::str_detect(word,"[a-z`]$"),!word%in%stop_words$word)%>%filter(word%notin%c("b","e","a","g","r","s","i","v","la"))%>%count(Title,word,sort=TRUE)%>%filter(n>20)%>%ungroup()ggplot(song_words_filtered,aes(reorder(word,n),n,fill=Title))+geom_col()+xlab(NULL)+ylab(NULL)+coord_flip()+theme_ipsum()+scale_fill_ipsum()+ggtitle("Most Common Words - All Faith No More Studio Songs",subtitle="Frequency >30. Excluding common stop words")+theme(legend.position="bottom")

      2017-10-22-MostCommonWordsbySongExcludingStopWords.png

      Term document frequency by album – the number of times a word is used in the lyrics / by the total number of words in the lyrics. A lot of words occur rarely and a very few occurring frequently:

      Inverse Term Document frequency – or an attempt to identify words that don’t occur frequently but are important.

      2017-10-22-Inverse-Term-Document-Frequency-by-Album.png

      Now we’ve looked at common and important words, let’s see if we can get a handle on the sentiment of those words. Admittedly, this first plot is a bit of a gimmick, but rather than use geom_col, or geom_point, I thought I’d use the ggimage package, and mark the overall album sentiment using the relevant album covers.

      Here’s the code:

      #overall album sentiment  albumsentiment<-text_df%>%inner_join(get_sentiments("bing"))%>%count(Album,sentiment)%>%spread(sentiment,n,fill=0)%>%mutate(sentiment=positive-negative)#create vector of imagesalbumsentiment$img<-c("2017-10-22-WCAL.jpg","2017-10-22-IY.jpg","2017-10-22-TRT.jpg","2017-10-22-AD.jpg","2017-10-22-KFAD.jpg","2017-10-22-AOTY.jpg","2017-10-22-SI.jpg")ggplot(albumsentiment,aes(Album,sentiment,fill=Album))+#geom_col(show.legend = FALSE) +theme_ipsum()+ggExtra::removeGrid()+coord_flip()+geom_image(aes(image=img),size=.15)+ggtitle("Overall Sentiment Score by Album",subtitle="Faith No More Studio Albums - Excludes cover versions")+labs(x=NULL,y=NULL)+geom_text(aes(label=sentiment),hjust=0,nudge_y=7)+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())

      2017-10-22-Sentiment-by-Album.png

      So only “Introduce Yourself” managed to get a positive overall sentiment score. I’m surprised that “Album of the Year” scores as highly as it does given the band were on the verge of splitting up. However – this is simply a case of summing positive & negative scores by individual word and finding the overall difference – so its definitely not an exact science. Songs like “Collision” and “Helpless” do not convey positive emotion to me.

      Here is a more straightforward plot of sentiment by track:

      2017-10-22-Sentiment-by-track2.png

      Topic Models

      2017-10-22-Topic_Models.png

      I’m going to take the easy way out and refer you to the tidytext website to explain topic models. I have to be honest, I don’t really see any themes that make any sense to me here. So, I thought I should look at lyrics by individual vocalist:

      2017-10-22-Topic_Models_Chuck.png

      2017-10-22-Topic_Models_patton.png

      Here I take the first 5 topics by both to see if there is anything in common:

      2017-10-22-Top_Topic_Model_by_Vocalist.png

      After a bit of digging around, it turned out only 2 terms existed in both sets:

      2017-10-22-Common_Terms_between_Vocalists.png

      Hmm,”Love” and “World”. Seems a bit “touchy feely” as they say. Perhaps I should look at this by album and see what topics appear?

      2017-10-22-Topic_Models-WCAL.png

      2017-10-22-Topic_Models-IY.png

      2017-10-22-Topic_Models-TRT.png

      2017-10-22-Topic_Models-AD.png

      2017-10-22-Topic_Models-KFAD.png

      2017-10-22-Topic_Models-AOTY.png

      2017-10-22-Topic_Models-SI.png

      I’d expected perhaps that words belonging to the same song would get grouped together, but that doesn’t appear to happen very often.

      I also looked at splitting the lyrics into trigrams (sequences of 3 words, e.g. “We Care A”, “Care A Lot”, “A Lot We”, “Lot We Care” etc. By far the most common trigram was “Separation Anxiety Hey” from “Separation Anxiety”.

      I also looked at words where “Not” was the first of the 3 words in the trigram, to see if they were of positive or negative sentiment. First -the second word of the 3, then the final word of the 3.

      2017-10-22-Negating-words-2-of-3.png

      2017-10-22-Negating-words-3-of-3.png

      This can be expanded to look at other types of negating words:

      2017-10-22-Negative-trigrams-Word-2-of-3-by-negating-word.png

      2017-10-22-Negative-trigrams-Word-3-of-3-by-negating-word.png

      I also tried producing a network plot of the trigrams. Bit of a mess to be honest, but here it is:

      2017-10-22-correlation-plot.png

      Finally – a correlation plot looking at closely linked overall each album is to the others, in terms of words used:

      library(widyr)album_word_cors<-album_words%>%pairwise_cor(Album,word,n,sort=TRUE)album_word_corslibrary(ggraph)library(igraph)set.seed(1234)album_word_cors%>%filter(correlation>.05)%>%graph_from_data_frame()%>%ggraph(layout="fr")+geom_edge_link(aes(alpha=correlation,width=correlation))+geom_node_point(size=6,color="lightblue")+geom_node_text(aes(label=name),repel=TRUE)+theme_void()ggsave("2017-10-22-album-word-correlations.png",width=8,height=5)

      2017-10-22-album-word-correlations.png

      This is more interesting -although – the correlations are pretty weak overall. However, The Real Thing, Angel Dust and King For A Day have the strongest correlations with each other. There appears to be less correlation over time between the lyrics used in The Real Thing compared to Album of the Year.

      There appears to little (or no) correlation between Album of the Year and Sol Invictus – that 18 year gap really made a difference. But what is interesting is the thick line between Sol Invictus and Introduce Yourself – 28 years apart, and from 2 different vocalists.

      However, the sting in the tail is that Mike Patton has said that he chose words more for their sound, and to fit the music, than for their actual meaning. That would explain the lyrics of “ Midlife Crisis” and “A Small Victory”, but seems to short change the cleverness of the lyrics in tracks like “Everything’s Ruined” and “Kindergarten”. Maybe this means this has been a massive waste of time? Regardless, I have enjoyed looking back at some great music from this wonderfully unique band.

      Code for the plots, and the plots themselves is available here

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Moving to Blogdown

      $
      0
      0

      (This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

      As I announced last friday on Twitter I have moved Data Imaginist to Blogdown from Jekyll. This is going to be a short post about why I did it. There’s not going to be much tutorial about this; the migration is as painless as can be expected and besides, other more qualified people have already made great tutorials about the matter.

      A little over a year ago I started blogging, using what I expect most were using at the time (in R-land anyway) — The Jekyll/Github Pages combo. I’d stolen a script from David Robinson that converted new RMarkdown files into Markdown documents for Jekyll, resulting in a workflow quite like what Blogdown offers. So when Blogdown arrived, I had little inclination to look more into it. Blogdown might be newer and more flashy, but it didn’t seem to offer enough to warrant the time investment it would take to migrate my post and learn about Hugo. Arguably it is possible to ignore the Hugo backend when using Blogdown, but I like to tinker and would never be satisfied using an engine I didn’t understand (this might explain why I have been poking at ggplot2 to the extend I has).

      Bottomline: I let it pass…

      Then, little by little, I was drawn in anyway. The first seducer were the slew of blogposts descibing the bliss that migrating your blog to Blogdown had brought – In addition those posts also spoke highly of the hosting from Netlify and their free HTTPS support. Being around Bob Rudis (in the Twitterverse) had made me aware of the need for secure browsing habits so this appealed a lot to me (even though the HTTPS support came throgh Let’s Encrypt – something Bob vehemently opposes). The second seducer was more of a repulsor from my current setup: My page was falling apart… Due to changes in Jekyll my internal links were no longer functioning and I would need to go through my whole theme to fix it. What was most annoying was that this stabbed me in the back. Jekyll had been upgraded on Github pages at some point, silently wrecking havock on my site without me knowing it – clearly an unacceptable setup. Lastly, as convenient as the automatic Jekyll building was by Github (when it didn’t break my site), as annoying it was to do it locally to preview a post before publishing. Mayby it is because I haven’t spend much time in Ruby-world but I never quite managed to get everything installed correctly so I could ensure that my own system matched what Github was using… The final push came when I for a brief period was the designer of the Blogdown logo and decided that now was the right time to make the switch.

      As I said in the beginning, this will not be a post about how to switch to Blogdown from Jekyll as I’m horrible at writing such posts. I can, however, describe my approach to migrating. I started by browsing the available Hugo themes and to my joy the one I was currently using had been ported and improved upon so the choice was easy (to be honest I’m not really really pleased with the look of my blog but the Cactus theme is pretty stylish in its minimalism). Once the theme was sorted out I began moving my posts into the new system one by one to see how it went. To my surprise it went predominantly without a hitch. Blogdown sets some other default figure sizes which resulted in some of my animations becoming unbearably large, so this had to be adjusted, but other than that it was mainly a matter of switching the Jekyll macros for Hugo shortcodes. My most used macro was for linking between blogposts which is akin to the ref shortcode but it took me a while to figure out that the argument to ref had to be enclosed in escaped quotes to survive the RMarkdown + Pandoc conversion. This could be a nice thing for Blogdown to take care off as it feels pretty cumbersome to write a path as '\\"my-post.html\\"'

      Once I had gotten all my posts successfully into Blogdown I needed to get the rest going, which meant for the most part to port all my hacks and changes from the Jekyll theme into the Hugo version. One thing I’m pretty serious about is how links to my posts look when shared on Twitter so I had made a system where I could set the Twitter card type in the preamble of a post (large vs small) as well as a picture to use instead of my default logo. Porting this meant getting a bit into the Hugo templating system so that I could migrate my changes to the HTML header that gives these nice cards on Twitter/Facebook/LinkedIn. In the end HTML templates seems to be so alike nowadays that moving from one to the next is mostly a matter of understanding the data binding and how it differs between the systems. If you have some custom templates in Jekyll I can assure you that they probably wont take long to port over to Hugo…

      My last big todo was my RSS feed. Had I only had one it would not have been an issue at all, but as it is my page actually got two – a general one and one unique to posts in the R category (this was only made for R-Bloggers). Hugo comes with automatic RSS generation for both the main site and for differentt tags and categories, but they all reside at different locations. My two feeds both resides at the root of my page – something not really supported by Hugo. But I couldn’t accept breaking subscription or link so I scavenged the net and came to a solution – as my git commits tell, it was not straightforward…

      In the end I prevailed. If you are in a similar situation do look at my source code for the solution or contact me – I wont bore the 99.9 % rest with the details. The RSS episode was my first experience battling the Hugo system and arguably only because I had to support a setup from Jekyll that was handled in another way by Hugo…

      I wish I had a lot to tell you about my experience with Netlify, but I haven’t. Their service is the posterchild of easy, one-click solutions and I can whole-heartedly recommend them. The switch from Github to Netlify hosting took a total of around 20 minutes, including setting up HTTPS.

      As easy as I have hopefully made all of this sound, I really, really, really hope that something new and better doesn’t come out within a year. Doing this sort of infrastructure stuff is not something I particularly enjoy and I could have given my site a new look or begun developing the next version of gganimate instead of basically getting to the same page I had before. But sometimes you have to clean your room before you can play, and hopefully playtime is coming soon.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Viewing all 12194 articles
      Browse latest View live


      <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>