Quantcast
Channel: R-bloggers
Viewing all 12108 articles
Browse latest View live

Fifteen New Zealand government Shiny web apps by @ellis2013nz

$
0
0

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

table, td, th { border: 3px solid white;border-collapse:separate; border-spacing:0 5px;}</p><p>img {width: 100%;}

At a glance:

I had a brief look around New Zealand government agency websites and found 15 high quality web apps written in the Shiny platform.

13 May 2018


Here’s fifteen nice web applications built with RStudio’s Shiny framework. All of these are owned and maintained by New Zealand government departments and have a main purpose as making public data more available and accessible for non-specialist users. I think it’s fair to say New Zealand has been a leader in using Shiny for this.

As a commissioner of several of those below and author of one, I can say the appeal of Shiny for this sort of public dissemination of data is the combination of cheap, quick and pretty good. It’s rare to have so little trade-off between those three items of the project management dilemma. Usually it’s “cheap, quick, good – pick any two”.

The modest trade-off in my view is in some small aspects of the eventual quality and performance. The server-client transfers can seem clunky (certainly compared to something written in pure JavaScript), particularly when you’re at the bottom of the world using a server in the USA. They have an annoying habit of freezing and going grey on your users if something goes wrong with the connection. And as they get more complicated, you start using more JavaScript, HTML and CSS anyway. But the results are 80% or 90% good enough for many use cases, and development and deployment costs are materially less than other options. I find Shiny more flexible, powerful and controllable (eg fonts, polish, etc which can all be done with CSS) than Tableau or Power BI, and cheaper and quicker than writing your own app from the ground up.

So, go Shiny!

Ministry of Business, Innovation and Employment

I’m pretty sure the New Zealand Tourism Forecasts was the first use of Shiny by a government agency in New Zealand. My old team produced this tool in 2015. It’s nice and simple, not particularly ambitious, but it does the job of letting the user play around with the forecasts much more effectively and better presented than previous dissemination methods (ie Excel pivot tables, if you were lucky).
The New Zealand Tourism Dashboard was our first ambitious big Shiny project. A brilliant job by Jimmy Oh, first of his sequence of high quality boundary-pushing apps for MBIE. It combines data from MBIE itself, Stats NZ, and direct from the web. The source code is on GitHub.
Building on the style of the tourism dashboard came the New Zealand Sectors Dashboard. It aims to be a one-stop shop for all information about New Zealand’s economy by sectors. It brings together a range of economic datasets produced by MBIE and Statistics New Zealand into one easy-to-use tool.
And the New Zealand Labour Market Dashboard. It displays labour market information from many different sources in one place.
The Urban Development Capacity Dashboard, jointly branded by MBIE and the Ministry for the Environment, provides charts, maps, tables and underlying data on local markets for housing and business space.
The Modelled Territorial Authority GDP is the only Shiny app on this page I can claim to have authored personally. It was the tail end (dissemination) of a big project producing new granular estimates of value add by industry, district and city. There is a paper and presentation on this on the presentations part of my website and source code for the app, and the much bigger job of creating the data, is on GitHub.

Stats NZ

The Living Cost Explorer presents data from Household Living-costs Price Indexes. It shows how price changes vary depending upon the average basket of goods of different types of people, such as beneficiaries, Māori and superannuitants.
Irrigated land in New Zealand uses maps and graphs to present spatial information on irrigation of New Zealand land.
This Landcover tool shows composition and changes in land cover.
The third of these spatial / environmentally themed tools, Livestock numbers has graphs and maps showing the distribution of cattle (different types thereof), sheep and deer.
The Iwi cultural well-being from Te Kupenga 2013 app may be the first Shiny app with an option to swap the user interface into Te reo Māori.
This series of Experimental estimates of income has been derived from the tax data available in the Integrated Data Infrastructure as part of ongoing work at Stats NZ to increase the use of administrative data in the production of statistics.

Ministry of Health

An interactive tool for exploring New Zealand Health Survey data.It presents the latest results by sex, age, ethnic group and neighbourhood deprivation, as well as changes over time.
A tool to allow summary data about prescriptions and dispensings funded by the New Zealand Government.

Treasury

Treasury’s Insights tool provides information drawn from a range of public sector agencies including extensive use of the Integrated Data Infrastructure.

Nice collection. Anyone know of any others?

Disclaimer: I was part of the commissioning team for several of the MBIE Shiny apps, and (as noted above) the author of one. I haven’t been involved in development of any of the others listed above.

Making animated GIFs of websites

To make the animated GIFs used in this website and keep them to a reasonable size (under 2MB each), here’s what I did.

  • I did the original screen captures using the open source CamStudio application and saved them as .avi files. Even though only a part of my screen was captured, with the original screen resolution and about 30 – 45 seconds of content, these files were large; typically 700MB or larger.
  • I used a Python program that GitHub user michaelosthege had published as a Gist to convert from .avi to .gif format. These were about a quarter or fifth the size, but still too large (150MB – 200MB) to use on the web
  • I found another Python program by PaulineLc on another Gist that shrank and sped up animated GIFs.

Is it ironic that a blog post celebrating R Shiny used Python for playing around with the animated images? I don’t think so at all; it’s just a matter of using the convenient and easy tool for the job at hand. Python is awesome with everything to do with images; There are ways in R to do this too (or they could certainly be developed) but it was easier to find out how to do it in Python. I am not as good with Python as with R, but I know how to copy and paste a program someone else has written when it does exactly what I need!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Do know Shake Shack’s locations outside of the US? You’d be surprised

$
0
0

(This article was first published on R – nandeshwar.info, and kindly contributed to R-bloggers)

Madison Shake

I had heard that the lines to get some food at Shake Shack are long. So when I saw a new location opening in downtown LA, I wondered how many locations does it have and how fast are they spreading across the US. The answers surprised me. Using R and previous code, I created a few maps:

Shake it like a globetrotter

Read on to learn how I got the data and plotted them.

Load Libraries

First, let’s load our favourite libraries.

12345
library(rvest)library(readr)library(tidyverse)library(scales)library(ggmap)

Figure out locations

On its site, Shake Shack fortunately has all the locations and opening dates, going back to April 23, 2012. The archive pages run from 1 to 20 with this URL structure:

https://www.shakeshack.com/location/page/

Using SelectorGadget, I figured out the XPath and CSS code to find the opening date, location name, and location page link. Then, I wrote a function to retrieve these values from a given archive page.

12345678910111213141516
get_locations <-function(url){  page_html <- read_html(url)  nodes <- page_html %>%     html_nodes(xpath ='//*[contains(concat( " ", @class, " " ), concat( " ", "span4", " " ))]')   data.frame(opdate = html_nodes(x = nodes,                                 xpath ='//*[contains(concat( " ", @class, " " ), concat( " ", "date", " " ))]')%>%                html_text(trim = TRUE),             store_loc_name = html_nodes(x = nodes,                                         css ='h2')%>%                html_text(trim = TRUE),             store_loc_link = html_nodes(x = nodes,                                         css ='h2 a')%>%                html_attr("href"),             stringsAsFactors = FALSE)}

I applied this function to retrieve all location opening dates, names, and individual location urls:

123
all_loc_pages <- paste0("https://www.shakeshack.com/location/page/", 1:20, "/") all_locations <-do.call(rbind, lapply(all_loc_pages, get_locations))

Find addresses of all locations

If you visit an individual location’s page, such as this Tokyo Dome page, you will see that often the exact address is not listed, or if it is, you can’t directly geocode it. But, luckily, there’s a Google Map right below the location. I thought, they must be passing some parameters to Google Maps API. I spend a good amount of time, but couldn’t figure out how they were getting the map. And. Then. I found out that the text “CLICK MAP FOR DIRECTIONS” block had a valid address as part of the hyperlink!!

I wrote another simple function to get the addresses from the given URL:

12345678910
get_loc_cords <-function(loc_url){  location_html <- read_html(loc_url)data.frame(loc_url = loc_url,             goog_map_url =  location_html %>%               html_nodes(xpath ='//a[text()="Click here for directions"]')%>%               html_attr("href"),             stringsAsFactors = FALSE)} location_google_maps_address <-do.call(rbind, lapply(all_locations$store_loc_link, get_loc_cords))

Then I joined the location name with the address data frame:

1
all_locations <- left_join(all_locations, location_google_maps_address, by=c("store_loc_link"="loc_url"))

Geocoding the addresses

Using the fantastic ggmap library and mutate_geocode function, I geocoded all the addresses:

123
all_locations <- all_locations %>%  mutate(google_addr_string = str_sub(goog_map_url, start=36))%>%  mutate_geocode(google_addr_string, output ="latlon")

Here’s what the data frame looks like now:

Tip

You may want to create a Google developer key for mass geocoding. Since the mutate_geocode function is used by many people, sometimes you may not get all the addresses geocoded. Use register_google(key = , account_type = 'premium', day_limit = 100000) function to register your key with ggmap functions.

Data manipulation

Now that we have all the geographical coordinates, we just need to do some clean-up to get the data ready for plotting.

First, get the date field in order and add opening month and year columns:

1234
all_locations <- all_locations %>%   mutate(open_date =as.Date(opdate, "%B %d, %Y"),         open_month = lubridate::month(open_date),         open_year = lubridate::year(open_date))

Second, get the cumulative count of store openings:

12345
ss_op_data_smry <- all_locations %>%   count(open_date)%>%   ungroup()%>%  arrange(open_date)%>%  mutate(cumm_n =cumsum(n))

Third, join the summary back to the locations data frame:

12
all_locations_smry <- inner_join(all_locations, ss_op_data_smry,                                  by=c("open_date"="open_date"))

Get the maps ready

Using the ggmap library, I got the US map and a world map:

12
us_map <- get_stamenmap(c(left =-125, bottom =24, right =-67, top =49), zoom =5, maptype ="toner-lite")ggmap(us_map)

12
world_map <- get_stamenmap(bbox =c(left =-180, bottom =-60, right =179.9999, top =70), zoom =3, maptype ="toner-lite")ggmap(world_map)

Create functions to plot each location

Repurposing my code from the Walmart spread across the US, I wrote a similar function to plot locations with two different sizes: big, if the locations opened during the mapped month, and small, if the locations opened before the mapped month. I did so that we could notice the new locations.

12345678910111213141516
my_us_plot <-function(df, plotdate, mapid){  g <- ggmap(us_map, darken =c("0.8", "black"), extent ="device")   old_df <-filter(df, open_date < plotdate)  new_df <-filter(df, open_date == plotdate)# old locations  g <- g + geom_point(data= old_df, aes(x = lon, y = lat), size =5, color ="dodgerblue", alpha =0.4)# new locations  g <- g + geom_point(data= new_df, aes(x = lon, y = lat), size =8, color ="dodgerblue", alpha =0.4)  g <- g + theme(axis.ticks= element_blank(), axis.title= element_blank(), axis.text= element_blank(), plot.title= element_blank(), panel.background= element_rect(fill ="grey20"), plot.background= element_rect(fill ="grey20"))     g <- g + annotate("text", x =-77, y =33, label ="MONTH/YEAR:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-77, y =32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color ="white", size = rel(6), fontface =2, hjust =0)  g <- g + annotate("text", x =-77, y =31, label ="STORE COUNT:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-77, y =30, label = comma(unique(new_df$cumm_n)), color ="white", size = rel(6), fontface =2, hjust =0)  filename <- paste0("maps/img_" , str_pad(mapid, 7, pad ="0"),  ".png")  ggsave(filename = filename, plot= g, width =13, height =7, dpi =120, type ="cairo-png")}

I modified this function to map the world:

1234567891011121314
my_world_plot <-function(df, plotdate, mapid){  g <- ggmap(world_map, darken =c("0.8", "black"), extent ="device")   old_df <-filter(df, open_date < plotdate)  new_df <-filter(df, open_date == plotdate)  g <- g + geom_point(data= old_df, aes(x = lon, y = lat), size =5, color ="dodgerblue", alpha =0.4)  g <- g + geom_point(data= new_df, aes(x = lon, y = lat), size =8, color ="dodgerblue", alpha =0.4)  g <- g + theme(axis.ticks= element_blank(), axis.title= element_blank(), axis.text= element_blank(), plot.title= element_blank(), panel.background= element_rect(fill ="grey20"))    g <- g + annotate("text", x =-130, y =0, label ="MONTH/YEAR:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-130, y =-10, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color ="white", size = rel(6), fontface =2, hjust =0)  g <- g + annotate("text", x =-130, y =-20, label ="STORE COUNT:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-130, y =-30, label = comma(unique(new_df$cumm_n)), color ="white", size = rel(6), fontface =2, hjust =0)  filename <- paste0("maps/img_" , str_pad(mapid, 7, pad ="0"),  ".png")  ggsave(filename = filename, plot= g, width =12, height =6, dpi =150, type ="cairo-png")}

Create maps

Now, the exciting part: create month-by-month maps.

US maps:

1234
all_locations_smry %>%    mutate(mapid = group_indices_(all_locations_smry, .dots='open_date'))%>%   group_by(open_date)%>%   do(pl = my_us_plot(all_locations_smry, unique(.$open_date), unique(.$mapid)))

World maps:

1234
all_locations_smry %>%    mutate(mapid = group_indices_(all_locations_smry, .dots='open_date'))%>%   group_by(open_date)%>%   do(pl = my_world_plot(all_locations_smry, unique(.$open_date), unique(.$mapid)))

Create a movie

Using ffmpeg, we can put all the images together to create a movie:

123
# works on a macmakemovie_cmd <- paste0("ffmpeg -framerate 8 -y -pattern_type glob -i '", paste0(getwd(), "/maps/"), "*.png'", " -c:v libx264 -pix_fmt yuv420p '", paste0(getwd(), "/maps/"), "movie.mp4'")system(makemovie_cmd)

We can use the convert function to create a gif:

123
# https://askubuntu.com/a/43767makegif_cmd <- paste0("convert   -delay 8   -loop 0 ", paste0(getwd(), "/maps/"), "*.png ", "animated.gif")# loop 0 for forever loopingsystem(makegif_cmd)

That’s it! We get nice looking videos showing location openings by each month. I was surprised to see how fast the company is opening the locations as well as how many locations it has in Asia!

Post hoc

Using the ggimage library, I tried creating the maps using Shake Shack’s burger icon, but they didn’t turn out as good:

1234567891011121314
my_us_icon_plot <-function(df, plotdate, mapid){  g <- ggmap(us_map, darken =c("0.8", "black"))   old_df <-filter(df, open_date < plotdate)  new_df <-filter(df, open_date == plotdate)  g <- g + geom_image(data= old_df, aes(x = lon, y = lat), image="ss-app-logo.png", by="height", size =0.03, alpha =0.4)   g <- g + geom_image(data= new_df, aes(x = lon, y = lat), image="ss-app-logo.png", by="height", size =0.07, alpha =0.4)   g <- g + theme(axis.ticks= element_blank(), axis.title= element_blank(), axis.text= element_blank(), plot.title= element_blank())    g <- g + annotate("text", x =-77, y =33, label ="MONTH/YEAR:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-77, y =32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color ="white", size = rel(6), fontface =2, hjust =0)  g <- g + annotate("text", x =-77, y =31, label ="STORE COUNT:", color ="white", size = rel(5), hjust =0)  g <- g + annotate("text", x =-77, y =30, label = comma(unique(new_df$cumm_n)), color ="white", size = rel(6), fontface =2, hjust =0)  filename <- paste0("maps/img_" , str_pad(mapid, 7, pad ="0"),  ".png")  ggsave(filename = filename, plot= g, width =13, height =7, dpi =150, type ="cairo-png")}

Fun maps

What do you think? How else would you visualize these data points?

The post Do know Shake Shack’s locations outside of the US? You’d be surprised appeared first on nandeshwar.info.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – nandeshwar.info.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

automaton: Interpretation of Automaton Diagrams (Using TikZ)

$
0
0

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for assessing the interpretation of an automaton diagram (drawn with TikZ) based on randomly generated input sequences.

Name:
automaton
Type:
Related:

Description:
An automaton diagram with four states A-D is drawn with TikZ and is to be interpreted, where A is always the initial state and one state is randomly picked as the accepting state. Five binary 0/1 input sequences acceptance have to be assessed with approximately a quarter of all sequences being accepted. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX.
Solution feedback:
Yes
Randomization:
Random numbers, text blocks, and graphics
Mathematical notation:
No
Verbatim R input/output:
No
Images:
Yes
Other supplements:
No
Raw: (1 random version)
PDF:
automaton-Rnw-pdf
automaton-Rmd-pdf
HTML:
automaton-Rnw-html
automaton-Rmd-html

Demo code:

library("exams")set.seed(1090)exams2html("automaton.Rnw")set.seed(1090)exams2pdf("automaton.Rnw")set.seed(1090)exams2html("automaton.Rmd")set.seed(1090)exams2pdf("automaton.Rmd")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

logic: Interpretation of Logic Gates (Using TikZ)

$
0
0

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for matching logic gate diagrams (drawn with TikZ) to the corresponding truth table.

Name:
logic
Type:
Related:

Description:
Gate diagrams for three logical operators (sampled from: and, or, xor, nand, nor) are drawn with TikZ and have to be matched to a truth table for another randomly drawn logical operator. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX.
Solution feedback:
Yes
Randomization:
Shuffling, text blocks, and graphics
Mathematical notation:
No
Verbatim R input/output:
No
Images:
Yes
Other supplements:
No
Raw: (1 random version)
PDF:
logic-Rnw-pdf
logic-Rmd-pdf
HTML:
logic-Rnw-html
logic-Rmd-html

Demo code:

library("exams")set.seed(1090)exams2html("logic.Rnw")set.seed(1090)exams2pdf("logic.Rnw")set.seed(1090)exams2html("logic.Rmd")set.seed(1090)exams2pdf("logic.Rmd")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tips for Ellipse Summary Plot

$
0
0

(This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers)

I privately had some questions and reply here, because it may also help others including me.

plot_with_factors

1. How to specify size

With plot axis parameters.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], xlim=c(4,8), ylim=c(2,5))

2. How to specify color

With plot color parameter.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’))

3. How to give names

Using builtin iris data.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)])

Digging deeper

about iris data

> str(iris) ‘data.frame’: 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 … $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 … $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 … $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 … $ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

The column Species are used in both of x and y data. These are used to give the name of each catergory.

example

Using fivenum instead of default ninenum.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’), SUMMARY=fivenum)

Above shows the plot shown above.

Below may help you to know values on each axis.  Here, for the fivenum, each 3rd values is a (x, y) set of each category average.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], SUMMARY=fivenum, plot=FALSE) $setosa x y 1 4.3 2.3 2 4.8 3.2 3 5.0 3.4 4 5.2 3.7 5 5.8 4.4

$versicolor x y 1 4.9 2.0 2 5.6 2.5 3 5.9 2.8 4 6.3 3.0 5 7.0 3.4

$virginica x y 1 4.9 2.2 2 6.2 2.8 3 6.5 3.0 4 6.9 3.2 5 7.9 3.8

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

AphA Scotland – it’s a thing

$
0
0

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

Reflections on AphA Scotland launch event –

On Tues 8th May there was only one Scottish based member of the Association of Professional Healthcare Analysts (me) but on Wed 9th May that number rose to around 80 with the launch of the the Scotland AphA Branch.

The event took place in the very nice Perth Concert Hall, and consisted of several great speakers plus a series of workshops. I particularly enjoyed hearing from Mohammed Mohammed, who talked about setting up the NHS-R community, and how we might overcome the resistance to R in some quarters within the NHS.

In later discussions after the event, there was some conjecture that this might be because traditional IT depts cannot provide support for R. Thinking back to when I first got it installed, I think I was told that I would get no support for it, which I was totally cool with because I just wanted to ggplot EVERYTHING. (Actually, I based, and latticed, before I ggplotted, but you get the point).

The beauty of R though is that “support” is plentiful and 24/7 in the shape of the #rstats community. I think if someone has got to the point where they want to use R, they are well into “power-user” territory and beyond the scope of regular IT support anyway. Not only that, but to get to that point, they have almost certainly mastered the black arts of Google-Fu and Stack Overflowing like a demon. Therefore they are the sort of self starters who are not going to be bothering IT in the first place. In other words, if we want to use R, let us use R (responsibly).

By virtue of being member number 1, there was a suggestion early on in the planning stage that I be involved on the actual day. Initially it was suggested that I might have to get up on stage with the Actual Proper Really Important People. Thankfully, this idea got canned and evolved into being able to co-host one of the workshops (I never want go up on a stage, unless it’s behind a drum-kit), with the very talented Mr Neil Pettinger.

This gave us a chance to demonstrate some of the patient flow graphics we’ve been iterating on – Neil had originally drafted Excel versions and then I tied my hand at replicating them in R. I blogged about this towards the end of last year so go take a look there for some more background.

Our aim was to make the workshop a conversational, interactive affair, and I think we managed it. As Neil is based in Edinburgh, and I’m up in Inverness, most of our communication has been via email or Twitter DMs. We had maybe 2 phonecalls prior to the event. On the day before, we tried doing a rehearsal via internet video conference but my 4 year old twins managed to gatecrash that and it was a bit of a disaster.

There was a fair bit of trepidation on my part before the first session, but it went well – people asked questions, which is always a good thing. In all we had to run the workshop 4 times, which meant we missed out on the other sessions running in parallel.

I would also have liked to have been able to participate in the discussion about AphA Scotland moves forward. My one hope is that future events remain in Perth, which is a lot more ‘central’ than the usual Edinburgh/ Glasgow locations, or, another way to look at it, is “equitably inconvenient” for everyone.

The slides that I put together for our workshop are hosted here:

DataTransitions: visualising and animating patient flow

This shows the original and revised Excel plots, plus the dplyr and ggplot2 code in a step-by-step guide to creating the R equivalent.

I’m still undecided on making presentations in R. For our purposes, PowerPoint might have been absolutely fine, BUT, for those who were new to R ( I’d say it was a 50/50 split in terms of our workshop attendees between those who’d seen/used it, and those who hadn’t), it was quite nice to demonstrate nice graphics, and also say, “Yes, these slides you’re looking at, these were put together using R”. I had also planned to spin up a Leaflet map centred on the concert hall with a big “You are here” sign but didn’t get round to it.

One other cool moment, at least as far as this blog is concerned, was speaking to someone, who upon realising I was from the Highlands, put 2 and 2 together and asked “are you… HighlandR?” Yay! Someone reads this stuff!

Big thanks to Paul Stroner (Apha CEO), Val Perigo ( Administrator extraordinaire), Scott Heald, Peter Knight and Neil Pettinger for organising the event, and to all those who attended. You wonderful people.

I’m looking forward to seeing what happens next, roll on the next event and hopefully, some training too 🙂

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics

$
0
0

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")
devtools::install_github("josiahparry/geniusR")
## Downloading GitHub repo josiahparry/geniusR@master ## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master 
## Installing geniusR 
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \ ##   --no-environ --no-save --no-restore --quiet CMD INSTALL  \ ##   '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82'  \ ##   --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library'  \ ##   --install-tests 
##  

Now you’ll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── 
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4 ## ✔ tibble  1.4.2     ✔ dplyr   0.7.4 ## ✔ tidyr   0.8.0     ✔ stringr 1.3.0 ## ✔ readr   1.1.1     ✔ forcats 0.3.0 
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag()    masks stats::lag() 

For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics<-genius_album(artist="Taylor Swift",album="Reputation")
## Joining, by = c("track_title", "track_n", "track_url") 
lorde_lyrics<-genius_album(artist="Lorde",album="Melodrama")
## Joining, by = c("track_title", "track_n", "track_url") 

Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)tidy_swift<-swift_lyrics%>%unnest_tokens(word,lyric)%>%anti_join(stop_words)%>%count(word,sort=TRUE)
## Joining, by = "word" 
head(tidy_swift)
## # A tibble: 6 x 2 ##   word      n ##     ## 1 call     46 ## 2 wanna    37 ## 3 ooh      35 ## 4 ha       34 ## 5 ah       33 ## 6 time     32 
tidy_lorde<-lorde_lyrics%>%unnest_tokens(word,lyric)%>%anti_join(stop_words)%>%count(word,sort=TRUE)
## Joining, by = "word" 
head(tidy_lorde)
## # A tibble: 6 x 2 ##   word         n ##        ## 1 boom        40 ## 2 love        26 ## 3 shit        24 ## 4 dynamite    22 ## 5 homemade    22 ## 6 light       22 

Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name.

tidy_swift<-tidy_swift%>%rename(swift_n= n)%>%mutate(swift_prop= swift_n/sum(swift_n))tidy_lorde<-tidy_lorde%>%rename(lorde_n= n)%>%mutate(lorde_prop= lorde_n/sum(lorde_n))

There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values.

compare_words<-tidy_swift%>%full_join(tidy_lorde,by="word")summary(compare_words)
##      word              swift_n         swift_prop         lorde_n     ##  Length:957         Min.   : 1.000   Min.   :0.00050   Min.   : 1.0   ##  Class :character   1st Qu.: 1.000   1st Qu.:0.00050   1st Qu.: 1.0   ##  Mode  :character   Median : 1.000   Median :0.00050   Median : 1.0   ##                     Mean   : 3.021   Mean   :0.00152   Mean   : 2.9   ##                     3rd Qu.: 3.000   3rd Qu.:0.00151   3rd Qu.: 3.0   ##                     Max.   :46.000   Max.   :0.02321   Max.   :40.0   ##                     NA's   :301      NA's   :301       NA's   :508    ##    lorde_prop     ##  Min.   :0.0008   ##  1st Qu.:0.0008   ##  Median :0.0008   ##  Mean   :0.0022   ##  3rd Qu.:0.0023   ##  Max.   :0.0307   ##  NA's   :508 

The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words,aes(x=swift_prop,y=lorde_prop))+geom_abline()+geom_text(aes(label=word),check_overlap=TRUE,vjust=1.5)+labs(y="Lorde",x="Taylor Swift")+theme_classic()
## Warning: Removed 809 rows containing missing values (geom_text). 

The warning lets me know there are 809 rows with missing values – those are the words only present in one artist’s corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses “love,” “light,” and, yes, “shit,” more than Swift, while Swift uses “call,” “wanna,” and “hands” more than Lorde. They use words like “waiting,” “heart,” and “dreams” at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift’s most common word only accounts for about 2.3% of her total words; Lorde’s most common word only accounts for about 3.1% of her total words.

This highlights why it’s important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like “the,” “a”, and “and.”

Next Statistics Sunday, we’ll take a look at sentiment analysis!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CRAN Release of R/exams 2.3-1

$
0
0

(This article was first published on R/exams, and kindly contributed to R-bloggers)

New minor release of the R/exams package to CRAN, containing a wide range of smaller improvements and bug fixes. Notable new features include a dedicated OpenOLAT interface, and a convenience function facilitating the use of TikZ-based graphics.

CRAN Release of R/exams 2.3-1

Version 2.3-1 of the one-for-all exams generator R/exams has been published on the Comprehensive R Archive Network at https://CRAN.R-project.org/package=exams. In the next days this will propagate to other CRAN mirrors along with Windows binary packages. The development version of the package is now version 2.3-2 on http://R-Forge.R-project.org/forum/?group_id=1337.

New features

  • Added new interface exams2openolat() for the open-source OpenOLAT learning management system. This is only a convenience wrapper to exams2qti12() or exams2qti21() with some dedicated tweaks for optimizing MathJax output for OpenOLAT.
  • New function include_tikz() that facilitates compiling standalone TikZ figures into a range of output formats, especially PNG and SVG (for HTML-based output). This is useful when including TikZ in R/Markdown exercises or when converting R/LaTeX exercises to HTML. Two examples have been added to the package that illustrate the capabilities of include_tikz(): automaton, logic. A dedicated blog post is also planned.

Written exams (NOPS)

  • Following the blog post on Written R/exams around the World several users have been kind enough to add language support for: Croatian (hr.dcf, contributed by Krunoslav Juraić), Danish (da.dcf, contributed by Tue Vissing Jensen and Jakob Messner),Slovak (sk.dcf, contributed by Peter Fabsic), Swiss German (gsw.dcf, contributed by Reto Stauffer), Turkish (tr.dcf, contributed by Emrah Er). Furthermore, Portuguese has been distinguished into pt-PT.dcf (Portuguese Portuguese) vs. pt-BR.dcf (Brazilian Portuguese) with pt.dcf defaulting to the former (contributed by Thomas Dellinger).
  • After setting a random seed exams2nops() and exams2pdf() now yield the same random versions of the exercises. Previously, this was not the case because exams2nops() internally generates a single random trial exam first for a couple of sanity checks. Now, the .GlobalEnv$.Random.seed is restored after generating the trial exam.
  • Fixed the support for nsamp argument in exams2nops(). Furthermore, current limitations of exams2nops() are pointed out more clearly in error messages and edge cases caught.
  • Allow varying points within a certain exercise in nops_eval().

HTML output and Base64-encoded supplements

  • In exams2html() and other interfaces based on make_exercise_transform_html() the option base64 = TRUE now uses Base64 encoding for all file extensions (known to the package) whereas base64 = NULL only encodes image files (previous default behavior).
  • Bug fixes and improvements in HTML transformers:
    • Only ="file.ext" (with =") for supplementary files embedded into HTML is replaced now by the corresponding Base64-encoded version.
    • href="file.ext" is replaced by href="file.ext" download="file.ext" prior to Base 64 replacement to assure that the file name is preserved for the browser/downloader.
    • alt="file.ext" and download="file.ext" are preserved without the Base64-encoded version of file.ext.
  • Include further file URIs for Base64 supplements, in particular .sav for SPSS data files.
  • In exams2blackboard(..., base64 = FALSE, ...) the base64 = FALSE was erroneously ignored. No matter how base64 was specified essentially base64 = TRUE was used, it is honored again now.

Extensions

  • \exshuffle{} can now also be used for schoice exercises with more than one TRUE answer. In a first step only one of the TRUE answers is selected and then -1 items from the FALSE answers.
  • Function include_supplement(..., dir = "foo")– without full path to "foo"– now also works if "foo" is not a local sub-directory but a sub-directory to the exercise directory edir (if specified).
  • Enable passing of envir argument from exams2html() to xweave() in case of R/Markdown (.Rmd) exercises.
  • When using exams2html(..., mathjax = TRUE) for testing purposes, mathjax.rstudio.com is used now rather than cdn.mathjax.org which is currently redirecting and will eventually be shut down completely.
  • Added support for \tightlist (as produced by pandoc) in all current LaTeX templates as well as exams2nops().

Bug fixes

  • Fixed a bug in stresstest_exercise() where the “rank” (previously called “order”) of the correct solution was computed incorrectly. Additional enhancements in plots and labels.
  • Fixed a bug for tex2image(..., tikz = TRUE) where erroneously \usetikzlibrary{TRUE} was included. Also tex2image(..., Sweave = TRUE) (the default) did not run properly on Windows, fixed now.
  • Better warnings if \exshuffle{} could not be honored due to a lack of sufficiently many (suitable) answer alternatives.
  • Bug fix in CSV export of exams2arsnova(). Recent ARSnova versions use “mc” (rather than “MC”) and “abcd” (rather than “SC”) to code multiple-choice and single-choice questions, respectively.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Is non-inferiority on par with superiority?

$
0
0

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

It is grant season around here (actually, it is pretty much always grant season), which means another series of problems to tackle. Even with the most straightforward study designs, there is almost always some interesting twist, or an approach that presents a subtle issue or two. In this case, the investigator wants compare two interventions, but doesn’t feel the need to show that one is better than the other. He just wants to see if the newer intervention is not inferior to the more established intervention.

The shift from a superiority trial to a non-inferiority trial leads to a fundamental shift in the hypothesis testing framework. In the more traditional superiority trial, where we want to see if an intervention is an improvement over another intervention, we can set up the hypothesis test with null and alternative hypotheses based on the difference of the intervention proportions \(p_{old}\) and \(p_{new}\) (under the assumption of a binary outcome):

\[ \begin{aligned} H_0: p_{new} – p_{old} &\le 0 \\ H_A: p_{new} – p_{old} &> 0 \end{aligned} \] In this context, if we reject the null hypothesis that the difference in proportions is less than zero, we conclude that the new intervention is an improvement over the old one, at least for the population under study. (A crucial element of the test is the \(\alpha\)-level that determines the Type 1 error (probability of rejecting \(H_0\) when \(H_0\) is actually true. If we use \(\alpha = 0.025\), then that is analogous to doing a two-sided test with \(\alpha = .05\) and hypotheses \(H_0: p_{new} – p_{old} = 0\) and \(H_A: p_{new} – p_{old} \ne 0\).)

In the case of an inferiority trial, we add a little twist. Really, we subtract a little twist. In this case the hypotheses are:

\[ \begin{aligned} H_0: p_{new} – p_{old} &\le -\Delta \\ H_A: p_{new} – p_{old} &> -\Delta \end{aligned} \]

where \(\Delta\) is some threshold that sets the non-inferiority bounds. Clearly, if \(\Delta = 0\) then this is equivalent to a superiority test. However, for any other \(\Delta\), there is a bit of a cushion so that the new intervention will still be considered non-inferior even if we observe a lower proportion for the new intervention compared to the older intervention.

As long as the confidence interval around the observed estimate for the difference in proportions does not cross the \(-\Delta\) threshold, we can conclude the new intervention is non-inferior. If we construct a 95% confidence interval, this procedure will have a Type 1 error rate \(\alpha = 0.025\), and a 90% CI will yield an \(\alpha = 0.05\). (I will demonstrate this with a simulation.)

The following figures show how different confident intervals imply different conclusions. I’ve added an equivalence trial here as well, but won’t discuss in detail except to say that in this situation we would conclude that two interventions are equivalent if the confidence interval falls between \(-\Delta\) and \(\Delta\)). The bottom interval crosses the non-inferiority threshold, so is considered inconclusive. The second interval from the top crosses zero, but does not cross the non-inferiority threshold, so we conclude that the new intervention is at least as effective as the old one. And the top interval excludes zero, so we conclude that the new intervention is an improvement:

This next figure highlights the key challenge of the the non-inferiority trial: where do we set \(\Delta\)? By shifting the threshold towards zero in this example (and not changing anything else), we can no longer conclude non-inferiority. But, the superiority test is not affected, and never will be. The comparison for a superiority test is made relative to zero only, and has nothing to do with \(\Delta\). So, unless there is a principled reason for selecting \(\Delta\), the process (and conclusions) and feel a little arbitrary. (Check out this interactive post for a really cool way to explore some of these issues.)

Type 1 error rate

To calculate the Type 1 error rate, we generate data under the null hypothesis, or in this case on the rightmost boundary of the null hypothesis since it is a composite hypothesis. First, let’s generate one data set:

library(magrittr)library(broom)set.seed(319281)def <- defDataAdd(varname = "y", formula = "0.30 - 0.15*rx",                   dist = "binary")DT <- genData(1000) %>% trtAssign(dtName = ., grpName = "rx")DT <- addColumns(def, DT)DT
##         id rx y##    1:    1  0 0##    2:    2  1 0##    3:    3  1 0##    4:    4  0 0##    5:    5  1 0##   ---          ##  996:  996  0 1##  997:  997  0 0##  998:  998  1 0##  999:  999  0 0## 1000: 1000  0 0

And we can estimate a confidence interval for the difference between the two means:

props <- DT[, .(success = sum(y), n=.N), keyby = rx]setorder(props, -rx)round(tidy(prop.test(props$success, props$n,                correct = FALSE, conf.level = 0.95))[ ,-c(5, 8,9)], 3)
##   estimate1 estimate2 statistic p.value conf.low conf.high## 1     0.142     0.276    27.154       0   -0.184    -0.084

If we generate 1000 data sets in the same way, we can count the number of occurrences where the where we would incorrectly reject the null hypothesis (i.e. commit a Type 1 error):

powerRet <- function(nPerGrp, level, effect, d = NULL) {    Form <- genFormula(c(0.30, -effect), c("rx"))  def <- defDataAdd(varname = "y", formula = Form, dist = "binary")  DT <- genData(nPerGrp*2) %>% trtAssign(dtName = ., grpName = "rx")    iter <- 1000  ci <- data.table()    # generate 1000 data sets and store results each time in "ci"    for (i in 1: iter) {        dx <- addColumns(def, DT)        props <- dx[, .(success = sum(y), n=.N), keyby = rx]    setorder(props, -rx)    ptest <- prop.test(props$success, props$n, correct = FALSE,                        conf.level = level)        ci <- rbind(ci, data.table(t(ptest$conf.int),                        diff = ptest$estimate[1] - ptest$estimate[2]))  }    setorder(ci, V1)  ci[, i:= 1:.N]    # for sample size calculation at 80% power    if (is.null(d)) d <- ci[i==.2*.N, V1]    ci[, d := d]    # determine if interval crosses threshold  ci[, nullTrue := (V1 <= d)]    return(ci[])  }

Using 95% CIs, we expect 2.5% of the intervals to lie to the right of the non-inferiority threshold. That is, 2.5% of the time we would reject the null hypothesis when we shouldn’t:

ci <- powerRet(nPerGrp = 500, level = 0.95, effect = 0.15, d = -0.15)formattable::percent(ci[, mean(!(nullTrue))], 1)
## [1] 2.4%

And using 90% CIs, we expect 5% of the intervals to lie to the right of the threshold:

ci <- powerRet(nPerGrp = 500, level = 0.90, effect = 0.15, d = -0.15)formattable::percent(ci[, mean(!(nullTrue))], 1)
## [1] 5.1%

Sample size estimates

If we do not expect the effect sizes to be different across interventions, it seems reasonable to find the sample size under this assumption of no effect. Assuming we want to set \(\alpha = 0.025\), we generate many data sets and estimate the 95% confidence interval for each one. The power is merely the proportion of these confidence intervals lie entirely to the right of \(-\Delta\).

But how should we set \(\Delta\)? I’d propose that for each candidate sample size level, we find \(-\Delta\) such that 80% of the simulated confidence intervals lie to the right of some value, where 80% is the desired power of the test (i.e., given that there is no treatment effect, 80% of the (hypothetical) experiments we conduct will lead us to conclude that the new treatment is non-inferior to the old treatment).

ci <- powerRet(nPerGrp = 200, level = 0.95, effect = 0)p1 <- plotCIs(ci, 200, 0.95)ci <- powerRet(nPerGrp = 500, level = 0.95, effect = 0)p2 <- plotCIs(ci, 500, 0.95)
library(gridExtra)grid.arrange(p1, p2, nrow = 1,              bottom = "difference in proportion", left = "iterations")

It is clear that increasing the sample size reduces the width of the 95% confidence intervals. As a result, the non-inferiority threshold based on 80% power is shifted closer towards zero when sample size increases. This implies that a larger sample size allows us to make a more compelling statement about non-inferiority.

Unfortunately, not all non-inferiority statements are alike. If we set \(\Delta\) too large, we may expand the bounds of non-inferiority beyond a reasonable, justifiable point. Given that there is no actual constraint on what \(\Delta\) can be, I would say that the non-inferiority test is somewhat more problematic than its closely related cousin, the superiority test, where \(\Delta\) is in effect fixed at zero. But, if we take this approach, where we identify \(\Delta\) that satisfies the desired power, we can make a principled decision about whether or not the threshold is within reasonable bounds.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sketchnotes from TWiML&AI: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

$
0
0

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang:

Sketchnotes from TWiMLAI talk: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

Sketchnotes from TWiMLAI talk: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

You can listen to the podcast here.

In this episode, I’m joined by Ian Goodfellow, Staff Research Scientist at Google Brain and Sandy Huang, Phd Student in the EECS department at UC Berkeley, to discuss their work on the paper Adversarial Attacks on Neural Network Policies. If you’re a regular listener here you’ve probably heard of adversarial attacks, and have seen examples of deep learning based object detectors that can be fooled into thinking that, for example, a giraffe is actually a school bus, by injecting some imperceptible noise into the image. Well, Sandy and Ian’s paper sits at the intersection of adversarial attacks and reinforcement learning, another area we’ve discussed quite a bit on the podcast. In their paper, they describe how adversarial attacks can also be effective at targeting neural network policies in reinforcement learning. Sandy gives us an overview of the paper, including how changing a single pixel value can throw off performance of a model trained to play Atari games. We also cover a lot of interesting topics relating to adversarial attacks and RL individually, and some related areas such as hierarchical reward functions and transfer learning. This was a great conversation that I’m really excited to bring to you! https://twimlai.com/twiml-talk-119-adversarial-attacks-reinforcement-learning-agents-ian-goodfellow-sandy-huang/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Generative Assessment Creation

$
0
0

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a generative way so that the correct and incorrect answers are automatically generated.(This seems to be one of those use cases for why ‘everyone should learn to code’;-)

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.

You can also write templates that include the creation of graphical assets such as charts:

 

Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

Or automata diagrams:

(You can see more exam templates here: www.r-exams.org/templates.)

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to noteooks.azure.com using OUCU@open.ac.uk credentials.) For the moment, I’m depositing them at https://notebooks.azure.com/OUsefulInfo/libraries/gettingstarted, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sparklyr 0.8

$
0
0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce that sparklyr 0.8 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark ML. You can also learn more about Apache Spark and sparklyr at spark.rstudio.com and the sparklyr webinar series. In this version, we added support for Spark 2.3, Livy 0.5, and various enhancements and bugfixes. For this post, we’d like to highlight a new feature from Spark 2.3 and introduce the mleap and graphframes extensions.

Parallel Cross-Validation

Spark 2.3 supports parallelism in hyperparameter tuning. In other words, instead of training each model specification serially, you can now train them in parallel. This can be enabled by setting the parallelism parameter in ml_cross_validator() or ml_train_split_validation(). Here’s an example:

library(sparklyr)sc <- spark_connect(master = "local", version = "2.3.0")iris_tbl <- sdf_copy_to(sc, iris)# Define the pipelinelabels <- c("setosa", "versicolor", "virginica")pipeline <- ml_pipeline(sc) %>%  ft_vector_assembler(    c("Sepal_Width", "Sepal_Length", "Petal_Width", "Petal_Length"),    "features"  ) %>%  ft_string_indexer_model("Species", "label", labels = labels) %>%  ml_logistic_regression()# Specify hyperparameter gridgrid <- list(  logistic = list(    elastic_net_param = c(0.25, 0.75),    reg_param = c(1e-3, 1e-4)  ))# Create the cross validator objectcv <- ml_cross_validator(  sc, estimator = pipeline, estimator_param_maps = grid,  evaluator = ml_multiclass_classification_evaluator(sc),  num_folds = 3, parallelism = 4)# Train the modelscv_model <- ml_fit(cv, iris_tbl)

Once the models are trained, you can inspect the performance results by using the newly available helper function ml_validation_metrics():

ml_validation_metrics(cv_model)
##       f1 elastic_net_param_1 reg_param_1## 1 0.9506                0.25       1e-03## 2 0.9384                0.75       1e-03## 3 0.9384                0.25       1e-04## 4 0.9569                0.75       1e-04
spark_disconnect(sc)

Pipelines in Production

Earlier this year, we announced support for ML Pipelines in sparklyr, and discussed how one can persist models onto disk. While that workflow is appropriate for batch scoring of large datasets, we also wanted to enable real-time, low-latency scoring using pipelines developed with sparklyr. To enable this, we’ve developed the mleap package, available on CRAN, which provides an interface to the MLeap open source project.

MLeap allows you to use your Spark pipelines in any Java-enabled device or service. This works by serializing Spark pipelines which can later be loaded into the Java Virtual Machine (JVM) for scoring without requiring a Spark cluster. This means that software engineers can take Spark pipelines exported with sparklyr and easily embed them in web, desktop or mobile applications.

To get started, simply grab the package from CRAN and install the necessary dependencies:

install.packages("mleap")library(mleap)install_maven()install_mleap()

Then, build a pipeline as usual:

library(sparklyr)sc <- spark_connect(master = "local", version = "2.2.0")mtcars_tbl <- sdf_copy_to(sc, mtcars)# Create a pipeline and fit itpipeline <- ml_pipeline(sc) %>%  ft_binarizer("hp", "big_hp", threshold = 100) %>%  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%  ml_gbt_regressor(label_col = "mpg")pipeline_model <- ml_fit(pipeline, mtcars_tbl)

Once we have the pipeline model, we can export it via ml_write_bundle():

# Export modelmodel_path <- file.path(tempdir(), "mtcars_model.zip")transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl)ml_write_bundle(pipeline_model, transformed_tbl, model_path)spark_disconnect(sc)

At this point, we’re ready to use mtcars_model.zip in other applications. Notice that the following code does not require Spark:

# Import modelmodel <- mleap_load_bundle(model_path)# Create a data frame to be scorednewdata <- tibble::tribble(  ~qsec, ~hp, ~wt,  16.2,  101, 2.68,  18.1,  99,  3.08)# Transform the data frametransformed_df <- mleap_transform(model, newdata)dplyr::glimpse(transformed_df)
## Observations: 2## Variables: 6## $ qsec        16.2, 18.1## $ hp          101, 99## $ wt          2.68, 3.08## $ big_hp      1, 0## $ features    [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]]## $ prediction  21.07, 22.37

Notice that MLeap requires Spark 2.0 to 2.2. You can find additional details in the production pipelines guide.

Graph Analysis

The other extension we’d like to highlight is graphframes, which provides an interface to the GraphFrames Spark package. GraphFrames allows us to run graph algorithms at scale using a DataFrame-based API.

Let’s see graphframes in action through a quick example, where we analyze the relationships among package on CRAN.

library(graphframes)library(dplyr)sc <- spark_connect(master = "local", version = "2.1.0")# Grab list of CRAN packages and their dependenciesavailable_packages <- available.packages(  contrib.url("https://cloud.r-project.org/")) %>%  `[`(, c("Package", "Depends", "Imports")) %>%  as_tibble() %>%  transmute(    package = Package,    dependencies = paste(Depends, Imports, sep = ",") %>%      gsub("\\n|\\s+", "", .)  )# Copy data to Sparkpackages_tbl <- sdf_copy_to(sc, available_packages, overwrite = TRUE)# Create a tidy table of dependencies, which define the edges of our graphedges_tbl <- packages_tbl %>%  mutate(    dependencies = dependencies %>%      regexp_replace("\\\\(([^)]+)\\\\)", "")  ) %>%  ft_regex_tokenizer(    "dependencies", "dependencies_vector",    pattern = "(\\s+)?,(\\s+)?", to_lower_case = FALSE  ) %>%  transmute(    src = package,    dst = explode(dependencies_vector)  ) %>%  filter(!dst %in% c("R", "NA"))

Once we have an edges table, we can easily create a GraphFrame object by calling gf_graphframe() and running PageRank:

# Create a GraphFrame objectg <- gf_graphframe(edges = edges_tbl)# Run the PageRank algorithmpagerank <- gf_pagerank(g, tol = 0.01)pagerank %>%  gf_vertices() %>%  arrange(desc(pagerank))
## # Source:     table [?? x 2]## # Database:   spark_connection## # Ordered by: desc(pagerank)##    id        pagerank##            ##  1 methods      259. ##  2 stats        209. ##  3 utils        194. ##  4 Rcpp         109. ##  5 graphics     104. ##  6 grDevices     60.0##  7 MASS          53.7##  8 lattice       34.7##  9 Matrix        33.3## 10 grid          32.1## # ... with more rows

We can also collect a sample of the graph locally for visualization:

library(gh)library(visNetwork)list_repos <- function(username) {  gh("/users/:username/repos", username = username) %>%    vapply("[[", "", "name")}rlib_repos <- list_repos("r-lib")tidyverse_repos <- list_repos("tidyverse")base_packages <- installed.packages() %>%  as_tibble() %>%  filter(Priority == "base") %>%  pull(Package)top_packages <- pagerank %>%  gf_vertices() %>%  arrange(desc(pagerank)) %>%  head(75) %>%  pull(id)edges_local <- g %>%  gf_edges() %>%  filter(src %in% !!top_packages && dst %in% !!top_packages) %>%  rename(from = src, to = dst) %>%  collect()vertices_local <- g %>%  gf_vertices() %>%  filter(id %in% top_packages) %>%  mutate(    group = case_when(      id %in% !!rlib_repos ~ "r-lib",      id %in% !!tidyverse_repos ~ "tidyverse",      id %in% !!base_packages ~ "base",      TRUE ~ "other"    ),    title = id) %>%  collect()visNetwork(vertices_local, edges_local, width = "100%") %>%  visEdges(arrows = "to")

{"x":{"nodes":{"id":["ggplot2","gtable","knitr","RCurl","stringi","stringr","curl","magrittr","numDeriv","gtools","glue","raster","reshape2","parallel","methods","bitops","R6","abind","cluster","glmnet","iterators","lattice","lme4","lubridate","nlme","tidyr","tools","ape","boot","coda","fields","Formula","MASS","purrr","rgl","XML","zoo","graphics","stats","grid","quadprog","tcltk","car","dplyr","htmltools","Matrix","mgcv","mime","plyr","lazyeval","rlang","crayon","httr","mvtnorm","rJava","shiny","sp","survival","tibble","xml2","splines","RColorBrewer","assertthat","data.table","doParallel","foreach","Hmisc","igraph","Rcpp","scales","digest","codetools","jsonlite","utils","grDevices"],"group":["tidyverse","other","other","other","other","tidyverse","other","tidyverse","other","other","tidyverse","other","other","base","base","other","other","other","other","other","other","other","other","tidyverse","other","tidyverse","base","other","other","other","other","other","other","tidyverse","other","other","other","base","base","base","other","base","other","tidyverse","other","other","other","other","other","other","other","r-lib","r-lib","other","other","other","other","other","tidyverse","other","base","other","other","other","other","other","other","other","other","other","other","other","other","base","base"],"title":["ggplot2","gtable","knitr","RCurl","stringi","stringr","curl","magrittr","numDeriv","gtools","glue","raster","reshape2","parallel","methods","bitops","R6","abind","cluster","glmnet","iterators","lattice","lme4","lubridate","nlme","tidyr","tools","ape","boot","coda","fields","Formula","MASS","purrr","rgl","XML","zoo","graphics","stats","grid","quadprog","tcltk","car","dplyr","htmltools","Matrix","mgcv","mime","plyr","lazyeval","rlang","crayon","httr","mvtnorm","rJava","shiny","sp","survival","tibble","xml2","splines","RColorBrewer","assertthat","data.table","doParallel","foreach","Hmisc","igraph","Rcpp","scales","digest","codetools","jsonlite","utils","grDevices"]},"edges":{"from":["abind","abind","ape","ape","ape","ape","ape","ape","ape","ape","ape","assertthat","boot","boot","car","car","car","car","car","car","car","car","car","cluster","cluster","cluster","cluster","coda","crayon","crayon","crayon","data.table","doParallel","doParallel","doParallel","doParallel","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","fields","foreach","foreach","foreach","Formula","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","glmnet","glmnet","glmnet","glmnet","glue","gtable","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","htmltools","htmltools","htmltools","httr","httr","httr","httr","igraph","igraph","igraph","igraph","igraph","igraph","igraph","iterators","jsonlite","knitr","knitr","knitr","lattice","lattice","lattice","lattice","lattice","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lubridate","lubridate","lubridate","MASS","MASS","MASS","MASS","MASS","Matrix","Matrix","Matrix","Matrix","Matrix","Matrix","mgcv","mgcv","mgcv","mgcv","mgcv","mime","mvtnorm","mvtnorm","nlme","nlme","nlme","nlme","plyr","purrr","purrr","purrr","raster","raster","raster","Rcpp","Rcpp","RCurl","RCurl","reshape2","reshape2","reshape2","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rJava","scales","scales","scales","scales","shiny","shiny","shiny","shiny","shiny","shiny","shiny","shiny","sp","sp","sp","sp","sp","sp","sp","stringi","stringi","stringi","stringr","stringr","stringr","survival","survival","survival","survival","survival","survival","tibble","tibble","tibble","tibble","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","XML","XML","xml2","zoo","zoo","zoo","zoo","zoo"],"to":["methods","utils","nlme","lattice","graphics","methods","stats","tools","utils","parallel","Rcpp","tools","graphics","stats","abind","MASS","mgcv","grDevices","utils","stats","graphics","lme4","nlme","graphics","grDevices","stats","utils","lattice","grDevices","methods","utils","methods","foreach","iterators","parallel","utils","assertthat","glue","magrittr","methods","rlang","R6","Rcpp","tibble","utils","methods","codetools","utils","iterators","stats","digest","grid","gtable","MASS","plyr","reshape2","scales","stats","tibble","lazyeval","Matrix","utils","foreach","methods","methods","grid","lattice","survival","Formula","ggplot2","methods","cluster","gtable","grid","data.table","htmltools","utils","digest","Rcpp","jsonlite","mime","curl","R6","methods","graphics","grDevices","magrittr","Matrix","stats","utils","utils","methods","stringr","methods","tools","grid","grDevices","graphics","stats","utils","Matrix","methods","stats","graphics","grid","splines","utils","parallel","MASS","lattice","nlme","methods","stringr","Rcpp","grDevices","graphics","stats","utils","methods","methods","graphics","grid","stats","utils","lattice","nlme","methods","stats","graphics","Matrix","tools","stats","methods","graphics","stats","utils","lattice","Rcpp","magrittr","rlang","tibble","methods","sp","Rcpp","methods","utils","methods","bitops","plyr","Rcpp","stringr","graphics","grDevices","stats","utils","htmltools","knitr","jsonlite","shiny","magrittr","methods","RColorBrewer","plyr","Rcpp","R6","methods","utils","mime","jsonlite","digest","htmltools","R6","tools","methods","utils","stats","graphics","grDevices","lattice","grid","tools","utils","stats","glue","magrittr","stringi","graphics","Matrix","methods","splines","stats","utils","crayon","methods","rlang","utils","dplyr","glue","magrittr","purrr","Rcpp","rlang","stringi","tibble","methods","utils","Rcpp","stats","utils","graphics","grDevices","lattice"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot"},"manipulation":{"enabled":false},"edges":{"arrows":"to"}},"groups":["tidyverse","other","base","r-lib"],"width":"100%","height":null,"idselection":{"enabled":false},"byselection":{"enabled":false},"main":null,"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)"},"evals":[],"jsHooks":[]}

spark_disconnect(sc)

Notice that GraphFrames currently supports Spark 2.0 and 2.1. You can find additional details in the graph analysis guide.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Early draft of our “Feature Engineering and Selection” book

$
0
0

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

model_process.png

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It’s about 60% done and we’d love to get feedback. You cna take a look at http://feat.engineering and provide feedback at https://github.com/topepo/FES/issues.

The current TOC is:

  1. Introduction
  2. Illustrative Example: Predicting Risk of Ischemic Stroke
  3. A Review of the Predictive Modeling Process
  4. Exploratory Visualizations
  5. Encoding Categorical Predictors
  6. Engineering Numeric Predictors
  7. Detecting Interaction Effects (these later chapters are not finished yet)
  8. Flattening Profile Data
  9. Handling Missing Data
  10. Feature Engineering Without Overfitting
  11. Feature Selection
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

fruit: Image-Based Systems of Linear Equations (Numeric)

$
0
0

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (numeric answer) with a problem description based on shuffled images.

Name:
fruit
Type:
Related:

Description:
A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.)
Solution feedback:
Yes
Randomization:
Random numbers, shuffled graphics
Mathematical notation:
Yes
Verbatim R input/output:
No
Images:
Yes
Other supplements:
No
Raw: (1 random version)
PDF:
fruit-Rnw-pdf
fruit-Rmd-pdf
HTML:
fruit-Rnw-html
fruit-Rmd-html

(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)

Demo code:

library("exams")set.seed(1090)exams2html("fruit.Rnw")set.seed(1090)exams2pdf("fruit.Rnw")set.seed(1090)exams2html("fruit.Rmd")set.seed(1090)exams2pdf("fruit.Rmd")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

fruit2: Image-Based Systems of Linear Equations (Single-Choice)

$
0
0

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (single-choice) with a problem description based on shuffled images.

Name:
fruit2
Type:
Related:

Description:
A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. A set of five answer alternatives is generated based on two potential mistakes and two random solutions from a suitable range. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.)
Solution feedback:
Yes
Randomization:
Random numbers, shuffled graphics
Mathematical notation:
Yes
Verbatim R input/output:
No
Images:
Yes
Other supplements:
No
Raw: (1 random version)
PDF:
fruit2-Rnw-pdf
fruit2-Rmd-pdf
HTML:
fruit2-Rnw-html
fruit2-Rmd-html

(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)

Demo code:

library("exams")set.seed(1090)exams2html("fruit2.Rnw")set.seed(1090)exams2pdf("fruit2.Rnw")set.seed(1090)exams2html("fruit2.Rmd")set.seed(1090)exams2pdf("fruit2.Rmd")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


The Pleasing Ratio Project

$
0
0

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder)

This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a thought-provoking experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself.

Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause bias in my experiment. You can find more information about the original experiment here.

I have done a project inspired in Fechner’s one that I called The Pleasing Ratio Project. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their length-to-width ratios. Then you will be asked to select the one that seems you most pleasing. If you have doubts, just choose I’m not sure. You can do it as many times as you want (all responses are completely anonymous). Every game will confront a couple of ratios, which can vary from 1 to 3,5. In the Results section you will find the  percentage of winning games for each ratio. The one with the highest percentage will be named officially as The Most Pleasing Ratio of the World in the future by myself.

Although my experiment is absolutely inspired in Fechner’s one, there are some differences. I can explore a bigger set of ratios doing an A/B test and I also introduce the option I’m not sure. This makes this one a bit richer.

The experiment has also some interesting technical features:

  • the use of shinydashboard package to arrange the App
  • the use of shinyjs package to add javaScript to refresh the page when use choose to play again
  • to save votes in a text file
  • to read it to visualize results

Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly. Go to The Pleasing Ratio Project! The code of the project is available in GitHub. Thanks a lot for your collaboration!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting into the Rhythm: the euRovision sweepstake

$
0
0

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Laura Swales, Marketing and Events Assistant

Another month, another sweepstake to raise money for the Bath Cats & Dogs home!

This time, we picked the Eurovision song contest as our sweepstake of choice. After enjoying my first experience of using R to randomise the names for the previous sweepstake I decided to give it another go, but with a few tweaks.

Soundcheck

During my first attempt in R, issues arose when I had been (innocently!) allocated the favourite horse to win. I had no way to prove that the R code had made the selection, as my work was not reproducible.

So with the cries of “cheater!” and “fix!”” still ringing in my ears, we started by setting a seed. This meant that if someone else was to replicate my code they would get the same results; therefore removing the dark smudge against my good name.

At random I selected the number 6 at which to set my seed.

set.seed(6)

I next compiled my lists of people and Eurovision countries and associated them with correlating objects.

people_list <- c(    "Andy M",    "Adam",    "Laura",    "Rachel",    "Owen",    "Yvi",    "Karis",    "Toby",    "Jen",    "Matty G",    "Tatiana",    "Amanda",    "Chrissy",    "Lisa",    "Lisa",    "Ben",    "Ben",    "Robert",    "Toby",    "Matt A",    "Lynn",    "Ruth",    "Julian",    "Karina",    "Colin",    "Colin")
countries_list <- c(    "Albania",    "Australia",    "Austria",    "Bulgaria",    "Cyprus",    "Czech Rep",    "Denmark",    "Estonia",    "Finland",    "France",    "Germany",    "Hungary",    "Ireland",    "Israel",    "Italy",    "Lithuania",    "Moldova",    "Norway",    "Portugal",    "Serbia",    "Slovenia",    "Spain",    "Sweden",    "The Netherlands",    "Ukraine",    "United Kingdom"  )

Once I had the lists associated with objects, I followed the same steps as my previous attempt in R. I put both objects into data frames and then used the sample function to jumble up the names.

assign_countries <- data.frame(people = people_list,                               countries = sample(countries_list))

Task complete!

Fate had delivered me Denmark, who were nowhere near the favourites at the point of selection. I sighed with relief knowing that I had no chance of winning again and that perhaps maybe now I could start to re-build my reputation as an honest co-worker…

Encore

Before I finished my latest foray into R, we decided to create a function for creating sweepstakes in R.

I was talked down from picking the name SweepstakeizzleR and decided upon the slightly more sensible sweepR.

I entered the desired workings of the function, which followed from the above work in R.

sweepR <- function(a, b, seed = 1234){ set.seed(seed) data.frame(a, sample(b))}

Once done, I could use my newly created function to complete the work I had done before but in a much timelier fashion.

sweepR(people_list, countries_list)

My very first function worked! Using a function like sweepR will allow me to reliably reproduce the procedures I need for whatever task I’m working on. In this case it has enabled me to create a successfully random sweepstake mix of names and entries.

WinneR

With great relief Israel won Eurovision and I was very happy to hand over the prize to Amanda.

I really enjoyed learning a little more about R and how I can create functions to streamline my work. Hopefully another reason will come up for me to learn even more soon!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series of the World, Unite!

$
0
0

(This article was first published on R – usefulr, and kindly contributed to R-bloggers)

The R ecosystem knows a ridiculous number of time series classes. So, I decided to create a new universal standard that finally covers everyone’s use case… Ok, just kidding!

tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as ts, xts, data.frame, data.table, tibble, zoo, tsibble or timeSeries to each other.

To install the stable version from CRAN:

install.packages("tsbox")

To get an idea how easy it is to switch from one class to another, consider this:

library(tsbox)x.ts <- ts_c(mdeaths, fdeaths)x.xts <- ts_xts(x.ts)x.df <- ts_df(x.xts)x.tbl <- ts_tbl(x.df)x.dt <- ts_tbl(x.tbl)x.zoo <- ts_zoo(x.dt)x.tsibble <- ts_tsibble(x.zoo)x.timeSeries <- ts_timeSeries(x.tsibble)

We jump form good old ts objects toxts, store our time series in various data frames and convert them to some highly specialized time series formats.

tsbox is class-agnostic

Because these converters work nicely, we can use them to make functions class-agnostic. If a class-agnostic function works for one class, it works for all:

ts_scale(x.ts)           ts_scale(x.xts)ts_scale(x.df)ts_scale(x.dt)ts_scale(x.tbl)

ts_scale normalizes one or multiple series, by subtracting the mean and dividing by the standard deviation. It works like a ‘generic’ function: You can apply it on any time series object, and it will return an object of the same class as its input.

So, whether we want to smooth, scale, differentiate, chain-link, forecast, regularize or seasonally adjust a series, we can use the same commands to whatever time series at hand. tsbox offers a comprehensive toolkit for the basics of time series manipulation. Here are some additional operations:

ts_pc(x.ts)                 # percentage change rates ts_forecast(x.xts)          # forecast, by exponential smoothingts_seas(x.df)               # seasonal adjustment, by X-13ts_frequency(x.dt, "year")  # convert to annual frequencyts_span(x.tbl, "-1 year")   # limit time span to final year

tsbox is frequency-agnostic

There are many more. Because they all start with ts_, you can use auto-complete to see what’s around. Most conveniently, there is a time series plot function that works for all classes and frequencies:

ts_plot(  `Airline Passengers` = AirPassengers,   `Lynx trappings` = ts_df(lynx),   `Deaths from Lung Diseases` = ts_xts(fdeaths),  title = "Airlines, trappings, and deaths",  subtitle = "Monthly passengers, annual trappings, monthly deaths")

unnamed-chunk-2-1

There is also a version that uses ggplot2 and has the same syntax.

Time series in data frames

You may have wondered why we treated data frames as a time series class. The spread of dplyr and data.table has given data frames a boost and made them one of the most popular data structures in R. So, storing time series in a data frame is an obvious consequence. And even if you don’t intend to keep time series in data frames, this is still the format in which you import and export your data. tsbox makes it easy to switch from data frames to time series and back.

Make existing functions class-agnostic

tsbox includes tools to make existing functions class-agnostic. To do so, the ts_ function can be used to wrap any function that works with time series. For a function that works on "ts" objects, this is as simple as that:

ts_rowsums <- ts_(rowSums)ts_rowsums(ts_c(mdeaths, fdeaths))

Note that ts_ returns a function, which can be used with or without a name.

In case you are wondering, tsbox uses data.table as a backend, and makes use of its incredibly efficient reshaping facilities, its joins and rolling joins. And thanks to anytime, tsbox will be able to recongnize almost any date format without manual intervention.

So, enjoy some relieve in R’s time series class struggle.

Website: www.tsbox.help

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – usefulr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

icon: web icons for rmarkdown

$
0
0

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Icons in R

The icon package provides a convenient interface for adding icons from popular web fonts to R Markdown documents. The project began at rOpenSci OzUnconf 2017, and was developed by Mitchell O’Hara-Wild, Earo Wang and Timothy Hyndman. The package currently supports icons from Font Awesome, Academicons, and ionicons.

Icons can be added to your R Markdown documents using short prefixes which identify the font’s library.

  • fa: Font Awesome
  • ai: Academicons
  • ii: ionicons

For example, `r icon::fa("rocket")` can be used to add the rocket icon from Font Awesome. This interface is convenient if you are familiar with the icon you want, or if you are dynamically selecting your icon.

The package also provides a second interface which allows for tab completion of font names. To include the rocket icon above, you could instead use `r icon::fa_rocket()`. This is a useful interface for if you are unfamiliar with the available icons or their names.

Each icon function also provides support for modifying the attributes of the icon such as size, colour, rotation, and animation. We can make the rocket icon spin using `r icon::fa_rocket(animate = "spin")`.

R in icons

The long-awaited R Project icon has been added to the Font Awesome library in version 5.0.11. This addition comes nearly four years since the icon’s first request after being brought into the spotlight of #rstats with LockeData’s tweet.

Hey, the #rstats logo is #40 on the @fontawesome brand board – go give it some 👍 so we can get it higher!https://t.co/jdEcVMJyip

— Locke Data (@LockeData) April 20, 2018

Finally, it is possible to use an R package to include the R icon in your R Markdown documents!

`r fa_rocket(colour = "#1FA67A")` + `r fa_r_project(colour = "#384CB7")` = `r fa_heart(colour = "red")`

Future development

There are plans to submit the icon package to CRAN, however there are a few things we would like to work on first. The top development priority is transitioning from web-font icons to svg icons. This should drastically reduce the load times of documents using icons, and allow the user to import their own icon libraries. Before release, we would also like to add further support for manipulating icons with modifier functions, and resolve an R Markdown issue relating to using icons within headers.

You can install and try out the latest version of the icon package today via:

devtools::install_github("ropenscilabs/icon")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CHAID and R – When you need explanation – May 15, 2018

$
0
0

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

A modern data scientist using R has access to an almost bewildering number of tools, libraries and algorithms to analyze the data. In my next two posts I’m going to focus on an in depth visit with CHAID (Chi-square automatic interaction detection). The title should give you a hint for why I think CHAID is a good “tool” for your analytical toolbox. There are lots of tools that can help you predict or classify but CHAID is especially good at helping you explain to any audience how the model arrives at it’s prediction or classification. It’s also incredibly robust from a statistical perspective, making almost no assumptions about your data for distribution or normality. I’ll try and elaborate on that as we work the example.

You can get a very brief summary of CHAID from wikipedia and mentions of it scattered about in places like Analytics Vidhya or Data Flair. If you prefer a more scholarly bent the original article can be found in places like JSTOR. As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it.

Compared to some other techniques it’s also quite simple to use, as I hope you’ll agree, by the end of these posts. To showcase it we’re going to be using a dataset that comes to us from the IBM Watson Project and comes packaged with the rsample library. It’s a very practical and understandable dataset. A great use case for a tree based algorithm. Imagine yourself in a fictional company faced with the task of trying to figure out which employees you are going to “lose” a.k.a. attrition or turnover. There’s a steep cost involved in keeping good employees and training and on-boarding can be expensive. Being able to predict attrition even a little bit better would save you lots of money and make the company better, especially if you can understand exactly what you have to “watch out” for that might indicate the person is a high risk to leave.

Setup and library loading

If you’ve never used CHAID before you may also not have partykit. CHAID isn’t on CRAN but I have commented out the install command below. You’ll also get a variety of messages, none of which is relevant to this example so I’ve suppressed them.

# install.packages("partykit")# install.packages("CHAID", repos="http://R-Forge.R-project.org")require(rsample)# for dataset and splitting also loads broom and tidyrrequire(dplyr)require(ggplot2)theme_set(theme_bw())# set themerequire(CHAID)require(purrr)require(caret)

Predicting attrition in a fictional company

Let’s load up the attrition dataset and take a look at the variables we have.

# data(attrition)str(attrition)
## 'data.frame':    1470 obs. of  31 variables:##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Okay we have data on 1,470 employees. We have 30 potential predictor or independent variables and the all important attrition variable which gives us a yes or no answer to the question of whether or not the employee left. We’re to build the most accurate predictive model we can that is also simple (parsimonious) and explainable. The predictors we have seem to be the sorts of data we might have on hand in our HR files and thank goodness are labelled in a way that makes them pretty self explanatory.

The CHAID library in R requires that any variables that we enter as predictors be either nominal or ordinal variables (see ?CHAID::chaid), which in R speak means we have to get them in as either factor or ordered factor. The str command shows we have a bunch of variables which are of type integer. As it turns out moving from integer to factor is simple in terms of code but has to be thoughtful for substantive reasons. So let’s see how things breakdown.

attrition%>%select_if(is.factor)%>%ncol
## [1] 15
attrition%>%select_if(is.numeric)%>%ncol
## [1] 16

Hmmmm, 15 factors and 16 integers. Let’s explore further. Of the variables that are integers how many of them have a small number of values (a.k.a. levels) and can therefore be simply and easily converted to true factors. We’ll use a dplyr pipe to see how many have 5 or fewer levels and 10 or fewer levels.

attrition%>%select_if(function(col)length(unique(col))<=5&is.integer(col))%>%head
##   JobLevel StockOptionLevel## 1        2                0## 2        2                1## 4        1                0## 5        1                0## 7        1                1## 8        1                0
attrition%>%select_if(function(col)length(unique(col))<=10&is.integer(col))%>%head
##   JobLevel NumCompaniesWorked StockOptionLevel TrainingTimesLastYear## 1        2                  8                0                     0## 2        2                  1                1                     3## 4        1                  6                0                     3## 5        1                  1                0                     3## 7        1                  9                1                     3## 8        1                  0                0                     2

2 and 4 respectively. We can be pretty confident that converting these from integer to factor won’t lose much information. Simple to run a mutate operation across the 4 we have identified. Probably more elegant though to make it a mutate_if. That way in the future we decide we like 4 or 7 or 122 as our criteria for the change we only have to change one number. The “if” variation is also less to type and less likely to make a manual mistake.

attrition%>%mutate(JobLevel=factor(JobLevel),NumCompaniesWorked=factor(NumCompaniesWorked),StockOptionLevel=factor(StockOptionLevel),TrainingTimesLastYear=factor(TrainingTimesLastYear))%>%str
## 'data.frame':    1470 obs. of  31 variables:##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...##  $ NumCompaniesWorked      : Factor w/ 10 levels "0","1","2","3",..: 9 2 7 2 10 1 5 2 1 7 ...##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...##  $ TrainingTimesLastYear   : Factor w/ 7 levels "0","1","2","3",..: 1 4 4 4 4 3 4 3 3 4 ...##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
attrition<-attrition%>%mutate_if(function(col)length(unique(col))<=10&is.integer(col),as.factor)summary(attrition)
##       Age        Attrition            BusinessTravel   DailyRate     ##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  ##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  ##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  ##  Mean   :36.92                                       Mean   : 802.5  ##  3rd Qu.:43.00                                       3rd Qu.:1157.0  ##  Max.   :60.00                                       Max.   :1499.0  ##                                                                      ##                 Department  DistanceFromHome         Education  ##  Human_Resources     : 63   Min.   : 1.000   Below_College:170  ##  Research_Development:961   1st Qu.: 2.000   College      :282  ##  Sales               :446   Median : 7.000   Bachelor     :572  ##                             Mean   : 9.193   Master       :398  ##                             3rd Qu.:14.000   Doctor       : 48  ##                             Max.   :29.000                      ##                                                                 ##           EducationField EnvironmentSatisfaction    Gender   ##  Human_Resources : 27    Low      :284           Female:588  ##  Life_Sciences   :606    Medium   :287           Male  :882  ##  Marketing       :159    High     :453                       ##  Medical         :464    Very_High:446                       ##  Other           : 82                                        ##  Technical_Degree:132                                        ##                                                              ##    HourlyRate       JobInvolvement JobLevel##  Min.   : 30.00   Low      : 83    1:543   ##  1st Qu.: 48.00   Medium   :375    2:534   ##  Median : 66.00   High     :868    3:218   ##  Mean   : 65.89   Very_High:144    4:106   ##  3rd Qu.: 83.75                    5: 69   ##  Max.   :100.00                            ##                                            ##                       JobRole     JobSatisfaction  MaritalStatus##  Sales_Executive          :326   Low      :289    Divorced:327  ##  Research_Scientist       :292   Medium   :280    Married :673  ##  Laboratory_Technician    :259   High     :442    Single  :470  ##  Manufacturing_Director   :145   Very_High:459                  ##  Healthcare_Representative:131                                  ##  Manager                  :102                                  ##  (Other)                  :215                                  ##  MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  ##  Min.   : 1009   Min.   : 2094   1      :521        No :1054  ##  1st Qu.: 2911   1st Qu.: 8047   0      :197        Yes: 416  ##  Median : 4919   Median :14236   3      :159                  ##  Mean   : 6503   Mean   :14313   2      :146                  ##  3rd Qu.: 8379   3rd Qu.:20462   4      :139                  ##  Max.   :19999   Max.   :26999   7      : 74                  ##                                  (Other):234                  ##  PercentSalaryHike   PerformanceRating RelationshipSatisfaction##  Min.   :11.00     Low        :   0    Low      :276           ##  1st Qu.:12.00     Good       :   0    Medium   :303           ##  Median :14.00     Excellent  :1244    High     :459           ##  Mean   :15.21     Outstanding: 226    Very_High:432           ##  3rd Qu.:18.00                                                 ##  Max.   :25.00                                                 ##                                                                ##  StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance##  0:631            Min.   : 0.00     0: 54                 Bad   : 80     ##  1:596            1st Qu.: 6.00     1: 71                 Good  :344     ##  2:158            Median :10.00     2:547                 Better:893     ##  3: 85            Mean   :11.28     3:491                 Best  :153     ##                   3rd Qu.:15.00     4:123                                ##                   Max.   :40.00     5:119                                ##                                     6: 65                                ##  YearsAtCompany   YearsInCurrentRole YearsSinceLastPromotion##  Min.   : 0.000   Min.   : 0.000     Min.   : 0.000         ##  1st Qu.: 3.000   1st Qu.: 2.000     1st Qu.: 0.000         ##  Median : 5.000   Median : 3.000     Median : 1.000         ##  Mean   : 7.008   Mean   : 4.229     Mean   : 2.188         ##  3rd Qu.: 9.000   3rd Qu.: 7.000     3rd Qu.: 3.000         ##  Max.   :40.000   Max.   :18.000     Max.   :15.000         ##                                                             ##  YearsWithCurrManager##  Min.   : 0.000      ##  1st Qu.: 2.000      ##  Median : 3.000      ##  Mean   : 4.123      ##  3rd Qu.: 7.000      ##  Max.   :17.000      ## 

As you look at the results this is a good time to remind you that CHAID is “non parametric” which means that we don’t have to worry about how the distribution (normality) looks nor make any assumptions about the variance. We are assuming that the predictors are independent of one another, but that is true of every statistical test and this is a robust procedure. So for now, let’s simply ignore all the variables that are still integers. I promise we’ll come back and deal with them later. But for now I’m eager to actually use CHAID and do some predicting. We’re also going to defer and address the issue of “over-fitting” and how to most wisely use the data we have. We’re simply going to build a first model using all 1,470 cases, the 18 factors we have available to predict with and we are trying to predict attrition. We’ll create a new dataframe called newattrit (how original right?).

newattrit<-attrition%>%select_if(is.factor)dim(newattrit)
## [1] 1470   19

The chaid command accepts two pieces of information in it’s simplest case, a formula like outcome ~ predictors and a dataframe. We’re going to make use of the ~ . shortcut on the right hand side and add attrition on the left and newattrit as our dataframe.

About 6 seconds later (at least on my Mac) we’ll have a solution that we can print and plot.

I’m going to output all the plots in a smaller size for the benefit of you the readers. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code. I’ll initially be using, fig.height=10, fig.width=20, dpi=90, out.width=“900px”

What does CHAID do? Straight from the help pages “Select the predictor that has the smallest adjusted p-value (i.e., most significant). If this adjusted p-value is less than or equal to a user-specified alpha-level alpha4, split the node using this predictor. Else, do not split and the node is considered as a terminal node.” So it will take our 18 predictors and test each one against our outcome variable – attrition. The one with the lowest p value (a proxy for is most predictive) will “anchor” our decision tree. It will then repeat this process of splitting until more splits fail to yield significant results. I’m way over-simplifying, of course, but you get the idea. The end result will be a series of terminal nodes (think of them as “prediction buckets” that have a group of employees who all meet the same criteria who we think will either attrit or not attrit). Let’s run it.

# demonstrate a full model using chaid with defaultschaidattrit1<-chaid(Attrition~.,data=newattrit)print(chaidattrit1)
## ## Model formula:## Attrition ~ BusinessTravel + Department + Education + EducationField + ##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ##     JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ##     OverTime + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] StockOptionLevel in 0## |   |   |   [4] JobSatisfaction in Low## |   |   |   |   [5] RelationshipSatisfaction in Low, Medium, High: No (n = 56, err = 42.9%)## |   |   |   |   [6] RelationshipSatisfaction in Very_High: No (n = 28, err = 7.1%)## |   |   |   [7] JobSatisfaction in Medium, High## |   |   |   |   [8] JobInvolvement in Low: Yes (n = 12, err = 41.7%)## |   |   |   |   [9] JobInvolvement in Medium, High, Very_High## |   |   |   |   |   [10] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 181, err = 9.9%)## |   |   |   |   |   [11] BusinessTravel in Travel_Frequently## |   |   |   |   |   |   [12] RelationshipSatisfaction in Low: Yes (n = 8, err = 25.0%)## |   |   |   |   |   |   [13] RelationshipSatisfaction in Medium, High, Very_High: No (n = 30, err = 16.7%)## |   |   |   [14] JobSatisfaction in Very_High: No (n = 134, err = 7.5%)## |   |   [15] StockOptionLevel in 1, 2, 3## |   |   |   [16] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%)## |   |   |   [17] EnvironmentSatisfaction in Medium, High, Very_High## |   |   |   |   [18] Department in Human_Resources, Sales: No (n = 164, err = 8.5%)## |   |   |   |   [19] Department in Research_Development: No (n = 314, err = 3.2%)## |   [20] OverTime in Yes## |   |   [21] JobLevel in 1## |   |   |   [22] StockOptionLevel in 0, 3## |   |   |   |   [23] JobSatisfaction in Low, Medium, High: Yes (n = 61, err = 26.2%)## |   |   |   |   [24] JobSatisfaction in Very_High: No (n = 28, err = 46.4%)## |   |   |   [25] StockOptionLevel in 1, 2## |   |   |   |   [26] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 50, err = 26.0%)## |   |   |   |   [27] BusinessTravel in Travel_Frequently: Yes (n = 17, err = 35.3%)## |   |   [28] JobLevel in 2, 3, 4, 5## |   |   |   [29] MaritalStatus in Divorced, Married## |   |   |   |   [30] EnvironmentSatisfaction in Low, Medium: No (n = 60, err = 20.0%)## |   |   |   |   [31] EnvironmentSatisfaction in High, Very_High## |   |   |   |   |   [32] TrainingTimesLastYear in 0, 6: No (n = 10, err = 40.0%)## |   |   |   |   |   [33] TrainingTimesLastYear in 1, 2, 3, 4, 5## |   |   |   |   |   |   [34] EnvironmentSatisfaction in Low, Medium, High: No (n = 57, err = 0.0%)## |   |   |   |   |   |   [35] EnvironmentSatisfaction in Very_High: No (n = 61, err = 6.6%)## |   |   |   [36] MaritalStatus in Single## |   |   |   |   [37] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%)## |   |   |   |   [38] Department in Sales: Yes (n = 35, err = 40.0%)## ## Number of inner nodes:    18## Number of terminal nodes: 20
plot(chaidattrit1)

chisq.test(newattrit$Attrition,newattrit$OverTime)
## ##  Pearson's Chi-squared test with Yates' continuity correction## ## data:  newattrit$Attrition and newattrit$OverTime## X-squared = 87.564, df = 1, p-value < 2.2e-16

I happen to be a visual learner and prefer the plot to the print but they are obviously reporting the same information so use them as you see fit. As you can see the very first split it decides on is overtime yes or no. I’ve run the chi-square test so that you can see the p value is indeed very small (0.00000000000000022).

So the algorithm has decided that the most predictive way to divide our sample of employees is into 20 terminal nodes or buckets. Each one of the nodes represents a distinct set of predictors. Take a minute to look at node 19. Every person there shares the following characteristics.

  • [2] OverTime in No
  • [15] StockOptionLevel in 1, 2, 3
  • [17] EnvironmentSatisfaction in Medium, High, Very_High
  • [19] Department in Research_Development: No

There are n = 314 in this group, our prediction is that No they will not attrit and we were “wrong” err = 3.2%. That’s some useful information. To quote an old Star Wars movie “These are not the droids you’re looking for…”. In other words, this is not a group we should be overly worried about losing and we can say that with pretty high confidence.

For contrast let’s look at node #23:

  • [20] OverTime in Yes
  • [21] JobLevel in 1
  • [22] StockOptionLevel in 0, 3
  • [23] JobSatisfaction in Low, Medium, High:

Where there are n = 61 staff, we predict they will leave Yes and we get it wrong err = 26.2% of the time. A little worrisome that we’re not as accurate but this is a group that bears watching or intervention if we want to retain them.

Some other things to note. Because the predictors are considered categorical we will get splits like we do for node 22, where 0 and 3 are on one side and 1, 2 is on the other. The number of people in any node can be quite variable. Finally, notice that a variable can occur at different levels of the model like StockOptionLevel does!

On the plot side of things there are a few key options you can adjust to make things easier to read. The next blocks of code show you how to adjust some key options such as adding a title, reducing the font size, using “simple” mode, and changing colors.

# digress for plottingplot(chaidattrit1,type="simple")

plot(chaidattrit1,main="Testing Graphical Options",gp=gpar(fontsize=8),type="simple")

plot(chaidattrit1,main="Testing More Graphical Options",gp=gpar(col="blue",lty="solid",lwd=3,fontsize=10))

Exercising some control

Next let’s look into varying the parameters chaid uses to build the model. chaid_control (not surprisingly) controls the behavior of the model building. When you check the documentation at ?chaid_control you can see the list of 8 parameters you can adjust. We’ve already run the default settings implicitly when we built chaidattrit1 let’s look at three others.

  • minsplit– Number of observations in splitted response at which no further split is desired.
  • minprob– Minimum frequency of observations in terminal nodes.
  • maxheight– Maximum height for the tree.

We’ll use those but our fourth model we’ll simply require a higher significance level for alpha2 and alpha4.

ctrl<-chaid_control(minsplit=200,minprob=0.05)ctrl# notice the rest of the list is there at the default value
## $alpha2## [1] 0.05## ## $alpha3## [1] -1## ## $alpha4## [1] 0.05## ## $minsplit## [1] 200## ## $minbucket## [1] 7## ## $minprob## [1] 0.05## ## $stump## [1] FALSE## ## $maxheight## [1] -1## ## attr(,"class")## [1] "chaid_control"
chaidattrit2<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit2)
## ## Model formula:## Attrition ~ BusinessTravel + Department + Education + EducationField + ##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ##     JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ##     OverTime + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] StockOptionLevel in 0## |   |   |   [4] JobSatisfaction in Low: No (n = 84, err = 31.0%)## |   |   |   [5] JobSatisfaction in Medium, High## |   |   |   |   [6] JobInvolvement in Low: Yes (n = 12, err = 41.7%)## |   |   |   |   [7] JobInvolvement in Medium, High, Very_High## |   |   |   |   |   [8] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 181, err = 9.9%)## |   |   |   |   |   [9] BusinessTravel in Travel_Frequently: No (n = 38, err = 28.9%)## |   |   |   [10] JobSatisfaction in Very_High: No (n = 134, err = 7.5%)## |   |   [11] StockOptionLevel in 1, 2, 3## |   |   |   [12] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%)## |   |   |   [13] EnvironmentSatisfaction in Medium, High, Very_High## |   |   |   |   [14] Department in Human_Resources, Sales: No (n = 164, err = 8.5%)## |   |   |   |   [15] Department in Research_Development: No (n = 314, err = 3.2%)## |   [16] OverTime in Yes## |   |   [17] JobLevel in 1: Yes (n = 156, err = 47.4%)## |   |   [18] JobLevel in 2, 3, 4, 5## |   |   |   [19] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [20] MaritalStatus in Single: No (n = 72, err = 34.7%)## ## Number of inner nodes:     9## Number of terminal nodes: 11
plot(chaidattrit2,main="minsplit = 200, minprob = 0.05",gp=gpar(col="blue",lty="solid",lwd=3))

ctrl<-chaid_control(maxheight=3)chaidattrit3<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit3)
## ## Model formula:## Attrition ~ BusinessTravel + Department + Education + EducationField + ##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ##     JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ##     OverTime + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] StockOptionLevel in 0## |   |   |   [4] JobSatisfaction in Low: No (n = 84, err = 31.0%)## |   |   |   [5] JobSatisfaction in Medium, High: No (n = 231, err = 15.6%)## |   |   |   [6] JobSatisfaction in Very_High: No (n = 134, err = 7.5%)## |   |   [7] StockOptionLevel in 1, 2, 3## |   |   |   [8] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%)## |   |   |   [9] EnvironmentSatisfaction in Medium, High, Very_High: No (n = 478, err = 5.0%)## |   [10] OverTime in Yes## |   |   [11] JobLevel in 1## |   |   |   [12] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%)## |   |   |   [13] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%)## |   |   [14] JobLevel in 2, 3, 4, 5## |   |   |   [15] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [16] MaritalStatus in Single: No (n = 72, err = 34.7%)## ## Number of inner nodes:    7## Number of terminal nodes: 9
plot(chaidattrit3,main="maxheight = 3",gp=gpar(col="blue",lty="solid",lwd=3))

ctrl<-chaid_control(alpha2=.01,alpha4=.01)chaidattrit4<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit4)
## ## Model formula:## Attrition ~ BusinessTravel + Department + Education + EducationField + ##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ##     JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ##     OverTime + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] StockOptionLevel in 0## |   |   |   [4] JobSatisfaction in Low## |   |   |   |   [5] RelationshipSatisfaction in Low, Medium, High: No (n = 56, err = 42.9%)## |   |   |   |   [6] RelationshipSatisfaction in Very_High: No (n = 28, err = 7.1%)## |   |   |   [7] JobSatisfaction in Medium, High, Very_High## |   |   |   |   [8] JobInvolvement in Low: No (n = 20, err = 45.0%)## |   |   |   |   [9] JobInvolvement in Medium, High, Very_High## |   |   |   |   |   [10] JobLevel in 1: No (n = 139, err = 18.0%)## |   |   |   |   |   [11] JobLevel in 2, 3, 4, 5: No (n = 206, err = 5.8%)## |   |   [12] StockOptionLevel in 1, 2, 3: No (n = 605, err = 6.3%)## |   [13] OverTime in Yes## |   |   [14] JobLevel in 1## |   |   |   [15] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%)## |   |   |   [16] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%)## |   |   [17] JobLevel in 2, 3, 4, 5## |   |   |   [18] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [19] MaritalStatus in Single## |   |   |   |   [20] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%)## |   |   |   |   [21] Department in Sales: Yes (n = 35, err = 40.0%)## ## Number of inner nodes:    10## Number of terminal nodes: 11
plot(chaidattrit4,main="alpha2 = .01, alpha4 = .01",gp=gpar(col="blue",lty="solid",lwd=3))

Let me call your attention to chaidattrit3 for a minute to highlight two important things. First it is a good picture of what we get for answer if we were to ask a question about what are the most important predictors, what variables should we focus on. An important technical detail has emerged as well. Notice that when you look at inner node #3 that there is no technical reason why a node has to have a binary split in chaid. As this example clearly shows node#3 leads to a three way split that is nodes #4-6.

How good is our model?

So the obvious question is which model is best? IMHO the joy of CHAID is in giving you a clear picture of what you would predict given the data and why. Then of course there is the usual problem every data scientist has, which is, I have what I think is a great model. How well will it generalize to new data? Whether that’s next years attrition numbers for the same company or say data from a different company.

But it’s time to talk about accuracy and all the related ideas, so on with the show…

When it’s all said and done we built a model called chaidattrit1 to be able to predict or classify the 1,470 staff members. Seems reasonable then that we can get back these predictions from the model for all 1,470 people and see how we did compared to the data we have about whether they attrited or not. The print and plot commands sort of summarize that for us at the terminal node level with an error rate but all in all which of our four models is best?

The first step is to get the predictions for each model and put them somewhere. For that we’ll use the predict command. If you inspect the object you create (in my case with a head command) you’ll see it’s a vector of factors where the attribute names is set to be the terminal node the prediction is associated with. So pmodel1 <- predict(chaidattrit1) puts our predictions using the first model we built in a nice orderly fashion. On the other side newattrit$Attrition has the actual outcome of whether the employee departed or not.

What we want is a comparison of how well we did. How often did we get it right or wrong? Turns out what we need is called a confusion matrix. The caret package has a function called confusionMatrix that will give us what we want nicely formatted and printed.

There’s a nice short summary of what is produced at this url Confusion Matrix, so I won’t even try to repeat that material. I’ll just run the appropriate commands. Later we’ll revisit this topic to be more efficient. For now I want to focus on the results.

# digress how accurate were wepmodel1<-predict(chaidattrit1)head(pmodel1)
##  38  19  23  23  16  14 ## Yes  No Yes Yes  No  No ## Levels: No Yes
pmodel2<-predict(chaidattrit2)pmodel3<-predict(chaidattrit3)pmodel4<-predict(chaidattrit4)confusionMatrix(pmodel1,newattrit$Attrition)
## Confusion Matrix and Statistics## ##           Reference## Prediction   No  Yes##        No  1190  147##        Yes   43   90##                                           ##                Accuracy : 0.8707          ##                  95% CI : (0.8525, 0.8875)##     No Information Rate : 0.8388          ##     P-Value [Acc > NIR] : 0.0003553       ##                                           ##                   Kappa : 0.4192          ##  Mcnemar's Test P-Value : 7.874e-14       ##                                           ##             Sensitivity : 0.9651          ##             Specificity : 0.3797          ##          Pos Pred Value : 0.8901          ##          Neg Pred Value : 0.6767          ##              Prevalence : 0.8388          ##          Detection Rate : 0.8095          ##    Detection Prevalence : 0.9095          ##       Balanced Accuracy : 0.6724          ##                                           ##        'Positive' Class : No              ## 
confusionMatrix(pmodel2,newattrit$Attrition)
## Confusion Matrix and Statistics## ##           Reference## Prediction   No  Yes##        No  1154  148##        Yes   79   89##                                           ##                Accuracy : 0.8456          ##                  95% CI : (0.8261, 0.8637)##     No Information Rate : 0.8388          ##     P-Value [Acc > NIR] : 0.2516          ##                                           ##                   Kappa : 0.353           ##  Mcnemar's Test P-Value : 6.382e-06       ##                                           ##             Sensitivity : 0.9359          ##             Specificity : 0.3755          ##          Pos Pred Value : 0.8863          ##          Neg Pred Value : 0.5298          ##              Prevalence : 0.8388          ##          Detection Rate : 0.7850          ##    Detection Prevalence : 0.8857          ##       Balanced Accuracy : 0.6557          ##                                           ##        'Positive' Class : No              ## 
confusionMatrix(pmodel3,newattrit$Attrition)
## Confusion Matrix and Statistics## ##           Reference## Prediction   No  Yes##        No  1202  179##        Yes   31   58##                                           ##                Accuracy : 0.8571          ##                  95% CI : (0.8382, 0.8746)##     No Information Rate : 0.8388          ##     P-Value [Acc > NIR] : 0.02864         ##                                           ##                   Kappa : 0.2936          ##  Mcnemar's Test P-Value : < 2e-16         ##                                           ##             Sensitivity : 0.9749          ##             Specificity : 0.2447          ##          Pos Pred Value : 0.8704          ##          Neg Pred Value : 0.6517          ##              Prevalence : 0.8388          ##          Detection Rate : 0.8177          ##    Detection Prevalence : 0.9395          ##       Balanced Accuracy : 0.6098          ##                                           ##        'Positive' Class : No              ## 
confusionMatrix(pmodel4,newattrit$Attrition)
## Confusion Matrix and Statistics## ##           Reference## Prediction   No  Yes##        No  1188  158##        Yes   45   79##                                           ##                Accuracy : 0.8619          ##                  95% CI : (0.8432, 0.8791)##     No Information Rate : 0.8388          ##     P-Value [Acc > NIR] : 0.007845        ##                                           ##                   Kappa : 0.3676          ##  Mcnemar's Test P-Value : 3.815e-15       ##                                           ##             Sensitivity : 0.9635          ##             Specificity : 0.3333          ##          Pos Pred Value : 0.8826          ##          Neg Pred Value : 0.6371          ##              Prevalence : 0.8388          ##          Detection Rate : 0.8082          ##    Detection Prevalence : 0.9156          ##       Balanced Accuracy : 0.6484          ##                                           ##        'Positive' Class : No              ## 

There we have it, four matrices, one for each of the models we made with the different control parameters. It helpfully provides not just Accuracy but also other common measures you may be interested in. I won’t review them all that’s why I provided the link to a detailed description of all the measures. Before we leave the topic for a bit however, I do want to highlight a way you can use the purrr package to make your life a lot easier. A special thanks to Steven at MungeX-3D for his recent post on purrr which got me thinking about it.

We have 4 models so far (with more to come) we have the nice neat output from caret but honestly to compare values across the 4 models involves way too much scrolling back and forth right now. Let’s use purrr to create a nice neat dataframe. purrr’s map command is like lapply from base R, designed to apply some operations or functions to a list of objects. So what we’ll do is as follows:

  1. Create a named list called modellist to point to our four existing models (perhaps at a latter date we’ll start even earlier in our modelling process).
  2. It’s a named list so we can name each model (for now with the accurate but uninteresting name Modelx)
  3. Pass the list using map to the predict function to generate our predictions
  4. Pipe %>% those results to the confusionMatrix function with map
  5. Pipe %>% the confusion matrix results to map_dfr. The results of confusionMattrix are actually a list of six items. The ones we want to capture are in $overall and $byClass. We grab them, transpose them, and make them into a dataframe then bind the two dataframes together so everything is neatly packaged. The .id = ModelNumb tells map_dfr to add an identifying column to the dataframe. It is populated with the name of the list item we passed in modellist. Therefore the object CHAIDresults contains everything we might want to use to compare models in one neat dataframe.

The kable call is simply for your reading convenience. Makes it a little easier to read than a traditional print call.

library(kableExtra)modellist<-list(Model1=chaidattrit1,Model2=chaidattrit2,Model3=chaidattrit3,Model4=chaidattrit4)CHAIDResults<-map(modellist,~predict(.x))%>%map(~confusionMatrix(newattrit$Attrition,.x))%>%map_dfr(~cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))),.id="ModelNumb")kable(CHAIDResults,"html")%>%kable_styling(bootstrap_options=c("striped","hover","condensed","responsive"),font_size=9)

ModelNumb

Accuracy

Kappa

AccuracyLower

AccuracyUpper

AccuracyNull

AccuracyPValue

McnemarPValue

Sensitivity

Specificity

Pos Pred Value

Neg Pred Value

Precision

Recall

F1

Prevalence

Detection Rate

Detection Prevalence

Balanced Accuracy

Model1

0.8707483

0.4191632

0.8525159

0.8874842

0.9095238

0.9999996

0.0e+00

0.8900524

0.6766917

0.9651257

0.3797468

0.9651257

0.8900524

0.9260700

0.9095238

0.8095238

0.8387755

0.7833720

Model2

0.8455782

0.3529603

0.8260781

0.8636860

0.8857143

0.9999985

6.4e-06

0.8863287

0.5297619

0.9359286

0.3755274

0.9359286

0.8863287

0.9104536

0.8857143

0.7850340

0.8387755

0.7080453

Model3

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model4

0.8619048

0.3676334

0.8432050

0.8791447

0.9156463

1.0000000

0.0e+00

0.8826152

0.6370968

0.9635036

0.3333333

0.9635036

0.8826152

0.9212873

0.9156463

0.8081633

0.8387755

0.7598560

One other thing I’ll mention in passing is that the partykit package offers a way of assessing the relative importance of the variables in the model via the varimp command. We’ll come back to this concept of variable importance later but for now a simple example of text and plot output.

sort(varimp(chaidattrit1),decreasing=TRUE)
##                 JobLevel                 OverTime  EnvironmentSatisfaction ##              0.142756888              0.114384725              0.071069051 ##         StockOptionLevel            MaritalStatus          JobSatisfaction ##              0.058726463              0.030332565              0.029157845 ##    TrainingTimesLastYear RelationshipSatisfaction               Department ##              0.025637743              0.015700750              0.013815233 ##           BusinessTravel           JobInvolvement ##              0.009906245              0.009205317
plot(sort(varimp(chaidattrit1),decreasing=TRUE))

What about those other variables?

But before we go much farther we should probably circle back and make use of all those variables that were coded as integers that we conveniently ignored in building our first four models. Let’s bring them into our model building activities and see what they can add to our understanding. As a first step let’s use ggplot2 and take a look at their distribution using a density plot.

# Turning numeric variables into factors## what do they look likeattrition%>%select_if(is.numeric)%>%gather(metric,value)%>%ggplot(aes(value,fill=metric))+geom_density(show.legend=FALSE)+facet_wrap(~metric,scales="free")

Well other than Age very few of those variables appear to have especially normal distributions. That’s okay we’re going to wind up cutting them up into factors anyway. The only question is what are the best cut-points to use? In base R the cut function default is equal intervals (distances along the x axis). You can also specify your own cutpoints and your own labels as shown below.

table(cut(attrition$YearsWithCurrManager,breaks=5))
## ## (-0.017,3.4]    (3.4,6.8]   (6.8,10.2]  (10.2,13.6]    (13.6,17] ##          825          158          414           54           19
table(attrition$YearsSinceLastPromotion)
## ##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 ## 581 357 159  52  61  45  32  76  18  17   6  24  10  10   9  13
table(cut(attrition$YearsSinceLastPromotion,breaks=c(-1,0.9,1.9,2.9,30),labels=c("Less than 1","1","2","More than 2")))
## ## Less than 1           1           2 More than 2 ##         581         357         159         373

ggplot2 has three helper functions I prefer to use: cut_interval, cut_number, and cut_width. cut_interval makes n groups with equal range, cut_number makes n groups with (approximately) equal numbers of observations, and cut_width makes groups of a fixed specified width. As we think about moving the numeric variables into factors any of these might be a viable alternative.

# cut_interval makes n groups with equal rangetable(cut_interval(attrition$YearsWithCurrManager,n=5))
## ##     [0,3.4]   (3.4,6.8]  (6.8,10.2] (10.2,13.6]   (13.6,17] ##         825         158         414          54          19
# cut_number makes n groups with (approximately) equal numbers of observationstable(cut_number(attrition$YearsWithCurrManager,n=5))
## ##  [0,1]  (1,2]  (2,4]  (4,7] (7,17] ##    339    344    240    276    271
# cut_width makes groups of width widthtable(cut_width(attrition$YearsWithCurrManager,width=2))
## ##  [-1,1]   (1,3]   (3,5]   (5,7]   (7,9]  (9,11] (11,13] (13,15] (15,17] ##     339     486     129     245     171      49      32      10       9

For the sake of our current example let’s say that I would like to focus on groups of more or less equal size which means that I would need to apply cut_number to each of the 12 variables under discussion. I’m not enamored of running the function 12 times though so I would prefer to wrap it in a mutate_if statement. If the variable is numeric then apply cut_number with n=5.

The problem is that cut_number will error out if it doesn’t think there are enough values to produce the bins you requested. So…

cut_number(attrition$YearsWithCurrManager,n=6)# Error: Insufficient data values to produce 6 bins.cut_number(attrition$YearsSinceLastPromotion,n=4)# Error: Insufficient data values to produce 4 bins.attrition%>%mutate_if(is.numeric,funs(cut_number(.,n=5)))# Error in mutate_impl(.data, dots) : #   Evaluation error: Insufficient data values to produce 5 bins..

A little sleuthing reveals that there is one variable among the 12 that has too few values for the cut_number function to work. That variable is YearsSinceLastPromotion. Let’s try what we would like but explicitly select out that variable.

attrition%>%select(-YearsSinceLastPromotion)%>%mutate_if(is.numeric,funs(cut_number(.,n=5)))%>%head
##       Age Attrition    BusinessTravel          DailyRate## 1 (38,45]       Yes     Travel_Rarely     (942,1.22e+03]## 2 (45,60]        No Travel_Frequently          [102,392]## 3 (34,38]       Yes     Travel_Rarely (1.22e+03,1.5e+03]## 4 (29,34]        No Travel_Frequently (1.22e+03,1.5e+03]## 5 [18,29]        No     Travel_Rarely          (392,656]## 6 (29,34]        No Travel_Frequently     (942,1.22e+03]##             Department DistanceFromHome     Education EducationField## 1                Sales            [1,2]       College  Life_Sciences## 2 Research_Development            (5,9] Below_College  Life_Sciences## 3 Research_Development            [1,2]       College          Other## 4 Research_Development            (2,5]        Master  Life_Sciences## 5 Research_Development            [1,2] Below_College        Medical## 6 Research_Development            [1,2]       College  Life_Sciences##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel## 1                  Medium Female   (87,100]           High        2## 2                    High   Male    (59,73]         Medium        2## 3               Very_High   Male   (87,100]         Medium        1## 4               Very_High Female    (45,59]           High        1## 5                     Low   Male    [30,45]           High        1## 6               Very_High   Male    (73,87]           High        1##                 JobRole JobSatisfaction MaritalStatus       MonthlyIncome## 1       Sales_Executive       Very_High        Single (5.74e+03,9.86e+03]## 2    Research_Scientist          Medium       Married (4.23e+03,5.74e+03]## 3 Laboratory_Technician            High        Single  [1.01e+03,2.7e+03]## 4    Research_Scientist            High       Married  (2.7e+03,4.23e+03]## 5 Laboratory_Technician          Medium       Married  (2.7e+03,4.23e+03]## 6 Laboratory_Technician       Very_High        Single  (2.7e+03,4.23e+03]##           MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike## 1 (1.67e+04,2.17e+04]                  8      Yes           [11,12]## 2  (2.17e+04,2.7e+04]                  1       No           (19,25]## 3 [2.09e+03,6.89e+03]                  6      Yes           (13,15]## 4  (2.17e+04,2.7e+04]                  1      Yes           [11,12]## 5 (1.18e+04,1.67e+04]                  9       No           [11,12]## 6 (1.18e+04,1.67e+04]                  0       No           (12,13]##   PerformanceRating RelationshipSatisfaction StockOptionLevel## 1         Excellent                      Low                0## 2       Outstanding                Very_High                1## 3         Excellent                   Medium                0## 4         Excellent                     High                0## 5         Excellent                Very_High                1## 6         Excellent                     High                0##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany## 1             (5,8]                     0             Bad          (5,7]## 2            (8,10]                     3          Better         (7,10]## 3             (5,8]                     3          Better          [0,2]## 4             (5,8]                     3          Better         (7,10]## 5             (5,8]                     3          Better          [0,2]## 6             (5,8]                     2            Good          (5,7]##   YearsInCurrentRole YearsWithCurrManager## 1              (2,4]                (4,7]## 2              (4,7]                (4,7]## 3              [0,1]                [0,1]## 4              (4,7]                [0,1]## 5              (1,2]                (1,2]## 6              (4,7]                (4,7]

Yes that appears to be it. So let’s manually cut it into 4 groups and then apply the 5 grouping code to the other 11 variables. Once we have accomplished that we can run the same newattrit <- attrition %>% select_if(is.factor) we ran earlier to produce a newattrit dataframe we can work with.

attrition$YearsSinceLastPromotion<-cut(attrition$YearsSinceLastPromotion,breaks=c(-1,0.9,1.9,2.9,30),labels=c("Less than 1","1","2","More than 2"))attrition<-attrition%>%mutate_if(is.numeric,funs(cut_number(.,n=5)))summary(attrition)
##       Age      Attrition            BusinessTravel##  [18,29]:326   No :1233   Non-Travel       : 150  ##  (29,34]:325   Yes: 237   Travel_Frequently: 277  ##  (34,38]:255              Travel_Rarely    :1043  ##  (38,45]:291                                      ##  (45,60]:273                                      ##                                                   ##                                                   ##               DailyRate                  Department  DistanceFromHome##  [102,392]         :294   Human_Resources     : 63   [1,2]  :419     ##  (392,656]         :294   Research_Development:961   (2,5]  :213     ##  (656,942]         :294   Sales               :446   (5,9]  :308     ##  (942,1.22e+03]    :294                              (9,17] :253     ##  (1.22e+03,1.5e+03]:294                              (17,29]:277     ##                                                                      ##                                                                      ##          Education            EducationField EnvironmentSatisfaction##  Below_College:170   Human_Resources : 27    Low      :284          ##  College      :282   Life_Sciences   :606    Medium   :287          ##  Bachelor     :572   Marketing       :159    High     :453          ##  Master       :398   Medical         :464    Very_High:446          ##  Doctor       : 48   Other           : 82                           ##                      Technical_Degree:132                           ##                                                                     ##     Gender       HourlyRate    JobInvolvement JobLevel##  Female:588   [30,45] :306   Low      : 83    1:543   ##  Male  :882   (45,59] :298   Medium   :375    2:534   ##               (59,73] :280   High     :868    3:218   ##               (73,87] :312   Very_High:144    4:106   ##               (87,100]:274                    5: 69   ##                                                       ##                                                       ##                       JobRole     JobSatisfaction  MaritalStatus##  Sales_Executive          :326   Low      :289    Divorced:327  ##  Research_Scientist       :292   Medium   :280    Married :673  ##  Laboratory_Technician    :259   High     :442    Single  :470  ##  Manufacturing_Director   :145   Very_High:459                  ##  Healthcare_Representative:131                                  ##  Manager                  :102                                  ##  (Other)                  :215                                  ##              MonthlyIncome              MonthlyRate  NumCompaniesWorked##  [1.01e+03,2.7e+03] :294   [2.09e+03,6.89e+03]:294   1      :521       ##  (2.7e+03,4.23e+03] :294   (6.89e+03,1.18e+04]:294   0      :197       ##  (4.23e+03,5.74e+03]:294   (1.18e+04,1.67e+04]:294   3      :159       ##  (5.74e+03,9.86e+03]:294   (1.67e+04,2.17e+04]:294   2      :146       ##  (9.86e+03,2e+04]   :294   (2.17e+04,2.7e+04] :294   4      :139       ##                                                      7      : 74       ##                                                      (Other):234       ##  OverTime   PercentSalaryHike   PerformanceRating RelationshipSatisfaction##  No :1054   [11,12]:408       Low        :   0    Low      :276           ##  Yes: 416   (12,13]:209       Good       :   0    Medium   :303           ##             (13,15]:302       Excellent  :1244    High     :459           ##             (15,19]:325       Outstanding: 226    Very_High:432           ##             (19,25]:226                                                   ##                                                                           ##                                                                           ##  StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance##  0:631            [0,5]  :316       0: 54                 Bad   : 80     ##  1:596            (5,8]  :309       1: 71                 Good  :344     ##  2:158            (8,10] :298       2:547                 Better:893     ##  3: 85            (10,17]:261       3:491                 Best  :153     ##                   (17,40]:286       4:123                                ##                                     5:119                                ##                                     6: 65                                ##  YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion##  [0,2]  :342    [0,1] :301         Less than 1:581        ##  (2,5]  :434    (1,2] :372         1          :357        ##  (5,7]  :166    (2,4] :239         2          :159        ##  (7,10] :282    (4,7] :295         More than 2:373        ##  (10,40]:246    (7,18]:263                                ##                                                           ##                                                           ##  YearsWithCurrManager##  [0,1] :339          ##  (1,2] :344          ##  (2,4] :240          ##  (4,7] :276          ##  (7,17]:271          ##                      ## 
newattrit<-attrition%>%select_if(is.factor)dim(newattrit)
## [1] 1470   31

Now we have newattrit with all 30 predictor variables. We will simply repeat the process we used earlier to develop 4 new models.

# Repeat to produce models 5-8chaidattrit5<-chaid(Attrition~.,data=newattrit)print(chaidattrit5)
## ## Model formula:## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ##     Education + EducationField + EnvironmentSatisfaction + Gender + ##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ##     YearsWithCurrManager## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] YearsAtCompany in [0,2]## |   |   |   [4] Age in [18,29], (29,34]## |   |   |   |   [5] StockOptionLevel in 0## |   |   |   |   |   [6] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 56, err = 41.1%)## |   |   |   |   |   [7] BusinessTravel in Travel_Frequently: Yes (n = 10, err = 10.0%)## |   |   |   |   [8] StockOptionLevel in 1, 2, 3: No (n = 63, err = 15.9%)## |   |   |   [9] Age in (34,38], (38,45], (45,60]## |   |   |   |   [10] WorkLifeBalance in Bad: No (n = 4, err = 50.0%)## |   |   |   |   [11] WorkLifeBalance in Good, Better, Best## |   |   |   |   |   [12] EducationField in Human_Resources, Life_Sciences, Marketing, Medical: No (n = 92, err = 2.2%)## |   |   |   |   |   [13] EducationField in Other, Technical_Degree: No (n = 13, err = 23.1%)## |   |   [14] YearsAtCompany in (2,5], (5,7], (7,10], (10,40]## |   |   |   [15] WorkLifeBalance in Bad: No (n = 45, err = 22.2%)## |   |   |   [16] WorkLifeBalance in Good, Better, Best## |   |   |   |   [17] JobSatisfaction in Low## |   |   |   |   |   [18] StockOptionLevel in 0## |   |   |   |   |   |   [19] RelationshipSatisfaction in Low: Yes (n = 11, err = 45.5%)## |   |   |   |   |   |   [20] RelationshipSatisfaction in Medium: No (n = 12, err = 8.3%)## |   |   |   |   |   |   [21] RelationshipSatisfaction in High: No (n = 17, err = 47.1%)## |   |   |   |   |   |   [22] RelationshipSatisfaction in Very_High: No (n = 20, err = 0.0%)## |   |   |   |   |   [23] StockOptionLevel in 1, 2, 3: No (n = 93, err = 4.3%)## |   |   |   |   [24] JobSatisfaction in Medium, High, Very_High## |   |   |   |   |   [25] Age in [18,29], (29,34], (34,38], (38,45]## |   |   |   |   |   |   [26] BusinessTravel in Non-Travel, Travel_Rarely## |   |   |   |   |   |   |   [27] JobInvolvement in Low: No (n = 25, err = 12.0%)## |   |   |   |   |   |   |   [28] JobInvolvement in Medium, High, Very_High## |   |   |   |   |   |   |   |   [29] RelationshipSatisfaction in Low: No (n = 81, err = 3.7%)## |   |   |   |   |   |   |   |   [30] RelationshipSatisfaction in Medium, High: No (n = 198, err = 0.0%)## |   |   |   |   |   |   |   |   [31] RelationshipSatisfaction in Very_High## |   |   |   |   |   |   |   |   |   [32] DistanceFromHome in [1,2], (2,5], (5,9], (17,29]: No (n = 92, err = 2.2%)## |   |   |   |   |   |   |   |   |   [33] DistanceFromHome in (9,17]: No (n = 13, err = 23.1%)## |   |   |   |   |   |   [34] BusinessTravel in Travel_Frequently: No (n = 95, err = 8.4%)## |   |   |   |   |   [35] Age in (45,60]## |   |   |   |   |   |   [36] JobSatisfaction in Low, Medium, High## |   |   |   |   |   |   |   [37] TotalWorkingYears in [0,5], (5,8], (8,10], (17,40]: No (n = 57, err = 0.0%)## |   |   |   |   |   |   |   [38] TotalWorkingYears in (10,17]: No (n = 14, err = 28.6%)## |   |   |   |   |   |   [39] JobSatisfaction in Very_High: No (n = 43, err = 20.9%)## |   [40] OverTime in Yes## |   |   [41] JobLevel in 1## |   |   |   [42] StockOptionLevel in 0, 3## |   |   |   |   [43] DistanceFromHome in [1,2], (2,5]## |   |   |   |   |   [44] EnvironmentSatisfaction in Low: Yes (n = 12, err = 16.7%)## |   |   |   |   |   [45] EnvironmentSatisfaction in Medium, High, Very_High: No (n = 33, err = 36.4%)## |   |   |   |   [46] DistanceFromHome in (5,9], (9,17], (17,29]: Yes (n = 44, err = 18.2%)## |   |   |   [47] StockOptionLevel in 1, 2## |   |   |   |   [48] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 50, err = 26.0%)## |   |   |   |   [49] BusinessTravel in Travel_Frequently: Yes (n = 17, err = 35.3%)## |   |   [50] JobLevel in 2, 3, 4, 5## |   |   |   [51] MaritalStatus in Divorced, Married## |   |   |   |   [52] EnvironmentSatisfaction in Low, Medium: No (n = 60, err = 20.0%)## |   |   |   |   [53] EnvironmentSatisfaction in High, Very_High## |   |   |   |   |   [54] TrainingTimesLastYear in 0, 6: No (n = 10, err = 40.0%)## |   |   |   |   |   [55] TrainingTimesLastYear in 1, 2, 3, 4, 5## |   |   |   |   |   |   [56] YearsInCurrentRole in [0,1], (1,2]: No (n = 36, err = 11.1%)## |   |   |   |   |   |   [57] YearsInCurrentRole in (2,4], (4,7], (7,18]: No (n = 82, err = 0.0%)## |   |   |   [58] MaritalStatus in Single## |   |   |   |   [59] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%)## |   |   |   |   [60] Department in Sales: Yes (n = 35, err = 40.0%)## ## Number of inner nodes:    28## Number of terminal nodes: 32
plot(chaidattrit5,main="Default control sliced numerics",gp=gpar(col="blue",lty="solid",lwd=3,fontsize=8))

ctrl<-chaid_control(minsplit=200,minprob=0.05)chaidattrit6<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit6)
## ## Model formula:## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ##     Education + EducationField + EnvironmentSatisfaction + Gender + ##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ##     YearsWithCurrManager## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] YearsAtCompany in [0,2]## |   |   |   [4] Age in [18,29], (29,34]: No (n = 129, err = 32.6%)## |   |   |   [5] Age in (34,38], (38,45], (45,60]: No (n = 109, err = 6.4%)## |   |   [6] YearsAtCompany in (2,5], (5,7], (7,10], (10,40]## |   |   |   [7] WorkLifeBalance in Bad: No (n = 45, err = 22.2%)## |   |   |   [8] WorkLifeBalance in Good, Better, Best## |   |   |   |   [9] JobSatisfaction in Low: No (n = 153, err = 12.4%)## |   |   |   |   [10] JobSatisfaction in Medium, High, Very_High## |   |   |   |   |   [11] Age in [18,29], (29,34], (34,38], (38,45]## |   |   |   |   |   |   [12] BusinessTravel in Non-Travel, Travel_Rarely## |   |   |   |   |   |   |   [13] JobInvolvement in Low: No (n = 25, err = 12.0%)## |   |   |   |   |   |   |   [14] JobInvolvement in Medium, High, Very_High## |   |   |   |   |   |   |   |   [15] RelationshipSatisfaction in Low: No (n = 81, err = 3.7%)## |   |   |   |   |   |   |   |   [16] RelationshipSatisfaction in Medium, High: No (n = 198, err = 0.0%)## |   |   |   |   |   |   |   |   [17] RelationshipSatisfaction in Very_High: No (n = 105, err = 4.8%)## |   |   |   |   |   |   [18] BusinessTravel in Travel_Frequently: No (n = 95, err = 8.4%)## |   |   |   |   |   [19] Age in (45,60]: No (n = 114, err = 11.4%)## |   [20] OverTime in Yes## |   |   [21] JobLevel in 1: Yes (n = 156, err = 47.4%)## |   |   [22] JobLevel in 2, 3, 4, 5## |   |   |   [23] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [24] MaritalStatus in Single: No (n = 72, err = 34.7%)## ## Number of inner nodes:    11## Number of terminal nodes: 13
plot(chaidattrit6,main="minsplit = 200, minprob = 0.05",gp=gpar(col="blue",lty="solid",lwd=3,fontsize=8))

ctrl<-chaid_control(maxheight=3)chaidattrit7<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit7)
## ## Model formula:## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ##     Education + EducationField + EnvironmentSatisfaction + Gender + ##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ##     YearsWithCurrManager## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] YearsAtCompany in [0,2]## |   |   |   [4] Age in [18,29], (29,34]: No (n = 129, err = 32.6%)## |   |   |   [5] Age in (34,38], (38,45], (45,60]: No (n = 109, err = 6.4%)## |   |   [6] YearsAtCompany in (2,5], (5,7], (7,10], (10,40]## |   |   |   [7] WorkLifeBalance in Bad: No (n = 45, err = 22.2%)## |   |   |   [8] WorkLifeBalance in Good, Better, Best: No (n = 771, err = 6.6%)## |   [9] OverTime in Yes## |   |   [10] JobLevel in 1## |   |   |   [11] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%)## |   |   |   [12] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%)## |   |   [13] JobLevel in 2, 3, 4, 5## |   |   |   [14] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [15] MaritalStatus in Single: No (n = 72, err = 34.7%)## ## Number of inner nodes:    7## Number of terminal nodes: 8
plot(chaidattrit7,main="maxheight = 3",gp=gpar(col="blue",lty="solid",lwd=3,fontsize=8))

ctrl<-chaid_control(alpha2=.01,alpha4=.01)chaidattrit8<-chaid(Attrition~.,data=newattrit,control=ctrl)print(chaidattrit8)
## ## Model formula:## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ##     Education + EducationField + EnvironmentSatisfaction + Gender + ##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ##     YearsWithCurrManager## ## Fitted party:## [1] root## |   [2] OverTime in No## |   |   [3] YearsAtCompany in [0,2]## |   |   |   [4] Age in [18,29], (29,34]## |   |   |   |   [5] StockOptionLevel in 0: No (n = 66, err = 48.5%)## |   |   |   |   [6] StockOptionLevel in 1, 2, 3: No (n = 63, err = 15.9%)## |   |   |   [7] Age in (34,38], (38,45], (45,60]## |   |   |   |   [8] WorkLifeBalance in Bad: No (n = 4, err = 50.0%)## |   |   |   |   [9] WorkLifeBalance in Good, Better, Best: No (n = 105, err = 4.8%)## |   |   [10] YearsAtCompany in (2,5], (5,7], (7,10], (10,40]## |   |   |   [11] WorkLifeBalance in Bad: No (n = 45, err = 22.2%)## |   |   |   [12] WorkLifeBalance in Good, Better, Best## |   |   |   |   [13] JobSatisfaction in Low## |   |   |   |   |   [14] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Manager, Manufacturing_Director, Research_Director, Research_Scientist, Sales_Executive## |   |   |   |   |   |   [15] StockOptionLevel in 0: No (n = 58, err = 22.4%)## |   |   |   |   |   |   [16] StockOptionLevel in 1, 2, 3: No (n = 92, err = 3.3%)## |   |   |   |   |   [17] JobRole in Sales_Representative: Yes (n = 3, err = 0.0%)## |   |   |   |   [18] JobSatisfaction in Medium, High, Very_High: No (n = 618, err = 5.2%)## |   [19] OverTime in Yes## |   |   [20] JobLevel in 1## |   |   |   [21] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%)## |   |   |   [22] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%)## |   |   [23] JobLevel in 2, 3, 4, 5## |   |   |   [24] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%)## |   |   |   [25] MaritalStatus in Single## |   |   |   |   [26] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%)## |   |   |   |   [27] Department in Sales: Yes (n = 35, err = 40.0%)## ## Number of inner nodes:    13## Number of terminal nodes: 14
plot(chaidattrit8,main="alpha2 = .01, alpha4 = .01",gp=gpar(col="blue",lty="solid",lwd=3,fontsize=8))

As we did earlier we’ll also repeat the steps necessary to build a table of results.

modellist<-list(Model1=chaidattrit1,Model2=chaidattrit2,Model3=chaidattrit3,Model4=chaidattrit4,Model5=chaidattrit5,Model6=chaidattrit6,Model7=chaidattrit7,Model8=chaidattrit8)CHAIDResults<-map(modellist,~predict(.x))%>%map(~confusionMatrix(newattrit$Attrition,.x))%>%map_dfr(~cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))),.id="ModelNumb")kable(CHAIDResults,"html")%>%kable_styling(bootstrap_options=c("striped","hover","condensed","responsive"),font_size=10)

ModelNumb

Accuracy

Kappa

AccuracyLower

AccuracyUpper

AccuracyNull

AccuracyPValue

McnemarPValue

Sensitivity

Specificity

Pos Pred Value

Neg Pred Value

Precision

Recall

F1

Prevalence

Detection Rate

Detection Prevalence

Balanced Accuracy

Model1

0.8707483

0.4191632

0.8525159

0.8874842

0.9095238

0.9999996

0.0e+00

0.8900524

0.6766917

0.9651257

0.3797468

0.9651257

0.8900524

0.9260700

0.9095238

0.8095238

0.8387755

0.7833720

Model2

0.8455782

0.3529603

0.8260781

0.8636860

0.8857143

0.9999985

6.4e-06

0.8863287

0.5297619

0.9359286

0.3755274

0.9359286

0.8863287

0.9104536

0.8857143

0.7850340

0.8387755

0.7080453

Model3

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model4

0.8619048

0.3676334

0.8432050

0.8791447

0.9156463

1.0000000

0.0e+00

0.8826152

0.6370968

0.9635036

0.3333333

0.9635036

0.8826152

0.9212873

0.9156463

0.8081633

0.8387755

0.7598560

Model5

0.8775510

0.4451365

0.8596959

0.8938814

0.9122449

0.9999968

0.0e+00

0.8926174

0.7209302

0.9708029

0.3924051

0.9708029

0.8926174

0.9300699

0.9122449

0.8142857

0.8387755

0.8067738

Model6

0.8442177

0.3317731

0.8246542

0.8623944

0.8938776

1.0000000

1.0e-07

0.8820396

0.5256410

0.9399838

0.3459916

0.9399838

0.8820396

0.9100903

0.8938776

0.7884354

0.8387755

0.7038403

Model7

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model8

0.8639456

0.3808988

0.8453515

0.8810715

0.9136054

1.0000000

0.0e+00

0.8845867

0.6456693

0.9635036

0.3459916

0.9635036

0.8845867

0.9223602

0.9136054

0.8081633

0.8387755

0.7651280

You can clearly see that Overtime remains the first cut in our tree structure but that now other variables have started to influence our model as well, such as how long they’ve worked for us and their age. You can see from the table that model #5 is apparently the most accurate now. Not by a huge amount but apparently these numeric variables we ignored at first pass do matter at least to some degree.

Not done yet

I’m not going to dwell on the current results too much they are simply for an example and in my next post I’d like to spend some time on over-fitting and cross validation.

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Viewing all 12108 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>