DEADLINE EXTENDED: Last call for Boston EARL abstracts

August 31, 2017, 9:56 am

≫ Next: Probability functions intermediate

≪ Previous: Text featurization with the Microsoft ML package

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Are you solving problems and innovating with R? Are you working with R in a commercial setting? Do you enjoy sharing your knowledge?

If you said yes to any of the above, we want your abstract!

Share your commerical R stories with your peers at EARL Boston this November.

EARL isn’t about knowing the most or being the best in your field – it’s about taking what you’ve learnt and sharing it with others, so they can learn from your wins (and sometimes your failures, because we all have them!).

As long as your proposed presentation is focused on the commerical use of R, any topic from any industry is welcome!

The abstract submission deadline has been extended to Sunday 3 September.

Join David Robinson, Mara Averick and Tareef Kawaf on 1-3 November 2017 at The Charles Hotel in Cambridge.

To submit your abstract on the EARL website, click here.

See you in Boston!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Probability functions intermediate

August 31, 2017, 9:57 am

≫ Next: Pulling Data Out of Census Spreadsheets Using R

≪ Previous: DEADLINE EXTENDED: Last call for Boston EARL abstracts

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises, we are going to explore some of the probability functions in R by using practical applications. Basic probability knowledge is required. In case you are not familiarized with the function apply, check the R documentation.

Note: We are going to use random numbers functions and random processes functions in R such as runif. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use set.seed(1).Don’t forget to specify it before every exercise that includes random numbers.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating dice rolls Set your seed to 1 and generate 30 random numbers using runif. Save it in an object called random_numbers. Then use the ceiling function to round the values. These values represent rolling dice values.

Exercise 2

Simulate one dice roll using the function rmultinom. Make sure n = 1 is inside the function, and save it in an object called die_result. The matrix die_result is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function whichto create an output that shows only the value obtained after the dice is rolled.

Exercise 3

Using rmultinom, simulate 30 dice rolls. Save it in a variable called dice_result and use apply to transform the matrix into a vector with the result of each dice.

Exercise 4

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use rmultinomto simulate throwing 2 dice 30 times. Use the function apply to record the sum of the values of each experiment.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

work with different binomial and logistic regression techniques,
know how to compare regression models and choose the right fit,
and much more.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using rnorm, simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights, use the function summary.

Exercise 6

90% of the population is smaller than ____________?

Exercise 7

Which percentage of the population is bigger than 1.60 m?

Exercise 8

Run the following line code before this exercise. This will load a library required for the exercise.

if (!'MASS' %in% installed.packages()) install.packages('MASS') library(MASS)

Simulate 1000 people with height and weight using the function mvrnorm with mu = c(1.70, 60) and Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2)

Exercise 9

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

Exercise 10

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

Pulling Data Out of Census Spreadsheets Using R

August 31, 2017, 5:22 am

≫ Next: Multiplicative Congruential Generators in R

≪ Previous: Probability functions intermediate

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R. The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14))
hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.

for (i in 1:11)
{
fn = paste("http://www2.census.gov/govs/local/",years[i], "slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’. The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.

ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern. For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise.

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame. For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.

ssi1 = which(grepl("*Hospital*", ex, ignore.case=T))
ssi2 = which(grepl("Iowa", ex, ignore.case=T))
ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work. Then the temporary files are removed and the loop ends.

if (ssi2<80) ssi5=ssi2+0:2
else ssi5 = ssi2 + c(0,2,3)
df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F)
names(df.1)=c(NA,NA,NA)
hospital.capital.expend = rbind(hospital.capital.expend, df.1)
rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5)
}

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.

hospital.capital.expend = as.matrix(hospital.capital.expend[-1,])
hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3)
hospital.capital.expend = cbind(2004:2014,hospital.capital.expend)
colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local")
print(hospital.capital.expend)

That’s it!!!

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

↧

Multiplicative Congruential Generators in R

August 31, 2017, 11:00 am

≫ Next: The Proof-Calculation Ping Pong

≪ Previous: Pulling Data Out of Census Spreadsheets Using R

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 2 of 2 in the series Random Number Generation

Multiplicative congruential generators, also known as Lehmer random number generators, is a type of linear congruential generator for generating pseudorandom numbers in U(0, 1) . The multiplicative congruential generator, often abbreviated as MLCG or MCG, is defined as a recurrence relation similar to the LCG with c = 0 .

$\large{X_{i+1} = aX_i \space \text{mod} \space m}$

Unlike the LCG, the parameters and for multiplicative congruential generators are more restricted and the initial seed X_0 must be relatively prime to the modulus (the greatest common divisor between X_0 and is ). The current parameters in common use are $m = 2^{31} - 1 = 2,147,483,647 \text{and} a = 7^5 = 16,807$ . However, in a correspondence from the Communications of the ACM, Park, Miller and Stockmeyer changed the value of the parameter , stating:

The minimal standard Lehmer generator we advocated had a modulus of m = 2^31 – 1 and a multiplier of a = 16807. Relative to this particular choice of multiplier, we wrote “… if this paper were to be written again in a few years it is quite possible that we would advocate a different multiplier ….” We are now prepared to do so. That is, we now advocate a = 48271 and, indeed, have done so “officially” since July 1990. This new advocacy is consistent with the discussion on page 1198 of [10]. There is nothing wrong with 16807; we now believe, however, that 48271 is a little better (with q = 44488, r = 3399).

Multiplicative Congruential Generators with Schrage’s Method

When using a large prime modulus such as $2^{31} - 1$ , the multiplicative congruential generator can overflow. Schrage’s method was invented to overcome the possibility of overflow. We can check the parameters in use satisfy this condition:

a <- 48271m <- 2 ** 31 - 1a * (m %% a) < m## [1] TRUE

Schrage’s method restates the modulus as a decomposition m = aq + r where $r = m \space \text{mod} \space a$ and q = m / a .

$ax \space \text{mod} \space m = \begin{cases} a(x \space \text{mod} \space q) - r\frac{x}{q} & \text{if} \space x \space \text{is} \geq 0 \\ a(x \space \text{mod} \space q) - r\frac{x}{q} + m & \text{if} \space x \space \text{is} \leq 0 \end{cases}$ Multiplicative Congruential Generator in R

We can implement a Lehmer random number generator in R using the parameters mentioned earlier.

lehmer.rng <- function(n=10) {    rng <- vector(length = n)    m <- 2147483647  a <- 48271  q <- 44488  r <- 3399    # Set the seed using the current system time in microseconds.   # The initial seed value must be coprime to the modulus m,   # which we are not really concerning ourselves with for this example.  d <- as.numeric(Sys.time())    for (i in 1:n) {    h <- d / q    l <- d %% q    t <- a * l - r * h    if (t < 0) {      d <- t    }    else {      d <- t + m    }        rng[i] <- d / m  }  return(rng)}

# Print the first 10 randomly generated numberslehmer.rng()##  [1] 0.68635675 0.12657390 0.84869106 0.16614698 0.08108171 0.89533896##  [7] 0.90708773 0.03195725 0.60847522 0.70736551

Plotting our multiplicative congruential generator in three dimensions allows us to visualize the apparent ‘randomness’ of the generator. As before, we generate three random vectors x, y, z with our Lehmer RNG function and plot the points. The plot3d package is used to create the scatterplot and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

library(plot3D)library(animation)

n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000)saveGIF({  for (i in 1:length(n)) {    x <- lehmer.rng(n[i])    y <- lehmer.rng(n[i])    z <- lehmer.rng(n[i])        scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i]))  }}, movie.name = 'lehmer.gif')

The generator appears to be generating suitably random numbers demonstrated by the increasing swarm of points as increases.

References

Anne Gille-Genest (March 1, 2012). Implementation of the Pseudo-Random Number Generators and the Low Discrepancy Sequences.

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

Stephen K. Park; Keith W. Miller; Paul K. Stockmeyer (1988). “Technical Correspondence”. Communications of the ACM. 36 (7): 105–110.

The post Multiplicative Congruential Generators in R appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

↧

The Proof-Calculation Ping Pong

August 31, 2017, 3:00 pm

≫ Next: Mapping to a ‘t'(map)

≪ Previous: Multiplicative Congruential Generators in R

Abstract

The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

Case Story

Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

Id	Version	Region	Sales Volume	Staff Costs	People
01	1	A	23000	10003	5
02	1	B	12200	7200	1
03	1	NA	19500	7609	2
04	1	A	17500	13000	3
05	1	B	119500	90000	NA
05	2	B	119500	95691	19
06	1	B	NA	19990	6
07	1	A	19123	20100	8
08	1	D	25000	100	NA
09	1	D	45860	32555	9
10	1	E	33020	25010	7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

the overall number of fitness centres
total sales volume
total staff cost
total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <-fitness %>%na.omit() %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People))ada

## # A tibble: 4 x 5##   Region NoOfUnits `Sales Volume` `Staff Costs` People##                              ## 1      A         3          59623         43103     16## 2      B         1         119500         95691     19## 3      D         1          45860         32555      9## 4      E         1          33020         25010      7

Bob

Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

library(RSQLite)## Create ad-hoc databasedb <-dbConnect(SQLite(), dbname =file.path(filePath,"Test.sqlite"))##Move fitness data into the ad-hoc DBdbWriteTable(conn = db, name ="FITNESS", fitness, overwrite=TRUE, row.names=FALSE)##Query using SQLbob <-dbGetQuery(db, "    SELECT Region,           COUNT(*) As NoOfUnits,           SUM([Sales Volume]) As [Sales Volume],           SUM([Staff Costs]) AS [Staff Costs],           SUM(People) AS People    FROM FITNESS    WHERE [Sales Volume] > 17500 GROUP BY Region")bob

##   Region NoOfUnits Sales Volume Staff Costs People## 1            1        19500        7609      2## 2      A         2        42123       30103     13## 3      B         2       239000      185691     19## 4      D         2        70860       32655      9## 5      E         1        33020       25010      7

The Ping-Pong phase

After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff)diff <-diff_data(ada, bob)diff$get_data()

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 +++            1          19500          7609      2## 2  ->      A      3->2   59623->42123  43103->30103 16->13## 3  ->      B      1->2 119500->239000 95691->185691     19## 4  ->      D      1->2   45860->70860  32555->32655      9## 5          E         1          33020         25010      7

After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

print.data_diff <-function(x) x$get_data() %>%filter(`@@` != "")print(diff)

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 +++            1          19500          7609      2## 2  ->      A      3->2   59623->42123  43103->30103 16->13## 3  ->      B      1->2 119500->239000 95691->185691     19## 4  ->      D      1->2   45860->70860  32555->32655      9

The two decide to first focus on agreeing on the number of units per region.

##    @@ Region NoOfUnits## 1 +++            1## 2  ->      A      3->2## 3  ->      B      1->2## 4  ->      D      1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob

bob2 <-dbGetQuery(db, "    SELECT Region,           COUNT(*) As NoOfUnits,           SUM([Sales Volume]) As [Sales Volume],           SUM([Staff Costs]) AS [Staff Costs],           SUM(People) AS People    FROM FITNESS    WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL)    GROUP BY Region")diff2 <-diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE)diff2 %>%print()

##   @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 ->      A      3->2   59623->42123  43103->30103 16->13## 2 ->      B      1->2 119500->239000 95691->185691     19## 3 ->      D      1->2   45860->70860  32555->32655      9

Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

bob3 <-dbGetQuery(db, "    SELECT Region,           COUNT(*) As NoOfUnits,           SUM([Sales Volume]) As [Sales Volume],           SUM([Staff Costs]) AS [Staff Costs],           SUM(People) AS People FROM    (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id)    WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL)    GROUP BY Region")diff3 <-diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE)diff3 %>%print()

##    @@ Region NoOfUnits Sales Volume  Staff Costs People## 1 ...    ...       ...          ...          ...    ...## 2  ->      D      1->2 45860->70860 32555->32655      9

Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <-fitness %>%filter(!is.na(Region)) %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People))(diff_final <-diff_data(ada2,bob3)) %>%print()

##    @@ Region NoOfUnits ... Staff Costs People## 1 ...    ...       ... ...         ...    ...## 2  ->      D         2 ...       32655  NA->9

Pong Ada

Oops, forgot to take care of the NA in the summation:

ada3 <-fitness %>%filter(!is.na(Region)) %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People,na.rm=TRUE))diff_final <-diff_data(ada3,bob3)length(diff_final$get_data()) ==0

## [1] TRUE

Conclusion

Finally, their results agree and they move on to production and their results are published in a nice report.

Question 1: Do you agree with their results?

ada3

## # A tibble: 4 x 5##   Region NoOfUnits `Sales Volume` `Staff Costs` People##                              ## 1      A         3          59623         43103     16## 2      B         1         119500         95691     19## 3      D         2          70860         32655      9## 4      E         1          33020         25010      7

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

↧

Mapping to a ‘t'(map)

August 31, 2017, 4:00 pm

≫ Next: OpenML Workshop 2017

≪ Previous: The Proof-Calculation Ping Pong

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

tmap <- Easy & Interactive -

More maps of the Highlands?

Yep, same as last time, but no need to install dev versions of anything, we can get awesome maps courtesy of the tmap package.

Get the shapefile from the last post

library(tmap)library(tmaptools)library(viridis)scot<-read_shape("SG_SIMD_2016.shp",as.sf=TRUE)highland<-(scot[scot$LAName=="Highland",])#replicate plot from previous blog post:quint<-tm_shape(highland)+tm_fill(col="Quintile",palette=viridis(n=5,direction=-1,option="C"),fill.title="Quintile",title="SIMD 2016 - Highland Council Area by Quintile")quint# plotttm()#switch between static and interactive - this will use interactivequint# or use last_map()# in R Studio you will find leaflet map in your Viewer tabttm()# return to plotting

The results:

One really nice thing is that because the polygons don’t have outlines, the DataZones that are really densely packed still render nicely – so no ‘missing’ areas.

A static image of the leaflet map:

Here I take the rank for all the Highland data zones, and the overall SIMD rank, and create a small multiple

small_mult<-tm_shape(highland)+tm_fill(col=c("IncRank","EmpRank","HlthRank","EduRank","GAccRank","CrimeRank","HouseRank","Rank"),palette=viridis(n=5,direction=-1,option="C"),title=c("Income Rank","Employment Rank","Health Rank","Education Rank","General Access Rank","Crime Rank","Housing Rank","Overall Rank"))small_mult

Let’s take a look at Scotland as a whole, as I assume everyone’s pretty bored of the Highlands by now:

#try a map of scotlandscotplot<-tm_shape(scot)+tm_fill(col="Rank",palette=viridis(n=5,direction=-1,option="C"),fill.title="Overall Rank",title="Overall-Rank")scotplot# bit of a monster

With the interactive plot, we can really appreciate the density of these datazones in the Central belt.

Here, I’ve zoomed in a bit on the region around Glasgow, and then zoomed in some more:

I couldn’t figure out how to host the leaflet map within the page (Jekyll / Github / Leaflet experts please feel free to educate me on that ) but, given the size of the file, I very much doubt I could have uploaded it to Github anyway.

Thanks to Roger Bivand (@RogerBivand) for getting in touch and pointing me towards the tmap package! It’s really good fun and an easy way to get interactive maps up and running.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

↧

OpenML Workshop 2017

August 31, 2017, 5:00 pm

≫ Next: RcppAnnoy 0.0.9

≪ Previous: Mapping to a ‘t'(map)

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

What is OpenML?

The field of Machine Learning has grown tremendously over the last years, and is a key component of data-driven science. Data analysis algorithms are being invented and used every day, but their results and experiments are published almost exclusively in journals or separated repositories. However, data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning.

OpenML is a networked science platform that aims to connect and organize all this knowledge online, linking data, algorithms, results and people into a coherent whole so that scientists and practitioners can easy build on prior work and collaborate in real time online.

OpenML has an online interface on openml.org, and is integrated in the most popular machine learning tools and statistical environments such as R, Python, WEKA, MOA and RapidMiner. This allows researchers and students to easily import and export data from these tools and share them with others online, fully integrated into the context of the state of the art. On OpenML, researchers can connect to each other, start projects, and build on the results of others. It automatically keeps track of how often shared work is reused so that researchers can follow the wider impact of their work and become more visible.

When and Where?

The OpenML workshop is organized as a hackathon, an event where participants from many scientific domains present their goals and ideas, and then work on them in small teams for many hours or days at a time. Participants bring their laptops, learn how to use OpenML in tutorials, and build upon that to create something great to push their research forward. The complete OpenML development team will be available to get them started, answer questions, and implement new features on the fly.

The next OpenML Workshop will run from 9 October to 13 Oktober 2017, see also here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

↧

RcppAnnoy 0.0.9

August 31, 2017, 6:20 pm

≫ Next: How rOpenSci uses Code Review to Promote Reproducible Science

≪ Previous: OpenML Workshop 2017

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

An new version 0.0.9 of RcppAnnoy, our Rcpp-based R integration of the nifty Annoy library by Erik, is now on CRAN. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.

This release corrects an issue for Windows users discovered by GitHub user ‘khoran’ who later also suggested the fix of binary mode. It upgrades to Annoy release 1.9.1 and brings its new Manhattan distance to RcppAnnoy. A number of unit tests were added as well, and we updated some packaging internals such as symbol registration.

And I presume I had a good streak emailing with Uwe’s robots as the package made it onto CRAN rather smoothly within ten minutes of submission:

Changes in this version are summarized here:

Changes in version 0.0.9 (2017-08-31)
Synchronized with Annoy upstream version 1.9.1
Minor updates in calls and tests as required by annoy 1.9.1
New Manhattan distance modules along with unit test code
Additional unit tests from upstream test code carried over
Binary mode is used for save (as suggested by @khoran in #21)
A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()
Symbol registration is enabled in useDynLib

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

How rOpenSci uses Code Review to Promote Reproducible Science

September 1, 2017, 12:00 am

≫ Next: Practical Data Science for Stats

≪ Previous: RcppAnnoy 0.0.9

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

This is cross-posted from the NumFOCUS blog. NumFOCUS is a nonprofit that supports and promotes world-class, innovative, open source scientific computing, including rOpenSci.

Scott use this for NumFOCUS SEO?

At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be a community effort. The variety of scientific data and workflows could only be tackled by drawing on contributions of scientists with field-specific expertise.

With the community approach came challenges. How could we ensure the quality of code written by scientists without formal training in software development practices? How could we drive adoption of best practices among our contributors? How could we create a community that would support each other in this work?

We have had great success addressing these challenges via the peer review. We draw elements from a process familiar to our target community – academic peer review– and a practice from the software development world – production code review– to create a system that fosters software quality, ongoing education, and community development.

An Open Review Process

Production software review occurs within software development teams, open source or not. Contributions to a software project are reviewed by one or more other team members before incorporation into project source code. Contributions are typically small patches, and review serves as a check on quality, as well as an opportunity for training in team standards.

In academic peer review, external reviewers critique a complete product – usually a manuscript – with a very broad mandate to address any areas they see as deficient. Academic review is often anonymous and passing through it gives the product, and the author, a public mark of validation.

We blend these approaches. In our process, authors submit complete R packages to rOpenSci. Editors check that packages fit into our project’s scope, run a series of automated tests to ensure a baseline of code quality and completeness, and then assign two independent reviewers. Reviewers comment on usability, quality, and style of software code as well as documentation. Authors make changes in response, and once reviewers are satisfied with the updates, the package receives a badge of approval and joins our suite.

This process is quite iterative. After reviewers post a first round of extensive reviews, authors and reviewers chat in an informal back-and-forth, only lightly moderated by an editor. This lets both reviewers and authors pose questions of each other and explain differences of opinion. It can proceed at a much faster pace than typical academic review correspondence. We use the GitHub issues system for this conversation, and responses take minutes to days, rather than weeks to months.

The exchange is also open and public. Authors, reviewers, and editors all know each other’s identities. The broader community can view or even participate in the conversation as it happens. This provides an incentive to be thorough and provide non-adversarial, constructive reviews. Both authors and reviewers report that they enjoy and learn more from this open and direct exchange. It also has the benefit of building community. Participants have the opportunity to meaningfully network with new peers, and new collaborations have emerged via ideas spawned during the review process.

We are aware that open systems can have drawbacks. For instance, in traditional academic review, double-blind peer review can increase representation of female authors, suggesting bias in non-blind reviews. It is also possible reviewers are less critical in open review. However, we posit that the openness of the review conversation provides a check on review quality and bias; it’s harder to inject unsupported or subjective comments in public and without the cover of anonymity. Ultimately, we believe the ability of authors and reviewers to have direct but public communication improves quality and fairness.

Guidance and Standards

rOpenSci provides guidance on reviewing. This falls into two main categories: high-level best practices and low-level standards. High-level best practices are general and broadly applicable across languages and applications. These are practices such as “Write re-usable functions rather than repeating the same code,” “Test edge cases,” or “Write documentation for all of your functions.” Because of their general nature, these can be drawn from other sources and not developed from scratch. Our best practices are based on guidance originally developed by Mozilla Science Lab.

Low-level standards are specific to a language (in our case, R), applications (data interfaces) and user base (researchers). These include specific items such as naming conventions for functions, best choices of dependencies for certain tasks, and adherence to a code style guide. We have an extensive set of standards for our reviewers to check. These change over time as the R software ecosystem evolves, best practices change, and tooling and educational resources make new methods available to developers.

Our standards also change based on feedback from reviewers. We adopt into our standards suggestions that emerge in multiple reviewers across different packages. Many of these, we’ve found, have to do with with the ease-of-use and consistency of software APIs, and the type and location of information in documentation that make it easiest to find. This highlights one of the advantages of external reviewers – they can provide a fresh perspective on usability, as well as test software under different use-cases than imagined by the author.

As our standards have become more extensive, we have come to rely more on automated tools. The R ecosystem, like most languages, has a suite of tools for code linting, function testing, static code analysis and continuous integration. We require authors to use these, and editors run submissions through a suite of tests prior to sending them for review. This frees reviewers from the burden of low-level tasks to focus on high-level critiques where they can add the most value.

The Reviewer Community

One of the core challenges and rewards of our work has been developing a community of reviewers.

Reviewing is a high-skill activity. Reviewers need expertise in the programming methods used in a software package and also the scientific field of its application. (“Find me someone who knows sensory ecology and sparse data structures!”) They need good communications skills and the time and willingness to volunteer. Thankfully, the open-science and open-source worlds are filled with generous, expert people. We have been able to expand our reviewer pool as the pace of submissions and the domains of their applications have grown.

Developing the reviewer pool requires constant recruitment. Our editors actively and broadly engage with developer and research communities to find new reviewers. We recruit from authors of previous submissions, co-workers and colleagues, at conferences, through our other collaborative work and on social media. In the open-source software ecosystem, one can often identify people with particular expertise by looking at their published software or contribution to other projects, and we often will cold-email potential reviewers whose published work suggests they would be a good match for a submission.

We cultivate our reviewer pool as well as expand it. We bring back reviewers so that they may develop reviewing as a skill, but not so often as to overburden them. We provide guidance and feedback to new recruits. When assigning reviewers to a submission, we aim to pair experienced reviewers with new ones, or reviewers with expertise on a package’s programming methods with those experienced in its field of application. These reviewers learn from each other, and diversity in perspectives is an advantage; less experienced developers often provide insight that more experienced ones do not on software usability, API design, and documentation. More experienced developers will more often identify inefficiencies in code, pitfalls due to edge-cases, or suggest alternate implementation approaches.

Expanding Peer Review for Code

Code review has been one of rOpenSci’s best initiatives. We build software, build skills, and build community, and the peer review process has been a major catalyst for all three. It has made both the software we develop internally and that we accept from outside contributors more reliable, usable, and maintainable. So we are working to promote open peer review of code by more organizations working with scientific software. We helped launch The Journal of Open Source Software, which uses a version of our review process to provide a developer-friendly publication venue. JOSS’s success has led to a spin-off, the Journal of Open Source Education, which uses an open, code-review-like processes to provide feedback on curricula and educational materials. We are also developing a pilot program where software papers submitted to a partner academic journal receive a badge for going through rOpenSci review. We are encouraged by other review initiatives like ReScience and The Programming Historian. BioConductor’s code reviews, which predate ours by several years, recently switched to an open model.

If your organization is developing or curating scientific code, we believe code review, implemented well, can be a great benefit. It can take considerable effort to begin, but here are some of the key lessons we’ve learned that can help:

Develop standards and guidelines for your authors and reviewers to use. Borrow these freely from other projects (feel free to use ours), and modify them iteratively as you go.
Use automated tools such as code linters, test suites, and test coverage measures to reduce burden on authors, reviewers, and editors as much as possible.
Have a clear scope. Spell out to yourselves and potential contributors what your project will accept, and why. This will save a lot of confusion and effort in the future.
Build a community through incentives of networking, opportunities to learn, and kindness.

rOpenSci is eager to work with other groups interested in developing similar review processes, especially if you are interested in reviewing and curating scientific software in languages other than R or beyond our scope of supporting the data life cycle. Software that implements statistical algorithms, for instance, is an area ripe for open peer review of code. Please get in touch if you have questions or wish to co-pilot a review system for your project.

And of course, if you want to review, we’re always looking for volunteers. Sign up at https://ropensci.org/onboarding.

You can support rOpenSci by becoming a NumFOCUS member or making a donation to the project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

↧

Practical Data Science for Stats

September 1, 2017, 10:05 am

≫ Next: Summarizing dataset to apply machine learning – exercises

≪ Previous: How rOpenSci uses Code Review to Promote Reproducible Science

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

PeerJ Preprints has recently published a collection of articles that focus on the practical side of statistical analysis: Practical Data Science for Stats. While the articles are not peer-reviewed, they have been selected and edited by Jennifer Bryan and Hadley Wickham, both well-respected members of the R community. And while the articles provide great advice for any data scientist, the content does heavily feature the use of R, so it's particularly useful to R users.

There are 16 articles in the collection (with possibly more to come?). Here are just a few examples that caught my eye:

Bryan: "Excuse me, do you have a moment to talk about version control?". Practical advice on using Git, Github and Markdown for data science projects, with a focus on R.
Marwick, Boettiger and Mullen: Packaging data analytical work reproducibly using R (and friends). On using R packages as a vehicle for sharing research in a reproducible manner.
Taylor and Letham: Forecasting at Scale. On using the Prophet package for production-scale forecasting of time series.
Ross, Wickham, and Robinson: Declutter your R workflow with tidy tools. On using the tidyverse to make data analysis in R as smooth as possible.
Eddelbuettel and Balamuta: Extending R with C++: A Brief Introduction to Rcpp. An introduction to the Rcpp package for R.

There's lots more to explore in the collection as well, including case studies on using R at the likes of AirBnB and the New York Mets. Check out the entire collection at the link below.

PeerJ Collections: Practical Data Science for Stats (via Jenny Bryan)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Summarizing dataset to apply machine learning – exercises

September 1, 2017, 9:00 am

≫ Next: Pair Programming Statistical Analyses

≪ Previous: Practical Data Science for Stats

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

Dear reader,

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

Before proceeding, please follow our short tutorial. Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions. to check your answers.

Exercise 1

Create a list of 80% of the rows in the original dataset to use for training. HINT: Use createDataPartition().

Exercise 2

Select 20% of the data for validation.

Exercise 3

Use the remaining 80% of data to train and test the models.

Exercise 4

Find the dimensions of the “iris” dataset. HINT: Use dim().

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

Create a machine learning algorithm from a beginner point of view
Quickly dive into more advanced methods in an accessible pace and with more explanations
And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

Exercise 5

Find the type of each attribute in your dataset. HINT: Use sapply().

Exercise 6

Take a look at the first 5 rows of your dataset. HINT: Use head().

Exercise 7

Find the levels of the variable “Species.” HINT: Use levels().

Exercise 8

Find the percentages of rows that belong to the labels you found in Exercise 7. HINT: Use prop.table() and table().

Exercise 9

Display the absolute count of instances for each class as well as its percentage. HINT: Use cbind().

Exercise 10

Display the summary of the “iris” dataset. HINT: Use summary().

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

Pair Programming Statistical Analyses

September 1, 2017, 3:00 pm

≫ Next: Permutation Theory In Action

≪ Previous: Summarizing dataset to apply machine learning – exercises

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.

Clip art is based on work by Gan Khoon Lay available under a CC BY 3.0 US license.

Introduction

The Case Story

Ada and Bob have to calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. The present exposition will for the sake of convenience ignore the complex survey part of the problem and just pretend that the sample corresponds to the population (complete inventory count).

The Data

After the questionnaire phase, the following data are available to Ada and Bob.

Id	Version	Region	Sales Volume	Staff Costs	People
01	1	A	23000	10003	5
02	1	B	12200	7200	1
03	1	NA	19500	7609	2
04	1	A	17500	13000	3
05	1	B	119500	90000	NA
05	2	B	119500	95691	19
06	1	B	NA	19990	6
07	1	A	19123	20100	8
08	1	D	25000	100	NA
09	1	D	45860	32555	9
10	1	E	33020	25010	7

the overall number of fitness centres
total sales volume
total staff cost
total number of people employed in fitness centres

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <-fitness %>%na.omit() %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People))ada

## # A tibble: 4 x 5##   Region NoOfUnits `Sales Volume` `Staff Costs` People##                              ## 1      A         3          59623         43103     16## 2      B         1         119500         95691     19## 3      D         1          45860         32555      9## 4      E         1          33020         25010      7

Bob

Bob has a dark past as a relational database management system (RDBMS) developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach. The hope is that in big data situation this might be a little more speedy than base R:

library(RSQLite)## Create ad-hoc databasedb <-dbConnect(SQLite(), dbname =file.path(filePath,"Test.sqlite"))##Move fitness data into the ad-hoc DBdbWriteTable(conn = db, name ="fitness", fitness, overwrite=TRUE, row.names=FALSE)##Query using SQLbob <-dbGetQuery(db, "    SELECT Region,           COUNT(*) As NoOfUnits,           SUM([Sales Volume]) As [Sales Volume],           SUM([Staff Costs]) AS [Staff Costs],           SUM(People) AS People    FROM fitness    WHERE [Sales Volume] > 17500 GROUP BY Region")bob

##   Region NoOfUnits Sales Volume Staff Costs People## 1            1        19500        7609      2## 2      A         2        42123       30103     13## 3      B         2       239000      185691     19## 4      D         2        70860       32655      9## 5      E         1        33020       25010      7

Update: An alternative approach with less syntactic overhead would have been the sqldf package, which has a standard SQLite backend and automagically handles the import of the data.frame into the DB using the RSQLite pkg.

##Load packagesuppressPackageStartupMessages(library(sqldf))##Ensure SQLite backend.options(sqldf.driver ="SQLite")##Same query as beforebob <-sqldf("    SELECT Region,           COUNT(*) As NoOfUnits,           SUM([Sales Volume]) As [Sales Volume],           SUM([Staff Costs]) AS [Staff Costs],           SUM(People) AS People    FROM fitness    WHERE [Sales Volume] > 17500 GROUP BY Region")

Even shorter is the direct use of SQL chunks in knitr using the variable db as connection and using the chunk argument output.var=bob:

SELECT Region,       COUNT(*) As NoOfUnits,       SUM([Sales Volume]) As [Sales Volume],       SUM([Staff Costs]) AS [Staff Costs],       SUM(People) AS PeopleFROM fitnessWHERE [Sales Volume] > 17500GROUPBY Region

The Ping-Pong Phase

After Ada and Bob each have a result, they compare their resulting data.frame using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff)diff <-diff_data(ada, bob)diff$get_data()

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 +++            1          19500          7609      2## 2  ->      A      3->2   59623->42123  43103->30103 16->13## 3  ->      B      1->2 119500->239000 95691->185691     19## 4  ->      D      1->2   45860->70860  32555->32655      9## 5          E         1          33020         25010      7

After Ada’s and Bob’s serve, the two realize that their results just agree for the region E.

Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

##Small helper function for better printingprint.data_diff <-function(x) x$get_data() %>%filter(`@@` != "")diff %>%print()

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 +++            1          19500          7609      2## 2  ->      A      3->2   59623->42123  43103->30103 16->13## 3  ->      B      1->2 119500->239000 95691->185691     19## 4  ->      D      1->2   45860->70860  32555->32655      9

The two decide to first agree on the number of units per region.

diff$get_data() %>%filter(`@@` != "") %>%select(`@@`, Region, NoOfUnits)

##    @@ Region NoOfUnits## 1 +++            1## 2  ->      A      3->2## 3  ->      B      1->2## 4  ->      D      1->2

Pong Bob

SELECT Region,       COUNT(*) As NoOfUnits,       SUM([Sales Volume]) As [Sales Volume],       SUM([Staff Costs]) AS [Staff Costs],       SUM(People) AS PeopleFROM fitnessWHERE ([Sales Volume] > 17500AND REGION ISNOTNULL)GROUPBY Region

##   @@ Region NoOfUnits   Sales Volume   Staff Costs People## 1 ->      A      3->2   59623->42123  43103->30103 16->13## 2 ->      B      1->2 119500->239000 95691->185691     19## 3 ->      D      1->2   45860->70860  32555->32655      9

Ping Bob

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and apparently misunderstood the exact definition of the €17,500 exclusion limit. His massaged SQL query looks as follows:

SELECT Region,       COUNT(*) As NoOfUnits,       SUM([Sales Volume]) As [Sales Volume],       SUM([Staff Costs]) AS [Staff Costs],       SUM(People) AS PeopleFROM (SELECTId, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM fitness GROUPBYId)WHERE ([Sales Volume] >= 17500AND REGION ISNOTNULL)GROUPBY Region

##    @@ Region NoOfUnits Sales Volume  Staff Costs People## 1 ...    ...       ...          ...          ...    ...## 2  ->      D      1->2 45860->70860 32555->32655      9

Ping Ada

ada2 <-fitness %>%filter(!is.na(Region)) %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People))(diff_final <-diff_data(ada2,bob3)) %>%print()

##    @@ Region NoOfUnits ... Staff Costs People## 1 ...    ...       ... ...         ...    ...## 2  ->      D         2 ...       32655  NA->9

Pong Ada

Oops, Ada forgot to take care of the NA in the summation:

ada3 <-fitness %>%filter(!is.na(Region)) %>%group_by(Region,Id) %>%top_n(1,Version) %>%group_by(Region) %>%filter(`Sales Volume`>=17500) %>%summarise(`NoOfUnits`=n(),            `Sales Volume`=sum(`Sales Volume`),            `Staff Costs`=sum(`Staff Costs`),            People=sum(People,na.rm=TRUE))diff_final <-diff_data(ada3,bob3)##Check if the results really are the samelength(diff_final$get_data()) ==0

## [1] TRUE

Finally, their results agree and they move on to production and their results are published in a nice report.

Conclusion

Question 1: Now the two get the same results, do you agree with them?

## # A tibble: 4 x 5##   Region NoOfUnits `Sales Volume` `Staff Costs` People##                              ## 1      A         3          59623         43103     16## 2      B         1         119500         95691     19## 3      D         2          70860         32655      9## 4      E         1          33020         25010      7

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below. Control calculations appear quite common, but little structured code support appears to be available for such processes.

Photo is copyright Johnathan J. Stegeman under the GNU Free Documentation License, version 1.2.

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

↧

Permutation Theory In Action

September 2, 2017, 10:38 am

≫ Next: alphavantager: An R interface to the Free Alpha Vantage Financial Data API

≪ Previous: Pair Programming Statistical Analyses

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Let’s take a look at an example. Suppose our two orderings are o1 (the ordering Spark ML chooses) and o2 (the order the R user chooses).

set.seed(326346)
symbols <-letters[1:7]

o1 <-sample(symbols, length(symbols), replace =FALSE)
o1

## [1] "e" "a" "b" "f" "d" "c" "g"

o2 <-sample(symbols, length(symbols), replace =FALSE)
o2

## [1] "d" "g" "f" "e" "b" "c" "a"

To translate Spark results into R results we need a permutation that takes o1 to o2. The idea is: if we had a permeation that takes o1 to o2 we could use it to re-map predictions that are in o1 order to be predictions in o2 order.

To solve this we crack open our article on the algebra of permutations.

We are going to use the fact that the R command base::order(x) builds a permutation p such that x[p] is in order.

Given this the solution is: we find permutations p1 and p2 such that o1[p1] is ordered and o2[p2] is ordered. Then build a permutation perm such that o1[perm] = (o1[p1])[inverse_permutation(p2)]. I.e., to get from o1 to o2 move o1 to sorted order and then move from the sorted order to o2‘s order (by using the reverse of the process that sorts o2). Again, the tools to solve this are in our article on the relation between permutations and indexing.

Below is the complete solution (including combining the two steps into a single permutation):

p1 <-order(o1)
p2 <-order(o2)

# invert p2# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
p2inv <-seq_len(length(p2))
p2inv[p2] <-seq_len(length(p2))

(o1[p1])[p2inv]

## [1] "d" "g" "f" "e" "b" "c" "a"

# composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]]# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
perm <-p1[p2inv]
o1[perm]

## [1] "d" "g" "f" "e" "b" "c" "a"

The equivilence "(o1[p1])[p2inv] == o1[p1[p2inv]]" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.

The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

alphavantager: An R interface to the Free Alpha Vantage Financial Data API

September 2, 2017, 5:00 pm

≫ Next: Unifying packages to help users access R repositories

≪ Previous: Permutation Theory In Action

We’re excited to announce the alphavantager package, a lightweight R interface to the Alpha Vantage API! Alpha Vantage is a FREE API for retreiving real-time and historical financial data. It’s very easy to use, and, with the recent glitch with the Yahoo Finance API, Alpha Vantage is a solid alternative for retrieving financial data for FREE! It’s definitely worth checking out if you are interested in financial analysis. We’ll go through the alphavantager R interface in this post to show you how easy it is to get real-time and historical financial data. In the near future, we have plans to incorporate the alphavantager into tidyquant to enable scaling from one equity to many.

If you like what you read, please follow us on social media to stay up on the latest Business Science news, events and information! As always, we are interested in both expanding our network of data scientists and seeking new clients interested in applying data science to business and finance. If interested, contact us.

Alpha Vantage

Alpha Vantage is a free service that enables users to get real-time and historical equity data. New users will need to visit Alpha Vantage and obtain an API key.

Alpha Vantage

R Interface: Getting Started

The alphavantager package provides a convenient and lightweight interface to the Alpha Vantage API.

To get started, install the package from CRAN or from GitHub:

install.packages("alphavantager")# Ordevtools::install_github("business-science/alphavantager")

Load the package.

library(alphavantager)

Set your API key (get one from Alpha Vantage if you don’t already have one… it’s free).

av_api_key("YOUR_API_KEY")print(av_api_key())

## [1] "YOUR_API_KEY"

Getting Financial Data from Alpha Vantage

Now, you’re ready to get financial data via av_get(), which accepts the same¹ arguments as the API Documentation parameters. The function is setup with two primary arguments, symbol and av_fun, which accepts an equity and one of the API “function” parameters. You can pass additional API parameters via the ....

# Function is streamlined and user adds additional parameters via ... args(av_get)

## function (symbol = NULL, av_fun, ...) ## NULL

Let’s take av_get() for a test spin!

Time Series Data

We can get real-time intraday stock data by setting av_fun = "TIME_SERIES_INTRADAY" and the interval to one of the available resolutions (“1min”, “5min”, “15min”, “30min” or “60min”). We can also get daily, daily adjusted, weekly and monthly time series. Note that only 100 rows are returned by default, add the parameter outputsize = "full" to get the full number of rows.

av_get(symbol="MSFT",av_fun="TIME_SERIES_INTRADAY",interval="15min",outputsize="compact")

## # A tibble: 100 x 6##              timestamp   open    high    low  close volume##                            ##  1 2017-08-29 11:30:00 72.800 72.8601 72.710 72.725 250459##  2 2017-08-29 11:45:00 72.730 72.8600 72.680 72.850 354759##  3 2017-08-29 12:00:00 72.850 72.9400 72.835 72.870 328545##  4 2017-08-29 12:15:00 72.870 72.9300 72.870 72.890 157598##  5 2017-08-29 12:30:00 72.900 72.9300 72.860 72.890 193864##  6 2017-08-29 12:45:00 72.890 72.9300 72.860 72.920 140217##  7 2017-08-29 13:00:00 72.923 73.0000 72.900 72.910 163803##  8 2017-08-29 13:15:00 72.910 72.9600 72.880 72.890 132402##  9 2017-08-29 13:30:00 72.890 73.0100 72.860 72.985 470568## 10 2017-08-29 13:45:00 72.985 73.0199 72.920 72.980 292979## # ... with 90 more rows

Technical Indicators

A full suite of real-time and historical technical indicators are available. The “SMA” (Simple Moving Average) is shown below.

av_get(symbol="MSFT",av_fun="SMA",interval="monthly",time_period=60,series_type="close")

## # A tibble: 153 x 2##              timestamp     sma##                    ##  1 2005-01-31 05:00:00 47.5243##  2 2005-02-28 05:00:00 46.4542##  3 2005-03-31 05:00:00 45.0862##  4 2005-04-29 04:00:00 44.3453##  5 2005-05-31 04:00:00 43.7327##  6 2005-06-30 04:00:00 42.8133##  7 2005-07-29 04:00:00 42.0767##  8 2005-08-31 04:00:00 41.3695##  9 2005-09-30 04:00:00 40.7932## 10 2005-10-31 05:00:00 40.0737## # ... with 143 more rows

Sector Performances

Various real-time and historical sector performances are available.

av_get(av_fun="SECTOR")

## # A tibble: 106 x 3##                       rank.group                     sector value##                                                   ##  1 Rank A: Real-Time Performance                     Energy  0.81##  2 Rank A: Real-Time Performance                  Materials  0.63##  3 Rank A: Real-Time Performance     Consumer Discretionary  0.46##  4 Rank A: Real-Time Performance                 Financials  0.44##  5 Rank A: Real-Time Performance           Consumer Staples  0.42##  6 Rank A: Real-Time Performance                Industrials  0.15##  7 Rank A: Real-Time Performance Telecommunication Services  0.04##  8 Rank A: Real-Time Performance     Information Technology -0.05##  9 Rank A: Real-Time Performance                Real Estate -0.08## 10 Rank A: Real-Time Performance                Health Care -0.08## # ... with 96 more rows

Important Notes: av_get()

For the most part, the av_get() function works the same as the Alpha Vantage API Parameters. However, users will want to understand a few important aspects to the R interface:

The av_fun argument replaces the API parameter “function” because function is a reserved name in R. All other arguments match the Alpha Vantage API parameters.
There is no need to specify the apikey parameter as an argument to av_get(). The required method is to set the API key using av_api_key("YOUR_API_KEY").
There is no need to specify the datatype parameter as an argument to av_get(). The function will return a tibble data frame.
Some data sets only return 100 rows by default. Change the parameter outputsize = "full" to get the full dataset.

Next Steps

We have plans to integrate alphavantager into tidyquant, which will enable scaling from one equity to many! The change will be incorporated into tq_get(), the one-stop shop for getting financial data. It’s coming soon!!!

About Business Science

We have a full suite of data science services to supercharge your organizations financial and business performance! For example, our experienced data scientists reduced a manufacturer’s sales forecasting error by 50%, which led to improved personnel planning, material purchasing and inventory management.

How do we do it? With team-based data science: Using our network of data science consultants with expertise in Marketing, Forecasting, Finance, Human Resources and more, we pull together the right team to get custom projects done on time, within budget, and of the highest quality. Learn about our data science services or contact us!

We are growing! Let us know if you are interested in joining our network of data scientist consultants. If you have expertise in Marketing Analytics, Data Science for Business, Financial Analytics, Forecasting or data science in general, we’d love to talk. Contact us!

Follow Business Science on Social Media

Connect with @bizScienc on twitter!
Like us on Facebook!!!
Follow us on LinkedIn!
Sign up for our insights blog to stay updated!
If you like our software, star our GitHub packages

↧

Unifying packages to help users access R repositories

September 3, 2017, 6:16 am

≫ Next: Combining and comparing models using Model Confidence Set

≪ Previous: alphavantager: An R interface to the Free Alpha Vantage Financial Data API

(This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers)

With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.

In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.

There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:

unification of packages that have similar goals or features (of which more in this posting)
guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
search tools for R packages (pending posting by Spencer Graves)

There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!

Approaches to unification of packages

In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.

My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:

developing wrappers that provide a common syntax or interface to multiple packages
finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
promoting collaboration between developers of similar packages with the aim of consolidating those packages
encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.

None of these approaches can be considered “easy”.

Wrappers

My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.

A sample of some other packages which unify a number of methods:

mlr: Machine Learning in R
caret: Classification And REgression Training for developing predictive models
Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)

We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!

Archiving

During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.

Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.

Filtered presentation

The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.

Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.

Directions

Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.

John C. Nash 2017-9-3

To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash.

↧

Combining and comparing models using Model Confidence Set

September 3, 2017, 10:19 am

≫ Next: Variable Selection with Elastic Net

≪ Previous: Unifying packages to help users access R repositories

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In many cases, especially if you are dealing with forecasting models, it is natural to use a large set of models to forecast the same variable and then select the best model using some error measure. For example, you can break the sample into a training sample (in-sample) and a test sample (out-of-sample), estimate all models in the training sample and see how the perform in the test sample. You could compare the models using the root mean squared error (RMSE) or the mean absolute error (MAE).

If you want to make things a little more formal, then you can use some statistical test such as Diebold-Mariano to compare the models pairwise. This test will tell you if model A is better than model B. If you have models to compare you will need to perform n(n-1)/2 tests. If is large you will have a lot of test results to look at and reporting the results will become a problem (imagine that you have 100 models, you will have 4950 p-values to report).

Fortunately, Hansen, Lunde and Nason proposed (here) a suitable solution in an article in 2011. Their solution is called Model Confidence Set (MCS). The MCS has an interpretation similar to a confidence interval, i. e. the $MCS_{1-\alpha}$ contains the best model with $100(1-\alpha)\%$ confidence. The number of models in the MCS increases as we decrease $\alpha$ just like the size of a confidence interval. You only need to perform the MCS procedure once to compare all models and you have to report only the p-value for each model, which is an output of the test.

Application

Before going to the application, let us define a function that calculates the RMSE for several models at the same time, which we will be using a lot.

rmse=function(real,...){  forecasts=cbind(...)  sqrt(colMeans((real-forecasts)^2))}

To the application!! We are going to generate some data with the following design:

300 observations ( $t=1,\dots,300$ ) and 8 variables ( $n=1,\dots,8$ ),
All variables are generated from $y_{i,t}=\phi y_{i,t-1}+\lambda_if_t +\varepsilon_t$ with $\phi=0.6$ , $\lambda$ is sampled from a uniform(-1,1) distribution and $\varepsilon_{i,t}$ is Normal with mean equals 0 and variance equals 4,
is a common factor generated from $f_{t}=\rho f_{t-1}+\epsilon_t$ with $\rho=0.75$ and $\epsilon_t$ is Normal with mean equals 0 and variance equals 4,
I generated two independent factors: four variables are generated from factor 1 and the remaining four variables are generated from factor 2.
We assume that we do not know which factor generates each variable.

library(reshape2)library(grid)library(gridExtra)library(ggplot2)library(boot)T = 300 # = number of observations = #n = 8 # = Number of Variables = #rho = 0.75 # = factors AR coefficient = #phi = 0.6 # = Variables AR coefficient = #set.seed(1) # = seed for replication = ## = Generate two autoregressive factors = #f = matrix(0, T, 2)for(i in 2:T){  f[i, ] = rho * f[i-1, ] + rnorm(2)}lambda = runif(n,-1,1) # = Factor loadings = ## = Generate variables = #Y = matrix(0, T, n)for(i in 2:T){  # = Based on the first factor = #  Y[i, 1:4] = phi * Y[i-1, 1:4] + lambda[1:4] * f[i, 1] + rnorm(4, 0, 2)  # = Based on the second factor = #  Y[i, 5:8] = phi * Y[i-1, 5:8] + lambda[5:8] * f[i,2] + rnorm(4, 0, 2)}# = Organize variable lags = ## = y is the first (dependent) variable = #Y = as.data.frame(embed(Y, 2)[ , c(1, 9:16)])# = y is the first (dependent) variable = #colnames(Y) = c("y", paste("x", 1:n, sep = ""))

Now that we have our data, we are going to estimate all possible combinations of linear regressions using our 8 variables. The first variable will be the dependent variable, defined as . The explanatory variables are the first lags of all eight variables (including the first), defined as $x_i, i=1,\dots,8$ . This design results in 255 models to be estimated and compared.

# = First we will generate formulas for all the 255 equations = #variables = colnames(Y)[-1]combinations = lapply(1:n, function(x)combn(1:n, x))formulas = lapply(combinations, function(y){  apply(y, 2, function(x){    paste("~", paste(variables[x], collapse = " + "), sep=" ")  })})formulas = unlist(formulas)# = Now we break the sample into in-sample and out-of-sample = #in_sample = Y[1:100, ]out_of_sample = Y[-c(1:100), ]# = Estimate the models = #models = lapply(formulas, function(x) lm(paste("y", x), data = in_sample))# = Compute forecasts = #forecasts = Reduce(cbind, lapply(models, function(x) predict(x, out_of_sample)))colnames(forecasts) = paste("model", 1:ncol(forecasts), sep = "")# RMSES (Mean of all models, model with all variables)rmse(out_of_sample$y,rowMeans(forecasts),forecasts[,ncol(forecasts)])

## [1] 2.225175 2.322508

As you can see, the average forecast is more accurate than the model using all 8 variables. Now we are going to compute the MCS using the function below. If you want more details on the algorithm, just look at the original article here.

mcs=function(Loss,R,l){ LbarB=tsboot(Loss,colMeans,R=R,sim="fixed",l=l)$t Lbar=colMeans(Loss) zeta.Bi=t(t(LbarB)-Lbar)  save.res=c()  for(j in 1:(ncol(Loss)-1)){    Lbardot=mean(Lbar)    zetadot=rowMeans(zeta.Bi)    vard=colMeans((zeta.Bi-zetadot)^2)    t.stat=(Lbar-Lbardot)/sqrt(vard)    t.max=max(t.stat)    model.t.max=which(t.stat==t.max)    t.stat.b=t(t(zeta.Bi-zetadot)/sqrt(vard))    t.max.b=apply(t.stat.b,1,max)    p=length(which(t.max<t.max.b))/R    save.res=c(save.res,p)    names(save.res)[j]=names(model.t.max)    Lbar=Lbar[-model.t.max]    zeta.Bi=zeta.Bi[,-model.t.max]  }  save.res=c(save.res,1)  names(save.res)[j+1]=names(Lbar)  save.p=save.res  for(i in 2:(length(save.res)-1)){    save.p[i]=max(save.res[i-1],save.res[i])  }  aux=match(colnames(Loss),names(save.p))  save.p=save.p[aux]  save.res=save.res[aux]  return(list(test=save.p,individual=save.res))}

The MCS is computed in a single line of code, but first we are going to break the out-of-sample into two sub-samples. We are going to compute the MCS in the first sub-sample and compare the models in the second sub-sample. The MCS function returns the p-value needed to include each model in the set. For example, if model1 has a p-value=0.1, it will be included in the MCS of 90\% confidence.

# = generate 2 subsamples = #out1 = out_of_sample[1:100, ]out2 = out_of_sample[-c(1:100), ]f1 = forecasts[1:100, ]f2 = forecasts[-c(1:100),]# = Compute MCS = #set.seed(123) # = Seed for replication = #MCS=mcs( Loss=(f1-out1$y)^2, R=1000, l=3 )$test # = R block bootstrap samples with block length = 3  = ## = If you want to see the p-values write (print(sort(MCS)) = ## = Selected models in the 50% MCS  = #selected=which(MCS>0.5)# RMSE (Mean of all models, model with all variables, MCS models)rmse(out2$y,rowMeans(f2),f2[,ncol(f2)],rowMeans(f2[,selected]))

## [1] 2.142932 2.240110 2.013547

As you can see, if we combine only the models in the MCS the RMSE becomes smaller. This is not always the case, but it happens a lot. The combined results are better when the forecasts do not have a very high positive correlation. The MCS is not useful only to combine forecasts. Even if the combinations are not better than individual models, you still have a confidence set that includes the best model with the desired significance. If some model is excluded from the confidence set you have strong evidence that this model is worst than the ones in the set (depending on the confidence level, of course). Naturally, if you increase the confidence level, the number of models in the set also increases just like in confidence intervals.

Finally, we can find some interesting relationships in the plots below. The first plot shows the forecasting errors as function of the confidence level. The optimal confidence level is around 55\%. If we increase it more the MCS starts to select a lot o bad models and the forecast becomes less accurate. The second plot shows the number of models as a function of the confidence level and the third plot shows the errors as a function of the number of models.

alpha = seq(0,1,0.01)nmodels = sapply(alpha, function(x) length(which(MCS >= x)))errors = sapply(alpha, function(x) rmse(out2$y, rowMeans(as.matrix(f2[ , which(MCS >= x)]))))df = data.frame(confidence = 1 - alpha, errors = errors, nmodels = nmodels)dfmelt = melt(df, id.vars = "errors")g1=ggplot(data=df) + geom_point(aes(x=confidence,y=errors))g2=ggplot(data=df) + geom_point(aes(x=confidence,y=nmodels))g3=ggplot(data=df) + geom_point(aes(x=nmodels,y=errors))grid.arrange(g1,g2,g3,ncol=2)

plot of chunk unnamed-chunk-6

References

Hansen, Peter R., Asger Lunde, and James M. Nason. “The model confidence set.” Econometrica 79.2 (2011): 453-497.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

↧

Variable Selection with Elastic Net

September 3, 2017, 1:50 pm

≫ Next: It is Needlessly Difficult to Count Rows Using dplyr

≪ Previous: Combining and comparing models using Model Confidence Set

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC")lapply(pkgs, require, character.only = T)registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt")df2 <- df1[df1$CARDHLDR == 1, ]set.seed(2017)n <- nrow(df2)sample <- sample(seq(n), size = n * 0.5, replace = FALSE)train <- df2[sample, -1]test <- df2[-sample, -1]mdlY <- as.factor(as.matrix(train["DEFAULT"]))mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])newY <- as.factor(as.matrix(test["DEFAULT"]))newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1)md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1)coef(md1)#(Intercept) -1.963030e+00#AGE          .           #ACADMOS      .           #ADEPCNT      .           #MAJORDRG     .           #MINORDRG     .           #OWNRENT      .           #INCOME      -5.845981e-05#SELFEMPL     .           #INCPER       .           #EXP_INC      .           #SPENDING     .           #LOGSPEND    -4.015902e-02roc(newY, as.numeric(predict(md1, newX, type = "response")))#Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0)md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0)coef(md2)#(Intercept) -2.221016e+00#AGE         -4.184422e-04#ACADMOS     -3.085096e-05#ADEPCNT      1.485114e-04#MAJORDRG     6.684849e-03#MINORDRG     1.006660e-03#OWNRENT     -9.082750e-03#INCOME      -6.960253e-06#SELFEMPL     3.610381e-03#INCPER      -3.881890e-07#EXP_INC     -1.416971e-02#SPENDING    -1.638184e-05#LOGSPEND    -6.213884e-03roc(newY, as.numeric(predict(md2, newX, type = "response")))#Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1a <- seq(0.1, 0.9, 0.05)search <- foreach(i = a, .combine = rbind) %dopar% {  cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i)  data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i)}cv3 <- search[search$cvm == min(search$cvm), ]md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha)coef(md3)#(Intercept) -1.434700e+00#AGE         -8.426525e-04#ACADMOS      .           #ADEPCNT      .           #MAJORDRG     6.276924e-02#MINORDRG     .           #OWNRENT     -2.780958e-02#INCOME      -1.305118e-04#SELFEMPL     .           #INCPER      -2.085349e-06#EXP_INC      .           #SPENDING     .           #LOGSPEND    -9.992808e-02roc(newY, as.numeric(predict(md3, newX, type = "response")))#Area under the curve: 0.6449

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

↧

It is Needlessly Difficult to Count Rows Using dplyr

September 3, 2017, 2:05 pm

≫ Next: Migrating from GitHub to GitLab with RStudio (Tutorial)

≪ Previous: Variable Selection with Elastic Net

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Question: how hard is it to count rows using the R package dplyr?
Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").

Let’s take an example from sparklyr issue 973:

suppressPackageStartupMessages(library("dplyr"))packageVersion("dplyr")

## [1] '0.7.2.9000'

library("sparklyr")packageVersion("sparklyr")

## [1] '0.6.2'

sc <-spark_connect(master ="local")

## * Using Spark: 2.1.0

db_drop_table(sc, 'extab', force =TRUE)

## [1] 0

DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab")DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)")DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)")dRemote <-tbl(sc, "extab")print(dRemote)

## # Source:   table [?? x 1]## # Database: spark_connection##       n##   ## 1    01## 2    02## 3    03

dLocal <-data.frame(n =as.raw(1:3))print(dLocal)

##    n## 1 01## 2 02## 3 03

Many Apache Spark big data projects use the TINYINT type to save space. TINYINT behaves as a numeric type on the Spark side (you can run it through SparkML machine learning models correctly), and the translation of this type to R‘s raw type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have Rraw columns in them, so we should expect our tools to work properly with such columns present.

Now let’s try to count the rows of this table:

nrow(dRemote)

## [1] NA

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

Checking if a table is empty.
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
Confirming data size is the same as reported in other sources (Spark, database, and so on).
Reporting amount of work performed or rows-per-second processed.

The obvious generic dplyr idiom would then be dplyr::tally() (our code won’t know to call the new sparklyr::sdf_nrow() function, without writing code to check we are in fact looking at a Sparklyr reference structure):

tally(dRemote)

## # Source:   lazy query [?? x 1]## # Database: spark_connection##      nn##   ## 1     3

That works for Spark, but not for local tables:

dLocal %>%tally

## Using `n` as weighting variable## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.

The above issue is filed as dplyr issue 3070. The above code usually either errors-out (if the column is raw) or creates a new total column called nn with the sum of the n column instead of the count.

data.frame(n=100) %>%tally

## Using `n` as weighting variable##    nn## 1 100

We could try adding a column and summing that:

dLocal %>%transmute(constant =1.0) %>%summarize(n =sum(constant))

## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector

That fails due to dplyr issue 3069: local mutate() fails if there are any raw columns present (even if they are not the columns you are attempting to work with).

We can try removing the dangerous column prior to other steps:

dLocal %>%select(-n) %>%tally

## data frame with 0 columns and 3 rows

That does not work on local tables, as tally fails to count 0-column objects (dplyr issue 3071; probably the same issue exists for may dplyr verbs as we saw a related issue for dplyr::distinct).

And the method does not work on remote tables either (Spark, or database tables) as many of them do not appear to support 0-column results:

dRemote %>%select(-n) %>%tally

## Error: Query contains no columns

In fact we start to feel trapped here. For a data-object whose only column is of type raw we can’t remove all the raw columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).

At best we can try something like this:

nrow2 <-function(d) {  n <-nrow(d)  if(!is.na(n)) {    return(n)  }  d %>%ungroup() %>%transmute(constant =1.0) %>%summarize(tot =sum(constant)) %>%pull()}dRemote %>%nrow2()

## [1] 3

dLocal %>%nrow2()

## [1] 3

We are still experimenting with work-arounds in the replyr package (but it is necessarily ugly code).

spark_disconnect(sc)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Migrating from GitHub to GitLab with RStudio (Tutorial)

September 3, 2017, 5:00 pm

≫ Next: Thou shalt not compare numeric values (except when it works)

≪ Previous: It is Needlessly Difficult to Count Rows Using dplyr

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

GitHub vs. GitLab

Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.

In short, its main feature is that version control allows you to keep track of the changes you make to your code. It will also keep a history of all the changes you have made in the past and allows you to go back to specific versions if you made a major mistake. And Git makes collaborating on the same code very easy.

Most R packages are also hosted on GitHub. You can check out their R code in the repositories if you want to get a deeper understanding of the functions, you can install the latest development versions of packages or install packages that are not on CRAN. The issue tracker function of GitHub also makes it easy to report and respond to issues/problems with your code.

Why would you want to leave GitHub?

Public repositories are free on GitHub but you need to pay for private repos (if you are a student or work in academia, you get private repos for free). Since I switched from academia to industry lately and no longer fulfil these criteria, all my private repos would have to be switched to public in the future. Here, GitLab is a great alternative!

GitLab offers very similar functionalities as GitHub. There are many pros and cons for using GitHub versus GitLab but for me, the selling point was that GitLab offers unlimited private projects and collaborators in its free plan.

Tutorial

Migrating from GitHub to GitLab with RStudio is very easy! Here, I will show how I migrated my GitHub repositories of R projects, that I work with from within RStudio, to GitLab.

Beware, that ALL code snippets below show Terminal code (they are NOT from the R console)!

Migrating existing repositories

You first need to set up your GitLab account (you can login with your GitHub account) and connect your old GitHub account. Under Settings &Account, you will find “Social sign-in”; here click on “Connect” next to the GitHub symbol (if you signed in with your GitHub account, it will already be connected).

Once you have done this, you can import all your GitHub repositories to GitLab. To do this, you first need to create a new project. Click on the drop-down arrow next to the plus sign in the top-right corner and select “New project”. This will open the following window:

Here, choose “Import project from GitHub” and choose the repositories you want to import.

If you go into one of your repositories, GitLab will show you a message at the top of the site that tells you that you need to add an SSH key. The SSH key is used for secure communication between the GitLab server and your computer when you want to share information, like push/pull commits.

If you already work with GitHub on your computer, you will have an SSH key set up and you can copy your public SSH key to GitLab. Follow the instructions here.

Here is how you do it on a Mac:

Look for your public key and copy it to the clipboard

cat ~/.ssh/id_rsa.pubpbcopy < ~/.ssh/id_rsa.pub

Then paste it into the respective field here.

The next step is to change the remote URL for pushing/pulling your project from RStudio. In your Git window (tab next to “Environment” and “History” for me), click on Settings and “Shell”.

Then write in the shell window that opened:

git remote set-url origin git@:/.git

You can copy the link in the navigation bar of your repo on GitLab.

Check that you now have the correct new gitlab path by going to “Tools”, “Project Options” and “Git/SVN”.

Also check your SSH key configuration with:

ssh -T git@

If you get the following message

The authenticity of host 'gitlab.com (52.167.219.168)' can't be established.ECDSA key fingerprint is ...Are you sure you want to continue connecting (yes/no)?

type “yes” (and enter passphrase if prompted).

If everything is okay, you now get a message saying Welcome to GitLab!

Now, you can commit, push and pull from within RStudio just as you have done before!

In case of problems with pushing/pulling

In my case, I migrated both, my private as well as my company’s GitHub repos to GitLab. While my private repos could be migrated without a hitch, migrating my company’s repos was a bit more tricky (because they had additional security settings, I assume).

Here is how I solved this problem with my company’s repos:

I have protected my SSH key with a passphrase. When pushing or pulling commits via the shell with git pull and git push origin master, I am prompted to enter my passphrase and everything works fine. Pushing/pulling from within RStudio, however, threw an error:

ssh_askpass: exec(/usr/X11R6/bin/ssh-askpass): No such file or directoryPermission denied (publickey).fatal: Could not read from remote repository.Please make sure you have the correct access rightsand the repository exists.

I am using a MacBook Pro with MacOS Sierra version 10.12.6, so this might not be an issue with another operating system.

The following solution worked for me:

Add your SSH key

ssh-add ~/.ssh/id_rsa

And reinstall VS Code

Now I could commit, push and pull from within RStudio just as before!

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

↧

Thou shalt not compare numeric values (except when it works)

September 3, 2017, 10:32 pm

≫ Next: Calculating Marginal Effects Exercises

≪ Previous: Migrating from GitHub to GitLab with RStudio (Tutorial)

(This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers)

This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers.

I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using dplyr for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly.

## if you haven't already done so, load dplyrsuppressPackageStartupMessages(library(dplyr))

My example involved joining together two tibbles containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect?

Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally.

Thou shalt not compare numbers.

If I had a tibble where the column I would use to join had integers

dataA <- tribble(    ~X, ~Y,    0L, 100L,    1L, 101L,    2L, 102L,    3L, 103L)dataA

## # A tibble: 4 x 2##       X     Y##    ## 1     0   100## 2     1   101## 3     2   102## 4     3   103

and another tibble with numeric in that column

dataB <- tribble(    ~X, ~Z,    0, 1000L,    1, 1001L,    2, 1002L,    3, 1003L)dataB

## # A tibble: 4 x 2##       X     Z##    ## 1     0  1000## 2     1  1001## 3     2  1002## 4     3  1003

would they still join?

full_join(dataA, dataB)

## Joining, by = "X"

## # A tibble: 4 x 3##       X     Y     Z##     ## 1     0   100  1000## 2     1   101  1001## 3     2   102  1002## 4     3   103  1003

Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal

identical(0L, 0)

## [1] FALSE

identical(2L, 2)

## [1] FALSE

though sometimes it does

0L == 0

## [1] TRUE

2L == 2

## [1] TRUE

(== coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?)

dataC <- tribble(    ~X, ~Z,    "0", 100L,    "1", 101L,    "2", 102L,    "3", 103L)dataC

## # A tibble: 4 x 2##       X     Z##    ## 1     0   100## 2     1   101## 3     2   102## 4     3   103

full_join(dataA, dataC)

## Joining, by = "X"

## Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'X' x 'X' because of incompatible types (character / integer)

That’s probably wise. Of course, R is perfectly happy with things like

"2":"5"

## [1] 2 3 4 5

and == thinks that’s fine

"0" == 0L

## [1] TRUE

"2" == 2L

## [1] TRUE

but who am I to argue?

Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? .Machine$double.eps is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

dataBeps <- tribble(    ~X, ~Z,    0 + .Machine$double.eps, 1000L,    1 + .Machine$double.eps, 1001L,    2 + .Machine$double.eps, 1002L,    3 + .Machine$double.eps, 1003L)dataBeps

## # A tibble: 4 x 2##              X     Z##           ## 1 2.220446e-16  1000## 2 1.000000e+00  1001## 3 2.000000e+00  1002## 4 3.000000e+00  1003

full_join(dataA, dataBeps)

## Joining, by = "X"

## # A tibble: 6 x 3##              X     Y     Z##            ## 1 0.000000e+00   100    NA## 2 1.000000e+00   101    NA## 3 2.000000e+00   102  1002## 4 3.000000e+00   103  1003## 5 2.220446e-16    NA  1000## 6 1.000000e+00    NA  1001

Well, that’s… weirder. The values offset from 2 and 3 joined fine, but the 0 and 1 each got multiple copies since R thinks they’re different. What if we offset a little further?

dataB2eps <- tribble(    ~X, ~Z,    0 + 2*.Machine$double.eps, 1000L,    1 + 2*.Machine$double.eps, 1001L,    2 + 2*.Machine$double.eps, 1002L,    3 + 2*.Machine$double.eps, 1003L)dataB2eps

## # A tibble: 4 x 2##              X     Z##           ## 1 4.440892e-16  1000## 2 1.000000e+00  1001## 3 2.000000e+00  1002## 4 3.000000e+00  1003

full_join(dataA, dataB2eps)

## Joining, by = "X"

## # A tibble: 8 x 3##              X     Y     Z##            ## 1 0.000000e+00   100    NA## 2 1.000000e+00   101    NA## 3 2.000000e+00   102    NA## 4 3.000000e+00   103    NA## 5 4.440892e-16    NA  1000## 6 1.000000e+00    NA  1001## 7 2.000000e+00    NA  1002## 8 3.000000e+00    NA  1003

That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values 0:4, let’s compare that integer with the same offset by .Machine$double.eps

suppressPackageStartupMessages(library(purrr)) ## for the 'thou shalt not for-loop' crowdmap_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine$double.eps)

## [1] FALSE FALSE  TRUE  TRUE  TRUE

And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

For what it’s worth, I’m sure this is hardly a surprising detail to the dplyr team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming.

↧