(This article was first published on MilanoR, and kindly contributed to R-bloggers)

This article was originally posted on Quantide blog – see here.

Principal components regression (PCR) is a regression technique based on principal component analysis (PCA).

The basic idea behind PCR is to calculate the principal components and then use some of these components as predictors in a linear regression model fitted using the typical least squares procedure.

As you can easily notice, the core idea of PCR is very closely related to the one underlying PCA and the “trick” is very similar. In some cases a small number of principal components are enough to explain the vast majority of the variability in the data. For instance, say you have a dataset of 50 variables that you would like to use to predict a single variable. By using PCR you might found out that 4 or 5 principal components are enough to explain 90% of the variance of your data. In this case, you might be better off running PCR on with these 5 components instead of running a linear model on all the 50 variables. This is a rough example but I hope it helped to get the point through.

A core assumption of PCR is that the directions in which the predictors show the most variation are the exact directions associated with the response variable. On one hand, this assumption is not guaranteed to hold 100% of the times, however, even though the assumption is not completely true it can be a good approximation and yield interesting results.

Some of the most notable advantages of performing PCR are the following:

Dimensionality reduction
Avoidance of multicollinearity between predictors
Overfitting mitigation

Let’s briefly walk through each one of them:

Dimensionality reduction

By using PCR you can easily perform dimensionality reduction on a high dimensional dataset and then fit a linear regression model to a smaller set of variables, while at the same time keep most of the variability of the original predictors. Since the use of only some of the principal components reduces the number of variables in the model, this can help in reducing the model complexity, which is always a plus. In case you need a lot of principal components to explain most of the variability in your data, say roughly as many principal components as the number of variables in your dataset, then PCR might not perform that well in that scenario, it might even be worse than plain vanilla linear regression.

PCR tends to perform well when the first principal components are enough to explain most of the variation in the predictors.

Avoiding multicollinearity

A significant benefit of PCR is that by using the principal components, if there is some degree of multicollinearity between the variables in your dataset, this procedure should be able to avoid this problem since performing PCA on the raw data produces linear combinations of the predictors that are uncorrelated.

Overfitting mitigation

If all the assumptions underlying PCR hold, then fitting a least squares model to the principal components will lead to better results than fitting a least squares model to the original data since most of the variation and information related to the dependent variable is condensend in the principal components and by estimating less coefficients you can reduce the risk of overfitting.

Potential drawbacks and warnings

As always with potential benefits come potential risks and drawbacks.

For instance, a typical mistake is to consider PCR a feature selection method. PCR is not a feature selection method because each of the calculated principal components is a linear combination of the original variables. Using principal components instead of the actual features can make it harder to explain what is affecting what.

Another major drawback of PCR is that the directions that best represent each predictor are obtained in an unsupervised way. The dependent variable is not used to identify each principal component direction. This essentially means that it is not certain that the directions found will be the optimal directions to use when making predictions on the dependent variable.

Performing PCR on a test dataset

There are a bunch of packages that perform PCR however, in my opinion, the pls package offers the easiest option. It is very user friendly and furthermore it can perform data standardization too. Let’s make a test.

Before performing PCR, it is preferable to standardize your data. This step is not necessary but strongly suggested since PCA is not scale invariant. You might ask why is it important that each predictor is on the same scale as the others. The scaling will prevent the algorithm to be skewed towards predictors that are dominant in absolute scale but perhaps not so relevant as others. In other words, variables with higher variance will influence more the calculation of the principal components and overall have a larger effect on the final results of the algorithm. Personally I would prefer to standardize the data most of the times.

Another thing to assess before running PCR is missing data: you should remove all the observations containing missing data, or approximate the missing observations with some technique before running the PCR function.

For this toy example, I am using the evergreen iris dataset.

require(pls)set.seed (1000)pcr_model <- pcr(Sepal.Length~., data = iris, scale = TRUE, validation = "CV")

By setting the parameter scale equal to TRUE the data is standardized before running the pcr algorithm on it. You can also perform validation by setting the argument validation. In this case I chose to perform 10 fold cross-validation and therefore set the validation argument to “CV”, however there other validation methods available just type ?pcr in the R command window to gather some more information on the parameters of the pcr function.

In oder to print out the results, simply use the summary function as below

summary(pcr_model)

## Data:    X dimension: 150 5 ##  Y dimension: 150 1## Fit method: svdpc## Number of components considered: 5## ## VALIDATION: RMSEP## Cross-validated using 10 random segments.##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps## CV          0.8308   0.5141   0.5098   0.3947   0.3309   0.3164## adjCV       0.8308   0.5136   0.5092   0.3941   0.3303   0.3156## ## TRAINING: % variance explained##               1 comps  2 comps  3 comps  4 comps  5 comps## X               56.20    88.62    99.07    99.73   100.00## Sepal.Length    62.71    63.58    78.44    84.95    86.73

As you can see, two main results are printed, namely the validation error and the cumulative percentage of variance explained using n components.

The cross validation results are computed for each number of components used so that you can easily check the score with a particular number of components without trying each combination on your own.

The pls package provides also a set of methods to plot the results of PCR. For example you can plot the results of cross validation using the validationplot function.

By default, the pcr function computes the root mean squared error and the validationplot function plots this statistic, however you can choose to plot the usual mean squared error or the R2 by setting the val.type argument equal to “MSEP” or “R2” respectively

# Plot the root mean squared errorvalidationplot(pcr_model)

# Plot the cross validation MSEvalidationplot(pcr_model, val.type="MSEP")

# Plot the R2validationplot(pcr_model, val.type = "R2")

What you would like to see is a low cross validation error with a lower number of components than the number of variables in your dataset. If this is not the case or if the smalles cross validation error occurs with a number of components close to the number of variables in the original data, then no dimensionality reduction occurs. In the example above, it looks like 3 components are enough to explain more than 90% of the variability in the data although the CV score is a little higher than with 4 or 5 components. Finally, note that 6 components explain all the variability as expected.

You can plot the predicted vs measured values using the predplot function as below

predplot(pcr_model)

while the regression coefficients can be plotted using the coefplot function

coefplot(pcr_model)

Now you can try to use PCR on a traning-test set and evaluate its performance using, for example, using only 3 components.

# Train-test splittrain <- iris[1:120,]y_test <- iris[120:150, 1]test <- iris[120:150, 2:5]    pcr_model <- pcr(Sepal.Length~., data = train,scale =TRUE, validation = "CV")pcr_pred <- predict(pcr_model, test, ncomp = 3)mean((pcr_pred - y_test)^2)

## [1] 0.213731

With the iris dataset there is probably no need to use PCR, in fact, it may even be worse using it. However, I hope this toy example was useful to introduce this model.

Thank you for reading this article, please feel free to leave a comment if you have any questions or suggestions and share the post with others if you find it useful.

The post Performing Principal Components Regression (PCR) in R appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

This year Julia Silge and I released the tidytext package for text mining using tidy tools such as dplyr, tidyr, ggplot2 and broom. One of the canonical examples of tidy text mining this package makes possible is sentiment analysis.

Sentiment analysis is often used by companies to quantify general social media opinion (for example, using tweets about several brands to compare customer satisfaction). One of the simplest and most common sentiment analysis methods is to classify words as “positive” or “negative”, then to average the values of each word to categorize the entire document. (See this vignette and Julia’s post for examples of a tidy application of sentiment analysis). But does this method actually work? Can you predict the positivity or negativity of someone’s writing by counting words?

To answer this, let’s try sentiment analysis on a text dataset where we know the “right answer”- one where each customer also quantified their opinion. In particular, we’ll use the Yelp Dataset: a wonderful collection of millions of restaurant reviews, each accompanied by a 1-5 star rating. We’ll try out a specific sentiment analysis method, and see the extent to which we can predict a customer’s rating based on their written opinion. In the process we’ll get a sense of the strengths and weaknesses of sentiment analysis, and explore another example of tidy text mining with tidytext, dplyr, and ggplot2.

Setup

I’ve downloaded the yelp_dataset_challenge_academic_dataset folder from here.¹ First I read and process them into a data frame:

library(stringr)library(jsonlite)# Each line is a JSON object- the fastest way to process is to combine into a# single JSON string and use fromJSON and flattenreviews_combined<-str_c("[",str_c(review_lines,collapse=", "),"]")reviews<-fromJSON(reviews_combined)%>%flatten()%>%tbl_df()

We now have a data frame with one row per review:

reviews

## # A tibble: 200,000 x 10##                   user_id              review_id stars       date##                     <chr>                  <chr> <int>      <chr>## 1  PUFPaY9KxDAcGqfsorJp3Q Ya85v4eqdd6k9Od8HbQjyA     4 2012-08-01## 2  Iu6AxdBYGR4A0wspR9BYHA KPvLNJ21_4wbYNctrOwWdQ     5 2014-02-13## 3  auESFwWvW42h6alXgFxAXQ fFSoGV46Yxuwbr3fHNuZig     5 2015-10-31## 4  uK8tzraOp4M5u3uYrqIBXg Di3exaUCFNw1V4kSNW5pgA     5 2013-11-08## 5  I_47G-R2_egp7ME5u_ltew 0Lua2-PbqEQMjD9r89-asw     3 2014-03-29## 6  PP_xoMSYlGr2pb67BbqBdA 7N9j5YbBHBW6qguE5DAeyA     1 2014-10-29## 7  JPPhyFE-UE453zA6K0TVgw mjCJR33jvUNt41iJCxDU_g     4 2014-11-28## 8  2d5HeDvZTDUNVog_WuUpSg Ieh3kfZ-5J9pLju4JiQDvQ     5 2014-02-27## 9  BShxMIUwaJS378xcrz4Nmg PU28OoBSHpZLkYGCmNxlmg     5 2015-06-16## 10 fhNxoMwwTipzjO8A9LFe8Q XsA6AojkWjOHA4FmuAb8XQ     3 2012-08-19## # ... with 199,990 more rows, and 6 more variables: text <chr>,## #   type <chr>, business_id <chr>, votes.funny <int>, votes.useful <int>,## #   votes.cool <int>

Notice the stars column with the star rating the user gave, as well as the text column (too large to display) with the actual text of the review. For now, we’ll focus on whether we can predict the star rating based on the text.

Tidy sentiment analysis

Right now, there is one row for each review. To analyze in the tidy text framework, we need to use the unnest_tokens function and turn this into one-row-per-term-per-document:

library(tidytext)review_words<-reviews%>%select(review_id,business_id,stars,text)%>%unnest_tokens(word,text)%>%filter(!word%in%stop_words$word,str_detect(word,"^[a-z']+$"))review_words

## # A tibble: 7,688,667 x 4##                 review_id            business_id stars        word##                     <chr>                  <chr> <int>       <chr>## 1  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4      hoagie## 2  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4 institution## 3  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4     walking## 4  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4   throwback## 5  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4         ago## 6  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4   fashioned## 7  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4        menu## 8  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4       board## 9  Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4      booths## 10 Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw     4   selection## # ... with 7,688,657 more rows

Notice that there is now one-row-per-term-per-document: the tidy text form. In this cleaning process we’ve also removed “stopwords” (such as “I”, “the”, “and”, etc), and removing things things that are formatting (e.g. “—-“) rather than a word.

Now let’s perform sentiment analysis on each review. We’ll use the AFINN lexicon, which provides a positivity score for each word, from -5 (most negative) to 5 (most positive). This, along with several other lexicons, are stored in the sentiments table that comes with tidytext. (I’ve tried some other lexicons on this dataset and the results are pretty similar.)

AFINN<-sentiments%>%filter(lexicon=="AFINN")%>%select(word,afinn_score=score)AFINN

## # A tibble: 2,476 x 2##          word afinn_score##         <chr>       <int>## 1     abandon          -2## 2   abandoned          -2## 3    abandons          -2## 4    abducted          -2## 5   abduction          -2## 6  abductions          -2## 7       abhor          -3## 8    abhorred          -3## 9   abhorrent          -3## 10     abhors          -3## # ... with 2,466 more rows

Now as described in Julia’s post, our sentiment analysis is just an inner-join operation followed by a summary:

reviews_sentiment<-review_words%>%inner_join(AFINN,by="word")%>%group_by(review_id,stars)%>%summarize(sentiment=mean(afinn_score))reviews_sentiment

## Source: local data frame [187,688 x 3]## Groups: review_id [?]## ##                 review_id stars sentiment##                     (chr) (int)     (dbl)## 1  __-r0eC3hZlaejvuliC8zQ     5 4.0000000## 2  __1yzxN39QzdeJqicAg99A     3 1.3333333## 3  __3Vy9VLHV5jKjgFDRWCiQ     2 1.3333333## 4  __56FUEaW57kZEm56OZk7w     5 0.8333333## 5  __5webDfFxADKz_3k5YipA     5 2.2222222## 6  __6QkPtePef4_oW6A_tbOg     4 2.0000000## 7  __6tOxx2VcvGR02d2ILkuw     5 1.7500000## 8  __77nP3Nf1wsGz5HPs2hdw     5 1.6000000## 9  __7MkcofSZYHj9v5KuLVvQ     4 1.8333333## 10 __7RBFUZgxef8gZ8guaVhg     5 2.4000000## ..                    ...   ...       ...

We now have an average sentiment alongside the star ratings. If we’re right and sentiment analysis can predict a review’s opinion towards a restaurant, we should expect the sentiment score to correlate with the star rating.

Did it work?

library(ggplot2)theme_set(theme_bw())

ggplot(reviews_sentiment,aes(stars,sentiment,group=stars))+geom_boxplot()+ylab("Average sentiment score")

center

Well, it’s a very good start! Our sentiment scores are certainly correlated with positivity ratings. But we do see that there’s a large amount of prediction error- some 5-star reviews have a highly negative sentiment score, and vice versa.

Which words are positive or negative?

Our algorithm works at the word level, so if we want to improve our approach we should start there. Which words are suggestive of positive reviews, and which are negative?

To examine this, let’s create a per-word summary, and see which words tend to appear in positive or negative reviews. This takes more grouping and summarizing:

review_words_counted<-review_words%>%count(review_id,business_id,stars,word)%>%ungroup()review_words_counted

## # A tibble: 6,566,367 x 5##                 review_id            business_id stars      word     n##                     <chr>                  <chr> <int>     <chr> <int>## 1  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5    batter     1## 2  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5     chips     3## 3  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5  compares     1## 4  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5 fashioned     1## 5  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5  filleted     1## 6  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5      fish     4## 7  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5     fries     1## 8  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5    frozen     1## 9  ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5 greenlake     1## 10 ___XYEos-RIkPsQwplRYyw YxMnfznT3eYya0YV37tE8w     5      hand     1## # ... with 6,566,357 more rows

word_summaries<-review_words_counted%>%group_by(word)%>%summarize(businesses=n_distinct(business_id),reviews=n(),uses=sum(n),average_stars=mean(stars))%>%ungroup()word_summaries

## # A tibble: 100,177 x 5##          word businesses reviews  uses average_stars##         <chr>      <int>   <int> <int>         <dbl>## 1   a'boiling          1       1     1           4.0## 2      a'fare          1       1     1           4.0## 3      a'hole          1       1     1           5.0## 4      a'ight          6       6     6           2.5## 5        a'la          2       2     2           4.5## 6        a'll          1       1     1           1.0## 7      a'lyce          1       1     2           5.0## 8      a'more          1       2     2           5.0## 9    a'orange          1       1     1           5.0## 10 a'prowling          1       1     1           3.0## # ... with 100,167 more rows

We can start by looking only at words that appear in at least 200 (out of 200000) reviews. This makes sense both because rare words will have a noisier measurement (a few good or bad reviews could shift the balance), and because they’re less likely to be useful in classifying future reviews or text. I also filter for ones that appear in at least 10 businesses (others are likely to be specific to a particular restaurant).

word_summaries_filtered<-word_summaries%>%filter(reviews>=200,businesses>=10)word_summaries_filtered

## # A tibble: 4,328 x 5##          word businesses reviews  uses average_stars##         <chr>      <int>   <int> <int>         <dbl>## 1     ability        374     402   410      3.465174## 2    absolute        808    1150  1183      3.710435## 3  absolutely       2728    6158  6538      3.757389## 4          ac        378     646   919      3.191950## 5      accent        171     203   214      3.285714## 6      accept        557     720   772      2.929167## 7  acceptable        500     587   608      2.505963## 8    accepted        293     321   332      2.968847## 9      access        544     840   925      3.505952## 10 accessible        220     272   282      3.816176## # ... with 4,318 more rows

What were the most positive and negative words?

word_summaries_filtered%>%arrange(desc(average_stars))

## # A tibble: 4,328 x 5##             word businesses reviews  uses average_stars##            <chr>      <int>   <int> <int>         <dbl>## 1  compassionate        193     298   312      4.677852## 2        listens        177     215   218      4.632558## 3       exceeded        286     320   321      4.596875## 4       painless        224     290   294      4.568966## 5   knowledgable        607     775   786      4.549677## 6            gem        874    1703  1733      4.537874## 7     impeccable        278     475   477      4.520000## 8        happier        545     638   654      4.495298## 9  knowledgeable       1550    2747  2807      4.493629## 10   compliments        333     418   428      4.488038## # ... with 4,318 more rows

Looks plausible to me! What about negative?

word_summaries_filtered%>%arrange(average_stars)

## # A tibble: 4,328 x 5##              word businesses reviews  uses average_stars##             <chr>      <int>   <int> <int>         <dbl>## 1            scam        211     263   297      1.368821## 2     incompetent        275     317   337      1.378549## 3  unprofessional        748     921   988      1.380022## 4       disgusted        251     283   292      1.381625## 5          rudely        349     391   418      1.493606## 6            lied        281     332   372      1.496988## 7          refund        717     930  1229      1.545161## 8    unacceptable        387     441   449      1.569161## 9           worst       2574    5107  5597      1.569219## 10        refused        803     983  1096      1.579858## # ... with 4,318 more rows

Also makes a lot of sense. We can also plot positivity by frequency:

ggplot(word_summaries_filtered,aes(reviews,average_stars))+geom_point()+geom_text(aes(label=word),check_overlap=TRUE,vjust=1,hjust=1)+scale_x_log10()+geom_hline(yintercept=mean(reviews$stars),color="red",lty=2)+xlab("# of reviews")+ylab("Average Stars")

center

Note that some of the most common words (e.g. “food”) are pretty neutral. There are some common words that are pretty positive (e.g. “amazing”, “awesome”) and others that are pretty negative (“bad”, “told”).

Comparing to sentiment analysis

When we perform sentiment analysis, we’re typically comparing to a pre-existing lexicon, one that may have been developed for a particular purpose. That means that on our new dataset (Yelp reviews), some words may have different implications.

We can combine and compare the two datasets with inner_join.

words_afinn<-word_summaries_filtered%>%inner_join(AFINN)words_afinn

## # A tibble: 505 x 6##            word businesses reviews  uses average_stars afinn_score##           <chr>      <int>   <int> <int>         <dbl>       <int>## 1       ability        374     402   410      3.465174           2## 2        accept        557     720   772      2.929167           1## 3      accepted        293     321   332      2.968847           1## 4      accident        369     447   501      3.536913          -2## 5  accidentally        279     305   307      3.252459          -2## 6        active        177     215   238      3.744186           1## 7      adequate        420     502   527      3.203187           1## 8         admit        942    1316  1348      3.620821          -1## 9      admitted        196     248   271      2.157258          -1## 10     adorable        305     416   431      4.281250           3## # ... with 495 more rows

ggplot(words_afinn,aes(afinn_score,average_stars,group=afinn_score))+geom_boxplot()+xlab("AFINN score of word")+ylab("Average stars of reviews with this word")

center

Just like in our per-review predictions, there’s a very clear trend. AFINN sentiment analysis works, at least a little bit!

But we may want to see some of those details. Which positive/negative words were most successful in predicting a positive/negative review, and which broke the trend?

center

For example, we can see that most profanity has an AFINN score of -4, and that while some words, like “wtf”, successfully predict a negative review, others, like “damn”, are often positive (e.g. “the roast beef was damn good!”). Some of the words that AFINN most underestimated included “die” (“the pork chops are to die for!”), and one of the words it most overestimated was “joke” (“the service is a complete joke!”).

One other way we could look at misclassifications is to add AFINN sentiments to our frequency vs average stars plot:

center

One thing I like about the tidy text mining framework is that it lets us explore the successes and failures of our model at this granular level, using tools (ggplot2, dplyr) that we’re already familiar with.

Next time: Machine learning

In this post I’ve focused on basic exploration of the Yelp review dataset, and an evaluation of one sentiment analysis method for predicting review positivity. (Our conclusion: it’s good, but far from perfect!) But what if we want to create our own prediction method based on these reviews?

In my next post on this topic, I’ll show how to train LASSO regression (with the glmnet package) on this dataset to create a predictive model. This will serve as an introduction to machine learning methods in text classification. It will also let us create our own new “lexicon” of positive and negative words, one that may be more appropriate to our context of restaurant reviews.

I encourage you to download this dataset and follow along- but note that if you do, you are bound by their Terms of Use. ↩

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

(This article was first published on R – randyzwitch.com, and kindly contributed to R-bloggers)

This blog post also serves as release notes for RSiteCatalyst v1.4.9, as only one feature was added (batch report request and download). But it’s a feature big enough for its own post!

Recently, I was asked how I would approach replicating the market basket analysis blog post I wrote for 33 Sticks, but using a lot more data. Like, months and months of order-level data. While you might be able to submit multiple months worth of data in a single RSiteCatalyst call, it’s a lot more elegant to request data from the Adobe Analytics API in several calls. With the new batch-submit and batch-receive functionality in RSiteCatalyst, this process can be a LOT faster.

Non-Batched Method

Prior to version 1.4.9 of RSiteCatalyst, API calls could only be made in a serial fashion:

The underlying assumption from a package development standpoint was that the user would be working in an interactive fashion; submit a report request, wait to get the answer back. There’s nothing inherently wrong with this code from an R standpoint that made this a slow process, you just had to wait until one report was calculated by the Adobe Analytics API until the next one was submitted.

Batch Method

Of course, most APIs can process multiple calls simultaneously, and the Adobe Analytics API is no exception. Thanks to user shashispace, it’s now possible to submit all of your report calls at once, then retrieve the results:

This code is nearly identical to the serial snippet above, except for 1) the addition of the enqueueOnly=TRUE keyword argument and 2) lowering the interval.seconds keyword argument to 1 second instead of 60. When you use the enqueueOnly keyword, instead of returning the report results back, a Queue* function will return the report.id; by accumulating these report.id values in a list, we can next retrieve the reports and bind them together using dplyr.

Performance gain: 4x speed-up

Although the code snippets are nearly identical, it is way faster to submit the reports all at once then retrieve the results. By submitting the requests all at once, the API will process numerous calls at once, and while you are retrieving the results of one call the others will continue to process in the background.

I wouldn’t have thought this would make such a difference, but retrieving one month of daily order-level data went from taking 2420 seconds to 560 seconds! If you were to retrieve the same amount of daily data, but for an entire year, that would mean saving 6 hours in processing time.

Keep The Pull Requests Coming!

The last several RSiteCatalyst releases have been driven by contributions from the community and I couldn’t be happier! Given that I don’t spend much time in my professional life now using Adobe Analytics, having improvements driven by a community of users using the library daily is just so rewarding.

So please, if you have a comment for improvement (and especially if you find a bug), please submit an issue on GitHub. Submitting questions and issues to GitHub is the easiest way for me to provide support, while also giving other users the possibility to answer your question before I might. It will also provide a means for others to determine if they are experiencing a new or previously-known problem.

To leave a comment for the author, please follow the link and comment on their blog: R – randyzwitch.com.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft has a brand-new conference, exclusively for data scientists, big data engineers, and machine learning practitioners. The Microsoft Data Science Summit, to be held in Atlanta GA, September 26-27, will feature talks and lab sessions from Microsoft engineers and thought leaders on using data science techniques and Microsoft technology, applied to real-world problems.

Included in the agenda are several topics of direct interest to R users, including:

The Data Science Virtual Machine, which includes R
Deploying Predictive Maintenance solutions with Cortana Intelligence
Using the Cognitive Services framework, which you can use to incorporate machine intelligence into R scripts

Other topics of interest include building with bot frameworks, deep learning, Internet of Things applications, and in-depth Data Science topics.

To register for the conference, follow the link below. Discounted day passes to Microsoft Ignite on September 28-29 are also available to Microsoft Data Summit registrants.

Microsoft Events: Microsoft Data Science Summit, September 26-27

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

New Free Course: Introduction to R in Traditional Chinese

The DataCamp team is thrilled to announce that our Introduction to R course has been generously translated by our friend and DataCamp user Tony Yao-Jen Kuo to Traditional Chinese! Tony holds an M.B.A. from the National Taiwan University where he lectures on Data Science and R. The course is part of our open course offering making it free for everyone! By completing in-browser coding challenges, you will experiment with the different aspects of the R language in real time, and you will receive instant and personalized feedback that guides you to the solution. We greatly appreciate Tony’s efforts translating the course and making educational material for R more accessible to people around the world.

What you’ll learn – in Chinese

This free introduction to R tutorial will help you master the basics of R. In six sections, you will cover its basic syntax, preparing you to undertake your own first data analysis using R. Starting from variables and basic operations, you will learn how to handle data structures such as vectors, matrices, lists and data frames. No prior knowledge in programming or data science is required. In general, the focus is on actively understanding how to code your way through interesting data science tasks.

Create your own course

Want to create your own translation of Introduction to R? With DataCamp Teach, you can easily create your own interactive courses for free. Use the same system DataCamp course creators use to develop their courses, and share your R or Python knowledge with the rest of the world. You just write your interactive exercises in simple markdown files, and DataCamp uploads the content to your course for you. This makes creating a DataCamp course hassle-free.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”

My concrete advice is:

Read Nina Zumel’s excellent series on scoring classifiers.
Keep notes.
Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.

Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.

Definitions

Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.

Decision or Hard Classifiers

For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.

For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:

The True Positive count, this is the number of items that are in the true class that the classifier declares to be positive.
The True Negative count, this is the number of items that in the false class that the classifier declares to be negative.
The False Positive count, this is the number of items that are not in the true class that the classifier declares to be positive.
The False Negative count, this is the number of items in the true class the that classifier declares to be negative.

Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.

It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:

library('ggplot2')library('caret')

## Loading required package: lattice

library('rSymPy')

## Loading required package: rJython

## Loading required package: rJava

## Loading required package: rjson

A = Var('TruePositives')B = Var('FalsePositives')C = Var('FalseNegatives')D = Var('TrueNegatives')

(Note all code shared here.)

Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")) as follows:

          Reference Predicted   Event   No Event     Event  A       B  No Event  C       D

Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.

We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix") then gives us definitions of a number of common classifier scores:

# (A+C) and (B+D) are facts about the data, independent of classifier.Sensitivity = A/(A+C)Specificity = D/(B+D)Prevalence = (A+C)/(A+B+C+D)PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1-Specificity)*(1-Prevalence)))NPV = (Specificity * (1-Prevalence))/(((1-Sensitivity)*Prevalence) + ((Specificity)*(1-Prevalence)))DetectionRate = A/(A+B+C+D)DetectionPrevalence = (A+B)/(A+B+C+D)BalancedAccuracy = (Sensitivity+Specificity)/2

We can (from our notes) also define some more common metrics:

TPR = A/(A+C)     # True Positive RateFPR = B/(B+D)     # False Positive RateFNR = C/(A+C)     # False Negative RateTNR = D/(B+D)     # True Negative RateRecall = A/(A+C)Precision = A/(A+B)Accuracy = (A+D)/(A+B+C+D)

By writing everything down it becomes obvious thatSensitivity==TPR==Recall. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.

By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.

# Confirm TPR == 1 - FNRsympy(paste("simplify(",TPR-(1-FNR),")"))

## [1] "0"

# Confirm Recall == Sensitivitysympy(paste("simplify(",Recall-Sensitivity,")"))

## [1] "0"

# Confirm PPV == Precisionsympy(paste("simplify(",PPV-Precision,")"))

## [1] "0"

We can also confirm non-identity by simplifying and checking an instance:

# Confirm Precision != Specificityexpr <- sympy(paste("simplify(",Precision-Specificity,")"))print(expr)

## [1] "(FalsePositives*TruePositives - FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"

sub <- function(expr,                TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {  eval(expr)}sub(parse(text=expr),    TruePositives=0,FalsePositives=1,FalseNegatives=0,TrueNegatives=1)

## [1] -0.5

More difficult checks

Balanced Accuracy

We can denote the probability of a true (in-class) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties). We can confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy for hard or decision classifiers by defining score(true)>score(false) as:

A D : True Positive and True Negative: Correct sorting 1 pointA B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 pointC D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 pointC B : False Negative and True Negative: Wrong order 0 points

Then ScoreTrueGTFalse ==Prob[score(true)>score(false)]` is:

ScoreTrueGTFalse = (1*A*D  + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))

Which we can confirm is equal to balanced accuracy.

sympy(paste("simplify(",ScoreTrueGTFalse-BalancedAccuracy,")"))

## [1] "0"

AUC

We can also confirm Prob[score(true)>score(false)] (with half point on ties) == AUC. We can compute the AUC (the area under the drawn curve) of the above confusion matrix by referring to the following diagram.

ComputingArea

Then we can check for general equality:

AUC = (1/2)*FPR*TPR + (1/2)*(1-FPR)*(1-TPR) + (1-FPR)*TPRsympy(paste("simplify(",ScoreTrueGTFalse-AUC,")"))

## [1] "0"

This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).

F1

We can show F1 is different than Balanced Accuracy by plotting results they differ on:

# Wikipedia https://en.wikipedia.org/wiki/F1_scoreF1 = 2*Precision*Recall/(Precision+Recall)F1 = sympy(paste("simplify(",F1,")"))print(F1)

## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"

print(BalancedAccuracy)

## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"

# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)F1formula = parse(text=F1)BAformula = parse(text=BalancedAccuracy)frm = c()for(TotTrue in 1:5) {  for(TotFalse in 1:5) {    for(TruePositives in 0:TotTrue) {      for(TrueNegatives in 0:TotFalse) {        FalsePositives = TotFalse-TrueNegatives        FalseNegatives = TotTrue-TruePositives        F1a <- sub(F1formula,                   TruePositives=TruePositives,FalsePositives=FalsePositives,                   FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)        BAa <- sub(BAformula,                   TruePositives=TruePositives,FalsePositives=FalsePositives,                   FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)        if((F1a<=0)&&(BAa>0.5)) {          stop()        }        fi = data.frame(          TotTrue=TotTrue,          TotFalse=TotFalse,          TruePositives=TruePositives,FalsePositives=FalsePositives,          FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives,          F1=F1a,BalancedAccuracy=BAa,          stringsAsFactors = FALSE)        frm = rbind(frm,fi) # bad n^2 accumulation      }    }  }}ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) +   geom_point() +   ggtitle("F1 versus balancedAccuarcy/AUC")

F1 versus BalancedAccuracy/AUC

Baroque measures

In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.

Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:

If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.
(Augustus De Morgan, “A Budget of Paradoxes” 1872)

There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).

Two excellent guides to this rat hole include:

Ackerman, M., & Ben-David, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.
Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.

The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.

Further directions

One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.

Conclusion

It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Version 0.0.5 of RcppCCTZ arrived on CRAN a couple of days ago. It reflects an upstream fixed made a few weeks ago. CRAN tests revealed that g++-6 was tripping over one missing #define; this was added upstream and I subsequently synchronized with upstream. At the same time the set of examples was extended (see below).

Somehow useR! 2016 got in the way and while working on the then-incomplete examples during the traveling I forgot to release this until CRAN reminded me that their tests still failed. I promptly prepared the 0.0.5 release but somehow failed to update NEWS files etc. They are correct in the repo but not in the shipped package. Oh well.

CCTZ is a C++ library for translating between absolute and civil times using the rules of a time zone. In fact, it is two libraries. One for dealing with civil time: human-readable dates and time, and one for converting between between absolute and civil times via time zones. It requires only a proper C++11 compiler and the standard IANA time zone data base which standard Unix, Linux, OS X, … computers tend to have in /usr/share/zoneinfo. RcppCCTZ connects this library to R by relying on Rcpp.

Two good examples are now included, and shown here. The first one tabulates the time difference between New York and London (at a weekly level for compactness):

R>example(tzDiff)

tzDiffR># simple call: difference now
tzDiffR>tzDiff("America/New_York", "Europe/London", Sys.time())
[1] 5

tzDiffR># tabulate difference for every week of the year
tzDiffR>table(sapply(0:52, function(d) tzDiff("America/New_York", "Europe/London",
tzDiff+as.POSIXct(as.Date("2016-01-01") +d*7))))

 45350 
R>

Because the two continents happen to spring forward and fall backwards between regular and daylight savings times, there are, respectively, two and one week periods where the difference is one hour less than usual.

A second example shifts the time to a different time zone:

R>example(toTz)

toTzR>toTz(Sys.time(), "America/New_York", "Europe/London")
[1] "2016-07-14 10:28:39.91740 CDT"
R>

Note that because we return a POSIXct object, it is printed by R with the default (local) TZ attribute (for "America/Chicago" in my case). A more direct example asks what time it is in my time zone when it is midnight in Tokyo:

R>toTz(ISOdatetime(2016,7,15,0,0,0), "Japan", "America/Chicago")
[1] "2016-07-14 15:00:00 CDT"
R>

More changes will come in 0.0.6 as soon as I find time to translate the nice time_tool (command-line) example into an R function.

Changes in this version are summarized here:

Changes in version 0.0.5 (2016-07-09)
New utility example functions toTz() and tzDiff
Synchronized with small upstream change for additional #ifdef for compiler differentiation

We also have a diff to the previous version thanks to CRANberries. More details are at the RcppCCTZ page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises we shall explore possibilities for fundamental and technical analysis of stocks offered by the quantmod package. If you don’t have the package already installed, install it using the following code:

install.packages("quantmod")

and load it into the session using the following code:

library("quantmod")

before proceeding.

Answers to the exercises are available here.

If you have a different solution, feel free to post it.

Exercise 1

Load FB (Facebook) market data from Yahoo and assign it to an xts object fb.p.

Exercise 2

Display monthly closing prices of Facebook in 2015.

Exercise 3

Plot weekly returns of FB in 2016.

Exercise 4

Plot a candlestick chart of FB in 2016.

Exercise 5

Plot a line chart of FB in 2016., and add boilinger bands and a Relative Strength index to the chart.

Exercise 6

Get yesterday’s EUR/USD rate.

Exercise 7

Get financial data for FB and display it.

Exercise 8

Calculate the current ratio for FB for years 2013, 2014 and 2015. (Tip: You can calculate the current ratio when you divide current assets with current liabilities from the balance sheet.)

Exercise 9

Based on the last closing price and income statement for 12 months ending on December 31th 2015, Calculate the PE ratio for FB. (Tip: PE stands for Price/Earnings ratio. You calculate it as stock price divided by diluted normalized EPS read from income statement.)

Exercise 10

write a function getROA(symbol, year) which will calculate return on asset for given stock symbol and year. What is the ROI for FB in 2014. (Tip: ROA stands for Return on asset. You calculate it as net income divided by total asset.)

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m excited to announce forcats, a new package for categorical variables, or factors. Factors have a bad rap in R because they often turn up when you don’t want them. That’s because historically, factors were more convenient than character vectors, as discussed in stringsAsFactors: An unauthorized biography by Roger Peng, and stringsAsFactors = <sigh> by Thomas Lumley.

If you use packages from the tidyverse (like tibble and readr) you don’t need to worry about getting factors when you don’t want them. But factors are a useful data structure in their own right, particularly for modelling and visualisation, because they allow you to control the order of the levels. Working with factors in base R can be a little frustrating because of a handful of missing tools. The goal of forcats is to fill in those missing pieces so you can access the power of factors with a minimum of pain.

Install forcats with:

install.packages("forcats")

forcats provides two main types of tools to change either the values or the order of the levels. I’ll call out some of the most important functions below, using using the included gss_cat dataset which contains a selection of categorical variables from the General Social Survey.

library(dplyr)library(ggplot2)library(forcats)gss_cat#> # A tibble: 21,483 × 9#>    year       marital   age   race        rincome            partyid#>   <int>        <fctr> <int> <fctr>         <fctr>             <fctr>#> 1  2000 Never married    26  White  $8000 to 9999       Ind,near rep#> 2  2000      Divorced    48  White  $8000 to 9999 Not str republican#> 3  2000       Widowed    67  White Not applicable        Independent#> 4  2000 Never married    39  White Not applicable       Ind,near rep#> 5  2000      Divorced    25  White Not applicable   Not str democrat#> 6  2000       Married    25  White $20000 - 24999    Strong democrat#> # ... with 2.148e+04 more rows, and 3 more variables: relig <fctr>,#> #   denom <fctr>, tvhours <int>

Change level values

You can recode specified factor levels with fct_recode():

gss_cat%>%count(partyid)#> # A tibble: 10 × 2#>              partyid     n#>               <fctr> <int>#> 1          No answer   154#> 2         Don't know     1#> 3        Other party   393#> 4  Strong republican  2314#> 5 Not str republican  3032#> 6       Ind,near rep  1791#> # ... with 4 more rowsgss_cat%>%mutate(partyid=fct_recode(partyid,    "Republican, strong"="Strong republican",    "Republican, weak"="Not str republican",    "Independent, near rep"="Ind,near rep",    "Independent, near dem"="Ind,near dem",    "Democrat, weak"="Not str democrat",    "Democrat, strong"="Strong democrat"))%>%count(partyid)#> # A tibble: 10 × 2#>                 partyid     n#>                  <fctr> <int>#> 1             No answer   154#> 2            Don't know     1#> 3           Other party   393#> 4    Republican, strong  2314#> 5      Republican, weak  3032#> 6 Independent, near rep  1791#> # ... with 4 more rows

Note that unmentioned levels are left as is, and the order of the levels is preserved.

fct_lump() allows you to lump the rarest (or most common) levels in to a new “other” level. The default behaviour is to collapse the smallest levels in to other, ensuring that it’s still the smallest level. For the religion variable that tells us that Protestants out number all other religions, which is interesting, but we probably want more level.

gss_cat%>%mutate(relig=fct_lump(relig))%>%count(relig)#> # A tibble: 2 × 2#>        relig     n#>       <fctr> <int>#> 1      Other 10637#> 2 Protestant 10846

Alternatively you can supply a number of levels to keep, n, or minimum proportion for inclusion, prop. If you use negative values, fct_lump()will change direction, and combine the most common values while preserving the rarest.

gss_cat%>%mutate(relig=fct_lump(relig, n=5))%>%count(relig)#> # A tibble: 6 × 2#>        relig     n#>       <fctr> <int>#> 1      Other   913#> 2  Christian   689#> 3       None  3523#> 4     Jewish   388#> 5   Catholic  5124#> 6 Protestant 10846gss_cat%>%mutate(relig=fct_lump(relig, prop=-0.10))%>%count(relig)#> # A tibble: 12 × 2#>                     relig     n#>                    <fctr> <int>#> 1               No answer    93#> 2              Don't know    15#> 3 Inter-nondenominational   109#> 4         Native american    23#> 5               Christian   689#> 6      Orthodox-christian    95#> # ... with 6 more rows

Change level order

There are four simple helpers for common operations:

fct_relevel() is similar to stats::relevel() but allows you to move any number of levels to the front.
fct_inorder() orders according to the first appearance of each level.
fct_infreq() orders from most common to rarest.
fct_rev() reverses the order of levels.

fct_reorder() and fct_reorder2() are useful for visualisations. fct_reorder() reorders the factor levels by another variable. This is useful when you map a categorical variable to position, as shown in the following example which shows the average number of hours spent watching television across religions.

relig<-gss_cat%>%group_by(relig)%>%summarise(age=mean(age, na.rm=TRUE),    tvhours=mean(tvhours, na.rm=TRUE),    n=n())ggplot(relig, aes(tvhours, relig))+geom_point()ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours)))+geom_point()

fct_reorder2() extends the same idea to plots where a factor is mapped to another aesthetic, like colour. The defaults are designed to make legends easier to read for line plots, as shown in the following example looking at marital status by age.

by_age<-gss_cat%>%filter(!is.na(age))%>%group_by(age, marital)%>%count()%>%mutate(prop=n/sum(n))ggplot(by_age, aes(age, prop))+geom_line(aes(colour=marital))ggplot(by_age, aes(age, prop))+geom_line(aes(colour=fct_reorder2(marital, age, prop)))+labs(colour="marital")

Learning more

You can learn more about forcats in R for data science, and on the forcats website.

Please let me know if you have more factor problems that forcats doesn’t help with!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

(This article was first published on lucaspuente.github.io/, and kindly contributed to R-bloggers)

On Monday, August 29, DJ Patil, the Chief Data Scientist in the White House Office of Science and Technology Policy, and Mark Rosekind, the Administrator of the National Highway Traffic Safety Administration (NHTSA), announced the release of a data set documenting all traffic fatalities occurring in the United States in 2015. As part of their release, they issued a “call to action” for data scientists and analysts to “jump in and analyze it.” This post does exactly that by plotting these fatalities and providing the code for others to reproduce and extend the analysis.

Step 1: Download and Clean the Data

The NHTSA made downloading this data set very easy. Simply visit ftp://ftp.nhtsa.dot.gov/fars/2015/National/ and download the FARS2015NationalDBF.zip file, unzip it, and load into R.

library(foreign)accidents<-read.dbf("FARS2015NationalDBF/accident.dbf")

Since the goal here is to map the traffic fatalities, I also recommend subsetting the data to only include rows that have valid coordinates:

accidents<-subset(accidents,LONGITUD!=999.99990&LONGITUD!=888.88880&LONGITUD!=777.77770)

Also, the map we’ll be producing will only include the lower 48 states, so we want to further subset the data to exclude Alaska and Hawaii:

cont_us_accidents<-subset(accidents,STATE!=2&STATE!=15)

We also need to load in data on state and county borders to make our map more interpretable – without this, there would be no borders on display. Fortunately, the map_data function that’s part of the ggplot2 package makes this step very easy:

library(ggplot2)county_map_data<-map_data("county")state_map<-map_data("state")

Step 2: Plot the Data

Plotting the data using ggplot is also not particularly complicated. The most important thing is to use layers. We’ll first add a polygon layer to a blank ggplot object to map the county borders in light grey and then subsequently add polygons to map the state borders. Then, we’ll add points to show exactly where in the (lower 48) United States traffic fatalities occurred in 2015, plotting these in red, but with a high level of transparency (alpha=0.05) to help prevent points from obscuring one another.

map<-ggplot()+#Add county borders:geom_polygon(data=county_map_data,aes(x=long,y=lat,group=group),colour=alpha("grey",1/4),size=0.2,fill=NA)+#Add state borders:geom_polygon(data=state_map,aes(x=long,y=lat,group=group),colour="grey",fill=NA)+#Add points (one per fatality):geom_point(data=cont_us_accidents,aes(x=LONGITUD,y=LATITUDE),alpha=0.05,size=0.5,col="red")+#Adjust the map projectioncoord_map("albers",lat0=39,lat1=45)+#Add a title:ggtitle("Traffic Fatalities in 2015")+#Adjust the theme:theme_classic()+theme(panel.border=element_blank(),axis.text=element_blank(),line=element_blank(),axis.title=element_blank(),plot.title=element_text(size=40,face="bold",family="Avenir Next"))

Step 3: View the Finished Product

With this relatively simple code, we produce a map that clearly displays the location of 2015’s traffic fatalities:

Hopefully with this post you’ll be well on the way to making maps of your own and can start exploring this data set and others like it. If you have any questions, please reach out on twitter. I’m available @lucaspuente.

To leave a comment for the author, please follow the link and comment on their blog: lucaspuente.github.io/.

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Introduction

Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.

The pruning heuristics

Let L(y;m;v1,...,vk) denote the estimate loss (or badness of performance, so smaller is better) of a model for y fit using modeling method m and the variables v1,...,vk. Let d(a;L(y;m;),L(y;m;a)) denote the portion of L(y;m;)-L(y;m;a) credited to the variable a. This could be the change in loss, something like effectsize(a), or -log(significance(a)); in all cases larger is considered better.

For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.

L(y;m;) is monotone decreasing, we expect L(y;m;v1,...,vk,a) is no larger than L(y;m;v1,...,vk,). Note this may be achievable “in sample” (or on training data), but is often false if L(y;m;) accounts for model complexity or is estimated on out of sample data (itself a good practice).
If L(y;m;v1,...,vk,a) is significantly lower than L(y;m;v1,...,vk) then we will be lucky enough have d(a;L(y;m;),L(y;m;a)) not too small.
If d(a;L(y;m;),L(y;m;a)) is not too small then we will be lucky enough to have d(a;L(y;lm;),L(y;lm;a)) is non-negligible (where modeling method lm is one linear regression or logistic regression).

Intuitively we are hoping variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models).

However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables { a | d(a;L(y;lm;),L(y;lm;a)) is non negligible}. This is a (user controllable) heuristic built into our vtreatR package and proves to be quite useful in practice.

I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).

Counter-examples

For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.

We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.

First example

First (using R) we build a data frame where y = a xor b. This is a classic example of y being a function of two variable but not a linear function of them (at least over the real numbers, it is a linear relation over the field GF(2)).

d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1))
d$y <- as.numeric(d$a == d$b)

We look at the (real) linear relations between y and a, b.

summary(lm(y~a+b,data=d))

## 
## Call:
## lm(formula = y ~ a + b, data = d)
## 
## Residuals:
##    1    2    3    4 
##  0.5 -0.5 -0.5  0.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.500      0.866   0.577    0.667
## a              0.000      1.000   0.000    1.000
## b              0.000      1.000   0.000    1.000
## 
## Residual standard error: 1 on 1 degrees of freedom
## Multiple R-squared:  3.698e-32,  Adjusted R-squared:     -2 
## F-statistic: 1.849e-32 on 2 and 1 DF,  p-value: 1

anova(lm(y~a+b,data=d))

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)
## a          1      0       0       0      1
## b          1      0       0       0      1
## Residuals  1      1       1

As we expect linear methods fail to find any evidence of a relation between y and a, b. This clearly violates our hoped for heuristics.

For details on reading these summaries we strongly recommend Practical Regression and Anova using R, Julian J. Faraway, 2002.

In this example the linear model fails to recognize a and b as useful variables (even though y is a function of a and b). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of a and b (denoted a:b).

Second example

Let us develop this example a bit more to get a more interesting counterexample.

Introduce new variables u = a and b, v = a or b. By the rules of logic we have y == 1+u-v, so there is a linear relation.

d$u <- as.numeric(d$a & d$b)
d$v <- as.numeric(d$a | d$b)
print(d)

##   a b y u v
## 1 0 0 1 0 0
## 2 0 1 0 0 1
## 3 1 0 0 0 1
## 4 1 1 1 1 1

print(all.equal(d$y,1+d$u-d$v))

## [1] TRUE

We can now see the counter-example effect: together the variables work better than they did alone.

summary(lm(y~u,data=d))

## 
## Call:
## lm(formula = y ~ u, data = d)
## 
## Residuals:
##          1          2          3          4 
##  6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.3333     0.3333       1    0.423
## u             0.6667     0.6667       1    0.423
## 
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:  5.551e-16 
## F-statistic:     1 on 1 and 2 DF,  p-value: 0.4226

anova(lm(y~u,data=d))

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value Pr(>F)
## u          1 0.33333 0.33333       1 0.4226
## Residuals  2 0.66667 0.33333

summary(lm(y~v,data=d))

## 
## Call:
## lm(formula = y ~ v, data = d)
## 
## Residuals:
##          1          2          3          4 
##  5.551e-17 -3.333e-01 -3.333e-01  6.667e-01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.5774   1.732    0.225
## v            -0.6667     0.6667  -1.000    0.423
## 
## Residual standard error: 0.5774 on 2 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:      0 
## F-statistic:     1 on 1 and 2 DF,  p-value: 0.4226

anova(lm(y~v,data=d))

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value Pr(>F)
## v          1 0.33333 0.33333       1 0.4226
## Residuals  2 0.66667 0.33333

summary(lm(y~u+v,data=d))

## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit:
## summary may be unreliable

## 
## Call:
## lm(formula = y ~ u + v, data = d)
## 
## Residuals:
##          1          2          3          4 
## -1.849e-32  7.850e-17 -7.850e-17  1.849e-32 
## 
## Coefficients:
##              Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.00e+00   1.11e-16  9.007e+15   <2e-16 ***
## u            1.00e+00   1.36e-16  7.354e+15   <2e-16 ***
## v           -1.00e+00   1.36e-16 -7.354e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11e-16 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.056e+31 on 2 and 1 DF,  p-value: < 2.2e-16

anova(lm(y~u+v,data=d))

## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an
## essentially perfect fit are unreliable

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq    F value    Pr(>F)    
## u          1 0.33333 0.33333 2.7043e+31 < 2.2e-16 ***
## v          1 0.66667 0.66667 5.4086e+31 < 2.2e-16 ***
## Residuals  1 0.00000 0.00000                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.

Third example

We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.

d0 <- d
d0$y <- 0
d1 <- d
d1$y <- 1
dG <- rbind(d,d,d,d,d0,d1)
set.seed(23235)
dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE]

summary(lm(y~u,data=dR))

## 
## Call:
## lm(formula = y ~ u, data = dR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8148 -0.3425 -0.3425  0.3033  0.6575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.34247    0.05355   6.396 5.47e-09 ***
## u            0.47235    0.10305   4.584 1.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4575 on 98 degrees of freedom
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.1681 
## F-statistic: 21.01 on 1 and 98 DF,  p-value: 1.349e-05

anova(lm(y~u,data=dR))

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## u          1  4.3976  4.3976   21.01 1.349e-05 ***
## Residuals 98 20.5124  0.2093                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(y~v,data=dR))

## 
## Call:
## lm(formula = y ~ v, data = dR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7619 -0.3924 -0.3924  0.6076  0.6076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7619     0.1049   7.263 9.12e-11 ***
## v            -0.3695     0.1180  -3.131   0.0023 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4807 on 98 degrees of freedom
## Multiple R-squared:  0.09093,    Adjusted R-squared:  0.08165 
## F-statistic: 9.802 on 1 and 98 DF,  p-value: 0.002297

anova(lm(y~v,data=dR))

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## v          1  2.265 2.26503  9.8023 0.002297 **
## Residuals 98 22.645 0.23107                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(y~u+v,data=dR))

## 
## Call:
## lm(formula = y ~ u + v, data = dR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8148 -0.1731 -0.1731  0.1984  0.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.76190    0.08674   8.784 5.65e-14 ***
## u            0.64174    0.09429   6.806 8.34e-10 ***
## v           -0.58883    0.10277  -5.729 1.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3975 on 97 degrees of freedom
## Multiple R-squared:  0.3847, Adjusted R-squared:  0.3721 
## F-statistic: 30.33 on 2 and 97 DF,  p-value: 5.875e-11

anova(lm(y~u+v,data=dR))

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## u          1  4.3976  4.3976  27.833 8.047e-07 ***
## v          1  5.1865  5.1865  32.826 1.133e-07 ***
## Residuals 97 15.3259  0.1580                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Consider the above counter example as exceptio probat regulam in casibus non exceptis (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.

In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

(This article was first published on R – Chester's R blog, and kindly contributed to R-bloggers)

In October of 2015, I released an R Markdown senior thesis template R package and discussed it in the blogpost here. It was well-received by students and faculty that worked with it and this past summer I worked on updating it to make it even nicer for students. The big addition is the ability for students to export their senior thesis to a webpage (example here) and also label and cross-reference figures and tables more easily. These additions and future revisions will be in the new thesisdown package in the spirit of the bookdown package developed and released by RStudio in summer 2016.

I encourage you to look over my blog post last year to get an idea of why R Markdown is such a friendly environment to work in. Markdown specifically allows for typesetting of the finished document to be transparent inside the actual document. Down the road, it is my hope that students will be able to write generating R Markdown files that will then export into many formats. These currently include the LaTeX format to produce a PDF following Reed's senior thesis guidelines and the HTML version in gitbook style. Eventually, this will include a Word document following Reed's guidelines and also an ePub (electronic book) version. These last two are available at the moment but are not fully functional.

By allowing senior theses in a variety of formats, seniors will be more easily able to display their work to potential employers, other students, faculty members, and potential graduate schools. This will allow them to get the word out about their studies and research while still encouraging reproducibility in their computations and in their analyses.

Install the template generating package

To check out the package yourself, make sure you have RStudio and LaTeX installed and then direct your browser to the GitHub page for the template: http://github.com/ismayc/thesisdown. The README.md file near the bottom of the page below the files gives directions on installing the template package and getting the template running. As you see there, you'll want to install the thesisdown package via the following commands in the RStudio console:

install.packages("devtools")devtools::install_github("ismayc/thesisdown")

If you have any questions, feedback, or would like to report any issues, please email me.

(The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)

To leave a comment for the author, please follow the link and comment on their blog: R – Chester's R blog.

(This article was first published on R – Chester's R blog, and kindly contributed to R-bloggers)

The development of the bookdown package from RStudio in the summer of 2016 has facilitated greatly the ability of educators to create open-source materials for their students to use. It expands to more than just academic settings though and it encourages the sharing of resources and knowledge in a free and reproducible way.

As more and more students and faculty begin to use R in their courses and their research, I wanted to create a resource for the complete beginner to programming and statistics to more easily learn how to work with R. Specifically, the book includes GIF screen recordings that show the reader what specific panes do in RStudio and also the formatting of an R Markdown document and the resulting HTML file.

Folks who have used a programming language for awhile often forget about all the troubles they had when they initially got started with it. To further support this, I’ll be working on updating the book (specifically Chapter 6) with examples of common R errors, what they mean, and how to remedy them.

The book is entitled “Getting Used to R, RStudio, and R Markdown” and can be found at http://ismayc.github.io/rbasics-book. All of the source code for the book is available at http://github.com/ismayc/rbasics. You can also request edits to the book by clicking on the Edit button near the top of the page. You’ll also find a PDF version of the book there with links to the GIFs (since PDFs can’t have embedded GIFs like webpages can).

Chapter 5 of the book walks through some of the basics of R in working with a data set corresponding to the elements of the periodic table. To expand on this book and on using R in an introductory statistics setting, I’ve also embarked on creating a textbook using bookdown focused on data visualization and resampling techniques to build inferential concepts. The book uses dplyr and ggplot2 packages and focuses on two main data sets in the nycflights13 and ggplot2movies packages. Chapters 8 and 9 are in development, but the plan is for an introduction to the broom package to also be given there. Lastly, there will be expanded Learning Checks throughout the book and Review Questions at the end of each chapter to help the reader better check their understanding of the material. This book is available at http://ismayc.github.io/moderndiver-book with source code available here.

Feel free to email me or tweet to me on Twitter @old_man_chester.

To leave a comment for the author, please follow the link and comment on their blog: R – Chester's R blog.

(This article was first published on eKonometrics, and kindly contributed to R-bloggers)

Hadley Wickham in a recent blog post mentioned that “Factors have a bad rap in R because they often turn up when you don’t want them.” I believe Factors are an even bigger concern. They not only turn up where you don’t want them, but they also turn things around when you don’t want them to.

Consider the following example where I present a data set with two variables: x and y. I represent age in years as ‘y‘ and gender as a binary (0/1) variable as ‘x‘ where 1 represents males.

I compute the means for the two variables as follows:

The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good.

Now let’s see what happens when I convert x into a factor variable using the following syntax:

The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

I compute the average age for males and females as follows:

See what happens when I try to compute the mean for the variable ‘male‘.

Once you factor a variable, you can’t compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.

I recompute the means below.

Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let’s look at the data set below:

Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and y as the explanatory variable, R generates the following error:

Factor or no factor, I would prefer my zeros to stay as zeros!

To leave a comment for the author, please follow the link and comment on their blog: eKonometrics.

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Before we perform any analysis and come up with any assumptions about the distributions of and relationships between variables in our datasets, it is always a good idea to visualize our data in order to understand their properties and identify appropriate analytics techniques. In this post, let’s see the dramatic differences in conclutions that we can make based on (1) simple statistics only, and (2) data visualization.

The four data sets

The Anscombe dataset, which is found in the base R datasets packege, is handy for showing the importance of data visualization in data analysis. It consists of four datasets and each dataset consists of eleven (x,y) points.

anscombe    x1 x2 x3 x4    y1   y2    y3    y41  10 10 10  8  8.04 9.14  7.46  6.582   8  8  8  8  6.95 8.14  6.77  5.763  13 13 13  8  7.58 8.74 12.74  7.714   9  9  9  8  8.81 8.77  7.11  8.845  11 11 11  8  8.33 9.26  7.81  8.476  14 14 14  8  9.96 8.10  8.84  7.047   6  6  6  8  7.24 6.13  6.08  5.258   4  4  4 19  4.26 3.10  5.39 12.509  12 12 12  8 10.84 9.13  8.15  5.5610  7  7  7  8  4.82 7.26  6.42  7.9111  5  5  5  8  5.68 4.74  5.73  6.89

Let’s make some massaging to make the data more convinient for analysis and plotting

Create four groups: setA, setB, setC and setD.

library(ggplot2)library(dplyr)library(reshape2)setA=select(anscombe, x=x1,y=y1)setB=select(anscombe, x=x2,y=y2)setC=select(anscombe, x=x3,y=y3)setD=select(anscombe, x=x4,y=y4)

Add a third column which can help us to identify the four groups.

setA$group ='SetA'setB$group ='SetB'setC$group ='SetC'setD$group ='SetD'head(setA,4)  # showing sample data points from setA   x    y group1 10 8.04  SetA2  8 6.95  SetA3 13 7.58  SetA4  9 8.81  SetA

Now, let’s merge the four datasets.

all_data=rbind(setA,setB,setC,setD)  # merging all the four data setsall_data[c(1,13,23,43),]  # showing sample    x    y group1  10 8.04  SetA13  8 8.14  SetB23 10 7.46  SetC43  8 7.91  SetD

Compare their summary statistics

summary_stats =all_data%>%group_by(group)%>%summarize("mean x"=mean(x),                                       "Sample variance x"=var(x),                                       "mean y"=round(mean(y),2),                                       "Sample variance y"=round(var(y),1),                                       'Correlation between x and y '=round(cor(x,y),2)                                      )models = all_data %>%       group_by(group) %>%      do(mod = lm(y ~ x, data = .)) %>%      do(data.frame(var = names(coef(.$mod)),                    coef = round(coef(.$mod),2),                    group = .$group)) %>%dcast(., group~var, value.var = "coef")summary_stats_and_linear_fit = cbind(summary_stats, data_frame("Linear regression" =                                    paste0("y = ",models$"(Intercept)"," + ",models$x,"x")))summary_stats_and_linear_fitgroup mean x Sample variance x mean y Sample variance y Correlation between x and y 1  SetA      9                11    7.5               4.1                         0.822  SetB      9                11    7.5               4.1                         0.823  SetC      9                11    7.5               4.1                         0.824  SetD      9                11    7.5               4.1                         0.82  Linear regression1      y = 3 + 0.5x2      y = 3 + 0.5x3      y = 3 + 0.5x4      y = 3 + 0.5x

If we look only at the simple summary statistics shown above, we would conclude that these four data sets are identical.

What if we plot the four data sets?

 ggplot(all_data, aes(x=x,y=y)) +geom_point(shape = 21, colour = "red", fill = "orange", size = 3)+    ggtitle("Anscombe's data sets")+geom_smooth(method = "lm",se = FALSE,color='blue') +     facet_wrap(~group, scales="free")

As we can see from the figures above, the datasets are very different from each other. The Anscombe’s quartet is a good example that shows that we have to visualize the relatonships, distributuions and outliers of our data and we shoul not rely only on simple statistics.

Summary

We should look at the data graphically before we start analysis. Further, we should understand that basic statistics properties can often fail to capture real-world complexities (such as outliers, relationships and complex distributions) since summary statistics do not capture all of the complexities of the data.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Tell me, which side of the earth does this nose come from? Ha! (ALF)

Reading about strange attractors I came across with this book, where I discovered a way to generate two dimensional chaotic maps. The generic equation is pretty simple:

$x_{n+1}= a_{1}+a_{2}x_{n}+a_{3}x_{n}^{2}+a_{4}x_{n}y_{n}+a_{5}y_{n}+a_{6}y_{n}^{2}$ $y_{n+1}= a_{7}+a_{8}x_{n}+a_{9}x_{n}^{2}+a_{10}x_{n}y_{n}+a_{11}y_{n}+a_{12}y_{n}^{2}$

I used it to generate these chaotic galaxies:

Changing the vector of parameters you can obtain other galaxies. Do you want to try?

library(ggplot2)library(dplyr)#Generic functionattractor = function(x, y, z){  c(z[1]+z[2]*x+z[3]*x^2+ z[4]*x*y+ z[5]*y+ z[6]*y^2,     z[7]+z[8]*x+z[9]*x^2+z[10]*x*y+z[11]*y+z[12]*y^2)}#Function to iterate the generic function over the initial point c(0,0)galaxy= function(iter, z){  df=data.frame(x=0,y=0)  for (i in 2:iter) df[i,]=attractor(df[i-1, 1], df[i-1, 2], z)  df %>% rbind(data.frame(x=runif(iter/10, min(df$x), max(df$x)),                           y=runif(iter/10, min(df$y), max(df$y))))-> df  return(df)}opt=theme(legend.position="none",          panel.background = element_rect(fill="#00000c"),          plot.background = element_rect(fill="#00000c"),          panel.grid=element_blank(),          axis.ticks=element_blank(),          axis.title=element_blank(),          axis.text=element_blank(),          plot.margin=unit(c(-0.1,-0.1,-0.1,-0.1), "cm"))#First galaxyz1=c(1.0, -0.1, -0.2,  1.0,  0.3,  0.6,  0.0,  0.2, -0.6, -0.4, -0.6,  0.6)galaxy1=galaxy(iter=2400, z=z1) %>% ggplot(aes(x,y))+  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=46, size= 0, color="#ffff00")+opt#Second galaxyz2=c(-1.1, -1.0,  0.4, -1.2, -0.7,  0.0, -0.7,  0.9,  0.3,  1.1, -0.2,  0.4)galaxy2=galaxy(iter=2400, z=z2) %>% ggplot(aes(x,y))+  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=46, size= 0, color="#ffff00")+opt#Third galaxyz3=c(-0.3,  0.7,  0.7,  0.6,  0.0, -1.1,  0.2, -0.6, -0.1, -0.1,  0.4, -0.7)galaxy3=galaxy(iter=2400, z=z3) %>% ggplot(aes(x,y))+  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=46, size= 0, color="#ffff00")+opt#Fourth galaxyz4=c(-1.2, -0.6, -0.5,  0.1, -0.7,  0.2, -0.9,  0.9,  0.1, -0.3, -0.9,  0.3)galaxy4=galaxy(iter=2400, z=z4) %>% ggplot(aes(x,y))+  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+  geom_point(shape=46, size= 0, color="#ffff00")+opt

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

satRdayLogo-128px

First satRdays in Budapest September 03, 2016 event is completed. This one day, community driven event with regional for very affordable prices, good for networking, getting latest from R community event is over. And it was a blast! Great time, nice atmosphere, lots of interesting people and where there is a good energy, there is a will to learn new things. And that’s what we did!

September 3rd, 2016 satRdays event took place at MTA TTK building in Budapest. When morning workshop were over, the event took off with keynote with Gabor Csardi sharing his experiences with R, CRAN and internals on his packages, following the speakers, lightning talks and at the end the visualization competition.

Speakers presenting were:

With absolutely outstanding schedule:

2016-09-06 13_55_35-satRday @ Budapest 2016.png

Sessions were outstanding and people were create, talks covering from practical, technical, visualization and package-wise topics and I was thrilled to be part of it.

2016-09-06 13_48_38-Gergely Daróczi (@daroczig) _ Twitter.png 2016-09-06 13_47_41-Gergely Daróczi (@daroczig) _ Twitter.png

Left to right: @tomaz_tsql,@benceArato,@daroczig

Closing the day with pizza and visualization talks.

2016-09-06 13_49_19-Gergely Daróczi (@daroczig) _ Twitter.png

Since it was R event, some statistics: 192 registrations -> 170 showed up (11% drop off rate) and speakers from 19 countries. Cca 180 lunches served, lots of coffees (I had 4) and tea drank and highest density of R package authors in Budapest on September 3, 2016.

And some twitter statistics as well:

490 tweets with hashtag #satRdays (from 01SEP2016 – 04SEP2016)

Top 10 most active twitter handles (order desc): @tomaz_tsql, @romain_francois, @BenceArato, @SteffLocke, @daroczig, @torokagoston, @thinkR_fr, @InvernessRug, @Emaasit, @matlabulous and many many others…

Most retweeted tweet by Kate Ross-Smith @LaSystemistaria

2016-09-06 14_18_42-@LaSystemistaria_ To all of the lovely - Twitter Search

Most adorable tweets (I must admit we can all agree) were the ones Romain proposing to his long-time girlfriend Cecile. Tweet scored highest number of favorites! (112 of the time when writing this post)

2016-09-06 14_22_00-Romain François (@romain_francois) _ Twitter.png

Word associations with official hashtag #satRdays were

$satrdays      lunch       kboda     putting     romance        buns         pic   andypryke       where        back       break       enjoy   firstever          gt       proud     sponsor       super   tomorrows          we        0.49        0.43        0.43        0.43        0.37        0.37        0.35        0.35        0.32        0.27        0.27        0.25        0.25        0.25        0.25        0.25        0.25        0.24 satrdaysorg    proposal        0.22        0.20

Love is correlating with satRdays very good these days! Not to mention food! Overall, 55% of tweets with positive and 45% with neutral sentiment.

To conclude: just great event! If you missed it, well… don’t do it again!

Thanks again to organizers (Thank you Gergely, Steff and the crew) and to all the speakers, volunteers, sponsors and R Consortium.

And just couple of words on my presentation: Microsoft R Server with SQL Server 2016, where I showed what a great job Microsoft did and how well RevoScaleR Package performs was accepted great, and I had couple of interesting questions and people coming to me to learn more.

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

One of the first steps of data analysis is the descriptive analysis; this helps to understand how the data is distributed and provides important information for further steps. This set of exercises will include functions useful for one variable descriptive analysis, including graphs. Before proceeding, it might be helpful to look over the help pages for the length, range, median, IQR, hist, quantile, boxplot, and stem functions.

For this set of exercises you will use a dataset called islands, an R dataset that contains the areas of the world’s major landmasses expressed in squared miles. To load the dataset run the following instruction: data(islands).

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Load the islands dataset and obtain the total number of observations.

Exercise 2

Measures of central tendency. Obtain the following statistics of islands

a)Mean b)Median

Exercise 3

Using the function range, obtain the following values:

a)Size of the biggest island b)Size of the smallest island

Exercise 4

Measures of dispersion. Find the following values for islands:

a)Standard deviation b)The range of the islands size using the function range.

Exercise 5

Quantiles. Using the function quantile obtain a vector including the following quantiles:

a) 0%, 25%, 50%, 75%, 100% b) .05%, 95%

Exercise 6

Interquartile range. Find the interquartile range of islands.

Exercise 7

Create an histogram of islands with the following properties.

a) Showing the frequency of each group b) Showing the proportion of each group

Exercise 8

Create box-plots with the following conditions

a) Including outiers b) Without outliers

Exercise 9

Using the function boxplot find the outliers of islands. Hint: use the argument prob=F.

Exercise 10

Create a stem and leaf plot of islands

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Previously in this series:

In previous posts, I’ve examined the benefits of the tidy data framework in cleaning, visualizing, and modeling in exploratory data analysis on a molecular biology experiment. We’re using Brauer et al 2008 as our case study, which explores the effects of nutrient starvation on gene expression in yeast.

From the posts so far, one might get the impression that I think data must be tidy at every stage of an analysis. Not true! That would be an absurd and unnecessary constraint. Lots of mathematical operations are faster on matrices, such as singular value decomposition or linear regression. Jeff Leek rightfully points this out as an issue with my previous modeling gene expression post, where he remarks that the limma package is both faster and takes more statistical considerations (pooling variances among genes) into account.

Isn’t it contradictory to do these kinds of operations in a tidy analysis? Not at all. My general recommendation is laid out as follows:

As long as you’re in that Models“cloud”, you can store your data in whatever way is computationally and practically easiest. However:

Before you model, you should use tidy tools to clean, process and wrangle your data (as shown in previous posts)
After you’ve performed your modeling, you should turn the model into a tidy output for interpretation, visualization, and communication

This requires a new and important tool in our series on tidy bioinformatics analysis: the biobroom package, written and maintained by my former colleagues, particularly Andy Bass and John Storey. In this post I’ll show how to use the limma and biobroom packages in combination to continue a tidy analysis, and consider when and how to use non-tidy data in an analysis.

Setup

Here’s the code to catch up with our previous posts:

library(readr)library(dplyr)library(tidyr)library(ggplot2)url<-"http://varianceexplained.org/files/Brauer2008_DataSet1.tds"nutrient_names<-c(G="Glucose",L="Leucine",P="Phosphate",S="Sulfate",N="Ammonia",U="Uracil")cleaned_data<-read_delim(url,delim="\t")%>%separate(NAME,c("name","BP","MF","systematic_name","number"),sep="\\|\\|")%>%mutate_each(funs(trimws),name:systematic_name)%>%select(-number,-GID,-YORF,-GWEIGHT)%>%gather(sample,expression,G0.05:U0.3)%>%separate(sample,c("nutrient","rate"),sep=1,convert=TRUE)%>%mutate(nutrient=plyr::revalue(nutrient,nutrient_names))%>%filter(!is.na(expression),systematic_name!="")%>%group_by(systematic_name,nutrient)%>%filter(n()==6)%>%ungroup()

In our last analysis, we performed linear models using broom and the built-in lm function:

library(broom)linear_models<-cleaned_data%>%group_by(name,systematic_name,nutrient)%>%do(tidy(lm(expression~rate,.)))

The above approach is useful and flexible. But as Jeff notes, it’s not how a computational biologist would typically run a gene expression analysis, for two reasons.

Performing thousands of linear regressions with separate lm calls is slow. It takes about a minute on my computer. There are computational shortcuts we can take when all of our data is in the form of a gene-by-sample matrix.
We’re not taking statistical advantage of the shared information. Modern bioinformatics approaches often “share power” across genes, by pooling variance estimates. The approach in the limma package is one notable example for microarray data, and RNA-Seq tools like edgeR and DESeq2 take a similar approach in their negative binomial models.

We’d like to take advantage of the sophisticated biological modeling tools in Bioconductor. We’re thus going to convert our data into a non-tidy format (a gene by sample matrix), and run it through limma to create a linear model for each gene. Then when we want to visualize, compare, or otherwise manipulate our models, we’ll tidy the model output using biobroom.

Limma

Most gene expression packages in Bioconductor expect data to be in a matrix with one row per gene and one column per sample. In the last post we fit one model for each gene and nutrient combination. So let’s set it up that way using reshape2’s acast().¹

library(reshape2)exprs<-acast(cleaned_data,systematic_name+nutrient~rate,value.var="expression")head(exprs)

##                  0.05   0.1  0.15   0.2  0.25  0.3## Q0017_Ammonia    0.18  0.73  0.05 -0.14 -0.06 0.24## Q0017_Glucose   -0.21 -0.26 -0.17  0.00  0.10 0.10## Q0017_Leucine    0.46  0.05  0.42  0.35  0.36 0.09## Q0017_Phosphate -0.24 -0.35 -0.45 -0.24 -0.35 0.11## Q0017_Sulfate    2.44  0.64  0.21  0.40 -0.07 0.31## Q0017_Uracil     1.88  0.03  0.15  0.04  0.14 0.32

We then need to extract the experiment design, which in this case is just the growth rate:

rate<-as.numeric(colnames(exprs))rate

## [1] 0.05 0.10 0.15 0.20 0.25 0.30

limma (“linear modeling of microarrays”) is one of the most popular Bioconductor packages for performing linear-model based differential expression analyses on microarray data. With the data in this matrix form, we’re ready to use it:

library(limma)fit<-lmFit(exprs,model.matrix(~rate))eb<-eBayes(fit)

This performs a linear regression for each gene. This operation is both faster and more statistically sophisticated than our original use of lm for each gene.

So now we’ve performed our regression. What output do we get?

class(eb)

## [1] "MArrayLM"## attr(,"package")## [1] "limma"

summary(eb)

##                  Length Class  Mode     ## coefficients     65074  -none- numeric  ## rank                 1  -none- numeric  ## assign               2  -none- numeric  ## qr                   5  qr     list     ## df.residual      32537  -none- numeric  ## sigma            32537  -none- numeric  ## cov.coefficients     4  -none- numeric  ## stdev.unscaled   65074  -none- numeric  ## pivot                2  -none- numeric  ## Amean            32537  -none- numeric  ## method               1  -none- character## design              12  -none- numeric  ## df.prior             1  -none- numeric  ## s2.prior             1  -none- numeric  ## var.prior            2  -none- numeric  ## proportion           1  -none- numeric  ## s2.post          32537  -none- numeric  ## t                65074  -none- numeric  ## df.total         32537  -none- numeric  ## p.value          65074  -none- numeric  ## lods             65074  -none- numeric  ## F                32537  -none- numeric  ## F.p.value        32537  -none- numeric

That’s a lot of outputs, and many of them are matrices of varying shapes. If you want to work with this using tidy tools (and if you’ve been listening, you hopefully do), we need to tidy it:

library(biobroom)tidy(eb,intercept=TRUE)

## # A tibble: 65,074 x 6##               gene        term   estimate statistic     p.value        lod##              <chr>       <chr>      <dbl>     <dbl>       <dbl>      <dbl>## 1    Q0017_Ammonia (Intercept)  0.3926667  1.591294 0.155459427 -6.3498508## 2    Q0017_Glucose (Intercept) -0.3533333 -3.172498 0.015599597 -4.0295194## 3    Q0017_Leucine (Intercept)  0.3873333  2.339366 0.051800996 -5.2780422## 4  Q0017_Phosphate (Intercept) -0.4493333 -2.732529 0.029158254 -4.6871409## 5    Q0017_Sulfate (Intercept)  1.9140000  3.937061 0.005595522 -2.9298317## 6     Q0017_Uracil (Intercept)  1.1846667  2.476906 0.042315927 -5.0718352## 7    Q0045_Ammonia (Intercept) -1.5060000 -5.842285 0.000629533 -0.5396668## 8    Q0045_Glucose (Intercept) -0.8513333 -5.235045 0.001195959 -1.2455443## 9    Q0045_Leucine (Intercept) -0.4440000 -1.807071 0.113591371 -6.0543398## 10 Q0045_Phosphate (Intercept) -0.7546667 -4.488406 0.002819261 -2.1846537## # ... with 65,064 more rows

Notice that this is now in one-row-per-coefficient-per-gene form, much like the output of broom’s tidy() on linear models.

Like broom, biobroom always returns a table² without rownames that we can feed into standard tidy tools like dplyr and ggplot2. (Note that unlike broom, biobroom requires an intercept = TRUE argument to leave the intercept term, simply because in many genomic datasets- though not ours- the intercept term is almost meaningless). biobroom can also tidy model objects from other tools like edgeR or DESeq2, always giving a consistent format similar to this one.

Now all we’ve got to do split the systematic name and nutrient back up. tidyr’s separate() can do this:

td<-tidy(eb,intercept=TRUE)%>%separate(gene,c("systematic_name","nutrient"),sep="_")td

## # A tibble: 65,074 x 7##    systematic_name  nutrient        term   estimate statistic     p.value## *            <chr>     <chr>       <chr>      <dbl>     <dbl>       <dbl>## 1            Q0017   Ammonia (Intercept)  0.3926667  1.591294 0.155459427## 2            Q0017   Glucose (Intercept) -0.3533333 -3.172498 0.015599597## 3            Q0017   Leucine (Intercept)  0.3873333  2.339366 0.051800996## 4            Q0017 Phosphate (Intercept) -0.4493333 -2.732529 0.029158254## 5            Q0017   Sulfate (Intercept)  1.9140000  3.937061 0.005595522## 6            Q0017    Uracil (Intercept)  1.1846667  2.476906 0.042315927## 7            Q0045   Ammonia (Intercept) -1.5060000 -5.842285 0.000629533## 8            Q0045   Glucose (Intercept) -0.8513333 -5.235045 0.001195959## 9            Q0045   Leucine (Intercept) -0.4440000 -1.807071 0.113591371## 10           Q0045 Phosphate (Intercept) -0.7546667 -4.488406 0.002819261## # ... with 65,064 more rows, and 1 more variables: lod <dbl>

Analyzing a tidied model

We can now use the tidy approaches to visualization and interpretation that were explored in previous posts. We could create a p-value histogram

ggplot(td,aes(p.value))+geom_histogram()+facet_grid(term~nutrient,scales="free_y")

center

Or make a volcano plot, comparing statistical significance to effect size (here let’s say just on the slope terms):

td%>%filter(term=="rate")%>%ggplot(aes(estimate,p.value))+geom_point()+facet_wrap(~nutrient,scales="free")+scale_y_log10()

center

We could easily perform for multiple hypothesis testing within each group, and filter for significant (say, FDR < 1%) changes:

td_filtered<-td%>%group_by(term,nutrient)%>%mutate(fdr=p.adjust(p.value,method="fdr"))%>%ungroup()%>%filter(fdr<.01)

Or finding the top few significant changes in each group using dplyr’s top_n:

top_3<-td_filtered%>%filter(term=="rate")%>%group_by(nutrient)%>%top_n(3,abs(estimate))top_3

## Source: local data frame [18 x 8]## Groups: nutrient [6]## ##    systematic_name  nutrient  term  estimate  statistic      p.value##              <chr>     <chr> <chr>     <dbl>      <dbl>        <dbl>## 1          YAL061W   Sulfate  rate -13.40000 -15.374369 1.159000e-06## 2          YBR054W Phosphate  rate -23.80571 -12.675600 4.306351e-06## 3          YBR072W    Uracil  rate -27.01143 -11.870314 6.702423e-06## 4          YBR116C    Uracil  rate -25.12000  -9.931703 2.199911e-05## 5          YBR294W Phosphate  rate  24.56571   9.953214 2.168703e-05## 6          YFL014W   Leucine  rate -18.92000  -6.884237 2.318928e-04## 7          YHR096C   Ammonia  rate -23.71429 -20.700901 1.496359e-07## 8          YHR096C   Glucose  rate -20.31429  -7.890506 9.816693e-05## 9          YHR096C   Leucine  rate -23.18286  -6.324890 3.904599e-04## 10         YHR096C Phosphate  rate -26.12000 -25.405139 3.616210e-08## 11         YHR137W   Glucose  rate -21.45143  -9.798727 2.404524e-05## 12         YIL160C   Sulfate  rate -13.50286 -16.838241 6.214232e-07## 13         YIL169C   Ammonia  rate -24.99429  -8.439007 6.375875e-05## 14         YLR327C   Leucine  rate -18.43429 -17.547502 4.679809e-07## 15         YLR327C    Uracil  rate -28.13714  -7.434840 1.432175e-04## 16         YMR303C   Glucose  rate -22.60000  -6.101140 4.856028e-04## 17         YOL155C   Ammonia  rate -28.13714 -11.530132 8.147212e-06## 18         YPL223C   Sulfate  rate -18.42857 -12.883494 3.857832e-06## # ... with 2 more variables: lod <dbl>, fdr <dbl>

We could join this with our original data, which would let us visualize the trends for only the most significant genes:

top_3%>%rename(significant_nutrient=nutrient)%>%inner_join(cleaned_data,by="systematic_name")%>%mutate(highlight=nutrient==significant_nutrient)%>%ggplot(aes(rate,expression,color=nutrient))+geom_point()+geom_smooth(aes(lty=!highlight),method="lm",se=FALSE,show.legend=FALSE)+facet_wrap(significant_nutrient~systematic_name,ncol=3,scales="free_y")

center

In short, you can once again use the suite of “tidy tools” that we’ve found powerful in genomic analyses.

Conclusion: Data is wrangled for you, not you for the data

There’s a classic proverb of computer science from Abelman & Sussman: “Programs must be written for people to read, and only incidentally for machines to execute.” I’d say this is even more true for data it is for code. Data scientists need to be very comfortable engaging with their data, not fighting with the representation.

I agree with a lot in Jeff’s “Non-tidy data” post, but there’s one specific statement I disagree with:

…you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth.

I’d counter that switching is a small cost, because switching can be automated. Note that in the above analysis, reshaping the data required only two lines of code and two functions (acast() and tidy()). In contrast, there’s no way to automate critical thinking. Any challenge in plotting, filtering, or merging your data will get directly in the way of answering scientific questions.

@jtleek @mark_scheuerell @hadleywickham solution is put energy into tidying first. Otherwise, you will pay off energy 10x in plot/manip
— David Robinson (@drob) February 12, 2016

It’s thus the job of tool developers to make these “switches” as easy as possible. broom and biobroom play a role in this, as do reshape2 and tidyr. Jeff lists natural language as a type of data that’s best left un-tidy, but since that post Julia Silge and have I developed the tidytext package, and we’ve found it useful for performing text analyses using ggplot2 and dplyr (see here for examples).

Other examples of operations that are better performed on matrices include correlations and distance calculations, and for those purposes I’m currently working on the widyr package, which wraps these operations to allow a tidy input and tidy output (for example, see this application of the pairwise_cor function).

Next time: gene set enrichment analysis

(Programming note: this was originally my plan for this post, but I decided to preempt it for biobroom!)

These per-gene models can still be difficult to interpret biologically if you’re not familiar with the functions of specific genes. What we really want is a way to summarize the results into “genes involved in this biological process changed their expression.” This is where annotations of gene sets become useful.

gene_sets<-distinct(cleaned_data,systematic_name,BP,MF)td%>%inner_join(gene_sets)%>%filter(BP=="leucine biosynthesis",term=="(Intercept)")%>%mutate(nutrient=reorder(nutrient,estimate,median))%>%ggplot(aes(nutrient,estimate))+geom_boxplot()+geom_point()+geom_text(aes(label=systematic_name),vjust=1,hjust=1)+xlab("Limiting nutrient")+ylab("Intercept (expression at low concentration)")+ggtitle("Genes involved in leucine biosynthesis")

center

Notice how clear it is that these genes respond to leucine starvation in particular. This can be applied to gene sets containing dozens or even hundreds of genes while still making the general trend apparent. Furthermore, we could use these summaries to look at many gene sets at once, and even use statistical tests to discover new gene sets that respond to starvation.

Thus, in my next post in this series, we’ll apply our “tidy modeling” approach to a new problem. Instead of testing whether each gene responds to starvation in an interesting way, we’ll test functional groups of genes (a general method called gene set analysis) in order to develop higher-level biological insights. And we’ll do it with these same set of tidy tools we’ve been using so far.

We could have used tidyr’s spread function, but acast actually saves us a few steps by giving us a matrix with rownames, rather than a data frame, right away. ↩
Note that by default biobroom returns a tbl_df rather than a data.frame. This is because tidy genomic output is usually many thousands of rows, so printing it is usually not practical. The class it returns can be set (to be a tbl_df, data.frame, or data.table) through the biobroom.return global option. ↩

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

(This article was first published on R – Nodalpoint, and kindly contributed to R-bloggers)

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as we’ll see, the issue remains in Spark 2.0).

We’ll first try a simple binary classification problem in PySpark using Spark MLlib, but, before doing so, let’s have a look at the current status of the machine learning APIs in Spark 2.0.

Spark MLlib was the older machine learning API for Spark, intended to be gradually replaced by the newest Spark ML library; in Spark 2.0 this terminology has changed (enough, in my opinion, to cause unnecessary confusion): now the whole machine learning functionality is termed “MLlib”, with the old MLlib being the so-called “RDD-based API”, and the (old) Spark ML library now termed the “MLlib DataFrame-based API“. The oldest RDD-based API has now entered maintenance mode, heading for gradual deprecation.

Truth is, whatever Databricks and the Spark architects may like to believe, there is some essential machine learning functionality which is still only available in the old MLlib RDD-based API, good examples being multinomial logistic regression and SVM models.

Now that we have hopefully justified interest to the “old” RDD-based API, let us proceed to our first example.

(In the code snippets below, pyspark.mllib corresponds to the old, RDD-based API, while pyspark.ml corresponds to the new DataFrame-based API.)

The following code is slightly adapted from the documentation example of logistic regression in PySpark (can you spot the difference?):

>>> print spark.version
2.0.0
>>> from pyspark.mllib.classification import LogisticRegressionModel, LogisticRegressionWithSGD
>>> from pyspark.mllib.regression import LabeledPoint
>>> data = [
...     LabeledPoint(2.0, [0.0, 1.0]),
...     LabeledPoint(1.0, [1.0, 0.0])]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10)
[...]
: org.apache.spark.SparkException: Input validation failed.

As we can see, this simple code snippet gives the super-informative error “Input validation failed“. And the situation is the same with other classifiers – here is an SVM with the same data:

>>> from pyspark.mllib.classification import SVMModel, SVMWithSGD
>>> svm = SVMWithSGD.train(sc.parallelize(data), iterations=10)
[...]
: org.apache.spark.SparkException: Input validation failed.

No matter how long you stare at the relevant PySpark documentation, you will not find the error cause, simply because it is not documented – to get a hint, you will have to dig in the Scala (!) source code of LogisticRegressionWithSGD, where it is mentioned:

 * NOTE: Labels used in Logistic Regression should be {0, 1}

So, in binary classification, your labels cannot be whatever you like, but they need to be {0, 1} or {0.0, 1.0}…

>>> data = [
...     LabeledPoint(0.0, [0.0, 1.0]), # changed the label here from 2.0 to 0.0
...     LabeledPoint(1.0, [1.0, 0.0])]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10) 
>>> lrm.predict([0.2, 0.5])
0
>>> svm = SVMWithSGD.train(sc.parallelize(data), iterations=10)
>>> svm.predict([0.2, 0.5])
0

A similar (again, undocumented) constraint applies for multi-class classification, where for k classes you must use the labels {0, 1, …, k-1}.

Interestingly enough, SparkR does not suffer from such a limitation; try the following code interactively from RStudio (where we use the iris dataset with one class removed, since spark.glm does not support multi-class classification):

library(SparkR, lib.loc = "/home/ctsats/spark-2.0.0-bin-hadoop2.6/R/lib") # change the path accordingly here

sparkR.session(sparkHome = "/home/ctsats/spark-2.0.0-bin-hadoop2.6")      # and here

df <- as.DataFrame(iris[1:100,]) # keep 2 classes only
head(df)

model <- spark.glm(df, Species ~ .,  family="binomial")
summary(model)

pred <- predict(model, df)
showDF(pred, 100, truncate = FALSE)

sparkR.session.stop()

Since we are interested at the (very) basic functionality only, we don’t bother with splitting to training and test sets – we just run the predict function for the dataset we used for training; here is the partial output from showDF:

+------------+-----------+------------+-----------+----------+-----+----------------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species   |label|prediction            |
+------------+-----------+------------+-----------+----------+-----+----------------------+
|5.1         |3.5        |1.4         |0.2        |setosa    |1.0  |0.9999999999999999    |
|4.9         |3.0        |1.4         |0.2        |setosa    |1.0  |0.999999999999992     |
|4.7         |3.2        |1.3         |0.2        |setosa    |1.0  |0.999999999999998     |
|4.6         |3.1        |1.5         |0.2        |setosa    |1.0  |0.9999999999994831    |
[...]
|5.7         |2.9        |4.2         |1.3        |versicolor|0.0  |1.0E-16               |
|6.2         |2.9        |4.3         |1.3        |versicolor|0.0  |1.0E-16               |
|5.1         |2.5        |3.0         |1.1        |versicolor|0.0  |1.9333693386853254E-10|
|5.7         |2.8        |4.1         |1.3        |versicolor|0.0  |1.0E-16               |
+------------+-----------+------------+-----------+----------+-----+----------------------+

from which it is apparent that SparkR has performed internally the label mapping setosa -> 1.0 and versicolor -> 0.0 for us.

There is an explanation for this difference in behavior: under the hood, and unlike the PySpark examples shown above, SparkR uses the newest DataFrame-based API for the machine learning functionality; so, let’s have a quick look at this API from a PySpark point of view as well.

We recreate our first code snippet above, with the data now as a DataFrame instead of a LabeledPoint:

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (2.0, Vectors.dense(0.0, 1.0)),
...     (1.0, Vectors.dense(1.0, 0.0))], 
...     ["label", "features"])
>>> df.show()
+-----+---------+
|label| features|
+-----+---------+
|  2.0|[0.0,1.0]|
|  1.0|[1.0,0.0]|
+-----+---------+
>>> lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
>>> model = lr.fit(df)
[...]
: org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary classification. Found 3 in the input dataset.

Well… as you can see, our classifier complains that it has found 3 classes in the data, despite that, evidently, being not the case…

Changing the labels above from {1.0, 2.0} to {0.0, 1.0} resolves this issue (not shown); again, this requirement is nowhere documented in PySpark, and the error message does little to help locate the actual issue.

And here is our last take on weird, counter-intuitive, and undocumented features of the new DataFrame-based machine learning API…

One could easily argue that encoding class labels (i.e. what is actually a factor) with floating-point numbers is unnatural; and the old, RDD-based API sure permits the more natural choice of encoding the class labels as integers, instead:

>>> print spark.version
2.0.0
>>> from pyspark.mllib.classification import LogisticRegressionModel, LogisticRegressionWithSGD
>>> from pyspark.mllib.regression import LabeledPoint
>>> data = [
...     LabeledPoint(0, [0.0, 1.0]), # integer labels instead
...     LabeledPoint(1, [1.0, 0.0])] # of float
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10) 
>>> lrm.predict([0.2, 0.5])
0

What’s more, the binary LogisticRegression classifier from the new, DataFrame-based API also allows for integer labels:

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (0, Vectors.dense(0.0, 1.0)),  # integer labels instead
...     (1, Vectors.dense(1.0, 0.0))], # of float
...     ["label", "features"])
>>> df.show()
+-----+---------+
|label| features|
+-----+---------+
|    0|[0.0,1.0]|
|    1|[1.0,0.0]|
+-----+---------+
>>> lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
>>> model = lr.fit(df) # works OK

But here is what happens when we try a DecisionTreeClassifierfrom the very same module (namely pyspark.ml.classification):

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import DecisionTreeClassifier
>>> df = sqlContext.createDataFrame([
...     (0, Vectors.dense(0.0, 1.0)),  # integer labels instead
...     (1, Vectors.dense(1.0, 0.0))], # of float
...     ["label", "features"])
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="label")
>>> model = dt.fit(df) 
[...]
: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double

As you might have guessed by now, changing the labels back to floating-point numbers resolves the issue – but if you expect to find a reference in the relevant PySpark documentation, or at least a hint in the general Spark Machine Learning Library Guide, well, good luck…

* * *

I will argue that

Such unexpected and counter-intuitive behavior in Spark abounds
The documentation, especially for the Python API (PySpark), is hopelessly uninformative at such issues
This can cause considerable strain and frustration to both novice and seasoned data scientists alike, especially since such users are naturally expected to rely on PySpark, rather than the Scala or Java APIs

Consider the following issue:

>>> print spark.version
2.0.0
>>> from pyspark.ml.linalg import Vectors
>>> x = Vectors.dense([0.0, 1.0])
>>> x
DenseVector([0.0, 1.0])
>>> -x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)

I have written in the past about this, but back then it concerned the “old” pyspark.mllib.linalg module; the reaction from the Spark community clearly implied something in the lines of “It is well-known that […]“; as I counter-argued (after I did my research), since it seems to be not so well-known, we might even consider adding it to the documentation – so I opened a documentation issue in Spark JIRA. Not only it remains unresolved, but, as I have just shown above, the same behavior has been inherited by the newer pyspark.ml.linalg module, again without any relevant mention in the documentation.

Wondrous tales indeed…

The post Classification in Spark 2.0: “Input validation failed” and other wondrous tales appeared first on Nodalpoint.

To leave a comment for the author, please follow the link and comment on their blog: R – Nodalpoint.