(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

I wanted to give everyone a heads up about a major rebrand and some probable downtime happening over the weekend.

I’m going to be consolidating my Locke Data consulting company materials, the blog, the talks, and my package documentation into a single site. The central URL itsalocke.com won’t be changing but there will be a ton of changes happening.

I think I’ve got the redirects and the RSS feeds pretty much sorted, but I’ve converted more than 300 pages of content to new systems – I’ve likely gotten things wrong. You might notice some issues when clicking through from R-Bloggers, on twitter, or from other people’s sites.

I hope you’ll like the changes I’ve made but I’m going to have a “bug bounty” in place. If you find a broken link, a blog post that isn’t rendered correctly, or some other bug, then report it. Filling in the form is easy and if you provide your name and address and I’ll send you a sticker as thanks!

If you want to get a preview of the site, check it out on its temporary home lockelife.com.

The post Notice: Changes to the site appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

Introduction

Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.

This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.

Where to find the book:

Download the PDF through payhip
Read the ebook on google play
Order a physical copy from amazon
(Download the book preview)

The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.

Principal component methods

There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

However, the result is presented differently depending on the used package.

To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and principal component methods – we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).

No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.

Methods, which outputs can be visualized using the factoextra package are shown in the figure below:

Principal component methods and clustering methods supported by the factoextra R package

In this book, we’ll use mainly:

the FactoMineR package to compute principal component methods;
and the factoextra package for extracting, visualizing and interpreting the results.

The other packages – ade4, ExPosition, etc – will be also presented briefly.

How this book is organized

This book contains 4 parts.

Principal Component Methods book structure

Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.

Key features of FactoMineR and factoextra for multivariate analysis

Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:

Principal Component Analysis (PCA, for continuous variables),
Simple correspondence analysis (CA, for large contingency tables formed by two categorical variables)
Multiple correspondence analysis (MCA, for a data set with more than 2 categorical variables).

In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:

Factor Analysis of Mixed Data (FAMD) and,
Multiple Factor Analysis (MFA).

Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables

Key features of this book

This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.

Key features:

Covers principal component methods and implementation in R
Highlights the most important information in your data set using ggplot2-based elegant visualization
Short, self-contained chapters with tested examples that allow for flexibility in designing a course and for easy reference

At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.

Examples of plots

Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.

Eigenvalues/variances of principal components. Proportion of information retained by each principal component.

PCA – Graph of variables:

Control variable colors using their contributions to the principal components.

Highlight the most contributing variables to each principal dimension:

PCA – Graph of individuals:

Control automatically the color of individuals using the cos2 (the quality of the individuals on the factor map)

Change the point size according to the cos2 of the corresponding individuals:

PCA – Biplot of individuals and variables

Correspondence analysis. Association between categorical variables.

FAMD/MFA– Analyzing mixed and structured data

Clustering on principal components

Book preview

Download the preview of the book at: Principal Component Methods in R (Book preview)

Order now

About the author

Alboukadel Kassambara is a PhD in Bioinformatics and Cancer Biology. He works since many years on genomic data analysis and visualization (read more: http://www.alboukadel.com/).

He has work experiences in statistical and computational methods to identify prognostic and predictive biomarker signatures through integrative analysis of large-scale genomic and clinical data sets.

He created a bioinformatics web-tool named GenomicScape (www.genomicscape.com) which is an easy-to-use web tool for gene expression data analysis and visualization.

He developed also a training website on data science, named STHDA (Statistical Tools for High-throughput Data Analysis, www.sthda.com/english), which contains many tutorials on data analysis and visualization using R software and packages.

He is the author of many popular R packages for:

multivariate data analysis (factoextra, http://www.sthda.com/english/rpkgs/factoextra),
survival analysis (survminer, http://www.sthda.com/english/rpkgs/survminer/),
correlation analysis (ggcorrplot, http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2),
creating publication ready plots in R (ggpubr, http://www.sthda.com/english/rpkgs/ggpubr).

Recently, he published three books on data analysis and visualization:

Practical Guide to Cluster Analysis in R (https://goo.gl/DmJ5y5)
Guide to Create Beautiful Graphics in R (https://goo.gl/vJ0OYb).
Complete Guide to 3D Plots in R (https://goo.gl/v5gwl0).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides.

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hello, R users! This week we’re continuing to bridge the gap between computers and human language with the launch Sentiment Analysis in R: The Tidy Way by Julia Silge!

Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

Take me to chapter 1!

Sentiment Analysis in R: The Tidy Way features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in sentiment analysis!

What you’ll learn:

Chapter 1: Tweets across the United States

In this chapter, you will implement sentiment analysis using tidy data principles using geocoded Twitter data.

Chapter 2: Shakespeare gets Sentimental

Your next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.

Chapter 3: Analyzing TV News

Text analysis using tidy principles can be applied to diverse kinds of text, and in this chapter, you will explore a dataset of closed captioning from television news. You will apply the skills you have learned so far to explore how different stations report on a topic with different words, and how sentiment changes with time.

Chapter 4: Singing a Happy Song (or Sad?!)

In this final chapter on sentiment analysis using tidy principles, you will explore pop song lyrics that have topped the charts from the 1960s to today. You will apply all the techniques we have explored together so far, and use linear modeling to find what the sentiment of song lyrics can predict.

Learn all there is to know about Sentiment Analysis in R, the Tidy Way!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

(This article was first published on R Psychologist - R, and kindly contributed to R-bloggers)

Over the years I’ve produced quite a lot of code for power calculations and simulations of different longitudinal linear mixed models. Over the summer I bundled together these calculations for the designs I most typically encounter into an R package. The purpose of powerlmm is to help design longitudinal treatment studies, with or without higher-level clustering (e.g. by therapists, groups, or physician), and missing data. Currently, powerlmm supports two-level models, nested three-level models, and partially nested models. Additionally, unbalanced designs and missing data can be accounted for in the calculations. Power is calculated analytically, but simulation methods are also provided in order to evaluated bias, type 1 error, and the consequences of model misspecification. For novice R users, the basic functionality is also provided as a Shiny web application.

The package can be install from GitHub here: github.com/rpsychologist/powerlmm. Currently, the packages includes three vignettes that show how to setup your studies and calculate power.

A basic example

library(powerlmm)# dropout per treatment group
d <- per_treatment(control = dropout_weibull(0.3,2),
              treatment = dropout_weibull(0.2,2))# Setup design
p <- study_parameters(n1 =11,# time points
                      n2 =10,# subjects per cluster
                      n3 =5,# clusters per treatment arm
                      icc_pre_subject =0.5,
                      icc_pre_cluster =0,
                      icc_slope =0.05,
                      var_ratio =0.02,
                      dropout = d,
                      cohend =-0.8)# Power
get_power(p)

## 
##      Power calculation for longitudinal linear-mixed model (three-level)
##                            with missing data and unbalanced designs 
## 
##               n1 = 11
##               n2 = 10  (treatment)
##                    10  (control)
##               n3 = 5   (treatment)
##                    5   (control)
##                    10  (total)
##          total_n = 50  (treatment)
##                    50  (control)
##                    100 (total)
##          dropout =  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10 (time)
##                     0,  0,  1,  3,  6,  9, 12, 16, 20, 25, 30 (%, control)
##                     0,  0,  1,  2,  4,  5,  8, 10, 13, 17, 20 (%, treatment)
## icc_pre_subjects = 0.5
## icc_pre_clusters = 0
##        icc_slope = 0.05
##        var_ratio = 0.02
##           cohend = -0.8
##            power = 0.68

Feedback

I appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues or via my contact info here.

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R.

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercise , you will explore how to handle bigdata with RevoscaleR package from Microsoft R (previously Revolution Analytics).It comes with Microsoft R client . You can get it from here . get the Credit card fraud data set from revolutionanalytics and lets get started Answers to the exercises are available here.Please check the documentation before starting these exercise set

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1 The heart of RevoScaleR is the xdf file format , convert the creditcardfraud data set into xdf format .

Exercise 2

use the newly created xdf file to get information about the variables and print 10 rows to check the data .

Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to

work with different data import techniques,
know how to import data and transform it for a specific moddeling or analysis goal,
and much more.

Exercise 3 use rxSummary ,get the summary for variables gender, balance ,cardholder where numTrans is greater than 10

Exercise 4 use rxDataStep and create a variable avgbalpertran which is balance /numTran+numIntlTran .use rxGetInfo to check if your changes being reflected in the xdf data

Exercise 5 use rxCor and find the correlation between the newly created variable and fraudRisk Exercise 6 use rxLinMod to construct the linear regression of fraudRisk on gender,balance and cardholder. Dont forget to check the summary of the model .

Exercise 7 Find the contingency table of fraudRisk and Gender , use rxCrossTab .Hint : Figure out how to include factors in the formula .

Exercise 8 use rxCube to find the mean balance for each of the two genders .

Exercise 9 Create a histogram from the xdf file on balance to show the relative frequency histogram .

Exercise 10 Create a two panel histogram with gender and fradurisk as explanatory variable to show the relative frequency of fraudrisk in two genders .

Related exercise sets:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able“readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")

## [1] "\n \n\n\n \n \n \n \n \n \n\n \n\n \n \n\n \n \n \n \n \n \n \n try{Typekit.load();}catch(e){}\n \n hljs.initHighlightingOnLoad();\n\n Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()

# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))

## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Błażej Moska, computer science student and data science intern

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} – c_{j} \| }{ \| x_{i} – c_{k} \| }) ^{ \frac{2}{m-1} } } $$

$w_{ij}$ denotes to what extent the $i$th object belongs to the $j$th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. $m$ is a parameter, typically set to $m=2$. $w_{ij}$ values range between 0 and 1 so they are easy and convenient to compare. In this example I will use $m = 2$ so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

Rcpp (C++ version)

#include #include usingnamespace Rcpp;

// [[Rcpp::export]]
NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) {  
/* data is a matrix with observations(rows) and variables,    centers is a matrix with cluster centers coordinates,    m is a parameter of equation, c is a number of clusters*/int c=centers.rows();
  int rows = data.rows();
  int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/double tempDist=0;        /*dist and tempDist are variables storing temporary euclidean distances */double dist=0;      
  double denominator=0;    //denominator of “main” equation
  
  NumericMatrix result(rows,c);    //declaration of matrix of resultsfor(int i=0;i<rows;i++){
    for(int j=0;j<c;j++){
      for(int k=0;k<c;k++){
        for(int p=0;p<cols;p++){
          tempDist = tempDist+pow(centers(j,p)-data(i,p),2);      
         //in innermost loop an euclidean distance is calculated.
          dist = dist + pow(centers(k,p)-data(i,p),2);              
/*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/
        }
        tempDist = sqrt(tempDist);
        dist = sqrt(dist);
        denominator = denominator+pow((tempDist/dist),(2/(m-1)));
        tempDist =0;
        dist =0;
      }
      result(i,j) =1/denominator;    
// nominator/denominator in the  main equation
      denominator =0;                           
    }
  }
  return result;
}

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function fuzzyClustering will then be available from R.

R version

fuzzyClustering=function(data,centers,m){

  c <- nrow(centers)
  rows <- nrow(data)
  cols <- ncol(data)
  result <- matrix(0,nrow=rows,ncol=c)  #defining membership matrix
  denominator <-0for(i in1:rows){
    for(j in1:c){
      tempDist <- sqrt(sum((centers[j,]-data[i,])^2))  #euclidean distance, nominator inside a sum operator for(k in1:c){
        Dist <- sqrt(sum((centers[k,]-data[i,])^2))    #euclidean distance, denominator inside a sum operator 
        denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation 
      }    
      result[i,j] <-1/denominator    #inserting value into membership matrix
      denominator <-0           
    }
  }
  return(result);
}

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

          1     2     3     4     5     6     7     8     9    10  
 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188  
 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051  
 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062  
 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031  
 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045  
 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048  
 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046  
 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062  
 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 
[10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

Performance comparison

Shown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters. The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

  user    system  elapsed
  0.32    0.00    0.33

R version:

  user    system  elapsed
  15.75    0.02     15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part of 1 in the series Random Number Generation

A Linear congruential generator (LCG) is a class of pseudorandom number generator (PRNG) algorithms used for generating sequences of random-like numbers. The generation of random numbers plays a large role in many applications ranging from cryptography to Monte Carlo methods. Linear congruential generators are one of the oldest and most well-known methods for generating random numbers primarily due to their comparative ease of implementation and speed and their need for little memory. Other methods such as the Mersenne Twister are much more common in practical use today.

Linear congruential generators are defined by a recurrence relation:

$\large{X_{i+1} = (aX_i + c) \space \text{mod} \space m}$

There are many choices for the parameters , the modulus, , the multiplier, and the increment. Wikipedia has a seemingly comprehensive list of the parameters currently in use in common programs.

Aside: ‘Pseudorandom’ and Selecting a Seed Number

Random number generators such as LCGs are known as ‘pseudorandom’ as they require a seed number to generate the random sequence. Due to this requirement, random number generators today are not truly ‘random.’ The theory and optimal selection of a seed number are beyond the scope of this post; however, a common choice suitable for our application is to take the current system time in microseconds.

A Linear Congruential Generator Implementation in R

The parameters we will use for our implementation of the linear congruential generator are the same as the ANSI C implementation (Saucier, 2000.).

$\large{m = 2^{32} \qquad a = 1103515245 \qquad c = 12345}$

The following function is an implementation of a linear congruential generator with the given parameters above.

lcg.rand <- function(n=10) {    rng <- vector(length = n)    m <- 2 ** 32  a <- 1103515245  c <- 12345  # Set the seed using the current system time in microseconds  d <- as.numeric(Sys.time()) * 1000    for (i in 1:n) {    d <- (a * d + c) %% m    rng[i] <- d / m  }    return(rng)}

We can use the function to generate random numbers U(0, 1) .

# Print 10 random numbers on the half-open interval [0, 1)lcg.rand()##  [1] 0.4605103 0.6643705 0.6922703 0.4603930 0.1842995 0.6804419 0.8561535##  [8] 0.2435846 0.8236771 0.9643965

We can also demonstrate how apparently ‘random’ the LCG is by plotting a sample generation in 3 dimensions. To do this, we generate three random vectors x, y, z using our LCG above and plot. The plot3d package is used to create the scatterplot, and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

library(plot3D)library(animation)

n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000)saveGIF({  for (i in 1:length(n)) {    x <- lcg.rand(n[i])    y <- lcg.rand(n[i])    z <- lcg.rand(n[i])        scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i]))  }}, movie.name = 'lcg.gif')

As increases, the LCG appears to be ‘random’ enough as demonstrated by the cloud of points.

Linear Congruential Generators with Poor Parameters

The values chosen for the parameters $m, a \text{and} c$ are very important in driving how ‘random’ the generated values from the linear congruential estimator; hence that is why there are so many different parameters in use today as there has not yet been a clear consensus on the ‘best’ parameters to use.

We can demonstrate how choosing poor parameters for our LCG leads to not so random generated values by creating a new LCG function.

lcg.poor <- function(n=10) {    rng <- vector(length = n)    # Parameters taken from https://www.mimuw.edu.pl/~apalczew/CFP_lecture3.pdf  m <- 2048  a <- 1229  c <- 1  d <- as.numeric(Sys.time()) * 1000    for (i in 1:n) {    d <- (a * d + c) %% m    rng[i] <- d / m  }    return(rng)}

Generating successively longer vectors using the ‘poor’ LCG and plotting as we did previously, we see the generated points are very sequentially correlated, and there doesn’t appear to be any ‘randomness’ at all as increases.

n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000)saveGIF({  for (i in 1:length(n)) {    x <- lcg.poor(n[i])    y <- lcg.poor(n[i])    z <- lcg.poor(n[i])        scatter3D(x, y, z, colvar = NULL, pch=20, cex=0.5, theta=20, main = paste('n = ', n[i]))  }}, movie.name = 'lcg_poor.gif')

References

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

The post Linear Congruential Generator in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Boosting is another famous ensemble learning technique in which we are not concerned with reducing the variance of learners like in Bagging where our aim is to reduce the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from training data, so as to avoid overfitting.

Another major difference between both the techniques is that in Bagging the various models which are generated are independent of each other and have equal weightage .Whereas Boosting is a sequential process in which each next model which is generated is added so as to improve a bit from the previous model.Simply saying each of the model that is added to mix is added so as to improve on the performance of the previous collection of models.In Boosting we do weighted averaging.

Both the ensemble techniques are common in terms of generating lots of models on training data and using their combined power to increase the accuracy of the final model which is formed by combining them.

But Boosting is more towards Bias i.e simple learners or more specifically Weak learners. Now a weak learner is a learner which always learns something i.e does better than chance and also has error rate less then 50%.The best example of a weak learner is a Decision tree.This is the reason we generally use ensemble technique on decision trees to improve its accuracy and performance.

In Boosting each tree or Model is grown or trained using the hard examples.By hard I mean all the training examples $ (x_i,y_i) $ for which a previous model produced incorrect output $Y$.Boosting boosts the performance of a simple base-learner by iteratively shifting the focus towards problematic training observations that are difficult to predict.Now that information from the previous model is fed to the next model.And the thing with boosting is that every new tree added to the mix will do better than the previous tree because it will learn from the mistakes of the previous models and try not to repeat them.Hence by this technique it will eventually convert a weak learner to a strong learner which is better and more accurate in generalization for unseen test examples.

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Gradient boosting generates learners using the same general boosting learning process. It first builds learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, forth… until certain threshold.Gradient boosting identifies hard examples by calculating large residuals-$ (y_{actual}-y_{pred} ) $ computed in the previous iterations.

Implementing Gradient Boosting

Let’s use gbm package in R to fit gradient boosting model.

require(gbm)require(MASS)#package with the boston housing dataset#separating training and test datatrain=sample(1:506,size=374)

We will use the Boston housing data to predict the median value of the houses.

Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees = 10000,                  shrinkage = 0.01, interaction.depth = 4)Boston.boostsummary(Boston.boost) #Summary gives a table of Variable Importance and a plot of Variable Importancegbm(formula = medv ~ ., distribution = "gaussian", data = Boston[-train,     ], n.trees = 10000, interaction.depth = 4, shrinkage = 0.01)A gradient boosted model with gaussian loss function.10000 iterations were performed.There were 13 predictors of which 13 had non-zero influence.>summary(Boston.boost)            var     rel.infrm           rm 36.96963915lstat     lstat 24.40113288dis         dis 10.67520770crim       crim  8.61298346age         age  4.86776735black     black  4.23048222nox         nox  4.06930868ptratio ptratio  2.21423811tax         tax  1.73154882rad         rad  1.04400159indus     indus  0.80564216chas       chas  0.28507720zn           zn  0.09297068

The above Boosted Model is a Gradient Boosted Model which generates 10000 trees and the shrinkage parametet (\lambda= 0.01\) which is also a sort of learning Rate. Next parameter is the interaction depth which is the total splits we want to do.So here each tree is a small tree with only 4 splits. The summary of the Model gives a feature importance plot.In the above list is on the top is the most important variable and at last is the least important variable. And the 2 most important features which explain the maximum variance in the Data set is lstat i.e lower status of the population (percent) and rm which is average number of rooms per dwelling.

Variable importance plot–

Plotting the Partial Dependence Plot

The partial Dependence Plots will tell us the relationship and dependence of the variables $X_i$ with the Response variable $Y$.

#Plot of Response variable with lstat variableplot(Boston.boost,i="lstat") #Inverse relation with lstat variableplot(Boston.boost,i="rm") #as the average number of rooms increases the the price increases

The above plot simply shows the relation between the variables in the x-axis and the mapping function $f(x)$ on the y-axis.First plot shows that lstat is negatively correlated with the response mdev, whereas the second one shows that rm is somewhat directly related to mdev.

cor(Boston$lstat,Boston$medv)#negetive correlation coeff-rcor(Boston$rm,Boston$medv)#positive correlation coeff-r> cor(Boston$lstat,Boston$medv)[1] -0.7376627> cor(Boston$rm,Boston$medv)[1] 0.6953599

Prediction on Test Set

We will compute the Test Error as a function of number of Trees.

n.trees = seq(from=100 ,to=10000, by=100) #no of trees-a vector of 100 values #Generating a Prediction matrix for each Treepredmatrix<-predict(Boston.boost,Boston[-train,],n.trees = n.trees)dim(predmatrix) #dimentions of the Prediction Matrix#Calculating The Mean squared Test Errortest.error<-with(Boston[-train,],apply( (predmatrix-medv)^2,2,mean))head(test.error) #contains the Mean squared test error for each of the 100 trees averaged#Plotting the test error vs number of treesplot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set")#adding the RandomForests Minimum Error line trained on same data and similar parametersabline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same datalegend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1)dim(predmatrix)[1] 206 100head(test.error)      100       200       300       400       500       600 26.428346 14.938232 11.232557  9.221813  7.873472  6.911313

In the above plot the red line represents the least error obtained from training a Random forest with same data and same parameters and number of trees.Boosting outperforms Random Forests on same test dataset with lesser Mean squared Test Errors.

Conclusion

From the above plot we can notice that if boosting is done properly by selecting appropriate tuning parameters such as shrinkage parameter $lambda$ ,the number of splits we want and the number of trees $n$, then it can generalize really well and convert a weak learner to strong learner. Ensembling techniques are really well and tend to outperform a single learner which is prone to either overfitting or underfitting or generate thousands or hundreds of them,then combine them to produce a better and stronger model.

Hope you guys liked the article, make sure to like and share. Cheers!!.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Immersion is among the most effective ways to learn any language. Immersing where new and advanced users come together to improve their use of the R language is a rare opportunity. rstudio::conf 2018 is that time and place!

REGISTER TODAY

Be an Early Bird! Discounts for early conference registration expire August 31. Immerse as a team! Ask us about group discounts for 5 or more from the same organization.

Rstudio::conf 2018 is a two day conference with optional two day workshops. One of the conference tracks will focus on topics for newer R users. Newer R users will learn about the best ways to use R, to avoid common pitfalls and accelerate proficiency. Several workshops are also designed specifically for those newer to R.

Intro to R & RStudio

Are you new to R & RStudio and do you learn best in person? You will learn the basics of R and data science, and practice using the RStudio IDE (integrated development environment) and R Notebooks. We will have a team of TAs on hand to show you the ropes and help you out when you get stuck.
This course is taught by well-known R educator and friend of RStudio, Amelia McNamara, a Smith College Visiting Assistant Professor of Statistical and Data Sciences & Mass Mutual Faculty Fellow.

Data Science in the Tidyverse

Are you ready to begin applying the book, R for Data Science? Learn how to achieve your data analysis goals the “tidy” way. You will visualize, transform, and model data in R and work with date-times, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats.
This course is taught by friend of RStudio, Charlotte Wickham, a professor and award winning teacher and data analyst at Oregon State University.

Intro to Shiny & R Markdown

Do you want to share your data analysis with others in effective ways? For people who know their way around the RStudio IDE and R at least a little, this workshop will help you become proficient in Shiny application development and using R Markdown to communicate insights from data analysis to others.
This course is taught by Mine Çetinkaya-Rundel, Duke professor and RStudio professional educator. Mine is well known for her open education efforts, and her popular data science MOOCs.

Whether you are new to the R language or as advanced as many of our speakers and educators, rstudio::conf 2018 is the place and time to focus on all things R & RStudio.

We hope to see you in San Diego!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The BH package on CRAN was updated today to version 1.65.0. BH provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R, possibly with Rcpp as well as other packages.

This release upgrades the version of Boost to the rather new upstream version Boost 1.65.0 released earlier this week, and adds two new libraries: align and sort.

I had started the upgrade process a few days ago under release 1.64.0. Rigorous checking of reverse dependencies showed that mvnfast needed a small change (which was trivial: just seeding the RNG prior to running tests), which Matteo did in no time with a fresh CRAN upload. rstan is needing a bit more work but should be ready real soon now and we are awaiting a new version. And once I switched to the just release Boost 1.65.0 it became apparent that Cyclops no longer needs its embedded copy of Boost iterator—and Marc already made that change with yet another fresh CRAN upload. It is a true pleasure to work in such a responsive and collaborative community.

Changes in version 1.65.0-1 (2017-08-24)
Upgraded to Boost 1.64 and then 1.65 installed directly from upstream source with several minor tweaks (as before)
Fourth tweak corrects a misplaced curly brace (see the Boost ublas GitHub repo and its issue #40)
Added Boost align (as requested in #32)
Added Boost sort (as requested in #35)
Added Boost multiprecision by fixing a script typo (as requested in #42)
Updated Travis CI support via newer run.sh

Via CRANberries, there is a diffstat report relative to the previous release.

Comments and suggestions are welcome via the mailing list or the issue tracker at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

Dear reader,

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

In this step-by-step tutorial you will:

1. Use one of the most popular machine learning packages in R. 2. Explore a dataset by using statistical summaries and data visualization. 3. Build 5 machine-learning models, pick the best, and build confidence that the accuracy is reliable.

The process of a machine learning project may not be exactly the same, but there are certain standard and necessary steps:

1. Define Problem. 2. Prepare Data. 3. Evaluate Algorithms. 4. Improve Results. 5. Present Results.

1. PACKAGE INSTALLATION & DATA SET

The first thing you have to do is install and load the “caret” package with: install.packages("caret") library(caret)

Moreover, we need a dataset to work with. The dataset we chose in our case is “iris,” which contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species. To attach it to the environment, use: data(iris)

1.1 Create a Validation Dataset

First of all, we need to validate that our data set is good. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. To be sure about the accuracy of the best model on unseen data, we will evaluate it on actual unseen data. To do this, we will “deposit” some data that the algorithms will not find and use this data later to get a second and independent idea of how accurate the best model really is.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% of which we will hold back as a validation dataset. Look at the example below: #create a list of 80% of rows in the original dataset to use them for training validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE) # select 20% of the data for validation validation <- dataset[-validation_index,] # use the remaining 80% of data to training and testing the models dataset <- dataset[validation_index,]

You now have training data in the dataset variable and a validation set that will be used later in the validation variable.

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

Create a machine learning algorhitm from a beginner point of view
Quickly dive into more advanced methods in an accessible pace and with more explanations
And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

2. DATASET SUMMARY

In this step, we are going to explore our data set. More specifically, we need to know certain features of our dataset, like:

1. Dimensions of the dataset. 2. Types of the attributes. 3. Details of the data. 4. Levels of the class attribute. 5. Analysis of the instances in each class. 6. Statistical summary of all attributes.

2.1 Dimensions of Dataset

We can see of how many instances (rows) and how many attributes (columns) the data contains with the dim function. Look at the example below: dim(dataset)

2.2 Types of Attributes

Knowing the types is important as it can help you summarize the data you have and possible transformations you might need to use to prepare the data before modeilng. They could be doubles, integers, strings, factors and other types. You can find it with: sapply(dataset, class)

2.3 Details of the Data

You can take a look at the first 5 rows of the data with: head(dataset)

2.4 Levels of the Class

The class variable is a factor that has multiple class labels or levels. Let’s look at the levels: levels(dataset$Species)

There are two types of classification problems: the multinomial like this one and the binary if there were two levels.

2.5 Class Distribution

Let’s now take a look at the number of instances that belong to each class. We can view this as an absolute count and as a percentage with: percentage <- prop.table(table(dataset$Species)) * 100 cbind(freq=table(dataset$Species), percentage=percentage)

2.6 Statistical Summary

This includes the mean, the min and max values, as well as some percentiles. Look at the example below: summary(dataset)

3. DATASET VISUALIZATION

We now have a basic idea about the data. We need to extend that with some visualizations, and for that reason we are going to use two types of plots:

1. Univariate plots to understand each attribute. 2. Multivariate plots to understand the relationships between attributes.

3.1 Univariate Plots

We can visualize just the input attributes and just the output attributes. Let’s set that up and call the input attributes x and the output attributes y. x <- dataset[,1:4] y <- dataset[,5]

Since the input variables are numeric, we can create box and whisker plots of each one with: par(mfrow=c(1,4)) for(i in 1:4) { boxplot(x[,i], main=names(iris)[i]) }

We can also create a barplot of the Species class variable to graphically display the class distribution. plot(y)

3.2 Multivariate Plots

First, we create scatterplots of all pairs of attributes and color the points by class. Then, we can draw ellipses around them to make them more easily separated. You have to install and call the “ellipse” package to do this. install.packages("ellipse") library(ellipse) featurePlot(x=x, y=y, plot="ellipse")

We can also create box and whisker plots of each input variable, but this time they are broken down into separate plots for each class. featurePlot(x=x, y=y, plot="box")

Next, we can get an idea of the distribution of each attribute. We will use some probability density plots to give smooth lines for each distribution. scales <- list(x=list(relation="free"), y=list(relation="free")) featurePlot(x=x, y=y, plot="density", scales=scales)

4. ALGORITHMS EVALUATION

Now it is time to create some models of the data and estimate their accuracy on unseen data.

1. Use the test harness to use 10-fold cross validation. 2. Build 5 different models to predict species from flower measurements. 3. Select the best model.

4.1 Test Harness

This will split our dataset into 10 parts, train in 9, test on 1, and release for all combinations of train-test splits. control <- trainControl(method="cv", number=10) metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is: (number of correctly predicted instances / divided by the total number of instances in the dataset)*100 to give a percentage.

4.2 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that we created earlier.

Algorithms evaluation:

1. Linear Discriminant Analysis (LDA) 2. Classification and Regression Trees (CART). 3. k-Nearest Neighbors (kNN). 4. Support Vector Machines (SVM) with a linear kernel. 5. Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

NOTE: To proceed, first install and load the following packages: “rpart”, “kernlab”, “e1071” and “randomForest”.

Let’s build our five models: # a) linear algorithms set.seed(7) fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control) # b) nonlinear algorithms # CART set.seed(7) fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control) # kNN set.seed(7) fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control) # c) advanced algorithms # SVM set.seed(7) fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control) # Random Forest set.seed(7) fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)

4.3 Select the Best Model

We now have 5 models and accuracy estimations for each so we have to compare them.

It is a good idea to create a list of the created models and use the summary function. results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf)) summary(results)

Moreover, we can create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times. dotplot(results)

You can summarize the results for just the LDA model that seems to be the most accurate. print(fit.lda)

5. Make Predictions

The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

We can run the LDA model directly on the validation set and summarize the results in a confusion matrix. predictions <- predict(fit.lda, validation) confusionMatrix(predictions, validation$Species)

Related exercise sets:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In Power BI, the dashboarding and reporting tool, you can use R to filter, transform, or restructure data via the Query Editor. For example, you could use the mice package to impute missing values, or use the tidytext package to assign sentiment scores to text inputs. As Imke Feldmann explains, there are lots of useful tricks you can accomplish using R in the query editor, including:

Passing multiple Power BI tables into the R script
Parameterizing R scripts
Modifying file names and output data types for use in Power BI
Generating multiple table outputs from an R script

For the details on these tips, follow the link below. You can also find tips on using R in the Query Editor here.

The Biccountant: Tips and Tricks for R scripts in the query editor in Power BI

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

This week, the entire Columbia portion of the Stan team is out of the office and we didn’t have an in-person/online meeting this Thursday. Mitzi and I are on vacation, and everyone else is either teaching, TA-ing, or attending the Stan course. Luckily for this report, there’s been some great activity out of the meeting even if I don’t have a report of what everyone around Columbia has been up to. If a picture’s really worth a thousand words, this is the longest report yet.

Ari Hartikainen has produced some absolutely beautiful parallel coordinate plots of HMC divergences^* for multiple parameters. The divergent transitions are shown in green and the lines connect a single draw. The top plot is unnormalized, whereas the bottom scales all parameters to a [0, 1] range.
You can follow the ongoing discussion on the forum thread. There are some further plots for larger models and some comparisons with the pairs plots that Michael Betancourt has been recommending for the same purpose (the problem with pairs is that it’s very very slow, at least in RStan, because it has to draw quadratically many plots).
Sebastian Weber has a complete working prototype of the MPI (multi-core parallelization) in place and has some beautiful results to report. The first graph is the speedup he achieved on a 20-core server (all in one box with shared memory):
The second graph shows what happens when the problem size grows (those bottom numbers on the x-axis are the number of ODE systems being solved, whereas the top number remains the number of cores used).
As with Ari’s plots, you can follow the ongoing disussion on the forum thread. And if you know something about MPI, you can even help out. Sebastian’s been asking if anyone who knows MPI would like to check his work—he’s learning it as he goes (and doing a bang-up job of it, I might add!).

These lists are incomplete

After doing a handful of these reports, I’m sorry to say you’re only seeing a very biased selection of activity around Stan. For the full story, I’d encourage you to jump onto our forums or GitHub (warning: very high traffic, even if you focus).

^* Divergences in Stan arise when the Hamiltonian, which should be conserved across a trajectory, diverges—it’s basically a numerical simulation problem—if we could perfectly follow the Hamiltonian through complex geometries, there wouldn’t be any divergences. This is a great diagnostic mechanism to signal something’s going wrong and resulting estimates might be biased. It may seem to make HMC more fragile, but the problem is that Gibbs and Metropolis will fail silently in a lot of these situations (though BUGS will often help you out of numerical issues by crashing).

The post Stan Weekly Roundup, 25 August 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Tidy evaluation is a bit challenging to get your head around. Even after reading programming with dplyr several times, I still struggle when creating functions from time to time. I made a small summary of the most common actions I perform, so I don’t have to dig in the vignettes and on stackoverflow over and over. Each is accompanied with a minimal example on how to implement it. I thought others might find this useful too, so here it is in a blog post. This list is meant to be a living thing so additions and improvements are most welcome. Please do a PR on this file or send an email.

library(tidyverse)

bare to quosure: `quo`

bare_to_quo<-function(x,var){x%>%select(!!var)%>%head(1)}bare_to_quo(mtcars,quo(cyl))

##           cyl
## Mazda RX4   6

bare to quosure in function: `enquo`

bare_to_quo_in_func<-function(x,var){var_enq<-enquo(var)x%>%select(!!var_enq)%>%head(1)}bare_to_quo_in_func(mtcars,mpg)

##           mpg
## Mazda RX4  21

quosure to a name: `quo_name`

bare_to_name<-function(x,nm){nm_name<-quo_name(nm)x%>%mutate(!!nm_name:=42)%>%head(1)%>%select(!!nm)}bare_to_name(mtcars,quo(this_is_42))

##   this_is_42
## 1         42

quosure to text: `quo_text`

quo_to_text<-function(x,var){var_enq<-enquo(var)ggplot(x,aes_string(rlang::quo_text(var_enq)))+geom_density()}plt<-quo_to_text(mtcars,cyl)

Note that tidy evaluation is not yet implemented in ggplot2, but this will be in future versions. This is a workaround for the meantime, when combining dplyr and ggplot2.

character to quosure: `sym`

char_to_quo<-function(x,var){var_enq<-rlang::sym(var)x%>%select(!!var_enq)%>%head(1)}char_to_quo(mtcars,"vs")

##           vs
## Mazda RX4  0

multiple bares to quosure: `quos`

bare_to_quo_mult<-function(x,...){grouping<-quos(...)x%>%group_by(!!!grouping)%>%summarise(nr=n())}bare_to_quo_mult(mtcars,vs,cyl)

## # A tibble: 5 x 3
## # Groups:   vs [?]
##      vs   cyl    nr
##     
## 1     0     4     1
## 2     0     6     3
## 3     0     8    14
## 4     1     4    10
## 5     1     6     4

multiple characters to quosure: `syms`

bare_to_quo_mult_chars<-function(x,...){grouping<-rlang::syms(...)x%>%group_by(!!!grouping)%>%summarise(nr=n())}bare_to_quo_mult_chars(mtcars,list("vs","cyl"))

## # A tibble: 5 x 3
## # Groups:   vs [?]
##      vs   cyl    nr
##     
## 1     0     4     1
## 2     0     6     3
## 3     0     8    14
## 4     1     4    10
## 5     1     6     4

quoting full expressions

Altough quoting column names is most often used, it is by no means the only option. We can use the above to quote full expressions.

filter_func<-function(x,filter_exp){filter_exp_enq<-enquo(filter_exp)x%>%filter(!!filter_exp_enq)}filter_func(mtcars,hp==93)

##    mpg cyl disp hp drat   wt  qsec vs am gear carb
## 1 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R package— msgxtractr— that can read in these ‘.msg’ files and produce a list as a result.

I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).

Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.

Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.

devtools::install_github("hrbrmstr/msgxtractr")library(msgxtractr)print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))## List of 7##  $ headers         :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  18 variables:##   ..$ Return-path               : chr ""##   ..$ Received                  :List of 1##   .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"##   ..$ Original-recipient        : chr "rfc822;brianzhou@me.com"##   ..$ Received-SPF              : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__##   ..$ DKIM-Signature            : chr "v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__##   ..$ MIME-version              : chr "1.0"##   ..$ X-Received                : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"##   ..$ Date                      : chr "Mon, 18 Nov 2013 10:26:24 +0200"##   ..$ Message-id                : chr ""##   ..$ Subject                   : chr "Test for TIF files"##   ..$ From                      : chr "Brian Zhou "##   ..$ To                        : chr "brianzhou@me.com"##   ..$ Cc                        : chr "Brian Zhou "##   ..$ Content-type              : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"##   ..$ Authentication-results    : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__##   ..$ x-icloud-spam-score       : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"##   ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__##   ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__##  $ sender          :List of 2##   ..$ sender_email: chr "brizhou@gmail.com"##   ..$ sender_name : chr "Brian Zhou"##  $ recipients      :List of 2##   ..$ :List of 3##   .. ..$ display_name : NULL##   .. ..$ address_type : chr "SMTP"##   .. ..$ email_address: chr "brianzhou@me.com"##   ..$ :List of 3##   .. ..$ display_name : NULL##   .. ..$ address_type : chr "SMTP"##   .. ..$ email_address: chr "brizhou@gmail.com"##  $ subject         : chr "Test for TIF files"##  $ body            : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__##  $ attachments     :List of 2##   ..$ :List of 4##   .. ..$ filename     : chr "importOl.tif"##   .. ..$ long_filename: chr "import OleFileIO.tif"##   .. ..$ mime         : chr "image/tiff"##   .. ..$ content      : raw [1:969674] 49 49 2a 00 ...##   ..$ :List of 4##   .. ..$ filename     : chr "raisedva.tif"##   .. ..$ long_filename: chr "raised value error.tif"##   .. ..$ mime         : chr "image/tiff"##   .. ..$ content      : raw [1:1033142] 49 49 2a 00 ...##  $ display_envelope:List of 2##   ..$ display_cc: chr "Brian Zhou"##   ..$ display_to: chr "brianzhou@me.com"## NULL

NOTE: Don’t try to read those TIFF images with magick or evan the tiff package. It seems to have some strange tags. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

wrapr is an R package that supplies powerful tools for writing and debugging R code.

Primary wrapr services include:

let()
%.>% (dot arrow pipe)
:= (named map builder)
λ() (anonymous function builder)
DebugFnW()

`let()`

let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).

The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".

library("wrapr")a <-7let(  c(VAR ='a'),    VAR +VAR) #  [1] 14

This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:

let(  c(VAR ='a'),  eval =FALSE,  {    VAR +VAR  }) #  {#      a + a#  }let(  c(VAR ='a'),  debugPrint =TRUE,  {    VAR +VAR  }) #  {#      a + a#  }#  [1] 14

Please see vignette('let', package='wrapr') for more examples. For working with dplyr 0.7.* we suggest also taking a look at an alternate approach called seplyr.

`%.>%` (dot arrow pipe)

%.>% dot arrow pipe is a strict pipe with intended semantics:

"a %.>% b" is to be treated as if the user had written "{ . <- a; b };" with "%.>%" being treated as left-associative, and .-side effects removed.

That is: %.>% does not alter any function arguments that are not explicitly named. It is not defined as a %.% b ~ b(a) (roughly dplyr‘s original pipe) or as the large set of differing cases constituting magrittr::%>%. %.>% is designed to be explicit and simple.

The effect looks is show below.

The following two expressions should be equivalent:

cos(exp(sin(4))) #  [1] 0.89194654 %.>%sin(.) %.>%exp(.) %.>%cos(.) #  [1] 0.8919465

The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

1:4 %.>%.^2#  [1]  1  4  9 16

The notation is also very regular in that expressions have the same iterpretation be then surrounded by parenthesis, braces, or as-is:

1:4 %.>%{ .^2 }  #  [1]  1  4  9 161:4 %.>%( .^2 ) #  [1]  1  4  9 16

Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details.

`:=`

:= is the "named map builder". It allows code such as the following:

'a' := 'x'#    a #  "x"

The important property of named map builder is it accepts values on the left-hand side allowing the following:

name <- 'variableNameFromElseWhere'name := 'newBinding'#  variableNameFromElseWhere #               "newBinding"

A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:

c('a', 'b') :=c('x', 'y') #    a   b #  "x" "y"c('a' := 'x', 'b' := 'y') #    a   b #  "x" "y"

`λ()`

λ() is a concise abstract function creator. It is a placeholder that allows the use of the λ-character for very concise function abstraction.

Example:

# Make sure lambda function builder is in our enironment.if(exists('defineLambda',           envir=as.environment('package:wrapr'),           mode='function')) {  # Note: prior to version 0.4.2 wrapr# loaded a lambda-definition during# package load.  The following explicit # wrapr::defineLambda() is more polite.  wrapr::defineLambda() }# square numbers 1 through 4sapply(1:4, λ(x, x^2)) #  [1]  1  4  9 16

`DebugFnW()`

DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

PUTTING 35 INCHES OF RAIN IN PERSPECTIVE

We are always interested in putting numbers into perspective, so we were interested in this article in which they put the Hurricane Harvey’s rain into perspective.

They’re predicting 30-40 inches of rain in a few days in Texas. They asked an expert to put that into perspective and he said:

Let’s put it in context. Much of the Northeast Corridor — Washington to New York and Boston — maybe receives maybe between 40 and 45 inches of rain a year. Think of all the rain you get in July through Christmas and put that in a couple days. It’s a lot of rain.

It’s easy for us to think in terms of New York City, so we looked up some weather data. See the table at the top of this post (all figures are in inches).

First thing we can notice is that the expert understated things, for New York at least. Thirty five inches would be equivalent to all the rain in NYC from April (not July) to December, inclusive.

But we agree that it’s a lot of rain.

We’ve always had trouble putting rain forecasts into perspective, so here are some rules of thumb we figured out from the data that we’re going to memorize. If you live in the corridor from DC to Boston, you may find these useful.

The average amount of rain per rainy day in NYC is .38 inches, which conveniently is about 1 cm.
When you hear it’s going to rain 1 cm or 3/8 inch, you can think “no big deal, that’s a typical NYC rainy day”.
If you hear it’s going to rain an inch, you can think “oh darn, that’s like three rainy days worth”.

Here’s R-markdown code if you want to play around:

The post How much will that Texas rain be appeared first on Decision Science News.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News.

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

Until now, we used random variable simulation and bootstrapping to test hypothesis and compute statistics of a single sample. In today’s set, we’ll learn how to use permutation to test hypothesis about two different samples and how to adapt bootstrapping to this situation.

Answers to the exercises are available here.

Exercise 1

Generate 500 points from a beta distribution of parameter a=2 and b=1.5, then store the result in a vector named beta1.
Generate 500 points from the same distribution and store those points in a vector named beta2.
Concatenate both vectors to create a vector called beta.data.
Plot the ecdf of beta1 and beta2.
Sample 500 points from beta.data and plot the ecdf of this sample. Repeat this process 5 times.
Does all those samples share the same distribution and if the answer is yes, what is the distribution?

Exercise 2 When we test an hypothesis, we suppose that this hypothesis is true, we simulate what would happen if that’s the case and if our initial observation happen less that α percent of the time we reject the hypothesis. Now, from the first exercise, we know that if two samples share the same distribution, we can assume that any sample drawn from those samples will follow the same distribution. In particular, if we shuffle the observations from a sample of size n1 and those of a sample of size n2, shuffle them and draw two new samples of size n1 and n2, they all should have a similar CDF. We can use this fact to test the hypothesis that two samples have the same distribution. This is process is called a permutation test.

Load this dataset where each column represents a variable and we want to know if they are identically distributed. Each exercise below follow a step of a permutation test.

What are the null and alternative hypotheses for this test?
Concatenate both samples into a new vector called data.ex.2.
Write a function that take data.ex.2 and the size of both sample as arguments, create a temporary vector by permuting data.ex.2 and return two new samples. The first sample has the same number of observations than the first column of the dataset, the second is made from the rest of the observations. Name this function permutation.sample (we will used it in the next exercise.) Why do we want the function to return samples of those size?
Plot the ECDF of both initial variables in black.
Use the function permutation.sample 100 times to generate permuted samples, then compute the ECDF of those samples and add the plot of those curve to the previous plot. Use the color red for the first batch of samples and green for the second batch.
By looking at the plot, can you tell if the null hypothesis is true?

Exercise 3 A business analyst think that the daily returns of the apple stocks follow a normal distribution with mean of 0 and a standard deviation of 0.1. Use this dataset of the daily return of those stocks for the last 10 years to test this hypothesis.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

Work with about different binomial and logistic regression techniques
Know how to compare regression models and choose the right fit
And much more

Exercise 4 Permutation test can help us verify if two samples come from the same distribution, but if this is true, we can conclude that both sample share the same statistics. As a consequence permutation test can also be used to test if statistic of two sample are the same. One really useful application of this is to test if two mean are the same or significantly different (as you have probably realized by now, statistician are obsessed with mean and love to spend time studying it!). In this situation, the question is to determine if the difference of mean in two sample are random or a consequence of a difference of distribution.

You should be quite familiar with tests by now, so how would you proceed to do a permutation test to verify if two means are equals? Used that process to test the equality of the mean of both sample in this dataset.

Exercise 5 Looking at the average annual wage of the United States and Switzerland both country have relatively the same level of wealth since those statistics are of 60154 and 60124 US dollar respectively. In this dataset, you will find simulated annual wage from citizen of both countries. Test the hypothesis that both the American and the Swiss have the same average annual wage based on those samples at a level of 5%.

Exercise 6 To test if two samples from different distribution have the same statistics, we cannot use the permutation test: we instead will use bootstrapping. To test if two sample as the same mean, for example, you should follow those steps:

Formulate a null and an alternative hypothesis.
Set a significance level.
Compute the difference of mean of both samples. This will be the reference value we will use to compute the p-value.
Concatenate both samples and compute the mean of this new dataset.
Shift both samples so that they share the mean of the concatenated dataset.
Use bootstrap to generate an estimate of the mean of both shifted samples.
Compute the difference of both means.
Repeat the last two steps at least 1000 times.
Compute the p-value and draw a conclusion.

Use the dataset from last exercise to see if the USA and Switzerland have the same average wage at a level of 5%.

Exercise 7 Test the hypothesis that both samples in this dataset have the same mean.

Exercise 8 R have functions that use analytic methods to test if two samples have an equal mean.

Use the t.test() function to test the equality of the mean of the samples of the last exercise.
Use this function to test the hypothesis that the average wage in the US are bigger than in Switzerland.

Exercise 9 The globular cluster luminosity dataset list measurement about the luminosity of cluster of stars in different region of the milky way galaxy and the Andromeda galaxy. Test the hypothesis that the average luminosity in both galaxy have a difference of 24,78.

Exercise 10 A company that mold aluminum for auto parts has bought a smaller company to increase the amount of parts they can produce each year. In their factory, the smaller company used the standard equipment, but used a different factory layout, had a different supply line and managed their employees work schedules in a completely different manner that their new parent company. Before changing the company culture, the engineer in the parent company are interested to know which of the approach is the more effective. To do so they measure the time it took to make an auto part in each factory, 150 times and created this dataset where the first column represent the sample of the small factory.

Does the average time it takes to make a part is the same in both factory?
Does the production time follow the same distribution in both factory?
If the engineer want to minimize the percentage of part that take more than one hour to be made, which setup they should implement in both their factory: the one of the parent company or the one of the smaller company?

Related exercise sets:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

Like all other sharing systems, Airbnb the housing sharing system, Uber the car sharing system, Citi Bike is the network of bicycle rental stations intended for point-to-point transportation.

Citi Bike is New York City’s largest bike sharing system. It’s a convenient solution for trips that are too far to walk but too short for a taxi or the subway. The bike sharing system is combined with all other transportation methods available in the area for commuters.

Currently, there are about a million trips on average per month by Citi Bike riders. The system has 10,000 bicycles and 610 stations. By end of 2017, the total size of Citi Bike system will be 12,000 bikes and 750 stations. The grey area is the current service area. The yellow and blue areas represent the sections to be covered by end of 2017.

The Optimization Questions

Any Citi Bike client has come up against two frustrating scenarios: the empty dock at the start and full dock at the end of the trip. Researchers call this as “rebalancing” problem as part of “fleet optimization” questions. This problem has attracted the attention of data scientists to develop complex methodologies to optimize the available bikes and open docks.

Following I attempt to utilize the shiny visualization app to provide a hint for the 3 questions:

Fleet Routing Pattern Detection: what are the most popular routes during peak hours and off-peak? What is the direction of the flow?
Station Balance Prediction: what is the average volume of imbalance in the distributed system? What is the station-level inflow and outflow? Is it sensitive to the time? How does it look like in a time series?
Reducing rebalancing demand: What are the riders’ activities like? Is it possible to rebalance through pricing schemes?

The visualization app is intended to provide a way to explore different comparative measures at the route, station and system levels with spatial attributes and time series.

The Data

Citi published Citi Bike Trip Histories – downloadable files of Citi Bike trip data. I used the Citi Bike data for the month of March 2017 (approximately 1 million observations). The data includes:

Trip Duration (in seconds)
Start Time and Date
Stop Time and Date
Start Station Name
End Station Name
Station ID
Station Lat/Long
Bike ID
User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
Gender (Zero=unknown; 1=male; 2=female)
Year of Birth

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns of rebalancing.

Bar Chart 1– Time wise imbalance (Peak/Off peak)

Bar Chart 2 -Location wise imbalance (Top 10 popular Station)

Insights

On the interactive map, each dot presents a station. The visualization will also provide options to identify popular routes by selecting date and hour range. The top popular routes are marked in orange as the lines between the spatial points. The direction of the routes is indicated by moving from the more red towards the more green dots.

Citi Bike Daily Migration

The Patterns

Interesting patterns are observed. The most popular routes on the west side run through Central Park and Chelsea Pier. Grand central/Penn Station centered routes are also in the hottest route list. Outside Manhattan there are centers in Queen and Brooklyn initiating lots of popular routes. Riders bike more along the west and east streets than along north and south avenue. That makes sense in light of the fact that there are more uptown and downtown subways than crosstown ones, and riders do utilize the Citi Bike as a an alternative transportation option.

While not enough bikes available in hot pick up stations, the docks are lacking in hot drop off stations. The red dots are where outflow of bikes exceeds the inflow of bikes The green dots are where inflow of bikes exceeds the outflow of bikes. In the other words, the green dots are the hot spot to pick up a bike(more inbound bikes) and the red(more empty docks) to drop them off. And The more extreme the color of dot is, the higher percentage change of the flows this stations has. The size and transparency of the dot is represented by the volume of both inflow and outflow of the stations. The more obvious the dot is, the hotter spot the station is.

What caused the balancing problem? The map based interactive app provides an insight for predicting demand. The information displayed is the accumulated hourly variables based on dates selected. Details of statistic numbers is also available for each stations by zooming in.

New York has a classic peak commuter flow pattern. Most commuters ride bikes towards the center of the town from its edge in the morning. At the end of the day, they ride the reverse way to the edge where they live, especially at the edge sections with fewer public transportations options.

The Rider’s Activities

What about the rider’s activities. Is there any pattern involved? The app provides insights of rider’s performance for reducing rebalancing demands. By studying rider’s activities, it will provides suggestions for potential solutions.

Below each bubble represents an age and gender group. The age is represented as the number on each bubble. A negative correlation is observed between age and speed. The younger the rider is, the faster he/she rides. In similarity, the group in the thirties shows similar miles per trip. The performance between female and male group are also different. The male groups in blue perform a higher speed level than female groups in red.

The Balancing Solutions

Is there solutions for rebalancing to cut the cost and improve the efficiency, instead of manually moving bikes via trucks, bike-draw trailers and sprinter vans from full stations to empty stations? The moving will take crews travel in pairs 45 minutes to load a truck.

Citi Bike sought a way to get the riders to move the bikes themselves. In May it started the pilot Bike Angel. The reverse-commuter would be perfect target member of the Angel program. What is so appealing about the program is the bike sharing system could self -distribute its fleet with the proper incentives. The member can easily make 10 Amazon gift card with a few reverse trips. As a result, the demand of manually moving bike around would decrease.

The Conclusions

The Visualization app provides the real time status of fleets: popular routes, inbound/outbound, net change, time series of stations, hot spot analysis and rider’s activities. It supports the self distributed fleet by establishing a baseline for identifying “healthy” rebalancing within the bike share system. It provides a hint for a future transportation solutions.

The App

The interactive app is available onShiny.io.

The post NYC Citi Bike Visualization – A Hint of Future Transportation appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

Introduction

How this book is organized

Key features of this book

Examples of plots

Book preview

Order now

About the author

What you’ll learn:

Chapter 2: Shakespeare gets Sentimental

Chapter 3: Analyzing TV News

Chapter 4: Singing a Happy Song (or Sad?!)

A basic example

Feedback

Related exercise sets:

Rcpp (C++ version)

R version

Performance comparison

Implementing Gradient Boosting

Plotting the Partial Dependence Plot

Prediction on Test Set

Conclusion

Intro to R & RStudio

Data Science in the Tidyverse

Intro to Shiny & R Markdown

Changes in version 1.65.0-1 (2017-08-24)

Related exercise sets:

bare to quosure: quo

bare to quosure in function: enquo

quosure to a name: quo_name

quosure to text: quo_text

character to quosure: sym

multiple bares to quosure: quos

multiple characters to quosure: syms

quoting full expressions

let()

%.>% (dot arrow pipe)

:=

λ()

DebugFnW()

Related exercise sets:

bare to quosure: `quo`

bare to quosure in function: `enquo`

quosure to a name: `quo_name`

quosure to text: `quo_text`

character to quosure: `sym`

multiple bares to quosure: `quos`

multiple characters to quosure: `syms`

`let()`

`%.>%` (dot arrow pipe)

`:=`

`λ()`

`DebugFnW()`