The Cycling Accident Map of Madrid City

August 29, 2017, 9:00 am

≫ Next: Working with air quality and meteorological data Exercises (Part-2)

≪ Previous: Clean or shorten Column names while importing the data itself

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Far away, this ship has taken me far away (Starlight, Muse)

Madrid City has an Open Data platform where can be found around 300 data sets about a number of topics. One of these sets is the one I used for this experiment. It contains information about cycling accidents happened in the city from January to July 2017. I have done a map to locate where the accidents took place. This experiment shows how R makes very easy to create professional maps with Leaflet (in this case I use Carto basemaps).

To locate accidents the data set only contains the address where they happened so the first thing I did is to obtain their geographical coordinates using geocode function from ggmap package. There were 431 accidents during the first 7 months of 2017 (such a big number!) and I got coordinates of 407 so I can locate 94% of the accidents.

Obviously, the amount of accidents in some place depend on how many bikers circulate there as well as on its infrastructure. None of these things can be seen in the map: It only shows number of accidents.

The categorization of accidents is:

Double collision (Colisión doble): Traffic accident occurred between two moving vehicles.
Multiple collision (Colisión múltiple): Traffic accident occurred between more than two moving vehicles.
Fixed object collision (Choque con objeto fijo): Accident occurred between a moving vehicle with a driver and an immovable object that occupies the road or separated area of the same, whether parked vehicle, tree, street lamp, etc.
Accident (Atropello): Accident occurred between a vehicle and a pedestrian that occupies the road or travels by sidewalks, refuges, walks or zones of the public road not destined to the circulation of vehicles.
Overturn (Vuelco): Accident suffered by a vehicle with more than two wheels which by some circumstance loses contact with the road and ends supported on one side or on its roof.
Motorcycle fall (Caída motocicleta): Accident suffered by a motorcycle, which at some moment loses balance, because of the driver or due to the conditions of the road.
Moped fall (Caída ciclomotor): Accident suffered by a moped, which at some moment loses balance, because of the driver or due to the conditions of the road.
Bicycle fall (Caída bicicleta): Accident suffered by a bicycle, which at some moment loses balance, because of the driver or due to the conditions of the road.

These categories are redundant (e.g. Double and Multiple collision), difficult to understand (e.g. Overturn) or both things at the same time (e.g. Motorcycle fall and Moped fall). This categorization also forgets human damages incurred by the accident.

Taking all these things in mind, this is the map:

Here is a full-screen version of the map.

My suggestions to the city council of Madrid are:

Add geographical coordinates to data (I guess many of the analysis will need them)
Rethink the categorization to make it clearer and more informative
Add more cycling data sets to the platform (detail of bikeways, traffic …) to understand accidents better
Attending just to the number of accidents , put the focus around Parque del Retiro, specially on its west surroundings, from Plaza de Cibeles to Plaza de Carlos V: more warning signals, more (or better) bikeways …

I add the code below to update the map (If someone ask it to me, I can do it myself regularly):

library(dplyr)
library(stringr)
library(ggmap)
library(leaflet)
# First, getting the data
download.file(paste0("http://datos.madrid.es/egob/catalogo/", file), 
              destfile="300110-0-accidentes-bicicleta.csv")

data=read.csv("300110-0-accidentes-bicicleta.csv", sep=";", skip=1)

# Prepare data for geolocation
data %>% 
  mutate(direccion=paste(str_trim(Lugar), str_trim(Numero), "MADRID, SPAIN", sep=", ") %>% 
           str_replace("NA, ", "") %>% 
           str_replace(" - ", " CON ")) -> data

# Geolocation (takes some time ...)
coords=c()
for (i in 1:nrow(data)) 
{
  coords %>% rbind(geocode(data[i,"direccion"])) -> coords
  Sys.sleep(0.5)
}
  
# Save data, just in case
data %>% cbind(coords) %>% saveRDS(file="bicicletas.RDS")

data=readRDS(file="bicicletas.RDS")

# Remove non-successfull geolocations
data %>% 
  filter(!is.na(lon)) %>% 
  droplevels()-> data

# Remove non-successfull geolocations
data %>% mutate(Fecha=paste0(as.Date(data$Fecha, "%d/%m/%Y"), " ", TRAMO.HORARIO),
                popup=paste0("Dónde:",
                             direccion,
                             "Cuándo:",
                             Fecha,
                             "Qué pasó:",
                             Tipo.Accidente)) -> data

# Do the map
data %>% split(data$Tipo.Accidente) -> data.df

l <- leaflet() %>% addProviderTiles(providers$CartoDB.Positron)

names(data.df) %>%
  purrr::walk( function(df) {
    l <<- l %>%
      addCircleMarkers(data=data.df[[df]],
                 lng=~lon, lat=~lat,
                 popup=~popup,
                 color="red",
                 stroke=FALSE,
                 fillOpacity = 0.8,
                 group = df,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F))
  })

l %>%
  addLayersControl(
    overlayGroups = names(data.df),
    options = layersControlOptions(collapsed = FALSE)
  )

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Working with air quality and meteorological data Exercises (Part-2)

August 29, 2017, 9:08 am

≫ Next: 3-D animations with R

≪ Previous: The Cycling Accident Map of Madrid City

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports.

In the previous exercise set we used data from MY1 station to see how to import data and extract basic statistical information from data. In this exercise set we will use some basic and useful functions that are available in openair package to analyze and visualize MY1 data.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag openair

Please load the package openair before starting the exercises.

Exercise 1 Use summaryPlot function to plot timeseries and histogram for pm10, and o3

Exercise 2 Use windRose function to plot monthly wind rose.

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course [Intermediate] Spatial Data Analysis with R, QGIS & More. this course you will learn how to:

Work with Spatial data and maps
Learn about different tools to develop spatial data next to R
And much more

Exercise 3 Use pollutionRose function to plot monthly pollution roses for a. pm10 b. pm2.5 b. nox c. no d. o3

Exercise 4 Use pollutionRose to plot seasonal pollution roses for a. pm10 b. pm2.5 b. nox c. no d. o3

Exercise 5 Use percentileRose function to plot monthly percentile roses for a. pm10 b. pm2.5 b. nox c. no d. o3

Exercise 6 Use polarCluster function to plot cluster roses plot for a. pm10 b. pm2.5 b. nox c. no d. o3

Related exercise sets:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

3-D animations with R

August 29, 2017, 2:42 pm

≫ Next: New CRAN Package Announcement: splashr

≪ Previous: Working with air quality and meteorological data Exercises (Part-2)

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is often used to visualize and animate 2-dimensional data. (Here are just a few examples.) But did you know you can create 3-dimensional animations as well?

As Thomas Lins Pedersen explains in a recent blog post, the trick is in using the persp function to translate points in 3-D space into a 2-D projection. This function is normally used to render a 3-D scatterplot or wireframe plot, but if you instead capture its output value, it returns a transformation matrix. You can then use the trans3d function to with this matrix to transform points in 3-D space. Thomas demonstrates how you can pass the transformed 2-D coordinates to plot a 3-D cube, and even animate it from two slightly different perspectives to create a 3-D stereo pair:

Rendering 3-D images isn't just for fun: there's plenty of 3-D data to analyze and visualize, too. Giora Simchoni used R to visualize data from the Carnegie-Mellon Graphics Lab Motion Capture Database. This data repository provides the output of human figures in motion-capture suits performing actions like walking, jumping, and even dancing. Since the motion-capture suits include multiple sensors measured over a time-period, the data structures are quite complex. To make things simpler, Giora created the mocap package (available on Github) to read these motion data files and generate 3-D animations to visualize them. For example, here's the output from two people performing the Charleston together:

You can find complete details behind both animations, including the associated R code, at the links below.

Data Imaginist: I made a 3D movie with ggplot2 once – here's how I did it Giora Simchoni: Lambada! (The mocap Package)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

New CRAN Package Announcement: splashr

August 29, 2017, 3:26 pm

≫ Next: IMDB Genre Classification using Deep Learning

≪ Previous: 3-D animations with R

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’m pleased to announce that splashr is now on CRAN.

(That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")).

The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear more geared to web scraping and has quite a bit of power under the hood.

I’ve blogged about splashr before:

and, the package comes with three vignettes that (hopefully) provide a solid introduction to using the web scraping framework.

More features — including additional DSL functions — will be added in the coming months, but please kick the tyres and file an issue with problems or suggestions.

Many thanks to all who took it for a spin and provided suggestions and even more thanks to the CRAN team for a speedy onboarding.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

↧

IMDB Genre Classification using Deep Learning

August 29, 2017, 5:00 pm

≫ Next: RStudio 1.1 Preview – I Only Work in Black

≪ Previous: New CRAN Package Announcement: splashr

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

The Internet Movie Database (Imdb) is a great source to get information about movies. Keras provides access to some part of the cleaned dataset (e.g. for sentiment classification). While sentiment classification is an interesting topic, I wanted to see if it is possible to identify a movie’s genre from its description. The image illustrates the task;

plot of chunk unnamed-chunk-4

To see if that is possible I downloaded the raw data from an FU-Berlin ftp- server. Most movies have multiple genres assigned (e.g. Action and Sci-fi.). I chose to randomly pick one genre in case of multiple assignments.

So the task at hand is to use a lengthy description to interfere a (noisy) label. Hence, the task is similar to the Reuters news categorization task. I used the code as a guideline for the model. However, looking at the code, it becomes clear that data preprocessing part is skipped. In order to make it easy for a practitioner to create their own applications, I will try to detail the necessary preprocessing. The texts are represented as a vector of integers (indexes). So basically one builds a dictionary in which each index refers to a particular word.

require(caret)require(keras)max_words<-1500### create a balanced dataset  with equal numbers of observations for each classdown_train<-caret::downSample(x=mm,y=mm$GenreFact)### preprocessing  --- tokenizer=keras::text_tokenizer(num_words=max_words)keras::fit_text_tokenizer(tokenizer,mm$descr)sequences=tokenizer$texts_to_sequences(mm$descr)## split in training and test set.train<-sample(1:length(sequences),size=0.95*length(sequences),replace=F)x_test<-sequences[-train]x_train<-sequences[train]### labels!y_train<-mm[train,]$GenreFacty_test<-mm[-train,]$GenreFact########## how many classes do we have?num_classes<-length(unique(y_train))+1cat(num_classes,'\n')#'Vectorizing sequence data to a matrix which can be used an input matrixx_train<-sequences_to_matrix(tokenizer,x_train,mode='binary')x_test<-sequences_to_matrix(tokenizer,x_test,mode='binary')cat('x_train shape:',dim(x_train),'\n')cat('x_test shape:',dim(x_test),'\n')#'Convert class vector to binary class matrix',#    '(for use with categorical_crossentropy)\n')y_train<-to_categorical(y_train,num_classes)y_test<-to_categorical(y_test,num_classes)

In order to get a trainable data, we first balance the dataset such that all classes have the same frequency. Then we preprocess the raw text descriptions in such an index based representation. As always, we split the dataset in test and training data (90%). Finally, we transform the index based representation into a matrix representation and hot-one-encode the classes.

After setting up the data, we can define the model. I tried different combinations (depth, dropouts, regularizers and input units) and the following layout seems to work the best:

batch_size<-64epochs<-200model<-keras_model_sequential()model%>%layer_dense(units=512,input_shape=c(max_words),activation="relu")%>%layer_dropout(rate=0.6)%>%layer_dense(units=64,activation='relu',regularizer_l1(l=0.15))%>%layer_dropout(rate=0.8)%>%layer_dense(units=num_classes,activation='softmax')summary(model)model%>%compile(loss='categorical_crossentropy',optimizer='adam',metrics=c('accuracy'))hist<-model%>%fit(x_train,y_train,batch_size=batch_size,epochs=200,verbose=1,validation_split=0.1)## using the holdout dataset!score<-model%>%evaluate(x_test,y_test,batch_size=batch_size,verbose=1)cat('Test score:',score[[1]],'\n')cat('Test accuracy',score[[2]],'\n')

Finally, we plot the training progress and conclude that it is possible to train a classifier without too much effort.

plot of chunk unnamed-chunk-4

I hope the short tutorial illustrated how to preprocess text in order to build a text-based deep-learning learning classifier. I am pretty sure that are better parameters to tune the model. If you want to implement such a model in production environment, I would recommend playing with the text-preprocessing parameters. The text-tokenizer and the text_to_sequence functions hold a lot of untapped value.

Good luck!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

↧

RStudio 1.1 Preview – I Only Work in Black

August 29, 2017, 5:00 pm

≫ Next: Tidy Time Series Analysis, Part 4: Lags and Autocorrelation

≪ Previous: IMDB Genre Classification using Deep Learning

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Today, we’re continuing our blog series on new features in RStudio 1.1. If you’d like to try these features out for yourself, you can download a preview release of RStudio 1.1.

I Only Work in Black

For those of us that like to work in black or very very dark grey, the dark theme can be enabled from the ‘Global Options’ menu, selecting the ‘Appearance’ tab and choosing an ‘Editor theme’ that is dark.

Icons are now high-DPI, a ‘Modern’ and ‘Sky’ theme were also added, read more about them under Using RStudio Themes.

All panels support themes: Code editor, Console, Terminal, Environment, History, Files, Connections, Packages, Help, Build and VCS. Other features like Notebooks, Debugging, Profiling , Menus and the Object Explorer support this theme as well.

However, the Plots and Viewer panes render with the default colors of your content and therefore, require additional packages to switch to dark themes. For instance, shinythemes provides the darkly theme for Shiny and ggthemes support for light = FALSE under ggplot. If you are a package author, consider using rstudioapi::getThemeInfo() when generating output to these panes.

Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

↧

Tidy Time Series Analysis, Part 4: Lags and Autocorrelation

August 29, 2017, 5:00 pm

≫ Next: RcppArmadillo 0.7.960.1.2

≪ Previous: RStudio 1.1 Preview – I Only Work in Black

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

In the fourth part in a series on Tidy Time Series Analysis, we’ll investigate lags and autocorrelation, which are useful in understanding seasonality and form the basis for autoregressive forecast models such as AR, ARMA, ARIMA, SARIMA (basically any forecast model with “AR” in the acronym). We’ll use the tidyquant package along with our tidyverse downloads data obtained from cranlogs. The focus of this post is using lag.xts(), a function capable of returning multiple lags from a xts object, to investigate autocorrelation in lags among the daily tidyverse package downloads. When using lag.xts() with tq_mutate() we can scale to multiple groups (different tidyverse packages in our case). If you like what you read, please follow us on social media to stay up on the latest Business Science news, events and information! As always, we are interested in both expanding our network of data scientists and seeking new clients interested in applying data science to business and finance. If interested, contact us.

If you haven’t checked out the previous tidy time series posts, you may want to review them to get up to speed.

Here’s an example of the autocorrelation plot we investigate as part of this post:

tidyquant correlation over time

Libraries Needed

We’ll need to load several libraries today.

library(tidyquant)# Loads tidyverse, tidyquant, financial pkgs, xts/zoolibrary(cranlogs)# For inspecting package downloads over timelibrary(timetk)# For consistent time series coercion functionslibrary(stringr)# Working with stringslibrary(forcats)#Workingwithfactors/categoricaldata

CRAN tidyverse Downloads

We’ll be using the same “tidyverse” dataset as the last several posts. The script below gets the package downloads for the first half of 2017.

# tidyverse packages (see my laptop stickers from first post) ;)pkgs<-c("tidyr","lubridate","dplyr","broom","tidyquant","ggplot2","purrr","stringr","knitr")# Get the downloads for the individual packagestidyverse_downloads<-cran_downloads(packages=pkgs,from="2017-01-01",to="2017-06-30")%>%tibble::as_tibble()%>%group_by(package)tidyverse_downloads

## # A tibble: 1,629 x 3## # Groups:   package [9]##          date count package##  *         ##  1 2017-01-01   873   tidyr##  2 2017-01-02  1840   tidyr##  3 2017-01-03  2495   tidyr##  4 2017-01-04  2906   tidyr##  5 2017-01-05  2847   tidyr##  6 2017-01-06  2756   tidyr##  7 2017-01-07  1439   tidyr##  8 2017-01-08  1556   tidyr##  9 2017-01-09  3678   tidyr## 10 2017-01-10  7086   tidyr## # ... with 1,619 more rows

We can visualize daily downloads, but detecting the trend is quite difficult due to noise in the data.

# Visualize the package downloadstidyverse_downloads%>%ggplot(aes(x=date,y=count,color=package))+# Datageom_point(alpha=0.5)+facet_wrap(~package,ncol=3,scale="free_y")+# Aestheticslabs(title="tidyverse packages: Daily downloads",x="",subtitle="2017-01-01 through 2017-06-30",caption="Downloads data courtesy of cranlogs package")+scale_color_tq()+theme_tq()+theme(legend.position="none")

plot of chunk unnamed-chunk-3

Lags (Lag Operator)

The lag operator (also known as backshift operator) is a function that shifts (offsets) a time series such that the “lagged” values are aligned with the actual time series. The lags can be shifted any number of units, which simply controls the length of the backshift. The picture below illustrates the lag operation for lags 1 and 2.

Lag Example

Lags are very useful in time series analysis because of a phenomenon called autocorrelation, which is a tendency for the values within a time series to be correlated with previous copies of itself. One benefit to autocorrelation is that we can identify patterns within the time series, which helps in determining seasonality, the tendency for patterns to repeat at periodic frequencies. Understanding how to calculate lags and analyze autocorrelation will be the focus of this post.

Finally, lags and autocorrelation are central to numerous forecasting models that incorporate autoregression, regressing a time series using previous values of itself. Autoregression is the basis for one of the most widely used forecasting techniques, the autoregressive integrated moving average model or ARIMA for short. Possibly the most widely used tool for forecasting, the forecast package by Rob Hyndman, implements ARIMA (and a number of other forecast modeling techniques). We’ll save autoregression and ARIMA for another day as the subject is truly fascinating and deserves its own focus.

Background on Functions Used

The xts, zoo, and TTR packages have some great functions that enable working with time series. The tidyquant package enables a “tidy” implementation of these functions. You can see which functions are integrated into tidyquant package using tq_mutate_fun_options(). We use glimpse() to shorten the output.

# tidyquant Integrated functionstq_mutate_fun_options()%>%glimpse()

## List of 5##  $ zoo                 : chr [1:14] "rollapply" "rollapplyr" "rollmax" "rollmax.default" ...##  $ xts                 : chr [1:27] "apply.daily" "apply.monthly" "apply.quarterly" "apply.weekly" ...##  $ quantmod            : chr [1:25] "allReturns" "annualReturn" "ClCl" "dailyReturn" ...##  $ TTR                 : chr [1:62] "adjRatios" "ADX" "ALMA" "aroon" ...##  $ PerformanceAnalytics: chr [1:7] "Return.annualized" "Return.annualized.excess" "Return.clean" "Return.cumulative" ...

lag.xts()

Today, we’ll take a look at the lag.xts() function from the xts package, which is a really great function for getting multiple lags. Before we dive into an analysis, let’s see how the function works. Say we have a time series of ten values beginning in 2017.

set.seed(1)my_time_series_tbl<-tibble(date=seq.Date(ymd("2017-01-01"),length.out=10,by="day"),value=1:10+rnorm(10))my_time_series_tbl

## # A tibble: 10 x 2##          date     value##             ##  1 2017-01-01 0.3735462##  2 2017-01-02 2.1836433##  3 2017-01-03 2.1643714##  4 2017-01-04 5.5952808##  5 2017-01-05 5.3295078##  6 2017-01-06 5.1795316##  7 2017-01-07 7.4874291##  8 2017-01-08 8.7383247##  9 2017-01-09 9.5757814## 10 2017-01-10 9.6946116

The lag.xts() function generates a sequence of lags (t-1, t-2, t-3, …, t-k) using the argument k. However, it only works on xts objects (or other matrix, vector-based objects). In other words, it fails on our “tidy” tibble. We get an “unsupported type” error.

# Bummer, man!my_time_series_tbl%>%lag.xts(k=1:5)

##

Now, watch what happens when converted to an xts object. We’ll use tk_xts() from the timetk package to coerce from a time-based tibble (tibble with a date or time component) to and xts object.

The timetk package is a toolkit for working with time series. It has functions that simplify and make consistent the process of coercion (converting to and from different time series classes). In addition, it has functions to aid the process of time series machine learning and data mining. Visit the docs to learn more.

# Success! Got our lags 1 through 5. One problem: no original valuesmy_time_series_tbl%>%tk_xts(silent=TRUE)%>%lag.xts(k=1:5)

##                value   value.1   value.2   value.3   value.4## 2017-01-01        NA        NA        NA        NA        NA## 2017-01-02 0.3735462        NA        NA        NA        NA## 2017-01-03 2.1836433 0.3735462        NA        NA        NA## 2017-01-04 2.1643714 2.1836433 0.3735462        NA        NA## 2017-01-05 5.5952808 2.1643714 2.1836433 0.3735462        NA## 2017-01-06 5.3295078 5.5952808 2.1643714 2.1836433 0.3735462## 2017-01-07 5.1795316 5.3295078 5.5952808 2.1643714 2.1836433## 2017-01-08 7.4874291 5.1795316 5.3295078 5.5952808 2.1643714## 2017-01-09 8.7383247 7.4874291 5.1795316 5.3295078 5.5952808## 2017-01-10 9.5757814 8.7383247 7.4874291 5.1795316 5.3295078

We get our lags! However, we still have one problem: We need our original values so we can analyze the counts against the lags. If we want to get the original values too, we can do something like this.

# Convert to xtsmy_time_series_xts<-my_time_series_tbl%>%tk_xts(silent=TRUE)# Get original values and lags in xtsmy_lagged_time_series_xts<-merge.xts(my_time_series_xts,lag.xts(my_time_series_xts,k=1:5))# Convert back to tblmy_lagged_time_series_xts%>%tk_tbl()

## # A tibble: 10 x 7##         index     value   value.5   value.1   value.2   value.3##                                 ##  1 2017-01-01 0.3735462        NA        NA        NA        NA##  2 2017-01-02 2.1836433 0.3735462        NA        NA        NA##  3 2017-01-03 2.1643714 2.1836433 0.3735462        NA        NA##  4 2017-01-04 5.5952808 2.1643714 2.1836433 0.3735462        NA##  5 2017-01-05 5.3295078 5.5952808 2.1643714 2.1836433 0.3735462##  6 2017-01-06 5.1795316 5.3295078 5.5952808 2.1643714 2.1836433##  7 2017-01-07 7.4874291 5.1795316 5.3295078 5.5952808 2.1643714##  8 2017-01-08 8.7383247 7.4874291 5.1795316 5.3295078 5.5952808##  9 2017-01-09 9.5757814 8.7383247 7.4874291 5.1795316 5.3295078## 10 2017-01-10 9.6946116 9.5757814 8.7383247 7.4874291 5.1795316## # ... with 1 more variables: value.4

That’s a lot of work for a simple operation. Fortunately we have tq_mutate() to the rescue!

tq_mutate()

The tq_mutate() function from tidyquant enables “tidy” application of the xts-based functions. The tq_mutate() function works similarly to mutate() from dplyr in the sense that it adds columns to the data frame.

The tidyquant package enables a “tidy” implementation of the xts-based functions from packages such as xts, zoo, quantmod, TTR and PerformanceAnalytics. Visit the docs to learn more.

Here’s a quick example. We use the select = value to send the “value” column to the mutation function. In this case our mutate_fun = lag.xts. We supply k = 5 as an additional argument.

# This is nice, we didn't need to coerce to xts and it merged for usmy_time_series_tbl%>%tq_mutate(select=value,mutate_fun=lag.xts,k=1:5)

## # A tibble: 10 x 7##          date     value   value.1   value.2   value.3   value.4##                                 ##  1 2017-01-01 0.3735462        NA        NA        NA        NA##  2 2017-01-02 2.1836433 0.3735462        NA        NA        NA##  3 2017-01-03 2.1643714 2.1836433 0.3735462        NA        NA##  4 2017-01-04 5.5952808 2.1643714 2.1836433 0.3735462        NA##  5 2017-01-05 5.3295078 5.5952808 2.1643714 2.1836433 0.3735462##  6 2017-01-06 5.1795316 5.3295078 5.5952808 2.1643714 2.1836433##  7 2017-01-07 7.4874291 5.1795316 5.3295078 5.5952808 2.1643714##  8 2017-01-08 8.7383247 7.4874291 5.1795316 5.3295078 5.5952808##  9 2017-01-09 9.5757814 8.7383247 7.4874291 5.1795316 5.3295078## 10 2017-01-10 9.6946116 9.5757814 8.7383247 7.4874291 5.1795316## # ... with 1 more variables: value.5

That’s much easier. We get the value column returned in addition to the lags, which is the benefit of using tq_mutate(). If you use tq_transmute() instead, the result would be the lags only, which is what lag.xts() returns.

Analyzing tidyverse Downloads: Lag and Autocorrelation Analysis

Now that we understand a little more about lags and the lag.xts() and tq_mutate() functions, let’s put this information to use with a lag and autocorrelation analysis of the tidyverse package downloads. We’ll analyze all tidyverse packages together, showing off the scalability of tq_mutate().

Scaling the Lag and Autocorrelation Calculation

First, let’s get lags 1 through 28 (4 weeks of lags). The process is quite simple: we take the tidyverse_downloads data frame, which is grouped by package, and apply tq_mutate() using the lag.xts function. We can provide column names for the new columns by prefixing “lag_” to the lag numbers, k, which the sequence from 1 to 28. The output is all of the lags for each package.

# Use tq_mutate() to get lags 1:28 using lag.xts()k<-1:28col_names<-paste0("lag_",k)tidyverse_lags<-tidyverse_downloads%>%tq_mutate(select=count,mutate_fun=lag.xts,k=1:28,col_rename=col_names)tidyverse_lags

## # A tibble: 1,629 x 31## # Groups:   package [9]##    package       date count lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7##                   ##  1   tidyr 2017-01-01   873    NA    NA    NA    NA    NA    NA    NA##  2   tidyr 2017-01-02  1840   873    NA    NA    NA    NA    NA    NA##  3   tidyr 2017-01-03  2495  1840   873    NA    NA    NA    NA    NA##  4   tidyr 2017-01-04  2906  2495  1840   873    NA    NA    NA    NA##  5   tidyr 2017-01-05  2847  2906  2495  1840   873    NA    NA    NA##  6   tidyr 2017-01-06  2756  2847  2906  2495  1840   873    NA    NA##  7   tidyr 2017-01-07  1439  2756  2847  2906  2495  1840   873    NA##  8   tidyr 2017-01-08  1556  1439  2756  2847  2906  2495  1840   873##  9   tidyr 2017-01-09  3678  1556  1439  2756  2847  2906  2495  1840## 10   tidyr 2017-01-10  7086  3678  1556  1439  2756  2847  2906  2495## # ... with 1,619 more rows, and 21 more variables: lag_8 ,## #   lag_9 , lag_10 , lag_11 , lag_12 ,## #   lag_13 , lag_14 , lag_15 , lag_16 ,## #   lag_17 , lag_18 , lag_19 , lag_20 ,## #   lag_21 , lag_22 , lag_23 , lag_24 ,## #   lag_25 , lag_26 , lag_27 , lag_28

Next, we need to correlate each of the lags to the “count” column. This involves a few steps that can be strung together in a dplyr pipe (%>%):

The goal is to get count and each lag side-by-side so we can do a correlation. To do this we use gather() to pivot each of the lagged columns into a “tidy” (long format) data frame, and we exclude “package”, “date”, and “count” columns from the pivot.
Next, we convert the new “lag” column from a character string (e.g. “lag_1”) to numeric (e.g. 1) using mutate(), which will make ordering the lags much easier.
Next, we group the long data frame by package and lag. This allows us to calculate using subsets of package and lag.
Finally, we apply the correlation to each group of lags. The summarize() function can be used to implement cor(), which takes x = count and y = lag_value. Make sure to pass use = "pairwise.complete.obs", which is almost always desired. Additionally, the 95% upper and lower cutoff can be approximated by:

$$cutoff = \pm \frac{2}{N^{0.5}}$$

Where:

N = number of observations.

Putting it all together:

# Calculate the autocorrelations and 95% cutoffstidyverse_count_autocorrelations<-tidyverse_lags%>%gather(key="lag",value="lag_value",-c(package,date,count))%>%mutate(lag=str_sub(lag,start=5)%>%as.numeric)%>%group_by(package,lag)%>%summarize(cor=cor(x=count,y=lag_value,use="pairwise.complete.obs"),cutoff_upper=2/(n())^0.5,cutoff_lower=-2/(n())^0.5)tidyverse_count_autocorrelations

## # A tibble: 252 x 5## # Groups:   package [?]##    package   lag        cor cutoff_upper cutoff_lower##                             ##  1   broom     1 0.65709555    0.1486588   -0.1486588##  2   broom     2 0.29065629    0.1486588   -0.1486588##  3   broom     3 0.18617353    0.1486588   -0.1486588##  4   broom     4 0.17266972    0.1486588   -0.1486588##  5   broom     5 0.26686998    0.1486588   -0.1486588##  6   broom     6 0.55222426    0.1486588   -0.1486588##  7   broom     7 0.74755610    0.1486588   -0.1486588##  8   broom     8 0.51461062    0.1486588   -0.1486588##  9   broom     9 0.19069218    0.1486588   -0.1486588## 10   broom    10 0.08473241    0.1486588   -0.1486588## # ... with 242 more rows

Visualizing Autocorrelation: ACF Plot

Now that we have the correlations calculated by package and lag number in a nice “tidy” format, we can visualize the autocorrelations with ggplot to check for patterns. The plot shown below is known as an ACF plot, which is simply the autocorrelations at various lags. Initial examination of the ACF plots indicate a weekly frequency.

# Visualize the autocorrelationstidyverse_count_autocorrelations%>%ggplot(aes(x=lag,y=cor,color=package,group=package))+# Add horizontal line a y=0geom_hline(yintercept=0)+# Plot autocorrelationsgeom_point(size=2)+geom_segment(aes(xend=lag,yend=0),size=1)+# Add cutoffsgeom_line(aes(y=cutoff_upper),color="blue",linetype=2)+geom_line(aes(y=cutoff_lower),color="blue",linetype=2)+# Add facetsfacet_wrap(~package,ncol=3)+# Aestheticsexpand_limits(y=c(-1,1))+scale_color_tq()+theme_tq()+labs(title=paste0("Tidyverse ACF Plot: Lags ",rlang::expr_text(k)),subtitle="Appears to be a weekly pattern",x="Lags")+theme(legend.position="none",axis.text.x=element_text(angle=45,hjust=1))

plot of chunk unnamed-chunk-13

Which Lags Consistently Stand Out?

We see that there appears to be a weekly pattern, but we want to be sure. We can verify the weekly pattern assessment by reviewing the absolute value of the correlations independent of package. We take the absolute autocorrelation because we use the magnitude as a proxy for how much explanatory value the lag provides. We’ll use dplyr functions to manipulate the data for visualization:

We drop the package group constraint using ungroup().
We calculate the absolute correlation using mutate(). We also convert the lag to a factor, which helps with reordering the plot later.
We select() only the “lag” and “cor_abs” columns.
We group by “lag” to lump all of the lags together. This enables us to determine the trend independent of package.

# Get the absolute autocorrelationstidyverse_absolute_autocorrelations<-tidyverse_count_autocorrelations%>%ungroup()%>%mutate(lag=as_factor(as.character(lag)),cor_abs=abs(cor))%>%select(lag,cor_abs)%>%group_by(lag)tidyverse_absolute_autocorrelations

## # A tibble: 252 x 2## # Groups:   lag [28]##       lag    cor_abs##          ##  1      1 0.65709555##  2      2 0.29065629##  3      3 0.18617353##  4      4 0.17266972##  5      5 0.26686998##  6      6 0.55222426##  7      7 0.74755610##  8      8 0.51461062##  9      9 0.19069218## 10     10 0.08473241## # ... with 242 more rows

We can now visualize the absolute correlations using a box plot that lumps each of the lags together. We can add a line to indicate the presence of outliers at values above 1.5IQR. If the values are consistently above this limit, the lag can be considered an outlier. Note that we use the fct_reorder() function from forcats to organize the boxplot in order of decending magnitude.

# Visualize boxplot of absolute autocorrelationsbreak_point<-1.5*IQR(tidyverse_absolute_autocorrelations$cor_abs)%>%signif(3)tidyverse_absolute_autocorrelations%>%ggplot(aes(x=fct_reorder(lag,cor_abs,.desc=TRUE),y=cor_abs))+# Add boxplotgeom_boxplot(color=palette_light()[[1]])+# Add horizontal line at outlier break pointgeom_hline(yintercept=break_point,color="red")+annotate("text",label=paste0("Outlier Break Point = ",break_point),x=24.5,y=break_point+.03,color="red")+# Aestheticsexpand_limits(y=c(0,1))+theme_tq()+labs(title=paste0("Absolute Autocorrelations: Lags ",rlang::expr_text(k)),subtitle="Weekly pattern is consistently above outlier break point",x="Lags")+theme(legend.position="none",axis.text.x=element_text(angle=45,hjust=1))

plot of chunk unnamed-chunk-15

Lags in multiples of seven have the highest autocorrelation and are consistently above the outlier break point indicating the presence of a strong weekly pattern. The autocorrelation with the seven-day lag is the highest, with a median of approximately 0.75. Lags 14, 21, and 28 are also outliers with median autocorrelations in excess of our outlier break point of 0.471.

Note that the median of Lag 1 is essentially at the break point indicating that half of the packages have a presence of “abnormal” autocorrelation. However, this is not part of a seasonal pattern since a periodic frequency is not present.

Conclusions

Lag and autocorrelation analysis is a good way to detect seasonality. We used the autocorrelation of the lagged values to detect “abnormal” seasonal patterns. In this case, the tidyverse packages exhibit a strong weekly pattern. We saw how the tq_mutate() function was used to apply lag.xts() to the daily download counts to efficiently get lags 1 through 28. Once the lags were retrieved, we used other dplyr functions such as gather() to pivot the data and summarize() to calculate the autocorrelations. Finally, we saw the power of visual analysis of the autocorrelations. We created an ACF plot that showed a visual trend. Then we used a boxplot to detect which lags had consistent outliers. Ultimately a weekly pattern was confirmed.

About Business Science

We have a full suite of data science services to supercharge your organizations financial and business performance! For example, our experienced data scientists reduced a manufacturer’s sales forecasting error by 50%, which led to improved personnel planning, material purchasing and inventory management.

How do we do it? With team-based data science: Using our network of data science consultants with expertise in Marketing, Forecasting, Finance, Human Resources and more, we pull together the right team to get custom projects done on time, within budget, and of the highest quality. Learn about our data science services or contact us!

We are growing! Let us know if you are interested in joining our network of data scientist consultants. If you have expertise in Marketing Analytics, Data Science for Business, Financial Analytics, Forecasting or data science in general, we’d love to talk. Contact us!

Follow Business Science on Social Media

Connect with @bizScienc on twitter!
Like us on Facebook!!!
Follow us on LinkedIn!
Sign up for our insights blog to stay updated!
If you like our software, star our GitHub packages

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

↧

RcppArmadillo 0.7.960.1.2

August 29, 2017, 7:20 pm

≫ Next: Layered Data Visualizations Using R, Plotly, and Displayr

≪ Previous: Tidy Time Series Analysis, Part 4: Lags and Autocorrelation

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

armadillo image

A second fix-up release is needed following on the recent bi-monthly RcppArmadillo release as well as the initial follow-up as it turns out that OS X / macOS is so darn special that it needs an entire separate treatment for OpenMP. Namely to turn it off entirely…

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 384 other packages on CRAN—an increase of 54 since the CRAN release in June!

Changes in RcppArmadillo version 0.7.960.1.2 (2017-08-29)
On macOS, OpenMP support is now turned off (#170).
The package is now compiling under the C++11 standard (#170).
The vignette dependency is correctly set (James and Dirk in #168 and #169)

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

Layered Data Visualizations Using R, Plotly, and Displayr

August 29, 2017, 11:03 pm

≫ Next: The one function call you need to know as a data scientist: h2o.automl

≪ Previous: RcppArmadillo 0.7.960.1.2

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

If you have tried to communicate research results and data visualizations using R, there is a good chance you will have come across one of its great limitations. R is painful when you need to create visualizations by layering multiple visual elements on top of each other. In other words, R can be painful if you want to assemble many visual elements, such as charts, images, headings, and backgrounds, into one visualization.

The good: R can create awesome charts

R is great for creating charts. It gives you a lot of control and makes it easy to update charts with revised data. As an example, the chart below was created in R using the plotly package. It has quite a few nifty features that cannot be achieved in, say, Excel or Tableau.

The data visualization below measures blood sugar, exercise intensity, and diet. Each dot represents a blood glucose (BG) measurement for a patient over the course of a day. Note that the blood sugar measurements are not collected at regular intervals so there are gaps between some of the dots. In addition, the y-axis label spacings are irregular because this chart needs to emphasize the critical point of a BG of 8.9. The dots also get larger the further they are from a BG of 6 and color is used to emphasize extreme values. Finally, green shading is used to indicate the intensity of the patient’s physical activity, and readings from a food diary have been automatically added to this chart.

While this R visualization is awesome, it can be made even more interesting by overlaying visual elements such as images and headings.

You can look at this R visualization live, and you can hover your mouse over points to see the dates and times of individual readings.

The bad: It is very painful to create visual confections in R

In his book, Visual Explanations, Edward Tufte coins the term visual confections to describe visualizations that are created by overlaying multiple visual elements (e.g., combining charts with images or joining multiple visualizations into one). The document below is an example of a visual confection.

The chart created in R above has been incorporated into the visualization below, along with another chart, images, background colors, headings and more – this is a visual confection.

In addition to all information contained in the original chart, the patient’s insulin dose for each day is shown in a syringe and images of meals have also been added. The background has been colored, and headings and sub-headings included. While all of this can be done in R, it cannot be done easily.

Even if you know all the relevant functions to programmatically insert images, resize them, deal with transparency, and control their order, you still have to go through a painful trial and error process of guesstimating the coordinates where things need to appear. That is, R is not WYSIWYG, and you really feel this when creating visual confections. Whenever I have done such things, I end up having to print the images, use a ruler, and create a simple model to estimate the coordinates!

Good-looking complex dashboard

The solution: How to assemble many visual layers into one data visualization

The standard way that most people create visual confections is using PowerPoint. However, PowerPoint and R are not great friends, as resizing R charts in PowerPoint causes problems, and PowerPoint cannot support any of the cool hover effects or interactivity in HTMLwidgets like plotly.

My solution was to build Displayr, which is a bit like a PowerPoint for the modern age, except that charts can be created in the app using R. The app is also online and can have its data updated automatically.

Click here to create your own layered visualization (just sign into Displayr first). Here you can access and edit the document that I used to create the visual confection example used in this post. This document contains all the raw data and the R code (as a function) used to automatically create the charts in this post. You can see the published layered visualization as a web page here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

↧

The one function call you need to know as a data scientist: h2o.automl

August 30, 2017, 5:49 am

≫ Next: Data wrangling : Cleansing – Regular expressions (2/3)

≪ Previous: Layered Data Visualizations Using R, Plotly, and Displayr

Introduction

Two things that recently came to my attention were AutoML(Automatic Machine Learning) by h2o.ai and thefashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

H2o AutoML

As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models. That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

FashionMNIST_Benchmark = h2o.automl(  x = 1:784,  y = 785,  training_frame = fashionmnist_train,  validation_frame = fashionmninst_test)

So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

h2o utilizing all 32 cores to create models

The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

Gradient Boosting (0.90)
Deep learning (0.89)
Random forests (0.89)
Extremely randomized forests (0.88)
GLM (0.86)

There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

Conclusion

If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

Cheers, Longhow.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

↧

Data wrangling : Cleansing – Regular expressions (2/3)

August 30, 2017, 9:00 am

≫ Next: Web Scraping Influenster: Find a Popular Hair Care Product for You

≪ Previous: The one function call you need to know as a data scientist: h2o.automl

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.

Before proceeding, it might be helpful to look over the help pages for the grep, gsub.

Moreover please run the following commands to create the strings that we will work on. bio <- c('24 year old', 'data scientist', '1992', 'A.I. enthusiast', 'R version 3.4.0 (2017-04-21)', 'r-exercises author', 'R is cool', 'RR')

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the strings with Numeric values between 3 and 6.

Exercise 2

Find the strings with the character ‘A’ or ‘y’.

Exercise 3

Find any strings that have non-alphanumeric characters.

Exercise 4

Remove lower case letters.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5

Remove space or tabs.

Exercise 6

Remove punctuation and replace it with white space.

Exercise 7

Remove alphanumeric characters.

Exercise 8

Match sentences that contain ‘M’.

Exercise 9

Match states with two ‘o’.

Exercise 10

Match cars with one or two ‘e’.

Related exercise sets:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

Web Scraping Influenster: Find a Popular Hair Care Product for You

August 30, 2017, 10:28 am

≫ Next: R in the Data Science Stack at ODSC

≪ Previous: Data wrangling : Cleansing – Regular expressions (2/3)

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Are you a person who likes to try new products? Are you curious about which hair products are popular and trendy? If you’re excited about getting your hair glossy and eager to find a suitable shampoo, conditioner or hair oil merchandise, using my ‘Shiny (Hair) App’ could help you find what you seek in less time. My codes are available on GitHub.

Research Questions

What are popular hair care brands?

What is the user behavior on Influenter.com?

What kind of factors may have critical influences on customers satisfaction?

Is it possible to create a search engine, which takes charge of phrases and returns related products?

Data Collection

To obtain the most up-to-date hair care information, I decided to web scrape Influenster, a product discovery and review platform. It has over 14 million reviews and over 2 millions products for users to choose from.

In order to narrow down my research scope, I focused on 3 categories: shampoo, hair conditioner, and hair oil. I garnered 54 top choices for each one. For product datasets, I scraped brand name, product name, overall product rating, rank and reviews. Plus, the web scraping review dataset includes author name, author location, content, rating score, and hair profile.

Results

Top Brands Graph

Firstly, the “other” category represents the brands which have one or two popular products. Thus, judging from the popular brands’ pie chart, we can see that most of the popular products belong to huge brands.

Rating Map

As to checking users’ behaviors on Influenster in the United States, I decided to make two maps to see whether there are any interesting results linked to location. Since I scraped top 54 products for each category, the overall rating score is high across the country. As a result, it is difficult to see regional differences.

Reviews Map

However, if we take a look at the number of hair care product reviews on Influenster.com across the nation, we know that there are 4740, 3898, 3787, 2818 reviews in California, Florida, Texas and New York respectively.

Analysis of Rating and Number of Reviews

There is a negative relationship between rating and number of reviews. As you can see, Pureolog receives the highest score 4.77out of 5, but it only has 514 reviews. On the other hand, OGX is scored 4.4 out of 5, though, it gains over 5167 reviews.

Wordcloud & Comparison Cloud

As we may be interested in what factors customers care about most and what contributes to their satisfaction with a product, I decided to inspect the most frequently mentioned words in those 77 thousand reviews. For the first try, I created word clouds for each category and the overall reviews. However, there is no significant difference among the four graphs. Therefore, I created a comparison cloud to collate the most common words popping up in reviews.From the comparison cloud, we can infer that customers regard functionalities of products and fragrance as the most important. In addition, the word “recommend” shows up as a commonly used word in the reviews dataset. Consequently, in my perspective, word of mouth is a great marketing strategy for brands to focus on.

Search Engine built in my Shiny App (NLP: TF-IDF, cosine similarity)

http://blog.nycdatascience.com/wp-content/uploads/2017/08/engine_demo.mp4

TF-IDF

TF-IDF is a NLP technique, which stands for “Term Frequency–Inverse Document Frequency,”, a numerical statistic that is intended to reflect how important a word is compared to a document in a corpus.

For my search engine, I utilize “tm” package and employ weightSMART “nnn” weighted schema for term frequency. Basically, the weightSMART “nnn”, a natural weighting computation, counts how many times each individual word matches up with the document in the dataset. If you would like to read more details and check more weighting schemas, please feel free to take a look at the R documentation.

Cosine Similarity

With TF-IDF measurements in place, products are recommended according to a cosine similarity score with the query. To further elaborate how cosine similarity works, it is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In the case of information retrieval like a search engine, the cosine similarity of two documents will range from 0 to 1, because the term frequencies (TF-IDF weights) cannot be negative. In other words, the angle between two term frequency vectors cannot be greater than 90 degrees. Additionally, when the cosine value is closer to 1, it means that there is a higher similarity between the two vectors (products). The cosine similarity formula is shown below.

Insights

Most of the products belong to household brands.

The more active users of the site are from California, Florida, Texas and New York.

There is a negative relationship between the number of reviews and rating score.

Functions and the scent of hair care products are of great importance.

Even though “recommend” is a commonly used word, in this project, it is difficult to tell whether is positive or negative feedbacks. Thus, I can conduct sentiment analysis in the future.

The self-developed search engine, applied with TF-IDF and cosine similarity concepts, will work even better if I include product descriptions. By adding up product descriptions, users can have a higher probability to match their inputs to not only product name but product description, so that they are able to retrieve more related merchandises and explore new features of products.

The post Web Scraping Influenster: Find a Popular Hair Care Product for You appeared first on NYC Data Science Academy Blog.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

↧

R in the Data Science Stack at ODSC

August 30, 2017, 1:05 pm

≫ Next: Finding distinct rows of a tibble

≪ Previous: Web Scraping Influenster: Find a Popular Hair Care Product for You

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

R continues to hold its own in the data science landscape thanks in no small part to its flexibility. That flexibility allows R to integrate with some of the most popular data science tools available.

Given R’s memory bounds, it’s no surprise that deep learning tools like Tensorflow are on that list. Comprehensive, intuitive, and well documented, Tensorflow has quickly become one of the most popular deep learning platforms and RStudio released a package to integrate with the Tensorflow API. Not to be outdone MXNet, another popular and powerful deep learning framework has native support for R with an API interface.

It doesn’t stop with deep learning. Data science is moving real-time and the streaming analytics platform, Apache Kafka, is rapidly gaining traction with the community. The kafka package allows one to use the Kafka messaging queue via R. Spark is now one of the dominant machine learning platforms and thus we see multiple R integrations in the form of the spark package and the SparkR package. The list will continue to grow with package integrations released for H20.ai, Druid etc. and more on the way.

At the Open Data Science Conference, R has long been one of the most popular data science languages and ODSC West 2017 is no exception. We have a strong lineup this year that includes:

R Tools for Data Science
Modeling Big Data with R, sparklyr, and Apache Spark
Machine Learning with R
Introduction to Data Science with R
Modern Time-Series with Prophet
R4ML: A Scalable and Distributed framework in R for Machine Learning
Databases Using R
Geo-Spatial Data Visualization using R

From an R user perspective, one of the most exciting things about ODSC West 2017 is that it offers an excellent opportunity to do a deep dive into some of the most popular data science tools you can now leverage with R. Talks and workshops on the conference schedule include:

Deep learning from Scratch WIth Tensorflow
Apache Kafka for Real-time analytics
Deep learning with MXNet
Effective TensorFlow
Building an Open Source Analytics Solution with Kafka and Druid
Deep Neural Networks with Keras
Robust Data Pipelines with Apache Airflow
Apache Superset – A Modern, Enterprise-Ready Business Intelligence Web Application

Over 3 packed days, ODSC West 2017 also offers a great opportunity to brush up on your modeling skills that include predictive analytics, time series, NLP, machine learning, image recognition, deep learning. autonomous vehicles, and AI chatbot assistants. Here’s just a few of the data science workshops and talks scheduled:

Feature Selection from High Dimensions
Interpreting Predictions from Complex Models
Deep Learning for Recommender Systems
Natural Language Processing in Practice – Do’s and Don’ts
Machine Imaging recognition
Training a Prosocial Chatbot
Anomaly Detection Using Deep Learning
Myths of Data Science: Practical Issues You Can and Can Not Ignore.
Playing Detective with CNNs
Recommendation System Architecture and Algorithms
Driver and Occupants Monitoring AI for Autonomous Vehicles
Solving Impossible Problems by Collaborating with an AI
Dynamic Risk Networks: Mapping Risk in the Financial System

With over 20 full training session, 50 workshops and 100 speakers, ODSC West 2017 is ideal for beginners to experts looking to understand the latest in R tools and topics in data science and AI.

Sheamus McGovern, CEO of ODSC

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

↧

Finding distinct rows of a tibble

August 30, 2017, 1:30 pm

≫ Next: Probably more likely than probable

≪ Previous: R in the Data Science Stack at ODSC

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

I’ve been using R or its predecessors for about 30 years, which means I know a lot about R, and also that I don’t necessarily know how to use modern R tools. Lately, I’ve been trying to unlearn some old approaches, and to re-learn them using the tidyverse approach to data analysis. I agree that it is much better, but old dogs and new tricks… Recently, I was teaching a class where I needed to extract some rows of a data set.

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman.

↧

Probably more likely than probable

August 30, 2017, 2:33 pm

≫ Next: Pacific Island Hopping using R and iGraph

≪ Previous: Finding distinct rows of a tibble

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

What kind of probability are people talking about when they say something is "highly likely" or has "almost no chance"? The chart below, created by Reddit user zonination, visualizes the responses of 46 other Reddit users to "What probability would you assign to the phase: <phrase>" for various statements of probability. Each set of responses has been converted to a kernel destiny estimate and presented as a joyplot using R.

Somewhat surprisingly, the results from the Redditors hew quite closely to a similar study of 23 NATO intelligence officers in 2007. In that study, the officers — who were accustomed to reading intelligence reports with assertions of likelihood — were giving a similar task with the same descriptions of probability. The results, here presented as a dotplot, are quite similar.

For details on the analysis of the Redditors, including the data and R code behind the joyplot chart, check out the Github repository linked below.

Github (zonination): Perceptions of Probability and Numbers

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Pacific Island Hopping using R and iGraph

August 30, 2017, 5:00 pm

≫ Next: Create and Update PowerPoint Reports using R

≪ Previous: Probably more likely than probable

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

The Pacific Ocean.

Listing all airports

My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

# Initlibrary(tidyverse)library(ggmap)library(ggrepel)library(geosphere)# Read flight list and airport listflights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE)f <- "Geography/airports.csv"if (file.exists(f)) {    airports <- read.csv(f)    } else airports <- data.frame(airport = NA, lat = NA, lon = NA)# Lookup coordinates for new airportsall_airports <- unique(c(flights$From, flights$To))new_airports <- all_airports[!(all_airports %in% airports$airport)]if (length(new_airports) != 0) {    coords <- geocode(new_airports)    new_airports <- data.frame(airport = new_airports, coords)    airports <- rbind(airports, new_airports)    airports <- subset(airports, !is.na(airport))    write.csv(airports, "Geography/airports.csv", row.names = FALSE)}# Add coordinates to flight listflights <- merge(flights, airports, by.x="From", by.y="airport")flights <- merge(flights, airports, by.x="To", by.y="airport")

Create the map

To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

# Pacific centricflights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360# Plot flight routesworldmap <- borders("world2", colour="#efede1", fill="#efede1")ggplot() + worldmap +     geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") +     geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport),       col = "black", size = 2, segment.color = NA) +     geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y,       yend = lat.y, col = Airline), size = .4, curvature = .2) +     theme(panel.background = element_rect(fill="white"),           axis.line = element_blank(),          axis.text.x = element_blank(),          axis.text.y = element_blank(),          axis.ticks = element_blank(),          axis.title.x = element_blank(),          axis.title.y = element_blank()          ) +     xlim(100, 300) + ylim(-40,40)

Pacific Island Hopping

This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

library(igraph)g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE)par(mar = rep(0, 4))plot(g, layout = layout.fruchterman.reingold, vertex.size=0)V(g)shortest_paths(g, "Auckland", "Saipan")

View the latest version of this code on GitHub.

The post Pacific Island Hopping using R and iGraph appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

↧

Create and Update PowerPoint Reports using R

August 30, 2017, 5:50 pm

≫ Next: Community Call – rOpenSci Software Review and Onboarding

≪ Previous: Pacific Island Hopping using R and iGraph

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In my sordid past, I was a data science consultant. One thing about data science that they don’t teach you at school is that senior managers in most large companies require reports to be in PowerPoint. Yet, I like to do my more complex data science in R – PowerPoint and R are not natural allies. As a result, creating an updating PowerPoint reports using R can be painful.

In this post, I discuss how to make R and PowerPoint work efficiently together. The underlying assumption is that R is your computational engine and that you are trying to get outputs into PowerPoint. I compare and contrast three tools for creating and updating PowerPoint reports using R: free ReporteRs package with two commercial products, Displayr and Q.

Option 1: ReporteRs

The first approach to getting R and PowerPoint to work together is to use David Gohel’s ReporteRs. To my mind, this is the most “pure” of the approaches from an R perspective. If you are an experienced R user, this approach works in pretty much the way that you will expect it to work.

The code below creates 250 crosstabs, conducts significance tests, and, if the p-value is less than 0.05, presents a slide containing each. And, yes, I know this is p-hacking, but this post is about how to use PowerPoint and R, not how to do statistics…

 library(devtools)devtools::install_github('davidgohel/ReporteRsjars')devtools::install_github('davidgohel/ReporteRs')install.packages(c('ReporteRs', 'haven', 'vcd', 'ggplot2', 'reshape2'))library(ReporteRs)library(haven)library(vcd)library(ggplot2)library(reshape2)dat = read_spss("http://wiki.q-researchsoftware.com/images/9/94/GSSforDIYsegmentation.sav")filename = "c://delete//Significant crosstabs.pptx" # the document to producedocument = pptx(title = "My significant crosstabs!")alpha = 0.05 # The level at which the statistical testing is to be done.dependent.variable.names = c("wrkstat", "marital", "sibs", "age", "educ")all.names = names(dat)[6:55] # The first 50 variables int the file.counter = 0for (nm in all.names)    for (dp in dependent.variable.names)    {        if (nm != dp)        {            v1 = dat[[nm]]            if (is.labelled(v1))                v1 = as_factor(v1)            v2 = dat[[dp]]            l1 = attr(v1, "label")            l2 = attr(v2, "label")            if (is.labelled(v2))                v2 = as_factor(v2)            if (length(unique(v1)) <= 10 <= 10) # Only performing tests if 10 or fewer rows and columns. {                 x = xtabs(~v1 + v2)                 x = x[rowSums(x) > 0, colSums(x) > 0]                ch = chisq.test(x)                p = ch$p.value                if (!is.na(p) && p <= alpha)                {                    counter = counter + 1                    # Creating the outputs.                    crosstab = prop.table(x, 2) * 100                    melted = melt(crosstab)                    melted$position = 100 - as.numeric(apply(crosstab, 2, cumsum) - 0.5 * crosstab)                    p = ggplot(melted, aes(x = v2, y = value,fill = v1)) + geom_bar(stat='identity')                    p = p + geom_text(data = melted, aes(x = v2, y = position, label = paste0(round(value, 0),"%")), size=4)                    p = p + labs(x = l2, y = l1)                    colnames(crosstab) = paste0(colnames(crosstab), "%")                    #bar = ggplot() + geom_bar(aes(y = v1, x = v2), data = data.frame(v1, v2), stat="identity")                    # Writing them to the PowerPoint document.                    document = addSlide(document, slide.layout = "Title and Content" )                    document = addTitle(document, paste0("Standardized residuals and chart: ", l1, " by ", l2))                    document = addPlot(doc = document, fun = print, x = p, offx = 3, offy = 1, width = 6, height = 5 )                    document = addFlexTable(doc = document, FlexTable(round(ch$stdres, 1), add.rownames = TRUE),offx = 8, offy = 2, width = 4.5, height = 3 )                }            }                   }    }writeDoc(document, file = filename )        cat(paste0(counter, " tables and charges exported to ", filename, "."))

Below we see one of the admittedly ugly slides created using this code. With more time and expertise, I am sure I could have done something prettier. A cool aspect of the ReporteRs package is that you can then edit the file in PowerPoint. You can then get R to update any charts and other outputs originally created in R.

Chart created in PowerPoint using R

Option 2: Displayr

A completely different approach is to author the report in Displayr, and then export the resulting report from Displayr to PowerPoint.

This has advantages and disadvantages relative to using ReporteRs. First, I will start with the big disadvantage, in the hope of persuading you of my objectivity (disclaimer: I have no objectivity, I work at Displayr).

Each page of a Displayr report is created interactively, using a mouse and clicking and dragging things. In my earlier example using ReporteRs, I only created pages where there was a statistically significant association. Currently, there is no way of doing such a thing in Displayr.

The flipside of using the graphical user interface like Displayr is that it is a lot easier to create attractive visualizations. As a result, the user has much greater control over the look and feel of the report. For example, the screenshot below shows a PowerPoint document created by Displayr. All but one of the charts has been created using R, and the first two are based on a moderately complicated statistical model (latent class rank-ordered logit model).

You can access the document used to create the PowerPoint report with R here (just sign in to Displayr first) – you can poke around and see how it all works.

A benefit of authoring a report using Displayr is that the user can access the report online, interact with it (e.g., filter the data), and then export precisely what they want. You can see this document as it is viewed by a user of the online report here.

Demographics dashboard example from Displayr

Option 3: Q

A third approach for authoring and updating PowerPoint reports using R is to use Q, which is a Windows program designed for survey reporting (same disclaimer as with Displayr). It works by exporting and updating results to a PowerPoint document. Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

In Q, a Report contains a series of analyses. Analyses can either be created using R, or, using Q’s own internal calculation engine, which is designed for producing tables from survey data.

The map above (in the Displayr report) is an HTMLwidget created using the plotly R package. It draws data from a table called Region, which would also be shown in the report. (The same R code in the Displayr example can be used in an R object within Q). So when exported into PowerPoint, it creates a page, using the PowerPoint template, where the title is Responses by region and the map appears in the middle of the page.

The screenshot below is showing another R chart created in PowerPoint. The data has been extracted from Google Trends using the gtrendsR R package. However, the chart itself is a standard Excel chart, attached to a spreadsheet containing the data. These slides can then be customized using all the normal PowerPoint tools and can be automatically updated when the data is revised.

Explore the Displayr example

You can access the Displayr document used to create and update the PowerPoint report with R here (just sign in to Displayr first). Here, you can poke around and see how it all works or create your own document.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

↧

Community Call – rOpenSci Software Review and Onboarding

August 31, 2017, 12:00 am

≫ Next: Why to use the replyr R package

≪ Previous: Create and Update PowerPoint Reports using R

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Are you thinking about submitting a package to rOpenSci's open peer software review? Considering volunteering to review for the first time? Maybe you're an experienced package author or reviewer and have ideas about how we can improve.

Join our Community Call on Wednesday, September 13th. We want to get your feedback and we'd love to answer your questions!

Agenda

Welcome (Stefanie Butland, rOpenSci Community Manager, 5 min)
guest: Noam Ross, editor (15 min) Noam will give an overview of the rOpenSci software review and onboarding, highlighting the role editors play and how decisions are made about policies and changes to the process.
guest: Andee Kaplan, reviewer (15 min) Andee will give her perspective as a package reviewer, sharing specifics about her workflow and her motivation for doing this.
Q & A (25 min, moderated by Noam Ross)

Speaker bios

Andee Kaplan is a Postdoctoral Fellow at Duke University. She is a recent PhD graduate from the Iowa State University Department of Statistics, where she learned a lot about R and reproducibility by developing a class on data stewardship for Agronomists. Andee has reviewed multiple (two!) packages for rOpenSci, iheatmapr and getlandsat, and hopes to one day be on the receiving end of the review process.

Andee on GitHub, Twitter

Noam Ross is one of rOpenSci's four editors for software peer review. Noam is a Senior Research Scientist at EcoHealth Alliance in New York, specializing in mathematical modeling of disease outbreaks, as well as training and standards for data science and reproducibility. Noam earned his Ph.D. in Ecology from the University of California-Davis, where he founded the Davis R Users' Group.

Noam on GitHub, Twitter

Resources

How rOpenSci uses Code Review to Promote Reproducible Science; blog post Aug 11, 2017
The what, why and how of rOpenSci open peer review and onboarding; guidelines
rOpenSci software reviews in progress and completed
rOpenSci onboarded packages
Read on our blog one of ten guest posts (to date) by authors of onboarded packages
So you (don't) think you can review a package; guest blog post by first-time reviewer Mara Averick, Aug 22, 2017
Onboarding at rOpenSci: A Year in Reviews; blog post Mar 28, 2016
Soon after the Community Call, we'll post the video

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

↧

Why to use the replyr R package

August 31, 2017, 7:48 am

≫ Next: Text featurization with the Microsoft ML package

≪ Previous: Community Call – rOpenSci Software Review and Onboarding

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))library("sparklyr")packageVersion("dplyr")#> [1] '0.7.2.9000'packageVersion("sparklyr")#> [1] '0.6.2'packageVersion("dbplyr")#> [1] '1.1.0.9000'sc <- spark_connect(master = 'local')#> * Using Spark: 2.1.0d <- dplyr::copy_to(sc, data.frame(x = 1:2))dim(d)#> [1] NAncol(d)#> [1] NAnrow(d)#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).

Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: “tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense iftibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyrdbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr")packageVersion("replyr")#> [1] '0.5.4'replyr_hasrows(d)#> [1] TRUEreplyr_dim(d)#> [1] 2 1replyr_ncol(d)#> [1] 1replyr_nrow(d)#> [1] 2spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

↧

Text featurization with the Microsoft ML package

August 31, 2017, 8:23 am

≫ Next: DEADLINE EXTENDED: Last call for Boston EARL abstracts

≪ Previous: Why to use the replyr R package

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms ("useful", "indespensable", "highly recommended"), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list(  featurizeText(    vars = c(Features ="REVIEW_TEXT"),    # ngramLength=2: include not only "Azure", "AD", but also "Azure AD"# skipLength=1 : "computer" and "compuuter" is the same    wordFeatureExtractor = ngramCount(      weighting ="tfidf",      ngramLength =2,      skipLength =1),    language ="English"  ),  selectFeatures(    vars = c("Features"),    mode = minCount(500)  ))# train using transforms !model <- rxFastLinear(  RATING ~ Features,  data = train,  mlTransforms = transformRule,  type ="regression"# not binary (numeric regression))

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring       -7.647399waste        -7.537471not          -6.355953nothing      -6.149342money        -5.386262bad          -5.377981no           -5.210301worst        -5.051558poorly       -4.962763disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will      3.073104the|best  3.265797love      3.290348life      3.562267wonderful 3.652950,|and     3.762862you       3.889580excellent 3.902497my        4.454115great     4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function.

sentimentScores <- rxFeaturize(data=data,                     mlTransforms = getSentiment(vars =                                      list(SentimentScore ="REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧