10 PRINT mazes with ggplot2

May 8, 2018, 10:00 pm

≪ Previous: Harry Potter and competition results with comperes

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

There is a celebrated Commodore 64 program that randomly prints outs / and \ characters and fills the screen with neat-looking maze designs. It is just one line of code, but there is a whole book written about it.

10 PRINT CHR$(205.5+RND(1)); : GOTO 10

Screenshots of the 10 PRINT program in action. Images taken from the 10 PRINT book.

The basic idea, from my reading of the code, is that CHR$(205) is \, CHR$(206) is /, and the program randomly selects between the two by adding a random number to 205.5. Endlessly looping over this command fills the screen with that pleasing maze pattern.

In R, we could replicate this functionality with by randomly sampling the slashes:

sample_n_slashes<-function(n){sample(c("/","\\"),size=n,replace=TRUE)}withr::with_options(list(width=40),cat(sample_n_slashes(800),sep="",fill=TRUE))#> //////\/\\/\/\/\/\\//\\\\//\//\/\//\///\#> \\\/\//\\/\/////////////\\/\//\\\/\//\\\#> /\\/\/\//\/\/\\/\\/\/\//\\\\//\/\\\/\///#> \\//\/\\\/\///\\/\/\/////\//\///\/\/\//\#> /\/\//\///\//\//\//\/\//\\/\\\\\/\\\//\/#> //\\\\\//////\//\//\/\//\\////\/\\\/\/\/#> \\////\/\\/\////\//////\\/\\/////\/////\#> /\/\/\/\/\\///\/\\\\/\/\\//\/\\/\\\\\\//#> //\\\/\///\///\/\\\//\//\\\\//\\/\\/\/\/#> /\\/\//\\//\/\\\\\/\\\/\\\/\/\/\/\////\/#> /\//\\//\////\\\///\//\/\\\/\\///\//\\\\#> /\//\\////\/\\\//\\\//\/\///\\\\\/\/\\//#> \\////\\/\\\\\\///\///\//\\\/\\\\//\////#> \\\///\/\//\//\/\//\/\/\\\/\/\\///\/////#> \\/\\//\\\\\\///\\\\/\\/\\//\//\\\/\\/\/#> ////\\\////\/\\\\\/////\/\\\////\\///\\\#> \//\\\//\///\/\\\\//\\\////\\//\\/\/\//\#> \/\//\\//\\///\\\\\\//\///\/\\\\\\\\/\\/#> ///\/\//\\/\/\\\\\\\\/\///\//\\///\\//\\#> /////\\///\/\\/\/\\//\\//\/\\/\//\//\\\\

where withr::with_options() lets us temporarily change the print width and cat() concatenates the slashes and prints out the characters as text.

We can also make this much prettier by drawing the patterns using ggplot2.

Drawing line segments with ggplot2

Instead of writing out slashes, we will draw a grid of diagonal line segments, some of which will be flipped at random. To draw a segment, we need a starting x–y coordinate and an ending x–y coordinate. geom_segment() will connect the two coordinates with a line. Here’s a small example where we draw four “slashes”.

library(ggplot2)library(dplyr)data<-tibble::tribble(~row,~col,~x_start,~x_end,~y_start,~y_end,1,1,0,1,0,1,1,2,1,2,1,0,# flipped2,1,0,1,1,2,2,2,1,2,1,2)ggplot(data)+aes(x=x_start,xend=x_end,y=y_start,yend=y_end)+geom_segment()

A simple demo of geom_segment()

The programming task now is to make giant grid of these slashes. Let’s start with an observation: To draw two slashes, we needed three x values: 0, 1, 2. The first two served as segment starts and the last two as segment ends. The same idea applies to the y values. We can generate a bunch of starts and ends by taking a sequence of steps and removing the first and last elements.

# We want a 20 by 20 gridrows<-20cols<-20x_points<-seq(0,1,length.out=cols+1)x_starts<-head(x_points,-1)x_ends<-tail(x_points,-1)y_points<-seq(0,1,length.out=rows+1)y_starts<-head(y_points,-1)y_ends<-tail(y_points,-1)

Each x_starts–x_ends pair is a column in the grid, and each y_starts–y_ends is a row in the grid. To make a slash at each row–column combination, we have to map out all the combinations of the rows and columns. We can do this with crossing() which creates all crossed combinations of values. (If it helps, you might think of crossed like crossed experiments or the Cartesian cross product of sets.)

grid<-tidyr::crossing(# columnsdata_frame(x_start=x_starts,x_end=x_ends),# rowsdata_frame(y_start=y_starts,y_end=y_ends))%>%# So values move left to right, bottom to toparrange(y_start,y_end)# 400 rows because 20 rows x 20 columnsgrid#> # A tibble: 400 x 4#>    x_start x_end y_start y_end#>           #>  1    0     0.05       0  0.05#>  2    0.05  0.1        0  0.05#>  3    0.1   0.15       0  0.05#>  4    0.15  0.2        0  0.05#>  5    0.2   0.25       0  0.05#>  6    0.25  0.3        0  0.05#>  7    0.3   0.35       0  0.05#>  8    0.35  0.4        0  0.05#>  9    0.4   0.45       0  0.05#> 10    0.45  0.5        0  0.05#> # ... with 390 more rows

We can confirm that the segments in the grid fill out a plot. (I randomly color the line segments to make individual ones visible.)

ggplot(grid)+aes(x=x_start,y=y_start,xend=x_end,yend=y_end,color=runif(400))+geom_segment(size=1)+guides(color=FALSE)

A grid full of line segments, all pointing the same direction

Finally, we need to flip slashes at random. A segment becomes flipped if the y_start and y_end are swapped. In the code below, we flip the slash in each row if a randomly drawn number between 0 and 1 is less than .5. For style, we also use theme_void() to strip away the plotting theme, leaving us with just the maze design.

p_flip<-.5grid<-grid%>%arrange(y_start,y_end)%>%mutate(p_flip=p_flip,flip=runif(length(y_end))<=p_flip,y_temp1=y_start,y_temp2=y_end,y_start=ifelse(flip,y_temp2,y_temp1),y_end=ifelse(flip,y_temp1,y_temp2))%>%select(x_start:y_end,p_flip)ggplot(grid)+aes(x=x_start,y=y_start,xend=x_end,yend=y_end)+geom_segment(size=1,color="grey20")last_plot()+theme_void()

A maze with 50% flipping probability

Now, we wrap all these steps together into a pair of functions.

make_10_print_data<-function(rows=20,cols=20,p_flip=.5){x_points<-seq(0,1,length.out=cols+1)x_starts<-head(x_points,-1)x_ends<-tail(x_points,-1)y_points<-seq(0,1,length.out=rows+1)y_starts<-head(y_points,-1)y_ends<-tail(y_points,-1)grid<-tidyr::crossing(data.frame(x_start=x_starts,x_end=x_ends),data.frame(y_start=y_starts,y_end=y_ends))grid%>%arrange(y_start,y_end)%>%mutate(p_flip=p_flip,flip=runif(length(y_end))<=p_flip,y_temp1=y_start,y_temp2=y_end,y_start=ifelse(flip,y_temp2,y_temp1),y_end=ifelse(flip,y_temp1,y_temp2))%>%select(x_start:y_end,p_flip)}draw_10_print<-function(rows=20,cols=20,p_flip=.5){grid<-make_10_print_data(rows=rows,cols=cols,p_flip=p_flip)ggplot(grid)+aes(x=x_start,y=y_start,xend=x_end,yend=y_end)+geom_segment(size=1,color="grey20")}

Now the fun part: custom flipping probabilities

We can vary the probability of flipping the slashes. For example, we can use the density of a normal distribution to make flipping more likely for central values and less likely for more extreme values.

xs<-seq(0,1,length.out=40)p_flip<-dnorm(seq(-4,4,length.out=40))ggplot(data.frame(x=xs,y=p_flip))+aes(x,y)+geom_line()+labs(x="x position",y="p(flipping)",title="normal density")# We repeat p_flip for each row of the griddraw_10_print(rows=40,cols=40,p_flip=rep(p_flip,40))+theme_void()

Density of the normal distribution A maze where flipping probability is based on the density of a normal curve

We can use the cumulative density of the normal distribution so that flipping becomes more likely as x increases.

xs<-seq(0,1,length.out=40)p_flip<-pnorm(seq(-4,4,length.out=40))ggplot(data.frame(x=xs,y=p_flip))+aes(x,y)+geom_line()+labs(x="x position",y="p(flipping)",title="cumulative normal")draw_10_print(rows=40,cols=40,p_flip=rep(p_flip,40))+theme_void()

Cumulative density of the normal distribution A maze where flipping probability is based on the cumulative density of a normal curve

The Cauchy distribution is said to have “thicker” tails than the normal distribution, so here it shows more flips on the left and right extremes.

xs<-seq(0,1,length.out=40)p_flip<-dcauchy(seq(-4,4,length.out=40))ggplot(data.frame(x=xs,y=p_flip))+aes(x,y)+geom_line()+labs(x="x position",y="p(flipping)",title="Cauchy density")draw_10_print(rows=40,cols=40,p_flip=rep(p_flip,40))+theme_void()

Density of the Cauchy distribution A maze where flipping probability is based on the density of a Cauchy curve

The exponential distribution is a spike that quickly peters out. We can make a probability “bowl” by splicing an exponential and a reversed exponential together.

# Use flipped exponential densities as probabilitiesp_flip<-c(dexp(seq(0,4,length.out=20)),dexp(seq(4,0,length.out=20)))ggplot(data.frame(x=xs,y=p_flip))+aes(x,y)+geom_line()+labs(x="x position",y="p(flipping)",title="exponential + flipped exponential")draw_10_print(rows=40,cols=40,p=rep(p_flip,40))+theme_void()

Densities of two exponential curves spliced to form a bowl A maze where flipping probability is based on an exponential curve and a reversed exponential curve

We might have the probabilities increase by 10% from row to row. In the code below, I use a simple loop to boost some random probability values by 10% from row to row. This gives us nice streaks in the grid as a column starts to flip for every row.

boost_probs<-function(p_flip,nrows,factor=1.1){output<-p_flipfor(iinseq_len(nrows-1)){p_flip<-p_flip*factoroutput<-c(output,p_flip)}output}draw_10_print(cols=40,rows=40,p=boost_probs(runif(40),40,1.1))+theme_void()

A maze where flipping probability increases by 10% from row to row

The probabilities can be anything we like. Here I compute the frequency of English alphabet letters as they appear in Pride and Prejudice and based the flipping probability on those values.

char_counts<-janeaustenr::prideprejudice%>%tolower()%>%stringr::str_split("")%>%unlist()%>%table()letter_counts<-char_counts[letters]%>%as.vector()p_letter<-letter_counts/sum(letter_counts)ggplot(data.frame(x=letters,y=p_letter))+aes(x,y,label=x)+geom_text()+labs(x=NULL,y="p(letter)",title="letter frequencies in Pride and Prejudice")

Letter frequencies in Pride and Prejudice

draw_10_print(cols=26,rows=80,p=rep(p_letter,80))+theme_void()

Maze where flipping frequency is based on letter frequencies in Pride and Prejudice

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Flow charts in R

May 9, 2018, 9:29 am

≫ Next: In case you missed it: April 2018 roundup

≪ Previous: 10 PRINT mazes with ggplot2

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

library(grid)library(Gmisc)grid.newpage()# set some parameters to use repeatedlyleftx <- .25midx <- .5rightx <- .75width <- .4gp <- gpar(fill = "lightgrey")# create boxes(total <- boxGrob("Total\n N = NNN",  x=midx, y=.9, box_gp = gp, width = width))(rando <- boxGrob("Randomized\n N = NNN",  x=midx, y=.75, box_gp = gp, width = width))# connect boxes like thisconnectGrob(total, rando, "v")(inel <- boxGrob("Ineligible\n N = NNN",  x=rightx, y=.825, box_gp = gp, width = .25, height = .05))connectGrob(total, inel, "-")(g1 <- boxGrob("Allocated to Group 1\n N = NNN",  x=leftx, y=.5, box_gp = gp, width = width))(g2 <- boxGrob("Allocated to Group 2\n N = NNN",  x=rightx, y=.5, box_gp = gp, width = width))connectGrob(rando, g1, "N")connectGrob(rando, g2, "N")(g11 <- boxGrob("Followed up\n N = NNN",  x=leftx, y=.3, box_gp = gp, width = width))(g21 <- boxGrob("Followed up\n N = NNN",  x=rightx, y=.3, box_gp = gp, width = width))connectGrob(g1, g11, "N")connectGrob(g2, g21, "N")(g12 <- boxGrob("Completed\n N = NNN",  x=leftx, y=.1, box_gp = gp, width = width))(g22 <- boxGrob("Completed\n N = NNN",  x=rightx, y=.1, box_gp = gp, width = width))connectGrob(g11, g12, "N")connectGrob(g21, g22, "N")

Sections of code to make the boxes are wrapped in brackets to print them immediately. The code creates something like the following figure:

flowtemp

For detailed info, see the Gmisc vignette. This code is also on github.

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD.

↧

In case you missed it: April 2018 roundup

May 9, 2018, 11:00 am

≫ Next: The social weather of rOpenSci onboarding system

≪ Previous: Flow charts in R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from April of particular interest to R users.

Microsoft R Open 3.4.4, based on R 3.4.4, is now available.

An R script by Ryan Timpe converts a photo into instructions for rendering it as LEGO bricks.

R functions to build a random maze in Minecraft, and have your avatar solve the maze automatically.

A dive into some of the internal changes bringing performance improvements to the new R 3.5.0.

AI, Machine Learning and Data Science Roundup, April 2018.

An analysis with R shows that Uber has overtaken taxis for trips in New York City.

News from the R Consortium: new projects, results from a survey on licenses, and R-Ladies is promoted to a top-level project.

A talk, aimed at Artificial Intelligence developers, making the case for using R.

Bob Rudis analyzes data from the R-bloggers.com website, and lists the top 10 authors.

An R-based implementation of Silicon Valley's "Not Hotdog" application.

An R package for creating interesting designs with generative algorithms.

And some general interest stories (not necessarily related to R):

Learn regular expressions with interactive "crosswords"
A tongue-in-cheek review of Wes Anderson's movies
Snarky captions for incorrect maps used in news reports
A few podcast recommendations, from me

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

The social weather of rOpenSci onboarding system

May 9, 2018, 5:00 pm

≫ Next: Format and Interpret Linear Mixed Models

≪ Previous: In case you missed it: April 2018 roundup

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Our onboarding process ensures that packages contributed by the community undergo a transparent, constructive, non adversarial and open review process. Before even submitting my first R package to rOpenSci onboarding system in December 2015, I spent a fair amount of time reading through previous issue threads in order to assess whether onboarding was a friendly place for me: a newbie, very motivated to learn more but a newbie nonetheless. I soon got the feeling that yes, onboarding would help me make my package better without ever making me feel inadequate.

More recently, I read Anne Gentle’s essay in Open Advice where she mentions the concept of the social weather of an online community. By listening before I even posted, I was simply trying to get a good idea of the social weather of onboarding – as a side-note, how wonderful is it that onboarding’s being open makes it possible to listen?!

In the meantime, I’ve now only submitted and reviewed a few packages but also become an associate editor. In general, when one of us editors talks about onboarding, we like to use examples illustrating the system in a good light: often excerpts from review processes, or quotes of tweets by authors or reviewers. Would there be a more quantitative way for us to support our promotion efforts? In this blog post, I shall explore how a tidytext analysis of review processes (via their GitHub threads) might help us characterize the social weather of onboarding.

Preparing the data

A note about text mining

In this blog post, I’ll leverage the tidytext package, with the help of its accompanying book “Tidy text mining”. The authors, Julia Silge and David Robinson, actually met and started working on this project, at rOpenSci unconf in 2016!

The “tidy text” of the analysis

I’ve described in the first post of this series how I got all onboarding issue threads. Now, I’ll explain how I cleaned their text. Why does it need to be cleaned? Well, this text contains elements that I wished to remove: code chunks, as well as parts from our submission and review templates.

My biggest worry was the removal of templates from issues. I was already picturing my spending hours writing regular expressions to remove these lines… and then I realized that the word “lines” was the key! I could go split all issue comments into lines, which is called tokenization in proper text mining vocabulary, and then remove duplicates! This way, I didn’t even have to worry about the templates having changed a bit over time, since each version was used at least twice. A tricky part that remained was the removal of code chunks since I only wanted to keep human conversation. In theory, it was easy: code chunks are lines located between two lines containing ““`”… I’m still not sure I solved this in the easiest possible way.

library("magrittr")threads <- readr::read_csv("data/clean_data.csv")

# to remove code lines between ```range <- function(x1, x2){  x1:x2}# I need the indices of lines between ```split_in_indices <- function(x){  lengthx <- length(x)  if(length(x) == 0){    return(0)  }else{    if(lengthx > 2){            limits1 <- x[seq(from = 1, to = (lengthx - 1), by = 2)]      limits2 <- x[seq(from = 2, to = lengthx, by = 2)]      purrr::map2(limits1, limits2, range) %>%        unlist() %>%        c()    }else{      x[1]:x[2]    }  }}# tokenize by linethreads <- tidytext::unnest_tokens(threads, line, body, token = "lines")# remove white spacethreads <- dplyr::mutate(threads, line = trimws(line))# remove citations linesthreads <- dplyr::filter(threads, !stringr::str_detect(line, "^\\>"))# remove the line from the template that has ``` that used to bother methreads <- dplyr::filter(threads, !stringr::str_detect(line, "bounded by ```"))# correct one linethreads <- dplyr::mutate(threads, line = stringr::str_replace_all(line, "`` `", "```"))# group by comment threads <- dplyr::group_by(threads, title, created_at, user, issue)# get indicesthreads <- dplyr::mutate(threads, index = 1:n())# get lines limiting chunksthreads <- dplyr::mutate(threads, chunk_limit = stringr::str_detect(line, "```")&stringr::str_count(line, "`") %in% c(3, 4))# special treatmentthreads <- dplyr::mutate(threads,                     chunk_limit = ifelse(user == "MarkEdmondson1234" & issue == 127 & index == 33,                                          FALSE, chunk_limit))threads <- dplyr::mutate(threads, which_limit = list(split_in_indices(which(chunk_limit))))# weird code probably to get indices of code linesthreads <- dplyr::mutate(threads, code = index %in% which_limit[[1]])threads <- dplyr::ungroup(threads)

Let’s look at what this does in practice, with comments from gutenbergr submission as example. I chose this submission because the package author, David Robinson, is one of the two tidytext creators, and because I was the reviewer, so it’s all very meta, isn’t it?

In the excerpt below, we see the most important variable, the binary code indicating whether the line is a code line. This excerpt also shows variables created to help compute code: index is the index of the line withing a comment, chunk_limit indicates whether the line is a chunk limit, which_limit indicates which indices in the comment indicate lines of code.

dplyr::filter(threads, package == "gutenbergr",               user == "sckott",               !stringr::str_detect(line, "ropensci..footer")) %>%  dplyr::mutate(created_at = as.character(created_at)) %>%  dplyr::select(created_at, line, code, index, chunk_limit, which_limit) %>%  knitr::kable()

created_at	line	code	index	chunk_limit	which_limit
2016-05-02 17:04:56	thanks for your submission @dgrtwo – seeking reviewers now	FALSE	1	FALSE	0
2016-05-04 06:09:19	reviewers: @masalmon	FALSE	1	FALSE	0
2016-05-04 06:09:19	due date: 2016-05-24	FALSE	2	FALSE	0
2016-05-12 16:16:38	having a quick look over this now…	FALSE	1	FALSE	0
2016-05-12 16:45:59	@dgrtwo looks great. just a minor thing:	FALSE	1	FALSE	3:7
2016-05-12 16:45:59	`gutenberg_get_mirror()` throws a warning due to `xml2`, at this line https://github.com/dgrtwo/gutenbergr/blob/master/r/gutenberg_download.r#l213	FALSE	2	FALSE	3:7
2016-05-12 16:45:59	“` r	TRUE	3	TRUE	3:7
2016-05-12 16:45:59	warning message:	TRUE	4	FALSE	3:7
2016-05-12 16:45:59	in node_find_one(xnode, xdoc, xpath = xpath, nsmap = ns) :	TRUE	5	FALSE	3:7
2016-05-12 16:45:59	101 matches for .//a: using first	TRUE	6	FALSE	3:7
2016-05-12 16:45:59	“`	TRUE	7	TRUE	3:7
2016-05-12 16:45:59	wonder if it’s worth a `suppresswarnings()` there?	FALSE	8	FALSE	3:7
2016-05-12 20:42:53	great!	FALSE	1	FALSE	3:5
2016-05-12 20:42:53	– add the footer to your readme:	FALSE	2	FALSE	3:5
2016-05-12 20:42:53	“`	TRUE	3	TRUE	3:5
2016-05-12 20:42:53	“`	TRUE	5	TRUE	3:5
2016-05-12 20:42:53	– could you add `url` and `bugreports` entries to `description`, so people know where to get sources and report bugs/issues	FALSE	6	FALSE	3:5
2016-05-12 20:42:53	– update installation of dev versions to `ropenscilabs/gutenbergr` and any urls for the github repo to `ropenscilabs` instead of `dgrtwo`	FALSE	7	FALSE	3:5
2016-05-12 20:42:53	– go to the repo settings –> transfer ownership and transfer to `ropenscilabs`– note that all our newer pkgs go to `ropenscilabs` first, then when more mature we’ll move to `ropensci`	FALSE	8	FALSE	3:5
2016-05-13 01:22:22	nice, builds on at travis https://travis-ci.org/ropenscilabs/gutenbergr/– you can keep appveyor builds under your acct, or i can start on mine, let me know	FALSE	1	FALSE	0
2016-05-13 16:06:31	updated badge link, started an appveyor account with `ropenscilabs` as account name – sent pr – though the build is failing, something about getting the current gutenberg url https://ci.appveyor.com/project/sckott/gutenbergr/build/1.0.1#l650	FALSE	1	FALSE	0

So as you see now getting rid of chunks is straightforward: the lines with code == TRUE have to be deleted.

# remove them and get rid of now useless columnsthreads <- dplyr::filter(threads, !code)threads <- dplyr::select(threads, - code, - which_limit, - index, - chunk_limit)

Now on to removing template parts… I noticed that removing duplicates was a bit too drastic because sometimes duplicates were poorly formatted citations, e.g. an author answering a reviewer’s question by copy-pasting it without Markdown blockquotes, in which case we definitely want to keep the first occurrence. Besides, duplicates were sometimes very short sentences such as “great!” that are not templates, that we therefore should keep. Therefore, for each line, I counted how many times it occurred overall (no_total_occ), and in how many issues it occurred (no_issues).

Let’s look at Joseph Stachelek’s review of rrricanes for instance.

dplyr::filter(threads, user == "jsta", is_review) %>%  dplyr::select(line) %>%  head() %>%  knitr::kable()

line
## package review
– [x] as the reviewer i confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).
#### documentation
the package includes all the following forms of documentation:
– [x] a statement of need clearly stating problems the software is designed to solve and its target audience in readme
– [x] installation instructions: for the development version of package and any non-standard dependencies in readme

Now if we clean up a bit…

threads <- dplyr::group_by(threads, line)threads <- dplyr::mutate(threads, no_total_occ = n(),                     no_issues = length(unique(issue)),                      size = stringr::str_length(line))threads <- dplyr::ungroup(threads)threads <- dplyr::group_by(threads, issue, line)threads <- dplyr::arrange(threads, created_at)threads <- dplyr::filter(threads, no_total_occ <= 2,                     # for repetitions in total keep the short ones                      # bc they are stuff like "thanks" so not template                     # yes 10 is arbitrary                     no_issues <= 1 | size < 10)# when there's a duplicate in one issue# it's probably citation# so keep the first occurrenceget_first <- function(x){  x[1]}threads <- dplyr::group_by(threads, issue, line)threads <- dplyr::summarise_all(threads, get_first)threads <- dplyr::select(threads, - no_total_occ, - size, - no_issues)threads <- dplyr::mutate(threads, # let code words now be real words                     line = stringr::str_replace_all(line, "`", ""),                     # only keep text from links, not the links themselves                     line = stringr::str_replace_all(line, "\\]\\(.*\\)", ""),                     line = stringr::str_replace_all(line, "\\[", ""),                     line = stringr::str_replace_all(line, "blob\\/master", ""),                      # ’                     line = stringr::str_replace_all(line, "’", ""),                      # remove some other links                     line = stringr::str_replace_all(line, "https\\:\\/\\/github\\.com\\/", ""))threads <- dplyr::filter(threads, !stringr::str_detect(line, "estimated hours spent reviewing"))threads <- dplyr::filter(threads, !stringr::str_detect(line, "notifications@github\\.com"))threads <- dplyr::filter(threads, !stringr::str_detect(line, "reply to this email directly, view it on"))threads <- dplyr::ungroup(threads)

Here is what we get from the same review.

dplyr::filter(threads, user == "jsta", is_review)  %>%  dplyr::select(line) %>%  head() %>%  knitr::kable()

line
* also, you might consider using the skip_on_cran function for lines that call an external download as recommended by the ropensci packaging guide.
* i am having timeout issues with building the getting_started vignette. i wonder if there is a particular year with very few hurricanes that would solve the timeout problem.
* i cannot build the data.world vignette. probably because i don’t have an api key set up. you may want to consider setting the code chunks to eval=false.
* i really like the tidy_ functions. i wonder if it would make it easier on the end-user to have the get_ functions return tidy results by default with an optional argument to return “messy” results.
* missing a maintainer field in the description
* there are no examples for knots_to_mph, mb_to_in, status_abbr_to_str, get_discus, get_fstadv, tidy_fstadv, tidy_wr, tidy_fcst. maybe some can be changed to non-exported?

So now, we mostly got the interesting human and original language.

This got me “tidy enough” text. Let’s not mention this package author who found a way to poorly format their submission right under a guideline explaining how to copy the DESCRIPTION… Yep, that’s younger me. Oh well.

Computing sentiment

Another data preparation part was to compute the sentiment score of each line via the sentimentr package by Tyler Rinker, which computes a score for sentences, not for single words.

sentiment <- all %>%  dplyr::group_by(line, created_at, user, role, issue) %>%  dplyr::mutate(sentiment = median(sentimentr::sentiment(line)$sentiment)) %>%  dplyr::ungroup() %>%  dplyr::select(line, created_at, user, role, issue, sentiment)

This dataset of sentiment will be used later in the post.

Tidy text analysis of onboarding social weather

What do reviews talk about?

To find out what reviews deal with as if I didn’t know about our guidelines, I’ll compute the frequency of words and bigrams, and the pairwise correlation of words within issue comments.

My using lollipops below was inspired by this fascinating blog post of Tony ElHabr’s about his Google search history.

library("ggplot2")library("ggalt")library("hrbrthemes")stopwords <- rcorpora::corpora("words/stopwords/en")$stopWordsword_counts <- threads %>%  tidytext::unnest_tokens(word, line) %>%  dplyr::filter(!word %in% stopwords) %>%  dplyr::count(word, sort = TRUE) %>%  dplyr::mutate(word = reorder(word, n))   ggplot(word_counts[1:15,]) +  geom_lollipop(aes(word, n),                size = 2, col = "salmon") +  hrbrthemes::theme_ipsum(base_size = 16,                          axis_title_size = 16) +  coord_flip()

Most common words in onboarding reviewthreads

bigrams_counts <- threads %>%  tidytext::unnest_tokens(bigram, line, token = "ngrams", n = 2) %>%  tidyr::separate(bigram, c("word1", "word2"), sep = " ",                  remove = FALSE) %>%  dplyr::filter(!word1 %in% stopwords) %>%  dplyr::filter(!word2 %in% stopwords) %>%  dplyr::count(bigram, sort = TRUE) %>%  dplyr::mutate(bigram = reorder(bigram, n))   ggplot(bigrams_counts[2:15,]) +  geom_lollipop(aes(bigram, n),                size = 2, col = "salmon") +  hrbrthemes::theme_ipsum(base_size = 16,                          axis_title_size = 16) +  coord_flip()

Most common bigrams in onboarding reviewthreads

I’m not showing the first bigram that basically shows I’ve an encoding issue to solve with a variation of “´”. In any case, both figures show what we care about, like our guidelines that are mentioned often, and documentation. I think words absent from the figures such as performance or speed also highlight what we care less about, following Jeff Leek’s philosophy.

Now, let’s move on to a bit more complex visualization of pairwise correlations between words within lines. First, let’s prepare the table of words in lines. Compared with the book tutorial, we add a condition for eliminating words mentioned in only one submission, often function names.

users <- unique(threads$user)onboarding_line_words <- threads %>%  dplyr::group_by(user, issue, created_at, package, line) %>%  dplyr::mutate(line_id = paste(package, user, created_at, line)) %>%  dplyr::ungroup() %>%  tidytext::unnest_tokens(word, line) %>%  dplyr::filter( word != package, !word %in% users,                 is.na(as.numeric(word)),                 word != "ldecicco",                 word != "usgs") %>%  dplyr::group_by(word) %>%  dplyr::filter(length(unique(issue)) > 1) %>%  dplyr::select(line_id, word)onboarding_line_words %>%  head() %>%  knitr::kable()

line_id	word
rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer.	add
rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer.	a
rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer.	ropensci
rrlite karthik 2015-04-12 20:56:04 – ] add a ropensci footer.	footer
rrlite karthik 2015-04-12 20:56:04 – ] add an appropriate entry into ropensci.org/packages/index.html	add
rrlite karthik 2015-04-12 20:56:04 – ] add an appropriate entry into ropensci.org/packages/index.html	an

Then, we can compute the correlation.

word_cors <- onboarding_line_words %>%  dplyr::group_by(word) %>%  dplyr::filter(!word %in% stopwords) %>%  dplyr::filter(n() >= 20) %>%  widyr::pairwise_cor(word, line_id, sort = TRUE)

For instance, what often goes in the same line as vignette?

dplyr::filter(word_cors, item1 == "vignette")

## # A tibble: 853 x 3##    item1    item2     correlation##                   ##  1 vignette readme         0.176 ##  2 vignette vignettes      0.174 ##  3 vignette chunk          0.145 ##  4 vignette eval           0.120 ##  5 vignette examples       0.108 ##  6 vignette overview       0.0933##  7 vignette building       0.0914##  8 vignette link           0.0863##  9 vignette maps           0.0840## 10 vignette package        0.0831## # ... with 843 more rows

Now let’s plot the network of these relationships between words, using the igraph package by Gábor Csárdi and Támas Nepusz and ggraph package by Thomas Lin Pedersen.

library("igraph")library("ggraph")set.seed(2016)word_cors %>%  dplyr::filter(correlation > .35) %>%  graph_from_data_frame() %>%  ggraph(layout = "fr") +  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +  geom_node_point(color = "lightblue", size = 5) +  geom_node_text(aes(label = name), repel = TRUE) +  theme_void()

This figure gives a good sample of things discussed in reviews. Despite our efforts filtering words specific to issues, some of them remain very specific, such as country/city/location that are very frequent in ropenaq review.

How positive is onboarding?

Using sentiment analysis, we can look at how positive comments are.

sentiments %>%  dplyr::group_by(role) %>%  skimr::skim(sentiment)

## Skim summary statistics##  n obs: 11553 ##  n variables: 6 ##  group variables: role ## ## Variable type: numeric ##               role  variable missing complete    n  mean   sd   min p25##             author sentiment       0     4823 4823 0.07  0.21 -1.2    0##  community_manager sentiment       0       97   97 0.13  0.21 -0.41   0##             editor sentiment       0     1521 1521 0.13  0.22 -1.63   0##              other sentiment       0      344  344 0.073 0.2  -0.6    0##           reviewer sentiment       0     4768 4768 0.073 0.21 -1      0##  median  p75  max     hist##   0     0.17 1.84 ##   0.071 0.23 1    ##   0.075 0.25 1.13 ##   0     0.2  0.81 ##   0     0.17 1.73

summary(sentiments$sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. ## -1.63200  0.00000  0.00000  0.07961  0.18353  1.84223

sentiments %>%  dplyr::filter(!role %in% c("other", "community_manager")) %>%  ggplot(aes(role, sentiment)) +  geom_boxplot(fill = "salmon") +  hrbrthemes::theme_ipsum(base_size = 16,                          axis_title_size = 16,                          strip_text_size = 16)

Sentiment of onboarding review threads byline

These boxplots seem to indicate that lines are generally positive (positive mean, zero 25th-quantile), although it’d be better to be able to compare them with text from traditional review processes of scientific manuscripts in order to get a better feeling for the meaning of these numbers.

On these boxplots we also see that we do get lines with a negative sentiment value… about what? Here are the most common words in negative lines.

sentiments %>%  dplyr::filter(sentiment < 0) %>%  tidytext::unnest_tokens(word, line) %>%  dplyr::filter(!word %in% stopwords) %>%  dplyr::count(word, sort = TRUE) %>%  dplyr::mutate(word = reorder(word, n)) %>%  dplyr::filter(n > 100) %>%ggplot() +  geom_lollipop(aes(word, n),                size = 2, col = "salmon") +  hrbrthemes::theme_ipsum(base_size = 16,                          axis_title_size = 16) +  coord_flip()

Most common words in negativelines

And looking at a sample…

sentiments %>%  dplyr::arrange(sentiment) %>%  dplyr::select(line, sentiment) %>%  head(n = 15) %>%  knitr::kable()

line	sentiment
@ultinomics no more things, although do make sure to add more examples – perhaps open an issue ropenscilabs/gtfsr/issues to remind yourself to do that,	-1.6320000
not sure what you mean, but i’ll use different object names to avoid any confusion (ropenscilabs/mregions#24)	-1.2029767
error in .local(.object, …) :	-1.0000000
error:	-1.0000000
#### miscellaneous	-1.0000000
error: command failed (1)	-0.8660254
– get_plate_size_from_number_of_columns: maybe throwing an error makes more sense than returning a string indicating an error	-0.7855844
this code returns an error, which is good, but it would be useful to return a more clear error. filtering on a non-existant species results in a 0 “length” onekp object (ok), but then the download_* functions return a curl error due to a misspecified url.	-0.7437258
0 errors \| 0 warnings \| 0 notes	-0.7216878
once i get to use this package more, i’m sure i’ll have more comments/issues but for the moment i just want to get this review done so it isn’t a blocker.	-0.7212489
– i now realize i’ve pasted the spelling mistakes without thinking too much about us vs. uk english, sorry.	-0.7071068
minor issues:	-0.7071068
## minor issues	-0.7071068
replicates issue	-0.7071068
visualization issue	-0.7071068

It seems that negative lines are mostly people discussing bugs and problems in code, and GitHub issues, and trying to solve them. The kind of negative lines we’re happy to see in our process, since once solved, they mean the software got more robust!

Last but not least, I mentioned our using particular cases as examples of how happy everyone seems to be in the process. To find such examples, we rely on memory, but what about picking heart-warming lines using their sentiment score?

sentiments %>%  dplyr::arrange(- sentiment) %>%  dplyr::select(line, sentiment) %>%  head(n = 15) %>%  knitr::kable()

line	sentiment
absolutely – it’s really important to ensure it really has been solved!	1.842234
overall, really easy to use and really nicely done.	1.733333
this package is a great and lightweight addition to working with rdf and linked data in r. coming after my review of the codemetar package which introduced me to linked data, i found this a great learning experience into a topic i’ve become really interested in but am still quite novice in so i hope my feedback helps to appreciate that particular pov.	1.463226
i am very grateful for your approval and i very much look forward to collaborating with you and the ropensci community.	1.256935
thank you very much for the constructive thoughts.	1.237437
thanks for the approval, all in all a very helpful and educational process!	1.217567
– really good use of helper functions	1.139013
– i believe the utf note is handled correctly and this is just a snafu in goodpractice, but i will seek a reviewer with related expertise in ensuring that all unicode is handled properly.	1.132201
seem more unified and consistent.	1.126978
very much appreciated!	1.125833
– well organized, readable code	1.100000
– wow very extensive testing! well done, very thorough	1.100000
– i’m delighted that you find my work interesting and i’m very keen to help, contribute and collaborate in any capacity.	1.084493
thank you very much for your thorough and thoughtful review, @batpigandme ! this is great feedback, and i think that visdat will be much improved because of these reviews.	1.083653
great, thank you very much for accepting this package. i am very grateful about the reviews, which were very helpful to improve this package!	1.074281

As you can imagine, these sentences make the whole team very happy! And we hope they’ll encourage you to contribute to rOpenSci onboarding.

Outlook

This first try at text analysis of onboarding issue threads is quite promising: we were able to retrieve text and to use natural language processing to extract most common words and bigrams, and sentiment. This allowed us to describe the social weather of onboarding: we could see that this system is about software, and that negative sentiment was often due to bugs being discussed and solved; and we could extract the most positive lines where volunteers praised the review system or the piece of software under review.

One could expand this analysis with a study of emoji use, in the text using an emoji dictionary as in this blog post and around the text (so-called emoji reactions, present in the API and used in e.g ghrecipes::get_issues_thumbs). Another aspect of social weather is maybe the timeliness that’s expected or implemented at the different parts of the process, but it’d be covered by other data such as the labelling history of the issues, which could get extracted from GitHub V4 API as well.

This is the final post of the “Our package review system in review” series. The first post presented data collection from GitHub, the second post aimed at quantifying the work represented by onboarding. The posts motivated us to keep using data to illustrate and assess the system. Brace yourself for more onboarding data analyses in the future!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

↧

Format and Interpret Linear Mixed Models

May 9, 2018, 5:00 pm

≫ Next: Rolling Fama French

≪ Previous: The social weather of rOpenSci onboarding system

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

You find it time-consuming to manually format, copy and paste output values to your report or manuscript? That time is over: the psycho package is here for you!

The data

Let’s take the example dataset included in the psycho package.

library(psycho)library(tidyverse)df<-psycho::emotion%>%select(Participant_ID,Emotion_Condition,Subjective_Valence,Autobiographical_Link)summary(df)

 Participant_ID Emotion_Condition Subjective_Valence Autobiographical_Link
 10S    : 48    Negative:456      Min.   :-100.000   Min.   :  0.00       
 11S    : 48    Neutral :456      1st Qu.: -65.104   1st Qu.:  0.00       
 12S    : 48                      Median :  -2.604   Median : 16.15       
 13S    : 48                      Mean   : -18.900   Mean   : 28.99       
 14S    : 48                      3rd Qu.:   7.000   3rd Qu.: 59.90       
 15S    : 48                      Max.   : 100.000   Max.   :100.00       
 (Other):624                                         NA's   :1

Our dataframe (called df) contains data from several participants, exposed to neutral and negative pictures (the Emotion_Condition column). Each row corresponds to a single trial. During each trial, the participant had to rate its emotional valence (Subjective_Valence: positive – negative) experienced during the picture presentation and the amount of personal memories associated with the picture (Autobiographical_Link).

Our dataframe contains, for each of the 48 trials, 4 variables: the name of the participant (Participant_ID), the emotion condition (Emotion_Condition), the valence rating (Subjective_Valence) and the Autobiographical Link (Autobiographical_Link).

Fit the model

Let’s fit a linear mixed model to predict the autobiographical link with the condition and the subjective valence.

library(lmerTest)fit<-lmer(Autobiographical_Link~Emotion_Condition*Subjective_Valence+(1|Participant_ID),data=df)summary(fit)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Autobiographical_Link ~ Emotion_Condition * Subjective_Valence +  
    (1 | Participant_ID)
   Data: df

REML criterion at convergence: 8555.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.2682 -0.6696 -0.2371  0.7052  3.2187 

Random effects:
 Groups         Name        Variance Std.Dev.
 Participant_ID (Intercept) 243.1    15.59   
 Residual                   661.4    25.72   
Number of obs: 911, groups:  Participant_ID, 19

Fixed effects:
                                             Estimate Std. Error        df
(Intercept)                                  25.52248    4.23991  31.49944
Emotion_ConditionNeutral                      6.13715    2.66993 895.13045
Subjective_Valence                            0.05772    0.03430 898.46616
Emotion_ConditionNeutral:Subjective_Valence   0.16140    0.05020 896.26695
                                            t value Pr(>|t|)    
(Intercept)                                   6.020 1.09e-06 ***
Emotion_ConditionNeutral                      2.299  0.02176 *  
Subjective_Valence                            1.683  0.09280 .  
Emotion_ConditionNeutral:Subjective_Valence   3.215  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) Emt_CN Sbjc_V
Emtn_CndtnN -0.459              
Sbjctv_Vlnc  0.455 -0.726       
Emtn_CN:S_V -0.308  0.301 -0.676

The analyze function

The analyze function, available in the psycho package, transforms a model fit object into user-friendly outputs.

results<-analyze(fit,CI=95)

Summary

Summarizing an analyzed object returns a dataframe, that can be easily saved and included in reports. It also includes standardized coefficients, as well as bootstrapped confidence intervals (CI) and effect sizes.

summary(results)%>%mutate(p=psycho::format_p(p))

Variable	Coef	SE	t	df	Coef.std	SE.std	p	Effect_Size	CI_lower	CI_higher
(Intercept)	25.52	4.24	6.02	31.50	0.00	0.00	< .001***	Very Small	17.16	33.93
Emotion_ConditionNeutral	6.14	2.67	2.30	895.13	0.10	0.04	< .05*	Very Small	0.91	11.37
Subjective_Valence	0.06	0.03	1.68	898.47	0.09	0.06	= 0.09°	Very Small	-0.01	0.12
Emotion_ConditionNeutral:Subjective_Valence	0.16	0.05	3.22	896.27	0.13	0.04	< .01**	Very Small	0.06	0.26

Print

Moreover, the print method return a nicely formatted output that can be almost directly pasted into the manuscript.

print(results)

The overall model predicting Autobiographical_Link (formula = Autobiographical_Link ~ Emotion_Condition * Subjective_Valence + (1 | Participant_ID)) successfully converged and explained 32.48% of the variance of the endogen (the conditional R2). The variance explained by the fixed effects was of 7.66% (the marginal R2) and the one explained by the random effects of 24.82%. The model's intercept is at 25.52 (SE = 4.24, 95% CI [17.16, 33.93]). Within this model:
   - The effect of Emotion_ConditionNeutral is significant (beta = 6.14, SE = 2.67, 95% CI [0.91, 11.37], t(895.13) = 2.30, p < .05*) and can be considered as very small (std. beta = 0.098, std. SE = 0.043).
   - The effect of Subjective_Valence is significant (beta = 0.058, SE = 0.034, 95% CI [-0.0097, 0.12], t(898.47) = 1.68, p = 0.09°) and can be considered as very small (std. beta = 0.095, std. SE = 0.056).
   - The effect of Emotion_ConditionNeutral:Subjective_Valence is significant (beta = 0.16, SE = 0.050, 95% CI [0.063, 0.26], t(896.27) = 3.22, p < .01**) and can be considered as very small (std. beta = 0.13, std. SE = 0.041).

The intercept (the baseline level) corresponds, here, to the negative condition with subjective valence at 0 (the average, as the data is standardized). Compared to that, changing the condition from negative to neutral does not induce any significant change to the outcome. However, in the negative condition, there is a trending linear relationship between valence and autobiographical memories: the more an item is positive the more it is related to memories. Finally, the interaction is significant: the relationship between valence autobiographical memories is stronger (more positive) in the neutral condition.

Credits

This package helped you? You can cite psycho as follows:

Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

↧

Rolling Fama French

May 9, 2018, 5:00 pm

≫ Next: Mimicking SQLDF with MonetDBLite

≪ Previous: Format and Interpret Linear Mixed Models

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post, we reviewed how to import the Fama French 3-Factor data, wrangle that data, and then regress our portfolio returns on the factors. Please have a look at that previous post, as the following work builds upon it. For more background on Fama French, see the original article published in The Journal of Financial Economics, Common risk factors in the returns on stocks and bonds.

Today, we will explore the rolling Fama French model and the explanatory power of the 3 factors in different time periods. In the financial world, we often look at rolling means, standard deviations and models to make sure we haven’t missed anything unusual, risky, or concerning during different market or economic regimes. Our portfolio returns history is for the years 2013 through 2017, which is rather a short history, but there still might a be a 24-month period where the Fama French factors were particularly strong, weak, or meaningless. We would like to unearth and hypothesize about what explains them or their future likelihood.

We will be working with our usual portfolio consisting of:

+ SPY (S&P500 fund) weighted 25%+ EFA (a non-US equities fund) weighted 25%+ IJS (a small-cap value fund) weighted 20%+ EEM (an emerging-mkts fund) weighted 20%+ AGG (a bond fund) weighted 10%

Before we can run a Fama French model for that portfolio, we need to find portfolio monthly returns, which was covered in this post. I won’t go through the logic again but the code is here:

library(tidyquant)library(tidyverse)library(timetk)symbols <- c("SPY","EFA", "IJS", "EEM","AGG")prices <-   getSymbols(symbols, src = 'yahoo',              from = "2012-12-31",             to = "2017-12-31",             auto.assign = TRUE, warnings = FALSE) %>%   map(~Ad(get(.))) %>%  reduce(merge) %>%   `colnames<-`(symbols)w <- c(0.25, 0.25, 0.20, 0.20, 0.10)asset_returns_long <-    prices %>%   to.monthly(indexAt = "lastof", OHLC = FALSE) %>%   tk_tbl(preserve_index = TRUE, rename_index = "date") %>%  gather(asset, returns, -date) %>%   group_by(asset) %>%    mutate(returns = (log(returns) - log(lag(returns)))) %>%   na.omit()portfolio_returns_tq_rebalanced_monthly <-   asset_returns_long %>%  tq_portfolio(assets_col  = asset,                returns_col = returns,               weights     = w,               col_rename  = "returns",               rebalance_on = "months")

We also need to import the Fama French factors and combine them into one object with our portfolio returns. We painstakingly covered this in the previous post and the code for doing so is here:

temp <- tempfile()base <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"factor <-   "Global_3_Factors"format<-  "_CSV.zip"full_url <-  glue(base,        factor,        format,        sep ="")download.file(full_url,temp,quiet = TRUE)Global_3_Factors <-   read_csv(unz(temp, "Global_3_Factors.csv"),            skip = 6) %>%   rename(date = X1) %>%   mutate_at(vars(-date), as.numeric) %>%   mutate(date =            rollback(ymd(parse_date_time(date, "%Y%m") + months(1)))) %>%   filter(date >=    first(portfolio_returns_tq_rebalanced_monthly$date) & date <=    last(portfolio_returns_tq_rebalanced_monthly$date))ff_portfolio_returns <-   portfolio_returns_tq_rebalanced_monthly %>%   left_join(Global_3_Factors, by = "date") %>%   mutate(MKT_RF = Global_3_Factors$`Mkt-RF`/100,         SMB = Global_3_Factors$SMB/100,         HML = Global_3_Factors$HML/100,         RF = Global_3_Factors$RF/100,         R_excess = round(returns - RF, 4))

We now have one data frame ff_portfolio_returns that holds our Fama French factors and portfolio returns. Let’s get to the rolling analysis.

We first define a rolling model with the rollify() function from tibbletime. However, instead of wrapping an existing function, such as kurtosis() or skewness(), we will pass in our linear Fama French model.

# Choose a 24-month rolling windowwindow <- 24library(tibbletime)# define a rolling ff model with tibbletimerolling_lm <-   rollify(.f = function(R_excess, MKT_RF, SMB, HML) {  lm(R_excess ~ MKT_RF + SMB + HML)  }, window = window, unlist = FALSE)

Next, we pass columns from ff_portfolio_returns to the rolling function model.

rolling_ff_betas <-  ff_portfolio_returns %>%   mutate(rolling_ff =            rolling_lm(R_excess,                       MKT_RF,                       SMB,                       HML)) %>%    slice(-1:-23) %>%   select(date, rolling_ff)head(rolling_ff_betas, 3)

# A tibble: 3 x 2  date       rolling_ff           1 2014-12-31   2 2015-01-31   3 2015-02-28

We now have a new data frame called rolling_ff_betas, in which the column rolling_ff holds an S3 object of our model results. We can tidy() that column with map(rolling_ff, tidy) and then unnest() the results, very similar to our CAPM work, except we have more than one independent variable.

rolling_ff_betas <-  ff_portfolio_returns %>%   mutate(rolling_ff =            rolling_lm(R_excess,                       MKT_RF,                       SMB,                       HML)) %>%   mutate(tidied = map(rolling_ff,                       tidy,                       conf.int = T)) %>%   unnest(tidied) %>%   slice(-1:-23) %>%   select(date, term, estimate, conf.low, conf.high) %>%   filter(term != "(Intercept)") %>%   rename(beta = estimate, factor = term) %>%   group_by(factor)head(rolling_ff_betas, 3)

# A tibble: 3 x 5# Groups:   factor [3]  date       factor    beta conf.low conf.high                    1 2014-12-31 MKT_RF  0.931     0.784     1.08 2 2014-12-31 SMB    -0.0130   -0.278     0.2523 2014-12-31 HML    -0.160    -0.459     0.139

We now have rolling betas and confidence intervals for each of our 3 factors. Let’s apply the same code logic and extract the rolling R-squared for our model. The only difference is we call glance() instead of tidy().

rolling_ff_rsquared <-  ff_portfolio_returns %>%   mutate(rolling_ff =            rolling_lm(R_excess,                       MKT_RF,                       SMB,                       HML)) %>%  slice(-1:-23) %>%  mutate(glanced = map(rolling_ff,                       glance)) %>%   unnest(glanced) %>%   select(date, r.squared, adj.r.squared, p.value)head(rolling_ff_rsquared, 3)

# A tibble: 3 x 4  date       r.squared adj.r.squared  p.value                        1 2014-12-31     0.898         0.883 4.22e-102 2015-01-31     0.914         0.901 8.22e-113 2015-02-28     0.919         0.907 4.19e-11

We have extracted rolling factor betas and the rolling model R-squared, now let’s visualize.

Visualizing Rolling Fama French

We start by charting the rolling factor betas with ggplot(). This gives us an view into how the explanatory power of each factor has changed over time.

rolling_ff_betas %>%   ggplot(aes(x = date,              y = beta,              color = factor)) +   geom_line() +  labs(title= "24-Month Rolling FF Factor Betas",       x = "rolling betas") +  scale_x_date(breaks = scales::pretty_breaks(n = 10)) +  theme_minimal() +  theme(plot.title = element_text(hjust = 0.5),        axis.text.x = element_text(angle = 90))

The rolling factor beta chart reveals some interesting trends. Both SMB and HML have hovered around zero, while the MKT factor has hovered around 1. That’s consistent with our plot of betas with confidence intervals from last time.

Next, let’s visualize the rolling R-squared with highcharter.

We first convert rolling_ff_rsquared to xts, using the tk_xts() function.

rolling_ff_rsquared_xts <-   rolling_ff_rsquared %>%  tk_xts(date_var = date, silent = TRUE)

Then pass the xts object to a highchart(type = "stock") code flow, adding the rolling R-squared time series with hc_add_series(rolling_ff_rsquared_xts$r.squared...).

highchart(type = "stock") %>%   hc_add_series(rolling_ff_rsquared_xts$r.squared,                color = "cornflowerblue",                name = "r-squared") %>%   hc_title(text = "Rolling FF 3-Factor R-Squared") %>%  hc_add_theme(hc_theme_flat()) %>%  hc_navigator(enabled = FALSE) %>%   hc_scrollbar(enabled = FALSE) %>%   hc_exporting(enabled = TRUE)

{"x":{"hc_opts":{"title":{"text":"Rolling FF 3-Factor R-Squared"},"yAxis":{"title":{"text":null}},"credits":{"enabled":false},"exporting":{"enabled":true},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1419984000000,0.898193874076082],[1422662400000,0.913613669331354],[1425081600000,0.919257460405196],[1427760000000,0.919053841151536],[1430352000000,0.920192764383095],[1433030400000,0.920671410263961],[1435622400000,0.919801526746092],[1438300800000,0.908279914990913],[1440979200000,0.92707755328544],[1443571200000,0.916184001290169],[1446249600000,0.926114696965536],[1448841600000,0.925254318145437],[1451520000000,0.923190960293236],[1454198400000,0.929828565005735],[1456704000000,0.922988355712527],[1459382400000,0.938348757232282],[1461974400000,0.937402062865212],[1464652800000,0.936268502221349],[1467244800000,0.912164668775577],[1469923200000,0.916567108467131],[1472601600000,0.918047866998973],[1475193600000,0.917334780359066],[1477872000000,0.938055499197554],[1480464000000,0.939201416001601],[1483142400000,0.938828531422516],[1485820800000,0.939091105917217],[1488240000000,0.934435224922171],[1490918400000,0.934421254157578],[1493510400000,0.934758653807486],[1496188800000,0.936641187924107],[1498780800000,0.934909707340084],[1501459200000,0.943913578310692],[1504137600000,0.932829172819464],[1506729600000,0.926037292248958],[1509408000000,0.911850534203785],[1512000000000,0.910661077924351],[1514678400000,0.916587636494134]],"color":"cornflowerblue","name":"r-squared"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"colors":["#f1c40f","#2ecc71","#9b59b6","#e74c3c","#34495e","#3498db","#1abc9c","#f39c12","#d35400"],"chart":{"backgroundColor":"#ECF0F1"},"xAxis":{"gridLineDashStyle":"Dash","gridLineWidth":1,"gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"yAxis":{"gridLineDashStyle":"Dash","gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"legendBackgroundColor":"rgba(0, 0, 0, 0.5)","background2":"#505053","dataLabelsColor":"#B0B0B3","textColor":"#34495e","contrastTextColor":"#F0F0F3","maskColor":"rgba(255,255,255,0.3)"},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

That chart looks choppy, but the R-squared never really left the range between .9 and .95. We can tweak the minimum and maximum y-axis values for some perspective.

highchart(type = "stock") %>%   hc_add_series(rolling_ff_rsquared_xts$r.squared,                color = "cornflowerblue",                name = "r-squared") %>%   hc_title(text = "Rolling FF 3-Factor R-Squared") %>%  hc_yAxis( max = 2, min = 0) %>%   hc_add_theme(hc_theme_flat()) %>%  hc_navigator(enabled = FALSE) %>%   hc_scrollbar(enabled = FALSE) %>%   hc_exporting(enabled = TRUE)

{"x":{"hc_opts":{"title":{"text":"Rolling FF 3-Factor R-Squared"},"yAxis":{"title":{"text":null},"max":2,"min":0},"credits":{"enabled":false},"exporting":{"enabled":true},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10},"series":[{"data":[[1419984000000,0.898193874076082],[1422662400000,0.913613669331354],[1425081600000,0.919257460405196],[1427760000000,0.919053841151536],[1430352000000,0.920192764383095],[1433030400000,0.920671410263961],[1435622400000,0.919801526746092],[1438300800000,0.908279914990913],[1440979200000,0.92707755328544],[1443571200000,0.916184001290169],[1446249600000,0.926114696965536],[1448841600000,0.925254318145437],[1451520000000,0.923190960293236],[1454198400000,0.929828565005735],[1456704000000,0.922988355712527],[1459382400000,0.938348757232282],[1461974400000,0.937402062865212],[1464652800000,0.936268502221349],[1467244800000,0.912164668775577],[1469923200000,0.916567108467131],[1472601600000,0.918047866998973],[1475193600000,0.917334780359066],[1477872000000,0.938055499197554],[1480464000000,0.939201416001601],[1483142400000,0.938828531422516],[1485820800000,0.939091105917217],[1488240000000,0.934435224922171],[1490918400000,0.934421254157578],[1493510400000,0.934758653807486],[1496188800000,0.936641187924107],[1498780800000,0.934909707340084],[1501459200000,0.943913578310692],[1504137600000,0.932829172819464],[1506729600000,0.926037292248958],[1509408000000,0.911850534203785],[1512000000000,0.910661077924351],[1514678400000,0.916587636494134]],"color":"cornflowerblue","name":"r-squared"}],"navigator":{"enabled":false},"scrollbar":{"enabled":false}},"theme":{"colors":["#f1c40f","#2ecc71","#9b59b6","#e74c3c","#34495e","#3498db","#1abc9c","#f39c12","#d35400"],"chart":{"backgroundColor":"#ECF0F1"},"xAxis":{"gridLineDashStyle":"Dash","gridLineWidth":1,"gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"yAxis":{"gridLineDashStyle":"Dash","gridLineColor":"#BDC3C7","lineColor":"#BDC3C7","minorGridLineColor":"#BDC3C7","tickColor":"#BDC3C7","tickWidth":1},"legendBackgroundColor":"rgba(0, 0, 0, 0.5)","background2":"#505053","dataLabelsColor":"#B0B0B3","textColor":"#34495e","contrastTextColor":"#F0F0F3","maskColor":"rgba(255,255,255,0.3)"},"conf_opts":{"global":{"Date":null,"VMLRadialGradientURL":"http =//code.highcharts.com/list(version)/gfx/vml-radial-gradient.png","canvasToolsURL":"http =//code.highcharts.com/list(version)/modules/canvas-tools.js","getTimezoneOffset":null,"timezoneOffset":0,"useUTC":true},"lang":{"contextButtonTitle":"Chart context menu","decimalPoint":".","downloadJPEG":"Download JPEG image","downloadPDF":"Download PDF document","downloadPNG":"Download PNG image","downloadSVG":"Download SVG vector image","drillUpText":"Back to {series.name}","invalidDate":null,"loading":"Loading...","months":["January","February","March","April","May","June","July","August","September","October","November","December"],"noData":"No data to display","numericSymbols":["k","M","G","T","P","E"],"printChart":"Print chart","resetZoom":"Reset zoom","resetZoomTitle":"Reset zoom level 1:1","shortMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"thousandsSep":" ","weekdays":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]}},"type":"stock","fonts":[],"debug":false},"evals":[],"jsHooks":[]}

Ah, when the y-axis is zoomed out a bit, our R-squared looks consistently near 1 for the life of the portfolio.

That’s all for today. Thanks and see you next time!

_____='https://rviews.rstudio.com/2018/05/10/rolling-fama-french/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

↧

Mimicking SQLDF with MonetDBLite

May 9, 2018, 9:32 pm

≫ Next: Scientific debt

≪ Previous: Rolling Fama French

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

Like many useRs, I am also a big fan of the sqldf package developed by Grothendieck, which uses SQL statement for data frame manipulations with SQLite embedded database as the default back-end.

In examples below, I drafted a couple R utility functions with the MonetDBLite back-end by mimicking the sqldf function. There are several interesting observations shown in the benchmark comparison. – The data import for csv data files is more efficient with MonetDBLite than with the generic read.csv function or read.csv.sql function in the sqldf package. – The data manipulation for a single data frame, such as selection, aggregation, and subquery, is also significantly faster with MonetDBLite than with the sqldf function. – However, the sqldf function is extremely efficient in joining 2 data frames, e.g. inner join in the example.

# IMPORT
monet.read.csv <- function(file) {
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")
  suppressMessages(MonetDBLite::monetdb.read.csv(monet.con, file, "file", sep = ","))
  result <- DBI::dbReadTable(monet.con, "file")
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)  
}

microbenchmark::microbenchmark(monet = {df <- monet.read.csv("Downloads/nycflights.csv")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
# monet 528.5378 532.5463 539.2877 539.0902 542.4301 559.1191    10

microbenchmark::microbenchmark(read.csv = {df <- read.csv("Downloads/nycflights.csv")}, times = 10)
#Unit: seconds
#     expr      min       lq     mean   median       uq      max neval
# read.csv 2.310238 2.338134 2.360688 2.343313 2.373913 2.444814    10

# SELECTION AND AGGREGATION
monet.sql <- function(df, sql) {
  df_str <- deparse(substitute(df))
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")  
  suppressMessages(DBI::dbWriteTable(monet.con, df_str, df, overwrite = T))
  result <- DBI::dbGetQuery(monet.con, sql)
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)
}

microbenchmark::microbenchmark(monet = {monet.sql(df, "select * from df sample 3")}, times = 10)
#Unit: milliseconds
#  expr     min      lq     mean   median       uq     max neval
# monet 422.761 429.428 439.0438 438.3503 447.3286 453.104    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from df order by RANDOM() limit 3")}, times = 10)
#Unit: milliseconds
#  expr      min      lq     mean   median       uq      max neval
# sqldf 903.9982 908.256 925.4255 920.2692 930.0934 963.6983    10

microbenchmark::microbenchmark(monet = {monet.sql(df, "select origin, median(distance) as med_dist from df group by origin")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
# monet 450.7862 456.9589 458.6389 458.9634 460.4402 465.2253    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select origin, median(distance) as med_dist from df group by origin")}, times = 10)
#Unit: milliseconds
#  expr      min       lq    mean   median       uq      max neval
# sqldf 833.1494 836.6816 841.952 843.5569 846.8117 851.0771    10

microbenchmark::microbenchmark(monet = {monet.sql(df, "with df1 as (select dest, avg(distance) as dist from df group by dest), df2 as (select dest, count(*) as cnts from df group by dest) select * from df1 inner join df2 on (df1.dest = df2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq    mean   median       uq     max neval
# monet 426.0248 431.2086 437.634 438.4718 442.8799 451.275    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from (select dest, avg(distance) as dist from df group by dest) df1 inner join (select dest, count(*) as cnts from df group by dest) df2 on (df1.dest = df2.dest)")}, times = 10)
#Unit: seconds
#  expr      min       lq     mean   median       uq      max neval
# sqldf 1.013116 1.017248 1.024117 1.021555 1.025668 1.048133    10

# MERGE 
monet.sql2 <- function(df1, df2, sql) {
  df1_str <- deparse(substitute(df1))
  df2_str <- deparse(substitute(df2))
  monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:")  
  suppressMessages(DBI::dbWriteTable(monet.con, df1_str, df1, overwrite = T))
  suppressMessages(DBI::dbWriteTable(monet.con, df2_str, df2, overwrite = T))
  result <- DBI::dbGetQuery(monet.con, sql)
  DBI::dbDisconnect(monet.con, shutdown = T)
  return(result)
}

tbl1 <- monet.sql(df, "select dest, avg(distance) as dist from df group by dest")
tbl2 <- monet.sql(df, "select dest, count(*) as cnts from df group by dest")

microbenchmark::microbenchmark(monet = {monet.sql2(tbl1, tbl2, "select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean  median       uq      max neval
# monet 93.94973 174.2211 170.7771 178.487 182.4724 187.3155    10

microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10)
#Unit: milliseconds
#  expr      min       lq     mean median       uq      max neval
# sqldf 19.49334 19.60981 20.29535 20.001 20.93383 21.51837    10

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

↧

Scientific debt

May 10, 2018, 7:00 am

≫ Next: Exploratory Factor Analysis in R

≪ Previous: Mimicking SQLDF with MonetDBLite

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

A very useful concept in software engineering is technical debt.

Technical debt occurs when engineers choose a quick but suboptimal solution to a problem, or don’t spend time to build sustainable infrastructure. Maybe they’re using an approach that doesn’t scale well as the team and codebase expand (such as hardcoding “magic numbers”), or using a tool for reasons of convenience rather than appropriateness (“we’ll write the DevOps infrastructure in PHP since that’s what our team already knows”). Either way, it’s something that seems like it’s working at first but causes real challenges in the long-term, in the form of delayed feature launches and hard-to-fix bugs.

In my new job as Chief Data Scientist at DataCamp, I’ve been thinking about the role of data science within a business, and discussing this with other professionals in the field. On a panel earlier this year, I realized that data scientists have a rough equivalent to this concept: “scientific debt.”

Scientific debt is when a team takes shortcuts in data analysis, experimental practices, and monitoring that could have long-term negative consequences. When you hear a statement like:

“We don’t have enough time to run a randomized test, let’s launch it”
“To a first approximation this effect is probably linear”
“This could be a confounding factor, but we’ll look into that later”
“It’s directionally accurate at least”

you’re hearing a little scientific debt being “borrowed”.

Example: WidgetCorp

Most engineers have a sense of what it’s like for a company to struggle with technical debt. What would a company struggling with scientific debt look like?

Imagine a small startup WidgetCorp is developing a B2B product, and deciding on their sales strategy. One year they decide to start focusing their sales efforts on larger corporate clients. They notice that as they take on this new strategy, their monthly revenue increases. They’re encouraged by this, and in the following years hire half a dozen salespeople with experience working with large clients, and spend marketing and design effort building that as part of their brand.

Years later, the strategy doesn’t seem to be paying off: their revenue is struggling and the early successes aren’t repeating themselves. They hire an analyst who looks at their sales data, and finds that in fact, it had never been the case that they’d had a higher return-on-investment selling to large companies. In that early year, their revenue had been rising because of a seasonal effect (the demand for widgets goes up in the fall and winter), which was compounded with some random noise and anecdotes (e.g. “SmallCompany.com was a waste of our time, but we just closed a huge deal with Megabiz!”)

WidgetCorp took on too much scientific debt.

Some ways this might have happened:

They made irreversible decisions based on flawed analyses. It’s reasonable to take a quick look at metrics and be happy that they’re going in the right direction. But once the company made product, sales and marketing changes, it became difficult to reverse them. Before making a major shift in business, it’s worth making sure that the data supports it: that they’ve accounted for seasonal effects and applied proper statistical tests.

Lack of monitoring. Early on, there may not have been enough data to tell whether larger clients were . But as more data was collected, it would be worth continually testing this assumption, in the form of a dashboard or a quarterly report. If this isn’t tracked, no one will notice that the hypothesis was falsified even once they have the data.

Lack of data infrastructure: Maybe early in the company the leads were locked in a sales CRM, while accounting data was stored in Excel spreadsheets that were emailed around. Even if there were a dedicated analyst within the company, they may not have easy access to the relevant data (linking together sales sucess and company size). Even if it were theoretically possible to combine the datasets with some effort, schlep blindness might have made everyone avoid the analysis entirely. This is an area where technical debt and scientific debt often appear together, since it takes engineering effort to make scientific problems easy to solve.

Spreading inaccurate lore. Suppose that the WidgetCorp CEO had given a series of company-wide talks and public blog posts with the message “The future of WidgetCorp is serving big companies!” Product teams got into the habit of prioritizing features in this direction, and every failure got blamed on “I guess we weren’t focused enough on big clients”. This kind of “cultural inertia” can be very difficult to reverse, even if the executive team is willing to publicly admit their mistake (which isn’t guaranteed!)

Just about every experienced data scientist has at least a few of these stories, even from otherwise successful companies. They are to scientific debt what the Daily WTF is to technical debt.

Is scientific debt always bad?

Not at all!

I often take shortcuts in my own analyses. Running a randomized experiment for a feature launch is sometimes too expensive, especially if the number of users is fairly small or the change pretty uncontroversial (you wouldn’t A/B test a typo fix). And while correlation doesn’t imply causation, it’s usually better than nothing when making business decisions.

The comparison to technical debt is useful here: a small engineering team’s first goal is typically to build an minimum viable product quickly, rather than overengineer a system that they think will be robust in the distant future. (The equivalent in scientific debt is typically called overthinking, e.g. “Yes I suppose we could control for weather when we examine what sales deals succeed, but I’m pretty sure you’re overthinking this”). And the comparison to financial debt is meaningful too: companies typically take on debt (or, similarly, give up equity) while they’re growing. Just like you can’t build a company without borrowing money, you can’t build a company while being certain every decision is thoroughly supported by data.

What’s important in both technical and scientific debt is to keep the long-term cost in mind.

It isn't technical debt if you aren't…
1) Leveraging it to get something valuable up front 2) Paying interest on it regularly 3) Treating it as a liability that you may eventually need to pay in full
Code that doesn't meet this criteria isn't debt, it's just low quality work.
— Practicing Developer (@practicingdev) February 26, 2018

Wrong decisions are expensive, and not paying attention to data is a risk. We can do a cost-benefit analysis of whether the risk is worth it, but we shouldn’t write it off as “data scientists always find something to complain about”.

Why even call it “debt”?

To a data scientist or analyst, this post might sound pretty obvious. Of course there are downsides to ignoring statistical rigor, so why bother giving it a “buzzword-y” name? Because it puts the concept in terms executives and managers can easily understand.

Again, let’s go back to technical debt. There are lots of reasons individual engineers may want to write “clean code”: they appreciate its elegance, they want to impress their peers, or they’re perfectionists procrastinating on other work. These reasons don’t generally matter to non-technical employees, who care about product features and reliability. The framing of technical debt helps emphasize what the company loses by not investing in architecture: the idea that even if a product looks like it’s working, the flaws have a long-term cost in actual dollars and time.

Engineer: It bothers me that different internal projects use different naming conventions.

CTO: Sorry it annoys you, but code is code, I don’t see why you should waste time on this.

Engineer: Our inconsistent naming conventions are technical debt: they make it harder for new developers to learn the system.

CTO: I’ve been looking for ways to reduce our onboarding time! Great idea, let me know what you need to fix it.

Similarly, scientists, especially from an academic background, often have a particular interest in discovering truths about reality. So the idea of “I’d like to analyze whether X is a confounding factor here” can sound like an indulgence rather than an immediate business need. Statisticians in particular are often excited by finding flaws in mathematical methods. So when a data scientist says something like “We can’t use that method, Jones et al 2012 proved that it is asymptotically inconsistent,” non-technical colleagues might assume they’re overthinking it or even showing off. Framing it in terms of what we’re actually risking helps communicate why it’s worth spending time on.

How can we manage scientific debt well?

Let data scientists “pay interest” on it. Just as not every engineering project will lead to a new feature, not every analysis will lead to an exciting discovery or novel algorithm. Some time needs to be spent confirming or invalidating existing assumptions. Jonathan Nolis has a great article about prioritizing data science work, where he describes this quadrant as “providing proof”.
Build data engineering processes: As described earlier, one reason a company might fall into scientific debt is that analysts may not have easy access to the data they need. It could be locked away in a platform that hasn’t been ingested, or in Google sheets that are edited by hand. Ingesting relevant data into a data warehouse or a data lake makes it more likely data scientists can make relevant discoveries.
Revisit old analyses: One common reason early-stage companies go into scientific debt is that they don’t yet have enough data to draw robust conclusions. Even if you don’t have enough data yet, that doesn’t mean you should forget about the problem. Sometimes I put time on my calendar to run an analysis once I expect enough data to be available, even if it’s a few months away. This can also help confirm an important analysis is still relevant: just like you’d keep track of a KPI over time, you want to keep track of whether a conclusion remains true.
Have data expertise spread throughout the company. Just as someone who can’t program may not recognize technical debt, someone who doesn’t have experience analyzing and understanding data may not recognize scientific debt. This is yet another reason to democratize data science within your company, as we do at DataCamp.

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

↧

Exploratory Factor Analysis in R

May 10, 2018, 11:30 am

≫ Next: rquery: SQL from R

≪ Previous: Scientific debt

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Changing Your Viewpoint for Factors

In real life, data tends to follow some patterns but the reasons are not apparent right from the start of the data analysis. Taking a common example of a demographics based survey, many people will answer questions in a particular ‘way’. For example, all married men will have higher expenses than single men but lower than married men with children. In this case, the driving factor which makes them answer following a pattern is the economic status but these answers may also depend on other factors such as level of education, salary and locality or area. It becomes complicated to assign answers related to multiple factors. One option is to map variables or answers to one of the factors. This process has a lot of drawbacks such as the requirement to ‘guess’ the number of factors, heuristic based or biased manual assignment and non-consideration of influence of other factors to the variable. We have variables and answers in the data defined in a way such that we can understand them and interpret. What if we change our view lens in an alternate reality where all variables are automatically mapped into different new categories with customized weight based on their influence on that category. This is the idea behind factor analysis.

Creating Meaningful Factors

Factor analysis starts with the assumption of hidden latent variables which cannot be observed directly but are reflected in the answers or variables of the data. It also makes the assumption that there are as many factors as there are variables. We then transform our current set of variables into an equal number of variables such that each new variable is a combination of the current ones in some weightage. Hence, we are not essentially adding or removing information in this step but only transforming it. A typical way to make this transformation is to use eigenvalues and eigenvectors. We transform our data in the direction of each eigenvector and represent all the new variables or ‘factors’ using the eigenvalues. An eigenvalue more than 1 will mean that the new factor explains more variance than one original variable. We then sort the factors in decreasing order of the variances they explain. Thus, the first factor will be the most influential factor followed by the second factor and so on. This also helps us think of variable reduction by removing the last few factors. Typically we take factors which collectively explain 90-99% of the variance depending upon the dataset and problem. Thus, there is no need to guess the number of factors required for the data. However, we need to read into each of the factors and the combination of original features out of which they are made of to understand what they represent.

Loading Factors With Factor Loadings

As I mentioned, even after transformation, we retain the weight of each original feature that were combined to obtain the factor. These weights are known as ‘factor loadings’ and help understand what the factors represent. Let’s take a cooked up example of factor loadings for an airlines survey. Let’s say the table looks like this:

Perceptive Analytics

I took 10 features originally so it should generate 10 factors. Let’s say our first three factors are as shown in the table. Looking at the values of the factors, the first factor may represent customer experience post on-boarding. The second factor reflects the airline booking experience and related perks. The third factor shows the flight competitive advantage of the airline compared to its competition. We also have a 10th feature which is negatively affecting the second factor (which seems to make sense as a loyal customer will book flights even if they are no longer economic or don’t have as great frequent flyer programs). Thus, we can now understand that the top most important factors for customers filling the survey are customer experience, booking experience and competitive advantage. However, this understanding needs to be developed manually by looking at the loadings. It is the factor loadings and their understanding which are the prime reason which makes factor analysis of such importance followed by the ability to scale down to a few factors without losing much information.

Exploration – How Many Factors?

Factor analysis can be driven by different motivations. One objective of factor analysis can be verifying with the data with what you already know and think about it. This requires a prior intuition on the number of important factors after which the loadings will be low overall as well as an idea of the loadings of each original variable in those factors. This is the confirmatory way of factor analysis where the process is run to confirm with understanding of the data. A more common approach is to understand the data using factor analysis. In this case, you perform factor analysis first and then develop a general idea of what you get out of the results. How do we stop at a specific number of factor in factor analysis when we are exploring? We use the scree plot in this case. The scree plot maps the factors with their eigenvalues and a cut-off point is determined wherever there is a sudden change in the slope of the line.

Going Practical – The BFI Dataset in R

Let’s start with a practical demonstration of factor analysis. We will use the Psych package in R which is a package for personality, psychometric, and psychological research. It consists a dataset – the bfi dataset which represents 25 personality items with 3 additional demographics for 2800 data points. The columns are already classified into 5 factors thus their names start with letters A (Agreeableness), C (Conscientiousness), E (Extraversion), N (Neuroticism) and O (Openness). Let’s perform factor analysis to see if we really have the same association of the variables with each factor.

#Installing the Psych package and loading itinstall.packages("psych")library(psych)#Loading the datasetbfi_data=bfi

Though we have NA values in our data which need to be handled but we will not perform much processing on the data. To make things simple, we will only take those data points where there are no missing values.

#Remove rows with missing values and keep only complete casesbfi_data=bfi_data[complete.cases(bfi_data),]

This leaves us with 2236 data points down from 2800 which means a reduction of 564 data points. Since 2236 is still a plausible number of data points, we can proceed further. We will use the fa() function for factor analysis and need to calculate the correlation matrix for the function

#Create the correlation matrix from bfi_databfi_cor <- cor(bfi_data)

The fa() function needs correlation matrix as r and number of factors. The default value is 1 which is undesired so we will specify the factors to be 6 for this exercise.

#Factor analysis of the datafactors_data <- fa(r = bfi_cor, nfactors = 6)#Getting the factor loadings and model analysisfactors_dataFactor Analysis using method =  minresCall: fa(r = bfi_cor, nfactors = 6)

Standardized loadings (pattern matrix) based upon correlation matrix

            MR2   MR3   MR1   MR5   MR4   MR6    h2   u2 comA1         0.11  0.07 -0.07 -0.56 -0.01  0.35 0.379 0.62 1.8A2         0.03  0.09 -0.08  0.64  0.01 -0.06 0.467 0.53 1.1A3        -0.04  0.04 -0.10  0.60  0.07  0.16 0.506 0.49 1.3A4        -0.07  0.19 -0.07  0.41 -0.13  0.13 0.294 0.71 2.0A5        -0.17  0.01 -0.16  0.47  0.10  0.22 0.470 0.53 2.1C1         0.05  0.54  0.08 -0.02  0.19  0.05 0.344 0.66 1.3C2         0.09  0.66  0.17  0.06  0.08  0.16 0.475 0.53 1.4C3         0.00  0.56  0.07  0.07 -0.04  0.05 0.317 0.68 1.1C4         0.07 -0.67  0.10 -0.01  0.02  0.25 0.555 0.45 1.3C5         0.15 -0.56  0.17  0.02  0.10  0.01 0.433 0.57 1.4E1        -0.14  0.09  0.61 -0.14 -0.08  0.09 0.414 0.59 1.3E2         0.06 -0.03  0.68 -0.07 -0.08 -0.01 0.559 0.44 1.1E3         0.02  0.01 -0.32  0.17  0.38  0.28 0.507 0.49 3.3E4        -0.07  0.03 -0.49  0.25  0.00  0.31 0.565 0.44 2.3E5         0.16  0.27 -0.39  0.07  0.24  0.04 0.410 0.59 3.0N1         0.82 -0.01 -0.09 -0.09 -0.03  0.02 0.666 0.33 1.1N2         0.83  0.02 -0.07 -0.07  0.01 -0.07 0.654 0.35 1.0N3         0.69 -0.03  0.13  0.09  0.02  0.06 0.549 0.45 1.1N4         0.44 -0.14  0.43  0.09  0.10  0.01 0.506 0.49 2.4N5         0.47 -0.01  0.21  0.21 -0.17  0.09 0.376 0.62 2.2O1        -0.05  0.07 -0.01 -0.04  0.57  0.09 0.357 0.64 1.1O2         0.12 -0.09  0.01  0.12 -0.43  0.28 0.295 0.70 2.2O3         0.01  0.00 -0.10  0.05  0.65  0.04 0.485 0.52 1.1O4         0.10 -0.05  0.34  0.15  0.37 -0.04 0.241 0.76 2.6O5         0.04 -0.04 -0.02 -0.01 -0.50  0.30 0.330 0.67 1.7gender     0.20  0.09 -0.12  0.33 -0.21 -0.15 0.184 0.82 3.6education -0.03  0.01  0.05  0.11  0.12 -0.22 0.072 0.93 2.2age       -0.06  0.07 -0.02  0.16  0.03 -0.26 0.098 0.90 2.0                       MR2  MR3  MR1  MR5  MR4  MR6SS loadings           2.55 2.13 2.14 2.03 1.79 0.87Proportion Var        0.09 0.08 0.08 0.07 0.06 0.03Cumulative Var        0.09 0.17 0.24 0.32 0.38 0.41Proportion Explained  0.22 0.18 0.19 0.18 0.16 0.08Cumulative Proportion 0.22 0.41 0.59 0.77 0.92 1.00 With factor correlations of       MR2   MR3   MR1   MR5   MR4   MR6MR2  1.00 -0.18  0.24 -0.05 -0.01  0.10MR3 -0.18  1.00 -0.23  0.16  0.19  0.04MR1  0.24 -0.23  1.00 -0.28 -0.19 -0.15MR5 -0.05  0.16 -0.28  1.00  0.18  0.17MR4 -0.01  0.19 -0.19  0.18  1.00  0.05MR6  0.10  0.04 -0.15  0.17  0.05  1.00Mean item complexity =  1.8Test of the hypothesis that 6 factors are sufficient.The degrees of freedom for the null model are  378  and the objective function was  7.79The degrees of freedom for the model are 225  and the objective function was  0.57 The root mean square of the residuals (RMSR) is  0.02 The df corrected root mean square of the residuals is  0.03 Fit based upon off diagonal values = 0.98Measures of factor score adequacy                                                                MR2  MR3  MR1  MR5  MR4  MR6Correlation of (regression) scores with factors   0.93 0.89 0.89 0.88 0.86 0.77Multiple R square of scores with factors          0.86 0.79 0.79 0.77 0.74 0.59Minimum correlation of possible factor scores     0.72 0.57 0.58 0.54 0.49 0.18

The factor loadings show that the first factor represents N followed by C,E,A and O. This means most of the members in the data have Neuroticism in the data. We also notice that the first five factors adequately represent the factor categories as the data is meant for.

Conclusion: A Deeper Insight

As apparent from the bfi survey example, factor analysis is helpful in classifying our current features into factors which represent hidden features not measured directly. It also has an additional advantage of helping reduce our data into a smaller set of features without losing much information. There are a few things to keep in mind before putting factor analysis into action. The first is about the values of factor loadings. We may have datasets where the factor loadings for all factors are low – lower than 0.5 or 0.3. While a factor loading lower than 0.3 means that you are using too many factors and need to re-run the analysis with lesser factors. A range of loadings around 0.5 is satisfactory but indicates poor predicting ability. You should later keep thresholds and discard factors which have a loading lower than the threshold for all features. Factor analysis on dynamic data can also be helpful in tracking changes in the nature of data. In case the data changes significantly, the number of factors in exploratory factor analysis will also change and indicate you to look into the data and check what changes have occurred. The final one of importance is the interpretability of factors. In case you are unable to understand or explain the factor loadings, you are either using a very granular or very generalized set of factors. In this case, you need to find the right number of factors and obtain loadings to features which are both interpretable and beneficial for analysis. There can be a variety of other situations that may occur with factor analysis and are all subject to interpretation.

Keep learning and here is the entire code used in this article.

#Installing the Psych package and loading itinstall.packages("psych")library(psych)#Loading the datasetbfi_data=bfi#Remove rows with missing values and keep only complete casesbfi_data=bfi_data[complete.cases(bfi_data),]#Create the correlation matrix from bfi_databfi_cor <- cor(bfi_data)#Factor analysis of the datafactors_data <- fa(r = bfi_cor, nfactors = 6)#Getting the factor loadings and model analysisfactors_data

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Prudhvi Sai Ram, Saneesh Veetil and Chaitanya Sagar contributed to this article.

Perceptive Analytics provides Tableau Consulting, data analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

↧

rquery: SQL from R

May 10, 2018, 12:23 pm

≫ Next: the riddle of the stands

≪ Previous: Exploratory Factor Analysis in R

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.

(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at jmount@win-vector.com, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

the riddle of the stands

May 10, 2018, 3:18 pm

≫ Next: 2018 R Conferences

≪ Previous: rquery: SQL from R

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The simple riddle of last week on The Riddler, about the minimum number of urinals needed for n men to pee if the occupation rule is to stay as far as possible from anyone there and never to stand next to another man, is quickly solved by an R code:

ocupee=function(M){ ok=rep(0,M) ok[1]=ok[M]=1 ok[trunc((1+M/2))]=1 while (max(diff((1:M)[ok!=0])>2)){  i=order(-diff((1:M)[ok!=0]))[1]  ok[(1:M)[ok!=0][i]+trunc((diff((1:M)[ok!=0])[i]/2))]=1  } return(sum(ok>0)) }

with maximal occupation illustrated by the graph below:

Meaning that the efficiency of the positioning scheme is not optimal when following the sequential positioning, requiring $latexN+2^{\lceil log_2(N-1) \rceil}$ urinals. Rather than one out of two, requiring 2N-1 urinals. What is most funny in this simple exercise is the connection exposed in the Riddler with an Xkcd blag written a few years go about the topic.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

2018 R Conferences

May 10, 2018, 5:00 pm

≫ Next: Basic R Automation

≪ Previous: the riddle of the stands

(This article was first published on R Views, and kindly contributed to R-bloggers)

rstudio::conf 2018 and the New York R Conference are both behind us, but we are rushing headlong into the season for conferences focused on the R Language and its applications.

The European R Users Meeting (eRum) begins this coming Monday, May 14th, in Budapest with three days of workshops and talks. Headlined by R Core member Martin Mächler and fellow keynote speakers Achim Zeileis, Nathalie Villa-Vialaneix, Stefano Maria Iacus, and Roger Bivand, the program features an outstanding array of accomplished speakers including RStudio’s own Barbara Borges Ribeiro, Andrie de Vries, and Lionel Henry.

Second only to useR! in longevity, the tenth consecutive R / Finance will be held in Chicago on June 1st and 2nd. Keynote speakers Norm Matloff, J.J. Allaire, and Li Deng head a strong program. Produced by the same committed crew of Chicago quants with the unwavering support of UIC, R / Finance is the epitome of a small, tightly focused, single-track R conference. If you are interested in the quantitative side of Finance, there is no better place to network.

The relatively new CascadiaRConf will feature keynote speakers Alison Hill and Kara Woo in a one-day event on June 2nd in Portland, OR that promises to be good time with several hands-on workshops.

A SatRday mini-conference will be held in Cardiff on June 23rd. Stephanie Locke, Heather Turner, and Maelle Salmon will be leading the event. The recent conference in Capetown appears to have been a great day for working with R, and a lot of fun. I expect that Cardiff will also be a blast.

Why R? July 2nd through 5th in Wrocław, Poland is an ambitious undertaking with five keynote speakers (Bernd Bischl, Thomas Petzoldt, Leon Eyrich Jessen, Tomasz Niedzielski, and Maciej Eder), a hackathon, and several “pre-meetings” spread across Poland, Germany, and Denmark. I expect this to be a top-tier series of events.

R in Montreal will be held from July 4th through 6th. Pleanary speakers Julie Josse, Arun Srinivasan, and Daniel Stubbs will headline the program.

The 14th useR! conference, the first to happen in the Southern Hemisphere, will be held in Brisbane, Australia from July 10th through the 13th. The mother of all R conferences, useR! attracts R afficianados from around the globe and provides a window to what is au courant in the R universe. Keynote speakers Jenny Bryan, Steph De Silva, Heike Hofmann, Thomas Lin Pedersen, Roger Peng, and Bill Venables head the program. The tutorials, always a major attraction at useR! conferences, will take place over two days.

Insurance Data Science, the direct successor to the series of R in Insurance conference will be held in London on July 16th. Although renamed, and presumably refocused, the program for the conference still indicates quite a bit of R content. Garth Peters and Eric Novic will deliver the keynotes.

BioC 2018, the flagship conference for the BioConductor project and a major event in the computational genomics world, will be held from July 25 through the 27th at Victoria University, Toronto. The program is still coming together, but confirmed speakers include Brenda Andrews, Benjamine Haibe-Kains, Elana Fertig, Charlotte Sonneson, Michael Hoffman, and Tim Hughes.

The Latin American R/BioConductor Developers Workshop will be held between July 30th and August 3rd at the center for Genomic Sciences in Cuernavaca, Mexico. Invited speakers include Martin Morgan and Heather Turner. The workshop is aimed at students and researchers, with a goal of teaching participants the principles of reproducible data science through the development of R/Bioconductor packages.

Two brand-new conferences directly modeled on the R / Finance experience will make their debuts this year. R / Pharma, a conference devoted to the use of R for reproducible research, regulatory compliance and validation, safety monitoring, clinical trials, drug discovery, R&D, PK/PD/pharmacometrics, genomics, and diagnostics in the pharmaceutical industry will be held on August 15th and 16th at Harvard University. This will be a small, collegial gathering limited to 150 attendees; it will undoubtedly sell out soon after registration opens.

R / Medicine, which will focus on the use of R in medical research and clinical practice, with talks addressing Phase I clinical trial design; the analysis and visualization of clinical trial data, patient records, and genetic data; personalized medicine; and reproducible research, will take place in New Haven, CT on September 7th and 8th. This will also be a small gathering that is likely to sell out soon after registration opens.

LatinR, which will focus on the use of R in R&D, will be held at th University of Palmero in Buenos Aires on September 4th and 5th. Keynote speakers Jenny Bryan and Walter Sosa Escudero will head the program.

The last R conference on my radar for the 2018 season, the enterprise-focused EARL (Enterprise Applications of the R Language) Conference, will take place in London from September 11th through the 13th. Edwin Dunn and Garrett Grolemund will deliver the keynotes, and the list of speakers comprises an impressive roster of industrial-strength R users. This is clearly the event for data scientists looking to put R into production.

_____='https://rviews.rstudio.com/2018/05/11/2018-r-conferences/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

↧

Basic R Automation

May 11, 2018, 5:48 am

≫ Next: goodpractice 1.0.2 on CRAN

≪ Previous: 2018 R Conferences

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Last Wednesday, a small presentation was given at the RBelgium meetup in Brussels on Basic R Automation. For those of you who could not attend, here are the slides of that presentation which showed the use of the cronR and taskscheduleR R packages for automating basic R scripts.

If you are interested in setting up a project for more advanced ways on how to automate your R processes for your specific environment, get in touch.

{aridoc engine=”pdfjs” width=”100%” height=”550″}images/bnosac/blog/Basic_R_Automation.pdf{/aridoc}

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers.

↧

goodpractice 1.0.2 on CRAN

May 11, 2018, 7:32 am

≫ Next: Custom R charts coming to Excel

≪ Previous: Basic R Automation

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Hannah Frick, Data Scientist

We are excited to annouce that the goodpractice package is now available on CRAN. The package gives advice about good practices when building R packages. Advice includes functions and syntax to avoid, package structure, code complexity, code formatting, etc.

You can install the CRAN version via

install.packages("goodpractice")

Building R packages

Building an R package is a great way of encapsulating code, documentation and data in a single testable and easily distributable unit.

For a package to be distributed via CRAN, it needs to pass a set of checks implemented in R CMD check, such as: Is there minimal documentation, e.g., are all arguments of exported functions documented? Are all dependencies declared?

These checks are helpful in developing a solid R package but they don’t check for several other good practices. For example, a package does not need to contain any tests but is it good practice to include some. Following a coding standard helps readability. Avoiding overly complex functions reduces the risk of bugs. Including an URL for bug reports lets people more easily report bugs if they find any.

What the goodpractice package does

Tools for automatically checking several of the above mentioned aspects already exist and the goodpractice package bundles the checks from rcmdcheck with code coverage through the covr package, source code linting via the lintr package and cyclompatic complexity via the cyclocomp package and augments it with some further checks on good practice for R package development such as avoiding T and F in favour of TRUE and FALSE. It provides advice on which practices to follow and which to avoid.

You can use goodpractice checks as a reminder for you and your collegues – and if you have custom checks to run, you can make goodpractice run those as well!

How to use goodpractice

The main fuction goodpractice() (and its alias gp()) takes the path to the source code of a package as its first argument. The goodpractice package contains the source for a simple package which violates some good practices. We’ll use this for the examples.

library(goodpractice)# get path to example packagepkg_path <- system.file("bad1", package = "goodpractice")# run gp() on itg <- gp(pkg_path)#> Preparing: covr#> Warning in MYPREPS[[prep]](state, quiet = quiet): Prep step for test#> coverage failed.#> Preparing: cyclocomp#> Skipping 2 packages ahead of CRAN: callr, remotes#> Installing 1 packages: stringr#> #>   There is a binary version available but the source version is#>   later:#>         binary source needs_compilation#> stringr  1.3.0  1.3.1             FALSE#> installing the source package 'stringr'#> Preparing: description#> Preparing: lintr#> Preparing: namespace#> Preparing: rcmdcheck# show the resultg#> ── GP badpackage ──────────────────────────────────────────────────────────#> #> It is good practice to#> #>   ✖ not use "Depends" in DESCRIPTION, as it can cause name#>     clashes, and poor interaction with other packages. Use#>     "Imports" instead.#>   ✖ omit "Date" in DESCRIPTION. It is not required and it gets#>     invalid quite often. A build date will be added to the package#>     when you perform `R CMD build` on it.#>   ✖ add a "URL" field to DESCRIPTION. It helps users find#>     information about your package online. If your package does#>     not have a homepage, add an URL to GitHub, or the CRAN package#>     package page.#>   ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug#>     tracker. Many online code hosting services provide bug#>     trackers for free, https://github.com, https://gitlab.com,#>     etc.#>   ✖ omit trailing semicolons from code lines. They are not needed#>     and most R coding standards forbid them#> #>     R/semicolons.R:4:30#>     R/semicolons.R:5:29#>     R/semicolons.R:9:38#> #>   ✖ not import packages as a whole, as this can cause name clashes#>     between the imported packages. Instead, import only the#>     specific functions you need.#>   ✖ fix this R CMD check ERROR: VignetteBuilder package not#>     declared: ‘knitr’ See section ‘The DESCRIPTION file’ in the#>     ‘Writing R Extensions’ manual.#>   ✖ avoid 'T' and 'F', as they are just variables which are set to#>     the logicals 'TRUE' and 'FALSE' by default, but are not#>     reserved words and hence can be overwritten by the user.#>     Hence, one should always use 'TRUE' and 'FALSE' for the#>     logicals.#> #>     R/tf.R:NA:NA#>     R/tf.R:NA:NA#>     R/tf.R:NA:NA#>     R/tf.R:NA:NA#>     R/tf.R:NA:NA#>     ... and 4 more lines#> #> ───────────────────────────────────────────────────────────────────────────

So with this package, we’ve done a few things in the DESCRIPTION file for which there are reasons not to do them, have unnecessary trailing semicolons in the code and used T and F for TRUE and FALSE. The output of gp() will tell you what isn’t considered good practice out of what you have already written. If that is in the R code itself, it will also point you to the location of your faux-pas. In general, the messages are supposed to not only point out to you what you might want to avoid but also why.

Custom checks

The above example tries to run all 230 checks available, to see the full list use all_checks(). You can customise the set of checks run by selecting only those default checks you are intersted in and by adding your own checks.

If you only want to run a subset of the checks, e.g., just the check on the URL field in the DESCRIPTION, you can specify the checks by name:

# what is the name of the check?grep("url", all_checks(), value = TRUE)#> [1] "description_url"# run only this checkgp(pkg_path, checks = "description_url")#> Preparing: description#> ── GP badpackage ──────────────────────────────────────────────────────────#> #> It is good practice to#> #>   ✖ add a "URL" field to DESCRIPTION. It helps users find#>     information about your package online. If your package does#>     not have a homepage, add an URL to GitHub, or the CRAN package#>     package page.#> ───────────────────────────────────────────────────────────────────────────

Additional checks can be used in gp() via the extra_checks argument. This should be a named list of check objects as returned by the make_check() function.

# make a simple version of the T/F checkcheck_simple_tf <- make_check(  description = "TRUE and FALSE is used, not T and F",  gp = "avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead.",  check = function(state) {      length(tools::checkTnF(dir = state$path)) == 0  })gp(pkg_path, checks = c("description_url", "simple_tf"),   extra_checks = list(simple_tf = check_simple_tf))#> Preparing: description#> ── GP badpackage ──────────────────────────────────────────────────────────#> #> It is good practice to#> #>   ✖ add a "URL" field to DESCRIPTION. It helps users find#>     information about your package online. If your package does#>     not have a homepage, add an URL to GitHub, or the CRAN package#>     package page.#>   ✖ avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead.#> ───────────────────────────────────────────────────────────────────────────

For more details on creating custom checks, please see the vignette Custom Checks.

Acknowledgements

This package was written by Gábor Csárdi with contributions by Noam Ross, Neal Fultz, Douglas Ashton, Marcel Ramos, Joseph Stachelek, and myself. Special thanks for the input and feedback to the rOpenSci leadership team and community as well as everybody who opened issues!

Feedback

If you have any feedback, please consider opening an issue on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Custom R charts coming to Excel

May 11, 2018, 8:49 am

≫ Next: Monte Carlo-based prediction intervals for nonlinear regression

≪ Previous: goodpractice 1.0.2 on CRAN

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

This week at the BUILD conference, Microsoft announced that Power BI custom visuals will soon be available as charts with Excel. You'll be able to choose a range of data within an Excel workbook, and pass those data to one of the built-in Power BI custom visuals, or one you've created yourself using the API.

Since you can create Power BI custom visuals using R, that means you'll be able to design a custom R-based chart, and make it available to people using Excel — even if they don't know how to use R themselves. There also many pre-defined custom visuals available, including some familiar R charts like decision trees, calendar heatmaps, and hexbin scatterplots.

For more details on how you'll be able to use custom R visuals in Excel, check out the blog post linked below.

PowerBI Blog: Excel announces new data visualization capabilities with Power BI custom visuals

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Monte Carlo-based prediction intervals for nonlinear regression

May 11, 2018, 9:22 am

≫ Next: Visualizing graphs with overlapping node groups

≪ Previous: Custom R charts coming to Excel

(This article was first published on Rmazing, and kindly contributed to R-bloggers)

Calculation of the propagated uncertainty $\sigma_y$ using $\nabla \Sigma \nabla^T$ (1), where $\nabla$ is the gradient and $\Sigma$ the covariance matrix of the coefficients $\beta_i$ , is called the “Delta Method” and is widely applied in nonlinear least-squares (NLS) fitting. However, this method is based on first-order Taylor expansion and thus assummes linearity around $\hat{y} = f(x, \hat{\beta})$ . The second-order approach can partially correct for this restriction by using a second-order polynomial around $\hat{y}$ , which is $\nabla \Sigma \nabla^T + \frac{1}{2} \rm{tr}(H \Sigma H \Sigma)$ (2), where $\rm{tr}(\cdot)$ is the matrix trace and $\rm{H}$ is the Hessian. Confidence and prediction intervals for NLS models are calculated using $t(1 - \frac{\alpha}{2}, \nu) \cdot \sigma_y$ (3) or $t(1 - \frac{\alpha}{2}, \nu) \cdot \sqrt{\sigma_y^2 + \sigma_r^2}$ (4), respectively, where the residual variance $\sigma_r^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\nu}$ (5). Now, how can we employ the matrix notation of error propagation for creating Taylor expansion- and Monte Carlo-based prediction intervals? The inclusion of $\sigma_r^2$ in the prediction interval can be implemented as an extended gradient and “augmented” covariance matrix. So instead of using $\hat{y} = f(x, \hat{\beta})$ (6), we take $\hat{y} = f(x, \hat{\beta}) + \sigma_r^2$ (7) as the expression and augment the $i \times i$ covariance matrix $\Sigma$ to an $(i+1) \times (i+1)$ covariance matrix, where $\Sigma_{i+1, i+1} = \sigma_r^2$ . Partial differentiation and matrix multiplication will then yield, for example with two coefficients $\beta_1$ and $\beta_2$ and their corresponding covariance matrix $\Sigma$ : $\left[\frac{\partial f}{\partial \beta_1}\; \frac{\partial f}{\partial \beta_2}\; 1\right] \left[ \begin{array}{ccc} \;\sigma_1^2\;\;\; \sigma_1\sigma_2\;\; 0 \\ \sigma_2\sigma_1\;\; \sigma_2^2\;\;\;\; 0 \\ \;\;\;0\;\;\;\;\;\; 0\;\;\;\;\; \sigma_r^2 \end{array} \right] \left[ \begin{array}{c} \frac{\partial f}{\partial \beta_1} \\ \frac{\partial f}{\partial \beta_2} \\ 1 \end{array} \right]$ (8) $= \left(\frac{\partial f}{\partial \beta_1}\right)^2\sigma_1^2 + 2 \frac{\partial f}{\partial \beta_1} \frac{\partial f}{\partial \beta_2} \sigma_1 \sigma_2 + \left(\frac{\partial f}{\partial \beta_2}\right)^2\sigma_2^2 + \sigma_r^2$ (9) $\equiv \sigma_y^2 + \sigma_r^2$ , which then goes into (4). The advantage of the augmented covariance matrix is that it can be exploited for creating Monte Carlo-based prediction intervals. This is new from propagate version 1.0-6 and is based on the paradigm that we add another dimension by employing the augmented covariance matrix of (8) in the multivariate t-distribution random number generator (in our case rtmvt), with $\mu = 0$ . All simulations are then evaluated with (7) and the usual $[1 - \frac{\alpha}{2}, \frac{\alpha}{2}]$ quantiles calculated for the prediction interval. Using the original covariance matrix with (6) will deliver the MC-based confidence interval. Application of second-order Taylor expansion or the MC-based approach demonstrates nicely that for the majority of nonlinear models, the confidence/prediction intervals around $\hat{y}$ are quite asymmetric, which the classical Delta Method does not capture: library(propagate)

DNase1 <- subset(DNase, Run == 1) fm3DNase1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)), data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1)) ## first-order prediction interval set.seed(123) PROP1 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000, second.order = FALSE, interval = "prediction") t(PROP1$summary)

$\begin{array}{cc} Prop.Mean.1 & 0.74804722 \\ Prop.sd.1 & 0.02081131 \\ Prop.2.5 & 0.70308712 \\ Prop.97.5 & 0.79300731 \\ \end{array}$

## second-order prediction interval and MC set.seed(123) PROP2 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000, second.order = TRUE, interval = "prediction") t(PROP2$summary)

$\begin{array}{cc} Prop.Mean.1 & 0.74804722 \\ Prop.Mean.2 & 0.74815136 \\ Prop.sd.1 & 0.02081131 \\ Prop.sd.2 & 0.02081520 \\ Prop.2.5 & 0.70318286 \\ Prop.97.5 & 0.79311987 \\ Sim.Mean & 0.74815598 \\ Sim.sd & 0.02261884 \\ Sim.2.5 & 0.70317629 \\ Sim.97.5 & 0.79309874 \\ \end{array}$

What we see here is that i) the first-order prediction interval [0.70308712; 0.79300731] is symmetric and slightly down-biased compared to the second-order one [0.70317629; 0.79309874], and ii) the second-order prediction interval tallies nicely up to the 4th decimal with the new MC-based interval (0.70318286 and 0.70317629; 0.79311987 and 0.79309874). I believe this clearly demonstrates the usefulness of the MC-based approach for NLS prediction interval estimation…

To leave a comment for the author, please follow the link and comment on their blog: Rmazing.

↧

Visualizing graphs with overlapping node groups

May 11, 2018, 9:36 am

≫ Next: Enterprise Advocate

≪ Previous: Monte Carlo-based prediction intervals for nonlinear regression

(This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers)

I recently came across some data about multilateral agreements, which needed to be visualized as network plots. This data had some peculiarities that made it more difficult to create a plot that was easy to understand. First, the nodes in the graph were organized in groups but each node could belong to multiple groups or to no group at all. Second, there was one “super node” that was connected to all other nodes (while “normal” nodes were only connected within their group). This made it difficult to find the right layout that showed the connections between the nodes as well as the group memberships. However, digging a little deeper into the R packages igraph and ggraph it is possible to get satisfying results in such a scenario.

Example data, nodes & edges

We will use the following packages that we need to load at first:

library(dplyr)library(purrr)library(igraph)library(ggplot2)library(ggraph)library(RColorBrewer)

Let’s create some exemplary data. Let’s say we have 4 groups a, b, c, d and 40 nodes with the node IDs 1 to 40. Each node can belong to several groups but it must not belong to any group. An example would be the following data:

group_a <- 1:5            # nodes 1 to 5 in group agroup_b <- 1:10           # nodes 1 to 10 in group bgroup_c <- c(1:3, 7:18)   # nodes 1 to 3 and 7 to 18 in cgroup_d <- c(1:4, 15:25)  # nodes 1 to 4 and 15 to 25 in dmembers <- data_frame(id = c(group_a, group_b, group_c, group_d, 26:40),                      group = c(rep('a', length(group_a)),                                rep('b', length(group_b)),                                rep('c', length(group_c)),                                rep('d', length(group_d)),                                rep(NA, 15)))   # nodes 26 to 40 do not                                                # belong to any group

An excerpt of the data:

> members     id group         1 a          2 a     [...]      5 a          1 b          2 b     [...]     38 NA        39 NA        40 NA

Now we can create the edges of the graph, i.e. the connections between the nodes. All nodes within a group are connected to each other. Additionally, all nodes are connected with one “super node” (as mentioned in the introduction). In our example data, we pick node ID 1 to be this special node. Let’s start to create our edges by connecting all nodes to node 1:

edges <- data_frame(from = 1, to = 2:max(members$id), group = NA)

We also denote here, that these edges are not part of any group memberships. We’ll handle these group memberships now:

within_group_edges <- members %>% split(.$group) %>% map_dfr(function (grp) {  id2id <- combn(grp$id, 2)  data_frame(from = id2id[1,],             to = id2id[2,],             group = unique(grp$group))})edges <- bind_rows(edges, within_group_edges)

At first, we split the members data by their group which produces a list of data frames. We then use map_dfr from the purrr package to handle each of these data frames that are passed as grp argument. grp$id contains the node IDs of the members of this group and we use combn to create the pair-wise combinations of these IDs. This will create a matrix id2id, where the columns represent the node ID pairs. We return a data frame with the from-to ID pairs and a group column that denotes the group to which these edges belong. These “within-group edges” are appended to the already created edges using bind_rows.

> edges    from    to group            1     2 NA          1     3 NA          1     4 NA    [...]      23    24 d          23    25 d          24    25 d

Plotting with ggraph

We have our edges, so now we can create the graph with igraph and plot it using the ggraph package:

g <- graph_from_data_frame(edges, directed = FALSE)ggraph(g) +  geom_edge_link(aes(color = group), alpha = 0.5) +     # different edge color per group  geom_node_point(size = 7, shape = 21, stroke = 1,                  fill = 'white', color = 'black') +  geom_node_text(aes(label = name)) +                   # "name" is automatically generated from the node IDs in the edges  theme_void()

Not bad for the first try, but the layout is a bit unfortunate, giving too much space to nodes that don’t belong to any group.

We can tell igraph’s layout algorithm to tighten the non-group connections (the gray lines in the above figure) by giving them a higher weight than the within-group edges:

# give weight 10 to non-group edgesedges <- data_frame(from = 1, to = 2:40,                    weight = 10, group = NA)within_group_edges <- members %>%  split(.$group) %>%  map_dfr(function (grp) {    id2id <- combn(grp$id, 2)    # weight 1 for within-group edges    data_frame(from = id2id[1,],               to = id2id[2,],               weight = 1,               group = unique(grp$group))})

We reconstruct the graph g and plot it using the same commands as before and get the following:

The nodes within groups are now much less cluttered and the layout is more balanced.

Plotting with igraph

A problem with this type of plot is that connections within smaller groups are sometimes hardly visible (for example group a in the above figure). The plotting functions of igraph allow an additional method of highlighting groups in graphs: Using the parameter mark.groups will construct convex hulls around nodes that belong to a group. These hulls can then be highlighted with respective colors.

At first, we need to create a list that maps each group to a vector of the node IDs that belong to that group:

group_ids <- lapply(members %>% split(.$group), function(grp) { grp$id })

> group_ids$a  [1] 1 2 3 4 5$b  [1] 1  2  3  4  5  6  7  8  9 10[...]

Now we can create a color for each group using RColorBrewer:

group_color <- brewer.pal(length(group_ids), 'Set1')# the fill gets an additional alpha value for transparency:group_color_fill <- paste0(group_color, '20')

We plot it by using the graph object g that was generated before with graph_from_data_frame:

par(mar = rep(0.1, 4))   # reduce marginsplot(g, vertex.color = 'white', vertex.size = 9,     edge.color = rgb(0.5, 0.5, 0.5, 0.2),     mark.groups = group_ids,     mark.col = group_color_fill,     mark.border = group_color)legend('topright', legend = names(group_ids),       col = group_color,       pch = 15, bty = "n",  pt.cex = 1.5, cex = 0.8,        text.col = "black", horiz = FALSE)

This option usually works well when you have groups that are more or less well separated, i.e. do not overlap too much. However, in our case there is quite some overlap and we can see that the shapes that encompass the groups also sometimes include nodes that do not actually belong to that group (for example node 8 in the above figure that is encompassed by group a although it does not belong to that group).

We can use a trick that leads the layout algorithm to bundle the groups more closely in a different manner: For each group, we introduce a “virtual node” (which will not be drawn during plotting) to which all the normal nodes in the group are tied with more weight than to each other. Nodes that only belong to a single group will be placed farther away from the center than those that belong to several groups, which will reduce clutter and wrongly overlapping group hulls. Furthermore, a virtual group node for nodes that do not belong to any group will make sure that these nodes will be placed more closely to each other.

We start by generating IDs for the virtual nodes:

# 4 groups plus one "NA-group"virt_group_nodes <- max(members$id) + 1:5       names(virt_group_nodes) <- c(letters[1:4], NA)

This will give us the following IDs:

> virt_group_nodes   a    b    c    d   NA   41   42   43   44   45

We start to create the edges again by connecting all nodes to the “super node” with ID 1:

edges_virt <- data_frame(from = 1, to = 2:40, weight = 5, group = NA)

Then, the edges within the groups will be generated again, but this time we add additional edges to each group’s virtual node:

within_virt %>% split(.$group) %>% map_dfr(function (grp) {  group_name <- unique(grp$group)  virt_from <- rep(virt_group_nodes[group_name], length(grp$id))  id2id <- combn(grp$id, 2)  data_frame(    from = c(id2id[1,], virt_from),    to = c(id2id[2,], grp$id),            # also connects from virtual_from node to each group node    weight = c(rep(0.1, ncol(id2id)),     # weight between group nodes               rep(50, length(grp$id))),  # weight that 'ties together' the group (via the virtual group node)    group = group_name  )})edges_virt <- bind_rows(edges_virt, within_virt)

We add edges from all nodes that don’t belong to a group to another virtual node:

virt_group_na <- virt_group_nodes[is.na(names(virt_group_nodes))]non_group_nodes <- (members %>% filter(is.na(group)))$idedges_na_group_virt <- data_frame(from = non_group_nodes,                                  to = rep(virt_group_na,                                           length(non_group_nodes)),                                  weight = 10,                                  group = NA)edges_virt <- bind_rows(edges_virt, edges_na_group_virt)

This time, we also create a data frame for the nodes, because we want to add an additional property is_virt to each node that denotes if that node is virtual:

nodes_virt <- data_frame(id = 1:max(virt_group_nodes),                         is_virt = c(rep(FALSE, max(members$id)),                                     rep(TRUE, length(virt_group_nodes))))

We’re ready to create the graph now:

g_virt <- graph_from_data_frame(edges_virt, directed = FALSE,                                vertices = nodes_virt)

To illustrate the effect of the virtual nodes, we can plot the graph directly and get a figure like this (virtual nodes highlighted in turquois):

We now want to plot the graph without the virtual nodes, but the layout should nevertheless be calculated with the virtual nodes. We can achieve that by running the layout algorithm first and then removing the virtual nodes from both the graph and the generated layout matrix:

# use "auto layout"lay <- layout_nicely(g_virt)# remove virtual group nodes from graphg_virt <- g_virt - vertices(virt_group_nodes)# remove virtual group nodes' positions from the layout matrixlay <- lay[-virt_group_nodes, ]

It’s important to pass the layout matrix now with the layout parameter to produce the final figure:

plot(g_virt, layout = lay, vertex.color = 'white', vertex.size = 9,     edge.color = rgb(0.5, 0.5, 0.5, 0.2),     mark.groups = group_ids, mark.col = group_color_fill,     mark.border = group_color)legend('topright', legend = names(group_ids), col = group_color,       pch = 15, bty = "n",  pt.cex = 1.5, cex = 0.8,        text.col = "black", horiz = FALSE)

We can see that output is less cluttered and nodes that belong to the same groups are bundled nicely while nodes that do not share the same groups are well separated. Note that the respective edge weights were found empirically and you will probably need to adjust them to achieve a good graph layout for your data.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog.

↧

Enterprise Advocate

May 11, 2018, 10:41 am

≫ Next: RStudio:addins part 2 – roxygen documentation formatting made easy

≪ Previous: Visualizing graphs with overlapping node groups

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We are looking for our next Enterprise Advocate to join the RStudio team. See what Pete Knast, Global Director of New Business, has to say about working at RStudio and the Enterprise Advocate role.

null

When did you join RStudio and what made you interested in working here?

I joined in early 2014. I was excited by RStudio since I love helping people and being an open source company RStudio seemed like a great way to reach a lot of people and get to assist with numerous interesting use cases.

What types of projects do you work on?

I get to work on the front lines corresponding directly with our customers. Since my focus is new business this means I am helping open source users take the next step in their use of R. Sometimes this means helping IT organizations understand how RStudio can integrate with corporate security/authentication protocols and sometimes it involves showing off various Shiny applications. The types of projects always vary which keeps me on my toes.

What do you enjoy about working at RStudio?

One would be that my colleagues are not only extremely smart but very genuine so there are always fun conversations in and outside of work. Another big reason would be the various use cases I get to play a part in. Since each customer has a different application area or industry I never get bored when I learn about how RStudio offerings are being applied.

What types of qualities do you look for when hiring an Enterprise Advocate?

Two words that come to mind are humble and smart. Also since RStudio is so popular but our company is small we have a high volume of customers to connect with so high energy is also a must. If you enjoy solving problems not only will you find the role a good fit but you will have the chance to help bring solutions to large corporations and assist in resolving issues that can even cure diseases.

What are the goals for someone new to this role?

To establish themselves in the r and data science community as a trusted consultant. When you wake up from a dream that involves R you have made it.

If you think you or someone you know might be a good fit for this role and want to know more, check it out here.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

↧

RStudio:addins part 2 – roxygen documentation formatting made easy

May 12, 2018, 7:00 am

≫ Next: Investing for the Long Run

≪ Previous: Enterprise Advocate

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

Code documentation is extremely important if you want to share the code with anyone else, future you included. In this second post in the RStudio:addins series we will pay a part of our technical debt from the previous article and document our R functions conveniently using a new addin we will build for this purpose.

The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as \code{} or \link{} around selected text in RStudio.

Quick intro to documentation with roxygen2

1. Documenting your first function

To help us generate documentation easily we will be using the roxygen2 package. You can install it using install.packages("roxygen2"). Roxygen2 works with in-code tags and will generate R’s documentation format .Rd files, create a NAMESPACE, and manage the Collate field in DESCRIPTION (not relevant to us at this point) automatically for our package.

Documenting a function works in 2 simple steps:

Documenting a function

Inserting a skeleton – Do this by placing your cursor anywhere in the function you want to document and click Code Tools -> Insert Roxygen Skeleton (default keyboard shortcut Ctrl+Shift+Alt+R).
Populating the skeleton with relevant information. A few important tags are:

#' @params– describing the arguments of the function
#' @return– describing what the function returns
#' @importFrom package function– in case your function uses a function from a different package Roxygen will automatically add it to the NAMESPACE
#' @export– if case you want the function to be exported (mainly for use by other packages)
#' @examples– showing how to use the function in practice

2. Generating and viewing the documentation

Generating and viewing the documentation

We generate the documentation files using roxygen2::roxygenise() or devtools::document() (default keyboard shortcut Ctrl+Shift+D)
Re-installing the package (default keyboard shortcut Ctrl+Shift+B)
Viewing the documentation for a function using ?functioname e.g. ?mean, or placing cursor on a function name and pressing F1 in RStudio – this will open the Viewer pane with the help for that function

3. A real-life example

Let us now document runCurrentRscript a little bit:

#' runCurrentRscript#' @description Wrapper around executeCmd with default arguments for easy use as an RStudio addin#' @param path character(1) string, specifying the path of the file to be used as Rscript argument (ideally a path to an R script)#' @param outputFile character(1) string, specifying the name of the file, into which the output produced by running the Rscript will be written#' @param suffix character(1) string, specifying additional suffix to pass to the command#' @importFrom rstudioapi getActiveDocumentContext#' @importFrom rstudioapi navigateToFile#' @seealso executeCmd#' @return side-effectsrunCurrentRscript <- function(  path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]]), outputFile = "output.txt", suffix = "2>&1") {  cmd <- makeCmd(path, outputFile = outputFile, suffix = suffix)  executeCmd(cmd)  if (!is.null(outputFile) && file.exists(outputFile)) {    rstudioapi::navigateToFile(outputFile)  }}

As we can see by looking at ?runCurrentRscript versus ?mean, our documentation does not quite look up to par with documentation for other functions: comparing the view of documentation for base::mean and runCurrentRscript

What is missing if we abstract from the richness of the content is the usage of markup commands (tags) for formatting and linking our documentation. Some of the very useful such tags are for example:

\code{}, \strong{}, \emph{} for font style
\link{}, \href{}, \url{} for linking to other parts of the documentation or external resources
\enumerate{}, \itemize{}, \tabular{} for using lists and tables
\eqn{}, \deqn{} for mathematical expressions such as equations etc.

For the full list of options regarding text formatting, linking and more see Writing R Extensions’ Rd format chapter

Our addins to make documenting a breeze

As you can imagine, typing the markup commands in full all the time is quite tedious. The goal of our new addin will therefore be to make this process efficient using keyboard shortcuts – just select a text and our addin will place the desired tags around it. For this time, we will be satisfied with simple 1 line tags.

1. Add a selected tag around a character string

roxyfy <- function(str, tag = NULL, splitLines = TRUE) {  if (is.null(tag)) {    return(str)  }  if (!isTRUE(splitLines)) {    return(paste0("\\", tag, "{", str, "}"))  }  str <- unlist(strsplit(str, "\n"))  str <- paste0("\\", tag, "{", str, "}")  paste(str, collapse = "\n")}

2. Apply the tag on a selection in an active document in RStudio

We will make the functionality available for multi-selections as well by lapply-ing over the selection elements retrieved from the active document in RStudio.

addRoxytag <- function(tag = NULL) {  context <- rstudioapi::getActiveDocumentContext()  lapply(X = context[["selection"]]       , FUN = function(thisSel, contextid) {           rstudioapi::modifyRange(location = thisSel[["range"]]                                 , roxyfy(thisSel[["text"]], tag)                                 , id = contextid)         }       , contextid = context[["id"]]       )  return(invisible(NULL))}

3. Wrappers around `addRoxytag` to be used as addin for some useful tags

addRoxytagCode <- function() {  addRoxytag(tag = "code")}addRoxytagLink <- function() {  addRoxytag(tag = "link")}addRoxytagEqn <- function() {  addRoxytag(tag = "eqn")}

4. Add the addin bindings into `addins.dcf` and assign keyboard shortcuts

As the final step, we need to add the bindings for our new addins to the inst/rstudio/addins.dcf file and re-install the package.

Name: addRoxytagCodeDescription: Adds roxgen tag code to current selections in the active RStudio documentBinding: addRoxytagCodeInteractive: falseName: addRoxytagLinkDescription: Adds roxgen tag link to current selections in the active RStudio documentBinding: addRoxytagLinkInteractive: falseName: addRoxytagEqnDescription: Adds roxgen tag eqn to current selections in the active RStudio documentBinding: addRoxytagEqnInteractive: false

assigning keyboard shortcuts to addins

The addins in action

And now, let’s just select the text we want to format and watch our addins do the work for us! Then document the package, re-install it and view the improved help for our functions:

The addins in action

What is next – even more automated documentation

Next time we will try to enrich our addins for generating documentation by adding the following functionalities

automatic generation of @importFrom tags by inspecting the function code
allowing for more complex tags such as itemize

TL;DR – Just give me the package

Get the status of the package after this article
or use git clone from https://gitlab.com/jozefhajnala/jhaddins.git

References

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

↧

Investing for the Long Run

May 12, 2018, 12:00 am

≫ Next: Fifteen New Zealand government Shiny web apps by @ellis2013nz

≪ Previous: RStudio:addins part 2 – roxygen documentation formatting made easy

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

–

I often get asked about how to invest in the stock market. Not surprisingly, this has been a common topic in my classes. Brazil is experiencing a big change in its financial scenario. Historically, fixed income instruments paid a large premium over the stock market and that is no longer the case. Interest rates are low, without the pressure from inflation. This means a more sustainable scenario for low-interest rates in the future. Without the premium in the fixed income market, people turn to the stock market.

We can separate investors according to their horizon. Traders try to profit in the short term, usually within a day, and long-term investors buy a stock without the intent to sell it in the near future. This type of investment strategy is called BH (buy and hold). At the extreme, you buy a stock and hold it forever. The most famous spokesperson of BH is Warren Buffet, among many others.

Investing in the long run works for me because it doesn’t require much of my time. You just need to keep up with the quarterly and yearly financial reports of companies. You can easily do it as a side activity, parallel to your main job. You don’t need a lot of brain power to do it either, but it does require knowledge of accounting practices to understand all the material that is released by the company.

I read many books before starting to invest and one of the most interesting tables I’ve found portrays the relationship between investment horizon and profitability. The idea is that the more time you hold a stock (or index), higher the chance of a profit. The table, originally from Taleb’s Fooled by Randomness, is as follows.

My problem with the table is that it seems pretty off. My experience tells me that a 67% chance of positive return every month seems exaggerated. If that was the case, making money in the stock market would be easy. Digging deeper, I found out that the data behind the table is simulated and, therefore, doesn’t really give good an estimate about the improvement in the probability of profits as a function of the investment horizon.

As you probably suspect, I decided to tackle the problem using real data and R. I wrote a simple function that will grab data, simulate investments of different horizons many times and plot the results. Let’s try it for the SP500 index:

source('fct_invest_horizon.R')my.ticker <- '^GSPC' # ticker from yahoo financemax.horizon = 255*50 # 50 yearsfirst.date <- '1950-01-01' last.date <- Sys.Date()n.points <- 50 # number of points in figure rf.year <- 0 # risk free return (or inflation)l.out <- get.figs.invest.horizon(ticker.in = my.ticker,                                  first.date = first.date,                                  last.date = last.date,                                 max.horizon = max.horizon,                                  n.points = n.points,                                  rf.year = rf.year)## ## Running BatchGetSymbols for:##    tickers = ^GSPC##    Downloading data for benchmark ticker | Found cache file## ^GSPC | yahoo (1|1) | Found cache file - Looking good!print(l.out$p1)

print(l.out$p2)

As we can see, the data doesn’t lie. As the investment horizon increases, the chances of a positive return increases. This result suggests that, if you invest for more than 13 years, it is very unlikely that you’ll see a negative return. When looking at the distribution of total returns by the horizon, we find that it increases significantly with time. Someone that invested for 50 years is likely to receive a 2500% return on the investment.

With input input rf.year we can also set a desired rate of return. Let’s try it with 5% return per year, with is pretty standard for financial markets.

my.ticker <- '^GSPC' # ticker from yahoo financemax.horizon = 255*50 # 50 yearsfirst.date <- '1950-01-01' last.date <- Sys.Date()n.points <- 50 # number of points in figure rf.year <- 0.05 # risk free return (or inflation) - yearlyl.out <- get.figs.invest.horizon(ticker.in = my.ticker,                                  first.date = first.date,                                  last.date = last.date,                                 max.horizon = max.horizon,                                  n.points = n.points,                                  rf.year = rf.year)## ## Running BatchGetSymbols for:##    tickers = ^GSPC##    Downloading data for benchmark ticker | Found cache file## ^GSPC | yahoo (1|1) | Found cache file - Got it!print(l.out$p1)

As expected, the curve of probabilities has a lower slope, meaning that you need more time investing in the SP500 index to guarantee a return of more than 5% a year.

Now, let’s try the same setup for Berkshire stock (BRK-A). This is Buffet’s company and looking at its share price we can have a good understanding of how successful Buffet has been as a BH (buy and hold) investor.

my.ticker <- 'BRK-A' # ticker from yahoo financemax.horizon = 255*25 # 50 yearsfirst.date <- '1980-01-01' last.date <- Sys.Date()n.points <- 50 # number of points in figure rf.year <- 0.05 # risk free return (or inflation) - yearlyl.out <- get.figs.invest.horizon(ticker.in = my.ticker,                                  first.date = first.date,                                  last.date = last.date,                                 max.horizon = max.horizon,                                  n.points = n.points,                                  rf.year = rf.year)## ## Running BatchGetSymbols for:##    tickers = BRK-A##    Downloading data for benchmark ticker | Found cache file## BRK-A | yahoo (1|1) | Found cache file - OK!print(l.out$p1)

print(l.out$p2)

Well, needless to say that, historically, Buffet has done very well in his investments! If you bought the stock and kept it for more 1 year, there is a 70% chance that you got a profit on your investment.

I hope this post convinced you to start investing. The message from the results are straighforward: the earliest you start, the better.

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin.

↧

Drawing line segments with ggplot2

Now the fun part: custom flipping probabilities

Preparing the data

A note about text mining

The “tidy text” of the analysis

Computing sentiment

Tidy text analysis of onboarding social weather

What do reviews talk about?

How positive is onboarding?

Outlook

The data

Fit the model

The analyze function

Summary

Print

Credits

Visualizing Rolling Fama French

Example: WidgetCorp

Is scientific debt always bad?

Why even call it “debt”?

How can we manage scientific debt well?

Changing Your Viewpoint for Factors

Conclusion: A Deeper Insight

Hannah Frick, Data Scientist

Building R packages

What the goodpractice package does

How to use goodpractice

Custom checks

Acknowledgements

Feedback

Example data, nodes & edges

Plotting with ggraph

Plotting with igraph

Introduction

The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as \code{} or \link{} around selected text in RStudio.

Contents

Quick intro to documentation with roxygen2

1. Documenting your first function

2. Generating and viewing the documentation

3. A real-life example

Our addins to make documenting a breeze

1. Add a selected tag around a character string

2. Apply the tag on a selection in an active document in RStudio

3. Wrappers around addRoxytag to be used as addin for some useful tags

4. Add the addin bindings into addins.dcf and assign keyboard shortcuts

The addins in action

What is next – even more automated documentation

TL;DR – Just give me the package

References

The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as `\code{}` or `\link{}` around selected text in RStudio.

3. Wrappers around `addRoxytag` to be used as addin for some useful tags

4. Add the addin bindings into `addins.dcf` and assign keyboard shortcuts