Our onboarding
reviews,
that ensure that packages contributed by the community undergo a
transparent, constructive, non adversarial and open review process, take
place in the issue tracker of a GitHub repository. Development of the
packages we onboard also takes place in the open, most often in GitHub
repositories.
Therefore, when wanting to get data about our onboarding system for
giving a data-driven overview, my mission was to extract data from
GitHub and git repositories, and to put it into nice rectangles (as
defined by Jenny
Bryan) ready for
analysis. You might call that the first step of a “tidy git analysis”
using the term coined by Simon
Jackson.
So, how did I collect data?
A side-note about GitHub
In the following, I’ll mention repositories. All of them are git
repositories, which means they’re folders under version control, where
roughly said all changes are saved via commits and their messages (more
or less) describing what’s been changed in the commit. Now, on top of
that these repositories live on GitHub which means they get to enjoy
some infratructure such as issue trackers, milestones, starring by
admirers, etc. If that ecosystem is brand new to you, I recommend
reading this book, especially its big
picture chapter.
Package review processes: weaving the threads
Each package submission is an issue thread in our onboarding repository,
see an example
here. The first
comment in that issue is the submission itself, followed by many
comments by the editor, reviewers and authors. On top of all the data
that’s saved there, mostly text data, we have a private
Airtable workspace where we have a table of
reviewers and their reviews, with direct links to the issue comments
that are reviews.
Getting issue threads
Unsurprisingly, the first step here was to “get issue threads”. What do
I mean? I wanted a table of all issue threads, one line per comment,
with columns indicating the time at which something was written, and
columns digesting the data from the issue itself, e.g. guessing the role
from the commenter from other information: the first user of the issue
is the “author”.
I used to use GitHub API V3 and then heard about GitHub API
V4 which blew my mind. As if I
weren’t impressed enough by the mere existence of this API and its
advantages,
I discovered the rOpenSci ghql
package allows one to interact
with such an API and that its docs actually use GitHub API V4 as an
example!
Carl Boettiger told me about his way to rectangle JSON
data,
using jq, a language for
processing JSON, via a dedicated rOpenSci package,
jqr
.
I have nothing against GitHub API V3 and
gh
and purrr
workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh
/purrr
code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard
. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.
The first function I wrote was one for getting the issue number of the
last onboarding issue, because then I looped/mapped over all issues.
library("ghql")
library("httr")
library("magrittr")
# function to get number of last issue
get_last_issue <- function(){
query = '{
repository(owner: "ropensci", name: "onboarding") {
issues(last: 1) {
edges{
node{
number
}
}
}
}
}'
token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
cli <- GraphqlClient$new(
url = "https://api.github.com/graphql",
headers = add_headers(Authorization = paste0("Bearer ", token))
)
## define query
### creat a query class first
qry <- Query$new()
qry$query('issues', query)
last_issue <-cli$exec(qry$queries$issues)
last_issue %>%
jqr::jq('.data.repository.issues.edges[].node.number') %>%
as.numeric()
}
get_last_issue()
## [1] 201
Then I wrote a function for getting all the precious info I needed from
an issue thread. At the time it lived on its own in an R script, now
it’s gotten included in my ghrecipes
package as
get_issue_thread
so you can check out the code there, along with other useful recipes for
analyzing GitHub data.
Then I launched this code to get all data! It was very satisfying.
#get all threads
issues <- purrr::map_df(1:get_last_issue(), get_issue_thread)
# for the one(s) with 101 comments get the 100 last comments
long_issues <- issues %>%
dplyr::count(issue) %>%
dplyr::filter(n == 101) %>%
dplyr::pull(issue)
issues2 <- purrr::map_df(long_issues, get_issue_thread, first = FALSE)
all_issues <- dplyr::bind_rows(issues, issues2)
all_issues <- unique(all_issues)
readr::write_csv(all_issues, "data/all_threads_v4.csv")
Digesting them and complementing them with Airtable data
In the previous step we got a rectangle of all threads, with information
from the first issue comment (such as labels) distributed to all the
comments of the threads.
issues <- readr::read_csv("data/all_threads_v4.csv")
issues <- janitor::clean_names(issues)
issues <- dplyr::rename(issues, user = author)
issues <- dplyr::select(issues, - dplyr::contains("topic"))
issues %>%
head() %>%
dplyr::select(- body) %>%
knitr::kable()
Now we need a few steps more:
transforming NA into FALSE for variables corresponding to labels,
getting the package name from Airtable since the titles of issues
are not uniformly formatted,
knowing which comment is a review,
deducing the role of the user writing the comment
(author/editor/reviewer/community manager/other).
Below binary variables are transformed and only rows corresponding to
approved packages are kept.
# labels
replace_1 <- function(x){
!is.na(x[1])
}
# binary variables
ncol_issues <- ncol(issues)
issues <- dplyr::group_by(issues, issue) %>%
dplyr::arrange(created_at) %>%
dplyr::mutate_at(9:(ncol_issues-1), replace_1) %>%
dplyr::ungroup()
# keep only issues that are finished
issues <- dplyr::filter(issues, package, !x0_presubmission,
!out_of_scope, !legacy,
!x1_editor_checks, x6_approved)
issues <- dplyr::select(issues, - dplyr::starts_with("x"),
- package, - out_of_scope, - legacy,
- meta, - holding, - pulled, - question)
Then, thanks to the airtabler
package we can add the name of the
package, and identify review comments.
# airtable data
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()
airtable <- dplyr::mutate(airtable,
issue = as.numeric(stringr::str_replace(onboarding_url,
".*issues\\/", "")))
# we get the name of the package
# and we know which comments are reviews
reviews <- dplyr::select(airtable, review_url, issue, package) %>%
dplyr::mutate(is_review = TRUE)
issues <- dplyr::left_join(issues, reviews, by = c("issue", "comment_url" = "review_url"))
issues <- dplyr::mutate(issues, is_review = !is.na(is_review))
Finally, the non elegant code below attributes a role to each user
(commenter is its more precise version that differentiates reviewer 1
from reviewer 2). I could have used dplyr
case_when
.
# non elegant code to guess role
issues <- dplyr::group_by(issues, issue)
issues <- dplyr::arrange(issues, created_at)
issues <- dplyr::mutate(issues, author = user[1])
issues <- dplyr::mutate(issues, package = unique(package[!is.na(package)]))
issues <- dplyr::mutate(issues, assignee = assignee[1])
issues <- dplyr::mutate(issues, reviewer1 = ifelse(!is.na(user[is_review][1]), user[is_review][1], ""))
issues <- dplyr::mutate(issues, reviewer2 = ifelse(!is.na(user[is_review][2]), user[is_review][2], ""))
issues <- dplyr::mutate(issues, reviewer3 = ifelse(!is.na(user[is_review][3]), user[is_review][3], ""))
issues <- dplyr::ungroup(issues)
issues <- dplyr::group_by(issues, issue, created_at, user)
# regexp because in at least 1 case assignee = 2 names glued together
issues <- dplyr::mutate(issues, commenter = ifelse(stringr::str_detect(assignee, user), "editor", "other"))
issues <- dplyr::mutate(issues, commenter = ifelse(user == author, "author", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer1, "reviewer1", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer2, "reviewer2", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer3, "reviewer3", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == "stefaniebutland", "community_manager", commenter))
issues <- dplyr::ungroup(issues)
issues <- dplyr::mutate(issues, role = commenter,
role = ifelse(stringr::str_detect(role, "reviewer"),
"reviewer", role))
issues <- dplyr::select(issues, - author, - reviewer1, - reviewer2, - reviewer3, - assignee,
- author_association, - comment_url)
readr::write_csv(issues, "data/clean_data.csv")
The role “other” corresponds to anyone chiming in, while the community
manager role is planning blog posts with the package author. We indeed
have a series of guest blog posts from package
authors that illustrate the review
process as well as their onboarded packages.
Here is the final table. I unselect “body” because formatting in the
text could break the output here, but I do have the text corresponding
to each comment.
issues %>%
dplyr::select(- body) %>%
head() %>%
knitr::kable()
title | created_at | closed_at | user | issue | package | is_review | commenter | role |
---|
rrlite | 2015-03-31 00:25:14 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
rrlite | 2015-04-01 17:30:51 | 2015-04-13 23:26:38 | sckott | 6 | rrlite | FALSE | editor | editor |
rrlite | 2015-04-01 17:36:03 | 2015-04-13 23:26:38 | karthik | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:36:09 | 2015-04-13 23:26:38 | jeroen | 6 | rrlite | FALSE | reviewer2 | reviewer |
rrlite | 2015-04-02 03:50:43 | 2015-04-13 23:26:38 | gaborcsardi | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:53:57 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
There are 2521 comments, corresponding to 70 onboarded packages.
Submitted repositories: down to a few metrics
As mentioned earlier, onboarded packages are most often developped on
GitHub. After onboarding they live in the ropensci GitHub
organization, previously some of them
were onboarded into ropenscilabs but
they should all be transferred soon. In any case, their being on GitHub
means it’s possible to get their history to have a glimpse at work
represented by onboarding!
Getting all onboarded repositories
Using rOpenSci git2r
package I
cloned all onboarded repositories in a “repos” folder. Since I didn’t
know which package was in ropensci or ropenscilabs, I tried both.
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()
safe_clone <- purrr::safely(git2r::clone)
# github link either ropensci or ropenscilabs
clone_repo <- function(package_name){
print(package_name)
url <- paste0("https://github.com/ropensci/", package_name, ".git")
local_path <- paste0(getwd(), "/repos/", package_name)
clone_from_ropensci <- safe_clone(url = url, local_path = local_path,
progress = FALSE)
if(is.null(clone_from_ropensci$result)){
url <- paste0("https://github.com/ropenscilabs/", package_name, ".git")
clone_from_ropenscilabs <- safe_clone(url = url, local_path = local_path,
progress = FALSE)
if(is.null(clone_from_ropenscilabs$result)){
message("OUILLE")
}
}
}
pkgs <- unique(airtable$package)
pkgs <- pkgs[!pkgs %in% fs::dir_ls()]
pkgs <- pkgs[pkgs != "rrricanes"]
purrr::walk(pkgs, clone_repo)
I didn’t clone “rrricanes” because it was too big!
Getting commits reports
I then got the commit logs of each repo for various reasons:
commits themselves show how much code and documentation editing was
done during review
I wanted to be able to git reset hard
the repo at its state at
submission, for which I needed the commit logs.
I used the gitsum
package to get commit
logs because its dedicated high-level functions made it easier than with
git2r
.
library("magrittr")
get_report <- function(package_name){
message(package_name)
local_path <- paste0(getwd(), "/repos/", package_name)
if(length(fs::dir_ls(local_path)) != 0){
gitsum::init_gitsum(local_path, over_write = TRUE)
report <- gitsum::parse_log_detailed(local_path)
report <- dplyr::select(report, - nested)
report$package <- package_name
if(!"datetime" %in% names(report)){
report <- dplyr::mutate(report,
hour = as.numeric(stringr::str_sub(timezone, 1, 3)),
minute = as.numeric(stringr::str_sub(timezone, 4, 5)),
datetime = date + lubridate::hours(-1 * hour) + lubridate::minutes(-1 * minute))
report <- dplyr::select(report, - hour, - minute, - timezone)
}
report <- dplyr::select(report, - date)
return(report)
}else{
return(NULL)
}
}
packages <- fs::dir_ls("repos")
packages <- stringr::str_replace_all(packages, "repos\\/", "")
purrr::map_df(packages, get_report) %>%
readr::write_csv("output/gitsum_reports.csv")
Getting repositories as at submission
Crossing information from the issue threads and from commit logs, I
could find the latest commit before submission and create a copy of each
repo before resetting it at this state. This is the closest to a
Time-Turner that I
have!
library("magrittr")
# get issues opening datetime
issues <- readr::read_csv("data/clean_data.csv")
issues <- dplyr::group_by(issues, package)
issues <- dplyr::summarise(issues, opened = min(created_at))
# now for each package keep only commits before that
commits <- readr::read_csv("output/gitsum_reports.csv")
commits <- dplyr::left_join(commits, issues, by = "package")
commits <- dplyr::group_by(commits, package)
commits <- dplyr::filter(commits, datetime <= opened)
# and from them keep the latest one,
# that's the latest commit before submission!
commits <- dplyr::filter(commits, datetime == max(datetime), !is_merge)
commits <- dplyr::summarize(commits, hash = hash[1])
# small helper function
get_sha <- function(commit){
commit@sha
}
set_archive <- function(package_name, commit){
message(package_name)
# copy the entire repo to another location
local_path <- paste0(getwd(), "/repos/", package_name)
local_path_archive <- paste0(getwd(), "/repos_at_submission/", package_name)
fs::dir_copy(local_path, local_path_archive)
# get all commits -- it's fast which is why I don't use gitsum report here
commits <- git2r::commits(git2r::repository(local_path_archive))
# get their sha
sha <- purrr::map_chr(commits, get_sha)
# all of this to extract the commit with the sha of the latest commit before submission
# in other words the latest commit before submission
commit <- commits[sha == commit][[1]]
# do a hard reset at that commit
git2r::reset(commit, reset_type = "hard")
}
purrr::walk2(commits$package, commits$hash, set_archive)
Outlook: getting even more data? Or analyzing this dataset
There’s more data to be collected or prepared! From GitHub issues,
using GitHub
archive one could
get the labelling history: when did an issue go from “editor-checks” to
“seeking-reviewers” for instance? It’d help characterize the usual speed
of the process. One could also try to investigate the formal and less
formal links between the onboarded repository and the review: did
commits and issues mention the onboarding review (with words), or even
actually put a link to it? Are actors in the process little or very
active on GitHub for other activities, e.g. could we see that some
reviewers create or revive their GitHub account especially for
reviewing?
Rather than enlarging my current dataset, I’ll present its analysis in
two further blog posts answering the questions “How much work is
rOpenSci onboarding?” and “How to characterize the social weather of
rOpenSci onboarding?”. In case you’re too impatient, in the meantime you
can dive into this blog post by Augustina Ragwitz about measuring
open-source influence beyond
commits
and this one by rOpenSci co-founder Scott Chamberlain about exploring
git commits with git2r
.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
R-bloggers.com offers
daily e-mail updates about
R news and
tutorials on topics such as:
Data science,
Big Data, R jobs, visualization (
ggplot2,
Boxplots,
maps,
animation), programming (
RStudio,
Sweave,
LaTeX,
SQL,
Eclipse,
git,
hadoop,
Web Scraping) statistics (
regression,
PCA,
time series,
trading) and more...