Big changes behind the scenes in R 3.5.0

April 24, 2018, 3:00 pm

≫ Next: Amsterdam in an R leaflet nutshell

≪ Previous: The current state of the Stan ecosystem in R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A major update to R is now available. The R Core group has announced the release of R 3.5.0, and binary versions for Windows and Linux are now available from the primary CRAN mirror. (The Mac release is forthcoming.)

Probably the biggest change in R 3.5.0 will be invisible to most users — except by the performance improvements it brings. The ALTREP project has now been rolled into R to use more efficient representations of many vectors, resulting in less memory usage and faster computations in many common situations. For example, the sequence vector 1:1000000 is now represented just by its start and end value, instead of allocating a vector of a million elements as earlier versions of R would do. So while R 3.4.3 takes about 1.5 seconds to run x <- 1:1e9 on my laptop, it's instantaneous in R 3.5.0.

There have been improvements in other areas too, thanks to ALTREP. The output of the sort function has a new representation: it includes a flag indicating that the vector is already sorted, so that sorting it again is instantaneous. As a result, running x <- sort(x) is now free the second and subsequent times you run it, unlike earlier versions of R. This may seem like a contrived example, but operations like this happen all the time in the internals of R code. Another good example is converting a numeric to a character vector: as.character(x) is now also instantaneous (the coercion to character is deferred until the character representation is actually needed). This has significant impact in R's statistical modelling functions, which carry around a long character vector that usually contains just numbers — the row names — with the design matrix. As a result, the calculation:

d <- data.frame(y = rnorm(1e7), x = 1:1e7)
lm(y ~ x, data=d)

runs about 4x faster on my system. (It also uses a lot less memory: running the equivalent command with 10x more rows failed for me in R 3.4.3 but succeeded in 3.5.0.)

The ALTREP system is designed to be extensible, but in R 3.5.0 the system is used exclusively for the internal operations of R. Nonetheless, if you'd like to get a sneak peek on how you might be able to use ALTREP yourself in future versions of R, you can take a look at this vignette (with the caveat that the interface may change when it's finally released).

There are many other improvements in R 3.5.0 beyond the ALTREP system, too. You can find the full details in the announcement, but here are a few highlights:

All packages are now byte-compiled on installation. R's base and recommended packages, and packages on CRAN, were already byte-compiled, so this will have the effect of improving the performance of packages installed from Github and from private sources.
R's performance is better when many packages are loaded, and more packages can be loaded at the same time on Windows (when packages use compiled code).
Improved support for long vectors, by functions including object.size, approx and spline.
Reading in text data with readLines and scan should be faster, thanks to buffering on text connections.
R should handle some international data files better, with several bugs related to character encodings having been resolved.

Because R 3.5.0 is a major release, you will need to re-install any R packages you use. (The installr package can help with this.) On my reading of the release notes, there haven't been any major backwardly-incompatible changes, so your old scripts should continue to work. Nonetheless, given the significant changes behind the scenes, it might be best to wait for a maintenance release before using R 3.5.0 for production applications. But for developers and data science work, I recommend jumping over to R 3.5.0 right away, as the benefits are significant.

You can find the details of what's new in R 3.5.0 at the link below. As always, many thanks go to the R Core team and the other volunteers who have contributed to the open source R project over the years.

R-announce mailing list: R 3.5.0 is released

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Amsterdam in an R leaflet nutshell

April 25, 2018, 12:37 am

≫ Next: Why R? 2018 Conference – Registration and Call for Papers Opened

≪ Previous: Big changes behind the scenes in R 3.5.0

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

amsterdamanimatie2

The municipal services of Amsterdam (The Netherlands) is providing open panorama images. See here and here. A camera car has driven around in the city, and now you can download these images.

Per neighborhood of Amsterdam I randomly sampled 20 images and put them in an animated gif using R magick and the put it on a interactive leaflet map.

Before you book your tickets to Amsterdam, have a quick look here on the leaflet first

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog.

↧

Why R? 2018 Conference – Registration and Call for Papers Opened

April 25, 2018, 1:00 am

≫ Next: Data Science For Business: Course Launch In 5 Days!!!

≪ Previous: Amsterdam in an R leaflet nutshell

(This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

The first edition of Polish R Users Conferences called Why R? took place on 27-29 September at Warsaw University of Technology – Faculty of Mathematics and Information Science. The event was so successful that we’ve decided to launch a second edition of the conference.

About the Why R? 2018 conference

We are pleased to announce that Why R? 2018 Conference will be organized by STWUR (Wroclaw R Users Group). The second official meeting of Polish R enthusiasts will be held in Wroclaw 2-5 July 2018. As the meeting is held in English, we are happy to invite R users from other countries.

The main topic of this conference is very strongly based around mlr R package for machine learning (over 4500 downloads per month). The creator of the package, Bernd Bischl, will be an invited speaker of the Why R? 2018 and more people involved in the project will conduct workshops and give specialized talks on the mlr ecosystem. With that strong focus on machine learning, we hope to gather broader audience, including people for whom the R is a side-interest, but they are keen on learning more about data science.

Important dates

Registration

09.03.2018: EARLY BIRD REGISTRATION OPENS
06.05.2018: EARLY BIRD REGISTRATION ENDS
EARLY BIRD FEE: 450PLN/100EUR
STUDENT FEE: 200PLN/50EUR
REGULAR FEE: 650PLN/150EUR

Calls

09.03.2018: ALL CALLS OPEN
30.04.2018: WORKSHOP CALL CLOSES
25.05.2018: PRESENTATION CALLS CLOSES
01.06.2018: LIGHTNING TALKS CALL CLOSES

Abstract submissions are format free, but please do not exceed 400 words and state clearly a chosen call. The abstract submission form is available here or during the registration.

Keynotes

Among multiple workshops we are planning to host, the above keynotes have confirmed their talks at Why R? 2018: Bernd Bischl (Ludwig-Maximilians-University of Munich), Tomasz Niedzielski (University of Wroclaw), Thomas Petzoldt (Dresden University of Technology), Maciej Eder (Pedagogical University of Cracow), Leon Eyrich Jessen (Technical University of Denmark).

Programme

The following events will be hosted during the Why R? 2018 conference:

plenary lectures of invited speakers,
lightning talks,
poster session,
community session,
presentation of diﬀerent R enthusiasts’ groups,
Why R? paRty,
session of sponsors,
workshops – blocks of several small to mediumsized courses (up to 50 people) for R users at diﬀerent levels of proﬁciency.

Pre-meetings

We are organizing pre-meetings in many European cities to cultivate the R experience of knowledge sharing. You are more than welcome to visit upcoming event and check photos and presentations from previous ones. There are still few meetings that are being organized and are not yet added to the map. If you are interested in co-organizing a Why R? pre-meeting in your city, let us know (under kontakt_at_whyr.pl) and the Why R? Foundation can provide speakers for the venue!

Past event

Why R? 2017 edition, organized in Warsaw, gathered 200 participants. The Facebook reach of the conference page exceeds 15 000 users, with almost 800 subscribers. Our oﬃcial web page had over 8000 unique visitors and over 12 000 visits in general. To learn more about Why R? 2017 see the conference after movie (https://vimeo.com/239259242).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

↧

Data Science For Business: Course Launch In 5 Days!!!

April 24, 2018, 5:00 pm

≫ Next: Lessons Learned from rtika, a Digital Babel Fish

≪ Previous: Why R? 2018 Conference – Registration and Call for Papers Opened

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Last November, our data science team embarked on a journey to build the ultimate Data Science For Business (DS4B) learning platform. We saw a problem: A gap exists in organizations between the data science team and the business. To bridge this gap, we’ve created Business Science University, an online learning platform that teaches DS4B, using high-end machine learning algorithms, and organized in the fashion of an on-premise workshop but at a fraction of the price. I’m pleased to announce that, in 5 days, we will launch our first course, HR 201, as part of a 4-course Virtual Workshop. We crafted the Virtual Workshop after the data science program that we wished we had when we began data science (after we got through the basics of course!). Now, our data science process is being opened up to you. We guide you through our process for solving high impact business problems with data science!

Highlights

A major benefit to the Virtual Workshop is that: We teach our internally developed systematic process, the Business Science Problem Framework (BSPF). We use this process to solve high impact problems, tying data science to financial benefit. Below is the BSPF, which is one of the tools that has been instrumental to our success. In Data Science For Business (HR 201), we follow the BSPF throughout the course, showing you how to apply the framework to a data science project.

Business Science Problem Framework

Another benefit is that you get to see our process for dissecting and analyzing difficult problems. We show you how to tie financial impact to the problem, which is critical in gaining organizational acceptance of a data science project.
Yet another benefit is you will learn how to code within the tidyverse, and specifically using Tidy Eval for programming with dplyr and other tidyverse packages.
And finally, one more benefit is you will spend a sizable chunk of time using: tidyverse, h2o, lime, recipes, GGally, skimr, and more!

The Course Overview touches on the content. Take a look and let us know what you think!

Course Overview

We show you how we use data science to solve high impact problems using proven methodologies and tying data science to financial benefit to the organization.

Data Science For Business (HR 201) is the first course in a 4-part Virtual Workshop that focuses on a $15M/year problem¹ that’s hidden from the organization: Employee Turnover. We use a real-world problem to show you how tools like the Business Science Problem Framework and advanced Machine learning algorithms like H2O and LIME can solve this problem, saving the organization millions in the process. Just think, a 10% reduction could save $1.5M/year. That’s the power of data science!

Data Science For Business, HR 201

Chapter 0: Getting Started

Data Science Project Setup
The True Cost of Employee Attrition
What Tools Are in Our Toolbox?
Frameworks

In this chapter, we introduce you to our systematic process using the Business Science Problem Framework (BSPF), which augments CRISP-DM. The BSPF focuses on problem understanding and business outcomes on a detailed level whereas CRISP-DM contains the tools necessary for high-level data science project management. Combined, they create one of the tools that has been instrumental to our success.

Business Science Problem Framework

Chapter 1: Business Understanding

Problem Understanding With BSPF
Streamlining The Attrition Code Workflow
Visualizing Attrition with ggplot2
Making A Custom Plotting Function: plot_attrition()
Challenge 1: Cost Of Attrition

This chapter kicks off CRISP-DM Stage 1 along with BSPF Stages 1-4. You will understand the business problem assigning a financial cost to employee turnover. We develop custom functions to enable visualizing attrition cost by department and job role. These functions are later developed into an R package, tidyattrition, as part of HR 303. We cap it off by developing a custom plotting function, plot_attrition(), that generates an impactful visualization for executives to see the value of your data science project.

Business Science Problem Framework

Visualizing Attrition Cost

Chapter 2: Data Understanding

EDA Part 1: Exploring Data By Data Type With skimr
EDA Part 2: Visualizing Feature-Target Interactions with GGally
Challenge 2: Assessing Feature Pairs

In this chapter, we focus on two methods of exploratory data analysis (EDA) to gain a thorough understanding of the features. First, we tackle our problem by data type with skimr, separating categorical data from numeric. Second, we visualize interactions using GGally.

Chapter 3: Data Preparation

Data Preparation For People (Humans)
Data Preparation For Machines With recipes

Next, we process the data for both people and machines. We make extensive use of the recipes package to properly transform data for a pre-modeling Correlation Analysis.

Chapter 4: Automated Machine Learning With H2O

Building A Classifier With h2o Automated Machine Learning
Inspecting the H2O Leaderboard
Building A Custom Leaderboard Plotting Function: plot_h2o_leaderboard()
Extracting Models
Making Predictions

Building a high accuracy model is the goal with this stage. We show how to run h2o automated machine learning. We also detail how to build a custom plotting function, plot_h2o_leaderboard() to visualize the best models and select them for work on a hold out (testing) set.

Business Science Problem Framework

Custom H2O Leaderboard Visualization

Chapter 5: Assessing H2O Performance

Classifier Summary Metrics
Precision & Recall: Adjusting The Classifier Threshold
Classifier Gain and Lift: Charts For Exec’s
Visualizing Performance
Making A Custom H2O Performance Plot: plot_h2o_performance()

In this chapter, we show you how to assess performance and visualize model quality in a way that executives and other business decision makers understand.

Chapter 6: Explaining Black-Box Models With LIME

Using lime For Local Model Explanations
Making An Explainer
Explaining Multiple Cases

We use lime to explain the black-box classification model showing which features drive whether the employee stays or leaves.

Business Science Problem Framework

LIME Feature Explanation Visualization

Chapter 7: Recommendation Algorithm

Finally, we put our data science investigative skills to use developing a recommendation algorithm that helps managers and executives make better decisions to prevent employee turnover. This recommendation algorithm is used in HR 301 to build a Machine-Learning powered shiny Web Application that can be deployed to executives and managers.

HR 301 App - Management Strategies

HR 301 Shiny App: Management Strategies

HR 301 App - LIME Feature Importance

HR 301 Shiny App: Attrition Risk

Timing

The HR 201 course will be opened on Monday (4/30). A special offer will be provided to those that enroll in BSU early. The course will not be visible until Monday when it’s released.

What You Need

All you need is a basic proficiency in R programming. A basic (novice) knowledge of R, dplyr, and ggplot2 is our expectation. We’ll take care of the rest. If you are unsure, there is a proficiency quiz to check your baseline. Also, there’s a 30-day money-back guarantee if the course is too difficult or if you are not completely satisfied.

Education Assistance

Many employers offer education assistance to cover the cost of courses. Begin discussions with your employer immediately if this is available to you and you are interested in this course. They will benefit BIG TIME from you taking this course. The special offer we send out is available for a limited time only!

Enroll Now

Enrollment in BSU is open already. Enroll now to take advantage of a special offer. The course will open on Monday, and I will send an announcement to those that are enrolled in BSU along with the special offer. Time is limited.

About Business Science

Business Science specializes in “ROI-driven data science”. We offer training, education, coding expertise, and data science consulting related to business and finance. Our latest creation is Business Science University, a Virtual Workshop that is self-paced and teaches you our data science process! In addition, we deliver about 80% of our effort into the open source data science community in the form of software and our Business Science blog. Visit Business Science on the web or contact us to learn more!

Don’t Miss A Beat

Sign up for the Business Science blog to stay updated!
Enroll in Business Science University to learn how to solve real-world data science problems from Business Science!
Check out our Open Source Software!

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

Footnotes

An organization that loses 200+ high performers per year can lose an estimated $15M/year in hidden costs primarily associated with productivity. We show you how to calculate this cost in Chapter 1: Business Understanding.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles.

↧

Lessons Learned from rtika, a Digital Babel Fish

April 24, 2018, 5:00 pm

≫ Next: practical Bayesian inference [book review]

≪ Previous: Data Science For Business: Course Launch In 5 Days!!!

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

The Apache Tika parser is like the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy” ¹. The Babel fish translates any natural language to any other. Although Tika does not yet translate natural language, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows an analyst to extract text and objects from Microsoft Word.

                        .----.                     ____    __\\\\\\__                                \___'--"          .-.                         /___>------rtika  '0'                        /____,--.        \----B                            "".____  ___-"                          //    / /                                               ]/

Parsing files is a common concern for many communities, including journalism ², government, business, and academia. The complexity of parsing can vary a lot. Apache Tika is a common library to lower that complexity. The Tika auto-detect parser finds the content type of a file and processes it with an appropriate parser. It currently handles text or metadata extraction from over one thousand digital formats:

Portable Document Format (.pdf)
Microsoft Office document formats (Word, PowerPoint, Excel, etc.)
Rich Text Format (.rtf)
Electronic Publication Format (.epub)
Image formats (.jpeg, .png, etc.)
Mail formats (.mbox, Outlook)
HyperText Markup Language (.html)
XML and derived formats (.xml, etc.)
Compression and packaging formats (.gzip, .rar, etc.)
OpenDocument Format
iWorks document formats
WordPerfect document formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Video formats
Java class files and archives
Source code
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats

I am blown away by the thousands of hours spent on the included parsers, such as Apache PDFBox, Apache Poi, and others ³. Tika began as a common back-end for search engines and to reduce duplicated time and effort. Automatically producing information from semi-structured documents is a deceptively complex process that involves tacit knowledge of how document formats have changed over time, the gray areas of their specifications, and dealing with inconsistencies in metadata. Tika began as part of Apache Nutch in 2005, and then became its own project in 2007 and a shared module in Lucene, Jackrabbit, Mahout, and Solr ¹. Now, Tika is the back-end of the rtika package.

Motivation: the Dreaded `.lob` File Extension

This package came together when parsing Word documents in a governmental archive. The files did not have a helpful file extension. They had been stored as ‘large object data’ in a database, and given the generic .lob extension. Some documents parsed with the antiword package:

library('antiword')timing <- system.time(    text <-     batch[1:2000] %>%    lapply(antiword)  )# average time elapsed *per document* for antiword:timing[3]/2000#>   elapsed #> 0.0098275

However, the files farther into the batch were in the new Word format, and antiword did not parse them. The government had switched to the new version but left no obvious clues. Typically, modern Word documents use the .docx file extension and the ancient ones use .doc. I just had .lob:

last_file <- length(batch)tryCatch(antiword(batch[last_file]), error = function(x){x})#>

Fortunately, I remembered Apache Tika. Five years earlier, Tika helped parse the Internet Archive, and handled whatever format I threw at it. Back then, I put together a R package for myself and a few colleagues. It was outdated.
I downloaded Tika and made an R script. Tika did its magic. It scanned the headers for the “Magic Bytes” ⁴ and parsed the files appropriately:
library('rtika')timing <- system.time(    text <-     batch[1:2000] %>%    tika_text(threads=1)  )# average time elapsed *per document* parsed for rtika:timing[3]/2000#>  elapsed #> 0.006245
For this batch, the efficiency compared favorably to antiword, even with the overhead of loading Tika. I estimate that starting Tika, loading the Java parsers each time, loading the file list from R, and reading the files back into an R object took a few extra seconds. The reduced time effort processing the entire batch led me to think about the broader applications of Tika. This was too good not to share, but I was apprehensive about maintaining a package over many years. The rOpenSci organization was ready to help.
Lessons Learned
I never distributed a package before on repositories such as CRAN or Github, and the rOpenSci group was the right place to learn how. The reviewers used a transparent onboarding process (see: https://github.com/ropensci/onboarding and https://github.com/ropensci/onboarding/issues/191  ) and taught about good documentation and coding style. They were helping create a maintainable package by following certain standards. If I stopped maintaining rtika, others could use their knowledge of the same standards to take over. The vast majority of time was spent on documenting the code, the introductory vignette, and continuous testing to integrate new code.
Connecting to Tika
There needed to be a reliable way to send messages from R to Tika and back. There were several possible ways to implement this: Tika server, rJava, or system calls. I recently discovered the Linux paradigm of using files and file-like paths for messaging, paraphrased as “everything is a file” ⁵, and wanted a method that would work on Linux, Windows and OS X and be easy to install. I chose the method of passing short messages to Java through the command line and sending larger amounts of data through the file system.
This worked. R sends Tika a signal to execute code using an old-fashioned command line call, telling Tika to parse a particular batch of files. R waits for a response. Eventually, Tika sends the signal of its completion, and R can then return with results as a character vector. Surprisingly, this process may be a good option for containerized applications running Docker. In the example later in this blog post, a similar technique is used to connect to a Docker container in a few lines of code.
Communication with Tika went smoothly, but after one issue with base::system2() was identified. The base::system2() call was terminating Tika’s long running process. Switching to sys::exec_wait() or processx::run() solved the issue.
The R User Interface
Many in the R community make use of magrittr style pipelines, so those needed to work well. The Tidy Tools Manifesto makes piping a central tenet ⁶, which makes code easier to read and maintain.
When writing rtika, I created two distinct styles of user interface. The first was a lightweight R wrapper around the Tika command line, called tika(). The parameters and abbreviations there mirrored the Apache Tika tool. The other functions are in the common R style found in tidy tools, and run tika() with certain presets (e.g. tika_html() outputs to ‘html’ ). For R users, these should be more intuitive and self-documenting.
Responding to Reviewers
During the review process, I appreciated David Gohel’s ⁷ attention to technical details, and that Julia Silge ⁸ and Noam Ross ⁹ pushed me to make the documentation more focused. I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better.
Noam Ross, the editor, helped deal with the unusually large size of the Tika app, which was around 60MB. CRAN officially limits the size of packages to 5MB, therefore an additional post-install function tika_install() downloaded Tika to a user directory. This got me thinking about the importance of Github’s larger file size limits, and if containerized apps in Docker or Kubernetes might eventually help distribute large packages in the future.
Tika in Context: Parsing the Internet Archive
The first archive I parsed with Tika was a website retrieved from the Wayback Machine ¹⁰, a treasure trove of historical files. Maintained by the Internet Archive, their crawler downloads sites over decades. When a site is crawled consistently, longitudinal analyses are possible. For example, federal agency websites often change when the executive branch administration changes, so the Internet Archive group and academic partners have increased the consistency of crawling there. In 2016, they archived over 200 terabytes of documents to include, among other things, over 40 million pdf files ¹¹. I consider these government documents to be in the public domain, even if an administration hides or removes them.
In the following example, the function wayback_machine_downloader() gets documents from ‘http://www3.epa.gov/climatechange/Downloads’ between January 20th, 2016 and January 20th, 2017.
# Wayback downloader image: https://github.com/hartator/wayback-machine-downloaderwayback_machine_downloader <- function(input,     args = character(),    download_dir = tempfile('wayback_machine_download_',getwd()) ) {    download_dir <- normalizePath(download_dir, mustWork = FALSE)  if(!dir.exists(download_dir)){    dir.create(download_dir, recursive = TRUE)  }  # wait for wayback-machine-downloader in Docker    processx::run('docker', c('run', '--rm',     '-v', paste0(download_dir,':/websites'),    'hartator/wayback-machine-downloader',    input, args))      # Retrieve the paths to results  file.path(download_dir, list.files(path = download_dir,                                    recursive = TRUE))}# download over 200 MB of documents given --from and --to datesbatch <- wayback_machine_downloader(  'http://www3.epa.gov/climatechange/Downloads',                     args = c('--from','20160120',                             '--to','20170120'),                    download_dir='~/wayback_machine_downloader')# get more easily parsable XHTML objectshtml <-    batch %>%     tika_html() %>%    lapply(xml2::read_html)
The Tika metadata fields are in the XHTML header.
content_type <-    html %>%    lapply(xml2::xml_find_first, '//meta[@name="Content-Type"]') %>%    lapply(xml2::xml_attr, 'content') %>%    unlist()# some files are compressed .zipcontent_type[1:10]#>  [1] "application/pdf" "application/pdf" "application/pdf"#>  [4] "application/pdf" "application/pdf" "application/pdf"#>  [7] "application/pdf" "application/pdf" "application/pdf"#> [10] "application/zip"creation_date <-     html %>%    lapply(xml2::xml_find_first, '//meta[@name="Creation-Date"]') %>%    lapply(xml2::xml_attr, 'content') %>%    unlist()# Note that some files were created before the Wayback machine downloaded them from the site.sample(creation_date,10)#>  [1] "2014-06-26T22:02:28Z" "2013-06-12T15:25:11Z" "2016-06-14T18:22:00Z"#>  [4] "2016-04-08T16:03:44Z" "2014-10-20T18:27:01Z" "2011-04-08T15:07:00Z"#>  [7] "2015-04-27T15:28:22Z" NA                     "2016-05-11T17:11:26Z"#> [10] "2013-05-07T18:53:31Z"links <-    html %>%    lapply(xml2::xml_find_all, '//a') %>%    lapply(xml2::xml_attr, 'href')sample(links[[6]],5)#>  [1] "www.epa.gov/climatechange/endangerment.html"                  #>  [2] "https://www.uea.ac.uk/mac/comm/media/press/2009/nov/CRUupdate"#>  [3] "mailto:ghgendangerment@epa.gov"                               #>  [4] "https://www.uea.ac.uk/mac/comm/media/press/2009/nov/CRUupdate"#>  [5] "http://www.epa.gov/climatechange/endangerment.html"           
Some files are compressed, and Tika automatically uncompressed and parses them. For example, the file ‘DataAnnex_EPA_NonCO2_Projections_2011_draft.zip’ contains an Excel spreadsheet that is unzipped and converted to HTML. Both Microsoft Excel and Word tables become HTML tables withtika_html().
For more fine grained access to files that contain other files, see tika_json() that is discussed in the vignette ¹².
Next Steps
Out of the box, rtika uses the Tika ‘batch processor’ module. For most situations the settings work well. However, Tika’s processor also has a ‘config file’ format that gives fine grained control, and I’m eager to incorporate that once the format stabilizes. For example, it could instruct the batch processor to get a particular type of metadata only, like the Content-Type, and not parse the text.
My hope is that even if rtika does not have the required parser, it will still be useful for Content-Type detection and metadata. I noticed Tika does not yet have strong support for Latex or Markdown, which is unfortunate because those are actively used in the R community. Tika currently reads and writes Latex and Markdown files, captures metadata, and recognizes the MIME type when downloading with tika_fetch(). However, Tika does not have parsers to fully understand the document structure, render it to XHTML, and extract the plain text without markup. For these cases, Pandoc is more useful (See: https://pandoc.org/demos.html). However, Tika still helps identify file types and get metadata.
Another next step is to include an install script for the Tesseract OCR software ¹³. Out of the box, Tika will be enhanced by Tesseract for pdf files with document images if Tesseract is available on the system. For installation tips, see https://wiki.apache.org/tika/TikaOCR and https://github.com/tesseract-ocr/tesseract/wiki.
It is also possible to integrate the rJava package. Many in the R community know rJava. Some like its speed while others say it is difficult to install. I think rJava would be a nice enhancement but want to make it optional feature instead of a dependency. If rJava were already installed on a system, rtika would detect that and reduce the start-up overhead for each call to tika(). Because the 2.0 version of Tika is planned to significantly reduce start-up time, I view this as a low priority.
Conclusion
For researchers who work with digital archives, this is a golden age. There is so much textual data, it is overwhelming. These data carry much meaning in words, letters, emoji, metadata, and structure. In my opinion, analyst should not have to spend too much time struggling to parse files, and spend their time doing what they love. I hope the R community makes good use of rtika, a digital Babel fish (see: https://github.com/ropensci/rtika).
Mattmann, Chris, and Jukka Zitting. 2011. Tika in Action. Manning Publications Co. https://www.manning.com/books/tika-in-action.  
Wikipedia. 2018b. “Panama Papers.” Accessed March 14. https://en.wikipedia.org/wiki/Panama_Papers.  
Apache Foundation. 2018. “Apache Tika Supported Document Formats.” Accessed March 14. https://tika.apache.org/1.17/formats.html.  
Wikipedia. 2018a. “File Format: Magic Number.” Accessed March 14. https://en.wikipedia.org/wiki/File_format#Magic_number.  
Kerrisk, Michael. 2010. The Linux Programming Interface: A Linux and Unix System Programming Handbook. No Starch Press.  
Wickham, Hadley. 2016. “The Tidy Tools Manifesto.” https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html.  
Gohel, David. 2018. “ArData.” Accessed March 14. http://www.ardata.fr/.  
Silge, Julia. 2018. “Juliasilge.com.” Accessed March 14. https://juliasilge.com/.  
Ross, Noam. 2018. “Noamross.net.” Accessed March 14. http://www.noamross.net/.  
Internet Archive. 2018. “Wayback Machine.” Accessed March 14. https://archive.org/.  
Jefferson. 2017. “Over 200 Terabytes of the Government Web Archived!” https://blog.archive.org/2017/05/09/over-200-terabytes-of-the-government-web-archived/.  
Goodman, Sasha. 2018. “Introduction to Rtika Vignette.” Accessed March 14. https://ropensci.github.io/rtika/articles/rtika_introduction.html.  
Smith, Ray. 2007. “An Overview of the Tesseract Ocr Engine.” In Document Analysis and Recognition, 2007. Icdar 2007. Ninth International Conference on, 2:629–33. IEEE.  
  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };  (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);  }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog:  rOpenSci - open tools for open science.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

practical Bayesian inference [book review]

April 25, 2018, 3:18 pm

≫ Next: Rectangling onboarding

≪ Previous: Lessons Learned from rtika, a Digital Babel Fish

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!]

It is always a bit of a challenge to review introductory textbooks as, on the one hand, they are rarely written at the level and with the focus one would personally choose to write them. And, on the other hand, it is all too easy to find issues with the material presented and the way it is presented… So be warned and proceed cautiously! In the current case, Practical Bayesian Inference tries to embrace too much, methinks, by starting from basic probability notions (that should not be unknown to physical scientists, I believe, and which would avoid introducing a flat measure as a uniform distribution over the real line!, p.20). All the way to running MCMC for parameter estimation, to compare models by Bayesian evidence, and to cover non-parametric regression and bootstrap resampling. For instance, priors only make their apparition on page 71. With a puzzling choice of an improper prior (?) leading to an improper posterior (??), which is certainly not the smoothest entry on the topic. “Improper posteriors are a bad thing“, indeed! And using truncation to turn them into proper distributions is not a clear improvement as the truncation point will significantly impact the inference. Discussing about the choice of priors from the beginning has some appeal, but it may also create confusion in the novice reader (although one never knows!). Even asking about “what is a good prior?” (p.73) is not necessarily the best (and my recommended) approach to a proper understanding of the Bayesian paradigm. And arguing about the unicity of the prior (p.119) clashes with my own view of the prior being primarily a reference measure rather than an ideal summary of the available information. (The book argues at some point that there is no fixed model parameter, another and connected source of disagreement.) There is a section on assigning priors (p.113), but it only covers the case of a possibly biased coin without much realism. A feature common to many Bayesian textbooks though. To return to the issue of improper priors (and posteriors), the book includes several warnings about the danger of hitting an undefined posterior (still called a distribution), without providing real guidance on checking for its definition. (A tough question, to be sure.)

“One big drawback of the Metropolis algorithm is that it uses a fixed step size, the magnitude of which can hardly be determined in advance…”(p.165)

When introducing computational techniques, quadratic (or Laplace) approximation of the likelihood is mingled with kernel estimators, which does not seem appropriate. Proposing to check convergence and calibrate MCMC via ACF graphs is helpful in low dimensions, but not in larger dimensions. And while warning about the dangers of forgetting the Jacobians in the Metropolis-Hastings acceptance probability when using a transform like η=ln θ is well-taken, the loose handling of changes of variables may be more confusing than helpful (p.167). Discussing and providing two R codes for the (standard) Metropolis algorithm may prove too much. Or not. But using a four page R code for fitting a simple linear regression with a flat prior (pp.182-186) may definitely put the reader off! Even though I deem the example a proper experiment in setting a Metropolis algorithm and appreciate the detailed description around the R code itself. (I just take exception at the paragraph on running the code with two or even one observation, as the fact that “the Bayesian solution always exists” (p.188) [under a proper prior] is not necessarily convincing…)

“In the real world we cannot falsify a hypothesis or model any more than we “truthify” it (…) All we can do is ask which of the available models explains the data best.” (p.224)

In a similar format, the discussion on testing of hypotheses starts with a lengthy presentation of classical tests and p-values, the chapter ending up with a list of issues. Most of them reasonable in my own referential. I also concur with the conclusive remarks quoted above that what matters is a comparison of (all relatively false) models. What I less agree [as predictable from earlier posts and papers] with is the (standard) notion that comparing two models with a Bayes factor follows from the no information (in order to avoid the heavily loaded non-informative) prior weights of ½ and ½. Or similarly that the evidence is uniquely calibrated. Or, again, using a truncated improper prior under one of the assumptions (with the ghost of the Jeffreys-Lindley paradox lurking nearby…). While the Savage-Dickey approximation is mentioned, the first numerical resolution of the approximation to the Bayes factor is via simulations from the priors. Which may be very poor in the situation of vague and uninformative priors. And then the deadly harmonic mean makes an entry (p.242), along with nested sampling… There is also a list of issues about Bayesian model comparison, including (strong) dependence on the prior, dependence on irrelevant alternatives, lack of goodness of fit tests, computational costs, including calls to possibly intractable likelihood function, ABC being then mentioned as a solution (which it is not, mostly).

Miscellanea

“This problem is known as the curse of dimensionality (an unusually evocative name for the field of statistics).”(p.66)

Starting an inference book with the (in)famous Monty Hall paradox is maybe not the most helpful entry to Bayesian inference (since some of my Bayesian friends managed to fail solving the paradox!). Some notations may feel more natural for physicists than mathematicians, as for instance the loose handling of changes of variables, e.g. P(ln x) representing the density of the log transform Y=ln X… Or {θ} denoting a sample of θ’s. Introducing (unprincipled) estimators like the unbiased estimator of the variance out of the blue is somewhat contradictory with a Bayesian perspective, which produces estimators from a decisional perspective (and none of them unbiased). Using the same symbol for the (unknown) parameter and its estimate (see, e.g., p.77 for the simple linear model) does not help. It is however interesting to notice how the author insists on parameterisation and normalisation issues, but I cannot help pointing the book missing the dominating measure aspect of deriving a MAP estimate. Pages and pages of printed R output are not necessarily very helpful, considering the codes are available on-line. A section (p.140) on how to handle underflows resulting from small likelihoods and large samples is more helpful. Another curious occurrence is the rather long discussion on the p-value, in the middle (p.111) of the chapter on parameter estimation. When considering the Jeffreys prior, invariance under reparameterisation is advanced as a supportive argument, as in my books, but the multidimensional case still remains a mystery. Yet another curious instance of the fact that densities are expressed in units that are the inverses of the variate. Volume units?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

Rectangling onboarding

April 25, 2018, 5:00 pm

≫ Next: Our package reviews in review: Introducing a 3-post series about software onboarding data

≪ Previous: practical Bayesian inference [book review]

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Our onboarding reviews, that ensure that packages contributed by the community undergo a transparent, constructive, non adversarial and open review process, take place in the issue tracker of a GitHub repository. Development of the packages we onboard also takes place in the open, most often in GitHub repositories.

Therefore, when wanting to get data about our onboarding system for giving a data-driven overview, my mission was to extract data from GitHub and git repositories, and to put it into nice rectangles (as defined by Jenny Bryan) ready for analysis. You might call that the first step of a “tidy git analysis” using the term coined by Simon Jackson. So, how did I collect data?

A side-note about GitHub

In the following, I’ll mention repositories. All of them are git repositories, which means they’re folders under version control, where roughly said all changes are saved via commits and their messages (more or less) describing what’s been changed in the commit. Now, on top of that these repositories live on GitHub which means they get to enjoy some infratructure such as issue trackers, milestones, starring by admirers, etc. If that ecosystem is brand new to you, I recommend reading this book, especially its big picture chapter.

Package review processes: weaving the threads

Each package submission is an issue thread in our onboarding repository, see an example here. The first comment in that issue is the submission itself, followed by many comments by the editor, reviewers and authors. On top of all the data that’s saved there, mostly text data, we have a private Airtable workspace where we have a table of reviewers and their reviews, with direct links to the issue comments that are reviews.

Getting issue threads

Unsurprisingly, the first step here was to “get issue threads”. What do I mean? I wanted a table of all issue threads, one line per comment, with columns indicating the time at which something was written, and columns digesting the data from the issue itself, e.g. guessing the role from the commenter from other information: the first user of the issue is the “author”.

I used to use GitHub API V3 and then heard about GitHub API V4 which blew my mind. As if I weren’t impressed enough by the mere existence of this API and its advantages,

I discovered the rOpenSci ghql package allows one to interact with such an API and that its docs actually use GitHub API V4 as an example!
Carl Boettiger told me about his way to rectangle JSON data, using jq, a language for processing JSON, via a dedicated rOpenSci package, jqr.

I have nothing against GitHub API V3 and gh and purrr workflows, but I was curious and really enjoyed learning these new tools and writing this code. I had written a gh/purrr code for getting the same information and it felt clumsier, but it might just be because I wasn’t perfectionist enough when writing it! I achieved writing the correct GitHub V4 API query to get just what I needed by using its online explorer. I then succeeded in transforming the JSON output into a rectangle by reading Carl’s post but also by taking advantage of another online explorer, jq play where I pasted my output via writeClipboard. That’s nearly always the way I learn about query tools: using some sort of explorer and then pasting the code into a script. When I am more experienced, I can skip the explorer part.

The first function I wrote was one for getting the issue number of the last onboarding issue, because then I looped/mapped over all issues.

library("ghql")
library("httr")
library("magrittr")
# function to get number of last issue
get_last_issue <- function(){
  query = '{
  repository(owner: "ropensci", name: "onboarding") {
  issues(last: 1) {
  edges{
  node{
  number
  }
  }
  }
}
}'
token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
  cli <- GraphqlClient$new(
    url = "https://api.github.com/graphql",
    headers = add_headers(Authorization = paste0("Bearer ", token))
  )
  
  ## define query
  ### creat a query class first
  qry <- Query$new()
  qry$query('issues', query)
  last_issue <-cli$exec(qry$queries$issues)
  last_issue %>%
    jqr::jq('.data.repository.issues.edges[].node.number') %>%
   as.numeric()
}
get_last_issue()

## [1] 201

Then I wrote a function for getting all the precious info I needed from an issue thread. At the time it lived on its own in an R script, now it’s gotten included in my ghrecipes package as get_issue_thread so you can check out the code there, along with other useful recipes for analyzing GitHub data.

Then I launched this code to get all data! It was very satisfying.

#get all threads
issues <- purrr::map_df(1:get_last_issue(), get_issue_thread)

# for the one(s) with 101 comments get the 100 last comments
long_issues <- issues %>%
  dplyr::count(issue) %>%
  dplyr::filter(n == 101) %>%
  dplyr::pull(issue)

issues2 <- purrr::map_df(long_issues, get_issue_thread, first = FALSE)
all_issues <- dplyr::bind_rows(issues, issues2)
all_issues <- unique(all_issues)


readr::write_csv(all_issues, "data/all_threads_v4.csv")

Digesting them and complementing them with Airtable data

In the previous step we got a rectangle of all threads, with information from the first issue comment (such as labels) distributed to all the comments of the threads.

issues <- readr::read_csv("data/all_threads_v4.csv")
issues <- janitor::clean_names(issues)
issues <- dplyr::rename(issues, user = author)
issues <- dplyr::select(issues, - dplyr::contains("topic"))
issues %>%
  head() %>%
  dplyr::select(- body) %>%
  knitr::kable()

title	author_association	assignee	created_at	closed_at	user	comment_url	package	pulled	issue	meta	x6_approved	out_of_scope	x4_review_s_in_awaiting_changes	x0_presubmission	question	x3_reviewer_s_assigned	holding	legacy	x1_editor_checks	x5_awaiting_reviewer_s_response	x2_seeking_reviewer_s
rrlite	OWNER	sckott	2015-03-10 23:22:45	2015-03-31 00:16:28	richfitz	NA	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
rrlite	OWNER	sckott	2015-03-10 23:26:11	2015-03-31 00:16:28	richfitz	https://github.com/ropensci/onboarding/issues/1#issuecomment-78170639	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
rrlite	OWNER	sckott	2015-03-11 19:29:32	2015-03-31 00:16:28	karthik	https://github.com/ropensci/onboarding/issues/1#issuecomment-78351979	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
rrlite	OWNER	sckott	2015-03-11 21:08:59	2015-03-31 00:16:28	sckott	https://github.com/ropensci/onboarding/issues/1#issuecomment-78372187	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
rrlite	OWNER	sckott	2015-03-11 21:13:11	2015-03-31 00:16:28	karthik	https://github.com/ropensci/onboarding/issues/1#issuecomment-78373054	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
rrlite	OWNER	sckott	2015-03-11 21:33:45	2015-03-31 00:16:28	richfitz	https://github.com/ropensci/onboarding/issues/1#issuecomment-78377124	TRUE	TRUE	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Now we need a few steps more:

transforming NA into FALSE for variables corresponding to labels,
getting the package name from Airtable since the titles of issues are not uniformly formatted,
knowing which comment is a review,
deducing the role of the user writing the comment (author/editor/reviewer/community manager/other).

Below binary variables are transformed and only rows corresponding to approved packages are kept.

# labels
replace_1 <- function(x){
 !is.na(x[1])
}

# binary variables
ncol_issues <- ncol(issues)
issues <- dplyr::group_by(issues, issue) %>%
  dplyr::arrange(created_at) %>%
  dplyr::mutate_at(9:(ncol_issues-1), replace_1) %>%
  dplyr::ungroup()


# keep only issues that are finished
issues <- dplyr::filter(issues, package, !x0_presubmission, 
                        !out_of_scope, !legacy,
                        !x1_editor_checks, x6_approved)
issues <- dplyr::select(issues, - dplyr::starts_with("x"),
                        - package, - out_of_scope, - legacy,
                        - meta, - holding, - pulled, - question)

Then, thanks to the airtabler package we can add the name of the package, and identify review comments.

# airtable data
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()

airtable <- dplyr::mutate(airtable,
                          issue = as.numeric(stringr::str_replace(onboarding_url, 
                                                                  ".*issues\\/", "")))

# we get the name of the package
# and we know which comments are reviews
reviews <- dplyr::select(airtable, review_url, issue, package) %>%
  dplyr::mutate(is_review = TRUE) 

issues <- dplyr::left_join(issues, reviews, by = c("issue", "comment_url" = "review_url"))
issues <- dplyr::mutate(issues, is_review = !is.na(is_review))

Finally, the non elegant code below attributes a role to each user (commenter is its more precise version that differentiates reviewer 1 from reviewer 2). I could have used dplyrcase_when.

# non elegant code to guess role
issues <- dplyr::group_by(issues, issue)
issues <- dplyr::arrange(issues, created_at)
issues <- dplyr::mutate(issues, author = user[1])
issues <- dplyr::mutate(issues, package = unique(package[!is.na(package)]))
issues <- dplyr::mutate(issues, assignee = assignee[1])
issues <- dplyr::mutate(issues, reviewer1 = ifelse(!is.na(user[is_review][1]), user[is_review][1], ""))
issues <- dplyr::mutate(issues, reviewer2 = ifelse(!is.na(user[is_review][2]), user[is_review][2], ""))
issues <- dplyr::mutate(issues, reviewer3 = ifelse(!is.na(user[is_review][3]), user[is_review][3], ""))
issues <- dplyr::ungroup(issues)
issues <- dplyr::group_by(issues, issue, created_at, user)
# regexp because in at least 1 case assignee = 2 names glued together
issues <- dplyr::mutate(issues, commenter = ifelse(stringr::str_detect(assignee, user), "editor", "other"))
issues <- dplyr::mutate(issues, commenter = ifelse(user == author, "author", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer1, "reviewer1", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer2, "reviewer2", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer3, "reviewer3", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == "stefaniebutland", "community_manager", commenter))
issues <- dplyr::ungroup(issues)
issues <- dplyr::mutate(issues, role = commenter,
                        role = ifelse(stringr::str_detect(role, "reviewer"),
                                      "reviewer", role))

issues <- dplyr::select(issues, - author, - reviewer1, - reviewer2, - reviewer3, - assignee,
                        - author_association, - comment_url)
readr::write_csv(issues, "data/clean_data.csv")

The role “other” corresponds to anyone chiming in, while the community manager role is planning blog posts with the package author. We indeed have a series of guest blog posts from package authors that illustrate the review process as well as their onboarded packages.

Here is the final table. I unselect “body” because formatting in the text could break the output here, but I do have the text corresponding to each comment.

issues %>%
  dplyr::select(- body) %>%
  head() %>%
  knitr::kable()

title	created_at	closed_at	user	issue	package	is_review	commenter	role
rrlite	2015-03-31 00:25:14	2015-04-13 23:26:38	richfitz	6	rrlite	FALSE	author	author
rrlite	2015-04-01 17:30:51	2015-04-13 23:26:38	sckott	6	rrlite	FALSE	editor	editor
rrlite	2015-04-01 17:36:03	2015-04-13 23:26:38	karthik	6	rrlite	FALSE	other	other
rrlite	2015-04-02 03:36:09	2015-04-13 23:26:38	jeroen	6	rrlite	FALSE	reviewer2	reviewer
rrlite	2015-04-02 03:50:43	2015-04-13 23:26:38	gaborcsardi	6	rrlite	FALSE	other	other
rrlite	2015-04-02 03:53:57	2015-04-13 23:26:38	richfitz	6	rrlite	FALSE	author	author

There are 2521 comments, corresponding to 70 onboarded packages.

Submitted repositories: down to a few metrics

As mentioned earlier, onboarded packages are most often developped on GitHub. After onboarding they live in the ropensci GitHub organization, previously some of them were onboarded into ropenscilabs but they should all be transferred soon. In any case, their being on GitHub means it’s possible to get their history to have a glimpse at work represented by onboarding!

Getting all onboarded repositories

Using rOpenSci git2r package I cloned all onboarded repositories in a “repos” folder. Since I didn’t know which package was in ropensci or ropenscilabs, I tried both.

airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()

safe_clone <- purrr::safely(git2r::clone)

# github link either ropensci or ropenscilabs
clone_repo <- function(package_name){
  print(package_name)
  url <- paste0("https://github.com/ropensci/", package_name, ".git")
  local_path <- paste0(getwd(), "/repos/", package_name)
  clone_from_ropensci <- safe_clone(url = url, local_path = local_path,
                                    progress = FALSE)
  if(is.null(clone_from_ropensci$result)){
    url <- paste0("https://github.com/ropenscilabs/", package_name, ".git")
    clone_from_ropenscilabs <- safe_clone(url = url, local_path = local_path,
                                          progress = FALSE)
    if(is.null(clone_from_ropenscilabs$result)){
      message("OUILLE")
    }
  }
  
}

pkgs <- unique(airtable$package)
pkgs <- pkgs[!pkgs %in% fs::dir_ls()]
pkgs <- pkgs[pkgs != "rrricanes"]
purrr::walk(pkgs, clone_repo)

I didn’t clone “rrricanes” because it was too big!

Getting commits reports

I then got the commit logs of each repo for various reasons:

commits themselves show how much code and documentation editing was done during review
I wanted to be able to git reset hard the repo at its state at submission, for which I needed the commit logs.

I used the gitsum package to get commit logs because its dedicated high-level functions made it easier than with git2r.

library("magrittr")

get_report <- function(package_name){
  message(package_name)
  local_path <- paste0(getwd(), "/repos/", package_name)
  if(length(fs::dir_ls(local_path)) != 0){
    gitsum::init_gitsum(local_path, over_write = TRUE)
    report <- gitsum::parse_log_detailed(local_path)
    report <- dplyr::select(report, - nested)
    report$package <- package_name
    if(!"datetime" %in% names(report)){
      report <-  dplyr::mutate(report,
                               hour = as.numeric(stringr::str_sub(timezone, 1, 3)),
                               minute = as.numeric(stringr::str_sub(timezone, 4, 5)),
                               datetime = date + lubridate::hours(-1 * hour) + lubridate::minutes(-1 * minute))
     report <- dplyr::select(report, - hour, - minute, - timezone)
      
    }
    report <- dplyr::select(report, - date)
    return(report)
  }else{
    return(NULL)
  }
}

packages <- fs::dir_ls("repos") 
packages <- stringr::str_replace_all(packages, "repos\\/", "")
purrr::map_df(packages, get_report) %>%
   readr::write_csv("output/gitsum_reports.csv")

Getting repositories as at submission

Crossing information from the issue threads and from commit logs, I could find the latest commit before submission and create a copy of each repo before resetting it at this state. This is the closest to a Time-Turner that I have!

library("magrittr")
# get issues opening datetime
issues <- readr::read_csv("data/clean_data.csv")
issues <- dplyr::group_by(issues, package)
issues <- dplyr::summarise(issues, opened = min(created_at))

# now for each package keep only commits before that
commits <- readr::read_csv("output/gitsum_reports.csv")
commits <- dplyr::left_join(commits, issues, by = "package")
commits <- dplyr::group_by(commits, package)
commits <- dplyr::filter(commits, datetime <= opened)
# and from them keep the latest one, 
# that's the latest commit before submission!
commits <- dplyr::filter(commits, datetime == max(datetime), !is_merge)
commits <- dplyr::summarize(commits, hash = hash[1])

# small helper function
get_sha <- function(commit){
  commit@sha
}

set_archive <- function(package_name, commit){
  message(package_name)
  # copy the entire repo to another location
  local_path <- paste0(getwd(), "/repos/", package_name)
  local_path_archive <- paste0(getwd(), "/repos_at_submission/", package_name)
  fs::dir_copy(local_path, local_path_archive)
  
  # get all commits -- it's fast which is why I don't use gitsum report here
  commits <- git2r::commits(git2r::repository(local_path_archive))
  
  # get their sha
  sha <- purrr::map_chr(commits, get_sha)
  
  # all of this to extract the commit with the sha of the latest commit before submission
  # in other words the latest commit before submission
  commit <- commits[sha == commit][[1]]
  
  # do a hard reset at that commit
  git2r::reset(commit, reset_type = "hard")
}

purrr::walk2(commits$package, commits$hash, set_archive)

Outlook: getting even more data? Or analyzing this dataset

There’s more data to be collected or prepared! From GitHub issues, using GitHub archive one could get the labelling history: when did an issue go from “editor-checks” to “seeking-reviewers” for instance? It’d help characterize the usual speed of the process. One could also try to investigate the formal and less formal links between the onboarded repository and the review: did commits and issues mention the onboarding review (with words), or even actually put a link to it? Are actors in the process little or very active on GitHub for other activities, e.g. could we see that some reviewers create or revive their GitHub account especially for reviewing?

Rather than enlarging my current dataset, I’ll present its analysis in two further blog posts answering the questions “How much work is rOpenSci onboarding?” and “How to characterize the social weather of rOpenSci onboarding?”. In case you’re too impatient, in the meantime you can dive into this blog post by Augustina Ragwitz about measuring open-source influence beyond commits and this one by rOpenSci co-founder Scott Chamberlain about exploring git commits with git2r.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

↧

Our package reviews in review: Introducing a 3-post series about software onboarding data

April 25, 2018, 5:00 pm

≫ Next: Shiny semantic – v 0.2.0 CRAN release

≪ Previous: Rectangling onboarding

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

On March the 17th I had the honor to give a keynote talk about rOpenSci’s package onboarding system at the satRday conference in Cape Town, entitled “Our package reviews in review: introducing and analyzing rOpenSci onboarding system”. You can watch its recording, skim through the corresponding slides or… read this series!

What is rOpenSci onboarding?

rOpenSci’s suite of packages is partly contributed by staff members and partly contributed by community members, which means the suite stems from a great diversity of skills and experience of developers. How to ensure quality for the whole set? That’s where onboarding comes into play: packages contributed by the community undergo a transparent, constructive, non adversarial and open review process. For that process relying mostly on volunteer work, four editors manage the incoming flow and ensure progress of submissions; authors create, submit and improve their package; reviewers, two per submission, examine the software code and user experience. This blog post written by rOpenSci onboarding editors is a good introduction to rOpenSci onboarding.

Technically, we make the most of GitHub infrastructure: each package onboarding process is an issue in the ropensci/onboarding GitHub repository. For instance, click here to read the onboarding review thread of my ropenaq package: the process is an ongoing conversation until acceptance of the package, with two external reviews as important milestones. Furthermore, we use GitHub features such as the use of issue templates (as submission templates), and such as labelling which we use to track progress of submissions (from editor checks to approval).

What is this series?

In my talk in Cape Town, I presented the motivation for and process of the rOpenSci onboarding system (with the aid of screenshots made in R using the webshot and magick packages!). I also presented a data collection and analysis of onboarding, which I shall report in three posts. The first post in the series will explain how I rectangled onboarding. The second post will give some clues as to how to quantify the work represented by rOpenSci onboarding. The third and last post will use tidy text analysis of onboarding threads to characterize the social weather of onboarding.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

↧

Shiny semantic – v 0.2.0 CRAN release

April 25, 2018, 5:00 pm

≫ Next: R HTMLWidgets in Jupyter Notebooks

≪ Previous: Our package reviews in review: Introducing a 3-post series about software onboarding data

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

shiny.semantic helps you introduce semantic elements to all kinds of shiny apps. It can make your Shiny app look awesome! If you’re new to shiny semantic, check out demos and How to use it.

What you can find in this version?

New features

Support for Semantic UI Themes
Slider input based on this great add-on
Search field
Multiple selection search dropdown
Syntactic sugar for semantic UI: form, message, menu, field and dropdown
Semantic UI switched to CDN to achieve great speed ups!
Add more examples

Bugs fixes

Fix input registry by adding keyup event
Fixed rendering hidden tabset content

We also introduced continuous integration in our development via CircleCI, see here for more details.

How to install shiny.semantic?

From CRAN:

install.packages("shiny.semantic")

From github:

devtools::install_github("Appsilon/shiny.semantic")

Thanks to the great leadership from Dominik, our opensource is growing dynamically. Track new changes in the opensource packages on our GitHub. Don’t miss latest release of great semantic.dashboard package, that allows you to create beatiful Shiny app dashboards!

Read the original post at Appsilon Data Science Blog.

Follow Appsilon Data Science

Follow us on Twitter!
Check us out on Facebook page!
Follow us on LinkedIn!
Star our GitHub packages, including shiny.semantic, shiny.router, shiny.collections, shiny.i18n and incoming semantic.dashboard!
Sign up for our newsletter and get free e-book!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

↧

R HTMLWidgets in Jupyter Notebooks

April 26, 2018, 1:17 am

≫ Next: Slides from my JAX 2018 talk: Deep Learning – a Primer

≪ Previous: Shiny semantic – v 0.2.0 CRAN release

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

A quick note for displaying R htmlwidgets in Jupyter notebooks without requiring pandoc – there may be a more native way but this acts as a workaround in the meantime if not:

library(htmlwidgets)library(IRdisplay)library(leaflet)m = leaflet() %>% addTiles()saveWidget(m, 'demo.html', selfcontained = FALSE)display_html('')

PS and from the other side, using reticulate for Python powered Shiny apps.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog….

↧

Slides from my JAX 2018 talk: Deep Learning – a Primer

April 26, 2018, 5:00 pm

≫ Next: Getting ready to attend rOpenSci unconf18 and probably working on tidyverse-like functions for the first time

≪ Previous: R HTMLWidgets in Jupyter Notebooks

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Here I am sharing the slides for a talk that my colleague Uwe Friedrichsen and I gave about Deep Learning – a Primer at the JAX conference on Tuesday, April 24th 2018 in Mainz, Germany.

Slides can be found here: https://www.slideshare.net/ShirinGlander/deep-learning-a-primer-95197733

Deep Learning is one of the “hot” topics in the AI area – a lot of hype, a lot of inflated expectation, but also quite some impressive success stories. As some AI experts already predict that Deep Learning will become “Software 2.0”, it might be a good time to have a closer look at the topic. In this session I will try to give a comprehensive overview of Deep Learning. We will start with a bit of history and some theoretical foundations that we will use to create a little Deep Learning taxonomy. Then we will have a look at current and upcoming application areas: Where can we apply Deep Learning successfully and what does it differentiate from other approaches? Afterwards we will examine the ecosystem: Which tools and libraries are available? What are their strengths and weaknesses? And to complete the session, we will look into some practical code examples and the typical pitfalls of Deep Learning. After this session you will have a much better idea of the why, what and how of Deep Learning, including if and how you might want to apply it to your own work. https://jax.de/big-data-machine-learning/deep-learning-a-primer/

https://twitter.com/jaxcon/status/957990506331557890

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

↧

Getting ready to attend rOpenSci unconf18 and probably working on tidyverse-like functions for the first time

April 26, 2018, 5:00 pm

≫ Next: Make RStudio Look the Way You Want — Because Beauty Matters

≪ Previous: Slides from my JAX 2018 talk: Deep Learning – a Primer

(This article was first published on Fellgernon Bit - rstats, and kindly contributed to R-bloggers)

It’s Friday 7pm and it’s been a long week with ups and downs¹. But I’m enthused as I write this blog post. In less than a month from now I’ll be attending rOpenSci unconf18 and it’ll be my first time at this type of event. Yay!

Building on my streak of good news, I'm delighted to have been selected to attend @rOpenSci #Unconf18 https://t.co/Xe6lojB7TS ^_^ Also, thanks to the https://t.co/o5OwUWEaBD and @LieberInstitute for their support! I'm hoping to relay as much as I can to @LIBDrstats #rstats
— L. Collado-Torres (@fellgernon) April 11, 2018

In my self introduction to everyone attending, I mentioned that I don’t use the pipe (%>%) symbol and that I use <- for assignment.

Recently I had my pre-unconf chat with Stefanie Butland (read more about these chats in her great blog post). In my notes to Stefanie before our chat I had mentioned again my lack of R piping experience and we talked about it. As we talked, it became clear that a blog post on related topics would be useful. Sure, I could have asked directly to the other unconf18 participants, but maybe others from the R community in general can chime in or benefit from reading the answers.

Coding style and git

I’m attending unconf18 with an open mind and I think of myself as someone who can be quite flexible (not with my body!) and accommodating. I’m assuming that most participants at unconf18 will use = for assignment. I’m not looking to start any discussions about the different assignment operators. Simply, I am willing to use whatever the majority uses. Just like I did in my first pull request to the blogdown package (issue 263). I was trying to follow Yihui Xie’s coding style to make his life easier and have a clean (or cleaner) git history. From Yihui’s post on this pull request I can see that he liked it.

Keeping this in mind, I think that following the coding style of others will be something I’ll do at unconf18. I haven’t really worked in any R packages with many developers actively working on the package. But I imagine that setting a common coding style will minimize git conflicts, and no one wants those². I don’t know if we’ll all follow some common guidelines at unconf18. I actually imagine that it’ll be project-specific. Why? Well because you can create an R project in RStudio³ and set some defaults for the project such as:

the number of spaces for tab
line ending conversions

We can also set some global RStudio preferences like whether to auto-indent code after paste, length of lines.

Additionally, we can decide whether we’ll use the RStudio “wand” to reformat code.

Maybe all of this is unnecessary, maybe everyone will work in different non-overlapping functions and thus avoid git conflicts. For example, at my job sometimes I write code with = users, but we don’t work on the same lines of the code file. Later on it becomes easy to identify who wrote which line without having to use git blame (awful name, right?).

Tidyverse-like functions

So far, I think that these coding style issues are minor and will be easily dealt with. I think that we’ll all be able to adapt and instead focus on other problems (like whatever the package is trying to solve) and enjoy the experience (network, build trust as Stefanie put it).

My second concern has to do with something I imagine could require more effort: my homework before the unconf. That is, writing tidyverse-like functions. Like I’ve said, I haven’t used the R pipe %>% symbol. I’ve executed some code with it before, seeing it in many blog posts, but I’ve never actually written functions designed to be compatible with this type of logic.

If I help write a function that is not pipe-friendly, then it might not integrate nicely with other functions written by the rest of the team. It would lead to workarounds in the vignette or maybe having someone else re-factor my first function to make it pipe-friendly. Sure, I would learn from observing others make changes to my code. But I want to take advantage as much as I can from my experience at rOpenSci unconf⁴!

Since I don’t really use %>%, I’m unfamiliar with many things related to pipe-friendly (tidyverse-like) functions. For example:

Do you document them differently? Like make a single Rd file for multiple functions. Or do you make an Rd file per function even if the example usage doesn’t involve %>%?
I know that the first argument is important in pipe-friendly functions. But I ignore if the second and other arguments play a role or not.
Do people use the ellipsis (...) argument in pipe-friendly functions? With my derfinder package I ended up a very deep rabbit hole using .... I explained some of the logic in my dotsblog post (there are fair criticisms to going deep with ... in the comments).
How do you write unit tests for pipe-friendly functions? Similar to how you write documentation for them, do the unit tests just test one function a time or do they test several at a time (that is the output after using %>%)?

These and other questions could involve time getting familiar with. Time that I could spend now, before unconf18, learning and at least getting a better sense of what to expect. Maybe I’m complicating myself and worrying too much about this. I imagine that the solution will involve a combination of:

Checking some popular tidyverse packages that use %>%. Like the vignettes/README files and examples.
Reading more about this in a book(s): I don’t know which one though.
Playing around a bit as a user with some of these functions. See what error messages pop up: actually I don’t know how users debug a series of functions tied together via %>%.

Wrapping up

I’m not saying everyone should learn about these topics before unconf18. I think that we are all (well, maybe excluding some) worried about not knowing $X$ or $Y$R/git/travis/testthat/usethis/etc topic before unconf18. And that will part of why it’ll be great to meet everyone in what is known to be an extremely welcoming R conference ^^ (seriously, check all the unconf17 posts!).

## And I'm done proofreading the post. Yay!Sys.time()

## [1] "2018-04-27 20:24:12 EDT"

Acknowledgements

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.7.9. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[3] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.5. 2018. URL: https://CRAN.R-project.org/package=devtools.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                                             ##  version  R Under development (unstable) (2017-11-29 r73789)##  system   x86_64, darwin15.6.0                              ##  ui       X11                                               ##  language (EN)                                              ##  collate  en_US.UTF-8                                       ##  tz       America/New_York                                  ##  date     2018-04-27

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            ##  backports       1.1.2   2017-12-13 CRAN (R 3.5.0)                    ##  base          * 3.5.0   2017-11-29 local                             ##  bibtex          0.4.2   2017-06-30 CRAN (R 3.5.0)                    ##  BiocStyle     * 2.7.9   2018-04-27 Bioconductor                      ##  blogdown        0.6     2018-04-18 CRAN (R 3.5.0)                    ##  bookdown        0.7     2018-02-18 CRAN (R 3.5.0)                    ##  colorout      * 1.1-3   2017-11-29 Github (jalvesaq/colorout@e2a175c)##  compiler        3.5.0   2017-11-29 local                             ##  datasets      * 3.5.0   2017-11-29 local                             ##  devtools      * 1.13.5  2018-02-18 CRAN (R 3.5.0)                    ##  digest          0.6.15  2018-01-28 CRAN (R 3.5.0)                    ##  evaluate        0.10.1  2017-06-24 CRAN (R 3.5.0)                    ##  graphics      * 3.5.0   2017-11-29 local                             ##  grDevices     * 3.5.0   2017-11-29 local                             ##  htmltools       0.3.6   2017-04-28 CRAN (R 3.5.0)                    ##  httr            1.3.1   2017-08-20 CRAN (R 3.5.0)                    ##  jsonlite        1.5     2017-06-01 CRAN (R 3.5.0)                    ##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.5.0)                    ##  knitr           1.20    2018-02-20 CRAN (R 3.5.0)                    ##  lubridate       1.7.4   2018-04-11 CRAN (R 3.5.0)                    ##  magrittr        1.5     2014-11-22 CRAN (R 3.5.0)                    ##  memoise         1.1.0   2017-04-21 CRAN (R 3.5.0)                    ##  methods       * 3.5.0   2017-11-29 local                             ##  plyr            1.8.4   2016-06-08 CRAN (R 3.5.0)                    ##  R6              2.2.2   2017-06-17 CRAN (R 3.5.0)                    ##  Rcpp            0.12.16 2018-03-13 CRAN (R 3.5.0)                    ##  RefManageR      1.2.0   2018-04-25 CRAN (R 3.5.0)                    ##  rmarkdown       1.9     2018-03-01 CRAN (R 3.5.0)                    ##  rprojroot       1.3-2   2018-01-03 CRAN (R 3.5.0)                    ##  stats         * 3.5.0   2017-11-29 local                             ##  stringi         1.1.7   2018-03-12 CRAN (R 3.5.0)                    ##  stringr         1.3.0   2018-02-19 CRAN (R 3.5.0)                    ##  tools           3.5.0   2017-11-29 local                             ##  utils         * 3.5.0   2017-11-29 local                             ##  withr           2.1.2   2018-03-15 CRAN (R 3.5.0)                    ##  xfun            0.1     2018-01-22 CRAN (R 3.5.0)                    ##  xml2            1.2.0   2018-01-24 CRAN (R 3.5.0)                    ##  yaml            2.1.18  2018-03-08 CRAN (R 3.5.0)

Feeling welcomed can be hard… oh well.
We could talk about git for a long time too. But I hope that I’ll get by with some git push, git pull, and maybe git branch.
It’s one of the sponsors http://unconf18.ropensci.org/#sponsors and well, probably want most will be using since it has such nice tools for writing R packages.
Specially since most only attend one unconf, I think. So it’s different from other conferences that you can experience multiple years and network with the group across a longer period of time: that’s what I’ve done with the Bioconductor meetings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fellgernon Bit - rstats.

↧

Make RStudio Look the Way You Want — Because Beauty Matters

April 27, 2018, 12:00 am

≫ Next: Registration for eRum 2018 closes in two days!

≪ Previous: Getting ready to attend rOpenSci unconf18 and probably working on tidyverse-like functions for the first time

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Introduction

RStudio is a powerful IDE that helped me so many times with conveniently debugging large and complex R programs as well as shiny apps, but there is one thing that bugs me about it: there is no easy option or interface in RStudio that lets you customize your theme, just as you can do in more developed text editors such as Atom or Sublime. This is sad, especially if you like working with RStudio but are not satisfied with its appearance. That being said, I did some research on how to customize its themes by hand with a little hack! Here you go!

Customizing your theme

Well, before we get started, I've got some sad news for you: first, I use a Mac, so all the instructions pertain to that platform. However, I think things should work similarly when having Windows on your computer. Second, you have to sacrifice one of RStudio's built-in editor themes, so choose wisely. In the end, what we will do is overwriting the settings of one theme with your own preferences. For the sake of demonstration, I will do that with the dawn editor theme in RStudio version 1.1.419. To understand what will be going on, be aware that the RStudio IDE in fact works like a browser and the styles and themes you have at your hand are essentially css files lying somewhere in your file system. We will eventually access those files and change them.

First step: Change a theme until you like it

Let's go now and look at the dawn theme.

Now, if you want to experiment with changing it, recall that RStudio actually is a browser, so right-clicking somewhere in the text editor and selecting "Inspect Element" should open the HTML the editor is based on.

href to css Scroll down until you find the tag referencing a css file. There is a path in the href argument. You should remember the filename of the corresponding css file, because we will change that file on our file system later. Simply click on the path to the css file to view its content.

old theme overview

Perfect! Now, you can mess around with the css selectors' attributes and view the result on the theme in real time (unfortunately, you cannot rearrange the selectors)! As an example, I will change the selector .ace_comment which defines the physical appearance of comments in the code. Let's say I don't like it being italic and I want to change the color to be a bit more … noticeable. That's why I decide to make the font bold and change the color to red. Furthermore, I add another attribute font-size to make the text larger.

This is what it looked like before …

comment old

… and this is what it looks like now ….

comment new

Second step: Overwrite the theme on your file system

So far, we have temporarily changed the theme. If you reopen RStudio, everything is gone. This can be handy if you just want to play around. However, we want to make things permanent, which is our next step.

As I have already mentioned, an editor theme in RStudio essentially is a css file lying somewhere on your computer, so all R themes can be accessed through your file system. Simply navigate on the program folder with your finder, right-click on the RStudio logo and click on "show package content" (or something similar to that; sorry, my system language is German ;)).

RStudio show package content

You should now find yourself in the directory Contents. Next, navigate to the css files as shown below

path to css

If you change the file corresponding to the dawn theme (97D3C…9C49C5.cache.css), you will have permanently changed that theme.

Conclusion

Customizing RStudio themes requires some little tricks, but it is definitely doable. Please keep in mind that any changes you make when inspecting the editor theme will only be temporary. To make changes permanent, take a look at what theme you want to overwrite, search its corresponding css file name, and enter your changes there.

If you like the Atom theme "One Dark" and you would like to have it as a RStudio theme, you can find it on my GitHub account. Simply rename it to the css file you want to replace and put it in RStudio's theme folder. As a little teaser: this is what it looks like:

atom theme rstudio

About the Author

Tobias Krabel

Tobias ist im Data Science Team und absolviert im Moment seinen 2. Master in Informatik. In seiner Freizeit ist er sozial engagiert und geht gerne Wandern in der Natur.

Der Beitrag Make RStudio Look the Way You Want — Because Beauty Matters erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

↧

Registration for eRum 2018 closes in two days!

April 27, 2018, 5:15 am

≫ Next: Imperial postdoc in Bayesian nonparametrics

≪ Previous: Make RStudio Look the Way You Want — Because Beauty Matters

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

Why I’m going to eRum this year instead of useR!

I have attended the useR! conferences every year now for the past 9 years, and loved it! However, this year I’m saddened that I won’t be able to go. This is because this year the conference will be held in Australia, and going there would require me to be away from home for at least 8 days (my heart goes to the people of Australia who had a hard time coming to useR all these years). Ordinarily I would do it, but given that my wife and I have a sweet 8 months year old baby (called Maya), I’m very reluctant to be away from home for that long.

The eRum 2018 conference

Fortunately for me, and for many other R users out there, we have a backup plan called eRum (a.k.a: The European R Users Meeting). It is an international conference, similar to useR!, that occurs every two years (specifically, in the years in which useR is taking place outside of Europe), and organized by Gergely Daroczi and others.

About the plan for this year:

Time and location: the conference will take place on May 14-16, 2018 @ Budapest, Hungary
Crowd size: The expectation is for ~500 R users from mostly Europe (you can see a visual breakdown of people’s country of origin here)
Content: The program has 5 keynote speakers, 12 invited speakers, 7 tracks of workshops and 2 tracks for contributed talks (picked after sifting over 150 abstracts). Knowing some of the people in the program, I can vouch for the high quality of the program.
The registration closes this Sunday, so hurry up and register! (the price is relatively cheap, starting from 80 Euro for students, and up to 275 Euro for industry).

If you get to see me around, feel free to come and say Hi

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

↧

Imperial postdoc in Bayesian nonparametrics

April 27, 2018, 5:42 am

≫ Next: A maze-solving Minecraft robot, in R

≪ Previous: Registration for eRum 2018 closes in two days!

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Here is another announcement for a post-doctoral position in London (UK) to work with Sarah Filippi. In the Department of Mathematics at Imperial College London. (More details on the site or in this document. Hopefully, the salary is sufficient for staying in London, if not in South Kensington!)

The post holder will work on developing a novel Bayesian Non-Parametric Test for Conditional Independence. This is at the core of modern causal discovery, itself of paramount importance throughout the sciences and in Machine Learning. As part of this project, the post holder will derive a Bayesian non-parametric testing procedure for conditional independence, scalable to high-dimensional conditioning variable. To ensure maximum impact and allow experimenters in different fields to easily apply this new methodology, the post holder will then create an open-source software package available on the R statistical programming platform. Doing so, the post holder will investigate applying this approach to real-world data from our established partners who have a track record of informing national and international bodies such as Public Health England and the World Health Organisation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

A maze-solving Minecraft robot, in R

April 27, 2018, 8:14 am

≫ Next: ANNOUNCEMENT: EARL London 2018 speakers

≪ Previous: Imperial postdoc in Bayesian nonparametrics

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week at the New York R Conference, I gave a presentation on using R in Minecraft. (I've embedded the slides below.) The demo gods were not kind to me, and while I was able to show building a randomly-generated maze in the Minecraft world, my attempt to have the player solve it automatically was stymied by some server issues. In this blog post, I'll show you how you can write an R function to build a maze, and use the left-hand rule to solve it automatically.

If you want to play along, you'll need to launch a Spigot server with the RaspberryJuice plugin. An easy way to do this is to use a Dockerfile to launch the Minecraft server. I used a Ubuntu instance of the Data Science Virtual Machine to do this, mainly because it comes with Docker already installed. Once you have your server running, you can join the multiplayer with from the Minecraft Java Edition using its IP address.

The R script mazebot.R steps through the process of connecting to the server from R, and building and solving a maze. It uses the miner package (which you will need to install from Github), which provides functions to interface between R and Minecraft. After connecting to the server, the first step is to get the ID code used to identify the player in Minecraft. (It's best to have just one player in the Minecraft world for this step, so you're guaranteed to get the active player's ID.)

id <- getPlayerIds()

You'll use that id code to identify the player later on. Next, we use the make_maze function (in the script genmaze.R) to design a random maze. This uses a simple recursive backtracking algorithm algorithm: explore in random directions until you can go no further, and then retreat until a new route is available to explore. Once we've generated the maze, we use print_maze to convert it to a matrix of characters, and place a marker "!" for the exit.

## Maze dimensions (we'll create a square maze)mazeSize <- 8## using the functions in genmaze.R:m <- print_maze(make_maze(mazeSize,mazeSize), print=FALSE)nmaze <- ncol(m) # dimensionsm[nmaze,nmaze-1] <- "!" ## end of the maze.

Now for the fun bit: building the maze in the world. This is a simple matter of looping over the character matrix representing the maze, and building a stack of three blocks (enough so you can't see over the top while playing) where the walls should go. The function build_maze (in solvemaze.R) does this, so all we need to do is provide a location for the maze. Find a clear spot of land, and the code below builds the maze nearby:

## get the current player position v <- getPlayerPos(id, TRUE)altitude <- -1 ## height offset of mazepos <- v+c(3, altitude, 3) # corner## Build a maze near the playerbuildMaze(m, pos, id)

You can try solving the maze yourself, just by moving the player in Minecraft. It's surprisingly how difficult even small mazes can be, if you don't cheat by looking at the layout of the maze from above! A simple way to solve this and many other mazes is by using the left hand rule: follow the wall on your left until you find the exit. This is something we can also code in R to solve the maze automatically, to check the positions of walls around the player, and move the player according to the left hand rule. Unfortunately, you can't actually make the player avatar turn using Spigot, so we track the direction the player should be facing with the botheading variable in the solveMaze function, which we use like this:

## Move the player into the start of the maze, and thensolveMaze(id)

And here's what it looks like from a third-person view (use F5 in the game to change the view):

You can find all of the R code to implement the maze-building and maze-solving at the Github repository below. For more on using R in Minecraft, check out the online book R Programming in Minecraft, which has lots of other ideas for building and playing in Minecraft with R.

Github (revodavid): minecraft-maze

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

ANNOUNCEMENT: EARL London 2018 speakers

April 27, 2018, 9:01 am

≫ Next: Read Random Rows from A Huge CSV File

≪ Previous: A maze-solving Minecraft robot, in R

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

We are excited to announce the speakers for this year’s EARL London Conference!

Every year, we receive an immense number of excellent abstracts and this year was no different – in fact, it’s getting harder to decide. We spent a lot of time deliberating and had to make some tough choices. We would like to thank everyone who submitted a talk – we appreciate the time taken to write and submit; if we could accept every talk, we would.

This year, we have a brilliant lineup, including speakers from Auto Trader, Marks and Spencer, Aviva, Hotels.com, Google, Ministry of Defence and KPMG. Take a look below at our illustrious list of speakers:

Full length talksAbigail Lebrecht, Abigail Lebrecht Consulting Alex Lewis, Africa’s Voices Foundation Alexis Iglauer, PartnerRe Amanda Lee, Merkle Aquila Andrie de Vries, RStudio Catherine Leigh, Auto Trader Catherine Gamble, Marks and Spencer Chris Chapman, Google Chris Billingham, N Brown PLC Christian Moroy, Edge Health Christoph Bodner, Austrian Post Dan Erben, Dyson David Smith, Microsoft Douglas Ashton, Mango Solutions Dzidas Martinaitis, Amazon Web Services Emil Lykke Jensen, MediaLytic Gavin Jackson, Screwfix Ian Jacob, HCD Economics James Lawrence, The Behavioural Insights Team Jeremy Horne, MC&C Media Jobst Löffler, Bayer Business Services GmbH Jo-fai Chow, H2O.ai Jonathan Ng, HSBC Kasia Kulma, Aviva Leanne Fitzpatrick, Hello Soda Lydon Palmer, Investec Matt Dray, Department for Education Michael Maguire, Tusk Therapeutics Omayma Said, WUZZUF Paul Swiontkowski, Microsoft Sam Tazzyman, Ministry of Justice Scott Finnie, Hymans Robertson Sean Lopp, RStudio Sima Reichenbach, KPMG Steffen Bank, Ekstra Bladet Taisiya Merkulova, Photobox Tim Paulden, ATASS Sports Tomas Westlake, Ministry Of Defence Victory Idowu, Aviva Willem Ligtenberg, CZ

Lightning TalksAgnes Salanki, Hotels.com Andreas Wittmann, MAN Truck & Bus AG Ansgar Wenzel, Qbiz UK George Cushen, Shop Direct Jasmine Pengelly, DAZN Matthias Trampisch, Boehringer Ingelheim Mike K Smith, Pfizer Patrik Punco, NOZ Medien Robin Penfold, Willis Towers Watson

Some numbers

We thought we would share some stats from this year’s submission process:

This is based on a combination of titles, photos and pronouns.

Agenda

We’re still putting the agenda together, so keep an eye out for that announcement!

Tickets

Early bird tickets are available until 31 July 2018, get yours now.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Read Random Rows from A Huge CSV File

April 28, 2018, 8:33 pm

≫ Next: X is for By

≪ Previous: ANNOUNCEMENT: EARL London 2018 speakers

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

Given R data frames stored in the memory, sometimes it is beneficial to sample and examine the data in a large-size csv file before importing into the data frame. To the best of my knowledge, there is no off-shelf R function performing such data sampling with a relatively low computing cost. Therefore, I drafted two utility functions serving this particular purpose, one with the LaF library and the other with the reticulate library by leveraging the power of Python. While the first function is more efficient and samples 3 records out of 336,776 in about 100 milliseconds, the second one is more for fun and a showcase of the reticulate package.

library(LaF)

sample1 <- function(file, n) {
  lf <- laf_open(detect_dm_csv(file, sep = ",", header = TRUE, factor_fraction = -1))
  return(read_lines(lf, sample(1:nrow(lf), n)))
}

sample1("Downloads/nycflights.csv", 3)
#   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
# 1 2013     9  15     1323        -6     1506       -23      MQ  N857MQ   3340
# 2 2013     3  18     1657        -4     2019         9      UA  N35271     80
# 3 2013     6   7     1325        -4     1515       -11      9E  N8477R   3867
#   origin dest air_time distance hour minute
# 1    LGA  DTW       82      502   13     23
# 2    EWR  MIA      157     1085   16     57
# 3    EWR  CVG       91      569   13     25

library(reticulate)

sample2 <- function(file, n) {
  rows <- py_eval(paste("sum(1 for line in open('", file, "'))", sep = '')) - 1
  return(import("pandas")$read_csv(file, skiprows = setdiff(1:rows, sample(1:rows, n))))
}

sample2("Downloads/nycflights.csv", 3)
#   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
# 1 2013    10   9      812        12     1010       -16      9E  N902XJ   3507
# 2 2013     4  30     1218       -10     1407       -30      EV  N18557   4091
# 3 2013     8  25     1111        -4     1238       -27      MQ  N721MQ   3281
#   origin dest air_time distance hour minute
# 1    JFK  MSY      156     1182    8     12
# 2    EWR  IND       92      645   12     18
# 3    LGA  CMH       66      479   11     11

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

↧

X is for By

April 27, 2018, 5:45 am

≫ Next: Re-exporting the magrittr pipe operator

≪ Previous: Read Random Rows from A Huge CSV File

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

X is for By Today’s post will be rather short, demonstrating a set of functions from the psych package, which allows you to conduct analysis by group. These commands add “By” to the end of existing functions. But first, a word of caution: With great power comes great responsibility. This function could very easily turn into a fishing expedition (also known as p-hacking). Conducting planned group comparisons is fine. Conducting all possible group comparisons and cherry-picking any differences is problematic. So use these group by functions with care.

Let’s pull up the Facebook dataset for this.

Facebook<-read.delim(file="full_facebook_set.txt",header=TRUE)

This is the full dataset, which includes all the variables I collected. I don’t want to run analyses on all variables, so I’ll pull out the ones most important for this blog post demonstration.

smallFB<-Facebook[,c(1:2,77:80,105:116,122,133:137,170,187)]

First, I’ll run descriptives on this smaller data frame by gender.

library(psych)

## Warning: package 'psych' was built under R version 3.4.4

describeBy(smallFB,smallFB$gender)

##  ##  Descriptive statistics by group  ## group: 0 ##              vars  n      mean      sd   median   trimmed     mad      min ## RespondentId    1 73 164647.77 1711.78 164943.0 164587.37 2644.96 162373.0 ## gender          2 73      0.00    0.00      0.0      0.00    0.00      0.0 ## Rumination      3 73     37.66   14.27     37.0     37.41   13.34      8.0 ## DepRelat        4 73     21.00    7.86     21.0     20.95    5.93      4.0 ## Brood           5 73      8.49    3.76      9.0      8.42    2.97      1.0 ## Reflect         6 73      8.16    4.44      8.0      8.24    4.45      0.0 ## SavorPos        7 73     64.30   10.93     65.0     64.92    8.90     27.0 ## SavorNeg        8 73     33.30   11.48     33.0     33.08   13.34     12.0 ## SavorTot        9 73     31.00   20.15     34.0     31.15   19.27    -10.0 ## AntPos         10 73     20.85    3.95     21.0     20.93    4.45     10.0 ## AntNeg         11 73     11.30    4.23     11.0     11.22    4.45      4.0 ## AntTot         12 73      9.55    6.90     10.0      9.31    7.41     -3.0 ## MomPos         13 73     21.68    3.95     22.0     21.90    2.97      9.0 ## MomNeg         14 73     11.45    4.63     11.0     11.41    5.93      4.0 ## MomTot         15 73     10.23    7.63     11.0     10.36    8.90    -11.0 ## RemPos         16 73     21.77    4.53     23.0     22.20    4.45      8.0 ## RemNeg         17 73     10.55    4.39      9.0     10.27    4.45      4.0 ## RemTot         18 73     11.22    8.05     14.0     11.68    7.41     -8.0 ## LifeSat        19 73     24.63    6.80     25.0     24.93    7.41     10.0 ## Extravert      20 73      4.32    1.58      4.5      4.33    1.48      1.5 ## Agreeable      21 73      4.79    1.08      5.0      4.85    1.48      1.0 ## Conscient      22 73      5.14    1.34      5.0      5.19    1.48      2.0 ## EmotStab       23 73      5.10    1.22      5.0      5.15    1.48      1.0 ## OpenExp        24 73      5.11    1.29      5.5      5.20    1.48      2.0 ## Health         25 73     28.77   19.56     25.0     26.42   17.79      0.0 ## Depression     26 73     10.26    7.27      9.0      9.56    5.93      0.0 ##                 max  range  skew kurtosis     se ## RespondentId 168279 5906.0  0.21    -1.36 200.35 ## gender            0    0.0   NaN      NaN   0.00 ## Rumination       71   63.0  0.12    -0.53   1.67 ## DepRelat         42   38.0  0.10    -0.04   0.92 ## Brood            17   16.0  0.15    -0.38   0.44 ## Reflect          19   19.0 -0.12    -0.69   0.52 ## SavorPos         84   57.0 -0.69     0.76   1.28 ## SavorNeg         57   45.0  0.14    -0.95   1.34 ## SavorTot         72   82.0 -0.17    -0.75   2.36 ## AntPos           28   18.0 -0.24    -0.46   0.46 ## AntNeg           22   18.0  0.27    -0.55   0.49 ## AntTot           24   27.0  0.11    -0.76   0.81 ## MomPos           28   19.0 -0.69     0.55   0.46 ## MomNeg           22   18.0  0.08    -0.98   0.54 ## MomTot           24   35.0 -0.25    -0.55   0.89 ## RemPos           28   20.0 -0.88     0.35   0.53 ## RemNeg           22   18.0  0.56    -0.66   0.51 ## RemTot           24   32.0 -0.53    -0.77   0.94 ## LifeSat          35   25.0 -0.37    -0.84   0.80 ## Extravert         7    5.5 -0.09    -0.93   0.19 ## Agreeable         7    6.0 -0.60     1.04   0.13 ## Conscient         7    5.0 -0.24    -0.98   0.16 ## EmotStab          7    6.0 -0.60     0.28   0.14 ## OpenExp           7    5.0 -0.49    -0.55   0.15 ## Health           91   91.0  1.13     1.14   2.29 ## Depression       36   36.0  1.02     0.95   0.85 ## --------------------------------------------------------  ## group: 1 ##              vars   n      mean      sd    median   trimmed     mad ## RespondentId    1 184 164373.49 1515.34 164388.00 164253.72 1891.80 ## gender          2 184      1.00    0.00      1.00      1.00    0.00 ## Rumination      3 184     38.09   15.28     40.00     38.16   17.05 ## DepRelat        4 184     21.67    8.78     21.00     21.66    8.90 ## Brood           5 184      8.57    4.14      8.50      8.47    3.71 ## Reflect         6 184      7.84    4.06      8.00      7.73    4.45 ## SavorPos        7 184     67.22    9.63     68.00     67.71    8.90 ## SavorNeg        8 184     29.75   11.62     27.50     28.72    9.64 ## SavorTot        9 184     37.47   19.30     40.00     38.66   20.02 ## AntPos         10 184     22.18    3.37     23.00     22.28    2.97 ## AntNeg         11 184     10.10    4.44      9.00      9.78    4.45 ## AntTot         12 184     12.08    6.85     14.00     12.36    5.93 ## MomPos         13 184     22.28    3.88     23.00     22.59    2.97 ## MomNeg         14 184     10.60    4.88      9.50     10.13    5.19 ## MomTot         15 184     11.68    7.75     13.00     12.29    7.41 ## RemPos         16 184     22.76    3.85     23.00     23.10    2.97 ## RemNeg         17 184      9.05    3.79      8.00      8.68    2.97 ## RemTot         18 184     13.71    6.97     15.00     14.34    5.93 ## LifeSat        19 184     23.76    6.25     24.00     24.18    7.41 ## Extravert      20 184      4.66    1.57      5.00      4.74    1.48 ## Agreeable      21 184      5.22    1.06      5.50      5.26    1.48 ## Conscient      22 184      5.32    1.24      5.50      5.42    1.48 ## EmotStab       23 184      4.70    1.31      4.75      4.75    1.11 ## OpenExp        24 184      5.47    1.08      5.50      5.56    0.74 ## Health         25 184     32.54   16.17     30.00     31.43   16.31 ## Depression     26 184     12.19    8.48      9.00     11.09    5.93 ##                   min    max  range  skew kurtosis     se ## RespondentId 162350.0 167714 5364.0  0.46    -0.90 111.71 ## gender            1.0      1    0.0   NaN      NaN   0.00 ## Rumination        3.0     74   71.0 -0.05    -0.60   1.13 ## DepRelat          0.0     42   42.0  0.00    -0.46   0.65 ## Brood             0.0     19   19.0  0.19    -0.62   0.31 ## Reflect           0.0     19   19.0  0.25    -0.48   0.30 ## SavorPos         33.0     84   51.0 -0.59     0.36   0.71 ## SavorNeg         12.0     64   52.0  0.79     0.25   0.86 ## SavorTot        -18.0     72   90.0 -0.57    -0.10   1.42 ## AntPos            9.0     28   19.0 -0.49     0.41   0.25 ## AntNeg            4.0     22   18.0  0.63    -0.39   0.33 ## AntTot           -8.0     24   32.0 -0.43    -0.48   0.50 ## MomPos           10.0     28   18.0 -0.81     0.54   0.29 ## MomNeg            4.0     24   20.0  0.81    -0.03   0.36 ## MomTot          -13.0     24   37.0 -0.69    -0.03   0.57 ## RemPos            9.0     28   19.0 -0.87     0.81   0.28 ## RemNeg            4.0     21   17.0  0.83     0.33   0.28 ## RemTot           -9.0     24   33.0 -0.82     0.50   0.51 ## LifeSat           8.0     35   27.0 -0.53    -0.32   0.46 ## Extravert         1.0      7    6.0 -0.36    -0.72   0.12 ## Agreeable         2.5      7    4.5 -0.27    -0.63   0.08 ## Conscient         1.0      7    6.0 -0.70     0.13   0.09 ## EmotStab          1.5      7    5.5 -0.35    -0.73   0.10 ## OpenExp           1.5      7    5.5 -0.91     0.62   0.08 ## Health            2.0     85   83.0  0.60    -0.05   1.19 ## Depression        0.0     39   39.0  1.14     0.66   0.62

In this dataset, I coded men as 0 and women as 1. The descriptive statistics table generated includes all scale and subscale scores, and gives me mean, standard deviation, median, a trimmed mean (dropping very low and very high values), median absolute deviation, minimum and maximum values, range, skewness, and kurtosis. I’d need to run t-tests to find out if differences were significant, but this still gives me some idea of how men and women might differ on these measures.

There are certain measures I included that we might hypothesize would show gender differences. For instance, some research suggests gender differences for rumination and depression. In addition to running descriptives by group, I might also want to display these differences in a violin plot. The psych package can quickly generate such a plot by group.

violinBy(smallFB,"Rumination","gender",grp.name=c("M","F"))

violinBy(smallFB,"Depression","gender",grp.name=c("M","F"))

ggplot2 will generate a violin plot by group, so this feature might not be as useful for final displays, but could help in quickly visualizing the data during analysis. And you may find that you prefer the appearance of this plots. To each his own.

Another function is error.bars.by, which plots means and confidence intervals by group for multiple variables. Again, this is a way to get some quick visuals, though differences in scale among measures should be taken into consideration when generating this plot. One set of variables for which this display might be useful is the 5 subscales of the Five-Factor Personality Inventory. This 10-item measure assesses where participants fall on the so-called Big Five personality traits: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (Emotional Stability). These subscales are all on the same metric.

error.bars.by(smallFB[,c(20:24)],group=smallFB$gender,xlab="Big Five Personality Traits",ylab="Score on Subscale")

Finally, we have the statsBy function, which gives descriptive statistics by group as well as between group statistics. This functions generates a lot of output, and you can read more about everything it gives you here.

FBstats<-statsBy(smallFB[,2:26],"gender",cors=TRUE,method="pearson",use="pairwise")print(FBstats,short=FALSE)

## Statistics within and between groups   ## Call: statsBy(data = smallFB[, 2:26], group = "gender", cors = TRUE,  ##     method = "pearson", use = "pairwise") ## Intraclass Correlation 1 (Percentage of variance due to groups)  ##     gender Rumination   DepRelat      Brood    Reflect   SavorPos  ##       1.00      -0.01      -0.01      -0.01      -0.01       0.03  ##   SavorNeg   SavorTot     AntPos     AntNeg     AntTot     MomPos  ##       0.03       0.04       0.05       0.02       0.05       0.00  ##     MomNeg     MomTot     RemPos     RemNeg     RemTot    LifeSat  ##       0.00       0.01       0.02       0.05       0.04       0.00  ##  Extravert  Agreeable  Conscient   EmotStab    OpenExp     Health  ##       0.01       0.05       0.00       0.03       0.03       0.01  ## Depression  ##       0.01  ## Intraclass Correlation 2 (Reliability of group differences)  ##     gender Rumination   DepRelat      Brood    Reflect   SavorPos  ##       1.00     -22.34      -2.06     -50.93      -2.21       0.77  ##   SavorNeg   SavorTot     AntPos     AntNeg     AntTot     MomPos  ##       0.80       0.83       0.86       0.75       0.86       0.19  ##     MomNeg     MomTot     RemPos     RemNeg     RemTot    LifeSat  ##       0.39       0.46       0.68       0.87       0.84      -0.04  ##  Extravert  Agreeable  Conscient   EmotStab    OpenExp     Health  ##       0.60       0.88       0.05       0.80       0.81       0.60  ## Depression  ##       0.66  ## eta^2 between groups   ## Rumination.bg   DepRelat.bg      Brood.bg    Reflect.bg   SavorPos.bg  ##          0.00          0.00          0.00          0.00          0.02  ##   SavorNeg.bg   SavorTot.bg     AntPos.bg     AntNeg.bg     AntTot.bg  ##          0.02          0.02          0.03          0.02          0.03  ##     MomPos.bg     MomNeg.bg     MomTot.bg     RemPos.bg     RemNeg.bg  ##          0.00          0.01          0.01          0.01          0.03  ##     RemTot.bg    LifeSat.bg  Extravert.bg  Agreeable.bg  Conscient.bg  ##          0.02          0.00          0.01          0.03          0.00  ##   EmotStab.bg    OpenExp.bg     Health.bg Depression.bg  ##          0.02          0.02          0.01          0.01  ## Correlation between groups  ##               Rmnt. DpRl. Brd.b Rflc. SvrP. SvrN. SvrT. AntP. AntN. AntT. ## Rumination.bg  1                                                          ## DepRelat.bg    1     1                                                    ## Brood.bg       1     1     1                                              ## Reflect.bg    -1    -1    -1     1                                        ## SavorPos.bg    1     1     1    -1     1                                  ## SavorNeg.bg   -1    -1    -1     1    -1     1                            ## SavorTot.bg    1     1     1    -1     1    -1     1                      ## AntPos.bg      1     1     1    -1     1    -1     1     1                ## AntNeg.bg     -1    -1    -1     1    -1     1    -1    -1     1          ## AntTot.bg      1     1     1    -1     1    -1     1     1    -1     1    ## MomPos.bg      1     1     1    -1     1    -1     1     1    -1     1    ## MomNeg.bg     -1    -1    -1     1    -1     1    -1    -1     1    -1    ## MomTot.bg      1     1     1    -1     1    -1     1     1    -1     1    ## RemPos.bg      1     1     1    -1     1    -1     1     1    -1     1    ## RemNeg.bg     -1    -1    -1     1    -1     1    -1    -1     1    -1    ## RemTot.bg      1     1     1    -1     1    -1     1     1    -1     1    ## LifeSat.bg    -1    -1    -1     1    -1     1    -1    -1     1    -1    ## Extravert.bg   1     1     1    -1     1    -1     1     1    -1     1    ## Agreeable.bg   1     1     1    -1     1    -1     1     1    -1     1    ## Conscient.bg   1     1     1    -1     1    -1     1     1    -1     1    ## EmotStab.bg   -1    -1    -1     1    -1     1    -1    -1     1    -1    ## OpenExp.bg     1     1     1    -1     1    -1     1     1    -1     1    ## Health.bg      1     1     1    -1     1    -1     1     1    -1     1    ## Depression.bg  1     1     1    -1     1    -1     1     1    -1     1    ##               MmPs. MmNg. MmTt. RmPs. RmNg. RmTt. LfSt. Extr. Agrb. Cnsc. ## MomPos.bg      1                                                          ## MomNeg.bg     -1     1                                                    ## MomTot.bg      1    -1     1                                              ## RemPos.bg      1    -1     1     1                                        ## RemNeg.bg     -1     1    -1    -1     1                                  ## RemTot.bg      1    -1     1     1    -1     1                            ## LifeSat.bg    -1     1    -1    -1     1    -1     1                      ## Extravert.bg   1    -1     1     1    -1     1    -1     1                ## Agreeable.bg   1    -1     1     1    -1     1    -1     1     1          ## Conscient.bg   1    -1     1     1    -1     1    -1     1     1     1    ## EmotStab.bg   -1     1    -1    -1     1    -1     1    -1    -1    -1    ## OpenExp.bg     1    -1     1     1    -1     1    -1     1     1     1    ## Health.bg      1    -1     1     1    -1     1    -1     1     1     1    ## Depression.bg  1    -1     1     1    -1     1    -1     1     1     1    ##               EmtS. OpnE. Hlth. Dprs. ## EmotStab.bg    1                      ## OpenExp.bg    -1     1                ## Health.bg     -1     1     1          ## Depression.bg -1     1     1     1    ## Correlation within groups  ##               Rmnt. DpRl. Brd.w Rflc. SvrP. SvrN. SvrT. AntP. AntN. AntT. ## Rumination.wg  1.00                                                       ## DepRelat.wg    0.95  1.00                                                 ## Brood.wg       0.88  0.78  1.00                                           ## Reflect.wg     0.80  0.63  0.59  1.00                                     ## SavorPos.wg   -0.20 -0.20 -0.18 -0.15  1.00                               ## SavorNeg.wg    0.43  0.43  0.36  0.30 -0.64  1.00                         ## SavorTot.wg   -0.36 -0.36 -0.31 -0.25  0.89 -0.92  1.00                   ## AntPos.wg     -0.06 -0.05 -0.08 -0.03  0.86 -0.49  0.73  1.00             ## AntNeg.wg      0.32  0.32  0.28  0.21 -0.54  0.89 -0.80 -0.50  1.00       ## AntTot.wg     -0.23 -0.23 -0.21 -0.15  0.78 -0.82  0.89  0.83 -0.89  1.00 ## MomPos.wg     -0.26 -0.26 -0.22 -0.19  0.86 -0.60  0.80  0.60 -0.47  0.61 ## MomNeg.wg      0.46  0.46  0.39  0.35 -0.51  0.88 -0.78 -0.33  0.66 -0.59 ## MomTot.wg     -0.42 -0.42 -0.36 -0.32  0.75 -0.85  0.89  0.51 -0.65  0.68 ## RemPos.wg     -0.20 -0.19 -0.17 -0.15  0.89 -0.56  0.79  0.66 -0.44  0.62 ## RemNeg.wg      0.34  0.35  0.28  0.23 -0.65  0.87 -0.85 -0.49  0.69 -0.69 ## RemTot.wg     -0.29 -0.30 -0.25 -0.21  0.85 -0.79  0.90  0.63 -0.62  0.72 ## LifeSat.wg    -0.47 -0.47 -0.43 -0.31  0.54 -0.50  0.57  0.39 -0.33  0.41 ## Extravert.wg  -0.20 -0.19 -0.11 -0.20  0.34 -0.35  0.38  0.21 -0.29  0.29 ## Agreeable.wg  -0.18 -0.18 -0.20 -0.10  0.35 -0.45  0.45  0.28 -0.39  0.39 ## Conscient.wg  -0.25 -0.30 -0.20 -0.10  0.24 -0.21  0.25  0.16 -0.14  0.17 ## EmotStab.wg   -0.48 -0.44 -0.49 -0.34  0.34 -0.44  0.43  0.20 -0.33  0.32 ## OpenExp.wg    -0.16 -0.14 -0.21 -0.10  0.37 -0.31  0.37  0.27 -0.27  0.31 ## Health.wg      0.44  0.47  0.36  0.29 -0.30  0.34 -0.35 -0.21  0.26 -0.27 ## Depression.wg  0.57  0.58  0.49  0.38 -0.44  0.55 -0.55 -0.27  0.39 -0.39 ##               MmPs. MmNg. MmTt. RmPs. RmNg. RmTt. LfSt. Extr. Agrb. Cnsc. ## MomPos.wg      1.00                                                       ## MomNeg.wg     -0.56  1.00                                                 ## MomTot.wg      0.86 -0.91  1.00                                           ## RemPos.wg      0.65 -0.42  0.59  1.00                                     ## RemNeg.wg     -0.55  0.63 -0.67 -0.65  1.00                               ## RemTot.wg      0.66 -0.58  0.69  0.91 -0.91  1.00                         ## LifeSat.wg     0.55 -0.55  0.62  0.48 -0.42  0.49  1.00                   ## Extravert.wg   0.39 -0.37  0.43  0.28 -0.25  0.29  0.27  1.00             ## Agreeable.wg   0.33 -0.43  0.43  0.31 -0.36  0.37  0.25  0.12  1.00       ## Conscient.wg   0.25 -0.16  0.22  0.23 -0.26  0.26  0.33  0.03  0.29  1.00 ## EmotStab.wg    0.40 -0.50  0.51  0.27 -0.32  0.32  0.44  0.12  0.41  0.27 ## OpenExp.wg     0.39 -0.26  0.36  0.30 -0.28  0.32  0.34  0.29  0.36  0.14 ## Health.wg     -0.30  0.33 -0.36 -0.27  0.29 -0.31 -0.42 -0.10 -0.25 -0.24 ## Depression.wg -0.45  0.56 -0.58 -0.41  0.49 -0.50 -0.65 -0.24 -0.29 -0.26 ##               EmtS. OpnE. Hlth. Dprs. ## EmotStab.wg    1.00                   ## OpenExp.wg     0.24  1.00             ## Health.wg     -0.31 -0.18  1.00       ## Depression.wg -0.54 -0.28  0.56  1.00 ##  ## Many results are not shown directly. To see specific objects select from the following list: ##  mean sd n F ICC1 ICC2 ci1 ci2 r within pooled sd.r raw rbg pbg rwg nw pwg etabg etawg nwg nG Call

The variance explained by gender is quite small for all of the variables. Instead, the relationships between the variables seem to be more meaningful.

A to Z is almost done! Just Y and Z, plus look for an A-to-Z-influenced Statistics Sunday post!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

↧

Re-exporting the magrittr pipe operator

April 27, 2018, 11:00 am

≫ Next: Y is for Ys, Y-hats, and Residuals

≪ Previous: X is for By

(This article was first published on petermeissner, and kindly contributed to R-bloggers)

… or how I stoped worrying and wrote a blog post to remember it ad infinitum.

Magrittr’s pipe operator is one of those newish R-universe features that I really want to have around whenever I put some lines into an R-console. This is even TRUE when writing a package.

So the first thing I do is put magrittr into the DESCRIPTION file and add an __imports.R file to the packages R/-directory with the following lines:

#' re-export magrittr pipe operator#'#' @importFrom magrittr %>%#' @name %>%#' @rdname pipe#' @exportNULL

These lines import and re-export the pipe operator (%>%) therewith allowing to use it within my package but also beeing able to use it interactively whenever the package is loaded.

Best of all these lines will also ensure passing all package checks (CRAN complient) and preventing any “The following objects are masked from …” messages.

Last but not least the file name “__imports.R” serves two purposes (1) making the it appear at the very beginning of an alphabetical sorted lists of file names and (2) second giving it a speaking name to inform – however reads the file name – that some R “Imports” are most likely happening inside.

Happy coding!

PS.: Those lines above require the usage of roxygen2 as documentation framework.

To leave a comment for the author, please follow the link and comment on their blog: petermeissner.

↧

About the Why R? 2018 conference

Important dates

Keynotes

Programme

Pre-meetings

Past event

Highlights

Course Overview

Chapter 0: Getting Started

Chapter 1: Business Understanding

Chapter 2: Data Understanding

Chapter 3: Data Preparation

Chapter 4: Automated Machine Learning With H2O

Chapter 5: Assessing H2O Performance

Chapter 6: Explaining Black-Box Models With LIME

Chapter 7: Recommendation Algorithm

Timing

What You Need

Education Assistance

Enroll Now

About Business Science

Don’t Miss A Beat

Connect With Business Science

Footnotes

Motivation: the Dreaded .lob File Extension

Lessons Learned

Connecting to Tika

The R User Interface

Responding to Reviewers

Tika in Context: Parsing the Internet Archive

Next Steps

Conclusion

Miscellanea

A side-note about GitHub

Package review processes: weaving the threads

Getting issue threads

Digesting them and complementing them with Airtable data

Submitted repositories: down to a few metrics

Getting all onboarded repositories

Getting commits reports

Getting repositories as at submission

Outlook: getting even more data? Or analyzing this dataset

What is rOpenSci onboarding?

What is this series?

What you can find in this version?

New features

Bugs fixes

Follow Appsilon Data Science

Coding style and git

Tidyverse-like functions

Wrapping up

Acknowledgements

References

Reproducibility

Introduction

Customizing your theme

First step: Change a theme until you like it

Second step: Overwrite the theme on your file system

Conclusion

Tobias Krabel

Why I’m going to eRum this year instead of useR!

The eRum 2018 conference

Some numbers

Agenda

Tickets

Motivation: the Dreaded `.lob` File Extension