Quantcast
Channel: R-bloggers
Viewing all 12124 articles
Browse latest View live

2020-04 Catching up with R Graphics

$
0
0

[This article was first published on R – Stat Tech, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This document describes an expansion of the R graphics engine to support a number of new graphical features: gradients, patterns, masks, and clipping paths.

Paul Murrell

Download

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Doing Maths Symbolically: R as a Computer Algebra System (CAS)

$
0
0

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I first saw the Computer Algebra System Mathematica in the nineties I was instantly fascinated by it: you could not just calculate things with it but solve equations, simplify, differentiate and integrate expressions and even solve simple differential equations… not just numerically but symbolically! It helped me a lot during my studies at the university back then. Normally you cannot do this kind of stuff with R but fear not, there is, of course, a package for that!

There are many so-called Computer Algebra Systems (CAS) out there, commercial but also open-source. One very mature one is called YACAS (for Yet Another Computer Algebra System). You find the documentation here: Yacas Documentation (many of the following examples are taken from there).

You can use the full power of it in R by installing the Ryacas package from CRAN. You can use Yacas commands directly by invoking the yac_str function, the as_r function converts the output to R. Let us first simplify a mathematical expression:

library(Ryacas)## ## Attaching package: 'Ryacas'## The following object is masked from 'package:stats':## ##     deriv## The following objects are masked from 'package:base':## ##     %*%, determinant, diag, diag<-, I, lower.tri, upper.tri# simplify expressionsas_r(yac_str("Simplify(a*b*a^2/b-a^3)"))## [1] 0

Or solve an equation:

as_r(yac_str("Solve(a+x*y==z,x)"))## [1] "x==-(a-z)/y"

And you can do all kinds of tedious stuff that is quite error-prone when done differently, e.g. expanding expressions like (x-2)^{10} by using the binomial theorem:

as_r(yac_str("Expand((x-2)^20)"))## expression(x^20 - 40 * x^19 + 760 * x^18 - 9120 * x^17 + 77520 * ##     x^16 - 496128 * x^15 + 2480640 * x^14 - 9922560 * x^13 + ##     32248320 * x^12 - 85995520 * x^11 + 189190144 * x^10 - 343982080 * ##     x^9 + 515973120 * x^8 - 635043840 * x^7 + 635043840 * x^6 - ##     508035072 * x^5 + 317521920 * x^4 - 149422080 * x^3 + 49807360 * ##     x^2 - 10485760 * x + 1048576)

To demonstrate how easily the results can be integrated into R let us do some curve sketching on a function. First, we define two helper function for converting an expression into a function (which can then be used to plot it) and for determining the derivative of order n of some function (we redefine the D function for that):

as_function <- function(expr) {  as.function(alist(x =, eval(parse(text = expr))))}# redefine D functionD <- function(eq, order = 1) {  yac_str(paste("D(x,", order, ")", eq))}

Now, we define the function (in this case a simple polynomial2x^3 - 3x^2 + 4x - 5), determine the first and second derivatives symbolically and plot everything:

xmin <- -5xmax <- 5eq <- "2*x^3 - 3*x^2 + 4*x - 5"eq_f <- as_function(eq)curve(eq_f, xmin, xmax, ylab = "y(x)")abline(h = 0, lty = 2)abline(v = 0, lty = 2)D_eq <- D(eq)D_eq## [1] "6*x^2-6*x+4"D_eq_f <- as_function(D_eq)curve(D_eq_f, xmin, xmax, add = TRUE, col = "blue")D2_eq <- D(eq, 2)D2_eq## [1] "12*x-6"D2_eq_f <- as_function(D2_eq)curve(D2_eq_f, xmin, xmax, add = TRUE, col = "green")

Impressive, isn’t it! Yacas can also determine limits and integrals:

# determine limitsyac_str("Limit(x,0) 1/x")## [1] "Undefined"yac_str("Limit(x,0,Left) 1/x")## [1] "-Infinity"yac_str("Limit(x,0,Right) 1/x")## [1] "Infinity"# integrationyac_str("Integrate(x) Cos(x)")## [1] "Sin(x)"yac_str("Integrate(x,a,b) Cos(x)")## [1] "Sin(b)-Sin(a)"

As an example, we can prove in no-time that the famous approximation \pi \approx \frac{22}{7} is actually too big (more details can be found here: Proof that 22/7 exceeds π):

    \[0 < \int_0^1 \frac{x^4\left(1-x\right)^4}{1+x^2} \, dx = \frac{22}{7} - \pi\]

yac_str("Integrate(x,0,1) x^4*(1-x)^4/(1+x^2)")## [1] "22/7-Pi"

And, as the grand finale of this post, Yacas is even able to solve ordinary differential equations symbolically! Let us first take the simplest of them all:

as_r(yac_str("OdeSolve(y' == y)"))## expression(C115 * exp(x))

It correctly returns the exponential function (The C-term is just an arbitrary constant). And finally a more complex, higher-order one:

as_r(yac_str("OdeSolve(y'' - 4*y == 0)"))## expression(C154 * exp(2 * x) + C158 * exp(-2 * x))

I still find CAS amazing and extremely useful… and an especially powerful one can be used from within R!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RTutor: Quantifying Social Spillovers in Movie Ticket Sales

$
0
0

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Probably many of us would be more inclined to watch a particular movie in cinema if some friends or colleagues have already seen it and talk about it (at least if it is a decent movie).

Duncan Sheppard Gilchrist and Emily Glassberg Sands quantify such social spillover effects in their great article Something to Talk About: Social Spillovers in Movie Consumption (2016, Journal of Political Economy). The key idea is to use random weather fluctuations during a movie’s premiere to causally identify the social spillover effects.

Consider the following simplified causal model of factors that determine the number of views during a movie’s premiere:

We want to instrument the number of views at a movie’s premiere with weather characteristics that are uncorrelated with movie characteristics. There are two catches however: First, the premiere date is likely correlated with both the weather and movie characteristics (e.g. commercially very promising movies may be timed to premiere around Christmas). Second, since there are a huge amount of possible relevant weather conditions, the authors use a LASSO approach to select relevant weather variables, i.e. we cannot just run a standard IV regression with date variables as exogenous controls.

I think it is a very nice econometric quiz to determine how one can causally estimate the social spillover effects using only a sequence of basic OLS regressions and one LASSO regression. (This paper was also one of the motivators for writing my previous blog post on regression anatomy.)

As part of her Bachelor thesis at Ulm University, Lara Santak has created a very nice RTutor problem set that allows you to replicate the analysis in an interactive fashion in R. Like in previous RTutor problem sets, you can enter free R code in a web-based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by multiple choice quizzes. While the causal identification strategy is explained more stringent in the original paper, the RTutor problem set nicely allows you to dive yourself into the very interesting data set.

You can test the problem set online on shinyapps.io:

https://lara-santak.shinyapps.io/RTutorSomethingToTalkAbout

The free shinyapps.io account is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.

To locally install the problem set, follow the installation guide at the problem set’s Github repository: https://github.com/larasantak/RTutorSomethingToTalkAbout

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at

https://skranz.github.io/RTutor/

or at the Github page

https://github.com/skranz/RTutor

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Package Integration with Modern Reusable C++ Code Using Rcpp – Part 2

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Daniel Hanson is a full-time lecturer in the Computational Finance & Risk Management program within the Department of Applied Mathematics at the University of Washington. His appointment followed over 25 years of experience in private sector quantitative development in finance and data science.

In the first post in this series, we looked at configuring a Windows 10 environment for using the Rcpp package. However, what follows below, and going forward, is applicable to an up-to-date R, RStudio, and Rcpp configuration on any operating system.

Today, we will examine design considerations in integrating standard and portable C++ code in an R package, using Rcpp in the interface level alone. This will ensure no R-related dependencies are introduced into the C++ code base. In general, of course, best programming practices say we should strive to keep interface and implementation separate.

Design Considerations

For this discussion, we will assume the package developer has access to a repository of standard C++ code that is intended for use with other mathematical or scientific applications and interfaces. The goal is integrate this code into an R package, and then export functions to R that will use this existing C++ code. The end users need not be concerned that they are using C++ code; they will only see the exported functions that can be used and called like any other R function.

The package developer, at this stage, has two components that cannot communicate with each other, at least yet:

R and C++ Components

Establishing Communication

This is where Rcpp comes in. We will create an interface layer that utilizes functions and objects in the Rcpp C++ namespace that facilitate communication between R and C++. This interface will ensure that no dependence on R or Rcpp is introduced into our reusable code base.

The Rcpp interface connects R and C++

The Rcpp namespace contains a treasure trove of functions and objects that abstract away the terse underlying C interface provided by R, making our job far less painful. However, at this initial stage, to keep the discussion focused on a basic interface example, we will limit our use of Rcpp functions to facilitate the transfer of numeric vector data in R to the workhorse STL container std::vector, which is of course ubiquitous in quantitative C++ code.

Tags to Indicate Interface Functions

C++ interface functions are indicated by a tag that needs to be placed just above the function name and signature. It is written as

// [[Rcpp::export]]

This tag instructs the package build process to export a function of the exact same name to R. As an interface function, it will take arguments from an R session, route them in a call to a function or class in the C++ code base, and then take the results that are returned and pass them back to the calling function in R.

Conversion between R Vectors and C++ Vectors

The Rcpp::NumericVector class, as its name suggests, stores data taken from an R numeric vector, but what makes Rcpp even more powerful here is its inclusion of the C++ template function Rcpp::as(.). This function safely and efficiently copies the contents of an Rcpp::NumericVector to a std::vector object, as demonstrated in Figure 3, below.

Remark: Rcpp also has the function Rcpp::wrap(.), which copies values in an STL vector back into an Rcpp::NumericVector object, so that the results can then be to R; this function will be covered in our next article in this series.

C++ interface function to be exported to R

A C++ Interface Function

Figure 3 shows a mythical Rcpp interface function at the top, called fcn(.), and a function in our reusable standard C++ code base called doSomething(.). Note first the tag that appears just above the interface function signature. It must be exactly one line above, and there must be a single space only between the second forward slash and the first left square bracket.

This interface function will be exported to R, where it can be called by the same function name, fcn(.), taking in an R numeric vector input. The Rcpp::NumericVector object takes this data in as the input to C++. The contents are then transferred to a C++ std::vector, using the Rcpp template function Rcpp::as>(.).

The data can then be passed to the doSomething(.) function in the standard C++ code base, as it is expecting a std::vector input. This function returns the C++ double variable ret to the variable y in the interface function. This requires no special conversion and can be passed back to R as a C++ double type.

Putting the High-Level Design Together

With the C++ interface in place, this means an R user can call an R function that has been exported from C++. When the results are returned, they can be used in other R functions, but where we get an extraordinarily complementary benefit is with R’s powerful data visualization capabilities. Unlike languages such as Python, Java, or VB.NET, C++ does not have a standard GUI, but we can use cutting-edge R packages such as ggplot2, plotly, shiny, and xts– among many others – to generate a massive variety of plots and visualizations that are simply not available in other general purpose languages.

R and C++ Components

Next Steps

This concludes our discussion of high-level design considerations. Coming next, we will look at examples of writing actual interface functions to simple, but real, standard C++ functions and classes.

_____='https://rviews.rstudio.com/2020/07/14/r-package-integration-with-modern-reusable-c-code-using-rcpp-part-2/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Maintaining an R Package – Community Call Summary

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In March we held a Community Call discussing the maintenance of R packages. This call included a starting presentation by Julia Silge followed by a discussion featuring panelists with a wide variety of backgrounds: Elin Waring, Erin Grand, Leonardo Collado-Torres and Scott Chamberlain.

Headshots of the moderator and four panelists

The rOpenSci Package Maintenance Community Call was my (Janani’s) first Community Call and I loved it. For R/software-dev newbies, learning the right terminology/lingo/vocabulary is everything. It can take a few dozen blog posts and many hours of reading before beginners get to the ‘aha’ moment of ‘oh, these are the terms I need to use to search for what I’m thinking!’. As there are many online resources out there, the default expectation is that one can search for and learn almost everything provided one knows the right keywords. There’s nothing like hearing a lively technical banter of experts to pick up the vernacular that one can easily build upon. The first-hand tips and tricks, do’s and don’ts, personal anecdotes of what worked beautifully and what crashed terribly, offered by years of experience are yet unmatched in bringing newbies into speaking the community’s language. That is the precise gap filled in by the timely and helpful rOpenSci Package Maintenance Community Call!

During rOpenSci Community Calls, there is a shared document that allows anyone to add notes or questions during the discussion. I always take notes for my own use but quickly realized the benefit for all by taking collaborative notes. This live shared document helps everyone, including newbies, think through and formulate what they would like to say. It also gives people the option to participate without having to interrupt and speak up on the call (thus reducing the barrier for people, especially newcomers, to ask questions). The document also gives an opportunity for anyone in the community to share their expertise and add their perspective.

We felt that the rich content in the video and collaborative notes document from this call warranted a post that points readers to the material to ensure more people benefit from it. Here we’ve collated the questions and links to the various answers to facilitate look up.

Summarizing questions and answers

Moderator questions

1. What does it mean to “maintain an R package”? (video | document)

2. How do you manage issues / feature requests? (document) What workflows do you use to do this? (document) (video)

3. What is a path for new contributors to R packages? How can healthy norms be passed on? (video | document)

(Related: What should someone do if they want to start helping maintain one of your packages? First step?)

4. What considerations go into decisions around dependency management? (video) APIs that change? (video) (document)

5. What does the process of changing maintainers look like? What sets up this transition for success? (video | document)

6. We’ve talked about a lot of different facets of package management. Which are the same vs. different for internal packages? (video)

7. How do you decide to submit to a centralized repository like CRAN or Bioconductor? Peer review like JOSS or rOpenSci? Stay only on GitHub? (video | document)

8. What does someone need to know or skills they need to have to start maintaining a package? (video | document)

Audience questions

Daniel Sjoberg What is the best practice for ensuring continued support for older R versions when dependencies of dependencies of dependencies keep upping the minimum version of R required? (document)

Athanasia Monika Mowinckel Do any of you have a package that depends on software that is not from R, for instance another command line tool. I.e. your R package wraps system calls to the command line software and uses output from that in R. How do you improve user experience when the program needs environment variables R does not pick up with Sys.getenv()? (document)

Janani Ravi I would also like to know how to incorporate a significant part of non-R scripts within the R package workflow – functions that use system/system2 calls, for instance! Are there any systematic ways and good examples that could point to this? (document)

Lennert Schepers If I want to start with fixing an issue/contributing to a well-documented package, what are essential parts that I should know before fixing? Should I learn package documentation and testing/awcontinuous integration and all other aspects of R packages… or are there parts that are necessary and others that are less urgent? Are there things that should be absolutely avoided? (things that can break the whole package)? (document)

Eric Scott What are some best practices for making big changes to a package? For example, changing the output of a function from a list to a tibble. Should this be a new function? Should there be an argument to get the old behavior back? Warnings to users to update their code? (document)

Resources

You can also check out the list of resources related to this community call.

Excited for more?

We hope this has inspired you to join in with future community calls! What would YOU like to hear about on a future rOpenSci Community Call? Add your votes and ideas for topics and speakers by 1) browsing the issues in this public repository; 2) giving a or commenting on an issue; 3) opening a new issue if your idea doesn’t fit in any others.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio Connect 1.8.4

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A place for Python applications

For data science teams looking to promote data-driven decision making within an organization, interactive applications built in R and Python are often the gold standard. These interactive tools, built by data science teams, are powerful when put in the hands of the right people, but they don’t do much good at all if they only run locally. R users who build Shiny applications hit this roadblock, and so do Python users working with tools like Dash, Bokeh, and Streamlit. IT departments want to help but often aren’t sure how. RStudio Connect solves this problem, for both R and Python applications.

RStudio Connect 1.8.4 focuses on helping Python users by including support for a full suite of interactive application types. Support for publishing Dash applications is now generally available, and this release introduces new Beta offerings for Bokeh and Streamlit application deployment.

See RStudio Connect in Action

Interactive Python Applications

Get started with the RStudio Connect Jump Start

For a hands-on approach to learning about Python content in RStudio Connect, try exploring the Jump Start Examples. This resource contains lightweight data science project examples built with various R and Python frameworks. The Jump Start Examples appear in the RStudio Connect dashboard when you first log in. You can download each project, run it locally, and follow the provided instructions to publish it back to the RStudio Connect server; or you could simply browse the examples and deployment steps to get a sense for how you might publish your own project.

Python Jump Start Screenshot

Develop and deploy Python applications from your favorite Python IDE

New users often ask, Do I have to develop Python applications in the RStudio IDE in order to publish them in RStudio Connect? The answer is no! You do not need to touch the RStudio IDE for Python content development or publishing.

Publishing Python applications to RStudio Connect requires the rsconnect-python package. This package is available to install with pip from PyPI and enables a command-line interface that can be used to publish from any Python IDE including PyCharm, VS Code, JupyterLab, Spyder, and others.

Once you have the rsconnect-python package, the only additional information you need to supply is the RStudio Connect server address and a publisher API key.

The application shown here is the Stock Pricing Dashboard built with Dash and available for download from the Jump Start Examples available in RStudio Connect. The example comes packaged with everything needed to run it locally in the Python IDE of your choice. When you’re ready to try publishing, the Jump Start will guide you through that process, including all the required commands from rsconnect-python.

GIF of the Dash Jump Start example in RStudio Connect

Streamlit and Bokeh Applications

Data scientists who develop Streamlit or Bokeh applications can also use the rsconnect-python package to publish to RStudio Connect. If you’ve previously used the rsconnect-python package for other types of Python content deployment, make sure you upgrade to the latest version before attempting to use the Beta features with RStudio Connect 1.8.4.

This release ships with a example for Streamlit, located in the Jump Start Examples Python tab. For Bokeh, we recommend starting with examples from the Bokeh App Gallery. Source code for the Bokeh gallery applications is available from the Bokeh GitHub repository.

Visit the User Guide to learn more about our beta support for Streamlit and Bokeh:

What does “Beta” Mean?Bokeh and Streamlit app deployment are beta features. This means they are still undergoing final testing before official release. Should you encounter any bugs, glitches, lack of functionality or other problems, please let us know so we can improve before public release.

Learn how data science teams use RStudio products Visit R & Python – A Love Story

New & Notable

Scheduling Across Time Zones

A new time zone option for scheduled reports can be used to prevent schedules from breaking during daylight savings time changes. Publishers can now configure a report to run in a specific time zone by modifying the settings available in the Schedule panel.

Content Usage Tracking

In previous versions of RStudio Connect, content usage was only available for static and rendered content, as well as Shiny applications. With this release, Python content and Plumber API usage data is available via the Instrumentation API. Learn more about tracking content usage on RStudio Connect in the Server API Cookbook.

Screenshot of the Content Usage Info Panel in RStudio Connect

Authentication Changes

OpenID Connect

RStudio Connect 1.8.4 introduces OpenID Connect as an authentication provider for single sign-on (SSO). This new functionality is built on top of the existing support for OAuth2, which was previously limited to Google authentication. For backwards compatibility, Google is the default configuration, so no action is necessary for existing installations. See the OAuth2 section of the Admin Guide for details.

Automatic User Role Mapping

RStudio Connect now supports assigning user roles from authentication systems that support remote groups. Roles can be assigned in a custom attribute or automatically mapped from an attribute or group name. See Automatic User Role Mapping for more details.

Custom Login & Logout for Proxied Authentication

Proxied authentication now supports more customizable login and logout flows with the settings ProxyAuth.LoginURL and ProxyAuth.LogoutURL. See the Proxied Authentication section of the Admin Guide for details.

Try the free 45 day evaluation of RStudio Connect 1.8.4

Deprecations & Breaking Changes

  • Breaking Change SSLv3 is no longer supported, since it is considered cryptographically broken.
  • Deprecation The setting Python.LibraryCheckIsFatal has been deprecated. Python library version checks are now non-fatal and result in a warning in the RStudio Connect log at startup.

Please review the full release notes.

Upgrade Planning

For RStudio Connect installations that make use of Python, note that the latest version of the virtualenv package (version 20) is now supported. This is a reversal of the previous RStudio Connect 1.8.2 requirement on virtualenv. This release also provides support for Ubuntu 20.04 LTS.

To perform an upgrade, download and run the installation script. The script installs a new version of RStudio Connect on top of the earlier one, and existing configuration settings are respected.

# Download the installation scriptcurl -Lo rsc-installer.sh https://cdn.rstudio.com/connect/installer/installer-v1.1.0.sh# Run the installation scriptsudo bash ./rsc-installer.sh 1.8.4-11

Click through to learn more about RStudio Connect

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Setting up VS Code for Python Development like RStudio

$
0
0

[This article was first published on stevenmortimer.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this article I will highlight the features of VS Code that match RStudio exactly, such as the “interactive notebook window” (called the Console in R) or the “variable explorer” (like running View() on a data frame in RStudio). At the bottom of this post I will provide two JSON files (settings.json and keybindings.json) and a block of code to install from the command line a list of extensions that I recommend. By using these files as a guide you can configure your VS Code installation to do a pretty good job at mimicking RStudio.

First, why try to write Python like you write R code in RStudio??

RStudio is a great all around IDE for data analysis. A few years ago I was transitioning from writing a lot of R code to more Python code at work. I initially chose PyCharm as my Python IDE for a variety of reasons outlined in another blog post of mine: An R User Chooses a Python IDE. However, as of last summer (June 2019), I switched to VS Code as a Python IDE and never looked back. In hindsight, PyCharm just seems too clunky with an over-engineered GUI of buttons to click and not really be sure what’s going on. VS Code is making great strides towards becoming an IDE that works well for the REPL (read–eval–print loop) style of coding that RStudio excels at supporting. I love how lightweight VS Code feels and how the configurations are portable via JSON files making it easier to share a common config with team members. I’ll keep writing R code in RStudio, but VS Code is quckly becoming a second home for me to write Python code.

How to Configure VS Code to run like RStudio

Aesthetics (Textmate theme and margin column ruler)

You may know that the default theme in RStudio is “Textmate”. This is found in Tools -> Global Options -> Appearance. To get the color scheme in VS Code you can install this with the Textmate theme extension.

I tend to not change much beyond the default RStudio installation just so that I don’t need to maintain a unique and unfamiliar setup compared to other colleagues who are also using RStudio. The only customization I recommend is adding a vertical line at 80 characters. This is a must-have feature to keep your code readable and you can do the same with VS Code by specifying "editor.rulers" in your user settings. More detail on how to configure that is in the section of this post entitled Settings JSON File.

Running Code

As far as running code in RStudio, it is fairly common to write code in the “Source” pane (normally above the console), then send the code to the “Console” pane to run using CMD+ENTER (CRTL+ENTER if on Windows – please assume anywhere I refer to CMD in this article it is CTRL if you use Windows). In VS Code you can think of the Editor pane as having the exact same purpose (writing a script), but instead of sourcing lines to the “Console” you use the same command (CMD+ENTER) to run the code in the Python Interactive Window. More specifically, the keyboard shortcut you need to set in VS Code is for the command "python.datascience.execSelectionInteractive".

This is a game changer when writing Python code for analysis because you no longer need to code in a Jupyter Notebook to execute your analysis. Simply write your code in a .py file and press CMD+ENTER to execute line-by-line in the Python Interactive Window. Repeat this process as you run code, explore, and build out your analysis. Note that you can also type Python directly into the Interactive Window just like you can type directly in R’s Console as well to execute code.

Variable Explorer (“Environment” tab in R)

Notebooks always seem clunky in terms of executing single lines of code and reviewing variables in the global environment, but VS Code has a variable explorer just like the “Environment” tab in R.

Data Viewer (Like running View() in R)

Once you’ve created a pandas DataFrame you can explore a dynamic view of the data frame with basic filtering and sorting. I have found this be exactly like the View() function in RStudio. The only difference is it does seem to struggle a bit when viewing data frames over 1 million rows. It is laggy or crashes VS Code, but I don’t find that too much of a problem because you can always save a sample of your dataset into a variable or aggregate your data prior to viewing.

Plot Viewer (“Plots” tab in R)

You can flip through prior plots, save them, resize, and interact with plots the same way as you would in the “Plots” pane of RStudio. Nothing different here.

Version Control (“Git” tab in R)

Just like RStudio you can stage files, write a commit message, and push to a git repository. Instead of ticking a checkbox to stage files, you will have to press the + sign, which is about the only difference. Everything else is the similar as far as clicking on the files to see “diffs”, writing the commit message, and pushing code to a remote repository.

All of these git features can be made even better with Eric Amodio’s GitLens extension, which I highly recommend. It makes it easier to navigate branches, see when files were changed and by whom.

What’s missing or different in VS Code?

R’s Document Outline (aka Minimap in VS Code)

RStudio creates a nice outline of your scripts based on your functions and comments that start with:

# A new section ----code here...

This table of contents style view is helpful. As a somewhat suitable alternative I have become accustomed to annotating Python code blocks with region folding syntax which creates collapsible code sections in the VS Code editor.

# region ----code here...# endregion

Debugging

RStudio has a nice feature that will ask you if you want to run a piece of code with debug mode on if it initially errors. VS Code’s Python Interactive Window also prints out error messages, but will not let you debug with the click of a button. You can however set breakpoints and run your script in debug mode, which is a familiar experience, just not as seamless as RStudio.

There are obviously many other features, like having a built-in terminal, remote connections, app development via Shiny, etc. that make R/RStudio and Python/VS Code different tools, but if you’re just running analysis the two can provide very similar workflows, right down to the keyboard shortcuts. The only challenge you may have left is figuring out how to make your pandas code as legible and well organized as dplyr code… but that is for another day. (Hint: It’s possible with a strong commitment to method chaining).

Setup Files and Scripts

Settings JSON File

In order to update your settings.json file, open the Command Palette with CMD+SHIFT+P and select "Preferences: Open Settings (JSON)" to edit the JSON file where your settings are held. You can copy/paste the entire block of JSON below or just individual lines.

Note: These settings files were automatically generated from my VS Code installation using Shan Khan’s Settings Sync extension.

settings.json (User Settings, as opposed to workspace/project-specific settings)

{    "window.zoomLevel": 1,    "explorer.confirmDelete": false,    "explorer.confirmDragAndDrop": false,    "files.associations": {},    "files.autoSave": "off",    "files.exclude": {        "**/.git": true,        "**/.svn": true,        "**/.hg": true,        "**/CVS": true,        "**/.DS_Store": true,        "**/.history/**": true,        "**/History_**": true    },    "workbench.colorTheme": "textmate",    "workbench.editor.enablePreview": false,    "workbench.startupEditor": "welcomePageInEmptyWorkbench",    "terminal.integrated.fontSize": 14,    "terminal.integrated.cursorStyle": "line",    "terminal.integrated.copyOnSelection": true,    "terminal.integrated.confirmOnExit": false,    "editor.largeFileOptimizations": false,    "editor.suggest.shareSuggestSelections": true,    "editor.suggestSelection": "first",    "editor.minimap.enabled": false,    "editor.foldingStrategy": "indentation",    "editor.showFoldingControls": "always",    "editor.rulers": [        80    ],    "editor.formatOnSave": true,    "python.linting.enabled": true,    "python.linting.flake8Enabled": true,    "python.linting.flake8Args": [        "--ignore=E203, E266, E501, W503",        "--max-line-length=79",        "--select=B,C,E,F,W,T4,B9",        "--max-complexity=18"    ],    "python.formatting.provider": "autopep8",    "python.formatting.autopep8Args": [        "--ignore=E501,E402"    ],    "sonarlint.rules": {        "python:S3776": {            "level": "off"        }    },    "python.languageServer": "Microsoft",    "python.dataScience.askForKernelRestart": false,    "python.dataScience.runStartupCommands": "%load_ext autoreload\\n%autoreload 2",    "python.dataScience.sendSelectionToInteractiveWindow": true,    "python.dataScience.useNotebookEditor": false,    "git.autofetch": true,    "git.confirmSync": false,    "gitlens.views.repositories.location": "gitlens",    "gitlens.views.repositories.files.layout": "list",    "gitlens.views.fileHistory.location": "gitlens",    "gitlens.views.lineHistory.enabled": false,    "gitlens.views.compare.location": "gitlens",    "gitlens.views.search.location": "gitlens",    "gitlens.mode.statusBar.enabled": true,    "gitlens.mode.statusBar.alignment": "left",    "gitlens.currentLine.enabled": false,    "gitlens.hovers.enabled": false,    "vsintellicode.modify.editor.suggestSelection": "automaticallyOverrodeDefaultValue",    "tabnine.experimentalAutoImports": true,    "indentRainbow.errorColor": "rgba(255,255,255,0.0)",    "todo-tree.tree.flat": true,    "todo-tree.highlights.defaultHighlight": {        "type": "text-and-comment",        "foreground": "grey"    },    "better-comments.tags": [        {            "tag": "?",            "color": "#8f5785",            "strikethrough": false,            "backgroundColor": "transparent"        }        {            "tag": "*",            "color": "#69a800",            "strikethrough": false,            "backgroundColor": "transparent"        }    ],    "bracket-pair-colorizer-2.colors": [        "#992e24",        "#ffb997",        "#263859"    ],    "cSpell.userWords": [        "YYYYMMDD",        "groupby",        "idxmax",        "inplace",        "itertools",        "multindex",        "rfind",        "strptime"    ]}

Keyboard Shortcuts JSON File

Updating the keyboard shortcuts JSON file is similar to editing settings.json. Open the Command Palette with CMD+SHIFT+P and select "Preferences: Open Keyboard Shortcuts (JSON)" to edit the JSON file where your settings are held. You can copy/paste the entire block of JSON below or just individual lines.

In addition to setting the keyboard shortcut CMD+ENTER to execute lines in the Interactive Window I have set a few other two keyboard shortcuts to work exactly like RStudio. For example, clearing the console and restarting the R session:

  • CRTL+L– “Clear Console” (in RStudio) => Clear Cells (in Python Interactive Window)
  • CMD+SHIFT+F10– “Restart R” (in RStudio) => Restart Kernel (in Python Interactive Window)

Of course there are other shortcuts that you can configure like R’s block comment command (Code -> Comment/Uncomment Lines) (CMD+SHIFT+C). Simply set the VS Code command "editor.action.commentLine" to that shortcut.

Another example is R’s command (Code -> Reflow Comment) which hard wraps code to 80 characters using CMD+SHIFT+/. VS Code can do the same after installing the extension called “rewrap” and then giving its command "rewrap.rewrapComment" the same keyboard shortcut. It is really up to you to configure the shortcuts that you use most often.

keybindings.json

// Place your key bindings in this file to override the defaultsauto[][    {        "key": "cmd+enter",        "command": "python.datascience.execSelectionInteractive",        "when": "editorTextFocus && python.datascience.featureenabled && python.datascience.ownsSelection && !findInputFocussed && !notebookEditorFocused && !replaceInputFocussed && editorLangId == 'python'"    },    {        "key": "shift+enter",        "command": "-python.datascience.execSelectionInteractive",        "when": "editorTextFocus && python.datascience.featureenabled && python.datascience.ownsSelection && !findInputFocussed && !notebookEditorFocused && !replaceInputFocussed && editorLangId == 'python'"    },    {        "key": "cmd+enter",        "command": "python.datascience.runcurrentcell",        "when": "editorTextFocus && python.datascience.featureenabled && python.datascience.hascodecells && !editorHasSelection && !notebookEditorFocused"    },    {        "key": "ctrl+enter",        "command": "-python.datascience.runcurrentcell",        "when": "editorTextFocus && python.datascience.featureenabled && python.datascience.hascodecells && !editorHasSelection && !notebookEditorFocused"    },        {        "key": "cmd+enter",        "command": "notebook.cell.executeAndSelectBelow",        "when": "notebookEditorFocused && activeEditor == 'workbench.editor.notebook'"    },    {        "key": "shift+enter",        "command": "-notebook.cell.executeAndSelectBelow",        "when": "notebookEditorFocused && activeEditor == 'workbench.editor.notebook'"    },        {        "key": "ctrl+l",        "command": "python.datascience.removeallcells",        "when": "python.datascience.featureenabled && !terminalFocus"    },    {        "key": "ctrl+l",        "command": "workbench.action.terminal.clear",        "when": "terminalFocus"    },        {        "key": "cmd+shift+f10",        "command": "python.datascience.restartkernel"    },    {        "key": "cmd+shift+c",        "command": "editor.action.commentLine",        "when": "editorTextFocus && !editorReadonly"    },    {        "key": "cmd+/",        "command": "-editor.action.commentLine",        "when": "editorTextFocus && !editorReadonly"    },    {        "key": "cmd+i",        "command": "python.datascience.showhistorypane"    },    {        "key": "ctrl+shift+/",        "command": "rewrap.rewrapComment",        "when": "editorTextFocus"    },    {        "key": "alt+q",        "command": "-rewrap.rewrapComment",        "when": "editorTextFocus"    },    {        "key": "ctrl+shift+s",        "command": "extension.updateSettings"    },    {        "key": "shift+alt+u",        "command": "-extension.updateSettings"    }]

Extensions Installation Script

You can run each of the lines below in your Terminal or other Command Prompt after installing VS Code in order to install the extensions. Or you can search for them in the VS Code Marketplace (more detail is available here: https://code.visualstudio.com/docs/editor/extension-gallery#_browse-for-extensions)

# editor visual aidscode --install-extension CoenraadS.bracket-pair-colorizer-2code --install-extension oderwat.indent-rainbowcode --install-extension gerane.Theme-textmate# development aidscode --install-extension eamodio.gitlenscode --install-extension aaron-bond.better-commentscode --install-extension Gruntfuggly.todo-treecode --install-extension mikestead.dotenvcode --install-extension streetsidesoftware.code-spell-checker# code formattingcode --install-extension ms-python.pythoncode --install-extension VisualStudioExptTeam.vscodeintellicodecode --install-extension TabNine.tabnine-vscodecode --install-extension christian-kohler.path-intellisensecode --install-extension stkb.rewrapcode --install-extension SonarSource.sonarlint-vscode# file viewer toolscode --install-extension DotJoshJohnson.xmlcode --install-extension jithurjacob.nbpreviewercode --install-extension GrapeCity.gc-excelviewer# remote development toolscode --install-extension ms-azuretools.vscode-dockercode --install-extension ms-vscode-remote.remote-containerscode --install-extension ms-vscode-remote.remote-sshcode --install-extension ms-vscode-remote.remote-ssh-editcode --install-extension ms-vscode-remote.remote-wslcode --install-extension ms-vscode-remote.vscode-remote-extensionpack# tool to save your settings as a gistcode --install-extension Shan.code-settings-sync
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: stevenmortimer.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Interoperability: Getting the Most Out of Your Analytic Investments

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sparklers at night

_Photo by Federico Beccari on Unsplash_

The Challenges of Complexity and Underutilization

Organizations typically have multiple different environments and frameworks to support their analytic work, with each tool providing specialized capabilities or serving different audiences. These usually include:

  • Spreadsheets created in Excel or Google Sheets,
  • Data science tools including R, Python, SPSS, SAS, and many others,
  • BI tools such as Tableau or PowerBI,
  • Data storage and management frameworks including databases and Spark clusters,
  • Job management clusters such as Kubernetes and Slurm.

For example, in our most recent R Community Survey, we asked what tools and languages respondents used besides R. The results shown in Figure 1 illustrate the wide variety of tools that may be present in an organization.

Tools Chart

Figure 1: Respondents we surveyed use a wide variety of tools in addition to R.

These tools and frameworks provide flexibility and power but can also have two unpleasant, unintended consequences: productivity-undermining complexity for various stakeholders and underutilization of expensive analytic frameworks.

The stakeholders in the organization experience these consequences because:

  • Data scientists require multiple environments to get their work done. If data scientists have to leave their native tools to access other things they need such as a Spark cluster or a database, they have to switch contexts and remember how to use systems they might only rarely touch. Often, this means they won’t fully exploit the data and other resources available, or they waste time learning and relearning various systems, APIs, languages, and interfaces.
  • Data science leaders worry about productivity. When their teams struggle in this way, these leaders worry that their teams aren’t delivering the full value that they could. This inefficiency can make it more difficult to defend budgets and hire additional team members when needed. These leaders may also face criticism from other departments demanding to know why the data science team isn’t fully utilizing expensive BI deployments or powerful computing resources.
  • IT spends time and money supporting underutilized resources. Analytic infrastructures such as Spark or Kubernetes require considerable resources to set up and maintain. If these resources are being underutilized, IT will question their lack of ROI and whether they should continue to maintain them. These questions can lead to uncomfortable internal friction between departments, particularly depending on who requested the investments in the first place and what expectations were used to justify them.

Platforms diagram

Figure 2: Interoperability is a key strength of the R ecosystem.

Teams Need Interoperable Tools

Interoperable systems that give a data scientist direct access to different platforms from their native tools can help address these challenges. Everyone benefits from this approach because:

  • Data scientists keep working in their preferred environment. Rather than constantly switching between different tools and IDEs and interrupting their flow, data scientists can continue to work in the tools and languages they prefer. This makes the data scientist more productive and reduces the need to keep switching contexts.
  • Data science leaders get more productivity from their teams. When teams are more productive, they deliver more value to their organization. Delivered value helps them justify more training, tools, and team members. Easier collaboration and reuse of each other’s work further increases productivity. For example, if a data scientist who prefers R can easily call the Python script developed by a colleague from their preferred language, they avoid reimplementing the same work twice.
  • Teams make better use of IT resources. Since it is easier for data scientists to use the frameworks and other infrastructure IT has put in place, they use those resources more consistently. This higher utilization helps the organization achieve the expected ROI from these analytic investments.

Encouraging Interoperability

Interoperability is a mindset more than technology. You can encourage interoperability throughout your data science team with four initiatives:

  1. Embrace open source software. One of the advantages of open source software is the wide community providing specialized packages to connect to data sources, modeling frameworks, and other resources. If you need to connect to something, there is an excellent chance someone in the community has already built a solution. For example, as shown in Figure 2, the R ecosystem already provides interoperability with many different environments.
  2. Make the data natively accessible. Good data science needs access to good up-to-date data. Direct access to data in the data scientist’s preferred tool, instead of requiring the data scientist to use specialized software, helps the data scientist be more productive and makes it easier to automate a data pipeline as part of a data product. Extensive resources exist to help, whether your data is in databases, Spark clusters, or elsewhere.
  3. Provide connections to other data science or ML tools. Every data scientist has a preferred language or tool, and every data science tool has its unique strengths. By providing easy connections to other tools, you expand the reach of your team and make it easier to collaborate and benefit from the work of others. For example, the reticulate package allows an R user to call Python in a variety of ways, and the Tensorflow package provides an interface to large-scale TensorFlow machine learning applications.
  4. Make your compute environments natively accessible. Most data scientists aren’t familiar with job management clusters such as Kubernetes and Slurm and often struggle to use them. By making these environments available directly from their native tools, your data scientists are far more likely to use them. For example, RStudio Server Pro allows a data scientist to run a script on a Kubernetes or Slurm cluster directly from within their familiar IDE.

Eric Nantz, a Research Scientist at Eli Lilly and Company, spoke at rstudio::conf 2020 about the importance of interoperability in R:

Learn more about Interoperability

In future posts, we will expand on this idea of Interoperability, with a particular focus on teams using R and Python, and how open source data science can complement BI tools.

If you’d like to learn more about Interoperability, we recommend these resources:

  • In this recent blog post, we introduced the idea of interoperability with an example of calling multiple different languages from the RStudio IDE.
  • In this recent customer spotlight, Paul Ditterline, Manager Data Science at Brown-Forman, describes how RStudio products helped their data science team “turn into application developers and data engineers without learning any new languages or computer science skills.”
  • This article describes how RStudio products integrate with many different frameworks, including databases, Spark, Kubernetes, Slurm, Git, etc.
  • R and Python, a Love Story shows how RStudio products helped bilingual data science teams collaborate more productively, and have a greater impact on their organization.
  • At rstudio::conf 2020, George Kastrinakis from Financial Times presented a case study on building a new data science pipeline, using R and RStudio Connect.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Community Captioning of rOpenSci Community Calls

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Webinars and community calls are a great way to gather many people to discuss a specific topic, without the logistic hurdles of in-person events. But whether online or in-person, to reach the broadest audience, all events should work towards greater accessibility. In particular, it is difficult for people who are deaf or hard of hearing to follow the conversation because of low quality video hindering lip reading, or for non-native speakers because of low quality sound.

When the calls are recorded, as it is the case for rOpenSci Community Calls, it is possible to rewind and replay, which may help but it is not always sufficient. A better solution, as suggested on the GitHub issue tracker for community calls, would be to provide subtitles. In this blog post we want to provide feedback on our experience subtitling one community call on R package maintenance. We present the tools we used, the lessons we learned, and lay out a possible workflow for future video captioning. We here refer to closed captioning instead of subtitling. Both consist on adding text on a video, but they differ in their end goal. Subtitles only transcribe dialogues, while closed captioning also include the transcription of sound effect, musical cues, and other conversational cues such as the speakers’ names.

Screenshot of the video with closed captions

The Community Call video with captions enabled on Vimeo

One of us already had some experience captioning short videos on YouTube so we volunteered to try and add subtitles to the next community call. Because we think it’s important to work as much as possible with free and open source tools, our choice landed on Amara. Amara is a popular platform for community captioning used by other prolific video producers, such as TED talks.

Some things were easier than expected but some were harder

The good thing about modern technology

If you’ve never tried captioning a video, you may think that the hardest parts are writing the transcript and syncing the video and audio. Fortunately, we have pretty good tools for this nowadays. To ease the subtitling process, we didn’t start from a blank slate: thanks to the fact that all rOpenSci community calls are recorded, Stefanie Butland provided us with the raw transcript (in VTT format) automatically generated from the Zoom call recording.

Content of a VTT file

Screenshot of a sample VTT file. It contains both text and corresponding timestamps

That VTT file not only contains a transcript of all the audio, but also some timestamps to synchronize the transcript with the video. As expected, technical and field specific terms were often wrong (the most common issue being ‘R’ transcribed as ‘our’, and rOpenSci as ‘Our open sigh’). This seems like an unavoidable issue that happens no matter the tools you use. For example, YouTube gets even more creative with R slang:

In Youtube subtitles CRAN is crime 😂

— Maëlle Salmon (@ma_salmon) May 10, 2020

Unexpected tasks

But even after downloading the raw transcript, there is still a huge amount of work left. You have to balance subtitles, making sure each line is short enough to display even on short screens and that each frame doesn’t have more than two lines. Additionally, and as we detail later in this blog post, it can be pleasant to remove excessive discourse markers, such as ‘uhm’, ‘like’, ‘so’ or stutters.

The Community Call was a roundtable involving 5 participants in a lively unscripted discussion that lasted 55 minutes. This was very different from our previous experiences, since until now we only dealt with very short, 100% scripted YouTube documentaries, such as the Kurzgesagt channel.

A diverse community of speakers and captioners

Different speakers

Subtitling is a very interesting exercise because it forces you to focus very hard on what people say. And very quickly, you notice that different people have different speech styles. Of course, you also notice everybody’s verbal tics. Even though it might be good to know in order to correct it, we don’t necessarily recommend you try it on your own videos where you’re speaking because it can be unnerving, especially if you’re already self-conscious. One difficulty was to remove some orality markers in the subtitles while consistently respecting the styles of different speakers. We didn’t want to go overboard and end up with a transcript that differed too much from the spontaneous discussion that actually took place.

Another interesting difference between speakers is that different people pause at different moments. In this kind of informal discussion, you have to take some time to think, and some people pause to think mid-sentence while other pause in between sentences. This was probably made worse by the online nature of the discussion since silence in video discussion can be very awkward for both the speaker and the listeners, as explained in the amazing RStudio webinar by Greg Wilson about ‘Teaching Online at Short Notice’. Speakers may tend to ‘fix’ these silences by adding more discourse markers or by rushing to start a new sentence.

Different captioners

Unexpectedly, this difference between speakers uncovered differences in the way we chose to break the captions. Matthias chose to break the captions based on the auditory context: add breaks where the speaker makes a pause, which sometimes resulted in caption breaks mid-sentence as explained above. Alternatively, Hugo chose to break the captions based on the grammatical context: add breaks where the pause should be, which sometimes resulted in mismatches with the audio when speakers made pauses to think.

It was also a good reminder that even though we speak English quite fluently, we sometimes don’t understand everything but we usually don’t even notice it. We attended this community call live and managed to follow everything with ease but when looking at it one word at a time, we realised we missed some words here and there, especially when several participants interacted quickly.

The difficulty of collaborative captioning

We’re used to working together and have already collaborated on multiple projects. We even wrote a post on this very blog about a package we submitted to rOpenSci software review. But this didn’t help us find an efficient collaborative workflow for captioning. As mentioned earlier, one issue is that we had different captioning styles, and only realised it late in the project.

Even if Amara’s subtitle editor is an amazing tool that allows you to easily pause, rewind, and advance the video while captioning, it does not support simultaneous editing of the subtitles. Compared to taking collaborative notes through EtherPad, this slows down the process quite a lot. Amara however allow for several people to edit the captions successively. Other communities, such as TED videos captioning community, disable entirely collaborative captioning. When someone starts working on a video, it disappears from the list of available tasks and they work on it alone.

From our experience, being a team to caption a video cuts the work needed from each individual member. Simultaneous collaborative captioning would be helpful to edit different parts of the captions. It would allow you to focus on different independent tasks when captioning: roughly place captions in time, correct for vocabulary/transcription issues, add speakers’ names when necessary, etc. One important lesson for successful collaborative captioning would be to define before-hand the captioning style used throughout the captions, as well as to define a list of tasks that can be split among the different contributors.

About Amara

Nice features

Screenshot showing the interface of Amara.org

The interface of Amara is divided in different sections each focusing on distinct aspects of captioning. The video is on the top. The middle part shows a timeline with the duration of each caption. The bottom part shows the text and the caption editor as well as a conservation window on the bottom right.

Amara.org has a set of useful features that increase your efficiency a bit. The most notable ones in our opinion are:

  • keyboard shortcuts such as ‘Tab’ to play/pause the video, ‘Shift+Tab’ to go back in the video, ‘Ctrl’ to create a new subtitle, and ‘Shift+Ctrl’ to add a line break in a subtitle;

  • automatic deletion of leading whitespaces;

  • warnings regarding caption length, number of lines used, and caption reading speed (longer captions have to stay on screen for longer time so that viewers can read them!);

Warning window from Amara.org on captions

If a caption does not follow what are considered best practices in captioning, it gets flagged with an exclamation mark. Here the first line of the caption exceeds the recommendation of 42 characters per line.

  • full versioning of the subtitles, we can still go back through all the subtitle versions we edited.

Missing features

There are other features we could not find (they might exist but we missed them?):

  • an option to quickly merge subtitles. Very often, the subtitles are not split at the right place and you want to merge them to cut at the right time. A keyboard shortcut would be super useful here. It’s even more annoying because when you try to select one subtitle to copy/paste it, Amara removes your selection.

  • by default, subtitles start/stop exactly at the same time as the audio. Maybe it’s not the recommended practice but we found it more comfortable to add some buffer (even a split second) to give smoother flow, and leave a bit more time for the reader.

Proposed optimised workflow

In the long term, it is always worth it to provide captions, as it increases the accessibility of the videos. However, in the short term, the amount of time required to caption even a single video puts too much burden on volunteers. Perhaps with more collaborative tools and an optimized workflow, it would be possible to caption video quicker. One idea to increase captioning of many videos would be to organize “Captioning Sprints” where volunteers gather (online or in person) and split the captioning work in small workable chunks.

Nonetheless, in case someone would like to try it next, we propose an optimised workflow, that may help reduce the time you spend:

  1. Use the transcript provided by Zoom/YouTube.

  2. Correct the initial raw transcript for typos and mistakes. In particular, pay attention to technical and field-specific terms (R, package names, URLs).

  3. Make sure the breaks happen at the right place. If you opt for breaks based on the grammatical context, you don’t need to have access for the video for this.

    These two steps can be done in your favorite text editor, thereby unlocking a much more efficient workflow. For example, if you use Vim (or RStudio with vim keybindings), then the ‘merge line’ operation that was difficult in Amara is just a simple keystroke (J). You can also add a visual hint to make sure you respect the character width limit (follow these steps for RStudio for example).

  4. Upload the corrected transcript on Amara.

  5. Sync the subtitles with the audio. Because the breaks are already placed at the right time, this should be quick using Amara and its keyboard shortcuts, and drag and drop feature.

  6. Download the final transcript VTT file from Amara, upload it to the video on Vimeo, and enable Closed Captions to make them visible.

A time-consuming but gratifying experience

Given the complexity of the task –almost our first subtitling experience, our first use of Amara, a community call with many different speakers– producing good enough subtitles took us quite some time. We (Stefanie, Matthias, and Hugo) spent a total of around 20 hours of work to edit the subtitles for this 1-hour community call. It may seem quite a lot but at the same time, it was our first time doing this, first time using these tools, on a complex video with many different speakers, several interruptions, and quite open-ended discussion. Now that we are more used to the process, with the tools, we should be able to work faster. And with a simpler video, such as a regular community call where speakers spend more time presenting their own work, without talking at the same time, it should take less time.

We learned a lot during the process, enjoyed the roundtable even more (in all its details), learning about subtitling best practices, asking questions about the best way to transcribe some oral expressions. It was a challenging but interesting exercise. Even though we may need a better workflow we think it is worth our collective time to provide subtitles to our videos for accessibility and also to broaden and diversify the R community audience. We could also think at a more global scale of accessibility issues and spark discussion through the R Consortium Diversity & Inclusion Working Group. And why not think about live-captioning for your future event to make it more accessible?

Resources

When working on the subtitles we stumbled upon many useful resources ranging from quick tips and best practices to full guide describing the process of same-language subtitling for videos. They helped us ease our process and hope they will help you as well if you want to join the subtitling adventure:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Monty Hall Problem

$
0
0

[This article was first published on R – The Research Kitchen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just finished the book “The Monty Hall Problem” by Jason Rosenhouse, which is an exploration of one of the most counter intuitive puzzles in probability. The entire book is devoted to the basic form of the problem and a number of variations, of increasing complexity.

The basic outline of the problem is as follows. You are on a game show and are presented with three closed doors. Behind one of the doors is a prize (say a car) and behind the other two doors is nothing. You are asked to select a door. Once you have picked one of the three doors, the game show host (Monty) opens one of the other doors to reveal that it is empty. Your choice is then to either stick with the door you have initially selected or to change your selection to the other unopened door.

So for example, say you select door 1 below and Monty opens door 2. He then offers you the choice to switch to door 3 or stay with your original selection door 1. You have a choice of two unopened doors. Does it make any difference if you switch or stay?

Most people, myself included, will intuitively say no – there are two doors, and the prize has an equal probability of being behind either of the two remaining doors, so the probability is $\frac{1}{2}$ – switching or sticking makes no difference to the odds of winning the prize. However, this is not the case. Switching to the remaining unopened door will result in a win with a probability of $\frac{2}{3}$. This is a very surprising result and in order to understand why we need to look at the mechanics of the game.

We can prove this very quickly using a simulated approach. If we define the rules of the game as follows:

  • The prize can be behind any door with equal initial probability;
  • Monty will never open the door containing the prize;
  • When Monty has a choice of 2 (empty) doors to open, he will randomly choose between them.

Here is a simple R function that we can use to execute a single run of the game. It simulates the simple game given the rules above, and returns a string describing the winning strategy for that game. For example it will return “Stick” if the winning strategy for that game is to stay with the door you initially selected.

classic_monty <- function() {  # Assign the prize  prize <- sample(1:3,1)  # Pick a door  choice <- sample(1:3,1)  # Monty picks a door  monty <- sample(c(1:3)[-c(choice,prize)], 1)  return(ifelse(prize!=choice, "Switch", "Stick"))}

We can run the game a few times and see the result:

> classic_monty()[1] "Switch"> classic_monty()[1] "Stick"> classic_monty()[1] "Switch"

To see the asymptotic win probability of switch or stick, we can replicate the experiment a number of times and record the outcome:

n <- 2^(1:16)runs <- data.frame(n=numeric(), switch=numeric())for (trials in n) {  run <- table(replicate(trials, classic_monty()))  runs <- runs %>%  add_row(n=trials, switch=(sum(run["Switch"]))/trials)}# Handle zero-occurrence trialsruns[is.na(runs)]<-0

If we run this, then we can examine the win probability using the switch strategy as we increase the number of trials:

> runs       n    switch1      2 0.00000002      4 0.25000003      8 0.62500004     16 0.50000005     32 0.59375006     64 0.70312507    128 0.66406258    256 0.68359389    512 0.648437510  1024 0.664062511  2048 0.674804712  4096 0.664550813  8192 0.670288114 16384 0.672119115 32768 0.663696316 65536 0.6641541

There is a clear convergence to a $\frac{2}{3}$ win probability using the switch strategy. This is something I had to simulate to initially accept, as my intuitive expectation was a win probability of $\frac{1}{2}$.

In order to understand the mechanics behind this result a little more, we can generate the full space of possible outcomes $ \left( C, M, P\right) $for each trial, where $C$ is the door we choose, $M$ is the door Monty opens, and $P$ is the door that hides the prize. For example the tuple $\left( 1, 2, 3\right)$ would be the scenario where we pick door 1, Monty reveals door 2, and the prize sits behind door 3.

The code to generate the sample space is below:

# Generate sample space of tuples for Monty-Hall space <- data.frame(choice=numeric(), monty=numeric(), prize=numeric())i <- 1for (prize in 1:3) { for (choice in 1:3) {    for (monty in c(1:3)[-c(prize,choice)]) {      space <- space %>%  add_row(choice=choice,monty=monty,prize=prize)    }  }}space <- space %>% arrange(choice)

Which will generate a table as follows:

> head(space)  choice monty prize1      1     2     12      1     3     13      1     3     24      1     2     35      2     3     16      2     1     2

Let’s look at a sample play. Say we choose door 1. The sample space for possible plays where we choose door 1 is:

1      1     2     12      1     3     13      1     3     24      1     2     3

So if we stick with door 1 it looks like we have 2 outcomes out of 4, or a 50% chance of winning. However this is not the case, as the first two rows in the sample space above must have the same probability as each of the last two rows – door 1 has a fixed probability of $\frac{1}{3}$ of hiding the prize and that cannot change. So the first two outcomes above have a combined probability of $\frac{1}{3}$, which means that the win probability when switching is $\frac{2}{3}$.

Once we see the sample space enumerated like this, the reason for the switch probability seems obvious – due to the rules of the game outlined above, we have constrained the sample space. If the host follows the rules above and randomly chooses between two unopened doors when we have selected the door with the prize, but in all other cases will only open the door that is empty, we can see that the choice of door opened by Monty contains information – which is why his choice of door is important.

The Bayesian Approach

As an aside, the book contains a short conditional probability-based approach to the simple scenario above that I think is worth showing. If $C_n$ denotes that the prize is behind door $n$ and $M_m$ denotes that Monty opens door $m$, then assuming we choose door 1 as above and Monty then opens door 2 – we can ask the question what is the probability now that the prize is behind door 3 – i.e. $P(C_3|M_2)$?

This is $P(C_3|M_2) = \frac{P(M_2|C_3)P(C_3)}{P(M_2)}$

$=\frac{P(M_2|C_3)P(C_3)}{P(M_2|C_1)P(C_1)+P(M_2|C_2)P(C_2)+P(M_2|C_3)P(C_3)}$

We can infer that if any door is initially likely to contain the prize, then

$P(C_1)=P(C_2)=P(C_3)=\frac{1}{3}$

And as Monty will not open the door hiding the prize $P(M_2|C_2)=0$, and will be forced to open the only remaining door if we have chosen door 1 and the prize is behind door 3, then $P(M_2|C_3)=1$

This then simplifies to

$P(C_3|M_2) = \frac{1}{P(M_2|C_1)+1}$

We can figure out $P(M_2|C_1)$ using the simple rules of the game – if we have selected door 1 and the prize is behind door 1, Monty must choose randomly between the other two doors. So $P(M_2|C_1)=P(M_3|C_1)=\frac{1}{2}$.

Thus

$P(C_3|M_2) = \frac{2}{3}$

Another nice approach. This is a nice problem to illustrate just how deceptive simple probability puzzles can be!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – The Research Kitchen.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why data visualization is important

$
0
0

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data visualization is not only important to communicate results but also a powerful technique for exploratory data analysis. Each plot type like scatter plots, line graphs, bar charts and histograms has its own purpose and can be leveraged in a powerful way using the ggplot2 package.

  • Understand the different roles of data visualization
  • Understand the different plot types available
  • Get an overview of the ggplot2 package.

Introduction to data visualization

A picture is worth a thousand words.

Data visualization is the quickest and most powerful technique to understand new and existing information. During an initial exploration phase data scientists try to reveal the underlying features of a dataset like different distributions, correlations or other visible patterns. This process is also called exploratory data analysis (EDA) and marks the starting point of each data science project.

The graphs produced during the EDA show the data scientist the directions of the journey ahead. Revealed patterns can inspire hypothesis about the underlying processes, features of the dataset to be extracted or modelling techniques to be tested. Last but not least, visualizations uncover outliers and data errors which the data scientist needs to take care about.

The biggest role for data visualization is the communication of data science findings to colleagues and customers through presentations, reports or dashboards. Effort used for EDA and visualizations is time well spent since results can be directly used to communicate findings.

Quiz: Visualization Phase

For which phases is data visualization important in the data science workflow?

  • Explorative Data Analysis (EDA).
  • Detection of outliers.
  • Communication of Results.

Start Quiz

Available Plot Types

There are many plot types available which help to understand different features and relationships in the dataset.

During the exploratory data analysis phase we typically want to detect the most obvious patterns by looking at each variable in isolation or by detecting relationships of variables against others. The used plot type is also determined by the data type of the input variables like numeric or categorical.

Scatter Plots

Scatter plots are used to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x and y-axis.

Line Graphs

Line graphs are used to visualize the trajectory of one numeric variable against another which are connected through lines. They are well suited if values only change continuously– like temperature over time.

Bar Charts and Histograms

Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value. Histograms are specific bar charts to summarize the number of occurrences of numeric values over a set of value ranges (or bins). They are typically used to determine the distribution of numeric values.

Others

Other frequently used plot types in data science include:

  • Box plots: Show distributional information of numeric values grouped in categories as boxes. Great to quickly compare multiple distributions.
  • Violin plots: Same as box plots but show distributions as violins.
  • Heat Maps: Show interactions of variables – typically correlations – as rastered image highlighting areas of high interaction.
  • Network Graphs: Show connections between nodes

Quiz: Distribution Comparison Plots

Which plot types are typically used to compare distributions of numeric variables?

  • Box plots
  • Network graphs
  • Violin plots
  • Line Graphs

Start Quiz

Introducing: ggplot2

Due to the importance of visualization for data science and statistics, R offers a rich set of tools and packages. The core R language already provides a rich set of plotting functions and plot types. These plotting functions require users to specify how to plot each element on the canvas step by step. By contrast, the ggplot2 package allows the specification of plots through set of plotting layers. This requires the package to figure out the required steps to take to produce the graph.

Through the pre-defined set of geometric layers, facets and themes ggplot2 enables users to create beautiful graphs in very short time. ggplot2 is also the most widely adopted plotting library in the R community.

Quiz: ggplot2 Facts

Which statements about data visualization and ggplot2 are correct?

  • ggplot2 is the only way to create plots in R.
  • ggplot2 facilitates the creation of good looking graphs quickly.
  • ggplot2 requires users to specify the plotting commands in a step-by-step fashion.
  • ggplot2 enables users to specify plots in a declarative way.

Start Quiz

Why data visualization is important is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introducing Tidygeocoder 1.0.0

$
0
0

[This article was first published on Jesse Cambon-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tidygeocoder v1.0.0 is now live on CRAN. There are numerous new features and improvements such as batch geocoding (submitting multiple addresses per query), returning full results from geocoder services (not just latitude and longitude), address component arguments (city, country, etc.), query customization, and reduced package dependencies.

For a full list of new features and improvements refer to the release page on Github. For usage examples you can reference the the Getting Started vignette.

To demonstrate a few of the new capabilities of this package, I decided to make a map of the stadiums for the UEFA Champions League Round of 16 clubs. To start, I looked up the addresses for the stadiums and put them in a dataframe.

library(dplyr)library(tidygeocoder)library(ggplot2)require(maps)library(ggrepel)# https://www.uefa.com/uefachampionsleague/clubs/stadiums<-tibble::tribble(~Club,~Street,~City,~Country,"Barcelona","Camp Nou","Barcelona","Spain","Bayern Munich","Allianz Arena","Munich","Germany","Chelsea","Stamford Bridge","London","UK","Borussia Dortmund","Signal Iduna Park","Dortmund","Germany","Juventus","Allianz Stadium","Turin","Italy","Liverpool","Anfield","Liverpool","UK","Olympique Lyonnais","Groupama Stadium","Lyon","France","Man. City","Etihad Stadium","Manchester","UK","Napoli","San Paolo Stadium","Naples","Italy","Real Madrid","Santiago Bernabéu Stadium","Madrid","Spain","Tottenham","Tottenham Hotspur Stadium","London","UK","Valencia","Av. de Suècia, s/n, 46010","Valencia","Spain","Atalanta","Gewiss Stadium","Bergamo","Italy","Atlético Madrid","Estadio Metropolitano","Madrid","Spain","RB Leipzig","Red Bull Arena","Leipzig","Germany","PSG","Le Parc des Princes","Paris","France")

To geocode these addresses, you can use the geocode function as shown below. New in v1.0.0, the street, city, and country arguments specify the address. The Nominatim (OSM) geocoder is selected with the method argument. Additionally, the full_results and custom_query arguments (also new in v1.0.0) are used to return the full geocoder results and set Nominatim’s “extratags” parameter which returns extra columns.

stadium_locations<-stadiums%>%geocode(street=Street,city=City,country=Country,method='osm',full_results=TRUE,custom_query=list(extratags=1))

This returns 40 columns including the longitude and latitude. A few of the columns returned due to the extratags argument are shown below.

stadium_locations%>%select(Club,City,Country,extratags.sport,extratags.capacity,extratags.operator,extratags.wikipedia)%>%rename_with(~gsub('extratags.','',.))%>%knitr::kable()
ClubCityCountrysportcapacityoperatorwikipedia
BarcelonaBarcelonaSpainsoccerNANAen:Camp Nou
Bayern MunichMunichGermanysoccer75021NAde:Allianz Arena
ChelseaLondonUKsoccer41837Chelsea Football Cluben:Stamford Bridge (stadium)
Borussia DortmundDortmundGermanysoccerNANAde:Signal Iduna Park
JuventusTurinItalysoccerNANAit:Allianz Stadium (Torino)
LiverpoolLiverpoolUKsoccer54074Liverpool Football Cluben:Anfield
Olympique LyonnaisLyonFrancesoccer58000Olympique Lyonnaisfr:Parc Olympique lyonnais
Man. CityManchesterUKsoccerNAManchester City Football Cluben:City of Manchester Stadium
NapoliNaplesItalysoccerNANAen:Stadio San Paolo
Real MadridMadridSpainsoccer85454NAes:Estadio Santiago Bernabéu
TottenhamLondonUKsoccer;american_football62062Tottenham Hotspuren:Tottenham Hotspur Stadium
ValenciaValenciaSpainNANANANA
AtalantaBergamoItalysoccerNANANA
Atlético MadridMadridSpainsoccerNANAes:Estadio Metropolitano (Madrid)
RB LeipzigLeipzigGermanyNANANAde:Red Bull Arena (Leipzig)
PSGParisFrancesoccer48527Paris Saint-Germainfr:Parc des Princes

Below, the stadium locations are plotted on a map of Europe using the longitude and latitude coordinates and ggplot.

# reference: https://www.datanovia.com/en/blog/how-to-create-a-map-using-ggplot2/# EU Countriessome.eu.countries<-c("Portugal","Spain","France","Switzerland","Germany","Austria","Belgium","UK","Netherlands","Denmark","Poland","Italy","Croatia","Slovenia","Hungary","Slovakia","Czech republic")# Retrieve the map datasome.eu.maps<-map_data("world",region=some.eu.countries)# Plotggplot(stadium_locations,aes(x=long,y=lat))+borders('world',xlim=c(-10,10),ylim=c(40,55))+geom_label_repel(aes(label=Club),force=2,segment.alpha=0)+geom_point()+theme_void()

Another great mapping option is the leaflet package, which was originally what I intended to use for the map above, but getting it to render on a Jekyll blog proved to be a bit involved.

If you find any issues with the package or have ideas on how to improve it, feel free to file an issue on Github. For reference, the RMarkdown file that generated this blog post can be found here.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jesse Cambon-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

{hexmake} is one of the 5 Grand Prizes of the 2020 Shiny Contest

$
0
0

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey y’all!

I’m very happy to announce that my {hexmake} application has won one of the 5 Grand Prizes of the 2020 shiny contest, organized by RStudio!

{hexmake} is a pretty simple application when it comes to its idea: building hex stickers. But I wanted this simple idea to become a playground for showcasing more advanced {shiny} features: namely manipulating images with live display, importing and exporting data with a specific file format (here .hex), and importing and exporting to an external database (here a MongoDB).

You can give {hexmake} a try at connect.thinkr.fr/hexmake, and read more about the Shiny Contest results on the RStudio blog. The source code of the application is at github.com/ColinFay/hexmake.

You can also play with {hexmake} on your machine by installing it with:

# Installremotes::install_github("ColinFay/hexmake")# Runhexmake::run_app(with_mongo=FALSE)

If you want to use a mongodb as a back-end, you can launch one using docker with:

docker run \-v /mongo/data/db:/data/db \-v /mongo/data/dump:/dump \-p 12334:27017 \-d--name mongohexmake \-eMONGO_INITDB_ROOT_USERNAME=myuser \-eMONGO_INITDB_ROOT_PASSWORD=mypassword \
  mongo:3.4 

Then launch the app using:

# Change these env variables to suite your mongo configurationSys.setenv("MONGOPORT"=12334)Sys.setenv("MONGOURL"="127.0.0.1")Sys.setenv("MONGODB"="hex")Sys.setenv("MONGOCOLLECTION"="make")Sys.setenv("MONGOUSER"="myuser")Sys.setenv("MONGOPASS"="mypassword")hexmake::run_app(with_mongo=TRUE)
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bagging with tidymodels and #TidyTuesday astronaut missions

$
0
0

[This article was first published on rstats | Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on bagging using this week’s #TidyTuesday dataset on astronaut missions. 👨‍🚀

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal is to use bagging (bootstrap aggregation) to model the duration of astronaut missions from this week’s #TidyTuesday dataset.

Let’s start by reading in the data and check out what the top spacecraft used in orbit have been.

astronauts <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-14/astronauts.csv")astronauts %>%  count(in_orbit, sort = TRUE)
## # A tibble: 289 x 2##    in_orbit      n##         ##  1 ISS         174##  2 Mir          71##  3 Salyut 6     24##  4 Salyut 7     24##  5 STS-42        8##  6 explosion     7##  7 STS-103       7##  8 STS-107       7##  9 STS-109       7## 10 STS-110       7## # … with 279 more rows

How has the duration of missions changed over time?

astronauts %>%  mutate(    year_of_mission = 10 * (year_of_mission %/% 10),    year_of_mission = factor(year_of_mission)  ) %>%  ggplot(aes(year_of_mission, hours_mission,    fill = year_of_mission, color = year_of_mission  )) +  geom_boxplot(alpha = 0.2, size = 1.5, show.legend = FALSE) +  scale_y_log10() +  labs(x = NULL, y = "Duration of mission in hours")

This duration is what we want to build a model to predict, using the other information in this per-astronaut-per-mission dataset. Let’s get ready for modeling next, by bucketing some of the spacecraft together (such as all the space shuttle missions) and taking the logarithm of the mission length.

astronauts_df <- astronauts %>%  select(    name, mission_title, hours_mission,    military_civilian, occupation, year_of_mission, in_orbit  ) %>%  mutate(in_orbit = case_when(    str_detect(in_orbit, "^Salyut") ~ "Salyut",    str_detect(in_orbit, "^STS") ~ "STS",    TRUE ~ in_orbit  )) %>%  filter(hours_mission > 0) %>%  mutate(hours_mission = log(hours_mission)) %>%  na.omit()

It may make more sense to perform transformations like taking the logarithm of the outcome during data cleaning, before feature engineering and using any tidymodels packages like recipes. This kind of transformation is deterministic and can cause problems for tuning and resampling.

Build a model

We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.

library(tidymodels)set.seed(123)astro_split <- initial_split(astronauts_df, strata = hours_mission)astro_train <- training(astro_split)astro_test <- testing(astro_split)

Next, let’s preprocess our data to get it ready for modeling.

astro_recipe <- recipe(hours_mission ~ ., data = astro_train) %>%  update_role(name, mission_title, new_role = "id") %>%  step_other(occupation, in_orbit,    threshold = 0.005, other = "Other"  ) %>%  step_dummy(all_nominal(), -has_role("id"))

Let’s walk through the steps in this recipe.

  • First, we must tell the recipe() what our model is going to be (using a formula here) and what data we are using.
  • Next, update the role for the two columns that are not predictors or outcome. This way, we can keep them in the data for identification later.
  • There are a lot of different occupations and spacecraft in this dataset, so let’s collapse some of the less frequently occurring levels into an “Other” category, for each predictor.
  • Finally, we can create indicator variables.

We’re going to use this recipe in a workflow() so we don’t need to stress about whether to prep() or not.

astro_wf <- workflow() %>%  add_recipe(astro_recipe)astro_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: None## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────## 2 Recipe Steps## ## ● step_other()## ● step_dummy()

For this analysis, we are going to build a bagging, i.e. bootstrap aggregating, model. This is an ensembling and model averaging method that:

  • improves accuracy and stability
  • reduces overfitting and variance

In tidymodels, you can create bagging ensemble models with baguette, a parsnip-adjacent package. The baguette functions create new bootstrap training sets by sampling with replacement and then fit a model to each new training set. These models are combined by averaging the predictions for the regression case, like what we have here (by voting, for classification).

Let’s make two bagged models, one with decision trees and one with MARS models.

library(baguette)tree_spec <- bag_tree() %>%  set_engine("rpart", times = 25) %>%  set_mode("regression")tree_spec
## Bagged Decision Tree Model Specification (regression)## ## Main Arguments:##   cost_complexity = 0##   min_n = 2## ## Engine-Specific Arguments:##   times = 25## ## Computational engine: rpart
mars_spec <- bag_mars() %>%  set_engine("earth", times = 25) %>%  set_mode("regression")mars_spec
## Bagged MARS Model Specification (regression)## ## Engine-Specific Arguments:##   times = 25## ## Computational engine: earth

Let’s fit these models to the training data.

tree_rs <- astro_wf %>%  add_model(tree_spec) %>%  fit(astro_train)tree_rs
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: bag_tree()## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────## 2 Recipe Steps## ## ● step_other()## ● step_dummy()## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────## Bagged CART (regression with 25 members)## ## Variable importance scores include:## ## # A tibble: 11 x 4##    term                       value std.error  used##                                ##  1 year_of_mission            890.      18.5     25##  2 in_orbit_Other             689.      55.6     25##  3 in_orbit_STS               386.      19.4     25##  4 occupation_flight.engineer 190.      14.9     25##  5 occupation_pilot           189.      20.4     25##  6 in_orbit_Mir               124.      20.7     25##  7 in_orbit_Salyut            100.       9.61    25##  8 occupation_MSP              96.3      9.89    25##  9 occupation_Other            54.7      4.09    25## 10 military_civilian_military  39.8      4.77    25## 11 occupation_PSP              34.4      6.24    25
mars_rs <- astro_wf %>%  add_model(mars_spec) %>%  fit(astro_train)mars_rs
## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: bag_mars()## ## ── Preprocessor ────────────────────────────────────────────────────────────────────────────────## 2 Recipe Steps## ## ● step_other()## ● step_dummy()## ## ── Model ───────────────────────────────────────────────────────────────────────────────────────## Bagged MARS (regression with 25 members)## ## Variable importance scores include:## ## # A tibble: 10 x 4##    term                         value std.error  used##                                  ##  1 in_orbit_STS               100         0        25##  2 in_orbit_Other              91.7       1.78     25##  3 year_of_mission             62.6       4.46     25##  4 in_orbit_Salyut             31.7       2.41     25##  5 in_orbit_Mir                 1.08      0.914     4##  6 military_civilian_military   0.699     1.43      2##  7 occupation_Other             0.698     0.186     3##  8 occupation_PSP               0.542     0.924     2##  9 occupation_pilot             0.436     0.710     2## 10 occupation_flight.engineer   0.215     0         1

The models return aggregated variable importance scores, and we can see that the spacecraft and year are importance in both models.

Evaluate model

Let’s evaluate how well these two models did by evaluating performance on the test data.

test_rs <- astro_test %>%  bind_cols(predict(tree_rs, astro_test)) %>%  rename(.pred_tree = .pred) %>%  bind_cols(predict(mars_rs, astro_test)) %>%  rename(.pred_mars = .pred)test_rs
## # A tibble: 316 x 9##    name  mission_title hours_mission military_civili… occupation year_of_mission##                                                   ##  1 Carp… Mercury-Atla…          1.61 military         Pilot                 1962##  2 Schi… Mercury-Atla…          2.22 military         pilot                 1962##  3 Tere… Vostok 6               4.26 military         pilot                 1963##  4 Koma… Voskhod 1              3.19 military         commander             1964##  5 Feok… Voskhod 1              3.19 civilian         MSP                   1964##  6 Youn… Gemini 10              4.26 military         pilot                 1966##  7 Youn… Apollo 16              5.58 military         commander             1972##  8 Youn… STS-9                  5.48 military         commander             1983##  9 McDi… Gemini 4               4.57 military         commander             1965## 10 Whit… Gemini 4               4.58 military         pilot                 1965## # … with 306 more rows, and 3 more variables: in_orbit , .pred_tree ,## #   .pred_mars 

We can use the metrics() function from yardstick for both sets of predictions.

test_rs %>%  metrics(hours_mission, .pred_tree)
## # A tibble: 3 x 3##   .metric .estimator .estimate##                ## 1 rmse    standard       0.640## 2 rsq     standard       0.798## 3 mae     standard       0.357
test_rs %>%  metrics(hours_mission, .pred_mars)
## # A tibble: 3 x 3##   .metric .estimator .estimate##                ## 1 rmse    standard       0.640## 2 rsq     standard       0.795## 3 mae     standard       0.351

Both models performed pretty similarly.

Let’s make some “new” astronauts to understand the kinds of predictions our bagged tree model is making.

new_astronauts <- crossing(  in_orbit = fct_inorder(c("ISS", "STS", "Mir", "Other")),  military_civilian = "civilian",  occupation = "Other",  year_of_mission = seq(1960, 2020, by = 10),  name = "id", mission_title = "id") %>%  filter(    !(in_orbit == "ISS" & year_of_mission < 2000),    !(in_orbit == "Mir" & year_of_mission < 1990),    !(in_orbit == "STS" & year_of_mission > 2010),    !(in_orbit == "STS" & year_of_mission < 1980)  )new_astronauts
## # A tibble: 18 x 6##    in_orbit military_civilian occupation year_of_mission name  mission_title##                                               ##  1 ISS      civilian          Other                 2000 id    id           ##  2 ISS      civilian          Other                 2010 id    id           ##  3 ISS      civilian          Other                 2020 id    id           ##  4 STS      civilian          Other                 1980 id    id           ##  5 STS      civilian          Other                 1990 id    id           ##  6 STS      civilian          Other                 2000 id    id           ##  7 STS      civilian          Other                 2010 id    id           ##  8 Mir      civilian          Other                 1990 id    id           ##  9 Mir      civilian          Other                 2000 id    id           ## 10 Mir      civilian          Other                 2010 id    id           ## 11 Mir      civilian          Other                 2020 id    id           ## 12 Other    civilian          Other                 1960 id    id           ## 13 Other    civilian          Other                 1970 id    id           ## 14 Other    civilian          Other                 1980 id    id           ## 15 Other    civilian          Other                 1990 id    id           ## 16 Other    civilian          Other                 2000 id    id           ## 17 Other    civilian          Other                 2010 id    id           ## 18 Other    civilian          Other                 2020 id    id

Let’s start with the decision tree model.

new_astronauts %>%  bind_cols(predict(tree_rs, new_astronauts)) %>%  ggplot(aes(year_of_mission, .pred, color = in_orbit)) +  geom_line(size = 1.5, alpha = 0.7) +  geom_point(size = 2) +  labs(    x = NULL, y = "Duration of mission in hours (predicted, on log scale)",    color = NULL, title = "How did the duration of astronauts' missions change over time?",    subtitle = "Predicted using bagged decision tree model"  )

What about the MARS model?

new_astronauts %>%  bind_cols(predict(mars_rs, new_astronauts)) %>%  ggplot(aes(year_of_mission, .pred, color = in_orbit)) +  geom_line(size = 1.5, alpha = 0.7) +  geom_point(size = 2) +  labs(    x = NULL, y = "Duration of mission in hours (predicted, on log scale)",    color = NULL, title = "How did the duration of astronauts' missions change over time?",    subtitle = "Predicted using bagged MARS model"  )

You can really get a sense of how these two kinds of models work from the differences in these plots (tree vs. splines with knots), but from both, we can see that missions to space stations are longer, and missions in that “Other” category change characteristics over time pretty dramatically.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rstats | Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Shiny Video Game Wins RStudio Shiny Contest Grand Prize

$
0
0

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Shiny Game

Shiny Decisions Grand Prize Winner

Yes, You Can Win a Shiny Contest by Building a Video Game

RStudio has released the results of the 2nd Annual Shiny Contest, and we are pleased to announce that Appsilon engineer Pedro Silva has been named a Grand Prize Winner. Pedro’s entry is a video game created in Shiny called Shiny Decisions. Here’s how RStudio described Pedro’s app:

“A game about making the best of terrible choices. In Shiny Decisions your goal is to last as long as possible while making decisions that affect the wealth, population and environment quality in the world. The app is quite complex, and hard to describe with words. We strongly recommend giving the game a try to get a sense of it! The code for the app is equally complex, but very well organised.”

Mine Çetinkaya-Rundel, RStudio

R Shiny Game Result

Pedro intended Shiny Decisions as a Shiny experiment, mostly to see if it was possible to achieve his swiping card video game vision with R Shiny:

“Honestly, no one should actually use Shiny for video game production. Shiny Decisions is more of a funky way of presenting different concepts and ideas about what’s possible with R Shiny. I do think it speaks to the flexibly of Shiny as a framework. It was easy to reuse existing JS and CSS libraries, as well as iterate though different ideas efficiently. Overall, the project came together surprisingly quickly given the speed of Shiny as a development tool.”

Pedro Silva, Appsilon

If you’re curious, you can play Shiny Decisions here. Learn how Pedro used Shiny, CSS, JavaScript, and R6 classes to create Shiny Decisionshere.

You can see the other Grand Prize Winners of the 2nd Annual Shiny Contest here. Congratulations to all of the winners and runners up!

Learn More

Appsilon is hiring!

We’re searching for a Senior Sales Executive, a Content Manager/Writer, and multiple technical roles. See Appsilon’s open roles on our Careers page.

Article Shiny Video Game Wins RStudio Shiny Contest Grand Prize comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Probabilities for action and resistance in Blades in the Dark

$
0
0

[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Later this week, I’m going to be GM-ing my first session of Blades in the Dark, a role-playing game designed by John Harper. We’ve already assembled a crew of scoundrels in Session 0 and set the first score. Unlike most of the other games I’ve run, I’ve never played Blades in the Dark, I’ve only seen it on YouTube (my fave so far is Jared Logan’s Steam of Blood x Glass Cannon play Blades in the Dark!).

Action roll

In Blades, when a player attempts an action, they roll a number of six-sided dice and take the highest result. The number of dice rolled is equal to their action rating (a number between 0 and 4 inclusive) plus modifiers (0 to 2 dice). The details aren’t important for the probability calculations. If the total of the action rating and modifiers is 0 dice, the player rolls two dice and takes the worst. This is sort of like disadvantage and (super-)advantage in Dungeons & Dragons 5e.

A result of 1-3 is a failure with a consequence, a result of 4-5 is a success with a consequence, and a result of 6 is an unmitigated success without a consequence. If there are more than two 6s in the result, it’s a success with a benefit (aka a “critical” success).

The GM doesn’t roll. In a combat situation, you can think of the player roll encapsulating a turn of the player attacking and the opponent(s) counter-attacking. On a result of 4-6, the player hits, on a roll of 1-5, the opponent hits back or the situation becomes more desperate in some other way like the character being disarmed or losing their footing. On a critical result (two or more 6s in the roll), the player succeeds with a benefit, perhaps cornering the opponent away from their flunkies.

Resistance roll

When a player suffers a consequence, they can resist it. To do so, they gather a pool of dice for the resistance roll and spend an amount of stress equal to six minus the highest result. Again, unless they have zero dice in the pool, in which case they can roll two dice and take the worst. If the player rolls a 6, the character takes no stress. If they roll a 1, the character takes 5 stress (which would very likely take them out of the action). If the player has multiple dice and rolls two or more 6s, they actually reduce 1 stress.

For resistance rolls, the value between 1 and 6 matters, not just whether it’s in 1-3, in 4-5, equal to 6, or if there are two 6s.

Probabilities Resistance rolls are rank statistics for pools of six-sided dice. Action rolls just group those. Plus a little sugar on top for criticals. We could do this the hard way (combinatorics) or we could do this the easy way. That decision was easy.

Here’s a plot of the results for action rolls, with dice pool size on the x-axis and line plots of results 1-3 (fail plus a complication), 4-5 (succeed with complication), 6 (succeed) and 66 (critical success with benefit). This is based on 10m simulations.

You can find a similar plot from Jasper Flick on AnyDice, in the short note Blades in the Dark.

I find the graph pretty hard to scan, so here’s a table in ASCII format, which also includes the resistance roll probabilities. The 66 result (at least two 6 rolls in the dice pool) is a possibility for both a resistance roll and an action roll. Both decimal places should be correct given the 10M simulations.

DICE   RESISTANCE                      ACTION           BOTHDICE    1    2    3    4    5    6     1-3  4-5    6      66----  ----------------------------     -------------    ---- 0d   .36  .25  .19  .14  .08  .03     .75  .22  .03     .00 1d   .17  .17  .17  .17  .17  .17     .50  .33  .17     .00 2d   .03  .08  .14  .19  .25  .28     .25  .44  .28     .03 3d   .01  .03  .09  .17  .29  .35     .13  .45  .35     .07 4d   .00  .01  .05  .14  .29  .39     .06  .42  .39     .13 5d   .00  .00  .03  .10  .27  .40     .03  .37  .40     .20 6d   .00  .00  .01  .07  .25  .40     .02  .32  .40     .26 7d   .00  .00  .01  .05  .22  .39     .01  .27  .39     .33 8d   .00  .00  .00  .03  .19  .38     .00  .23  .38     .39

One could go for more precision with more simulations, or resort to working them all out combinatorially.

The hard way

The hard way is a bunch of combinatorics. These aren’t too bad because of the way the dice are organized. For the highest value of throwing N dice, the probability that a value is less than or equal to k is one minus the probability that a single die is greater than k raised to the N-th power. It’s just that there are a lot of cells in the table. And then the differences would be required. Too error prone for me. Criticals can be handled Sherlock Holmes style by subtracting the probability of a non-critical from one. A non-critical either has no sixes (5^N possibilities with N dice) or exactly one six ((6 choose 1) * 5^(N – 1)). That’s not so bad. But there are a lot of entries in the table. So let’s just simulate.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to write your own R package and publish it on CRAN

$
0
0

[This article was first published on R on Methods Bites, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is a great resource for data management, statistics, analysis, and visualization — and it becomes better every day. This is to a large part because of the active community that continuously creates and builds extensions for the R world. If you want to contribute to this community, writing a package can be one way. That is exactly what we intended with our package overviewR. While there exist many great resources for learning how to write a package in R, we found it difficult to find one all-encompassing guide that is also easily accessible for beginners. This tutorial seeks to close this gap: we will provide you with a step-by-step guide — seasoned with new and helpful packages that are also inspired by presentations at the recent virtual European R Users Meeting e-Rum 2020.

In the following sections, we will use a simplified version of one function (overview_tab) from our overviewR package as a minimal working example.

Why you should write a package

Writing a package has two main advantages. First, it helps you to approach your problems in a functional way, e.g., by turning your everyday tasks into little functions and bundling them together. Second, it is easy to share your code and new functions with others and thereby contribute to the engaged and vivid R community.

When it comes to our package, we wanted to add an automated way to get an overview — hence the name — of the data you are working with and present it in a neat and accessible way. In particular, our main motivation came from the need to get an overview of the time and scope conditions (i.e., the units of observations and the time span that occur in the data) as this is a recurring issue both in academic articles and real-world situations. While there are ways to semi-automatically extract this information, we were missing an all-integrated function to do this. This is why we started working on overviewR.

To make your package easily accessible for everyone, there are two basic strategies. You can either publish your package on GitHub (which, in terms of transparency, is always a good idea) or you can submit it to the Comprehensive R Archive Network (CRAN). Both offer the ability for others to use your package but differ in several important aspects. Releasing on CRAN offers additional tests that ensure that your package is stable across multiple operating systems and is easily installable with the function utils::install.packages() in R. If you have your package only on GitHub, there is also a function that allows users to install it directly – devtools::install_github from the devtools package– but most users are more likely to prefer the framework and stability that they can expect from a package that is on CRAN.

We will walk you through both options and start with how to make your package accessible on GitHub before discussing what needs to be done and considered when submitting it to CRAN. To set up your package in RStudio, you need to load the following packages:

library(roxygen2) # In-Line Documentation for R library(devtools) # Tools to Make Developing R Packages Easierlibrary(testthat) # Unit Testing for Rlibrary(usethis)  # Automate Package and Project Setup

When preparing this post, we came across this incredibly helpful cheat sheet that gives a detailed overview of what the devtools package can do to help you build your own package.

Where to start

Idea

All good things have to start somewhere and this is most often when you realize that the world is lacking something that is necessary and where you believe others will also benefit from. R packages come in various shapes— from entire universes such as the tidyverse package family (if you look for some Stata like feedback when using the tidyverse and additions to these universes, tidylog is your best friend!), packages for specific statistical models and their validation (icr, MNLpred or oolong), to packages such as polite that offers a netiquette when scraping the web, snakecase that converts names to snake case format, rwhatsapp for scraping WhatsApp, or meme, a package that allows you to make customized memes. As you can tell, the world – and your fantasy – is your oyster.[^1]

Name

Let us assume you have a great idea for a new package, the next step would be to find and to pick a proper name for it. As a general rule, package names can only be letters and numbers and must start with a letter. The package available helps you — both with getting inspiration for a name and with checking whether your name is available. This is exactly what we did in our case:

library(available) # Check if the Title of a Package is Available,                    # Appropriate and Interesting# Check for potential namesavailable::suggest("Easily extract information about your sample")
## easilyr

suggest takes a string with words that can be a description of your package and suggests a name based on this string. As you can tell, we did not go with the suggestion but opted for overviewR instead. We then checked with available whether the name is still available and valid across different platforms. Since our package is already published, it is not available on GitHub, CRAN, or Bioconductor (hence, the “x”).

# Check whether it's availableavailable::available("overviewR", browse = FALSE)
## -- overviewR ------------------------------------------------------------------------------------------------------------------------------------------------------------------------## Name valid: ## Available on CRAN:  ## Available on Bioconductor: ## Available on GitHub:   ## Abbreviations: http://www.abbreviations.com/overview## Wikipedia: https://en.wikipedia.org/wiki/overview## Wiktionary: https://en.wiktionary.org/wiki/overview## Urban Dictionary:##   a general [summary] of a subject "the [treasurer] gave [a brief]  overview of the financial consequences"##   http://overview.urbanup.com/3904264## Sentiment:???

Let your creativity spark and learn from fantastic package names such as GeneTonic or charlatan.

Set up your package with RStudio and GitHub

When setting up your package, there are various possible ways. Ours was to use RStudio and GitHub. RStudio already has a template that comes with the main documents that are necessary to build your package. To access the template, just click on File> New Project...> New Directory> R Package. Note, you need to check the box Create a git to set up a local git.

Hooray, you have started your own package! Let us take a look at the different files that were created.

  • .gitignore and .Rbuildignore contain documents that should be ignored when either building in git or R
  • DESCRIPTION gives the user all the core information about your package – we will talk more about this below.
  • man contains all manuals for your functions. You do not need to touch the .Rd files in there as they will be generated automatically once we populate our package with functions and run devtools::document().
  • NAMESPACE will later contain information on exported and imported functions. This file will not be modified by hand but we will show you how to do it automatically. This might seem counter-intuitive in the workflow, but we need to delete the NAMESPACE file here. We do this because we want NAMESPACE to be generated and to be accessible with the devtools universe. We will generate it automatically again later using the command devtools::document().
  • R contains all the functions that you create. We will address this folder and its files in the next step.
  • The overviewR.Rproj file is the usual R project file that you can read more about here.

However, your package is not yet linked with your GitHub. We will do this in the next step:

  1. Log in to your GitHub account.
  2. Create a new repository with “+New Repository”. We named it “overviewR” (as our package). You can set it to private or public – whatever is best for you.
  3. Do not check the box “Initialize this Repository with a README”
  4. Once you created the repository, execute the following commands in your RStudio terminal:
git remote add origin https://github.com/YOUR_USERNAME/REPOSITORY_NAME.gitgit commit -m "initial commit"git commit -u push origin master

If you now refresh your GitHub repository, you will see that your R package is perfectly synchronized with GitHub.

GitHub will now also ask you whether you want to create a README – just click on it and you are ready to go. To get the README in your project, pull it from GitHub either using the Pull button in the Git tab in RStudio or execute the following command line in the RStudio terminal:

git pull

Fill your package with life

We will showcase a typical workflow for creating a package using one example function (overview_tab) from our overviewR package. In practice, you can add as many functions as you want to your package.

Add functions

The folder R contains all your functions and each function is saved in a new R file where the function name and the file name are the same. As you can see, the template comes with the preset function hello that returns "Hello, world!" when executed. (The file hello.R showcases the function and can later be deleted.) To include now our function as well, we open a new R file and insert a basic version of our function.

Since we program our function using the tidyverse, we have to take care of the tidy evaluation and use enquo() for all our inputs that we later modify. Going into detail on how to program in the tidyverse and how and when we need to use enquo, is beyond the scope of this blogpost. For a detailed overview, take a look at this post.

In the preamble of this file, we can add information on the function. An example is shown below:

#' @title overview_tab#'#' @description Provides an overview table for the time and scope conditions of#'     a data set#'#' @param dat A data set object#' @param id Scope (e.g., country codes or individual IDs)#' @param time Time (e.g., time periods are given by years, months, ...)#'#' @return A data frame object that contains a summary of a sample that#'     can later be converted to a TeX output using \code{overview_print}#' @examples#' data(toydata)#' output_table <- overview_tab(dat = toydata, id = ccode, time = year)#' @export#' @importFrom dplyr "%>%"
  • @title takes the name of your function
  • @description contains a short description
  • @param takes all your arguments that are in the input of the function with a short description. Our function has three arguments (dat (the data set), id (the scope), and time (the time period)).
  • @return gives the user information about the output of your function
  • @examples eventually provide a minimal working example for the user to see what s/he needs to include. You can also wrap \dontrun{} around your examples if they should not be executed (e.g., if additional software or an API key is missing). If this is not the case, it is not recommended to wrap this around as it will cause a warning for the user. If your example runs longer than 5 seconds, you can wrap \donttest{} around it.
  • @export– if this is a new package, it is always recommended to export your functions. It automatically adds these functions to the NAMESPACE file.
  • @importFrom dplyr "%>%" pre-defines required functions for your function. It automatically adds these functions to the NAMESPACE file.

Once you have included the preamble, you can now add your function below.

Write a help file

When you execute devtools::document(), R automatically generates the respective help file in man as well as the new NAMESPACE file. If you click on it, you see that it is read-only and all edits should be done in the main R function file in R/.

Now you can call the function with ? overview_tab and get the nice package help that you know from other functions as well.

Write DESCRIPTION

The DESCRIPTION is pre-generated by roxygen2 and contains all the information about your package that is necessary. We will walk you through the most essential parts:

Type: PackagePackage: overviewRTitle: Easily Extracting Information About Your DataVersion: 0.0.2Authors@R: c(    person("Cosima", "Meyer", email = "XX@XX.com", role = c("cre","aut")),    person("Dennis", "Hammerschmidt", email = "XX@XX.com", role = "aut"))Description: Makes it easy to display descriptive information on    a data set.  Getting an easy overview of a data set by displaying and    visualizing sample information in different tables (e.g., time and    scope conditions).  The package also provides publishable TeX code to    present the sample information.License: GPL-3URL: https://github.com/cosimameyer/overviewRBugReports: https://github.com/cosimameyer/overviewR/issuesDepends:    R (>= 3.5.0)Imports:    dplyr (>= 1.0.0)Suggests:    covr,    knitr,    rmarkdown,    spelling,    testthatVignetteBuilder:    knitrEncoding: UTF-8Language: en-USLazyData: trueRoxygenNote: 7.1.0
  • Type: Package should remain unchanged
  • Package has your package’s name
  • Title is a really short description of your package
  • Version has the version number (you will most likely start with a 0.0.1, if you want to know more about, here is an excellent reference).
  • Authors@R contains the authors’ names and their roles. [cre] stands for the creator and this person is also the maintainer while [aut] is the author. There are also options to indicate a contributor ([ctb]) or translator ([trl]). If you need more, here’s a great overview or you can simply check for additional roles using ? person. At this point, you also need to give your e-mail address. If you want to submit your package to CRAN (but also in any other case), make sure that your e-mail address is correct and accessible!
  • Description provides a longer description of what your package does. If you want to indent, use four blank spaces.
  • License shows others what they can do with your package. This is an important part and probably a tough decision. Here and here or here are excellent overviews of different licenses and a starting guide on how to pick the best one for you.
  • URL indicates where the package is currently hosted
  • BugReports show where users should address their reports to (if linked with GitHub, this will automatically refer the user to the issues section)
  • Depends shows the R version your package works with (you always need to indicate a version number!)
  • Imports shows the packages that are required to run your package (here you always need to indicate a version number so that potential conflicts with previous versions can be avoided!)
  • Suggests lists all the packages that you suggest but that are not necessarily required for the functionality of your package
  • LazyData: true ensures that internal data sets are automatically loaded when loading the package

Add internal data set

Inspired by this excellent overview, we decided to include an internal data set to test the functionality of our package easily. How you do this is straightforward: You have a pre-generated data set at hand (or generate it yourself), and save it in data/.

As you know, every good data set, even if it is only a toy data set, comes with a description. For your package, just set up an .R file with the name of your data set (toydata.R in our case) and save it in the R folder. The file should contain the following information:

  • Starts with a title for the data set
  • Then you have some lines for a short concise description
  • @docType defines the type of document (data)
  • @usage describes here how the data set should be loaded.
  • @format gives information on the object’s format
  • \describe{} then allows you to give the user a specific description of your variables included in the data set
  • @references is essential if you do not use artificially generated data to indicate the source
  • @keywords allows you to indicate keywords (we used dataset here)
  • @examples finally gives you some room to showcase your data

What we included in our toydata.R file (you can simply copy and paste the code and adjust it to your needs)

#' Cross-sectional data for countries#'#' Small, artificially generated toy data set that comes in a cross-sectional#' format where the unit of analysis is either country-year or#' country-year-month. It provides artificial information for five countries#' (Angola, Benin, France, Rwanda, and the UK) for a time span from 1990 to 1999 to#' illustrate the use of the package.#'#' @docType data#'#' @usage data(toydata)#'#' @format An object of class \code{"data.frame"}#' \describe{#'  \item{ccode}{ISO3 country code (as character) for the countries in the#'      sample (Angola, Benin, France, Rwanda, and UK)}#'  \item{year}{A value between 1990 and 1999}#'  \item{month}{An abbreviation (MMM) for month (character)}#'  \item{gpd}{A fake value for GDP (randomly generated)}#'  \item{population}{A fake value for population (randomly generated)}#' }#' @references This data set was artificially created for the overviewR package.#' @keywords datasets#' @examples#'#' data(toydata)#' head(toydata)#'"toydata"
Write the NEWS.md

You can automatically generate a NEWS.md file using R with usethis::use_news_md. Our news file looks like this:

# overviewR 0.0.2- Bug fixes in overview_tab that affected overview_crosstab---# overviewR 0.0.1

The newest release always comes first and --- dividers separate the versions. To inform users, use bullet points to describe changes that came with the new version. As a plus, if you plan to generate a website with pkgdown (we will explain later how you can do this), the news section automatically integrates this file.

Write the vignette

A vignette can come in handy and allows you to present the functions of your package in a more elaborate way that is easily accessible for the user. Similar to the news section, your vignette will also be automatically integrated into your website if you use pkgdown. You can think of a vignette as something like a blog post that outlines specific use cases or more detailed descriptions of your package.

Here, usethis offers an excellent service and allows you to create your first vignette automatically with the command usethis::use_vignette("NAME-OF-VIGNETTE"). This command does three different things:

  1. It generates your vignettes/ folder,
  2. Adds essential specifications to the DESCRIPTION, and
  3. It also stores a draft vignette “NAME-OF-VIGNETTE.Rmd” in the vignettes folder that you can now access and edit. This draft vignette already contains a nice template that offers you all information and pre-requirements that you need to generate your good-looking vignette. You can adjust this as needed to show what your package does and how it can be used best.

Check your package

The following steps are either recommended or required when submitting your package to CRAN. We, however, recommend following all of them. We summarized what we believe is helpful when testing your package.

Write tests

Writing tests felt like the most difficult part of building the package. Essentially, you have to come up with tests for every part of your function to make sure that everything – not only the final output of your function – runs smoothly. A piece of good advice, that we read multiple times, is that whenever you encounter a bug, write a test for it to check for future occurrences. To set up the test environment, we used a combination of the great testthat package and covr that allows you to visually see how good your test coverage is and which parts of the package still need to be tested.

  1. Generate the test environment usethis::use_testthat. This generates a tests/ folder with another folder called testthat/ that later contains your tests as well as an R file testthat.R. We will only add tests to the tests/testthat/ folder and do not touch the R file.
  2. Add test(s) as .R files. The filename does not matter, just choose whatever you find reasonable.
  3. Run the tests using devtools::test(). To get an estimation of your test coverage, you can use devtools::test_coverage().

We attach the code that we used to test our overview_tab() function below and hope this sparks some inspiration when testing your functions.

Code for function testing

context("check-output")  # Our file is called "test-check_output.R"library(testthat)        # load testthat packagelibrary(overviewR)       # load our package# Test whether the output is a data frametest_that("overview_tab() returns a data frame", {  output_table <- overview_tab(dat = toydata, id = ccode, time = year)  expect_is(output_table, "data.frame")})# In reality, our function is more complex and aggregates your input if you have duplicates in your id-time units -- this is why the following two tests were essential for us## Test whether the output contains the right number of rowstest_that("overview_tab() returns a dataframe with correct number of rows", {  output_table <- overview_tab(dat = toydata, id = ccode, time = year)  expect_equal(nrow(output_table), length(unique(toydata$ccode)))})## Test whether the function works on a data frame that has no duplicates in id-timetest_that("overview_tab() works on a dataframe that is already in the correct          format",          {            df_com <- data.frame(              # Countries              ccode  = c(                rep("RWA", 4),                rep("AGO", 8),                rep("BEN", 2),                rep("GBR", 5),                rep("FRA", 3)              ),              # Time frame              year =                c(                  seq(1990, 1995),                  seq(1990, 1992),                  seq(1995, 1999),                  seq(1991, 1999, by = 2),                  seq(1993, 1999, by = 3)                )            )            output_table <-              overview_tab(dat = df_com, id = ccode, time = year)            expect_equal(nrow(output_table), 5)          })
codecov

Once you are done with your tests, you can also link your results automatically with codecov.io to your GitHub repository. This allows codecov to automatically check your tests after each push to the repository. As a bonus, you will also get a nice badge that you can then be included in your GitHub README to show the percentage of passed tests for your package.

To link codecov and GitHub, simply follow these steps:

  1. Log in on codecov.io with your GitHub account
  2. Give codecov access to your repository with your package
  3. This will prompt a page where you can copy your token from
  4. Now go back to your RStudio console and execute:
library(covr) # Test Coverage for Packagescovr::codecov(token = "INCLUDE_YOUR_CODECOV_TOKEN_HERE")
  1. This will then link your GitHub repository with codecov and generate the badge.
Check whether it works on various operating systems with devtools and rhub

To check whether our package works on various operating systems, we relied on a combination of the rhub and devtools packages. These packages also help you to integrate a continuous integration with Travis CI that automatically checks your package on Ubuntu (more on this below).

We used the following lines of code sequentially to check our package:

# The following function runs a local R CMD checkdevtools::check()

This command can take some time and produces an output in the console where you get specific feedback on potential errors, warnings, or notes.

# Check for CRAN specific requirementsrhub::check_for_cran()

This command checks for standard requirements as specified by CRAN and, if saved in an object, you can generate your cran-comments.md file based on this command. We will go into further detail about this in the next section. If you use rhub for the first time, you need to validate your e-mail address with rhub::validate_email(). You can then execute the command. Once the command ran, you will receive three different e-mails that give you detailed feedback on how well the tests performed on three different operating systems. At the time of writing, this function checked our package on Windows Server 2008 R2 SP1, R-devel, 32/64 bit; Ubuntu Linux 16.04 LTS, R-release, GCC; and Fedora Linux, R-devel, clang, gfortran. From our experience, the checks on Windows were extremely fast but we had to wait a bit until we got the results for Ubuntu and Fedora.

We then also checked the package on the development version of R as suggested with the following function:

# Check for win-builderdevtools::check_win_devel()
Generate cran-comments.md file

If you plan to submit your package to CRAN, you should save your test results in a cran-comments.md file. rhub and usethis allow us to create this file almost automatically using the following lines of code:

# Check for CRAN specific requirements using rhub and save it in the results # objectsresults <- rhub::check_for_cran()# Get the summary of your resultsresults$cran_summary()

We received the following output when running the results$cran_summary() command.

For a CRAN submission we recommend that you fix all NOTEs, WARNINGs and ERRORs.## Test environments- R-hub windows-x86_64-devel (r-devel)- R-hub ubuntu-gcc-release (r-release)- R-hub fedora-clang-devel (r-devel)## R CMD check results> On windows-x86_64-devel (r-devel), ubuntu-gcc-release (r-release), fedora-clang-devel (r-devel)  checking CRAN incoming feasibility ... NOTE    New submission  Maintainer: 'Cosima Meyer '0 errors ✓ | 0 warnings ✓ | 1 note x

Your package must not cause any errors or warnings when submitting to CRAN. Even notes need to be well explained. In our case, we receive one note saying that this is a new submission. This note occurs every time when you submit a new package and can be briefly be explained when submitting your package to CRAN in the cran-comments.md file.

We then generated our cran-comments.md file with the following command and copy-pasted this output with minor adjustments.

# Generate your cran-comments.md, then you copy-paste the output from the function aboveusethis::use_cran_comments()
Continuous integration with Travis CI

Continuous integration (CI) is incredibly helpful to ensure the smooth working of your package every time you update even small parts. Using the command usethis::use_travis() you can easily link Travis CI with your GitHub repository. Travis CI then checks your package after each push to your repository on Ubuntu. Explaining CI in further detail would require another blog post or book itself. Luckily, Julia Silge wrote an excellent overview that can be found here. In essence, CI checks after every commit and push to your repository on GitHub that the entire code/package works and sends you an e-mail if any errors occur.

Checking for good practice I: goodpractice

The package goodpractice is incredibly helpful and provides you all the information that you need when it comes to polishing your package with concerning your syntax, package structure, code complexity, formatting, and much more. And, the best thing: it provides easily understandable feedback that pinpoints you exactly to the lines of code where changes are recommended.

libary(goodpractice)goodpractice::gp()

As a general tip for improving the style of your code, the package styler provides an easy solution by formatting your entire source code in adherence to the tidyverse style (similar to RStudio’s built-in hotkey combination with Cmd + Shift + A (Mac) or Ctrl + Shift + A (Windows)).

While all these packages refer to the tidyverse style guide, you are generally free to choose which (programming) style you like best.

Checking for good practice II: inteRgrate

A package that was presented at e-Rum 2020 and is still in an experimental cycle but yet incredibly helpful is the inteRgrate package. The underlying idea behind this package is that it tests stricter than other packages with clear standards. By this, it aims to ensure that you are definitely on the safe side when submitting your package to CRAN. A good starting point is this list of commands that is listed under “Functions”, where we particularly highlight the following functions:

  • check_pkg() installs package dependencies, builds, and installs the package, before running package check (by default this check is rather strict and any note or warning raises an error by default)
  • check_lintr() runs lintr on the package, README, and the vignette. lintr checks whether your code adheres to certain standards and that you avoid syntax errors and semantic issues.
  • check_tidy_description() makes sure that your DESCRIPTION file is tidy. If not, you can use usethis::use_tidy_description() to follows the tidyverse conventions for formatting.
  • check_r_filenames() checks that all file extensions are .R and all names are lower case.
  • check_gitignore() checks whether .gitignore contains standard files.
  • check_version() ensures that you update your package version (might be good to run as the last step)

Submit to CRAN

Submitting a package to CRAN is substantially more work than making it available on GitHub. It, however, forces you to test your package on various operating systems and ensures that it is stable across all these systems. In the end, your package will become more user-friendly and accessible for a larger share of users. After going through the entire process, we believe that it is worth the effort – just for these simple reasons alone. When testing our package for CRAN, we followed mainly this blog post and collected the essential steps for you below while extending with what we think is also helpful to get published on CRAN. The column Needed is based on what is asked for when running devtools::release(). Recommended includes additional neat checks that we found helpful.

ChecksNeededRecommended
Update your R, Rstudio and all dependent R packages (R and Rstudio has to be updated manually, devtools::install_deps() updates the dependencies for you)x
Write tests and check if your own tests work (devtools::test() and devtools::test_coverage() to see how much of your package is covered by your tests)x
Check your examples in your manuals (devtools::run_examples(); unless you set your examples to \dontrun{} or \donttest{})x
Local R CMD check (devtools::check())xx
Use devtools and rhub to check for CRAN specific requirements (rhub::check_for_cran() and/or devtools::check_rhub()– remember, you can store your output of these functions and generate your cran-comments.md automatically)xx
Check win-builder (devtools::check_win_devel())xx
Check with Travis CI (usethis::use_travis())x
Update your manuals (devtools::document())xx
Update your NEWS filexx
Update DESCRIPTION (e.g. version number)xx
Spell check (devtools::spell_check())xx
Run goodpractice check (goodpractice::gp())x
Check package dependencies (inteRgrate::check_pkg())x
Check if code adheres to standards (inteRgrate::check_lintr())x
Check if your description is tidy (inteRgrate::check_tidy_description()– if your description is not tidy, it will produce an error and ask you to run usethis::use_tidy_description() to make your DESCRIPTION tidy)x
Check if file names are correct (inteRgrate::check_r_filenames())x
Check if .gitignore contains standard files (inteRgrate::check_gitignore())x
Update cran-comments.mdxx
Run devtools::check() one last timex

CRAN also offers a detailed policy for package submissions as well as a check list when submitting your package. We definitely recommend to check them in addition to our list above.

As already mentioned above, it is of vital importance that your package must not cause any errors or warnings when submitting to CRAN. Even notes need to be well explained. If you submit a new package, there is not much you can do about it and it will always create a note.

The function devtools::release() allows you to easily submit your package to CRAN – it works like a charm. Once you feel you are ready, make sure to push your changes to GitHub and then just type the command in your console. It runs a couple of yes-no questions before the submission. The following questions are those asked in the devtools::release() function at the date of writing this post.

  • Have you checked for spelling errors (with spell_check())?
  • Have you run R CMD check locally?
  • Were devtool’s checks successful?
  • Have you checked on R-hub (with check_rhub())?
  • Have you checked on win-builder (with check_win_devel())?
  • Have you updated NEWS.md file?
  • Have you updated DESCRIPTION?
  • Have you updated cran-comments.md?

Once submitted, you will receive an e-mail that requires you to confirm your submission – and then you will have to wait. If it is a new package, CRAN also runs a couple of additional tests and it might take longer than submitting an updated version of your package.

For us, it took about four days until we heard back from CRAN. We read that CRAN is curated by volunteers that can receive and incredible amount of submissions per day. Our experience was extremely positive and supportive which we truly enjoyed.

Once CRAN gets back to you, they will tell you about potential problems that you have to address before resubmitting your package – or you are lucky and your package gets accepted immediately.

Before resubmitting your package, go through all the steps presented in “Submit to CRAN” once again to make sure that your updated version still adheres to the standards of CRAN.

Common things that we have learned (and that others might find helpful) while going through the CRAN submission process are:

  1. Do not modify (save or delete) outputs on the user’s home filespace. Use tempdir() and/or tempfile()instead when running examples/vignettes/tests.
  2. Make sure that the user can set the directory and the file name when saving outputs. Simply add a file/path argument to your function(s).
  3. Write package names, software names, and API names in single quotes in your DESCRIPTION. If you use for example LaTeX in your DESCRIPTION, put it in single quotes. This issue is apparently not discovered by goodpractice::gp() or one of the inteRgrate functions.

Once your package was accepted by CRAN, it is recommended to wait another 48 hours before celebrating because CRAN will still run some background checks. Afterwards, go to your GitHub repository, click on “Create a new release”, enter the version number of your package (vX.X.X) and copy-paste the release notes from your NEWS file into the release description.

When submitting your package to CRAN via the devtools::release() function, a CRAN-RELEASE file was generated to remind you to tag your release on GitHub. This file can now safely be deleted.

Add-ons

This section on add-ons can be considered a bonus. It is not essential to guarantee that your package works smoothly or gets published on CRAN — but the extensions make your package look nicer, more professional, and might help to get discovered by other users.

Create your own hexagon sticker

Hex(agon) stickers are the small hexagon-shaped icons that a large number of packages have and that people seem to love. So why not come up with your own sticker for your very own package? The package hexSticker makes it incredibly easy to customize and build a beautiful sticker. To get a sticker for your package, just add the following arguments to the function hexSticker::sticker(): package (the name of your package), subplot (an image – we have drawn our lamp ourselves, saved it as a .png and included it in our sticker without any problems), and h_fill (if you want to change the background color). You can then adjust the sticker by defining the position of the text, the subplot, the font size, or even add a spotlight as we did. This works with virtually any text and image combination – also with the Methods Bites logo.

Code for the overviewR sticker

library(hexSticker) # Create Hexagon Sticker in Rlibrary(showtext)   # Using Fonts More Easily in R Graphs## Loading Google fonts (http://www.google.com/fonts)font_add_google("Inconsolata", "incon")sticker(  # Subplot (image)  subplot = "logo-image.png",       # Image name  s_y = 1,                          # Position of the sub plot (y)  s_x = 1.05,                       # Position of the sub plot (x)  s_width = 1.15,                   # Width of the sub plot  s_height=0.01,                    # Height of the sub plot  # Font  package = "overviewR",            # Package name (will be printed on the sticker)  p_size = 6,                       # Font size of the text  p_y = 0.8,                        # Position of the font (y)  p_x=0.75,                         # Position of the font (x)  p_family = "incon",               # Defines font  # Spotlight  spotlight = TRUE,                 # Enables spotlight  l_y=0.8,                          # Position of spotlight (y)  l_x=0.7,                          # Position of spotlight (x)  # Sticker colors  h_fill = "#5d8aa6",               # Color for background  h_color = "#2A5773",              # Color for border  # Resolution  dpi=1200,                         # Sets DPI  # Save  filename="logo.png"               # Sets file name and location where to store the sticker)

Figures such as your logo are usually stored in man/figures/.

Add badges

Badges in your GitHub repository are a bit like stickers but they also serve an informative purpose. We included five different badges in our README in our GitHub repository: a Travis CI status, a codecov status, as well as the repo status. We then also added a badge that signals that the package is ready to use and another one that tells the user that the package was built with R (… and love!).

If you want to learn more about available badges for your package, here and here are nice overviews. You can also use the package badgecreatr to check for badges and to include them.

Create your own manual

If you want to create your own PDF manual for your package, devtools::build_manual does this for you.

A preview to our manual

Build your website for your package

As the last part, to advertise your package and to provide a more detailed insight into how your package works, you can set up a whole, stand-alone website for it! The pkgdown package makes this as easy as writing one line of code — literally! All you have to do is to install and load pkgdown and then — provided that you have taken all the steps above and have an R-package-structure in your GitHub repository — run pkgdown::build_site(). This automatically renders your package into a website that follows the structure of your package with a landing page based on the README file, a “get started” part of your vignette, as well as sections for function references based on the content of your man/ folder, and a dedicated page for your NEWS.md. It even includes a sidebar with links to the GitHub repository, the name(s) of the author(s), and, of course, all your badges. Amazing, right?

Naturally, pkgdown allows for further modifications of your websites’ appearance such as different themes (based on bootswatch themes), modified landing pages, different outlines of your navigation bar, etc. This post provides a good overview of things that you can do in addition to using the default website builder from pkgdown.

By default, your website is hosted on GitHub pages with the following URL: https://GITHUB_USERNAME.github.io/PACKAGENAME. To ensure that every time you update your package, the website gets updated as well, you can modify the .travis.yml file in the root of your package and add the following parts to it:

after_success:  - Rscript -e 'pkgdown::build_site()'deploy:  provider: pages  skip-cleanup: true  github-token: $GITHUB_PAT  keep-history: true  local-dir: docs  on:    branch: master

This makes sure that every time you push your updates to GitHub, Travis CI will not only check the build of your package but will also update your website automatically. For more detailed information on the deployment and the continuous integration process of your website, take a look at this blogpost.

A preview to our pkgdown website

       

Package references

       

About the authors

Cosima Meyer is a doctoral researcher and lecturer at the University of Mannheim and one of the organizers of the MZES Social Science Data Lab. Motivated by the continuing recurrence of conflicts in the world, her research interest on conflict studies became increasingly focused on post-civil war stability. In her dissertation, she analyzes leadership survival – in particular in post-conflict settings. Using a wide range of quantitative methods, she further explores questions on conflict elections, women’s representation as well as autocratic cooperation.

Dennis Hammerschmidt is a doctoral researcher and lecturer at the University of Mannheim. He works at the intersection of quantitative methods and international relations to explore the underlying structure of state relations in the international system. Using information from political speeches and cooperation networks, his dissertation work analyzes the interaction of states at the United Nations and estimates alignment structures of states in the international arena.

[^1] As of writing this post, there are more than 16,0000 packages on CRAN, and around 85,000 packages on GitHub. This can be overwhelming, especially when trying to find the package for your specific problem or when trying to learn what others are doing. There are some really useful and insightful twitter accounts or blog posts that provide an excellent overview of hot topics and rising stars in the R package world.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Methods Bites.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

CRAN Checks API News: Documentation, Notifications, and More

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In October last year we wrote about the CRAN Checks API (https://cranchecks.info). Since then there have been four new major items introduced: documentation, notifications, search, and a new version of the cchecks R package. First, an introduction to the API for those not familiar.

CRAN Checks API

The CRANS checks API was born because whilst being crucial to a package’s fate on CRAN, CRAN checks results were not available in a machine readable format, contrary to checks from continuous integration services. Indeed, CRAN checks results are only distributed as HTML pages. Therefore CRAN checks API’s goal is to provide data from CRAN checks in a format that’s easier to work with from code.

CRAN checks are presented for packages in html meant for browser interaction, in a combination of tables, lists and text. On our server we scrape checks data for each package, and manipulate the data into a format that can be easily stored and searched, and then made machine readable.

The main thing this API is used for is badges like the one below that indicate status of the CRAN checks for a package. Many package maintainers have these in the README of their code repository.

Exmple CRAN Checks API status badge, giving an OK status. Colors are green and black.

Documentation

APIs are not very useful without good documentation. Maëlle re-organized existing docs into a website made with Hugo: https://docs.cranchecks.info/

The documentation includes explanation of all the API routes, and includes examples in Shell/command line and for use in R.

There’s also detailed explanation of notifications, see below.

For those interested in details about the Hugo website, here are a few. You can find the website source on GitHub.

The theme is an edited version of bep’s docuapi theme in order to allow for more flexibility of code tabs. Note that bep is Hugo maintainer, so that was a good theme to build upon. The docuapi theme uses Go modules. The way languages are divided into tabs relies on using data attributes that a JavaScript script can then access.

The website source includes some knitr hooks to deal with chunks of various languages. It might be a bit overcomplicated and could be simplified in the future, but it works for now.

Using GitHub Actions, an R script is run every week to update the documentation.

Apart from data attributes, another cool HTML thing we learnt about for this website is the Markdown extension for creating a description list.

parameter: its definition

is rendered to

parameter
its definition

Regarding the website styling, we didn’t tweak it much. We added an rOpenSci logo, and use a dark theme for code highlighting (a tweak version of the Chroma fruity style, to add some contrast).

cchecks R package

The cchecks package has been around for a while, but has received a lot of work recently, and is up to date with the current CRAN checks API. It is not on CRAN right now. To get started see the docs for the package at https://docs.ropensci.org/cchecks, as well as the API docs at https://docs.cranchecks.info/. You can install it like:

remotes::install_github("ropenscilabs/cchecks")

Below we talk about using cchecks for notifications and searching check results, so we’ll give a brief example of some of the other functions here.

In our October 2019 blog post we discussed accessing “historical” data, that is, data older than 30 days from the present day. Data is stored in an Amazon S3 bucket, with a separate gzipped JSON file for each day. In October ‘19 we had the API route, but now you can access the data easily within R:

library(cchecks)cch_history(date="2020-04-01")

The cch_history() function calls our API, which returns a link to the file in the S3 bucket. We then download the file and then jsonlite reads in the JSON data to a data.frame. Using this function you can quickly get historical checks data if you need to do some archeological work.

To get checks data for specific packages up to 30 days old, we can use the cch_pkgs_history() function:

cch_pkgs_history("MASS")cch_pkgs_history(c("crul","leaflet","MASS"))

So when you need historical CRAN checks data, you’ll use either cch_pkgs_history() (data for specific packages, up to 30 days in the past) or cch_history() (data for all packages, one day by function call, back to 2018-12-18), or a combination of both, depending on your needs. The resulting data is in both cases a data.frame so you can use your favorite R data munging tools.

Notifications

Good technical solutions are often born from scratching one’s own itch. The first author has many packages on CRAN and would like to avoid getting emails from the CRAN maintainers with a deadline to fix a problem. If I could only know about a problem with a CRAN check and fix it quickly, we’re all better off as users get fixes quickly, and CRAN maintainers email burden is that much less.

We’re announcing here the availability of CRAN checks notifications. These notificaitons are emails; there could be other forms (e.g., Twitter, etc.), but emails probably meet most people’s needs. To get started see the docs at https://docs.cranchecks.info/#notifications.

Notifications work via a rule that you set. A rule is made up of one or more of four categories:

  • status: match against check status. one of: ok, note, warn, error, or fail
  • time: days in a row the match occurs. an integer. can only go 30 days back (history cleaned up after 30 days)
  • platforms: platform the status occurs on, including negation (e.g., “-solaris”). options: solaris, osx, linux, windows, devel, release, patched, oldrel
  • regex: a regex to match against the text of an error in check_details.output

A user can set as many rules as they like. A package can have more than one rule. Users can delete only their own rules (as one would expect). Rules are used indefinitely, until deleted by the user.

Once a rule is triggered an email is sent to the user. We won’t send the same email for 5 days. After 5 days has past, if the rule is matched, we’ll send the same email; and repeat.

Feel free to ignore these emails, or act them as you see fit.

The email is structured as follows:

An example email users get from the CRAN checks API notifications service, including the rule triggered and link to check results.
  • Triggered rule: with a list of each of the four categories and their values.
  • Date of the check result (matches closely the date on the CRAN check results that CRAN maintainers created)
  • Your check results: link to the CRAN checks API JSON output for your package
  • The rest is information, report bugs, ask for help, etc. Note the “Unsubscribe” link doesn’t actually work yet!

Managing notifications

There is no web interface to notifications at this time. The only official interface is the cchecks R package – you are welcome to interact with the API itself via curl or any other tool.

First, you’ll need to register for a token (aka key). Using cchn_register(), you run the function with or without an email address. If no email address is given we look for an email address in various places, and ask you which one you’d like to use, or you can supply one at the prompt.

cchn_register()

Running cchn_register() caches the token in a file locally on your computer.

After registering, you can manage rules for your packages. There’s two ways to manage rules: a) across packages, or b) in a single package context.

Functions prefixed with cchn_pkg_ operate within a package directory. That is, your current working directory is an R package, and is the package for which you want to handle CRAN checks notifications. These functions make sure that you are inside of an R package, and use the email address and package name based on the directory you’re in.

Functions prefixed with just cchn_ do not operate within a package. These functions do not guess package name at all, but require the user to supply a package name (for those functions that require a package name); and instead of guessing an email address from your package, we guess email from the cached email/token file.

If you don’t use the package specific functions, you can add a rule like:

cchn_rule_add(package="foobar",status="warn",platform=2)

Using package specific functions, you can add a rule like:

cchn_pkg_rule_add(status="warn",platform=2)

See cchn_rules_add() for adding many rules at once.

What the first author does currently is a rule for each of his packages checking for a status of ERROR for at least 2 days. This would look like the below for an example set of three packages.

pkgs<-c("charlatan","randgeo","rgbif")rules<-lapply(pkgs,function(z)list(package=z,status="error",time=2))cchn_rules_add(rules,"myemail@gmail.com")

You could take this approach as well for your packages. We are thinking about ways to build in some sensible default rules, as well as making it easier to work across all of your packages by looking them up for you by your maintainer email.

Search

A benefit of having a proper database (aka SQL) of anything is that you can search it. We did not have search until May 2020, so it’s relatively new. Search allows users to do full-text search the check_details field of the 30-day historical data across all packages. There’s a few parameters users can toggle, including one_each (boolean) to only return a single result per matching package (rather than results for all days matching). The equivalent function in the R package is cchecks::cch_pkgs_search().

Here, we search for the term memory:

cchecks::cch_pkgs_search(q="memory")
#> $error#> NULL#> #> $count#> [1] 1309#> #> $returned#> [1] 30#> #> $data#> # A tibble: 30 x 5#>    package date_updated summary$any   $ok $note $warn $error $fail checks#>                            #>  1 openCR  2020-06-14T… TRUE            0    11     0      1     0   2 allan   2020-06-14T… TRUE            0     9     0      3     0   3 openCR  2020-06-15T… TRUE            0    11     0      1     0   4 allan   2020-06-15T… TRUE            0     9     0      3     0   5 allan   2020-06-16T… TRUE            0     9     0      3     0   6 allan   2020-06-17T… TRUE            0     9     0      3     0   7 allan   2020-06-18T… TRUE            0     9     0      3     0   8 allan   2020-06-19T… TRUE            0     9     0      3     0   9 allan   2020-06-20T… TRUE            0     9     0      3     0  10 allan   2020-06-21T… TRUE            0     9     0      3     0  # … with 20 more rows, and 2 more variables: check_details$details ,#> #   $additional_issues 

The result is a list with number of results found, returned, and a data.frame of matches.

If you want to return only one result for package, use the one_each parameter:

cchecks::cch_pkgs_search(q="memory",one_each=TRUE)

Wrap up

Please try out the various items discussed above, and give us feedback. Whether it’s about the documentation, the API itself, the notifications service, or the cchecks package – it’s all useful!

We’re particualarly interested in your feedback on the email notifications service. It’s still early days for the service, so we’re very keen to get all rough edges smoothed out to make for a good user experience.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Announcing the Swimming + Data Science High School States Swimming State-Off!

$
0
0

[This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The spring and summer of 2020 were going to be exciting times here at Swimming + Data Science, with NCAA championships, Olympic trials, and then the big one, the Olympics on the docket. That’s all been canceled or postponed though, and with good reason. It has also left me with a few holes in my schedule, so, without those meets I’ve decided to make my own virtual meet – from lemons, lemonade!

I’m going to take the high school state championship results from each of the 8 most populous states, boys and girls, and score them against each other in a single elimination tournament. The winner will be grand champion of the inaugural Swimming + Data Science High School Swimming State-Off!

The State-Off, and the resulting series of posts, will serve a few purposes. First, and foremost, to keep me entertained because I can’t actually swim and/or watch swimming. Second, to settle a disagreement between some old college teammates of mine as to whose home state would “dominate” the others in high school swimming. Third, to show off what my SwimmeR package can do, and celebrate the launch of version 0.3.1, and last, to inform, educate, and you know, whatever, Swimming + Data Science style.

This kind of thing must of course be handled seriously, everyone knows nothing makes a sporting event more serious than a bracket. To begin filling a bracket out we need to know teams and seeds. There will be 8 states (teams) in the State-Off, but which 8, and with what seeding?

The United States Census Bureau keeps population data for all US States and territories, including estimates for non-census years like 2019. That data can be collected directly from census.gov. Let’s load some packages and get going!

library(readr)library(dplyr)library(flextable)pop_data <- read_csv("http://www2.census.gov/programs-surveys/popest/datasets/2010-2019/national/totals/nst-est2019-alldata.csv?#")

Now that we’ve got our data, via readr (for data reading) plus dplyr (for general excellence) and flextable (for displaying the results) let’s determine the top 8 US states by population in 2019 and print up a nice table. I’ll be seeding the meet in order of population, with the most populous state getting the top seed.

seeds <- pop_data %>%  mutate(STATE = as.numeric(STATE)) %>%  filter(STATE >= 1) %>%  select(NAME, POPESTIMATE2019) %>%  arrange(desc(POPESTIMATE2019)) %>%  top_n(8) %>%  mutate(Seed = 1:n(),         POPESTIMATE2019 = round(POPESTIMATE2019 / 1000000, 2)) %>%  select(Seed, "State" = NAME, "Population (mil)" = POPESTIMATE2019)seeds %>%  flextable() %>%  bold(part = "header") %>%  bg(bg = "#D3D3D3", part = "header")

Seed

State

Population (mil)

1

California

39.51

2

Texas

29.00

3

Florida

21.48

4

New York

19.45

5

Pennsylvania

12.80

6

Illinois

12.67

7

Ohio

11.69

8

Georgia

10.62

Okay, so California, Texas, Florida, New York, Pennsylvania, Illinois, Ohio, and Georgia in that order. Now we just need to actually make a bracket. SwimmeR now has the draw_bracket function, which can produce brackets for anywhere from 5 to 64 teams. It’s just the ticket. Let’s see our match-ups!

library(SwimmeR)draw_bracket(teams = seeds$State,             title = "Swimming + Data Science High School Swimming State-Off",             text_size = 0.7)

The next several posts will cover each match-up in depth, dealing with getting and cleaning the meet results, scoring out the meet for boys, girls and combined, and discussions of the process, until a State-Off champion is crowned. Additionally I’ll do a wrap up post where I run the meet as one giant invitational, name swimmers of the meet and comment on anything interesting I find.

Since New York vs. Pennsylvania is one of the most hotly contested match-ups amongst my old teammates it’ll be the next post. Stay tuned!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

stringdist 0.9.6 on CRAN: new features

$
0
0

[This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

stringdist version 0.9.6 arrived on CRAN on 16 july 2020.

This release brings a few new features.

Fuzzy text search

Search text for approximate matches of a search string using any stringdist distance. There are several functions that allow you to

  • detect whether there is a match within a certain maximum distance
  • return the position of the first best match
  • return the best match.

There are several interfaces for this. Functions grab and grabl work like base grep and grepl. The function extract has output similar to stringr::str_extract. The workhorse function is called afind (approximate find), which returns all results for multiple search patterns.

There is also a new implementation of the popular ‘cosine’ distance that I developed especially for this purpose. It is called ‘running_cosine’ and it avoids double work otherwise done with by the standard ‘cosine’ method. The result is a much faster implementation (up to about 100 times faster).

string similarity matrices

Thanks to a PR by Johannes Gruber stringdist now has a function to compute string similarity matrices: stringsimmatrix

Markdown with by ❤wp-gfm
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12124 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>