A case against pipes in R and what to do instead

April 20, 2020, 5:00 pm

≫ Next: Make your Amazon purchases with R!

≪ Previous: random generators produce ties

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A case against pipes in R and what to do instead

Pipes (%>%) are great for improving readibility of lengthy data processing scripts, but I’m beggining to learn they have some weaknesses when it comes to large and complex data processing.

We are running a number of projects at the moment that require managing and wrangling large and complex datasets. We have numerous scripts we use to document our workflow and the data wrangling steps. This has turned out to be very helpful, because when we identify bugs in the end product, we can go back and fix them.

But I’m starting to see a pattern. Most of the really insidious bugs occur in sections of code that use dplyr tools and pipes. These are always the types of bugs that don’t throw an error, so you get a result, it just turns out to be wrong. They are the worst kind of bugs. And hard to detect and fix.

So we are now moving away from using pipes in complex scripts. For simple scripts I intend to keep using them, they are so fast and easy. Here’s what we’re trying instead.

The problem with pipes

So here’s some made up data that mimics the kind of fish survey data we often have:

sites <- data.frame(site = letters[1:5],
                    temp = rnorm(5, 25, 2), stringsAsFactors = FALSE)
dat <- expand.grid(site = letters[1:5],
                   transect = 1:4)
dat$abundance <- rpois(20, 11)

So we have site level data with a covariate, temp and transect level data with fish counts.

Now say we have an error and one of our sites has capitals, instead of lower case, so lets introduce that bug:

sites$site[1] <- "A"

Now if I join and summarize them, I will lose one of the sites

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

dat %>%
  inner_join(sites) %>%
  group_by(site) %>%
  summarize(mean(abundance))

## Joining, by = "site"

## Warning: Column `site` joining factor and character vector, coercing into
## character vector

## # A tibble: 4 x 2
##   site  `mean(abundance)`
##                
## 1 b                   9  
## 2 c                  12  
## 3 d                  12.5
## 4 e                  12.8

Obvious enough here, but issues like that are much harder to detect in very large datasets.

Unit testing

The solution of course is to code in ‘unit tests’ to make sure each operations are doing what you expect. For small data you can just look, but for big datasets its not so easy.

For long pipes with multiple steps we’d usually do this debugging and testing interactively. So I’d write the first line (the join) save the output to a new variable, check it worked ok, then move on to write the next step of the pipe.

Now here’s the catch. In complex project its common to change the data that goes into your pipe (in this case dat or sites dataframes). For instance, in our current project new data comes in all the time.

New data presents new issues. So a pipe that worked the first time may no longer work the second time.

This is why it is crucial to have unit tests built into your code.

There are lots of sophisticated R packages for unit testing, including ones that work with pipes. But given many of us are just learning tools like dplyr its not wise to add extra tools. So here I’ll show some simple unit tests with base R.

Unit testing an example

Joins often case problems, due to mis-matching (e.g. if site names are spelt differently in different datasets, which is a very common human data entry error!).

So its wise to check the join has worked. Here’s some examples:

dat2 <- inner_join(dat, sites)

## Joining, by = "site"

## Warning: Column `site` joining factor and character vector, coercing into
## character vector

Now compare number of rows:

nrow(dat2)

## [1] 16

nrow(dat)

## [1] 20

Obviously the join has lost data in this case.

We can do better though with a complex script. We’d like to have an error if the data length changes. We can do this:

nrow(dat2) == nrow(dat)

## [1] FALSE

Which tells us TRUE/FALSE if the condition is met. To get an error use stopifnot

stopifnot(nrow(dat2) == nrow(dat))

Common unit tests for data wrangling

Of the top of my head here are a few of my most commonly used unit tests To check the number of sites has stayed the same, use length(unique(… to get the number of unique cases:

length(unique(dat$site))

## [1] 5

length(unique(dat2$site))

## [1] 4

length(unique(dat$site)) == length(unique(dat2$site))

## [1] FALSE

Or if we wanted to compare the site and dat dataframes:

unique(sites$site) %in% unique(dat$site)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE

The %in% just means are the sites names in sites matching the site names in dat? (We can use stopifnot here too, with multiple TRUE/FALSE values).

How many don’t match?

sum(!(unique(sites$site) %in% unique(dat$site)))

## [1] 1

The ! is a logical ‘not’ (not FALSE = TRUE, so we are counting non-matches).

Which one doesn’t match?

sites$site[!unique(sites$site) %in% unique(dat$site)]

## [1] "A"

Here’s another insidious bug caused by joins, when our covariate dataframe has duplicate site entries:

sites <- data.frame(site = c(letters[1:5], "a"),
                    temp = c(rnorm(5, 25, 2), 11), stringsAsFactors = FALSE)
sites

##   site     temp
## 1    a 19.30061
## 2    b 23.76530
## 3    c 24.89018
## 4    d 25.16386
## 5    e 23.83092
## 6    a 11.00000

Now we have two sites called a with different values of temp. Check out the join:

dat2 <- inner_join(dat, sites)

## Joining, by = "site"

## Warning: Column `site` joining factor and character vector, coercing into
## character vector

nrow(dat)

## [1] 20

nrow(dat2)

## [1] 24

So its added rows, ie made up data we didn’t have. Why? Well the join duplicated all the site a values for both values of temp:

filter(dat2, site == "a")

##   site transect abundance     temp
## 1    a        1        10 19.30061
## 2    a        1        10 11.00000
## 3    a        2        11 19.30061
## 4    a        2        11 11.00000
## 5    a        3        12 19.30061
## 6    a        3        12 11.00000
## 7    a        4         9 19.30061
## 8    a        4         9 11.00000

No watch this, we can really go wrong when we summarize:

dat2 %>%
  group_by(site) %>%
  summarize(sum(abundance))

## # A tibble: 5 x 2
##   site  `sum(abundance)`
##               
## 1 a                   84
## 2 b                   36
## 3 c                   48
## 4 d                   50
## 5 e                   51

It looks like site a has twice as many fish as it really does (78, when it should have 39). So imagine you had a site dataframe you were happy worked, then your collaborator sent you a new one to use, but it had duplicate rows. If you didn’t have the unit test to check your join in place, you may never know about this doubling of data error.

We could check for this by checking for the number of transects e.g.

dat_ntrans <- dat2 %>% group_by(site) %>% summarize(n = n())
dat_ntrans

## # A tibble: 5 x 2
##   site      n
##    
## 1 a         8
## 2 b         4
## 3 c         4
## 4 d         4
## 5 e         4

dat_ntrans$n != 4

## [1]  TRUE FALSE FALSE FALSE FALSE

(Yes I used a pipe this time, but a simple one).

Going forward

So I plan on keeping up pipes, but just for simple things. For complex scripts we’ll break the pipes with more intermediate tests and do more unit testing. It’ll save a lot of pain down the road.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Make your Amazon purchases with R!

April 20, 2020, 8:17 pm

≫ Next: COVID-19: Analyze Mobility Trends with R

≪ Previous: A case against pipes in R and what to do instead

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

buy amazon fresh groceries with r

(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });sq

Background

Anyone who’s bought groceries online recently has seen the huge increase in demand due to the COVID-19 outbreak and quarantines. In this post, you’ll learn how to buy groceries on Amazon using R! To do that, we’ll be using the RSelenium package. In case you’re not familiar, Selenium is a browser automation tool. It works like a normal browser, except that you write code to perform operations, such as navigating to websites, filling in forms online, clicking links and buttons, etc. This way, it’s similar to writing a macro in Excel – except for a web browswer.

Several different languages, including Python and R, have packages that allow you to use Selenium by writing code in their language. R’s package for this, as mentioned above, is RSelenium.

Getting started

To get started, we need to install RSelenium, which we can do using the standard install.packages command. I also recommend using a newer version of R, as some of RSelenium‘s dependencies won’t work on older R versions.

install.packages("RSelenium")

Once installed, we can start coding!

Getting the URL of each grocery item

The first step we need to do is to create a driver object, which we’ll do using the rsDriver function. Running the first line below will launch a browser window. RSelenium supports several browsers, including Firefox, Chrome, and Internet Explorer.

Our code will mostly revolve around using driver$client to navigate around webpages. To go to a specific webpage, like Amazon’s, we can use the navigate method.

# create driver objectdriver <- rsDriver(port = 2000L,browser = "firefox") # navigate to Amazon's homepagedriver$client$navigate("https://www.amazon.com")

Next, let’s define a vector of general items we want to search for.

items <- c("flour","butter","cereal", "eggs", "milk", "apples", "sugar")

Now, let’s break down what we need to do. For each item, we will:

1) Type in the name of the item in Amazon’s search box and submit the seach

2) Find the URLs of all non-sponsored items products on the results page

3) Click on the links for the top search results for each item

4) Check if each product is in stock

5) Click “Add to cart” for each in-stock product

6) Check out

Translating that into code, the first three points are below. Let’s cover a couple key points about the code. First, to search for elements on the webpage, we use the findElement method, where we pass HTML attributes of that element to the method. For example, the item search box has the id = “twotabsearchtextbox”, which we can see in the code below. We can figure that out by looking at the source code behind the webpage, or by right clicking the search box, and going to “inspect element”.

amazon search bar

amazon html

# create empty list to hold URLs for the top products for each itemall_urls <- list()# loop through each item in the items vectorfor(item in items){    # Find the search bar box    item_box <- driver$client$findElement(using = "id", value = "twotabsearchtextbox")    # Clear the search box    item_box$clearElement()    # Type in the item name (note: text must be inside of a list)    item_box$sendKeysToElement(list(item))    # Submit the search    item_box$submitElement()        # Wait for the results page to come up    Sys.sleep(5)        # Get the links within the "rush-component" span tags    spans <- driver$client$findElements(using = "class", value = "rush-component")    links <- lapply(spans, function(span) try(span$findChildElement(using = "class", value = "a-link-normal"), silent = TRUE))        # Filter out errors i.e. the result of some span tags above not having links    links <- links[sapply(links, class) == "webElement"]        # Get URLS from link objects    urls <- unlist(sapply(links, function(link) link$getElementAttribute("href")))        # Filter out links we don't need ("sponsored" products)    urls <- unique(urls[!grepl("/gp/", urls)])     # Add URLs to list    all_urls[[item]] <- urls[1:5]       }

RSelenium returns the links as they show up on the webpage i.e. links closer to the top of the search results will show up earlier in returned links. This means if we examine the first 5 URLs for an item (as we pull above), it should correspond to the first 5 products in the search results for that item.

Now, for our purposes, we’re only going to get one product per item, but if the first product we check is not in stock, then we want to be able to check if other products for the item are available. Again, we’ll limit our search to the first 5 products i.e. first 5 URLs associated with each item. If you’re doing this on your own, this is something you could adjust.

Here’s the next section of code. In this section, we go to each product’s webpage, check if it’s in stock, and then add it to the cart if it’s currently available. If a product is in stock, then we skip the rest of the products for that item. We’ll use the clickElement method to click the “add to cart” button.

for(urls in all_urls){        text <- ""                for(url in urls)        {               # Navigate to url, wait for the page to fully load            driver$client$navigate(url)            Sys.sleep(5)                  # Look for div tag stating if item is in stock or not            div <- try(driver$client$findElement(using = "id", value = "availability"))                     # If page doesn't have this tag, assume it's not in stock            if(class(div) == "try-error")              next            else            {                   # Scrape text from div tag                text <- div$getElementText()[[1]]                break            }        }                    if(text == "In Stock.")        {                          add_to_cart <- driver$client$findElement(using = "class", value = "a-button-input")            add_to_cart$clickElement()                    }                Sys.sleep(5)  }

In the code above we check if the page specifically states the product is in stock. If it says it’s in stock at a later date, then this will not add that product to the cart. However, you can modify this by changing the equality operator to a regular expression, like this:

if grepl("in stock", text, ignore.case = TRUE){...}

Next, now that we have added each available product to our shopping cart, we can check out! Below, we’ll go to the checkout page, and login with a username and password.

# Navigate to shopping cartdriver$client$navigate("https://www.amazon.com/gp/cart/view.html?ref_=nav_cart")# Find "Proceed to checkout" buttoncheckout_button <- driver$client$findElement(using = "name", value = "proceedToRetailCheckout")# Click checkout buttoncheckout_button$clickElement()# Find username boxusername_input <- driver$client$findElement(using = "id", value = "ap_email")# Enter username infousername_input$sendKeysToElement(list("TOP SECRET USERNAME"))# Submit usernameusername_input$submitElement()# Find password boxpassword_input <- driver$client$findElement(using = "id", value = "ap_password")# Enter passwordpassword_input$sendKeysToElement(list("TOP SECRET PASSWORD"))# Submit passwordpassword_input$submitElement()# Wait for page to loadSys.sleep(5)

One note – it’s not a good idea to store credentials in a script – ever. You can avoid this by using the keyring package. Learn more about that by clicking here.

Next, we can place our order.

# Find "place order" buttonplace_order <- driver$client$findElement(using = "name", value = "placeYourOrder1")# Submit "place order" buttonplace_order$submitElement()

However, before you place your order, you might need to update your address or payment info. For example, you can start the “change address” process by using the code below. You’ll just need to add a few lines to select a different address or fill in a new one. Similarly, you could do the same for payment information.

# Find "change address" linkchange_address <- driver$client$findElement(using = "id", value = "addressChangeLinkId")# Click "change address" linkchange_address$clickElement()

Conclusion

That’s it for this post! There’s several pieces of this that you can adjust on your own, like which items you’re looking for, or whether to get items that will be in stock soon, etc. Check out this link to follow my blog on Twitter!.

Want to learn more about web scraping? Of course you do! Check out a new online course I co-authored with 365 Data Science here! You’ll learn all about web scraping, APIs (a must), and how to deal with modern challenges like JavaScript, logging into websites, and more! Also, check out their full platform of courses on R, Python, and data science here.

The post Make your Amazon purchases with R! appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

COVID-19: Analyze Mobility Trends with R

April 21, 2020, 12:00 am

≫ Next: Why R? Webinar – Recent changes in R spatial and how to be ready for them

≪ Previous: Make your Amazon purchases with R!

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The global lockdown has slowed down mobility considerably. This can be seen in the data produced by our ubiquitous mobile phones.

Apple is kind enough to make those anonymized and aggregated data available to the public. If you want to learn how to get a handle on those data and analyze trends with R read on!

To download the current data set go to the following website, click on “All Data CSV”: Apple Maps: Mobility Trends Reports and move the file to your data folder.

Apple explains:

The CSV file and charts on this site show a relative volume of directions requests per country/region or city compared to a baseline volume on January 13th, 2020.
We define our day as midnight-to-midnight, Pacific time. Cities represent usage in greater metropolitan areas and are stably defined during this period. In many countries/regions and cities, the relative volume has increased since January 13th, consistent with normal, seasonal usage of Apple Maps. Day of week effects are important to normalize as you use this data.
Data that is sent from users’ devices to the Maps service is associated with random, rotating identifiers so Apple doesn’t have a profile of your movements and searches. Apple Maps has no demographic information about our users, so we can’t make any statements about the representativeness of our usage against the overall population.

To get an overview we first load the data into R and print the available regions (data for countries and many cities are available) and transportation types (“driving”, “transit” and “walking”):

mobility <- read.csv("data/applemobilitytrends-2020-04-19.csv") # change path and file name accordinglylevels(mobility$region)##   [1] "Albania"                  "Amsterdam"               ##   [3] "Argentina"                "Athens"                  ##   [5] "Atlanta"                  "Auckland"                ##   [7] "Australia"                "Austria"                 ##   [9] "Baltimore"                "Bangkok"                 ##  [11] "Barcelona"                "Belgium"                 ##  [13] "Berlin"                   "Birmingham - UK"         ##  [15] "Bochum - Dortmund"        "Boston"                  ##  [17] "Brazil"                   "Brisbane"                ##  [19] "Brussels"                 "Buenos Aires"            ##  [21] "Bulgaria"                 "Cairo"                   ##  [23] "Calgary"                  "Cambodia"                ##  [25] "Canada"                   "Cape Town"               ##  [27] "Chicago"                  "Chile"                   ##  [29] "Cologne"                  "Colombia"                ##  [31] "Copenhagen"               "Croatia"                 ##  [33] "Czech Republic"           "Dallas"                  ##  [35] "Delhi"                    "Denmark"                 ##  [37] "Denver"                   "Detroit"                 ##  [39] "Dubai"                    "Dublin"                  ##  [41] "Dusseldorf"               "Edmonton"                ##  [43] "Egypt"                    "Estonia"                 ##  [45] "Finland"                  "France"                  ##  [47] "Frankfurt"                "Fukuoka"                 ##  [49] "Germany"                  "Greece"                  ##  [51] "Guadalajara"              "Halifax"                 ##  [53] "Hamburg"                  "Helsinki"                ##  [55] "Hong Kong"                "Houston"                 ##  [57] "Hsin-chu"                 "Hungary"                 ##  [59] "Iceland"                  "India"                   ##  [61] "Indonesia"                "Ireland"                 ##  [63] "Israel"                   "Istanbul"                ##  [65] "Italy"                    "Jakarta"                 ##  [67] "Japan"                    "Johannesburg"            ##  [69] "Kuala Lumpur"             "Latvia"                  ##  [71] "Leeds"                    "Lille"                   ##  [73] "Lithuania"                "London"                  ##  [75] "Los Angeles"              "Luxembourg"              ##  [77] "Lyon"                     "Macao"                   ##  [79] "Madrid"                   "Malaysia"                ##  [81] "Manchester"               "Manila"                  ##  [83] "Melbourne"                "Mexico"                  ##  [85] "Mexico City"              "Miami"                   ##  [87] "Milan"                    "Montreal"                ##  [89] "Morocco"                  "Moscow"                  ##  [91] "Mumbai"                   "Munich"                  ##  [93] "Nagoya"                   "Netherlands"             ##  [95] "New York City"            "New Zealand"             ##  [97] "Norway"                   "Osaka"                   ##  [99] "Oslo"                     "Ottawa"                  ## [101] "Paris"                    "Perth"                   ## [103] "Philadelphia"             "Philippines"             ## [105] "Poland"                   "Portugal"                ## [107] "Republic of Korea"        "Rio de Janeiro"          ## [109] "Riyadh"                   "Romania"                 ## [111] "Rome"                     "Rotterdam"               ## [113] "Russia"                   "Saint Petersburg"        ## [115] "San Francisco - Bay Area" "Santiago"                ## [117] "Sao Paulo"                "Saudi Arabia"            ## [119] "Seattle"                  "Seoul"                   ## [121] "Serbia"                   "Singapore"               ## [123] "Slovakia"                 "Slovenia"                ## [125] "South Africa"             "Spain"                   ## [127] "Stockholm"                "Stuttgart"               ## [129] "Sweden"                   "Switzerland"             ## [131] "Sydney"                   "Taichung"                ## [133] "Taipei"                   "Taiwan"                  ## [135] "Tel Aviv"                 "Thailand"                ## [137] "Tijuana"                  "Tokyo"                   ## [139] "Toronto"                  "Toulouse"                ## [141] "Turkey"                   "UK"                      ## [143] "Ukraine"                  "United Arab Emirates"    ## [145] "United States"            "Uruguay"                 ## [147] "Utrecht"                  "Vancouver"               ## [149] "Vienna"                   "Vietnam"                 ## [151] "Washington DC"            "Zurich"levels(mobility$transportation_type)## [1] "driving" "transit" "walking"

We now create a function mobi_trends to return the data in a well-structured format. The default plot = TRUE plots the data, plot = FALSE returns a named vector with the raw data for further investigation:

mobi_trends <- function(reg = "United States", trans = "driving", plot = TRUE, addsmooth = TRUE) {  data <- subset(mobility, region == reg & transportation_type == trans)[4:ncol(mobility)]  dates <- as.Date(sapply(names(data), function(x) substr(x, start = 2, stop = 11)), "%Y.%m.%d")  values <- as.numeric(data)  series <- setNames(values, dates)  if (plot) {    plot(dates, values, main = paste("Mobility Trends", reg, trans), xlab = "", ylab = "", type = "l", col = "blue", lwd = 3)    if (addsmooth) {      lines(dates, values, col = "lightblue", lwd = 3)      lines(supsmu(dates, values), col = "blue", lwd = 2)    }    abline(h = 100)    abline(h = c(0, 20, 40, 60, 80, 120, 140, 160, 180, 200), lty = 3)    invisible(series)  } else series}mobi_trends()

The drop is quite dramatic… by 60%! Even more dramatic, of course, is the situation in Italy:

mobi_trends(reg = "Italy")

A drop by 80%! The same plot for Frankfurt:

mobi_trends(reg = "Frankfurt")

Obviously in Germany people are taking those measures less seriously lately, there seems to be a clear upward trend. This can also be seen in the German “walking” data:

mobi_trends(reg = "Germany", trans = "walking")

What is interesting is that before the lockdown “transit” mobility seems to have accelerated before plunging:

mobi_trends(reg = "Germany", trans = "transit")

You can also plot the raw numbers only, without an added smoother (option addsmooth = FALSE):

mobi_trends(reg = "London", trans = "walking", addsmooth = FALSE)

And as I said, you can conduct your own analyses on the formatted vector of the time series (option plot = FALSE)…

mobi_trends(reg = "London", trans = "walking", plot = FALSE)## 2020-01-13 2020-01-14 2020-01-15 2020-01-16 2020-01-17 2020-01-18 ##     100.00     108.89     116.84     118.82     132.18     160.29 ## 2020-01-19 2020-01-20 2020-01-21 2020-01-22 2020-01-23 2020-01-24 ##     105.12     108.02     120.52     124.81     127.01     137.38 ## 2020-01-25 2020-01-26 2020-01-27 2020-01-28 2020-01-29 2020-01-30 ##     162.41      97.16     100.01     113.27     122.75     124.96 ## 2020-01-31 2020-02-01 2020-02-02 2020-02-03 2020-02-04 2020-02-05 ##     144.13     161.17     103.93     105.67     115.03     125.42 ## 2020-02-06 2020-02-07 2020-02-08 2020-02-09 2020-02-10 2020-02-11 ##     128.43     140.65     167.80      76.79     100.51     115.26 ## 2020-02-12 2020-02-13 2020-02-14 2020-02-15 2020-02-16 2020-02-17 ##     125.35     124.69     150.77     149.35      96.03     131.20 ## 2020-02-18 2020-02-19 2020-02-20 2020-02-21 2020-02-22 2020-02-23 ##     131.72     137.59     136.05     153.95     170.22     104.41 ## 2020-02-24 2020-02-25 2020-02-26 2020-02-27 2020-02-28 2020-02-29 ##     104.32     119.88     125.12     123.88     133.76     153.92 ## 2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06 ##     109.26     103.64     114.68     114.25     106.50     142.09 ## 2020-03-07 2020-03-08 2020-03-09 2020-03-10 2020-03-11 2020-03-12 ##     167.10      96.86      97.50     105.54     106.91      98.87 ## 2020-03-13 2020-03-14 2020-03-15 2020-03-16 2020-03-17 2020-03-18 ##     104.19     117.44      64.28      64.53      48.95      43.31 ## 2020-03-19 2020-03-20 2020-03-21 2020-03-22 2020-03-23 2020-03-24 ##      38.76      37.49      37.36      30.76      31.25      24.63 ## 2020-03-25 2020-03-26 2020-03-27 2020-03-28 2020-03-29 2020-03-30 ##      24.09      22.89      23.40      23.40      17.83      19.72 ## 2020-03-31 2020-04-01 2020-04-02 2020-04-03 2020-04-04 2020-04-05 ##      22.29      22.19      22.76      24.34      28.49      26.06 ## 2020-04-06 2020-04-07 2020-04-08 2020-04-09 2020-04-10 2020-04-11 ##      21.63      24.64      23.87      26.13      28.59      28.58 ## 2020-04-12 2020-04-13 2020-04-14 2020-04-15 2020-04-16 2020-04-17 ##      22.86      22.80      25.66      27.44      26.40      23.27 ## 2020-04-18 2020-04-19 ##      26.36      30.40

…as we have only scratched the surface of the many possibilities here, there are many interesting analyses, like including the data in epidemiological models or simply calculate correlations with new infections/deaths: please share your findings in the comments below!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Why R? Webinar – Recent changes in R spatial and how to be ready for them

April 21, 2020, 6:00 am

≫ Next: Use R & GitHub as a Workout planner

≪ Previous: COVID-19: Analyze Mobility Trends with R

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

April 23rdth (8:00pm GMT+2) is the next time we’ll be hosting a webinar at Why R? Foundation YouTube channel – https://youtube.com/c/WhyRFoundation This time we will have a joint presentation by Robin Lovelace and Jakub Nowosad (authors of Geocomputation with R). Jakub is an assistant professor in the Institute of Geoecology and Geoinformation at the Adam Mickiewicz University, Poznan, Poland. He is a computational geographer working at the intersection between geocomputation and the environmental sciences. Robin is associate professor at the Institute for Transport Studies (ITS) and Leeds Institute for Data Analytics (LIDA), University of Leeds, UK. His research focuses on geocomputation and reproducible data science for evidence-based policy-making.

Full abstract and biograms are below.

See you on the Webinar!

Details

donate: whyr.pl/donate/
channel: youtube.com/c/WhyRFoundation
date: every Thursday 8:00 pm GMT+2
format: one 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
comments: ask questions on YouTube live chat

Abstract

Currently, hundreds of R packages are related to spatial data analysis. They range from ecology and earth observation, through hydrology and soil science, to transportation and demography. These packages support various stages of analysis, including data preparation, visualization, modeling, or communicating the results. One common feature of most R spatial packages is that they are built upon some of the main representations of spatial data in R, available in the sf, sp, raster, terra, or stars package. Those packages are also not entirely independent. They are using external libraries, namely GEOS for spatial data operations, GDAL for reading and writing spatial data, and PROJ for conversions of spatial coordinates.

Therefore, R spatial packages are interwoven with each other and depend partially on external software developments. This has several positives, including the ability to use cutting-edge features and algorithms. On the other hand, it also makes R spatial packages vulnerable to changes in the upstream packages and libraries.

In the first part of the talk, we will showcase several recent advances in R packages. It includes the largest recent change related to the developments in the PROJ library. We will explain why the changes happened and how they impact R users. The second part will focus on how to prepare for the changes, including computer set-up and running R spatial packages using Docker. We will outline important considerations when setting-up operating systems for geographic R packages. To reduce set-up times you can use geographic R packages Docker, a flexible and scalable technology containerization technology. Docker can run on modern computers and on your browser via services such as Binder, greatly reducing set-up times. Discussing these set-up options, and questions of compatibility between geographic R packages and paradigms such as the tidyverse and data.table, will ensure that after the talk everyone can empower themselves with open source software for geographic data analysis in a powerful and flexible statistical programming environment.

Biograms

Jakub is an assistant professor in the Institute of Geoecology and Geoinformation at the Adam Mickiewicz University, Poznan, Poland. He is a computational geographer working at the intersection between geocomputation and the environmental sciences. His research has focused on developing and applying spatial methods to broaden our understanding of processes and patterns in the environment. Another vital part of his work is to create, collaborate on, and improve geocomputational software. He is a co-author of the Geocomputation with R book and many R packages, including landscapemetrics, sabre, and climate.

Robin is associate professor at the Institute for Transport Studies (ITS) and Leeds Institute for Data Analytics (LIDA), University of Leeds, UK. His research focuses on geocomputation and reproducible data science for evidence-based policy-making. Decarbonising the global economy while improving health and environmental outcomes is a major problem solving challenge. Robin’s research supports solutions by generating evidence and tools enabling evidence-based investment in efficient and healthy modes of transport at local, city and national scales. Robin is the Lead Developer of the award-winning Propensity to Cycle Tool (see www.pct.bike ), convenor of the Transport Data Science module and workshop series, and co-author of popular packages, papers, and books, including Geocomputation with R.

Previous talks

Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video

Olgun Aydin – PwC Poland– Introduction to shinyMobile. Video

Achim Zeileis from Universität Innsbruck– R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video

Stay up to date

subscribe to YouTube channel youtube.com/c/WhyRFoundation
join Why R? Slack whyr.pl/slack/
join Meetup

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Use R & GitHub as a Workout planner

April 19, 2020, 5:00 pm

≫ Next: Launch of New Course Platform

≪ Previous: Why R? Webinar – Recent changes in R spatial and how to be ready for them

[This article was first published on Colin Fay, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the years, I’ve been trying a bunch a different applications and methods to stay motivated to workout. But every time it’s the same: at some point the application is great but limited if you do not pay, or the workouts are repetitive, or simply I can’t get into the habit of using the application.

What if there were a solution that is free, where exercises can be generated at random so that it’s not too repetitive, and that I can fit into my “natural workflow” when it come to creating and keeping track of tasks?

So here’s a new idea: what if I can use R to generate random workout, and then use GitHub issues as a tracker and checklist?

Note: obviously I’m neither a profesional athlete (in case you haven’t met me before), nor a professional trainer, so in case this needs to be said, please use these for recreational workouts, and with care (and of course at your own risk).

Random workouts

First of all, I need a workout database, with the following elements:

type of workout (bodyweight/kettlebells/abs/dumbbells/resistance band…)
the description
a value to track the timing or number of repetitions
the unit of this value
a complexity factor so that I can plan harder workouts when I feel like it
something to say if it’s “outside friendly”: for example I can’t do jumping rope inside, nor some kettlebell movements

For now that I will just be doing it manually, maybe at some point I will scrape an online resource, but that’s an idea for another blogpost. The idea is to come with this kind of table:

tibble::tribble(~type,~description,~val,~unit,~complexity,~inside,"Jumping rope","Jumping Rope",100,"jump",1,FALSE,"Bodyweight","Plank",45,"seconds",1,TRUE)

# A tibble: 2 x 6  type         description    val unit    complexity inside                             1 Jumping rope Jumping Rope   100 jump             1 FALSE 2 Bodyweight   Plank           45 seconds          1 TRUE

Given that I want to add manual exercises (the one I know how to do, and like to do), I’ll do this manually in a CSV file.

worrkout::wk

# A tibble: 27 x 7   type    description      val unit    complexity inside pics                                                                1 Jumpin… "Jumping Rope"   100 jump             1 FALSE  https://mir-s3-cdn-cf… 2 Bodywe… "Plank"           45 seconds          1 TRUE   https://thumbs.gfycat… 3 Bodywe… "Squat"           16 x                1 TRUE   https://thumbs.gfycat… 4 Bodywe… "Lunge"           16 x / ea…          1 TRUE   https://media0.giphy.… 5 Bodywe… "Single-Leg B…    16 x / ea…          1 TRUE   https://thumbs.gfycat… 6 Bodywe… "Glute Bridge"    16 x / ea…          1 TRUE   https://thumbs.gfycat… 7 Bodywe… "Plyo Lunge"      16 x / ea…          1 TRUE   https://media.giphy.c… 8 Bodywe… "Mountain Cli…    16 x / ea…          1 TRUE   https://media1.tenor.… 9 Kettle… "Russian Twis…    16 x                1 TRUE   https://thumbs.gfycat…10 Bodywe… "Leg Lift"        16 x                1 TRUE   https://thumbs.gfycat…# … with 17 more rows

Then, I’ll need a function to sample some of the exercises, repeat twice (I like to do two cycles of the same series of exercises), then create the text for the issue.

This text for the issue should look like this:

+ [ ] KettleBell - Russian Twist (16 x)

So that way, I can check the ToDos, and also see a gif of the movement (well, some gif are more for fun )

generate_workout<-function(n_workout,complexity=1,outside=FALSE){# Read the dataset, and remove its NAwk<-worrkout::wkwk<-tidyr::drop_na(wk,1:6)# We want `n_workout`, which is 2 cycles of exerciseswk<-dplyr::sample_n(wk,floor(n_workout/2))wk<-dplyr::bind_rows(wk,wk)# If we want to get more difficultywk$complexity<-wk$complexity*complexity# Generate the outputbody<-sprintf("+ [ ] %s - %s (%s %s) \n\n 
",wk$type,wk$description,wk$val,wk$unit,wk$pics)paste0(body,collapse="\n\n")}cat(generate_workout(4))

+ [ ] KettleBell - KettleBell clean and press (16 x / each side)  
+ [ ] Bodyweight - Bicycle Crunch (16 x / each side)  
+ [ ] KettleBell - KettleBell clean and press (16 x / each side)  
+ [ ] Bodyweight - Bicycle Crunch (16 x / each side)

Let’s now push this to Github:

res<-gh::gh("POST /repos/:owner/:repo/issues",owner="ColinFay",repo="workouts",title=sprintf("Workout - %s",Sys.Date()),assignee="ColinFay",body=generate_workout(16))browseURL(res$html_url)

And, tadaaa .

Please reuse these functions just for fun and recreational workouts.

Find them at github.com/ColinFay/worrkout

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Launch of New Course Platform

April 20, 2020, 5:00 pm

≫ Next: Automatic Code Cleaning in R with Rclean

≪ Previous: Use R & GitHub as a Workout planner

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Launch of New Course Platform

After months of hard work we are really excited to launch our brand-new course platform to learn and apply data science. Together with the new platform we also developed the first new course Introduction to R which is available for free now!

This release is a big milestone for us on our path to provide people with the best knowledge and tools available to apply data science. We think that programming—and data science in particular—should be taught interactively by seeing and writing real code. Our online course platform is built on an exhaustive amount of interactive coding exercises and quizzes.

But the technical achievement is not the only novelty. We diverged from a traditional course outline and completely changed how we structure our course content. Each chapter features a so-called recipe which learners can collect by finishing exercises. Recipes depend on each other and together form a knowledge graph. In the future, learners will be able to create their own learning paths based on the dependency structure of the graph and their progress. Collected recipes are available in your cookbook which gives you an overview of your progress.

Another cool feature is the achievement system. Code recipes can be collected in a cookbook so that learners can review their achievements. Once all recipes from a topic have been collected users get a badge. The course is finally finished when all the badges have been collected.

New Course Available for Free: Introduction to R

The first course module Introduction to R is perfect for newcomers who want to get started with data science. The course teaches the programming language R and covers the language basics so that you can transform data and make professional looking graphs and charts with little effort.

OPEN COURSE

We would love to hear your feedback – either through the feedback buttons on each page (visible for logged-in users) or via e-mail:

Course Content: courses@quantargo.com
Technical Issues: support@quantargo.com

Cheers,

Your Quantargo Team

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Automatic Code Cleaning in R with Rclean

April 20, 2020, 5:00 pm

≫ Next: How to showcase CSS+JS+HTML snippets with Hugo?

≪ Previous: Launch of New Course Platform

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Leave the code cleaner than you found it.
– R.C. Martin in Clean Code

The R language has become very popular among scientists and analysts because it enables the rapid development of software and empowers scientific investigation. However, regardless of the language used, data analysis is usually complicated. Because of various project complexities and time constraints, analytical software often reflects these challenges. “What did I measure? What analyses are relevant to the study? Do I need to transform the data? What’s the function for the analysis I want to run?” Although many researchers see the value in learning to write software, the time investment for learning a programming language alone is still exceedingly high for many, let alone also learning software best practices. The downside to the rapid spread of data science is that learning to create good software takes a back-seat to just writing code that will get the job “done” leading to issues with transparency and software that is highly unstable (i.e. buggy and not reproducible).

The goal of the R package Rclean is to provide a automated tool to help scientists more easily write better code. Specifically, Rclean has been designed to facilitate the isolation of the code needed to produce one or more results, because more often then not, when someone is writing an R script, the ultimate goal is analytical results for inference, such as a set of statistical analyses and associated figures and tables. As the investigative process is inherently iterative, this set of results is nearly always a subset of a much larger set of possible ways to explore a dataset. There are many statistical tests and visualizations and other representations that can be employed in a myriad of ways depending on how the data are processed. This commonly leads to lengthy, complicated scripts from which researchers manually subset results, but which are likely never to be refactored because of the difficulty in disentangling the code.

The Rclean package uses a technique based on data provenance and network algorithms to isolate code for a desired result automatically. The intent is to ease refactoring for scientists that use R but do not have formal training in software design and specifically with the “art” of creating clean code, which in part is the development of an intuitive sense of how code is and/or should be organized. However, manually culling code is tedious and potentially leads to errors regardless of the level of expertise a coder might have; therefore, we see Rclean as a useful tool for programming in R at any level of ability. Here, we’ll cover details on the implementation and design of the package, a general example of how it can be used and thoughts on its future development.

How to use rclean to write cleaner code

Installation

Through the helpful feedback from the rOpenSci community, the package has recently passed software review and a supporting article was recently published in the Journal of Open Source Software ¹, in which you can find more details about the package. The package is hosted through the rOpenSci organization on GitHub, and the package can be installed using the devtools package ² directly from the repository (https://github.com/ROpenSci/Rclean).

library(devtools)install_github("ROpenSci/Rclean")

If you do not already have RGraphviz, you will need to install it using the following code before installing Rclean:

if (!requireNamespace("BiocManager",quietly=TRUE))install.packages("BiocManager")BiocManager::install("Rgraphviz")

Isolating code for a set of results

Analytical scripts that have not been refactored are often both long and complicated. However, a script doesn’t need to be long to be complicated. The following example script presents some challenges such that even though it’s not a long script, picking through it to get a result would likely prove to be frustrating.

library(stats)x<-1:100x<-log(x)x<-x*2x<-lapply(x,rep,times=4)### This is a note that I made for myself.### Next time, make sure to use a different analysis.### Also, check with someone about how to run some other analysis.x<-do.call(cbind,x)### Now I'm going to create a different variable.### This is the best variable the world has ever seen.x2<-sample(10:1000,100)x2<-lapply(x2,rnorm)### Wait, now I had another thought about x that I want to work through.x<-x*2colnames(x)<-paste0("X",seq_len(ncol(x)))rownames(x)<-LETTERS[seq_len(nrow(x))]x<-t(x)x[,"A"]<-sqrt(x[,"A"])for (iinseq_along(colnames(x))){set.seed(17)x[,i]<-x[,i]+runif(length(x[,i]),-1,1)}### Ok. Now I can get back to x2.### Now I just need to check out a bunch of stuff with it.lapply(x2,length)[1]max(unlist(lapply(x2,length)))range(unlist(lapply(x2,length)))head(x2[[1]])tail(x2[[1]])## Now, based on that stuff, I need to subset x2.x2<-lapply(x2,function(x)x[1:10])## And turn it into a matrix.x2<-do.call(rbind,x2)## Now, based on x2, I need to create x3.x3<-x2[,1:2]x3<-apply(x3,2,round,digits=3)## Oh wait! Another thought about x.x[,1]<-x[,1]*2+10x[,2]<-x[,1]+x[,2]x[,"A"]<-x[,"A"]*2## Now, I want to run an analysis on two variables in x2 and x3.fit.23<-lm(x2~x3,data=data.frame(x2[,1],x3[,1]))summary(fit.23)## And while I'm at it, I should do an analysis on x.x<-data.frame(x)fit.xx<-lm(A~B,data=x)summary(fit.xx)shapiro.test(residuals(fit.xx))## Ah, it looks like I should probably transform A.## Let's try that.fit_sqrt_A<-lm(I(sqrt(A))~B,data=x)summary(fit_sqrt_A)shapiro.test(residuals(fit_sqrt_A))## Looks good!## After that. I came back and ran another analysis with ## x2 and a new variable.z<-c(rep("A",nrow(x2)/2),rep("B",nrow(x2)/2))fit_anova<-aov(x2~z,data=data.frame(x2=x2[,1],z))summary(fit_anova)

So, let’s say we’ve come to our script wanting to extract the code to produce one of the results fit_sqrt_A, which is an analysis that is the fitted model object for an analysis. We might want to double check the results, and we also might need to use the code again for another purpose, such as creating a plot of the patterns supported by the test. Manually tracing through our code for all the variables used in the test and finding all of the code used to prepare them for the analysis would be difficult, especially given the fact that we have used “x” as a prefix for multiple unrelated objects in the script. However, Rclean can do this easily via the clean() function.

library(Rclean)script<-system.file("example","long_script.R",package="Rclean")clean(script,"fit_sqrt_A")

x <- 1:100x <- log(x)x <- x * 2x <- lapply(x, rep, times = 4)x <- do.call(cbind, x)x <- x * 2colnames(x) <- paste0("X", seq_len(ncol(x)))rownames(x) <- LETTERS[seq_len(nrow(x))]x <- t(x)x[, "A"] <- sqrt(x[, "A"])for (i in seq_along(colnames(x))) {  set.seed(17)  x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)}x[, 1] <- x[, 1] * 2 + 10x[, 2] <- x[, 1] + x[, 2]x[, "A"] <- x[, "A"] * 2x <- data.frame(x)fit_sqrt_A <- lm(I(sqrt(A)) ~ B, data = x)

The output is the code that Rclean has picked out from the tangled bits of code, which in this case is an example script included with the package. Here’s a view of this isolated code highlighted in the original script.

library(stats)x<-1:100x<-log(x)x<-x*2x<-lapply(x,rep,times=4)### This is a note that I made for myself.### Next time, make sure to use a different analysis.### Also, check with someone about how to run some other analysis.x<-do.call(cbind,x)### Now I'm going to create a different variable.### This is the best variable the world has ever seen.x2<-sample(10:1000,100)x2<-lapply(x2,rnorm)### Wait, now I had another thought about x that I want to work through.x<-x*2colnames(x)<-paste0("X",seq_len(ncol(x)))rownames(x)<-LETTERS[seq_len(nrow(x))]x<-t(x)x[,"A"]<-sqrt(x[,"A"])for (iinseq_along(colnames(x))){set.seed(17)x[,i]<-x[,i]+runif(length(x[,i]),-1,1)}### Ok. Now I can get back to x2.### Now I just need to check out a bunch of stuff with it.lapply(x2,length)[1]max(unlist(lapply(x2,length)))range(unlist(lapply(x2,length)))head(x2[[1]])tail(x2[[1]])## Now, based on that stuff, I need to subset x2.x2<-lapply(x2,function(x)x[1:10])## And turn it into a matrix.x2<-do.call(rbind,x2)## Now, based on x2, I need to create x3.x3<-x2[,1:2]x3<-apply(x3,2,round,digits=3)## Oh wait! Another thought about x.x[,1]<-x[,1]*2+10x[,2]<-x[,1]+x[,2]x[,"A"]<-x[,"A"]*2## Now, I want to run an analysis on two variables in x2 and x3.fit.23<-lm(x2~x3,data=data.frame(x2[,1],x3[,1]))summary(fit.23)## And while I'm at it, I should do an analysis on x.x<-data.frame(x)fit.xx<-lm(A~B,data=x)summary(fit.xx)shapiro.test(residuals(fit.xx))## Ah, it looks like I should probably transform A.## Let's try that.fit_sqrt_A<-lm(I(sqrt(A))~B,data=x)summary(fit_sqrt_A)shapiro.test(residuals(fit_sqrt_A))## Looks good!## After that. I came back and ran another analysis with ## x2 and a new variable.z<-c(rep("A",nrow(x2)/2),rep("B",nrow(x2)/2))fit_anova<-aov(x2~z,data=data.frame(x2=x2[,1],z))summary(fit_anova)

The isolated code can now be visually inspected to adapt the original code or ported to a new, refactored script using keep().

fitSA<-clean(script,"fit_sqrt_A")keep(fitSA)

This will pass the code to the clipboard for pasting into another document. To write directly to a new file, a file path can be specified.

fitSA<-clean(script,"fit_sqrt_A")keep(fitSA,file="fit_SA.R")

To explore more possible variables to extract, the get_vars() function can be used to produce a list of the variables (aka. objects) that are created in the script.

get_vars(script)

[1] "x"          "x2"         "i"          "x3"         "fit.23"    [6] "fit.xx"     "fit_sqrt_A" "z"          "fit_anova"

Especially when the code for different variables are entangled, it can be useful to visual the code in order to devise an approach to cleaning. The code_graph() function can also give us a visual of the code and the objects that they produce.

code_graph(script)

code_graph example showing a network graph of function and variable dependencies.

Figure 1 code_graph() example:Example of the plot produced by the code_graph function showing which functions produce which variables and which variables are used as inputs to other functions.

After examining the output from get_vars() and code_graph(), it is possible that more than one object needs to be isolated. This can be done by adding additional objects to the list of vars.

clean(script,vars=c("fit_sqrt_A","fit_anova"))

x <- 1:100x <- log(x)x <- x * 2x <- lapply(x, rep, times = 4)x <- do.call(cbind, x)x2 <- sample(10:1000, 100)x2 <- lapply(x2, rnorm)x <- x * 2colnames(x) <- paste0("X", seq_len(ncol(x)))rownames(x) <- LETTERS[seq_len(nrow(x))]x <- t(x)x[, "A"] <- sqrt(x[, "A"])for (i in seq_along(colnames(x))) {  set.seed(17)  x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)}x2 <- lapply(x2, function(x) x[1:10])x2 <- do.call(rbind, x2)x[, 1] <- x[, 1] * 2 + 10x[, 2] <- x[, 1] + x[, 2]x[, "A"] <- x[, "A"] * 2x <- data.frame(x)fit_sqrt_A <- lm(I(sqrt(A)) ~ B, data = x)z <- c(rep("A", nrow(x2) / 2), rep("B", nrow(x2) / 2))fit_anova <- aov(x2 ~ z, data = data.frame(x2 = x2[, 1], z))

Currently, libraries can not be isolated directly during the cleaning process. So, the get_libs() function provides a way to detect the libraries for a given script. We just need to supply a file path and get_libs() will return the libraries that are called by that script.

get_libs(script)

[1] "stats"

The provenance engine under the hood

The clean() function provides an effective way to remove code that is unwanted; however, many researchers are wary about doing this exact thing for at least a few reasons. Perhaps the top reason is that the main goal of an analysis is the results and taking time to craft transparent, dependable software is not the priority. As such, taking time to go back through a script and remove code is time wasted. Relatedly, for most researchers the best way to keep track of the various analyses that they have explored is to keep them in the script, as they do not use a rigorous version control system but instead rely on file backups and informal versioning. Although we can’t give researchers more hours in the day, providing an easier and more reliable means to remove unused code will lower the barrier to creating better, cleaner code. Combined with the increasing use of version control systems and digital notebooks, the practice of “saving” analytical ideas in a script will become less common and code quality will increase.

The process that Rclean uses relies on the generation of data provenance. The term provenance means information about the origins of some object. Data provenance is a formal representation of the execution of a computational process, to rigorously determine the the unique computational pathway from inputs to results ³. To avoid confusion, note that “data” in this context is used in a broad sense to include all of the information generated during computation, not just the data that are collected in a research project that are used as input to an analysis. Having the formalized, mathematically rigorous representation that data provenance provides guarantees that analyses conducted by Rclean are theoretically sound. Most importantly, because the relationships defined by the provenance can be represented as a graph, it is possible to apply network search algorithms to determine the minimum and sufficient code needed to generate the chosen result in the clean() function.

There are multiple approaches to collecting data provenance, but Rclean uses “prospective” provenance, which analyzes code and uses language-specific information to predict the relationship among processes and data objects. Rclean relies on an R package called CodeDepends to gather the prospective provenance for each script. For more information on the mechanics of the CodeDepends package, see ⁴. To get an idea of what data provenance is, we can use the code_graph() function to get a graphical representation of the prospective provenance generated for Rclean.

Network diagram of provenance data showing the dependencies of code and variables. Arrows connect functions with the objects that they generate.

Figure 2 provenance graph:Network diagram of the prospective data provenance generated for an example script. Arrows indicate which functions (numbers) produced (outgoing arrow) or used (incoming arrow) which objects (names).

All of this work with the provenance is to get the network representation of relationships among functions and objects. The provenance network is very powerful because we can now apply algorithms to analyze the R script with respect to our results. This is what empowers the clean() function, which takes the provenance and applies a network search algorithm to determine the pathways leading from inputs to outputs. In the process any objects or functions that do not fall along that pathway are by definition not necessary to produce the desired set of results and can therefore be removed. As demonstrated in the example, this property of the provenance network is what facilitates the robust isolation of the minimal code necessary to generate the output we want.

One important topic to discuss is that Rcleandoes not keep comments present in code. This is the result of a limitation of the data provenance collection, which currently does not assign them a relationship in the provenance network. This is a general issue with detecting the relationships between comments and code. For example, comments at the end of lines are typically relevant to the line they are on but this is not a linguistic requirement. Also, comments occupying their own lines usually refer to the following lines but this is also not necessarily the case either. In fact comments can refer to any or none of the code relative to their position in the script, the latter commonly being the case when code is removed from a script but comments referring to it have not. The inferred and explicit meanings of comments are a cultural and not linguistic.

That being said, although Rclean cannot operate automatically on comments, comments in the original code remain untouched and can be used to inform the reduced code. Also, as the clean() function is oriented toward isolating code based on a specific result, the resulting code tends to naturally support the generation of new comments that are higher level (e.g. “The following produces a plot of the mean response of each treatment group."), and lower level comments are not necessary because the code is simpler and clearer. This process of commenting is an important part of writing better code. Lastly, although comments can serve an important role in coding, it is worth reflecting on the statement in R.C. Martin’s book Clean Code: A Handbook of Agile Software Craftsmanship where he writes that, “Comments do not compensate for bad code.”

Concluding remarks and future work

Rclean provides a simple, easy to use tool for scientists who would like help refactoring code. Using Rclean, the code necessary to produce a specified result (e.g., an object stored in memory or a table or figure written to disk) can be easily and reliably isolated even when tangled with code for other results. Tools, such as this, that make it easier to produce transparent, accessible code will be an important aid for improving scientific reproducibility ⁵.

Although the current implementation of Rclean for minimizing code is useful on its own, we see promise in connecting it with other reproducibility tools. One example is the reprex package, which provides a simple API for sharing reproducible examples ⁶. Rclean could provide a reliable way to extract parts of a larger script that would be piped to a simplified reproducible example. Another possibility is to help transition scripts to functions, packages and workflows refactoring via toolboxes like drake⁷. Since Rclean can isolate the code from inputs to one or more outputs, it could be used to extract all of the components needed to write one or more functions that would be a part of a package or workflow, as is the goal of drake.

In the future, it would also be useful to extend the existing framework to support other provenance methods. One possibility is retrospective provenance, which tracks a computational process as it is executing. Through this active, concurrent monitoring, retrospective provenance can gather information that static prospective provenance can’t. Greater details of the computational process would enable other features that could address some challenges, such as libraries that are actually used by the code, processing comments (as discussed above), parsing control statements and replicating random processes. Using retrospective provenance comes at a cost, however. In order to gather it, the script needs to be executed. When scripts are computationally intensive or contain bugs that stop execution retrospective provenance can not be obtained for part or all of the code. Although such costs may present challenges, combining prospective and retrospective provenance methods could provide a powerful and flexible solution. Some work has already been done in the direction of implementing retrospective provenance for code cleaning in R (see http://end-to-end-provenance.github.io); however, there doesn’t appear to be a tool that synthesizes these two approaches to provenance.

We hope that Rclean makes writing scientific software easier for the R community. The package has already been significantly improved via the rOpenSci review process, thanks to the efforts of editor Anna Krystalli and reviewers, Will Landau and Clemens Schmid. We look forward to the future progress of the package and other “code cleaning” tools. As an open-source project, we would like to encourage feedback and help with extending the package. We invite people to use the package and get involved by reporting bugs and suggesting or (hopefully) contributing features. For more information please visit the project page on GitHub.

Acknowledgements

Thanks to Steffi Lazerte, Stefanie Butland and Maëlle Salmon for editorial work and technical help. Special word of thanks to Maëlle Salmon for the addition of the code highlighting figure in the clean() example.

Lau M, Pasquier TFJ, Seltzer M (2020). “Rclean: A Tool for Writing Cleaner, More Transparent Code.” Journal of Open Source Software, 5(46), 1312. https://doi.org/10.21105/joss.01312 , https://doi.org/10.21105/joss.01312. ︎
Wickham H, Hester J, Chang W (2019). devtools: Tools to Make Developing R Packages Easier. https://CRAN.R-project.org/package=devtools. ︎
Carata L, Akoush S, Balakrishnan N, Bytheway T, Sohan R, Seltzer M, Hopper A (2014). “A Primer on Provenance.” Queue, 12(3), 10-23. ISSN 15427730, http://dl.acm.org/citation.cfm?doid=2602649.2602651 , https://doi.org/10.1145/2602649.2602651. ︎
Temple Lang D, Peng R, Nolan D, Becker G (2020). CodeDepends: Analysis of R Code for Reproducible Research and Code Comprehension. https://github.com/duncantl/CodeDepends. ︎
Pasquier T, Lau MK, Trisovic A, Boose ER, Couturier B, Crosas M, Ellison AM, Gibson V, Jones CR, Seltzer M (2017). “If these data could talk.” Scientific Data, 4, 170114. ISSN 2052-4463, http://www.nature.com/articles/sdata2017114 , https://doi.org/10.1038/sdata.2017.114. ︎
Bryan J, Hester J, Robinson D, Wickham H (2019). reprex: Prepare Reproducible Example Code via the Clipboard. https://CRAN.R-project.org/package=reprex. ︎
Landau WM (2020). drake: A Pipeline Toolkit for Reproducible Computation at Scale. https://CRAN.R-project.org/package=drake. ︎

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to showcase CSS+JS+HTML snippets with Hugo?

April 20, 2020, 5:00 pm

≫ Next: The Case for tidymodels

≪ Previous: Automatic Code Cleaning in R with Rclean

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve recently found myself having to write a bit of CSS or JS for websites made with Hugo. Note for usual readers: it is a topic not directly related to R, but you might have played with either or both CSS and JS for your R blog or Shiny app. On a scale from Peter Griffin programming CSS window blinds to making art with CSS, I’m sadly much closer to the former; my JS knowledge is not better. I often scour forums for answers to my numerous and poorly formulated questions, and often end up at code playgrounds like Codepen where folks showcase snippets of HTML on its own or with either or both CSS and JS, together with the resulting HTML document. You can even edit the code and run it again. Quite neat!

Now, as I was listening to an episode from the Ladybug podcast about blogging, one of the hosts, Ali Spittel, mentioned integrating Codepens into blog posts¹, which sounds useful indeed, and I started wondering: how could one showcase CSS+JS+HTML snippets on a Hugo website? In this post I shall go over three solutions, with and without Codepen, in all cases based on custom shortcodes.

The easiest way: Embed a Codepen

As reported in a 2018 blog post by Ryan Campbell and a 2019 blog post by Jeremy Kinson, Jorin Vogel created and shared a Hugo shortcode for Codepen.

Save it under layouts/pen.html and voilà, you can call it! Below I’ll embed a cool CSS art snippet by Sarah L. Fossheim.

Gives

For a blog post showcasing code of yours, it might get a bit tiring to create and keep track of Codepens. Moreover, you might want more ownership of your code.

The DIY way: load HTML, CSS, and JS code into an iFrame

I was completely stuck trying to find out how to create and embed my own iframe and then luckily found a perfect post by Josh Pullen“How to Load HTML, CSS, and JS Code into an iFrame”, with a perfect definition of the problem “If you’ve ever used JSFiddle, Codepen, or others, this problem will be familiar to you: The goal is to take some HTML, CSS, and JS (stored as strings) and create an iframe with the code loaded inside.". Good stuff!

Based on the code in the post, I created a shortcode called “snippet.html”. It is a paired shortcode: there is input between two markers, and one option passed via the first marker. It expects input like

{{< snippetan-id-unique-for-the-post>}}```css// css code``````html// HTML code``````js// js code```{{< /snippet>}}

The shortcode itself is shown below.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435

<h4>My snippet {{ .Get 0 }}h4>{{ $content := .Inner }}{{ $content := replaceRE "```html" "\n **HTML code:** \n```html" $content }}{{ $content := replaceRE "```css" "\n **CSS code:** \n```css" $content }}{{ $content := replaceRE "```js" "\n **JS code:** \n```js" $content }}{{ $content | markdownify }}{{ $css := replaceRE "```html(.|\n)*?```" "$1" .Inner }}{{ $css := replaceRE "```js(.|\n)*?```" "$1" $css }}{{ $css := replaceRE "```css" "$1" $css }}{{ $css := replaceRE "```" "$1" $css }}{{ $js := replaceRE "```html(.|\n)*?```" "$1" .Inner }}{{ $js := replaceRE "```css(.|\n)*?```" "$1" $js }}{{ $js := replaceRE "```js" "$1" $js }}{{ $js := replaceRE "```" "$1" $js }}{{ $html := replaceRE "```css(.|\n)*?```" "$1" .Inner }}{{ $html := replaceRE "```js(.|\n)*?```" "$1" $html }}{{ $html := replaceRE "```html" "$1" $html }}{{ $html := replaceRE "```" "$1" $html }}<b>Result:b><br><iframeid="{{ .Get 0 }}"allowfullscreenstyle="width:100%;height:100%;">iframe><scriptsrc="/js/blob.js"type="text/javascript">script> <scripttype="text/javascript">document.addEventListener('DOMContentLoaded', function() {   mySnippet('{{ $html }}', '{{ $js }}', '{{ $css }}', '{{ .Get 0 }}');}, false);script>

Lines 2 to 7 create and markdownify the three highlighted blocks, with a note on the language before each. Using markdownify means the code will be highlighted using Chroma without my having to make any further efforts.

Then there is some non elegant string manipulation going on to extract the CSS, JS and HTML, until line 58.

After that I use the code from Josh Pullen’s post to create blob URLs and an iframe.

I create an iframe with the ID given as argument of the shortcode.
Once the page is loaded, I call a function defined below and saved under /js/blob.js, again recycling code from Josh Pullen’s post, with the HTML, CSS and JS code, as well as the frame ID, as argument. The code creates blob URLs with the JS and CSS, then a blob URL with the HTML calling the JS and CSS, and finally assign that last blob URL to the iframe.

functionmySnippet(html, js, css, id) {  constgetGeneratedPageURL= ({ html, css, js }) => {  constgetBlobURL= (code, type) => {    constblob=newBlob([code], { type });    consturl=URL.createObjectURL(blob);    returnurl;  };  constcssURL=getBlobURL(css, 'text/css')  constjsURL=getBlobURL(js, 'text/javascript')  constsource=`${css&&`${cssURL}" />`}${js&&`${</span><span style="color:#a6e22e">jsURL</span><span style="color:#e6db74">}</span><span style="color:#e6db74">" type="text/javascript">`}${html||''}  `;  returngetBlobURL(source, 'text/html');};consturl=getGeneratedPageURL({  html:html,  css:css,  js:js});constgetid="#"+id;constiframe= document.querySelector(getid);iframe.src=url;};

Demos

In this demo below I use w3schools.com tutorial about buttons.

{{< snippetnumber42>}}```cssp {  color: red;}button {  background-color: pink;}``````html Some text What time is it??``````jsfunction getTime() {  document.getElementById('demo').innerHTML = Date();}```{{< /snippet>}}

Gives

My snippet number42

CSS code:

p {  color: red;}button {  background-color: pink;}

HTML code:

<p> Some text p><buttononclick="getTime()">What time is it??button><pid="demo">p>

JS code:

functiongetTime() {  document.getElementById('demo').innerHTML= Date();}

Result:

Below is another, simpler, one.

{{< snippetnumber66>}}```cssp {  color: blue;}``````html There is no JS in this snippet. ```{{< /snippet>}}

Gives

My snippet number66

CSS code:

p {  color: blue;}

HTML code:

<p> There is no JS in this snippet. p>

Result:

document.addEventListener('DOMContentLoaded', function() { mySnippet('\n\n\n\n\n\x3cp\x3e There is no JS in this snippet. \x3c\/p\x3e\n\n\n', '\n\n\n\n\n\n\n', '\n\np {\n color: blue;\n}\n\n\n\n\n\n', 'number66');}, false);

To-dos

Clearly, my custom shortcode could do with… styling, which is sort of ironic, but this is left as an exercise to the reader.

The mix: own your code, present it through Codepen

A page of Codepen docs caught my attention: “Prefill Embeds “CodePen Prefill Embeds allow you to enhance code that you are already displaying on your own website and transform it into an interactive environment.

Using them make you rely on Codepen, of course, but you can therefore use all of Codepen fixings (even preprocessing!).

I created another shortcode as a proof-of-concept, not encompassing all features. In this case I was able to use nested shortcodes. In the previous solution I didn’t find how I could do that given I needed to use the content of each block on its own and together in the iframe.

The shortcode expects input like

{{< prefillembed"A title for the pen">}}  {{< pcodecss>}}  // CSS code  {{< /pcode>}}    {{< pcodehtml>}}  // HTML code  {{< /pcode>}}    {{< pcodejs>}}  // JS code  {{< /pcode>}}  {{< /prefillembed>}}

The main shortcode code is quite simple:

<divclass="codepen"data-prefill='{    "title": "{{ .Get 0 }}"}'data-height="400"data-theme-id="1"data-default-tab="html,result">  {{ .Inner }}div><scriptasyncsrc="https://static.codepen.io/assets/embed/ei.js">script>

The sub-shortcodes are not much more complicated. An important aspect is the escaping of HTML, that Codepen docs warn about. I felt quite proud knowing about htmlEscape but it was not enough, I had to pipe the output into safeHTML so I was no longer so full of myself after that.

<predata-lang="{{ .Get 0 }}">  {{ if eq (.Get 0) "html" }}    {{ .Inner| htmlEscape | safeHTML }}  {{ else }}    {{ .Inner }}  {{ end }}pre>

Demo

{{< prefillembed "My Pen" >}}  {{< pcode css >}}  p {    color: red;  }  button {    background-color: pink;  }  {{< /pcode >}}    {{< pcode html >}}   Some text 
  What time is it??  
  {{< /pcode >}}    {{< pcode js >}}  function getTime() {    document.getElementById('demo').innerHTML = Date();  }  {{< /pcode >}}  {{< /prefillembed >}}

gives

        p {    color: red;  }  button {    background-color: pink;  }

Some text

What time is it??

        function getTime() {    document.getElementById('demo').innerHTML = Date();  }

To-dos

This shortcode could do with more parameterization to allow using all features of Codepen’s prefill embeds.

Conclusion

In this post I went over three ways to showcase CSS+JS+HTML snippets with Hugo: adding a custom shortcode for embedding Codepen; creating a custom shortcode thanks to which the code is displayed in highlighted code blocks but also loaded into an iframe; creating a custom shortcode that uses Codepen prefill embeds. Each approach has its pros and cons depending on whether or not you want to rely on Codepen. Please don’t hesitate to share your alternative approaches or your extensions of my shortcodes!

Taking a step back, such shortcodes, if much improved, could maybe be shared in a Hugo theme as a developer toolbelt²? Even if copy-pasting shortcodes from someone else’s repo, with attribution, works well too. It could contain shortcodes for developer websites that use OEmbed (so not Stack Overflow, not GitHub), and unfurling workarounds for others. Quite a lot to explore!

I looked up the episode transcript to find out which of the hosts said that because I can’t recognize their voices (yet?). ︎
I am using this term because of Steph Locke’s Hugo utility belt. ︎

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The Case for tidymodels

April 20, 2020, 5:00 pm

≫ Next: Easy ggplot2 Theme customization with {ggeasy}

≪ Previous: How to showcase CSS+JS+HTML snippets with Hugo?

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you are a data scientist with a built-out set of modeling tools that you know well, and which are almost always adequate for getting your work done, it is probably difficult for you to imagine what would induce you to give them up. Changing out what works is a task that rarely generates much enthusiasm. Nevertheless, in this post, I would like to point out a few features of tidymodels that could help even experienced data scientists make the case to give tidymodels a try.

So what are we talking about? tidymodels are an integrated, modular, extensible set of packages that implement a framework that facilitates creating predicative stochastic models. tidymodels are first class members of the tidyverse. They adhere to tidyverse syntax and design principles that promote consistency and well-designed human interfaces over speed of code execution. Nevertheless, they automatically build in parallel execution for tasks such as resampling, cross validation and parameter tuning. Moreover, they don’t just work through the steps of the basic modeling workflow, they implement conceptual structures that make complex iterative workflows possible and reproducible.

If you are an R user and you have building predictive models then there is a good chance that you are familiar with the caret package. One straightforward path to investigate tidymodels is to follow the thread that leads form caret to parsnip. caret, the result of a monumental fifteen year plus effort, incorporates two hundred thirty-eight predictive models into a common framework. For example, any one of the included models can be substituted for lm in the following expression.

lmFit <- train(Y ~ X1 + X2, data = training,                  method = "lm",                  trControl = fitControl)

By itself this is a pretty big deal. parsnip refines this idea by creating a specification structure that identifies a class of models that allows users to easily change algorithms and also permits the models to run on different “engines”.

spec_lin_reg <- linear_reg() %>%   # a linear model specification                set_engine( "lm")  # set the model to use lm# fit the modellm_fit <- fit(spec_lin_reg, Y ~ X1 + X2, data = my_data)

This same specification can be modified to run a Bayesian model using Stan, or any number of other linear model backends such as glmnet, keras or spark.

spec_stan <-   spec_lin_reg %>%  set_engine("stan", chains = 4, iter = 1000) # set engine specific argumentsfit_stan <- fit(spec_stan, Y ~ X1 + X2, data = my_data)

On its own, parnsnip provides a time saving framework for exploring multiple models. It is really nice not to have to worry about the idiosyncratic syntax developed for different model algorithms. But, the real power of tidymodels is baked into the recipes package. Recipes are structures that bind a sequence of preprocessing steps to a training data set. They define the roles that the variables are to play in the design matrix, specify what data cleaning needs to take place, and what feature engineering needs to happen.

To see how all of this comes together, lets look at recipe used in the tidymodelsrecipes tutorial that uses the New York City flights data set, nycflights13. We assume that all of the data wrangling code in the tutorial has been executed, and we pick up with the code to define the recipe:

flights_rec <-   recipe(arr_delay ~ ., data = train_data) %>%   update_role(flight, time_hour, new_role = "ID") %>%   step_date(date, features = c("dow", "month")) %>%   step_rm(date) %>%   step_dummy(all_nominal(), -all_outcomes())

The first line identifies the variable arr_delay as the variable to be predicted and the other variables in the data set train_data to be predictors. The second line amends that by updating the roles of the variables flight and time_hour to be identifiers and not predictors. The third and fourth lines continue with the feature engineering by creating a new date variable and removing the old one. The last line explicitly converts all categorical or factor variables into binary dummy variables.

The recipe is ready to be evaluated, but if a modeler thought that she might want to keep track of this workflow for the future, she might bind the recipe and model together in a workflow() that saves everything as a reproducible unit with a command something like this.

lr_mod <- logistic_reg() %>% set_engine("glm")flights_wflow <-   workflow() %>%   add_model(lr_mod) %>%   add_recipe(flights_rec)

Then, fitting the model is just a matter calling fit with the workflow as a parameter.

flights_fit <- fit(flights_wflow, data = train_data)

At this point, everything is in place to complete a statistical analysis. A modeler can extract coefficients, p-values etc., calculate performance statistics, make statistical inferences and easily save the workflow in a reproducible markdown document. However the real gains from tidymodels become apparent when the modeler goes on to build predictive models.

The following diagram from Kuhn and Johnson (2019) illustrates a typical predictive modeling workflow.

It indicates that before going on to predict model performance on new data (the test set), a modeler will want to make use of cross validation or some other resampling technique to first evaluate the performance of multiple candidate models, and then tune the selected model. This is where the great power of the recipe() and workflow() constructs becomes apparent. In addition, to encouraging experiments with multiple models by rationalizing algorithm syntax, providing interchangeable model constructs, and enabling modelers to grow chains of recipe steps with the pipe operator; recipieshelps to enforce good statistical practice.

For example, although it is common practice to split the available data between training and test sets before preprocessing the training data set, it is also very common to see pipelines where data preparation is applied to the entire training set at one go. It is not common to see data cleaning and preparation processes individually applied to each fold of a ten-fold cross validation effort. But, that is exactly the right thing to do to mitigate the deleterious effects of data imputation, centering and scaling and numerous other preparation steps that contribute to bias and limit the predictive value of a model. This is the whole point of resampling, but it is not easy to do in a way that saves necessary intermediate artifacts, and provides a reproducible set of instructions for others on the modeling team.

Because, recipes are not evaluated until the model is fittidymodel workflows make an otherwise laborious and error prone process very straightforward. This is a game changer!

The next two lines of code set up and execute ten-fold cross-validation for our example.

set.seed(123)folds <- vfold_cv(train_data, v = 10)flights_fit_rs <- fit_resamples(flights_wflow, folds)

And then, another line of code collects the metrics over the folds and prints out the statistics for accuracy and area under the ROC curve.

collect_metrics(flights_fit_rs)

So, here we are with a mediocre model, and I’ll stop now having shown you only a small portion of what tidymodels can do, but enough, I hope to motivate you to take a closer look. tidymodels.org is a superbly crafted website with multiple layers of documentation. There are sections on packages, getting started guides, detailed tutorials, help pages and a section on making contributions.

Happy modeling!

_____='https://rviews.rstudio.com/2020/04/21/the-case-for-tidymodels/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Easy ggplot2 Theme customization with {ggeasy}

April 21, 2020, 2:45 am

≫ Next: R is for read_

≪ Previous: The Case for tidymodels

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, We’ll learn about {ggeasy} an R package by Jonathan Carroll. The goal of {ggeasy} is to help R programmers make ggplot2 theme customizations with simple-easy English functions. (much easier than playing with theme()) We use dataset generated by {fakir} for this tutorial.

Youtube: https://youtu.be/iAH1GJoBZmI

Video Tutorial

Code

library(fakir)library(tidyverse)library(ggeasy)# generate datasetclients <- fakir::fake_ticket_client(100)# rotate x axis labelsclients %>%   count(state) %>%   ggplot() + geom_col(aes(state,n)) +  easy_rotate_x_labels()

# color the text and increase text sizeclients %>%   count(state) %>%   ggplot() + geom_col(aes(n,state), fill = "orange") +  easy_text_color("orange") +  easy_text_size(25, teach = TRUE)

# move legend positionclients %>%   count(state, source_call) %>%# View()  ggplot() + geom_col(aes(n,state, fill = source_call)) +  #easy_move_legend("bottom", teach = TRUE)  theme(legend.position = "bottom")

50%+ R & Python Data Science Course DataCamp Offer*

Affiliate Link

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

R is for read_

April 21, 2020, 7:00 am

≫ Next: Recreating a Shiny App with Flask

≪ Previous: Easy ggplot2 Theme customization with {ggeasy}

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The tidyverse is full of functions for reading data, beginning with “read_”. The read_csv I’ve used to access my reads2019 data is one example, falling under the read_delim functions. read_tsv allows you to quickly read in tab-delimited files. And you can also read in files with other delimiters, using read_delim and specifying the delimiter used. You can also tell R if the file contains column names and whether those should be read in too, using col_names = TRUE.

But there are many more read_ functions you can use:

read_clip: Data from the clipboard
read_+ dta, sas, or spss: Data from other statistical programs
read_json: JSON data
read_fwf: Fixed-width files
read_lines: Line from a file
read_excel: Excel files – you’ll also need to include the worksheet name or number

All of these functions are included as part of the tidyverse packages, though for some, you may need to load the single package if it doesn’t automatically load with library(tidyverse) – this includes haven (for dta, sas, and spss) and readxl (for read_excel).

You can find out more about a particular function by typing ?[functionname] into the R console. Or use ?? before to search all of R help for a particular string, such as ??read_.

Tomorrow, let’s talk about summarizing data!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Recreating a Shiny App with Flask

April 21, 2020, 7:25 am

≫ Next: The performance of small value stocks in bear markets

≪ Previous: R is for read_

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So RStudio Connect has embraced Python and now runs Flask applications! At Jumping Rivers we make a lot of use of R, shiny, and Python for creating visual tools for our clients. Shiny has a lot of nice features, in particular it is very fast for prototyping web applications. Over our morning meeting we discussed the fact that flask will soon be coming to RStudio products and wondered how easy it would be to recreate one of the simple shiny examples as a flask application. As I suggested it, I got the job of playing about with flask for an hour to recreate the faithful eruptions histogram shiny demo – the finished resulted is hosted on our Connect server. For this post it is not required that you know shiny, but I will make reference to it in various places.

Spoiler

Shiny and flask are different tools with different strengths. If you have a simple shiny application, shiny is by far the quicker tool to get the job done and requires less knowledge about web based technologies. However, flask gives us much greater flexibility than could easily be achieved in shiny.

I’m hoping that this will turn into a series of blog posts about flask for data science applications.

With that in mind, lets treat this as an exercise in creating a simple flask application for visualisation of some data with a little interactivity. For reference, the shiny application I am referring to can be viewed alongside the tutorial on how to build it.

What is Flask?

Flask is a micro web framework written in Python with a wealth of extensions for things like authentication, form validation, ORMs. It provides the tools to build web based applications. Whilst it is very light, combined with the extensions it is very powerful. Meaning that your web application might be a simple API to a machine learning model that you have created, or a full web application. Pinterest and LinkedIn use flask for example.

Set up your app

Create a directory structure in which to house your project, we can quickly create a project root and other necessary directories. We will explain the utility of each of the directories shortly.

mkdir -p app_example/{static,templates,data} && cd app_example

I highly recommend for all Python projects to set up a virtual environment. There are a number of tools for managing virtual environments in Python, but I tend to use virtualenv. We can create a new virtual environment for this project in the current directory and activate it.

In 2020 it should also go without saying that we are using Python 3 for this activity, specifically 3.7.3.

virtualenv venvsource venv/bin/activate

Other directories

data:

For our project, we will also want somewhere to house the data. Since we are talking about a very small tabular dataset of 2 variables and 272 cases, any sort of database would be overkill. So we will just read from a csv on disk. We can create this data set via

Rscript -e "readr::write_csv(faithful, 'data/faithful.csv')"

templates:

The visual elements of the flask application will be web pages rendered from html templates. The name templates is chosen specifically here as it is the default directory that your flask app will look for when trying to render web pages.

static:

This will be our place to store any static assets, like CSS style sheets and JavaScript code.

Packages

For this project we will need some python packages

flask (obviously)
pandas– useful for data manipulation, but in this case just used to read data from disk
numpy– we will use for calculating the histogram bins and frequencies
plotly– my preferred graphics library in Python at the moment and well suited to web based applications

pip install flask pandas plotly numpy

Choosing your editor

My editor of choice for anything that is not R related is VScode, which I find particularly suitable for applications that are created using a mixture of different languages. There are lots of plugins for Python, HTML, CSS and JavaScript for the purposes of code completion, snippets, linting and terminal execution which means I can write, test and run all the parts of my application from the comfort of one place.

Hello Flask

With everything set up we can start upon our Flask application. One of the things that I really like about flask is the simple syntax for adding URL endpoints to our site. We can create a “hello world” style example with the following python code (saved in this case in app.py)

# required importsfrom flask import Flask# instantiate the application objectapp = Flask(__name__)# create an endpoint with a decorator@app.route("/")def hello():  return "Hello World"if __name__ == "__main__":  app.run()

Back in the terminal we could run this app with

python app.py

and view at the default URL of localhost:5000. I think the interesting part of the above code snippet is the route decorator

@app.route("/)

Routes refer to the URL patterns of our web application. The "/" is effectively the root route. i.e what you would see at a web address like “mycoolpage.com”. The decorator here allows us to specify a Python function that should run when a user navigates to a particular URL within our domain name (or a handler).

What our app needs

We are creating an application here that allows users to choose an input via a slider, which causes a histogram to redraw. For this, our app will need two routes

A route for generating the histograms
A html page that the user will see

Creating a histogram

We could write a function which will draw a histogram using plotly fairly easily.

# importsfrom pandas import read_csvimport plotly.express as px# read data from diskfaithful = read_csv('./data/faithful.csv')def hist(bins = 30):  # calculate the bins  x = faithful['waiting']  counts, bins = np.histogram(x, bins=np.linspace(np.min(x), np.max(x), bins+1))  bins = 0.5* (bins[:-1] + bins[1:])  p = px.bar(    x=bins, y=counts,    title='Histogram of waiting times',    labels={      'x': 'Waiting time to next eruption (in mins)',      'y': 'Frequency'    },    template='simple_white'  )  return p

This is the sort of thing we might create outside of the web application context for visualising this data. If you want to see the plot you might do something like

plot = hist()plot.show()

However we want to make some modifications for use in our web application.

We want to turn our work into a flask application. We can start by adding the required imports and structure to our app.py with the hist function in it.
```
from flask import Flask# other imports# instantiate appapp = Flask(__name__)...# At the end of our scriptif __name__ == '__main__':  app.run()
```
We want to take the number of bins from a request to our webserver. We could achieve this by, instead of taking the number of bins from the argument to our function, taking it from the argument in the request from the client. When a Flask application handles a request object, it creates a Request object which can be accessed via the request proxy. Arguments can then be obtained from this context
```
from flask import requestdef hist():  bins = int(request.args['bins'])  ...
```
We want the function to be available at a route. The request context only really makes sense within a request from a client. Since the client is going to ask our application for the histogram to be updated dependent on their input we decorate our function with a route decorator
```
@app.route('/graph')def hist():  ...
```
Return JSON to send to the client. Instead of returning a figure object, we return some JSON that we can process with JavaScript on the client side
```
import jsonfrom plotly.utils import PlotlyJSONEncoder@app.route('/graph')def hist():  ...  return json.dumps(p, cls=PlotlyJSONEncoder)
```
If you were to rerun your Flask server now and navigate your browser to localhost:5000/graph?bins=30 you would see the fruit of your labour. Although not a very tasty fruit at the moment, as all you will see is all of the JSON output for your graph. So let’s put the user interface together.

Creating the user interface

We will want to grab a few front end dependencies. For brevity they are included here by linking to the CDN. The shiny app we are mimicking uses bootstrap for it’s styling, which we will use too. Similarly the sliderInput() function in shiny uses the ion-rangslider JS package, so we will too. We will take the Plotly js library (for which the plotly python package is a wrapper). We will not need to know how to create these plots in JavaScript, but will use it to take the plot returned from our flask server and render it client side in the browser.

The head of our HTML file in templates/index.html then looks like

                                Hello Flask

The {{}} notation at the bottom here is jinja syntax. Flask makes use of jinja templating for creating web pages that your users will consume. url_for is a function that is automatically available when you render a template using flask, used to generate URLs to views instead of having to write them out manually. Jinja templating is a really neat way to blend raw markup with your python variables and functions and some logic statements like for loops. We haven’t written any style yet, but we will create the file ready for later, we will also create somewhere to contain our JavaScript

mkdir -p static/{css,js} && touch static/css/app.css static/js/hist.js

With all of dependencies in place it is relatively easy to create a simple layout. We have a 1/3 to 2/3 layout of two columns for controls and main content respectively, somewhere to contain our input elements and an empty container for our histogram to begin with. The of our index.html then is

          Hello Flask!        
                              Bins

We will go back to our app.py and add the route for this view

@app.route('/')def home():  return render_template('index.html')

Our full app.py is then

# app.pyfrom flask import Flask, render_template, requestfrom pandas import read_csvimport plotly.express as pxfrom plotly.utils import PlotlyJSONEncoderimport jsonimport numpy as npfaithful = read_csv('./data/faithful.csv')app = Flask(__name__)@app.route('/graph', methods=['GET'])def hist():  # calculate the bins  x = faithful['waiting']  counts, bins = np.histogram(x, bins=np.linspace(np.min(x), np.max(x), int(request.args['bins'])+1))  bins = 0.5* (bins[:-1] + bins[1:])  p = px.bar(    x=bins, y=counts,    title='Histogram of waiting times',    labels={      'x': 'Waiting time to next eruption (in mins)',      'y': 'Frequency'    },    template='simple_white'  )  return json.dumps(p, cls=PlotlyJSONEncoder)@app.route('/')def home():  return render_template('index.html')if __name__ == '__main__':  app.run()

Running the server and viewing our work still won’t look very impressive, but we are almost there. At the end of our page we are including a JavaScript file, this is to initialise our ion range slider and use it to send the chosen value from client to server to ask for the updated plot.

We can use AJAX (Asynchronous JavaScript and XML) to send the data from the slider to our /graph URL route, and on response, draw a new plotly plot into the div element with the histogram id. We want this function to run when we first load the page and every time a user moves the slider

// hist.jsconst updatePlot = (data) => {  $.ajax({      url: 'graph',      type: 'GET',      contentType: 'application/json;charset=UTF-8',      data: {        'bins': data.from      },      dataType: 'json',      success: function(data){        Plotly.newPlot('histogram', data)      }    });}$('.js-range-slider').ionRangeSlider({    type: 'single',    skin: 'big',    min: 1,    max: 50,    step: 1,    from: 30,    grid: true,    onStart: updatePlot,    onFinish: updatePlot});

Now we are getting somewhere. Run your app and navigate to localhost:5000 to see the control and the output plot. As you drag the slider, the plot will redraw.

To finish up we will add a little styling, just to get us closer to our shiny example target. In our app.css file under static/css we add the styling around the input controls and make the title stand out a little more.

.title {  font-size: 2rem;}.well {  background-color: #f5f5f5;  padding: 20px;  border: 1px solid #e3e3e3;  border-radius: 4px;}

Rerun our application with

python app.py

And voila, at localhost:5000 we have something that fairly closely matches our target. I really like flask as a tool for creating web applications and APIs to data and models. There is an awful lot of power and flexibility available in what can be created using the toolset explored here.

See the finished result at our Connect server.

Watch this space for more flask posts where we can start to explore some more interesting applications.

Jumping Rivers are full service, RStudio certified partners. Part of our role is to offer support in RStudio Pro products. If you use any RStudio Pro products, feel free to contact us (info@jumpingrivers.com). We may be able to offer free support.

The post Recreating a Shiny App with Flask appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The performance of small value stocks in bear markets

April 21, 2020, 7:40 am

≫ Next: Getting to the Right Question

≪ Previous: Recreating a Shiny App with Flask

[This article was first published on Data based investing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you have ever seen comparisons of investment returns with and without reinvested dividends, you know that the difference gets huge as the investment horizon increases. Wouldn’t it be great you could achieve a similar increase in returns by altering your investment style?

Small cap value stocks have historically achieved much higher returns that typical stocks. Using data from French and Shiller, we can calculate that the average yearly (CAGR) total return with dividends has been 14.3 percent for US small cap value stocks (doubling every 5.2 years) and 10.1 percent for the S&P 500 index (doubling every 7.2 years). This means that it has taken a bit over 18 years for the investment in the small cap value stocks to have become twice as large as the same investment in the index. For comparison, without reinvested dividends, small cap value stocks have returned 10.8 percent (doubling every 6.8 years) and the S&P 500 has returned 6.1 percent (doubling every 11.7 years) in the same time period. The outperformance gap has therefore been a bit over four percent per year historically, which is about as big as the gap between the investment in the index with and without reinvested dividends.

The outperformance of small cap value investing doesn’t however come without costs. Value investing has been underperforming for the longest time in history (pdf). Long periods of underperformance are not unusual since the underperformance tends to happen for years at the time, especially during bull markets. In addition to value stocks, the strategy relies on small cap stocks, which have also underperformed large cap stocks during the past ten years in the US (source). Now in the latest bear market, the small cap value strategy has also so far performed worse than the index, of course for a good reason, but it raises the question of what the future performance might look like.

The recent underperformance does not mean that the performance will necessarily continue to be weak in this market, as in this post I’ll demonstrate that the strategy has outperformed the index not only during but also after the bottoms of bear markets. Bear markets are defined as markets where the index has fallen more than twenty percent.

The data of small cap value stock returns is from Kenneth French’s data library. In the data, the market is split into six parts: small and large companies and low, medium and high book-to-market companies. We’ll use the “old” definition of value since the strategy selects companies with the largest book-to-market values, which is the same as selecting companies with the smallest price-to-book values (P/B) excluding also companies with negative book values. We’ll use nominal i.e. non-inflation-adjusted returns, and we also take reinvested dividends into account. The returns we’ll use are monthly returns, which may not capture the shortest bear markets in the data.

Let’s first take a look at the returns beginning from the peak preceding the bear market until for the next ten years after the peak. The green line represents the returns of the small cap value stocks, and the black line represents the returns of the S&P 500, while the black horizontal line represents the boundaries for bear markets.

Click to enlarge images

Since the data is monthly and beginning in 1926, it captured seven different bear markets. Notice that the month and year of the peak are in the title of each of the seven plots. We can see that the small cap value strategy has outperformed in all of the cases except in the Great Depression, and usually by a wide margin. The average return of the index during the ten years was 79.7 percent, or 6.0 percent annually, while the average return of the small cap value strategy was over double at 217 percent, or 12.2 percent annually. The returns were negative only after two out of the seven bear markets for the index, and only once for the small cap value strategy.

Since small cap value has underperformed the index so far in the latest bear market, we should look at how the strategy has performed from the bottoms of the bear markets instead of looking at data beginning at the previous peaks. Below is the same plot but with the bottom of the bear market as the starting point of each plot. This time the title represents the month and the year the market bottomed in instead of when it peaked.

We can of course not know in advance when the bottom will be, but the intention is rather to show whether the performance is weaker after the bottom than during the decline before the bottom. All of the returns were positive from the bottoms of the bear markets, which denotes that none of the bear markets were followed by another bear market that would have exceeded the previous decline. The return for the index for the ten years following the bottom was 242 percent, or 13.1 percent annually, and the return for the small cap value strategy was an astonishing 592 percent, or 21.3 percent annually. This means that the investment would have doubled every 3.6 years! The only time the small cap value strategy underperformed compared to the index was the bear market that bottomed in 2009, but the underperformance wasn’t too substantial.

As can be seen from the plots, small cap value is a quite volatile strategy, which in addition tends to underperform for years at the time. It has however been one of the most profitable strategies especially right after a bear market has bottomed. The outperformance has been slightly stronger during the downturns (in relative terms) than from the bottom, achieving a return of almost double that of the index. Since the strategy has outperformed every time except for once from the bottoms of the last seven bear markets, it has a good chance of outperforming the index this time, too. This together with the fact that the returns of the strategy tend to mean revert, i.e. periods of outperformance tend to follow periods of underperformance, makes the strategy to have the one of the best starting points for the following decade.

Be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here.

Notice that the S&P 500 index has been reconstructed by Shiller for the years it did not exist yet.

The file which includes the small cap value returns in the French data library is called “6 Portfolios Formed on Size and Book-to-Market (2 x 3)”.

Notice that the choice of using total return data changes the definition of a bear market and its bottom a bit, but the same seven bear markets would be found also with the price return data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data based investing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Getting to the Right Question

April 21, 2020, 5:00 pm

≫ Next: S is for summarise

≪ Previous: The performance of small value stocks in bear markets

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Root Problem: We Don’t All Speak the Same Language

Organizations across the modern business world recognize the critical importance of Data Science for competitive advantage. That recognition has driven Glassdoor to rate Data Scientist as one of the 25 top paying jobs in America in 2020.

However, many organizations struggle to put these data scientists’ knowledge to work in their businesses where they can actually have an impact on success. We hear data scientists say, “The business can’t really tell us what they want, so they waste a lot of our time.” And in return, business people often say, “Our data scientists are really smart, but the applications they build too often fall short of what we’re looking for.”

The problem here is that data scientists and business people speak very different languages. Specifically, they struggle to understand each other around:

data. When a business person thinks about data associated with the business, they often are thinking about data that they can see in Web pages or spreadsheets. Data scientists, on the other hand, are usually looking for data that they can access using an Application Programming Interface or API.
process. When business people think about the process for analysis, they tend to think in people-centric terms along the lines of “Becky takes the order, and then transmits that data to George.” Data scientists, on the other hand, usually think of process as a series of automated programs that works without people.
results. Data scientists think of a result as an analysis running and producing correct output. Business people see a result as something that has an effect on the organization’s (usually financial) metrics. These are rarely the same thing, at least in the first version of a data science project.

Both points of view are valid – they just aren’t the same, which creates a communications gap.

Iterative Development Can Overcome the Communications Gap

“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.”–John Tukey

Astellas’ Aymen Waqar discusses the analytics communications gap:

These communications gaps are part of a larger challenge of defining (and refining) the problem. While your business stakeholder might believe they have a clear definition of the problem they are trying to solve, they may not understand whether the data is available, how complex the modeling might be or how long building a model on large data might take, or what adjacent problems might be potentially more valuable and/or far simpler to solve. So, before starting the development process, the data scientist and the business stakeholder must explore and discuss the problem in enough detail to create a realistic development plan. And while data scientists and business people may struggle to understand each other’s words, they usually can agree if they can just see a working model. The difficult part is getting to that working model.

A Commonly Used Data Exploration Process Can Help

One way to get to agreement is to break down the project into simpler pieces and get agreement on each piece before moving on to the next. Garrett Grolemund and Hadley Wickham propose the following process below in their book R for Data Science. This process isn’t specific to any technology such as R or Python. Rather it’s a way to get your data scientist and business sponsor to come to consensus on what question they are attacking.

A visualization of the data science process

The four steps are

Import. Identify the data you plan to use, and focus first on importing that data so you can work with it.
Tidy. Now that you have the data in hand, reshape and manipulate the data into a form that your analysis tools can easily work with.
Understand. This step is where your data scientists should be interacting most with sponsors by turning the data into visuals and models, and getting feedback about whether they satisfy the business needs.
Communicate. Once you have consensus on what you’re building, this is where you simplify and polish the result so that everyone will understand the result.

Four Recommendations For Applying This Process

Many data scientists (or at least those who have read R for Data Science) use this type of process for doing analysis. However, fewer think of using it as a communications tool to ensure they are answering the proper business questions. You can help your data scientists apply this approach; encourage them to:

Schedule check-ins at each step. Before you begin, set up regular check-ins with your business sponsors. Ideally, these should roughly correspond with the development phases listed above to ensure that everyone is in sync before moving on to the next phase.
Use rapid prototyping tools and languages. R and Python are the tools of choice for most data scientists because they are well-suited to the type of iterative development process being described here. Both languages speed development and have excellent visualization tools which will help drive consensus.
Document progress using public documents. Use a single Google Docs file to record each meeting and to record decisions. Don’t start a new document with each meeting, but simply prepend the date and the most recent meeting notes at top. By the time the project is done, you’ll have a record of the entire process from beginning to end which will help plan future projects.
Defer performance concerns until you have an agreed result. Too many projects get bogged down designing for full-scale deployment before they actually know what they are building. Instead, develop a prototype that everyone agrees is the right idea, and then revise it to scale up when you decide to put it into production. This approach simplifies early decision-making and doesn’t waste precious project time on premature optimizations.

Once the application satisfies both your data scientists and business stakeholders, you’ll want to share the finished application with the wider business community. One of the easiest ways to do this is through RStudio Connect, which can help you rapidly refine your content during the prototyping phase, and share it widely and consistently in the production phase. We will talk more about that in our next blog post. Meanwhile, to learn more about how Connect can add push-button publishing, scheduled execution of reports, and flexible security policies to your team’s data science work, please visit the RStudio Connect product page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

S is for summarise

April 22, 2020, 7:00 am

≫ Next: Using Python to Cheat at Scrabble

≪ Previous: Getting to the Right Question

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, we’ll finally talk about summarise! It’s very similar to mutate, but instead of adding or altering a variable in a dataset, it aggregates your data, creating a new tibble with the columns containing your requested summary data. The number of rows will be equal to the number of groups from group_by (if you don’t specify any groups, your tibble will have one row that summarizes your entire dataset).

These days, when I want descriptive statistics from a dataset, I generally use summarise, because I can specify the exact statistics I want in the exact order I want (for easy pasting of tables into a report or presentation).

Also, if you’re not a fan of the UK spelling, summarize works exactly the same. The same is true of other R/tidyverse functions, like color versus colour.

Let’s load the reads2019 dataset and start summarizing!

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag()

reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",col_names=TRUE)

## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## )

First, we could use summarise to give us some basic descriptives of the whole dataset. If we want to save the results to a tibble, we would give it a new name, or we could just have it display those results and not save them. Here’s what happens when I request a summary without saving a new tibble.

reads2019%>%summarise(AllPages=sum(Pages),AvgLength=mean(Pages),AvgRating=mean(MyRating),AvgReadTime=mean(read_time),ShortRT=min(read_time),LongRT=max(read_time),TotalAuthors=n_distinct(Author))

## # A tibble: 1 x 7 ##   AllPages AvgLength AvgRating AvgReadTime ShortRT LongRT TotalAuthors ##                                     ## 1    29696      341.      4.14        3.92       0     25           42

Now, let’s create a summary where we do save it as a tibble. And let’s have it create some groups for us. In the dataset, I coded author gender, with female authors coded as 1, so I can find out how many women writers are represented in a group by summing that variable. I also want to fill in a few missing publication dates, which seems to happen for Kindle version of books or books by small publishers. This will let me find out my newest and oldest books in each group; I just arrange by publication year, then request last and first, respectively. Two books were published in 2019, so I’ll replace the others based on title, then have R give the remaining NAs a year of 2019.

reads2019%>%filter(is.na(OriginalPublicationYear))%>%select(Title)

## # A tibble: 5 x 1 ##   Title                                                                          ##                                                                             ## 1 Empath: A Complete Guide for Developing Your Gift and Finding Your Sense of S… ## 2 Perilous Pottery (Cozy Corgi Mysteries, #11)                                   ## 3 Precarious Pasta (Cozy Corgi Mysteries, #14)                                   ## 4 Summerdale                                                                     ## 5 Swarm Theory

reads2019<-reads2019%>%mutate(OriginalPublicationYear=replace(OriginalPublicationYear,Title=="Empath: A Complete Guide for Developing Your Gift and Finding Your Sense of Self",2017),OriginalPublicationYear=replace(OriginalPublicationYear,Title=="Summerdale",2018),OriginalPublicationYear=replace(OriginalPublicationYear,Title=="Swarm Theory",2016),OriginalPublicationYear=replace_na(OriginalPublicationYear,2019))genrestats<-reads2019%>%filter(Fiction==1)%>%arrange(OriginalPublicationYear)%>%group_by(Childrens, Fantasy, SciFi, Mystery)%>%summarise(Books=n(),WomenAuthors=sum(Gender),AvgLength=mean(Pages),AvgRating=mean(MyRating),NewestBook=last(OriginalPublicationYear),OldestBook=first(OriginalPublicationYear))

Now let’s turn this summary into a nicer, labeled table.

genrestats<-genrestats%>%bind_cols(Genre=c("General Fiction","Mystery","Science Fiction","Fantasy","Fantasy SciFi","Children's Fiction","Children's Fantasy"))%>%ungroup()%>%select(Genre,everything(),-Childrens,-Fantasy,-SciFi,-Mystery)library(expss)

##  ## Attaching package: 'expss'

## The following objects are masked from 'package:stringr': ##  ##     fixed, regex

## The following objects are masked from 'package:dplyr': ##  ##     between, compute, contains, first, last, na_if, recode, vars

## The following objects are masked from 'package:purrr': ##  ##     keep, modify, modify_if, transpose

## The following objects are masked from 'package:tidyr': ##  ##     contains, nest

## The following object is masked from 'package:ggplot2': ##  ##     vars

as.etable(genrestats,rownames_as_row_labels=NULL)

Genre	Books	WomenAuthors	AvgLength	AvgRating	NewestBook	OldestBook
General Fiction	15	10	320.1	4.1	2019	1941
Mystery	9	8	316.3	3.8	2019	1950
Science Fiction	19	4	361.4	4.4	2019	1959
Fantasy	19	3	426.3	4.2	2019	1981
Fantasy SciFi	2	0	687.0	4.5	2009	2006
Children’s Fiction	1	0	181.0	4.0	2016	2016
Children’s Fantasy	16	1	250.6	4.2	2008	1900

I could have used other base R functions in my summary as well – such as sd, median, min, max, and so on. You can also summarize a dataset and create a plot of that summary in the same code.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

reads2019%>%mutate(Gender=factor(Gender,levels=c(0,1),labels=c("Male","Female")),Fiction=factor(Fiction,levels=c(0,1),labels=c("Non-Fiction","Fiction"),ordered=TRUE))%>%group_by(Gender, Fiction)%>%summarise(Books=n())%>%ggplot(aes(Fiction, Books))+geom_col(aes(fill=reorder(Gender,desc(Gender))))+scale_fill_economist()+xlab("Genre")+labs(fill="Author Gender")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Using Python to Cheat at Scrabble

April 22, 2020, 10:25 am

≫ Next: Apple’s COVID Mobility Data

≪ Previous: S is for summarise

[This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My New Year’s Resolution was to learn Python. After taking a few online courses, I became comfortable enough with the language to tackle a small side project. Side projects are great for learning a language because they let you “own” a project from start to finish as well as solve a problem that is of genuine interest to you. While I was interested in having a side project in Python for a while, it took me a while to find a project that interested me.

This all changed during the COVID-19 lockdowns. In order to pass the time my mother (a retired English teacher) became obsessed with Scrabble and insisted on playing game after game with me. The problem is that I hate the game, and not good at it, and kept on losing. Eventually I realized that it would be straightfrorward to write a program in Python that looked at my rack of letters and listed the highest scoring word I could create. Voila – my first Python side project was born!

I just wrapped up this project and decided to share it because it might help others who are interested in Python. Most people read my blog because of Choroplethr (my suite of R packages for mapping open datasets) or my various R trainings. However, over time I’ve learned that many of my readers are also interested in Python. Additionally, most data-related jobs in Industry (as opposed to Academia) use Python rather than R.

You can view the “Scrabble Cheat” project on github here. The key function is get_all_words, which takes a string that represents a set of tiles. It returns a list of tuples that represent valid words you can form from those letters, along with their score in Scrabble. The list is ordered so that the highest-scoring word appears first:

> get_all_words('ilzsiwl')
[('zills', 16),
('swiz', 16),
('zill', 15),
('wiz', 15),
('liz', 13),
('isz', 12),
('zs', 11),
('wills', 10),
('swill', 10),
('willi', 10),
...
]

This post will help you make sense of this output (i.e. “what is a list of tuples, and why is the data structured this way?”) But first, it’s useful to do a compare-and-contrast between Base R and Python Builtins.

Base R vs. Python Built-ins

One of the central concepts in R is the distinction between “Base R” and “Packages you choose to install”. Base R, while itself a package, cannot be uninstalled, and contains core language elements like data.frame and vector. “Base R” also colloquially refers to “all the packages that ship with R and are available when you load it” such as utils, graphics and datasets.

One of the more confusing things about R is that people are increasingly moving away from Base R to 3rd party libraries for routine tasks. For example, the utils package has a function read.csv for reading CSV files. But the read_csv function from the package readr is actually faster and does not automatically convert strings to factors, which is often desirable. Similarly, the graphics package has a plot function for making graphs, but the ggplot function in the ggplot2 package is much more popular.

This split between “functionality that ships with R” and “how people ‘in the know’ actually use R” is inherently confusing. Python’s equivalent of “Base R” is called “Built-ins”. (You can see the full list of Python’s Built-ins here). But unlike R, it appears that people are generally happy with Python’s Built-ins, and do not recreate that functionality in other packages. In fact, when talking to my friends who teach Python, they emphasized that expertise in Python often comes down to having fluency with the Built-ins.

Python’s Built-in Data Structures

The main Built-in Data Structures that I used in this project are Dictionaries, Lists and Tuples.

Dictionaries

Dictionaries (often just called Dicts) define a key-value relationship. For example, each Scrabble letter can be viewed as a key, and its numeric score can be viewed as its value. We can store this information in a Python Dict like this:

> letter_scores = {'a': 1,  'b': 4,  'c': 4, 'd': 2,
                 'e': 1,  'f': 4,  'g': 3, 'h': 3,
                 'i': 1,  'j': 10, 'k': 5, 'l': 2,
                 'm': 4,  'n': 2,  'o': 1, 'p': 4,
                 'q': 10, 'r': 1,  's': 1, 't': 1,
                 'u': 2,  'v': 5,  'w': 4, 'x': 8,
                 'y': 3,  'z': 10}

> letter_scores['a']
1
> letter_scores['z']
10

The Dict itself is defined by curly braces. Each key-value pair within the Dict is defined by a colon, and each element of the dict is separated by a comma.

The page on Built-ins says that Dicts are created with the keyword dict. However, they can also be created with the symbol { }. As a rule of thumb, Python programmers prefer to define data structures with symbols instead of keywords.

Note that R does not really have an equivalent data structure. In the accepted answer to this question on Stack Overflow people say that a List with Names is as close as you can get. However, there are still significant differences between the two data structures:

In a Python Dict, Keys must be unique. In R, List Names do not have to be unique.
In a Python Dict, each Key can be of a different type (e.g. int or string). In R, all List Names must be of the same type.

Lists

Lists are probably the most common type in Python. They are similar to Vectors in R, in that they are meant to store multiple elements of the same type. However, R strictly enforces this requirement, while Python does not.

Scrabble Cheat uses a List to store the contents of a file that contains a dictionary of English words. We then iterate over this list to see which words can be spelled with the user’s tiles. Here is code to read in the dictionary from a file:

all_words = open('words_alpha.txt').read().split()
all_words
>>> ['a',
'aa',
'aaa',
'aah',
'aahed',
'aahing',
'aahs',
... ]

Here we open the file with open and read it in as a string with read. The split function breaks the string into a list of smaller strings, using a blank space as the delimeter. This type of function chaining is very common in Python.

Tuples

Tuples are used to store data that has multiple components. For example, a location on a map has two components: longitude and latitude. Tuples are also immutable, which means that you cannot change their values after creation.

Scrabble Cheat tells you each word that your tiles can make, along with the Scrabble score of that word. Each (word, score) pair is stored as a Tuple. Because each set of tiles can normally make multiple words, the return value of get_all_words is actually a List of Tuples:

get_all_words('ttsedue')
[('etudes', 8),
('dustee', 8),
('detest', 7),
('stude', 7),
('tested', 7),
('tutees', 7),
('suede', 7),
('etude', 7),
('duets', 7),
... ]

In addition to being created with parentheses, Tuples can also be created with the tuple keyword.

List Comprehensions

Many languages have functionality for creating a new list as a function from another list. Python provides a way to do this that I have not encountered before. It is called a List Comprehension and has the following template:

[ object_in_new_list
for element in old_list
if condition_is_met ]

Scrabble Cheat uses a List Comprehension to iterate over a list of English words and pluck out the words which can be spelled with the user’s tiles. If the word can be spelled, then it is put into a Tuple along with its score. The actual code looks like this:

[(one_word, get_word_score(one_word))
for one_word in load_words()
if can_spell_word(one_word, tiles)]

(The actual code is a bit more complex, and you can see it here.)

While I have not encountered List Comprehensions before (and they are certainly not a feature in R), it appears that they have appeared in other programming languages in the past (see 1, 2).

Wrapping Up

This was a fun project that helped solidfy the book knowledge that I had recently gained about Python. It gave me valuable experience with Python’s Built-ins, and the write up helped me to solidify my understanding of some key differences between R and Python.

A small confession: the actual game I am playing with my mom is Zynga’s Words with Friends (WWF) not Hasbro’s Scrabble. I consider WWF to be a knock-off of Scrabble, and it is also a bit more clunky to type, so I just refer to it as Scrabble in this post. Also, the dictionary my app uses is much larger than the official WWF dictionary, so many of the words the app recommends you cannot actually use.

If this post winds up becoming popular, then I can do another one as I continue to learn Python. (I am currently looking for a side project that will give me some experience with Pandas, Mathplotlib and/or Seaborn).

Interested in Learning Python?

The best resources I found for learning Python came from my friends Reuven Lerner and Trey Hunner. Both are professional Python trainers who (a) specialize in doing live corporate trainings and (b) have recently launched consumer products for individuals. Reuven’s Introductory Python course was especially helpful in getting me quickly up to speed with the basics. Trey’s Python Morsels, which sends you one problem a week, was helpful in forcing me to continue to practice Python every week. (I am not being paid to recommend these courses – I am simply passing along that they helped me).

The post Using Python to Cheat at Scrabble appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Apple’s COVID Mobility Data

April 23, 2020, 5:16 am

≫ Next: 10 Commands to Get Started with Git

≪ Previous: Using Python to Cheat at Scrabble

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apple recently released a batch of mobility data in connection with the COVID-19 pandemic. The data is aggregated from requests for directions in Apple Maps and is provided at the level of whole countries and also for a selection of large cities around the world. I folded the dataset into the covdata package for R that I’ve been updating, as I plan to use it this Fall in a course I’ll be teaching. Here I’ll take a quick look at some of the data. Along the way—as it turns out—I end up reminding myself of a lesson I’ve learned before about making sure you understand your measure before you think you understand what it is showing.

Apple released time series data for countries and cities for each of three modes of getting around: driving, public transit, and walking. The series begins on January 13th and, at the time of writing, continues down to April 20th. The mobility measures for every country or city are indexed to 100 at the beginning of the series, so trends are relative to that baseline. We don’t know anything about the absolute volume of usage of the Maps service.

Here’s what the data look like:

 1 2 3 4 5 6 7 8 9101112131415161718

>apple_mobility# A tibble: 39,500 x 5geo_typeregiontransportation_typedateindex<chr><chr><chr><date><dbl>1country/regionAlbaniadriving2020-01-131002country/regionAlbaniadriving2020-01-1495.33country/regionAlbaniadriving2020-01-15101. 4country/regionAlbaniadriving2020-01-1697.25country/regionAlbaniadriving2020-01-17104. 6country/regionAlbaniadriving2020-01-18113. 7country/regionAlbaniadriving2020-01-19105. 8country/regionAlbaniadriving2020-01-2094.49country/regionAlbaniadriving2020-01-2194.110country/regionAlbaniadriving2020-01-2293.5# … with 39,490 more rows

The index is the measured outcome, tracking relative usage of directions for each mode of transportation. Let’s take a look at the data for New York.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829

raw_ny<-apple_mobility%>%filter(region=="New York City")%>%select(region:index)%>%rename(mode=transportation_type)%>%mutate(mode=tools::toTitleCase(mode),weekend=isWeekend(date),holiday=isHoliday(as.timeDate(date),listHolidays()))%>%mutate(max_day=ifelse(is_max(index),date,NA),max_day=as_date(max_day))p_raw_ny<-ggplot(raw_ny,mapping=aes(x=date,y=index,group=mode,color=mode))+geom_vline(data=subset(raw_ny,holiday==TRUE),mapping=aes(xintercept=date),color=my.colors("bly")[5],size=2.9,alpha=0.1)+geom_hline(yintercept=100,color="gray40")+geom_line()+geom_text_repel(aes(label=format(max_day,format="%a %b %d")),size=rel(2),nudge_x=1,show.legend=FALSE)+scale_color_manual(values=my.colors("bly"))+labs(x="Date",y="Relative Mobility",color="Mode",title="New York City's relative trends in activity. Baseline data with no correction for weekly seasonality",subtitle="Data are indexed to 100 for usage on January 13th 2020. Weekends shown as vertical bars. Date with highest relative activity index labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",caption="Data: Apple. Graph: @kjhealy")+theme(legend.position="top")p_raw_ny

Relative Mobility in New York City. Touch or click to zoom.

As you can see, we have three series. The weekly pulse of activity is immediately visible as people do more or less walking, driving, and taking the subway depending on what day it is. Remember that the data is based on requests for directions. So on the one hand, taxis and Ubers might be making that sort of request every trip. But people living in New York do not require turn-by-turn or step-by-step directions in order to get to work. They already know how to get to work. Even if overall activity is down at the weekends, requests for directions go up as people figure out how to get to restaurants, social events, or other destinations. On the graph here I’ve marked the highest relative value of requests for directions, which is for foot-traffic on February 22nd. I’m not interested in that particular date for New York, but when we look at more than one city it might be useful to see how the maximum values vary.

The big COVID-related drop-off in mobility clearly comes in mid-March. We might want to see just that trend, removing the “noise” of daily variation. When looking at time series, we often want to decompose the series into components, in order to see some underlying trend. There are many ways to do this, and many decisions to be made if we’re going to be making any strong inferences from the data. Here I’ll just keep it straightforward and use some of the very handy tools provided by the tidyverts (sic) packages for time-series analysis. We’ll use an STL decomposition to decompose the series into trend, seasonal, and remainder components. In this case the “season” is a week rather than a month or a calendar quarter. The trend is a locally-weighted regression fitted to the data, net of seasonality. The remainder is the residual left over on any given day once the underlying trend and “normal” daily fluctuations have been accounted for. Here’s the trend for New York.

 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930

resids_ny<-apple_mobility%>%filter(region=="New York City")%>%select(region:index)%>%rename(mode=transportation_type)%>%mutate(mode=tools::toTitleCase(mode))%>%as_tsibble(key=c(region,mode))%>%model(STL(index))%>%components()%>%mutate(weekend=isWeekend(date),holiday=isHoliday(as.timeDate(date),listHolidays()))%>%as_tibble()%>%mutate(max_day=ifelse(is_max(remainder),date,NA),max_day=as_date(max_day))p_resid_ny<-ggplot(resids_ny,aes(x=date,y=remainder,group=mode,color=mode))+geom_vline(data=subset(resids,holiday==TRUE),mapping=aes(xintercept=date),color=my.colors("bly")[5],size=2.9,alpha=0.1)+geom_line(size=0.5)+geom_text_repel(aes(label=format(max_day,format="%a %b %d")),size=rel(2),nudge_x=1,show.legend=FALSE)+scale_color_manual(values=my.colors("bly"))+labs(x="Date",y="Remainder",color="Mode",title="New York City, Remainder component for activity data",subtitle="Weekends shown as vertical bars. Date with highest remainder component labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",caption="Data: Apple. Graph: @kjhealy")+theme(legend.position="top")p_resid_ny

Trend component of the New York series. Touch or click to zoom.

We can make small multiple showing the raw data (or the components, as we please) for all the cities in the dataset if we like:

 1 2 3 4 5 6 7 8 910111213141516

p_base_all<-apple_mobility%>%filter(geo_type=="city")%>%select(region:index)%>%rename(mode=transportation_type)%>%ggplot(aes(x=date,y=index,group=mode,color=mode))+geom_line(size=0.5)+scale_color_manual(values=my.colors("bly"))+facet_wrap(~region,ncol=8)+labs(x="Date",y="Trend",color="Mode",title="All Modes, All Cities, Base Data",caption="Data: Apple. Graph: @kjhealy")+theme(legend.position="top")p_base_all

Data for all cities. Touch or click to zoom.

This isn’t the sort of graph that’s going to look great on your phone, but it’s useful for getting some overall sense of the trends. Beyond the sharp declines everywhere—with slightly different timings, something that’d be worth looking at separately—a few other things pop out. There’s a fair amount of variation across cities by mode of transport and also by the intensity of the seasonal component. No-one is walking anywhere in Dubai. Some sharp spikes are evident, too, not always on the same day or by the same mode of transport. We can take a closer look at some of the cities of interest on this front.

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435

focus_on<-c("Rio de Janeiro","Lyon","Bochum - Dortmund","Dusseldorf","Barcelona","Detroit","Toulouse","Stuttgart","Cologne","Hamburg","Cairo","Lille")raw_ts<-apple_mobility%>%filter(geo_type=="city")%>%select(region:index)%>%rename(mode=transportation_type)%>%mutate(mode=tools::toTitleCase(mode),weekend=isWeekend(date),holiday=isHoliday(as.timeDate(date),listHolidays()))%>%filter(region%in%focus_on)%>%group_by(region)%>%mutate(max_day=ifelse(is_max(index),date,NA),max_day=as_date(max_day))ggplot(raw_ts,mapping=aes(x=date,y=index,group=mode,color=mode))+geom_vline(data=subset(raw_ts,holiday==TRUE),mapping=aes(xintercept=date),color=my.colors("bly")[5],size=1.5,alpha=0.1)+geom_hline(yintercept=100,color="gray40")+geom_line()+geom_text_repel(aes(label=format(max_day,format="%a %b %d")),size=rel(2),nudge_x=1,show.legend=FALSE)+scale_color_manual(values=my.colors("bly"))+facet_wrap(~region,ncol=2)+labs(x="Date",y="Relative Mobility",color="Mode",title="Relative trends in activity, selected cities. No seasonal correction.",subtitle="Data are indexed to 100 for each city's usage on January 13th 2020. Weekends shown as vertical bars.\nDate with highest relative activity index labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",caption="Data: Apple. Graph: @kjhealy")+theme(legend.position="top")

Selected cities only. Touch or click to zoom.

Look at all those transit peaks on February 17th. What’s going on here? At this point, it might useful to take a look at the residual or remainder component of the series rather than looking at the raw data, so we can see if something interesting is happening.

 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031

resids<-apple_mobility%>%filter(geo_type=="city")%>%select(region:index)%>%rename(mode=transportation_type)%>%mutate(mode=tools::toTitleCase(mode))%>%filter(region%in%focus_on)%>%as_tsibble(key=c(region,mode))%>%model(STL(index))%>%components()%>%mutate(weekend=isWeekend(date),holiday=isHoliday(as.timeDate(date),listHolidays()))%>%as_tibble()%>%group_by(region)%>%mutate(max_day=ifelse(is_max(remainder),date,NA),max_day=as_date(max_day))ggplot(resids,aes(x=date,y=remainder,group=mode,color=mode))+geom_vline(data=subset(resids,holiday==TRUE),mapping=aes(xintercept=date),color=my.colors("bly")[5],size=1.5,alpha=0.1)+geom_line(size=0.5)+geom_text_repel(aes(label=format(max_day,format="%a %b %d")),size=rel(2),nudge_x=1,show.legend=FALSE)+scale_color_manual(values=my.colors("bly"))+facet_wrap(~region,ncol=2)+labs(x="Date",y="Remainder",color="Mode",title="Remainder component for activity data (after trend and weekly components removed)",subtitle="Weekends shown as vertical bars. Date with highest remainder component labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",caption="Data: Apple. Graph: @kjhealy")+theme(legend.position="top")

Remainder components only. Touch or click to zoom.

We can see that there’s a fair amount of correspondence between the spikes in activity, but it’s not clear what the explanation is. For some cities things seem straightforward. Rio de Janiero’s huge spike in foot traffic corresponds to the Carnival parade around the week of Mardi Gras. As it turns out—thanks to some local informants for this—the same is true of Cologne, where Carnival season (Fasching) is also a big thing. But that doesn’t explain the spikes that repeatedly show up for February 17th in a number of German and French provincial cities. It’s a week too early. And why specifically in transit requests? What’s going on there? Initially I speculated that it might be connected to events like football matches or something like that, but that didn’t seem very convincing, because those happen week-in week-out, and if it were an unusual event (like a final) we wouldn’t see it across so many cities. A second possibility was some widely-shared calendar event that would cause a lot of people to start riding public transit. The beginning or end of school holidays, for example, seemed like a plausible candidate. But if that were the case why didn’t we see it in other, larger cities in these countries? And are France and Germany on the same school calendars? This isn’t around Easter, so it seems unlikely.

After wondering aloud about this on Twitter, the best candidate for an explanation came from Sebastian Geukes. He pointed out that the February 17th spikes coincide with Apple rolling out expanded coverage of many European cities in the Maps app. That Monday marks the beginning of public transit directions becoming available to iPhone users in these cities. And so, unsurprisingly, the result is a surge in people using Maps for that purpose, in comparison to when it wasn’t a feature. I say “unsurprisingly”, but of course it took a little while to figure this out! And I didn’t figure it out myself, either. It’s an excellent illustration of a rule of thumb I wrote about a while ago in a similar context.

As a rule, when you see a sharp change in a long-running time-series, you should always check to see if some aspect of the data-generating process changed—such as the measurement device or the criteria for inclusion in the dataset—before coming up with any substantive stories about what happened and why. This is especially the case for something susceptible to change over time, but not to extremely rapid fluctuations. … As Tom Smith, the director of the General Social Survey, likes to say, if you want to measure change, you can’t change the measure.

In this case, there’s a further wrinkle. I probably would have been quicker to twig what was going on had I looked a little harder at the raw data rather than moving to the remainder component of the time series decomposition. Having had my eye caught by Rio’s big Carnival spike I went to look at the remainder component for all these cities and so ended up focusing on that. But if you look again at the raw city trends you can see that the transit data series (the blue line) spikes up on February 17th but then sticks around afterwards, settling in to a regular presence, at quite a high relative level in comparison to its previous non-existence. And this of course is because people have begun to use this new feature regularly. If we’d had raw data on the absolute levels of usage in transit directions this would likely have been clearly more quickly.

The tendency to launch right into what social scientists call the “Storytime!” phase of data analysis when looking at some graph or table of results is really strong. We already know from other COVID-related analysis how tricky and indeed dangerous it can be to mistakenly infer too much from what you think you see in the data. (Here’s a recent example.) Taking care to understand what your measurement instrument is doing really does matter. In this case, I think, it’s all the more important because with data of the sort that Apple (and also Google) have released, it’s fun to just jump into it and start speculating. That’s because we don’t often get to play with even highly aggregated data from sources like this. I wonder if, in the next year or so, someone doing an ecological, city-level analysis of social response to COVID-19 will inadvertently get caught out by the change in the measure lurking in this dataset.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

10 Commands to Get Started with Git

April 22, 2020, 5:00 pm

≫ Next: Miscellaneous Wisdom about R Markdown & Hugo Gained from Work on our Website

≪ Previous: Apple’s COVID Mobility Data

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.

Git and its online extensions like GitHub, Bitbucket, and GitLab are essential tools for data science. While the emphasis is often on collaboration, Git can also be very useful to the solo practitioner.

The RStudio IDE offers Git functionality via a convenient web-based interface (see the “Git” tab), as well as interaction with git via the command-line (via the “Console” tab, or via the “Git” tab’s “More”->“Shell” menu option).

Resources such as HappyGitWithR and try.github.io provide detailed setup instructions and interactive tutorials. Once you have set up Git, here are 10 commands and their RStudio GUI interface equivalents to get started using Git.

I encourage you to practice with both the command-line and the RStudio Git interface. The following commands should eventually become so habitual that they take only seconds to complete.

git config

Console command: git config

Before getting started with git, you should identify yourself with:

git config --global user.name ""git config --global user.email

These values will later show up in all commit and collaboration messages, so make sure they are appropriate for public visbility.

git init

Console command: git init

There are many ways to treat your RStudio project as a git repository, or repo for short. If you create a new project, RStudio will give you the option to use Git with the project:

To use git with an existing project, click on Tools -> Project Options -> Git/SVN:

git add

Console command: git add

Git is used to keep track of how files change. The changes to files in your project can be in one of two states:

unstaged: changes that won’t be included in the next commit
staged: changes that will be included in the next commit

We use git add to add changes to the staging area. Common changes include:

add a new file
add new changes to a previously-committed file

In RStudio we can easily git add both types of changes to the staging area by clicking on the checkbox in the “Staged” column next to the filename:

In the above example, we have staged two files. The green “A” icon is for Added (as opposed to a blue “M” for Modified, or a yellow “?” for an un-tracked file).

git reset

Console command: git reset --

Unstage a file with git reset, or in RStudio, just uncheck the checkbox. It’s that simple.

git rm

Console command: git rm

Staging the removal of a tracked file looks simple, but be warned:

you cannot simply undo a git rm removal by running git reset -- as we did with git add git rm stages the removal of the file so that the next commit knows the file was removed, but it also removes the actual file. This frequently causes panic in new git users when they don’t see the deleted file restored after running git reset -- . Restoring the file requires two steps:

git reset --
git checkout

In RStudio, removal, and undo’ing the removal, are simpler. If you delete a tracked file, you will see a red “D” icon in the “Status” column of the “Git” tab indicating that it has been Deleted. Clicking on the checkbox in the “Staged” column will stage the removal to be included in the next commit. Clicking on the file and then clicking “More”->“Revert” will undo the deletion.

git status

Console command: git status

To see what files are staged, unstaged, as well as what files have been modified, use git status. In RStudio, status is always visible by looking at the Status column of the “Git” tab.

git commit

Console command: git commit

Changes can be saved to the repo’s log in the form of a commit, a checkpoint that includes all information about how the files were changed at that point in time. Adding a concise commit message is valuable in that it allows you to quickly look through the log and understand what changes were made.

Do not use generic messages like “Updated files”: this greatly reduces the value of the commit log, as well as most other collaborative features. Other recommendations and opinions exist about how best to author a commit message, but the starting point is to include a concise (~50-character) description of what changed.

Using git commit via the console will open up a console-based editor that allows you to author the commit message.

In RStudio, click on the “Commit” icon in the “Git” tab. This will open up the “Review Changes” window, which allows you to see what has changed, and stage and unstage files before adding a commit message and clicking on “Commit” to finalize the commit:

When the commit is complete, you’ll see a message like the following:

git diff

Console command: git diff

git diff will show you what has changed in a file that has already been committed to the repo. The “diff”erence is calculated between the what the file looked like at the last commit, and what it currently looks like in the working directory. The console git diff command will not show you what new files exist in the unstaged area.

In RStudio, just click on the “Diff” button in the “Git” tab, to see something like the following:

Red/green shaded lines indicate lines that were removed/added to the file. In RStudio, new files appearing in the working directory will be entirely green.

Looking at git diff can be very useful when you want to know “what did I do since I last committed a change to this file?”. A common pattern after completing a segment of work is to use git diff to see what was changed overall, and then split the changes into logical commits that describe what was done.

For example, if four different files were modified to complete one logical change, add them all in the same commit. If half the files changed were related to one topic and the other half related to another topic, add-and-commit the two sets of files with separate commit messages describing the two distinct topics addressed.

git log

Console command: git log

To see a log of all commits and commit messages, use git log. In the RStudio interface, click on the “History” icon in the “Git” tab. This will pop up a window that shows the commit history:

As shown above, it will also allow you to easily view what was changed in each commit with a display of the commit’s “diff”.

git checkout

When running git status, if there are files in the working directory that have changes, Git will provide the helpful message:

use “git checkout – …” to discard changes in working directory git checkout -- will discard changes in the working directory, so be very careful about using git checkout: you will lose all changes you have made to a file.

git checkout is the basis for Git’s very powerful time-traveling features, allowing you to see what your code looks like in another commit that lies backwards, forwards, or even sideways in time. git checkout allows you to create “branches”, which we will discuss in a future post.

For now, just remember that checking-out a file that has been changed in your working directory will destroy those changes.

In RStudio, there is no interface to git checkout– probably for good reason. It exposes a lot of functionality and can quickly lead a novice down a delicate path of learning opportunities.

Next steps

Now that you have git set up, create a project and play with the above commands. Create files, add them, commit the changes, diff the changes, remove them. See what happens when you attempt to git reset -- after a git rm , or see what git checkout does when you have made changes to that file (hint: it destroys them!).

Also look at how RStudio handles these operations, likely intentionally keeping the novice safely within a subset of functionality, while also providing convenient GUI visualizations of diffs, histories, staged-state, and status.

Committing changes should become a regular part of your workflow, and understanding this essential commands will lay the foundations for the more complex workflows we’ll discuss in a future post.

_____='https://rviews.rstudio.com/2020/04/23/10-commands-to-get-started-with-git/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Miscellaneous Wisdom about R Markdown & Hugo Gained from Work on our Website

April 22, 2020, 5:00 pm

≫ Next: T is for Themes

≪ Previous: 10 Commands to Get Started with Git

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Whilst working on the blog guide, Stefanie Butland and I consolidated knowledge we had already gained, but it was also the opportunity to up our Rmd/Hugo technical game. Our website uses Hugo but not blogdown¹ to render posts: every post is based on an .md file that is either written directly or knit from an .Rmd file. We wanted to provide clear guidance for both options, and to stick to the well-documented Hugo way of e.g. inserting figures. We also wanted to provide post contributors with an as smooth as possible workflow to create a new post. Working on this mission, unsurprisingly we learned a lot, and why not share our newly acquired technical know-how? In this post I shall go through four things we learned about Rmd/Hugo, while trying to provide context about the why of our using them.

knitr hooks

Problem: Hugo has a nice figure shortcode with options such as width. How do we make R Markdown output these shortcodes from a code chunk, instead of the usual figure syntax? I.e. how to get²

{{< figure src="chunkname-1.png” alt="alternative text please make it informative” caption="this is what this image shows, write it here or in the paragraph after the image as you prefer” width="300” >}}

not

![alternative text please make it informative](chunkname-1.png)

to appear – as the result of a chunk producing a figure – in the .md after knitting the .Rmd?

I asked this question in the friendly French-speaking R Slack workspace³ and got an answer from both Romain Lesur and Christophe Dervieux: the solution was to use a knitr plot hook!

Person holding a purple crochet hook and white yarn

Castorly Stock on Pexels.

knitr hooks are “Customizable functions to run before / after a code chunk, tweak the output, and manipulate chunk options”.

In the setup chunk⁴ of the .Rmd file, there should be this code

knitr::knit_hooks$set(plot=function(x,options){hugoopts<-options$hugooptspaste0("{","{, # the original code is simpler# but here I need to escape the shortcode!'"',x,'" ',if (!is.null(hugoopts)){glue::glue_collapse(glue::glue('{names(hugoopts)}="{hugoopts}"'),sep=" ")},">}}\n")})

that reads options from the chunk, and uses options from the hugoopts named list if it exists.

The chunk

```{r chunkname, hugoopts=list(alt="alternative text please make it informative", caption="this is what this image shows, write it here or in the paragraph after the image as you prefer", width=300)} plot(1:10)```

produces

in the .md file which is what we wanted.

Now, a bit later in our website journey, I had a quite similar question: Hugo has nice highlighting options for code fences. How to make R Markdown show R source code with such options? This time there was no need to ask anyone, searching the internet for the name of the right knitr hook was enough: our .Rmd has to feature a knitr source hook. More on that highlighting chapter another time, but in the meantime refer to our standard .Rmd.

Note that when writing a .md post instead of knitting an .Rmd file, authors can use Hugo syntax directly. And when adding figures in an .Rmd file that come from say a stock photos website rather than a code chunk, authors can also use Hugo syntax, granted they write the shortcode between html_preserve markers, see below the lines I used to add the crochet hook picture in the .Rmd producing this post.

{{< figure src ="person-holding-purple-crochet-hook-and-white-yarn-3945638.jpg” alt = “Person holding a purple crochet hook and white yarn” link = “https://www.pexels.com/photo/person-holding-purple-crochet-hook-and-white-yarn-3945638/" caption = “Castorly Stock on Pexels” width = “300” class = “center” >}}

“One post = one folder”: Hugo leaf bundles

Problem: Our advice to contributors including ourselves was to add their post under content/ but its images under themes/ropensci/static/img/ which is… not smooth. How do we change that?

Thanks to Alison Hill’s blog post about page bundles I learned you can use a folder to add both a post and its related images to a Hugo website. What an improvement over adding the post in one place, the images in another place like we used to! It’s much smoother to explain to new contributors⁵. In Hugo speak, each post source is a leaf bundle.

Laptop keyboard with a tree leaf beside it

Engin Akyurt on Pexels.

We did not have to “convert” old posts since both systems can peacefully coexist.

Below is the directory tree of this very tech note, added as a leaf bundle.

/home/maelle/Documents/ropensci/roweb2/content/technotes/2020-04-23-rmd-learnings├── blogdownaddin.png├── index.Rmd├── index.md├── orange-mug-near-macbook-3219546.jpg├── person-holding-purple-crochet-hook-and-white-yarn-3945638.jpg└── richard-dykes-SPuHHjbSso8-unsplash.jpg

In R Markdown, in the setup chunk, the option fig.path needs to be set to "" via knitr::opts_chunk$set(fig.path = "").

Hugo archetypes and blogdown New Post Addin

Problem: We link to our two templates from the blog guide, and explain where under content the folder corresponding to a post should be created, but that leaves a lot of work to contributors. How do we distribute our .Rmd and .md templates? How do we help authors create a new post without too much clicking around and copy-pasting?⁶

After a bit of digging in Hugo docs and blogdown GitHub, reading a blog post of Alison Hill’s, contributing to blogdown in spontaneous collaboration with Garrick Aden-Buie, getting useful feedback from blogdown maintainer Yihui Xie… here’s the place I got to: storing templates as Hugo directory-based archetypes, and recommending the use of a handy blogdown RStudio addin to create new posts.

First of all, an important clarification: in Hugo speak, a template for Markdown content is called an archetype. A template refers to html layout. Not confusing at all, right?
We have stored both the templates archetypes for Markdown and R Markdown posts as directory based archetypes. The Rmd template is stored as index.md, otherwise Hugo doesn’t recognize it (thanks Leonardo Collado-Torres). Posts created using the template should however be saved with the usual .Rmd extension.
After reading the explanations around options in the blogdown book, I added an .Rprofile to our website repository.

if (file.exists('~/.Rprofile')){base::sys.source('~/.Rprofile',envir=environment())}# All options below apply to posts created via the New Post Addin.# to enforce leaf bundles:options(blogdown.new_bundle=TRUE)# to make blog the subdirectory for new posts by default:options(blogdown.subdir="blog")# to help enforce our strict & pedantic style guide options(blogdown.title_case=TRUE)

The blogdown New Post Addin assumes the author name can be stored in the “author” field of the post YAML metadata, whereas our website used to call that field “authors”… that called for a massive pull request, editing previous posts, and making our templating more resilient to having a single author as a string rather than a list. When editing the metadata of previous posts I like to use the yaml package. I made a few mistakes that were thankfully spotted and fixed by Steffi LaZerte.
Now from RStudio or elsewhere after installing blogdown above version 1.18.1⁷, we recommend using the blogdown’s New Post Addin!

Screenshot of the blogdown New Post Addin

The todo list for a contributor is long but less tedious than creating folders by hand:

Enter a title, no need to worry about title case at this stage.
Enter your name if whoami wasn’t able to guess it.
Choose the correct date.
Enter a new slug if the default one is too long.
Choose “blog” or “technotes” as a Subdirectory from the drop-down menu.
Choose an Archetype, Rmd or md, from the drop-down menu.
Also choose the correct Format: .Rmd if Rmd, Markdown (.md) if md. Never choose .RMarkdown.
Ignore Categories.
Select any relevant tag and/or create new ones.
Click on “Done”, your post draft will have been created and opened.

The addin can also be used outside of RStudio. If you don’t use that nice little tool, unless you use hugo new yes you’d copy-paste from a file and create a correctly named folder yourself.

ignoreFiles field in the Hugo config

Problem: How do we make Hugo ignore useless html from our knitting process without deleting said html by hand?

At the moment the output type in our .Rmd template archetype is

output:  html_document:    keep_md: yes

that produces a .md like we need. I hope we’ll find a better way one day, and welcome any cleaner suggestion for the output field (I tested a few Markdown variants without success!). I have the perhaps naive impression that blogdown actually follows a similar process when knitting .RMarkdown to .markdown?

So anyway, we currently have an useless html produced when knitting a post from an .rmd file, how do we make Hugo ignore it?

I knew about .gitignore where we have the lines

content/blog/*/index.htmlcontent/technotes/*/index.html

so that such garbage index.html doesn’t get added to pull requests, but it still made Hugo error locally.

Richard Dykes on Unspash.

Luckily Hugo recognizes a config field called ignoreFiles. Ours now mentions index.html.

ignoreFiles = ["\\.Rmd$", "_files$", "_cache$", "index\\.html"]

Conclusion

In this post I reported on a few things our website work taught us about R Markdown (knitr hooks), Hugo (ignoreFiles, leaf bundles, archetypes) and blogdown (New Post Addin). We’re still learning important concepts and tricks thanks to new questions by blog authors and to updates in the ecosystem⁸, we shall keep publishing such posts, stay tuned if that’s your jam!

But as you’ll see later we actually take advantage of that cool package: we recommend using blogdown’s New Post Addin; and we also mention blogdown::install_hugo() in the blog guide. ︎
Yes you can escape Hugo shortcodes in your site content! ︎
If you want to join us, follow the invitation link. À bientôt !︎
I recently asked and received references to define the setup chunk. ︎
I can also credit this smoother workflow for making me like adding more images, hence the stock pictures in this post! ︎
Thanks to Steffi LaZerte for encouraging work in this direction! ︎
At the time of writing blogdown 1.18.1 has to be installed from GitHub via remotes::install_github("rstudio/blogdown"). ︎
I actually wrote a post around maintaining Hugo websites in my personal blog. ︎

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

T is for Themes

April 23, 2020, 7:00 am

≫ Next: How to do Excel VLOOKUP in R (using left_join, merge)

≪ Previous: Miscellaneous Wisdom about R Markdown & Hugo Gained from Work on our Website

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the easiest ways to make a beautiful ggplot is by using a theme. ggplot2 comes with a variety of pre-existing themes. I’ll use the genre statistics summary table I created in yesterday’s post, and create the same chart with different themes.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag()

reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allrated.csv",col_names=TRUE)

## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## )

genrestats<-reads2019%>%filter(Fiction==1)%>%arrange(OriginalPublicationYear)%>%group_by(Childrens, Fantasy, SciFi, Mystery)%>%summarise(Books=n(),WomenAuthors=sum(Gender),AvgLength=mean(Pages),AvgRating=mean(MyRating))genrestats<-genrestats%>%bind_cols(Genre=c("General Fiction","Mystery","Science Fiction","Fantasy","Fantasy SciFi","Children's Fiction","Children's Fantasy"))%>%ungroup()%>%select(Genre,everything(),-Childrens,-Fantasy,-SciFi,-Mystery)genre<-genrestats%>%ggplot(aes(Genre, Books))+geom_col()+scale_y_continuous(breaks=seq(0,20,1))

Since I’ve created a new object for my figure, I can add a theme by typing genre + [theme]. Here’s a handful of the ggplot2 themes.

You can also get more themes with additional packages. My new favorite is ggthemes. I’ve been loving their Economist themes (particularly economist_white), which I’ve been using for most of the plots I create at work. Here are some of my favorites.

You can also customize different elements of the plot with theme(). For instance, theme(plot.title = element_text(hjust = 0.5)) centers your plot title. theme(legend.position = “none”) removes the legend. You could do both of these at once within the same theme() by separating them with commas. This is a great way to tweak tiny elements of your plot, or if you want to create your own custom theme.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

genre+theme_economist_white()+theme(plot.background=element_rect(fill="lightblue"))

These themes also have color schemes you can add to your plot. We’ll talk about that soon!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧