X is for scale_x

April 28, 2020, 7:00 am

≫ Next: Avoid Irrelevancy and Fire Drills in Data Science Teams

≪ Previous: ZeroR: The Simplest Possible Classifier… or: Why High Accuracy can be Misleading

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

These next two posts will deal with formatting scales in ggplot2 – x-axis, y-axis – so I’ll try to limit the amount of overlap and repetition.

Let’s say I wanted to plot my reading over time, specifically as a cumulative sum of pages across the year. My x-axis will be a date. Since my reads2019 file initially formats my dates as character, I’ll need to use my mutate code to turn them into dates, plus compute my cumulative sum of pages read.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag()

reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",col_names=TRUE)

## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## )

reads2019<-reads2019%>%mutate(date_started=as.Date(reads2019$date_started,format='%m/%d/%Y'),date_read=as.Date(date_read,format='%m/%d/%Y'),PagesRead=order_by(date_read,cumsum(Pages)))

This gives me the variables I need to plot my pages read over time.

reads2019%>%ggplot(aes(date_read, PagesRead))+geom_point()

ggplot2 did a fine job of creating this plot using default settings. Since my date_read variable is a date, the plot automatically ordered date_read, formatted as “Month Year”, and used quarters as breaks. But we can still use the scale_x functions to make this plot look even better.

One way could be to format years as 2-digit instead of 4. We could also have month breaks instead of quarters.

reads2019%>%ggplot(aes(date_read, PagesRead))+geom_point()+scale_x_date(date_labels="%b %y",date_breaks="1 month")

Of course, we could drop year completely and just show month, since all of this data is for 2019. We could then note that in the title instead.

reads2019%>%ggplot(aes(date_read, PagesRead))+geom_point()+scale_x_date(date_labels="%B",date_breaks="1 month")+labs(title="Cumulative Pages Read Over 2019")+theme(plot.title=element_text(hjust=0.5))

Tomorrow, I’ll show some tricks for how we can format the y-axis of this plot. But let’s see what else we can do to the x-axis. Let’s create a bar graph with my genre data. I’ll use the genre names I created for my summarized data last week.

genres<-reads2019%>%group_by(Fiction, Childrens, Fantasy, SciFi, Mystery)%>%summarise(Books=n())genres<-genres%>%bind_cols(Genre=c("Non-Fiction","General Fiction","Mystery","Science Fiction","Fantasy","Fantasy Sci-Fi","Children's Fiction","Children's Fantasy"))genres%>%ggplot(aes(Genre, Books))+geom_col()

Unfortunately, my new genre names are a bit long, and overlap each other unless I make my plot really wide. There are a few ways I can deal with that. First, I could ask ggplot2 to abbreviate the names.

genres%>%ggplot(aes(Genre, Books))+geom_col()+scale_x_discrete(labels= abbreviate)

These abbreviations were generated automatically by R, and I’m not a huge fan. A better way might be to add line breaks to any two-word genres. This Stack Overflow post gave me a function I can add to my scale_x_discrete to do just that.

genres%>%ggplot(aes(Genre, Books))+geom_col()+scale_x_discrete(labels=function(x){sub("\\s","\n", x)})

MUCH better! As you can see, the scale_x function you use depends on the type of data you’re working with. For dates, scale_x_date; for categories, scale_x_discrete. Tomorrow, we’ll show some ways to format continuous data, since that’s often what you see on the y-axis. See you then!

By the way, this is my 1000th post on my blog!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Avoid Irrelevancy and Fire Drills in Data Science Teams

April 27, 2020, 5:00 pm

≫ Next: Tips before migrating to a newer R version

≪ Previous: X is for scale_x

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A visualization of a computer thrown away and a fire

Balancing the twin threats of Data Science Development

Data science leaders naturally want to maximize the value their teams deliver to their organization, and that often means helping them navigate between two possible extremes. On the one hand, a team can easily become an expensive R&D department, detached from actual business decisions, slowly chipping away only to end up answering stale questions. On the other hand, teams can be overwhelmed with requests, spending all of their time on labor intensive, manual fire-drills, always creating one more “Just in Time” Powerpoint slide.

How do you avoid these threats, of either irrelevancy or constant fire drills? As we touched on in a recent blog post, Getting to the Right Question, it turns out the answer is pretty straightforward: use iterative, code-based development to share your content early and often, to help overcome the communications gap with your stakeholders.

Data science ecosystems can be complex and full of jargon, so before we dive into specifics let’s consider a similar balancing act. Imagine you are forming a band that wants to share new music with the world. To do so, it is critical to get music out to your fans quickly, to iterate on ideas rapidly. You don’t want to get bogged down in the details of a recording studio on day 1. At the same time, you want to be able to capture and repeat what works – perhaps as sheet music, perhaps as a video, or even as a simple recording.

Share your Data Science work early and often

For data scientists, the key is creating the right types of outputs so that decision makers can iterate with you on questions and understand your results. Luckily, like a musician, the modern data science team has many ways to share their initial vision:

They can quickly create notebooks, through tools like R Markdown or Jupyter, that are driven by reproducible code and can be shared, scheduled, and viewed without your audience needing to understand code.
They can build interactive web applications using tools like Shiny, Flask, or Dash to help non-coders test questions and explore data.
Sometimes, data science teams even create APIs, which act as a realistic preview of their final work with a much lower cost of creation.

Sharing early and often enables data science teams to solve impactful problems. For example, perhaps a data scientist is tasked with forecasting sales by county. They might share their initial exploratory analysis sales leadership and tap into their domain expertise to help explain outlier counties. Or imagine a data scientist working to support biologists doing drug discovery research. Instead of responding to hundreds of requests for statistical analysis, the data scientist could build an interactive application to allow biologists to run their own analysis on different inputs and experiments. By sharing the application early and often, the biologist and data scientist can empower each other to complete far more experiments.

These types of outputs all share a few characteristics:

The outputs are easy to create. The sign that your team has the right set of tools is if a data scientist can create and share an output from scratch in days, not months. They shouldn’t have to learn a new framework or technology stack.
The outputs are reproducible. It can be tempting, in a desire to move quickly, to take shortcuts. However, these shortcuts can undermine your work almost immediately. Data scientists are responsible for informing critical decisions with data. This responsibility is serious, and it means results can not exist only on one person’s laptop, or require manual tweaking to recreate. A lack of reproducibility can undermine credibility in the minds of your stakeholders, which may lead them to dismiss or ignore your analyses if the answer conflicts with their intuition.
Finally, and most importantly: the outputs must be shared. All of these examples: notebooks, interactive apps and dashboards, and even APIs, are geared towards interacting with decision makers as quickly as possible to be sure the right questions are being answered.

Benefits of sharing for Data Science Teams

Luckily, tools exist to ensure data science teams can create artifacts that share these three characteristics. At RStudio, we’ve built RStudio Team with all 3 of these goals in mind.

Great data science teams talk about the happy result of this approach. For examples:

“RStudio Connect is critical, the way you can deploy flexdashboards, R Markdown… I use web apps as a way to convey a model in a very succinct fashion… because I don’t know what the user will do, I can create an app where the user’s interactions with the model can imply it, I don’t have to come up with all the finite outcomes ahead of time”– Moody Hadi at S&P
“One of the key focuses for us was the method of delivery … actually taking your insights and getting business impact. How are non analytic people digesting your work.”– Aymen Waqar at Astellas (check out our last blog post, Getting to the Right Question, to see Aymen discussing the analytics communication gap)

It’s Not Just About Production

We often see data science teams make a common mistake that prevents them from achieving this delicate balancing act. A tempting trap is to focus exclusively on complex tooling oriented towards putting models in production. Because data science teams are trying to strike a balance between repeatability, robustness, and speed, and because they are working with code, they often turn to their software engineering counterparts for guidance on adopting “agile” processes. Unfortunately, many teams end up focusing on the wrong parts of the agile playbook. Instead of copying the concept – rapid iterations towards a useful goal – teams get caught up in the technologies, introducing complex workflows instead of focusing on results. This mistake leads to a different version of the expensive R&D department – the band stuck in a recording studio with the wrong song.

Eduardo Arina de la Rubio, head of a large data team at Facebook, lays out an important reminder in his recent talk at rstudio::conf 2020. Data science teams are not machine learning engineers. While growth of the two are related, ML models will ultimately become commoditized, mastered by engineers and available in off-the-shelf offerings. Data scientists, on the other hand, have a broader mandate: to enable critical business decisions. Often, in the teams we work with at RStudio, many projects are resolved and decisions made based on the rapid iteration of an app or a notebook. Only on occasion does the result need to be codified into a model at scale – and usually engineers are involved at that stage.

To wrap up, at RStudio we get to interact with hundreds of data science teams of all shapes and sizes from all types of industries. The best of these teams have all mastered the same balancing act: they use powerful tools to help them share results quickly, earning them a fanbase among their business stakeholders and helping their companies make great decisions.

We developed RStudio Team with this balancing act in mind, and to make it easy for data science teams to create, reproduce and share their work. To learn more, please visit the RStudio Team page.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Tips before migrating to a newer R version

April 28, 2020, 6:36 am

≫ Next: R package numbr 0.11.3 posted

≪ Previous: Avoid Irrelevancy and Fire Drills in Data Science Teams

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tips before migrating to a newer R version

This post is based on real events.

Several times when I installed the latest version of R, and proceeded to install all the packages I had in the previous version, I encountered problems. It also applies when updating packages after a while.

I decided to make this post after seeing the community reception to a quick post I made:

Tips before migrating to a newer R version

This post -also available in Spanish here– does not want to discourage the installation of R, on the contrary, to warn the "dark side" of the migration and make our projects stable over time.

Luckily the functions change for the better, or even much better as it is the case of the tidyverse suite.

(A little announcement for those who speak Spanish ) 3-weeks ago I create the data school: EscuelaDeDatosVivos.AI, where you can find an introductory free R course for data science (which includes the tidyverse and funModeling among others) Desembarcando en R

Projects that are not frequently executed

For example, post migration in the run to generate the Data Science Live Book (written 100% in R markdown), I have seen function depreciation messages as a warning. Naturally I have to remove them or use the new function.

I also had the case where they changed some of the parameters of R Markdown.

Another case

Imagine the following flow: R 4.0.0 is installed, then the latest version of all packages. Taking ggplot as an example, we go from, 2.8.1 to 3.5.1.

Version 3.5.1 doesn’t have a function because it is deprecated, ergo it fails. Or even changed a function (example from tidyverse: mutate_at, mutate_if). It changes what is called the signature of the function, e.g. the .vars parameter.

Package installation

Tips before migrating to a newer R version

Well, if we migrate and don’t install everything we had before, we’re going to run an old script and have this problem.

Some recommend listing all the packages we have installed, and generating a script to install them.

Another solution is to manually copy the packages from a folder of the old version of R to the new one. The packages are folders within the installation of R.

R on servers

Another case, they have R installed on a server with processes running every day, they do the migration and some of the functions change their signature. That is, they change the type of data that perhaps is defined in a function.

This point should not occur often if one migrates from package versions often. The normal flow for removing a function from an R package is to first announce a with a warning the famous deprecated: Mark a function as deprecated in customised R package.

If the announcement is in an N+1 version, and we switch from N to the N+2 version, we may miss the message and the function is no longer used.

So it is not advisable to upgrade packages and R?

As I said at the beginning, of course I encourage the migration.

We must be alert and test the projects we already have running.

Otherwise, we wouldn’t have many of the facilities that today’s languages give us through the use of the community. It is not even dependent on R.

Now that the tidymodels is out, here’s another post that might interest you: How to use recipes package from tidymodels for one hot encoding

Some advice: Environments

Tips before migrating to a newer R version

Python has a very useful concept that is the virtual environment, it is created quickly, and what it causes is that each library installation is done in the project folder.

Then you do pip freeze > requirements.txt and all the libraries with their version remain in a txt with which they can quickly recreate the environment with which they developed. Why and How to make a Requirements.txt

This is not so easy in R, there is packrat but it has its complexities, for example if there are repos in github.

Augusto Hassel just told me about the renv library (also from RStudio! ). I quote the page:

"The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors."

You can see the slides from renv: Project Environments for R, by Kevin Ushey.

Docker

Tips before migrating to a newer R version

Augusto also told me about Docker as a solution:

"Using Docker we can encapsulate the environment needed to run our code through an instruction file called Dockerfile. This way, we’ll always be running the same image, wherever we pick up the environment."

Here’s a post by him (in Spanish): My First Docker Repository

Conclusions

If you have R in production, have a testing environment and a production environment.

Install R, your libraries, and then check that everything is running as usual.

Have unit test to automatically test that the data flow is not broken. In R check: testthat.

Update all libraries every X months, don’t let too much time go by.

As a moral, this is also being data scientist, solving version, installation and environment problems.

Moss! What did you think of the post?

Tips before migrating to a newer R version

Happy update!

Find me at: Linkedin& Twitter.

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

R package numbr 0.11.3 posted

April 28, 2020, 7:27 am

≫ Next: Analysis essentials: Using the help page for a function in R

≪ Previous: Tips before migrating to a newer R version

[This article was first published on RStats – Tom Hopper, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My simple package of useful numeric functions, numbr, has been updated to version 0.11.3 and posted to GitHub. You can install it with the command

devtools::install_github("tomhopper/numbr")

0.11.3 adds the %==% operator. %==% implements the base R function all.equal(), comparing two numeric or integer vectors for near-equality. Unlike all.equal, %==% returns a vector for every element of the left-hand and right-hand sides. This makes it useful in functions like which or dplyr::filter.

I use %==% when making comparisons between columns of a data frame that have been calculated and may have differences due to machine accuracy errors.

%==% currently uses the defaults for other parameters to all.equal (most notably tolerance), and no method is implemented to alter those parameters in this version.

Full documentation of the functions in numbr can be found at tomhopper/numbr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStats – Tom Hopper.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Analysis essentials: Using the help page for a function in R

April 27, 2020, 5:00 pm

≫ Next: New Orleans and Normalization

≪ Previous: R package numbr 0.11.3 posted

[This article was first published on Very statisticious on Very statisticious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Since I tend to work with relatively new R users, I think a lot about what folks need to know when they are getting started. Learning how to get help tops my list of essential skills. Some of this involves learning about useful help forums like Stack Overflow and the RStudio Community. Some of this is about learning good search terms (this is a hard one!). And some of this is learning how to use the R documentation help pages.

While there are still exceptions, most often the help pages in R contain a bunch of useful information. Here I talk a little about what is generally in a help page for a function and what I focus on in each section.

R help pages

Every time I use a function for a first time or reuse a function after some time has passed (like, 5 minutes in some cases ), I spend time looking at the R help page for that function. You can get to a help page in R by typing ?functionname into your Console and pressing Enter, where functionname is some R function you are using.

For example, if I wanted to take an average of some numbers with the mean() function, I would type ?mean at the > in the R Console and then press Enter. The help page opens up; if using RStudio this will default to open in the Help pane.

Help page structure

A help page for an R function always has the same basic set-up. Here’s what the first half of the help page for mean() looks like.

At the very top you’ll see the function name, followed by the package the function is in surrounded by curly braces. You can see that mean() is part of the base package.

This is followed by a function title and basic Description of the function. Sometimes this description can be in fairly in depth and useful but often, like here, it’s not and I quickly skim over it.

Usage

The Usage section is usually my first stop in a help page. This is where I can see the arguments available in the function along with any default values. The function arguments are labels for the inputs you can give to a function. A default value means that is the value the function will use if you don’t input something else.

For example, for mean() you can see that the first argument is x (no default value), followed by trim that defaults to a value of 0, and then na.rm with a default of FALSE.

Arguments

The arguments the function takes and a description of those arguments is given in the Arguments section. This is a section I often spend a lot of time in and go back to regularly, figuring out what arguments do and the options available for each argument.

In the mean() example I’m using, this section tells me that the trim argument can take numeric values between 0 and 0.5 in order to trim the dataset prior to calculating the mean. I know from Usage it defaults to 0 but note in this case the default is not explicitly listed in the argument description.

The na.rm argument takes a logical value (i.e., TRUE or FALSE) and controls whether or not NA values are stripped before the function calculates the means. Since it defaults to FALSE, the NA values are not stripped prior to calculation unless I change this.

Examples

If you scroll to the very bottom of a help page you will find the Examples section. This gives examples of how the function works. You can practice using the function by copying and pasting the example and running the code. In RStudio you can also highlight the code and run it directly from the Help pane with Ctrl+Enter (MacOS Cmd+Enter).

After looking at Usage and Arguments I often scroll right down to the Examples section to see an example of the code in use. The Examples section for mean() is pretty sparse, but you’ll find that this section is quite extensive for some functions.

Other sections

Depending on the function, there can be a variety of different and important information after Arguments and before Examples. You may see mathematical notation that shows what the function does (in Details), a description of what the function returns (in Value), references in support of what the function does (in References), etc. This can be extremely valuable information, but I often don’t read it until I run into trouble using the function or need more information to understand exactly what the function does.

I have a couple examples of useful information I’ve found in these other sections for various functions.

First up is rbind() for stacking datasets. It turns out that rbind() stacks columns based on matching column names and not column positions. This is mentioned in the function documentation, but you have to dive deep into the very long Details section of the help file at ?rbind to find the information.

Second, functions for distributions will give information about the density function used in the Details section. Since the help pages for distributions almost always describe multiple functions at once, you can see what each of the functions return in Value. Here’s an example from ?rnorm.

Using argument order instead of labels

You will see plenty of examples in R where the argument labels are not written out explicitly. This is because we can take advantage of the argument order when writing code in R.

You can see this in the mean()Examples section, for example. You can pass a vector to the first argument of mean() without explicitly writing the x argument label.

vals = c(0:10, 50)mean(vals)

# [1] 8.75

In fact, you can pass in values to all the arguments without labels as long as you input them in the order the arguments come into the function. This relies heavily on you remembering the order of the arguments, as listed in Usage.

vals = c(0:10, 50, NA)mean(vals, 0.1, TRUE)

# [1] 5.5

You will definitely catch me leaving argument labels off in my own code. These days, though, I try to be more careful and primarily only leave off only the first label. One reason for this is it turns out my future self needs the arguments written out to better understand the code. I’m much more likely to figure out what the mean() code is doing if I put the argument labels back in. I think the code above, without the labels for trim and na.rm, is hard to understand.

Here’s the same code, this time with the argument labels written out. Note the argument order doesn’t matter if the argument labels are used.

vals = c(0:10, 50, NA)mean(vals, na.rm = TRUE, trim = 0.1)

# [1] 5.5

Another reason I try to use argument labels is that new R users can get stung leaving off argument labels when they don’t realize how/why it works. I worked with an R newbie recently who was getting weird results from a GLM with an offset. It turns out they weren’t using argument labels and so had passed the offset to weights instead of offset. Whoops! Luckily they saw something was weird and I could help get them on the right path. And now they know more about why it can be useful to write out argument labels.

I talk about this issue here because I don’t often see a lot of explicit discussion on why and when argument labels can be left off even though there are a lot of code examples out there that do this. This reminds me of when I was a new beekeeper and I made the mistake of going into a hive in the evening. (Do not try this at home, folks!) It turns out “everyone” who is an expert beekeeper knows what happens if you do this, but it wasn’t mentioned in any of my beginner books and classes. I don’t think beginners shouldn’t have to learn this sort of thing the hard way.

Figure 1: No worries, this is a daytime hive inspection.

To leave a comment for the author, please follow the link and comment on their blog: Very statisticious on Very statisticious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

New Orleans and Normalization

April 28, 2020, 4:36 pm

≫ Next: Beta release of data analysis chapters: Evidence-based software engineering

≪ Previous: Analysis essentials: Using the help page for a function in R

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My post about Apple’s mobility data from a few days ago has been doing the rounds. (People have been very kind.) Unsurprisingly, one of the most thoughtful responses came from Dr. Drang, who wrote up a great discussion about the importance of choosing the right baseline if you’re going to be indexing change with respect to some time. His discussion of Small Multiples and Normalization is really worth your while.

Dr. Drang’s eye was caught by the case of Seattle, where the transit series was odd in a way that was related to Apple’s arbitrary choice of April 13th as the baseline for its series:

One effect of this normalization choice is to make the recent walking and driving requests in Seattle look higher than they should. Apple’s scores suggest that they are currently averaging 50–65% of what they were pre-COVID, but those are artificially high numbers because the norm was set artificially low.
A better way to normalize the data would be to take a week’s average, or a few weeks’ average, before social distancing and scale all the data with that set to 100.

I’ve been continuing to update my covdata package for R as Apple, Google, and other sources release more data. This week, Apple substantially expanded the number of cities and regions it is providing data for. The number of cities in the dataset went up from about 90 to about 150, for example. As I was looking at that data this afternoon, I saw that one of the new cities was New Orleans. Like Seattle, it’s an important city in the story of COVID-19 transmission within its region. And, as it turns out, even more so than Seattle, its series in this particular dataset is warped by the choice of start date. Here are three views of the New Orleans data: the raw series for each mode, the trend component of an STL time series decomposition, and the remainder component of the decomposition. (The methods and code are the same as previously shown.)

The New Orleans series as provided by Apple. Click or touch to zoom in.

The trend component of the New Orleans series. Click or touch to zoom in.

The remainder component of the New Orleans series. Click or touch to zoom in.

Two things are evident right away. First, New Orleans has a huge spike in foot-traffic (and other movement around town) the weekend before Mardi Gras, and on Shrove Tuesday itself. The spike is likely accentuated by the tourist traffic. As I noted before, because Apple’s data is derived from the use of Maps for directions, the movements of people who know their way around town aren’t going to show up.

The second thing that jumps out about the series is that for most of January and February, the city is way, way below its notional baseline. How can weekday foot traffic, in particular, routinely be 75 percentage points below the January starting point?

The answer is that on January 13th, Clemson played LSU in the NCAA National Football Championship at the New Orleans Superdome. (LSU won 42-25.) This presumably brought a big influx of visitors to town, many of whom were using their iPhones to direct themselves around the city. Because Apple chose January 13th as its baseline day, this unusually busy Monday was marked as the “100” mark against which subsequent activity was indexed. Again, as with the strange case of European urban transit, a naive analysis, or even a “sophisticated” one where the researcher did not bother to look at the data first, might easily lead up the garden path.

Dr. Drang has already said most of what I’d say at this point about the value of checking the sanity of one’s starting point (and unlike me, he says it in Python) so I won’t belabor the point. You can see, though, just how huge Mardi Gras is in New Orleans. Were the data properly normalized, the Fat Tuesday spike would be far, far higher than most of the rest of the dataset.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Beta release of data analysis chapters: Evidence-based software engineering

April 28, 2020, 7:58 pm

≫ Next: lmSubsets: Exact variable-subset selection in linear regression

≪ Previous: New Orleans and Normalization

[This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I started my evidence-based software engineering book, nobody had written a data analysis book for software developers, so I had to write one (in fact, a book on this topic has still to be written). When I say “I had to write one”, what I mean is that the 200 pages in the second half of my evidence-based software engineering book contains a concentrated form of such a book.

This 200 pages is now on beta release (it’s 186 pages, if the bibliography is excluded); chapters 8 to 15 of the draft pdf. Originally I was going to wait until all the material was ready, before making a beta release; the Coronavirus changed my plans.

Here is your chance to learn a new skill during the lockdown (yes, these are starting to end; my schedule has not changed, I’m just moving with the times).

All the code+data is available for you to try out any ideas you might have.

The software engineering material, the first half of the book, is also part of the current draft pdf, and the polished form should be available on beta release in about 6 weeks.

If you have a comment or find a problem, either email me or raise an issue on the book’s Github page.

Yes, a few figures and tables still bump into each other. I’m loath to do very fine-tuning because things will shuffle around a bit with minor changes to the words.

I’m thinking of running some online sessions around each chapter. Watch this space for information.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

lmSubsets: Exact variable-subset selection in linear regression

April 28, 2020, 3:00 pm

≫ Next: 4 for 4.0.0 – Four Useful New Features in R 4.0.0

≪ Previous: Beta release of data analysis chapters: Evidence-based software engineering

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R package lmSubsets for flexible and fast exact variable-subset selection is introduced and illustrated in a weather forecasting case study.

Citation

Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). “lmSubsets: Exact Variable-Subset Selection in Linear Regression for R.” Journal of Statistical Software, 93(3), 1-21. doi:10.18637/jss.v093.i03

Abstract

An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a predetermined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an efficient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational efficiency of the package.

Software

https://CRAN.R-project.org/package=lmSubsets

Illustration: Variable selection in weather forecasting

Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades. Numerical models simulate physical systems that operate at a large, typically global, scale. The horizontal (spatial) resolution is limited by the computational power available today and hence, typically, the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations. So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time. Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.

Here, the lmSubsets package is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System. The data frame IbkTemperature contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns. The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.

First, package and data are loaded and the few missing values are omitted for simplicity.

library("lmSubsets")data("IbkTemperature", package = "lmSubsets")IbkTemperature <- na.omit(IbkTemperature)

A simple output model for the observed temperature (temp) is constructed, which will serve as the reference model. It consists of the 2-meter temperature NWP forecast (t2m), a linear trend component (time), as well as seasonal components with annual (sin, cos) and bi-annual (sin2, cos2) harmonic patterns.

MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,  data = IbkTemperature)

When looking at summary(MOS0) or the coefficient table below, it can be observed that despite the inclusion of the NWP variable t2m, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.

	MOS0	MOS1	MOS2
(Intercept)	-345.252 **	(109.212)	-666.584 ***	(95.349)	-661.700 ***	(95.225)
t2m	0.318 ***	(0.016)	0.055	(0.029)
time	0.132 *	(0.054)	0.149 **	(0.047)	0.147 **	(0.047)
sin	-1.234 ***	(0.126)	0.522 ***	(0.147)	0.811 ***	(0.120)
cos	-6.329 ***	(0.164)	-0.812 **	(0.273)
sin2	0.240 *	(0.110)	-0.794 ***	(0.119)	-0.870 ***	(0.118)
cos2	-0.332 **	(0.109)	-1.067 ***	(0.101)	-1.128 ***	(0.097)
sshnf			0.016 ***	(0.004)	0.018 ***	(0.004)
vsmc			20.200 ***	(3.115)	20.181 ***	(3.106)
tmax2m			0.145 ***	(0.037)	0.181 ***	(0.023)
st			1.077 ***	(0.051)	1.142 ***	(0.043)
wr			0.450 ***	(0.109)	0.505 ***	(0.103)
t2pvu			0.064 ***	(0.011)	0.149 ***	(0.028)
mslp					-0.000 ***	(0.000)
p2pvu					-0.000 **	(0.000)
AIC	9493.602		8954.907		8948.182
BIC	9537.650		9031.992		9025.267
RSS	19506.469		14411.122		14357.943
Sigma	3.281		2.825		2.820
R-squared	0.803		0.854		0.855
* p < 0.001; p < 0.01; * p < 0.05.

Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.

MOS1_best <- lmSelect(temp ~ ., data = IbkTemperature,  include = c("t2m", "time", "sin", "cos", "sin2", "cos2"),  penalty = "BIC", nbest = 20)MOS1 <- refit(MOS1_best)

Best-subset regression with respect to the BIC criterion is employed to determine pertinent veriables in addition to the regressors already used in MOS0. The 20 best submodels are computed and the selected variables can be visualized by image(MOS1_best, hilite = 1) (see below) while the corresponding BIC values can be visualized by plot(MOS1_best). All in all, these 20 best models are very similar with only a few variables switching between being included and excluded. Using the refit() method the best submodel can be extracted and fitted via lm(). Summary statistics are shown in the table above. Overall, the model MOS1 improves the model fit considerably compared to the basic MOS0 model.

Finally, an all-subsets regression is conducted instead of the cheaper best-subsets regression. It considers all 41 variables without any restrictions to determine what is the best model in terms of BIC that could be found for this data set.

MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)MOS2 <- refit(lmSelect(MOS2_all, penalty = "BIC"))

Again, the best model is refitted with lm() to facilitate further inspections, see above for the summary table.

The best-BIC models MOS1 and MOS2 both have 13 regressors. The deterministic trend and all but one of the harmonic seasonal components are retained in MOS2 even though they are not forced into the model (as in MOS1). In addition, MOS1 and MOS2 share six NWP outputs relating to temperature (tmax2m, st, t2pvu), pressure (mslp, p2pvu), hydrology (vsmc, wr), and heat flux (sshnf). However, and most remarkably, MOS1 does not include the direct 2-meter temperature output from the NWP model (t2m). In fact, t2m is not included by any of the 20 submodels (sizes 8 to 27) shown by image(MOS2_all, size = 8:27, hilite = 1, hilite_penalty = "BIC") whereas the temperature quantities tmax2m, st, t2pvu are included by all. (Additionally, plot(MOS2_all) would show the associated BIC and residual sum of squares across the different model sizes.) The summary statistics reveal that both MOS1 and MOS2 significantly improve over the simple reference model MOS0, with MOS2 being only slightly better than MOS1.

To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

4 for 4.0.0 – Four Useful New Features in R 4.0.0

April 28, 2020, 7:35 pm

≫ Next: Workflow automation tools for package developers

≪ Previous: lmSubsets: Exact variable-subset selection in linear regression

[This article was first published on R – Detroit Data Lab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With the release of R 4.0.0 upon us, let’s take a moment to understand a few parts of the release that are worth knowing about.

list2DF Function

This is a new utility that will convert your lists to a data frame. It’s very friendly, in the sense that it attempts to avoid errors by making assumptions on how to fill in gaps. That could lead to heartache when those don’t align with what you’re expecting, so be careful!

> one <- list(1:3, letters[5:10], rep(10,7))> list2DF(one)1 1 e 102 2 f 103 3 g 104 1 h 105 2 i 106 3 j 107 1 e 10

sort.list for non-atomic objects

Vectors and matrices are the most common non-atomic objects in R. The function sort.list now works with these. Previously you could use order on either of these, but if you’ve chosen to employ sort.list within your functions, you won’t have to error handle the outcomes of non-atomic objects.

> mtx.test <- matrix(1:9, nrow = 3, byrow = TRUE)> sort.list(mtx.test)[1] 1 4 7 2 5 8 3 6 9

New Color Palettes!

You can check out the new palettes using the palette.pals function. R has always been strong in visualizations, and while Detroit Data Lab won’t endorse Tableau color schemes, we’re excited to see the better accessibility offered by some color palettes. See the new R4 palette below using a simple code snippet.

> palette("R4")> palette()[1] "black"   "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710" "gray62" > show_col(palette())

R version 4.0.0 color palette

stringsAsFactors = FALSE

You knew it was coming, and this list wouldn’t be complete without it. By default, stringsAsFactors is now set to FALSE. Many programmers have explicitly stated this out of habit, but it’s worth checking your code bases to ensure you know what you’re reading and how you’re handling it.

To leave a comment for the author, please follow the link and comment on their blog: R – Detroit Data Lab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Workflow automation tools for package developers

April 28, 2020, 5:00 pm

≫ Next: shinyFeedback 0.2.0 CRAN Release

≪ Previous: 4 for 4.0.0 – Four Useful New Features in R 4.0.0

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a package developer, there are quite a few things you need to stay on top on. Following bug reports and feature requests of course, but also regularly testing your package, ensuring the docs are up-to-date and typo-free… Some of these things you might do without even pausing to think, whereas for others you might need more reminders. Some of these things you might be ok doing locally, whereas for others you might want to use some cloud services. In this blog post, we shall explore a few tools making your workflow smoother.

What automatic tools?

I’ve actually covered this topic a while back on my personal blog but here is an updated list.

Tools for assessing

R CMD check, or devtools::check() will check your package for adhesion to some standards (e.g. what folders can there be) and run the tests and examples. It’s an useful command to run even if your package isn’t intended to go on CRAN.
For Bioconductor developers, there is BiocCheck that “encapsulates Bioconductor package guidelines and best practices, analyzing packages “.
goodpractice, and lintr both provide you with useful static analyses of your package.
covr::package_coverage() calculates test coverage for your package. Having a good coverage also means R CMD check is more informative, since it means it’s testing your code. covr::package_coverage() can also provide you with the code coverage of the vignettes and examples!
devtools::spell_check(), wrapping spelling::spell_check_package(), runs a spell check on your package and lets you store white-listed words.

Tools for improving

That’s all good, but now, do we really have to improve the package by hand based on all these metrics and flags? Partly yes, partly no.

styler can help you re-style your code. Of course, you should check the changes before putting them in your production codebase. It’s better paired with version control.
Using roxygen2 is generally handy, starting with your no longer needing to edit the NAMESPACE by hand. If your package doesn’t use roxygen2 yet, you could use Rd2roxygen to convert the documentation.
One could argue that using pkgdown is a way to improve your R package documentation for very little effort. If you only tweak one thing, please introduce grouping in the reference page.
Even when having to write some things by hand like inventing new tests, usethis provides useful functions to help e.g. create test files (usethis::use_test()).

When and where to use the tools?

Now, knowing about useful tools for assessing and improving your package is good, but when and where do you use them?

Continuous integration

How about learning to tame some online services to run commands on your R package at your own will and without too much effort? Apart from the last subsection, this section assumes you are using Git.

Run something every time you make a change

Knowing your package still pass R CMD check, and what the latest value of test coverage is, is important enough for running commands after every change you commit to your codebase. That is the idea behind continuous integration (CI), that has been very well explained by Julia Silge with the particular example of Travis CI.

“The idea behind continuous integration is that CI will automatically run R CMD check (along with your tests, etc.) every time you push a commit to GitHub. You don’t have to remember to do this; CI automatically checks the code after every commit.” Julia Silge

Travis CI used to be very popular for the R crowd but this might be changing, as exemplified by usethis latest release. There are different CI providers with different strengths and weaknesses, and different levels of lock-in to GitHub.

Not only does CI allow to run R CMD check without remembering, it can also help you run R CMD check on operating systems that you don’t have locally!

You might also find the tic package interesting: it defines “CI agnostic workflows”.

LIFE HACK: My go-to strategy for getting Travis builds to work is snooping on *other* people's .travis.yml files. Shoutout today to the tidyr .travis.yml for solving my problem! #rstats
— Julia Silge (@juliasilge) December 12, 2019

The life-hack above by Julia Silge, “LIFE HACK: My go-to strategy for getting Travis builds to work is snooping on other people’s .travis.yml files.", applies to other CI providers too!

What you can run on continuous integration, beside R CMD check and covr, includes deploying your pkgdown website.

Run R CMD check regularly

Even in the absence of your changing anything to your codebase, things might break due to changes upstream (in the packages your package depends on, in the online services it wraps…). Therefore it might make sense to schedule a regular run of your CI checking workflow. Many CI services provide that option, see e.g. the docs of Travis CI and GitHub Actions.

As a side note, remember than CRAN packages are checked regularly on several platforms.

Be lazy with continuous integration: PR commands

You can also make the most of services “on the cloud” for not having to run small things locally. An interesting trick is e.g. the definition of “PR commands” via GitHub Action. Say someone sends a PR to your repo fixing a typo in the roxygen2 part of an R script, but doesn’t run devtools::document(), or someone quickly edits README.Rmd without knitting it. You could fetch the PR locally and run respectively devtools::document() and rmarkdown::render() yourself, or you could make GitHub Action bot do that for you!

Refer to the workflows in e.g. ggplot2 repo, triggered by writing a comment such as “/document”, and their variant in pksearch repo, where labeling the PR. Both approaches have their pros and cons. I like labeling because not having to type the command means you can’t make a typo. Furthermore, it doesn’t clutter the PR conversation, but you can hide comments later on whereas you cannot hide the labeling event from the PR history so really, to each their own.

This example is specific to GitHub and GitHub Action but you could think of similar ideas for other services.

Run something before you make a change

Let’s build on a meme to explain the idea in this subsection:

Tired: Always remember to do things well
Wired: Use continuous integration to notice wrong stuff
Inspired: Use precommit to not even commit wrong stuff

Git allows you to define “pre-commit hooks” for not letting you e.g. commit README.Rmd without knitting it. You might know this if you use usethis::use_readme_rmd() that adds such a hook to your project.

To take things further, the precommit R package provides two sets of utilities around the precommit framework: hooks that are useful for R packages or projects, and usethis-like functionalities to set them up.

Examples of available hooks, some of them possibly editing files, others only assessing them: checking your R code is still parsable, spell check, checking dependencies are listed in DESCRIPTION… Quite useful, if you’re up for adding such checks!

Check things before show time

Remembering to run automatic assessment tools is key, and both continuous integration and pre-commit hooks can help with that. Now a less regular but very important occurrence is the perfect occasion to run tons of checks: preparing a CRAN release! You know, the very nice moment when you use rhub::check_for_cran() among other things… What other things by the way?

CRAN has a submission checklist, and you could either roll your own or rely on usethis::use_release_issue() creating a GitHub issue with important items. If you don’t develop your package on GitHub you could still have a look at the items for inspiration. The devtools::release() function will ask you whether you ran a spell check.

Conclusion

In this blog post we went over tooling making your package maintainer life easier: R CMD check, lintr, goodpractice, covr, spelling, styler, roxygen2, usethis, pkgdown… and ways to integrate them into your workflow without having to remembering about them: continuous integration services, pre-commit hooks, using a checklist before a release. Tools for improving your R package will often be quite specific to R in their implementation (but not principles), whereas tools for integrating them into your practice are more general: continuous integration services are used for all sorts of software projects, pre-commit is initially a Python project. Therefore, there will be tons of resources about that out there, some of them under the umbrella of DevOps. While introducing some automagic into your workflow might save you time and energy, there is some balance to be found in order not to spend to much time on “meta work”.

Furthermore, there are other aspects of package development we have not mentioned for which there might be interesting technical hacks: e.g. how do you follow the bug tracker of your package? How do you subscribe to the changelogs of its upstream dependencies?

Do you have any special trick? Please share in the comments below!

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

shinyFeedback 0.2.0 CRAN Release

April 28, 2020, 5:00 pm

≫ Next: Y is for scale_y

≪ Previous: Workflow automation tools for package developers

[This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am excited to announce that shinyFeedback 0.2.0 is on its way to CRAN (it may take a day or 2 for it to be available on your CRAN mirror). shinyFeedback is an R package that allows you to easily display user feedback in Shiny apps. shinyFeedback’s primary user feedbacks are displayed alongside Shiny inputs like this:

shinyFeedback 0.2.0 underwent a significant rewrite of the JavaScript that controls when the feedback messages are displayed. I had been meaning to clean up this JavaScript for a while, but, as is often the case, I had not been able to find the time.

I was reinvigorated to work on shinyFeedback when, late last year, Hadley Wickham mentioned shinyFeedback and added a shinyFeedback example to the “User Feedback” chapter of his upcoming Mastering Shiny book. I am thrilled shinyFeedback is getting a mention in Hadley’s upcoming book, but I also knew shinyFeedback needed improvements before appearing in a book for Shiny masters! I feel that shinyFeedback got (at least the most pressing) of these needed improvements with this 0.2.0 release. Please let me know if you have recommendations for further enhancements!

In addition to the underlying JavaScript refactor, there are significant new features in shinyFeedback 0.2.0:

new input feedback for shinyWidgets::pickerInput()
new input feedback for shinyWidgets::airDatPickerInput() and shiny::dateRangeInput(). @pcogis submitted a flawless PR to add support for these 2 inputs. Thanks @pcogis!
a new loadingButton() input. When a button click triggers a long running process and no feedback is given after the button click, the user will understandably think nothing happened and click the button again. Double clicking an action button can cause a bunch of issues if not guarded againts (e.g. long running calculations can run multiple times or, worse, duplicate writes can be made to the database). Of course, it is fairly simple to add custom styles and logic to implement the loading button from scratch, but it is nice to have an out of the box solution as well.
a toast notification The toast notifications will not seem new to many Shiny developers as there is already an R package (shinytoastr) that wraps the same JavaScript library (toastr). We just wrapped the library slightly differently. Please see the loadingButton and showToast vignette for additional detail.
there are now new functions showFeedback() and hideFeedback() which can be used as an alternative to feedback(). See the intro vignette for additional detail.

shinyFeedback also now has a new pkgdown website here!

I want to thank Patrick Howard for his excellent work on this release. Patrick is a new coauthor of shinyFeedback.

Please open an issue on GitHub or leave a comment below if you have any problems with the shinyFeedback.

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Y is for scale_y

April 29, 2020, 7:00 am

≫ Next: Causal Inference cheat sheet for data scientists

≪ Previous: shinyFeedback 0.2.0 CRAN Release

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday, I talked about scale_x. Today, I’ll continue on that topic, focusing on the y-axis.

The key to using any of the scale_ functions is to know what sort of data you’re working with (e.g., date, continuous, discrete). Yesterday, I talked about scale_x_date and scale_x_discrete. We often put these types of data on the x-axis, while the y-axis is frequently used for counts. When displaying counts, we want to think about the major breaks that make sense, as well as any additional formatting to make them easier to read.

If I go back to my pages over time plot, you’ll notice the major breaks are in the tens of thousands. We’re generally used to seeing those values with a comma separating the thousands from the hundreds. I could add those to my plot like this (with a little help from the scales package).

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag()

reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",col_names=TRUE)

## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## )

reads2019<-reads2019%>%mutate(date_started=as.Date(reads2019$date_started,format='%m/%d/%Y'),date_read=as.Date(date_read,format='%m/%d/%Y'),PagesRead=order_by(date_read,cumsum(Pages)))library(scales)

##  ## Attaching package: 'scales'

## The following object is masked from 'package:purrr': ##  ##     discard

## The following object is masked from 'package:readr': ##  ##     col_factor

reads2019%>%ggplot(aes(date_read, PagesRead))+geom_point()+scale_x_date(date_labels="%B",date_breaks="1 month")+scale_y_continuous(labels= comma)+labs(title="Cumulative Pages Read Over 2019")+theme(plot.title=element_text(hjust=0.5))

I could also add more major breaks.

reads2019%>%ggplot(aes(date_read, PagesRead))+geom_point()+scale_x_date(date_labels="%B",date_breaks="1 month")+scale_y_continuous(labels= comma,breaks=seq(0,30000,5000))+labs(title="Cumulative Pages Read Over 2019")+theme(plot.title=element_text(hjust=0.5))

The scales package offers other ways to format data besides the 3 I’ve shown in this series (log transformation, percent, and now continuous with comma). It also lets you format data with currency, bytes, ranks, and scientific notation.

Last post tomorrow!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Causal Inference cheat sheet for data scientists

April 29, 2020, 10:42 am

≫ Next: Movie Recommendation With Recommenderlab

≪ Previous: Y is for scale_y

[This article was first published on R [english] – NC233, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Being able to make causal claims is a key business value for any data science team, no matter their size. Quick analytics (in other words, descriptive statistics) are the bread and butter of any good data analyst working on quick cycles with their product team to understand their users. But sometimes some important questions arise that need more precise answers. Business value sometimes means distinguishing what is true insights from what is incidental noise. Insights that will hold up versus temporary marketing material. In other terms causation.

When answering these questions, absolute rigour is required. Failing to understand key mechanisms could mean missing out on important findings, rolling out the wrong version of a product, and eventually costing your business millions of dollars, or crucial opportunities. Ron Kohavi, former director of the experimentation team at Microsoft, has a famous example: changing the place where credit card offers were displayed on amazon.com generated millions in revenue for the company.

The tech industry has picked up on this trend in the last 6 years, making Causal Inference a hot topic in data science. Netflix, Microsoft and Google all have entire teams built around some variations of causal methods. Causal analysis is also (finally!) gaining a lot of traction in pure AI fields. Having an idea of what causal inference methods can do for you and for your business is thus becoming more and more important.

The causal inference levels of evidence ladder

Hence the causal inference ladder cheat sheet! Beyond the value for data scientists themselves, I’ve also had success in the past showing this slide to internal clients to explain how we were processing the data and making conclusions.

The “ladder” classification explains the level of proof each method will give you. The higher, the easier it will be to make sure the results from your methods are true results and reproducible – the downside is that the set-up for the experiment will be more complex. For example, setting up an A/B test typically requires a dedicated framework and engineering resources. Methods further down the ladder will require less effort on the set-up (think: observational data), but more effort on the rigour of the analysis. Making sure your analysis has true findings and is not just commenting some noise (or worse, is plain wrong) is a process called robustness checks. It’s arguably the most important part of any causal analysis method. The further down on the ladder your method is, the more robustness checks I’ll require if I’m your reviewer

I also want to stress that methods on lower rungs are not less valuable – it’s almost the contrary! They are brilliant methods that allow use of observational data to make conclusions, and I would not be surprised if people like Susan Athey and Guido Imbens, who have made significant contributions to these methods in the last 10 years, were awarded the Nobel prize one of these days!

The causal inference levels of evidence ladder – click on the image to enlarge it

Rung 1 – Scientific experiments

On the first rung of the ladder sit typical scientific experiments. The kind you were probably taught in middle or even elementary school. To explain how a scientific experiment should be conducted, my biology teacher had us take seeds from a box, divide them into two groups and plant them in two jars. The teacher insisted that we made the conditions in the two jars completely identical: same number of seeds, same moistening of the ground, etc. The goal was to measure the effect of light on plant growth, so we put one of our jars near a window and locked the other one in a closet. Two weeks later, all our jars close to the window had nice little buds, while the ones we left in the closet barely had grown at all. The exposure to light being the only difference between the two jars, the teacher explained, we were allowed to conclude that light deprivation caused plants to not grow.

Sounds simple enough? Well, this is basically the most rigorous you can be when you want to attribute cause. The bad news is that this methodology only applies when you have a certain level of control on both your treatment group (the one who receives light) and your control group (the one in the cupboard). Enough control at least that all conditions are strictly identical but the one parameter you’re experimenting with (light in this case). Obviously, this doesn’t apply in social sciences nor in data science.

Then why do I include it in this article you might ask? Well, basically because this is the reference method. All causal inference methods are in a way hacks designed to reproduce this simple methodology in conditions where you shouldn’t be able to make conclusions if you followed strictly the rules explained by your middle school teacher.

Rung 2 – Statistical Experiments (aka A/B tests)

Probably the most well-known causal inference method in tech: A/B tests, a.k.a Randomized Controlled Trials for our Biostatistics friends. The idea behind statistical experiments is to rely on randomness and sample size to mitigate the inability to put your treatment and control groups in the exact same conditions. Fundamental statistical theorems like the law of large numbers, the Central Limit theorem or Bayesian inference gives guarantees that this will work and a way to deduce estimates and their precision from the data you collect.

Arguably, an Experiments platform should be one of the first projects any Data Science team should invest in (once all the foundational levels are in place, of course). The impact of setting up an experiments culture in tech companies has been very well documented and has earned companies like Google, Amazon, Microsoft, etc. billions of dollars.

Of course, despite being pretty reliable on paper, A/B tests come with their own sets of caveats. This white paper by Ron Kohavi and other founding members of the Experiments Platform at Microsoft is very useful.

Rung 3 – Quasi-Experiments

As awesome as A/B tests (or RCTs) can be, in some situations they just can’t be performed. This might happen because of lack of tooling (a common case in tech is when a specific framework lacks the proper tools to set up an experiment super quickly and the test becomes counter-productive), ethical concerns, or just simply because you want to study some data ex-post. Fortunately for you if you’re in one of those situations, some methods exist to still be able to get causal estimates of a factor. In rung 3 we talk about the fascinating world of quasi-experiments (also called natural experiments).

A quasi-experiment is the situation when your treatment and control group are divided by a natural process that is not truly random but can be considered close enough to compute estimates. In practice, this means that you will have different methods that will correspond to different assumptions about how “close” you are to the A/B test situation. Among famous examples of natural experiments: using the Vietnam war draft lottery to estimate the impact of being a veteran on your earnings, or the border between New Jersey and Pennsylvania to study the effect of minimum wages on the economy.

Now let me give you a fair warning: when you start looking for quasi-experiments, you can quickly become obsessed by it and start thinking about clever data collection in improbable places… Now you can’t say you haven’t been warned I have more than a few friends who were ~~lured into~~ attracted by a career in econometrics for the sheer love of natural experiments.

Most popular methods in the world of quasi-experiments are: differences-in-differences (the most common one, according to Scott Cunnigham, author of the Causal Inference Mixtape), Regression Discontinuity Design, Matching, or Instrumental variables (which is an absolutely brilliant construct, but rarely useful in practice). If you’re able to observe (i.e. gather data) on all factors that explain how treatment and control are separated, then a simple linear regression including all factors will give good results.

Rung 4 – The world of counterfactuals

Finally, you will sometimes want to try to detect causal factors from data that is purely observational. A classic example in tech is estimating the effect of a new feature when no A/B test was done and you don’t have any kind of group that isn’t receiving the feature that you could use as a control:

Slightly adapted from CausalImpact‘s documentation

Maybe right new you’re thinking: wait… are you saying we can simply look at the data before and after and be allowed to make conclusions? Well, the trick is that often it isn’t that simple to make a rigorous analysis or even compute an estimate. The idea here is to create a model that will allow to compute a counterfactual control group. Counterfactual means “what would have happened hadn’t this feature existed”. If you have a model of your number of users that you have enough confidence in to make some robust predictions, then you basically have everything

There is a catch though. When using counterfactual methods, the quality of your prediction is key. Without getting too much into the technical details, this means that your model not only has to be accurate enough, but also needs to “understand” what underlying factors are driving what you currently observe. If a confounding factor that is independent from your newest rollout varies (economic climate for example), you do not want to attribute this change to your feature. Your model needs to understand this as well if you want to be able to make causal claims.

This is why robustness checks are so important when using counterfactuals. Some cool Causal Inference libraries like Microsoft’s doWhy do these checks automagically for you Sensitivity methods like the one implemented in the R package tipr can be also very useful to check some assumptions. Finally, how could I write a full article on causal inference without mentioning DAGs? They are a widely used tool to state your assumptions, especially in the case of rung 4 methods.

(Quick side note: right now with the unprecedented Covid-19 crisis, it’s likely that most prediction models used in various applications are way off. Obviously, those cannot be used for counterfactual causal analysis)

Technically speaking, rung 4 methods look really much like methods from rung 3, with some small tweaks. For example, synthetic diff-in-diff is a combination of diff-in-diff and matching. For time series data, CausalImpact is a very cool and well-known R package. causalTree is another interesting approach worth looking at. More generally, models carefully crafted with domain expertise and rigorously tested are the best tools to do Causal Inference with only counterfactual control groups.

Hope this cheat sheet will help you find the right method for your causal analyses and be impactful for your business! Let us know about your best #causalwins on our Twitter, or in the comments!

To leave a comment for the author, please follow the link and comment on their blog: R [english] – NC233.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Movie Recommendation With Recommenderlab

April 29, 2020, 12:05 am

≫ Next: Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020

≪ Previous: Causal Inference cheat sheet for data scientists

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because You Are Interested In Data Science, You Are Interested In This Blog Post

If you love streaming movies and tv series online as much as we do here at STATWORX, you’ve probably stumbled upon recommendations like „Customers who viewed this item also viewed…“ or „Because you have seen …, you like …“. Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering.

In this blog post, I will first explain how collaborative filtering works. Secondly, I’m going to show you how to develop your own small movie recommender with the R package recommenderlab and provide it in a shiny application.

Different Approaches

There are several approaches to give a recommendation. In the user-based collaborative filtering (UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion.

For the item-based collaborative filtering IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested.

Those and other collaborative filtering methods are implemented in the recommenderlab package:

ALS_realRatingMatrix: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
ALS_implicit_realRatingMatrix: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
IBCF_realRatingMatrix: Recommender based on item-based collaborative filtering.
LIBMF_realRatingMatrix: Matrix factorization with LIBMF via package recosystem.
POPULAR_realRatingMatrix: Recommender based on item popularity.
RANDOM_realRatingMatrix: Produce random recommendations (real ratings).
RERECOMMEND_realRatingMatrix: Re-recommends highly-rated items (real ratings).
SVD_realRatingMatrix: Recommender based on SVD approximation with column-mean imputation.
SVDF_realRatingMatrix: Recommender based on Funk SVD with gradient descend.
UBCF_realRatingMatrix: Recommender based on user-based collaborative filtering.

Developing your own Movie Recommender

Dataset

To create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets.

head(movie_data)

  movieId                              title                                      genres1       1                   Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy2       2                     Jumanji (1995)                  Adventure|Children|Fantasy3       3            Grumpier Old Men (1995)                              Comedy|Romance4       4           Waiting to Exhale (1995)                        Comedy|Drama|Romance5       5 Father of the Bride Part II (1995)                                      Comedy6       6                        Heat (1995)                       Action|Crime|Thriller

head(ratings_data)

  userId movieId rating timestamp1      1       1      4 9649827032      1       3      4 9649812473      1       6      4 9649822244      1      47      5 9649838155      1      50      5 9649829316      1      70      3 964982400

To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of „smooth“ ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users.

# ranting_vector0         0.5    1      1.5    2      2.5   3      3.5    4       4.5   55830804   1370   2811   1791   7551   5550  20047  13136  26818   8551  13211

In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. ##   2.208   3.444   3.748   3.665   3.944   4.429

Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user:

You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data.

ratings_movies_norm <- normalize(ratings_movies)

Model Training and Evaluation

To train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector n_recommendations.

eval_sets <- evaluationScheme(data = ratings_movies_norm,                              method = "cross-validation",                              k = 10,                              given = 5,                              goodRating = 0)models_to_evaluate <- list(  `IBCF Cosinus` = list(name = "IBCF",                         param = list(method = "cosine")),  `IBCF Pearson` = list(name = "IBCF",                         param = list(method = "pearson")),  `UBCF Cosinus` = list(name = "UBCF",                        param = list(method = "cosine")),  `UBCF Pearson` = list(name = "UBCF",                        param = list(method = "pearson")),  `Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL))n_recommendations <- c(1, 5, seq(10, 100, 10))list_results <- evaluate(x = eval_sets,                          method = models_to_evaluate,                          n = n_recommendations)

We then have the results displayed graphically for analysis.

We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the UBCF Pearson model. Since the n most similar users (parameter nn) are used to calculate the recommendations, we will examine the results of the model for different numbers of users.

vector_nn <- c(5, 10, 20, 30, 40)models_to_evaluate <- lapply(vector_nn, function(nn){  list(name = "UBCF",       param = list(method = "pearson", nn = vector_nn))})names(models_to_evaluate) <- paste0("UBCF mit ", vector_nn, "Nutzern")list_results <- evaluate(x = eval_sets,                          method = models_to_evaluate,                          n = n_recommendations)

Conclusion

Our user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App.

However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models.

But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX.

Shiny-App

Here you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio.

Über den Autor

Andreas Vogl

ABOUT US

STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

.button { background-color: #0085af;}.x-container.width { width: 100% !important;}.x-section { padding-top: 00px !important; padding-bottom: 80px !important;}

Der Beitrag Movie Recommendation With Recommenderlab erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020

April 29, 2020, 2:52 pm

≫ Next: Testing for Covid-19 in the U.S.

≪ Previous: Movie Recommendation With Recommenderlab

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nina Zumel and John Mount will be speaking on advanced data preparation for supervised machine learning at the Why R? Webinar Thursday, May 7, 2020.

This is a 8pm in a GMT+2 timezone, which for us is 11AM Pacific Time. Hope to see you there!

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Testing for Covid-19 in the U.S.

April 28, 2020, 6:22 pm

≫ Next: Highlights of Hugo Code Highlighting

≪ Previous: Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For almost a month, on a daily basis, we are working with colleagues (Romuald, Chi and Mathieu) on modeling the dynamics of the recent pandemic. I learn of lot of things discussing with them, but we keep struggling with the tests. Paul, in Montréal, helped me a little bit, but I think we will still have to more to get a better understand. To but honest, we stuggle with two very simple questions

how many people are tested on a daily basis ?

Recently, I discovered Modelling COVID-19 exit strategies for policy makers in the United Kingdom, which is very close to what we try to do… and in the document two interesting scenarios are discussed, with, for the first one, “1 million ‘reliable’ daily tests are deployed” (in the U.K.) and “5 million ‘useless’ daily tests are deployed”. There are about 65 millions unhabitants in the U.K. so we talk here about 1.5% people tested, on a daily basis, or 7.69% people ! It could make sense, but our question was, at some point, is that realistic ? where are we today with testing ? In the U.S. https://covidtracking.com/ collects interesting data, on a daily basis, per state.

123	url="https://raw.githubusercontent.com/COVID19Tracking/covid-tracking-data/master/data/states_daily_4pm_et.csv"download.file(url,destfile="covid.csv")base =read.csv("covid.csv")

Unfortunately, there is no information about the population. That we can find on wikipedia. But in that table, the state is given by its full name (and the symbol in the previous dataset). So we new also to match the two datasets properly,

12345678910111213141516

url="https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population"download.file(url,destfile ="popUS.html")#pas contaminé 2/3 R=3library(XML)tables=readHTMLTable("popUS.html")T=tables[[1]][3:54,c("V3","V4")]names(T)=c("state","pop")url="https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations"download.file(url,destfile ="nameUS.html")tables=readHTMLTable("nameUS.html")T2=tables[[1]][13:63,c(1,4)]names(T2)=c("state","symbol")T=merge(T,T2)T$population =as.numeric(gsub(",", "", T$pop, fixed = TRUE))names(base)[2]="symbol"base =merge(base,T[,c("symbol","population")])

Now our dataset is fine… and we can get a function to plot the number of people tested in the U.S. (cumulated). Here, we distinguish between the positive and the negative,

1234567891011

drawing =function(st ="NY"){sbase=base[base$symbol==st,c("date","positive","negative","population")]sbase$DATE =as.Date(as.character(sbase$date),"%Y%m%d")sbase=sbase[order(sbase$DATE),]par(mfrow=c(1,2))plot(sbase$DATE,(sbase$positive+sbase$negative)/sbase$population,ylab="Proportion Test (/population of state)",type="l",xlab="",col="blue",lwd=3)lines(sbase$DATE,sbase$positive/sbase$population,col="red",lwd=2)legend("topleft",c("negative","positive"),lwd=2,col=c("blue","red"),bty="n")title(st)plot(sbase$DATE,sbase$positive/(sbase$positive+sbase$negative),ylab="Ratio of positive tests",ylim=c(0,1),type="l",xlab="",col="black",lwd=3)title(st)}

Let us start with New York

1	drawing("NY")

As at now, 4% of the entiere population got tested… over 6 weeks…. The graph on the right is the proportion of people who tested positive… I won’t get back on that one here today, I keep it for our work. In New Jersey, we got about 2.5% of the entiere population tested, overall,

1	drawing("NJ")

Let us try a last one, Florida

1	drawing("FL")

As at today, it is 1.5% of the population, over 6 weeks. Overall, in the U.S. less than 0.1% people are tested, on a daily basis. Which is far from the 1.5% in the U.K. scenarios. Now, here come the second question,

what are we actually testing for ?

On that one, my experience in biology is… very limited, and Paul helped me. He mentioned this morning a nice report, from a lab in UC Berkeley

One of my question was for instance, if you get tested positive, and you do it again, can you test negative ? Or, in the context of our data, do we test different people ? are some people tested on a regular basis (perhaps every week) ? For instance, with antigen tests (Reverse Transcription Quantitative Polymerase Chain Reaction (RT-qPCR) – also called molecular or PCR – Polymerase Chain Reaction – test) we test if someone is infectious, while with antibody test (using serological immunoassays that detect viral-specific antibodies — Immunoglobin M (IgM) and G (IgG) — also called serology test), we test for immunity. Which is rather different…

I have no idea what we have in our database, to be honest… and for the past six weeks, I have seen a lot of databases, and most of the time, I don’t know how to interpret, I don’t know what is measured… and it is scary. So, so far, we try to do some maths, to test dynamics by tuning parameters “the best we can” (and not estimate them). But if anyone has good references on testing, in the context of Covid-19 (for instance on specificity, sensitivity of all those tests) I would love to hear about it !

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Highlights of Hugo Code Highlighting

April 29, 2020, 5:00 pm

≫ Next: Z is for Additional Axes

≪ Previous: Testing for Covid-19 in the U.S.

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thanks to a quite overdue update of Hugo on our build system¹, our website can now harness the full power of Hugo code highlighting for Markdown-based content. What’s code highlighting apart from the reason behind a tongue-twister in this post title? In this post we shall explain how Hugo’s code highlighter, Chroma, helps you prettify your code (i.e. syntax highlighting), and accentuate parts of your code (i.e. line highlighting).

Make your code look pretty

If you notice and appreciate the difference between

a <- c(1:7, NA)mean(a, na.rm = TRUE)

and

a<-c(1:7,NA)mean(a,na.rm=TRUE)

you might agree with Mara Averick’s opinion,

Syntax highlighting! Just do it. Life is better when things are colourful.

Syntax highlighting means some elements of code blocks, like functions, operators, comments, etc. get styled differently: they could be colored or in italic.

Now, how do the colors of the second block appear?

First of all, it’s a code block with language information, in this case R (note the r after the backticks),

```ra<-c(1:7,NA)mean(a,na.rm=TRUE)```

as opposed to

```a <- c(1:7, NA)mean(a, na.rm = TRUE)```

without language information, that won’t get highlighted – although some syntax highlighting tools, not Hugo Chroma, do some guessing.

There are in general two ways in which colors are added to code blocks, client-side syntax highlighting and server-side syntax highlighting. The latter is what Hugo supports nowadays but let’s dive into both for the sake of completeness (or because I’m proud I now get it²).

Client-side syntax highlighting

In this sub-section I’ll mostly refer to highlight.js but principles probably apply to other client-side syntax highlighting tools. The “client-side” part of this phrase is that the html that is served by your website host does not have styling for the code. In highlight.js case, styling appears after a JS script is loaded and applied.

If we look at a post of Mara Averick’s at the time of writing, the html of a block is just

<preclass="r"><code>pal_a <- extract_colours("https://i.imgur.com/FyEALqr.jpg", num_col = 8)par(mfrow = c(1,2))pie(rep(1, 8), col = pal_a, main = "Palette based on Archer Poster")hist(Nile, breaks = 8, col = pal_a, main = "Palette based on Archer Poster")code>pre>

Now, using Firefox Developer Console,

Screenshot of blog post with Firefox Developer Console open

we see colors come from CSS classes starting with “hljs”.

And in the head of that page (examined via “View source”), there’s

<scriptsrc="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/highlight.min.js">script><script>hljs.initHighlightingOnLoad();script>

which is the part loading and applying highlight.js to the page. Now, how does it know what’s for instance a string in R? If we look at highlight.js highlighter for the R language, authored by Joe Cheng in 2012, it’s a bunch of regular expressions, see for instance the definition of a string.

className:'string',contains:[hljs.BACKSLASH_ESCAPE],variants:[{begin:'"',end:'"'},{begin:"'",end:"'"}]

When using highlight.js on your website, you might need to specify R as a supplementary language in your config, since some languages are bundled by default whilst others are not. You could also whip up some code to conditionally load supplementary highlight.js languages.

A big downside of client-side syntax highlighting is loading time: it appears quite fast if your internet connection isn’t poor, but you might have noticed code blocks changing aspect when loading a web page (first not styled, then styled). Moreover, Hugo now supports, and uses by default, an alternative that we’ll describe in the following subsection and take advantage of in this post’s second section.

Server-side syntax highlighting

In server-side syntax highlighting, with say Pygments or Chroma (Hugo default), your website html as served already has styling information.

With Chroma, that styling information is either:

hard-coded in html³, as is since recently the case of tidyverse.org and blog.r-hub.io;

The html source for one of the blocks of the page screenshot above is

div class="highlight"><prestyle=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><codeclass="language-r"data-lang="r">df <spanstyle="color:#666">%>%span><spanstyle="color:#00f">group_byspan>(g1, g2) <spanstyle="color:#666">%>%span><spanstyle="color:#00f">summarisespan>(a <spanstyle="color:#666">=span><spanstyle="color:#00f">meanspan>(a), b <spanstyle="color:#666">=span><spanstyle="color:#00f">meanspan>(b), c <spanstyle="color:#666">=span><spanstyle="color:#00f">meanspan>(c), d <spanstyle="color:#666">=span><spanstyle="color:#00f">meanspan>(c))code>pre>div>

The style used is indicated in the website config and picked from Chroma style gallery.

via the use of CSS classes also indicated in html, as is the case of this website.

The html of the block seen above is

<divclass="highlight"><preclass="chroma"><codeclass="language-r"data-lang="r"><spanclass="nf">install.packagesspan><spanclass="p">(span><spanclass="s">"parzer"span><spanclass="p">,span><spanclass="n">reposspan><spanclass="o">=span><spanclass="s">"https://dev.ropensci.org/"span><spanclass="p">)span>code>pre>div>

and it goes hand in hand with having styling for different “.chroma” classes in our website CSS.

.chroma.s{color:#a3be8c}

To have this behaviour, in our website config there’s

pygmentsUseClasses=true

which confusingly enough uses the name “Pygments”, not Chroma, for historical reasons. You’d use CSS like we do if none of Chroma default styles suited you, if you wanted to make sure the style colors respect WCAG color contrast guidelines (see last section), or if you want to add a button switching the CSS applied to the classes, which we did for this note using a dev.to post by Alberto Montalesi.⁴ Click the button below! It will also let you switch back to light mode.

Switch to dark mode

// this one is jut to wait for the page to loaddocument.addEventListener('DOMContentLoaded', () => { const themeStylesheet = document.getElementById('code'); const themeToggle = document.getElementById('theme-toggle'); themeToggle.addEventListener('click', () => { // if it's light -> go dark if(themeStylesheet.href.includes('friendly')){ themeStylesheet.href = '/css/fruity.css'; themeToggle.innerText = 'Switch to light mode'; } else { // if it's dark -> go light themeStylesheet.href = '/css/friendly.css'; themeToggle.innerText = 'Switch to dark mode'; } })})

To generate a stylesheet for a given style, use Hugo hugo gen chromastyles --style=monokai > syntax.css command. You can then use the stylesheet as is, or tweak it.

How does Chroma know what parts of code is of the string class for instance? Once again, regular expressions help, in this case in what is called a lexer. Chroma is inspired by Pygments, and in Pygments docs it is explained that “A lexer splits the source into tokens, fragments of the source that have a token type that determines what the text represents semantically (e.g., keyword, string, or comment)." In R lexer, ported from Pygments to Chroma by Chroma maintainer Alec Thomas, for strings we e.g. see

{`\'`,LiteralString,Push("string_squote")},{`\"`,LiteralString,Push("string_dquote")},// ... code"string_squote":{{`([^\'\\]|\\.)*\'`,LiteralString,Pop(1)},},"string_dquote":{{`([^"\\]|\\.)*"`,LiteralString,Pop(1)},},

Chroma works on Markdown content, so if you use blogdown to generate pages as html, you can only use client-side highlighting, like this tidyverse.org page whose source is html. By default nowadays Hugo does server-side syntax highlighting but you could choose to turn it off via codeFences = false.

We have now seen how Hugo websites have syntax highlighting, which for Yihui Xie “is only for cosmetic purposes”. Well, Chroma actually also offers one thing more: line numbering and line highlighting!

Emphasize parts of your code

With Chroma, you can apply special options to code blocks defined with fences, i.e. starting with three backticks and language info, and ending with three backticks⁵.

On Chroma options for line highlighting

See how

```r {hl_lines=[1,"4-5"]}library("dplyr")df %>%  mutate(date = lubridate::ymd(date_string)) %>%  select(- date_string)str(df)nrow(df)```

is rendered below: lines 1 and 4 to 5 are highlighted.

library("dplyr")df%>%mutate(date=lubridate::ymd(date_string))%>%select(-date_string)str(df)nrow(df)

There are also options related to line numbering.

```r {hl_lines=[1,"4-5"],linenos=table,linenostart=3}library("dplyr")df %>%  mutate(date = lubridate::ymd(date_string)) %>%  select(- date_string)str(df)nrow(df)```

gives a code block with line numbered as table (easier for copy-pasting the code without line numbers), starting from number 3.

library("dplyr")df%>%mutate(date=lubridate::ymd(date_string))%>%select(-date_string)str(df)nrow(df)

You can also configure line numbering for your whole website.

The real magic to me is that if you write your code from R Markdown you can

apply the options to the source chunk using a knitr hook like the one defined in our archetype;
use R code to programmatically produce code block between fences, e.g. choosing which lines to highlight.

knitr hook to highlight lines of source code

Our hook is

# knitr hook to use Hugo highlighting optionsknitr::knit_hooks$set(source=function(x,options){hlopts<-options$hloptspaste0("```","r ",if (!is.null(hlopts)){paste0("{",glue::glue_collapse(glue::glue('{names(hlopts)}={hlopts}'),sep=","),"}")},"\n",glue::glue_collapse(x,sep="\n"),"\n```\n")})

The chunk⁶

```{r name-your-chunks, hlopts=list(linenos="table")} a <-1+1b<-1+2c<-1+3a+b+c```

is rendered as

a<-1+1b<-1+2c<-1+3a+b+c

[1] 9

PSA! Note that if you’re after line highlighting, or function highlighting, for R Markdown documents in general, you should check out Kelly Bodwin’s flair package!

Produce line-highlighted code blocks with `glue`/`paste0`

What Chroma highlights are code blocks with code fences, which you might as well generate from R Markdown using some string manipulation and knitr results="asis" chunk option. E.g.

```{r, results="asis"} script <-c("a<-1","b<-2","c<-3","a+b+c")cool_lines<-sample(1:4,2)cool_lines<-stringr::str_remove(toString(cool_lines),"")fences_start<-paste0('```','r{hl_lines=[',cool_lines,']}')glue::glue_collapse(c(fences_start,script,"```"),sep ="\n")```

will be knit to produce

a<-1b<-2c<-3a+b+c

This is a rather uninteresting toy example since we used randomly drawn line numbers to be highlighted, but you might find use cases for this. We used such an approach in the recent blog post about Rclean, actually!

Accessibility

Since highlighting syntax and lines changes the color of things, it might make it harder for some people to read your content, so the choice color is a bit more than about cosmetics.

Disclaimer: I am not an accessibility expert. Our efforts were focused on contrast only, not differences between say green and red, since these do not endanger legibility of code.

We referred to the contrast criterion of the Web Content Accessibility Guidelines of the World Wide Web Consortium that state The intent of this Success Criterion is to provide enough contrast between text and its background so that it can be read by people with moderately low vision (who do not use contrast-enhancing assistive technology).

For instance, comments could be lighter or darker than code, but it is crucial to pay attention to the contrast between comments and code background! Like Max Chadwick, we darkened colors of a default Chroma style, friendly, until it passed on an online tool. Interestingly, this online tool can only work with a stylesheet: for a website with colors written in-line (Hugo default of pygmentsUseClasses=false), it won’t pick up color contrast problems. We chose friendly as a basis because its background can stand out a bit against white, without being a dark theme, which might be bad on a mobile device in direct sunglight. Comments are moreover in italic which helps distinguish them from other code parts.

Our approach is less good than having an actual designer pick colors like what Codepen recently did, but will do for now. Apart from Max Chadwick efforts on 10 Pygments styles, we only know of Eric Bailey’s a11y dark and light themes as highlighting themes that are advertised as accessible.

A further aspect of contrast when using Chroma is that when highlighting a line, its background will have a different color than normal code. This color also needs to not endanger the contrast between code and code background, so if your code highlighting is “dark mode”, yellow highlighting is probably a bad idea: in this post, for the dark mode, we used the “fruity” Chroma style but with #301934 as background color for the highlighted lines. It would also be a bad idea to only rely on line highlighting, as opposed to commenting code blocks, since some readers might not be able to differentiate highlighted lines. Commenting code blocks is probably a good practice in general anyway, explaining what it does instead of just sharing the code like you’d share a gist.

For further reading on accessibility of R Markdown documents, we recommend “Accessible R Markdown Documents” by A. Jonathan R. Godfrey.

Conclusion

In this post we’ve explained some concepts around code highlighting: both client-side and server-side syntax highlighting; and line highlighting with Chroma. We’ve even included a button for switching to dark mode and back as a proof-of-concept. Being able to properly decorate code might make your content more attractive to your readers, or motivate you to write more documentation, which is great. Now, how much time to fiddle with code appearance is probably a question of taste.

Our website is deployed via Netlify. ︎
Support for striking text, with ~~blablabla~~ is also quite new in Hugo, thanks to its new Markdown handler Goldmark! ︎
In this case colors are also hard-coded in RSS feeds which means the posts will look better in feed readers. ︎
With color not hard-coded in the html, but as classes, you could imagine folks developing browser extensions to override your highlighting style. ︎
There is also a highlight shortcode which to me is less natural to use in R Markdown or in Markdown as someone used to Markdown. ︎
I never remember how to show code chunks without their being evaluated so I always need to look at the source of Garrick Aden-Buie’s blog post about Rmd fragments. ︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Z is for Additional Axes

April 30, 2020, 7:00 am

≫ Next: Which Technology Should I Learn?

≪ Previous: Highlights of Hugo Code Highlighting

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here we are at the last post in Blogging A to Z! Today, I want to talk about adding additional axes to your ggplot, using the options for fill or color. While these aren’t true z-axes in the geometric sense, I think of them as a third, z, axis.

Some of you may be surprised to learn that fill and color are different, and that you could use one or both in a given plot.

Color refers to the outline of the object (bar, piechart wedge, etc.), while fill refers to the inside of the object. For scatterplots, the default shape doesn’t have a fill, so you’d just use color to change the appearance of those points.

Let’s recreate the pages read over 2019 chart, but this time, I’ll just use fiction books and separate them as either fantasy or other fiction; this divides that dataset pretty evenly in half. Here’s how I’d generate the pages read over time separately by those two genre categories.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

##  ggplot2 3.2.1      purrr   0.3.3 ##  tibble  2.1.3      dplyr   0.8.3 ##  tidyr   1.0.0      stringr 1.4.0 ##  readr   1.3.1      forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag()    masks stats::lag()

reads2019<-read_csv("~/Downloads/Blogging A to Z/SaraReads2019_allchanges.csv",col_names=TRUE)

## Parsed with column specification: ## cols( ##   Title = col_character(), ##   Pages = col_double(), ##   date_started = col_character(), ##   date_read = col_character(), ##   Book.ID = col_double(), ##   Author = col_character(), ##   AdditionalAuthors = col_character(), ##   AverageRating = col_double(), ##   OriginalPublicationYear = col_double(), ##   read_time = col_double(), ##   MyRating = col_double(), ##   Gender = col_double(), ##   Fiction = col_double(), ##   Childrens = col_double(), ##   Fantasy = col_double(), ##   SciFi = col_double(), ##   Mystery = col_double(), ##   SelfHelp = col_double() ## )

fantasy<-reads2019%>%filter(Fiction==1)%>%mutate(date_read=as.Date(date_read,format='%m/%d/%Y'),Fantasy=factor(Fantasy,levels=c(0,1),labels=c("Other Fiction","Fantasy")))%>%group_by(Fantasy)%>%mutate(GenreRead=order_by(date_read,cumsum(Pages)))%>%ungroup()

Now I’d just plug that information into my ggplot code, but add a third variable in the aesthetics (aes) for ggplot – color = Fantasy.

library(scales)

##  ## Attaching package: 'scales'

## The following object is masked from 'package:purrr': ##  ##     discard

## The following object is masked from 'package:readr': ##  ##     col_factor

myplot<-fantasy%>%ggplot(aes(date_read, GenreRead,color= Fantasy))+geom_point()+xlab("Date")+ylab("Pages")+scale_x_date(date_labels="%b",date_breaks="1 month")+scale_y_continuous(labels= comma,breaks=seq(0,30000,5000))+labs(color="Genre of Fiction")

This plot uses the default R colorscheme. I could change those colors, using an existing colorscheme, or define my own. Let’s make a fivethirtyeight style figure, using their theme for the overall plot, and their color scheme for the genre variable.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

myplot+scale_color_fivethirtyeight()+theme_fivethirtyeight()

I can also specify my own colors.

myplot+scale_color_manual(values=c("#4b0082","#ffd700"))+theme_minimal()

The geom_point offers many point shapes; 21-25 allow you to specify both color and fill. But for the rest, only use color.

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 3.6.3

## Loading required package: magrittr

##  ## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr': ##  ##     set_names

## The following object is masked from 'package:tidyr': ##  ##     extract

ggpubr::show_point_shapes()

## Scale for 'y' is already present. Adding another scale for 'y', which will ## replace the existing scale.

Of course, you may have plots where changing fill is best, such as on a bar plot. In my summarize example, I created a stacked bar chart of fiction versus non-fiction with author gender as the fill.

reads2019%>%mutate(Gender=factor(Gender,levels=c(0,1),labels=c("Male","Female")),Fiction=factor(Fiction,levels=c(0,1),labels=c("Non-Fiction","Fiction"),ordered=TRUE))%>%group_by(Gender, Fiction)%>%summarise(Books=n())%>%ggplot(aes(Fiction, Books,fill=reorder(Gender,desc(Gender))))+geom_col()+scale_fill_economist()+xlab("Genre")+labs(fill="Author Gender")

Stacking is the default, but I could also have the bars next to each other.

reads2019%>%mutate(Gender=factor(Gender,levels=c(0,1),labels=c("Male","Female")),Fiction=factor(Fiction,levels=c(0,1),labels=c("Non-Fiction","Fiction"),ordered=TRUE))%>%group_by(Gender, Fiction)%>%summarise(Books=n())%>%ggplot(aes(Fiction, Books,fill=reorder(Gender,desc(Gender))))+geom_col(position="dodge")+scale_fill_economist()+xlab("Genre")+labs(fill="Author Gender")

You can also use fill (or color) with the same variable you used for x or y; that is, instead of having it be a third scale, it could add some color and separation to distinguish categories from the x or y variable. This is especially helpful if you have multiple categories being plotted, because it helps break up the wall of bars. If you do this, I’d recommend choosing a color palette with highly complementary colors, rather than highly contrasting ones; you probably also want to drop the legend, though, since the axis will also be labeled.

genres<-reads2019%>%group_by(Fiction, Childrens, Fantasy, SciFi, Mystery)%>%summarise(Books=n())genres<-genres%>%bind_cols(Genre=c("Non-Fiction","General Fiction","Mystery","Science Fiction","Fantasy","Fantasy Sci-Fi","Children's Fiction","Children's Fantasy"))genres%>%filter(Genre!="Non-Fiction")%>%ggplot(aes(reorder(Genre,-Books), Books,fill= Genre))+geom_col()+xlab("Genre")+scale_x_discrete(labels=function(x){sub("\\s","\n", x)})+scale_fill_economist()+theme(legend.position="none")

If you only have a couple categories and want to draw a contrast, that’s when you can use contrasting shades: for instance, at work, when I plot performance on an item, I use red for incorrect and blue for correct, to maximize the contrast between the two performance levels for whatever data I’m presenting.

I hope you enjoyed this series! There’s so much more you can do with tidyverse than what I covered this month. Hopefully this has given you enough to get started and sparked your interest to learn more. Once again, I highly recommend checking out R for Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Which Technology Should I Learn?

April 30, 2020, 9:00 am

≫ Next: Expert opinion (again)

≪ Previous: Z is for Additional Axes

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Knowing where to start can be challenging, but we’re here to help. Read on to learn more about where to begin on your data science and analytics journey.

Data science and analytics languages

If you’re new to data science and analytics, or your organization is, you’ll need to pick a language to analyze your data and a thoughtful way to make that decision. Read our blog post and tutorial to learn how to choose between the two most popular languages for data science—Python and R—or read on for a brief summary.

Python

Python is one of the world’s most popular programming languages. It is production-ready, meaning it has the capacity to be a single tool that integrates with every part of your workflow. So whether you want to build a web application or a machine learning model, Python can get you there!

General-purpose programming language (can be used to make anything)
Widely considered one of the accessible programming languages to read and learn
The language of choice for cutting edge machine learning and AI applications
Commonly used for putting models "in production"
Has high ease of deployment and reproducibility

R

R has been used primarily in academics and research, but in recent years, enterprise usage has rapidly expanded. Built specifically for working with data, R provides an intuitive interface to the most advanced statistical methods available today.

Built specifically for data analysis and visualization
Traditionally used by statisticians and academic researchers
The language of choice for cutting edge statistics
A vast collection of community-contributed packages
Rapid prototyping of data-driven apps and dashboards

SQL

Much of the world’s raw data lives in organized collections of tables called relational databases. Data analysts and data scientists must know how to wrangle and extract data from these databases using SQL.

Useful for every organization that stores information in databases
One of the most in-demand skills in business
Used to access, query, and extract structured data which has been organized into a formatted repository, e.g., a database
Its scope includes data query, data manipulation, data definition, and data access control

Databases

Data scientists, analysts, and engineers must constantly interact with databases, which can store a vast amount of information in tables without slowing down performance. You can use SQL to query data from databases and model different phenomena in your data and the relationships between them. Find out the differences between the most popular databases in our blog post or read on for a summary.

Microsoft SQL Server

Commercial relational database management system (RDBMS), built and maintained by Microsoft
Available on Windows and Linux operating systems

PostgreSQL

Free and open-source RDBMS, maintained by PostgreSQL Global Development Group and its community
Beginner-friendly

Oracle Database

The most popular RDBMS, used by 97% of Fortune 100 companies
Requires knowledge of PL/SQL, an extension of SQL, to access and query data

Spreadsheets

Spreadsheets are used across the business world to transform mountains of raw data into clear insights by organizing, analyzing, and storing data in tables. Microsoft Excel and Google Sheets are the most popular spreadsheet software, with a flexible structure that allows data to be entered in cells of a table.

Google Sheets

Free for users
Allows collaboration between users via link sharing and permissions
Statistical analysis and visualization must be done manually

Microsoft Excel

Requires a paid license
Not as favorable as Google Sheets for collaboration
Contains built-in functions for statistical analysis and visualization

Business intelligence tools

Business intelligence (BI) tools make data discovery accessible for all skill levels—not just advanced analytics professionals. They are one of the simplest ways to work with data, providing the tools to collect data in one place, gain insight into what will move the needle, forecast outcomes, and much more.

Tableau

Tableau is a data visualization software that is like a supercharged Microsoft Excel. Its user-friendly drag-and-drop functionality makes it simple for anyone to access, analyze and create highly impactful data visualizations.

A widely used business intelligence (BI) and analytics software trusted by companies like Amazon, Experian, and Unilever
User-friendly drag-and-drop functionality
Supports multiple data sources including Microsoft Excel, Oracle, Microsoft SQL, Google Analytics, and SalesForce

Microsoft Power BI

Microsoft Power BI allows users to connect and transform raw data, add calculated columns and measures, create simple visualizations, and combine them to create interactive reports.

Web-based tool that provides real-time data access
User-friendly drag-and-drop functionality
Leverages existing Microsoft systems like Azure, SQL, and Excel

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Expert opinion (again)

April 29, 2020, 5:00 pm

≫ Next: Why R? Webinar – Development pipeline for R production – rZYPAD

≪ Previous: Which Technology Should I Learn?

[This article was first published on R | Gianluca Baio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

THis is the second video I was mentioning here— took a while to get out but it’s available now. I think you need to register here and then you can see our panel discussion. Like I said earlier, it was good fun and I think the actual session we did at ISPOR last year was, I think, very well received and it’s a shame that we can’t build on the momentum in the next R-HTA (which, I think, we’re going to have to postpone, given the COVID-19 emergency…).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R | Gianluca Baio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Balancing the twin threats of Data Science Development

Share your Data Science work early and often

Benefits of sharing for Data Science Teams

It’s Not Just About Production

Projects that are not frequently executed

Another case

Package installation

R on servers

So it is not advisable to upgrade packages and R?

Some advice: Environments

Docker

Conclusions

Table of Contents

R help pages

Help page structure

Usage

Arguments

Examples

Other sections

Using argument order instead of labels

Citation

Abstract

Software

Illustration: Variable selection in weather forecasting

list2DF Function

sort.list for non-atomic objects

New Color Palettes!

stringsAsFactors = FALSE

What automatic tools?

Tools for assessing

Tools for improving

When and where to use the tools?

Continuous integration

Run something every time you make a change

Run R CMD check regularly

Be lazy with continuous integration: PR commands

Run something before you make a change

Check things before show time

Conclusion

The causal inference levels of evidence ladder

Rung 1 – Scientific experiments

Rung 2 – Statistical Experiments (aka A/B tests)

Rung 3 – Quasi-Experiments

Rung 4 – The world of counterfactuals

Because You Are Interested In Data Science, You Are Interested In This Blog Post

Different Approaches

Developing your own Movie Recommender

Dataset

Model Training and Evaluation

Conclusion

Shiny-App

Andreas Vogl

ABOUT US

Make your code look pretty

Client-side syntax highlighting

Server-side syntax highlighting

Emphasize parts of your code

On Chroma options for line highlighting

knitr hook to highlight lines of source code

Produce line-highlighted code blocks with glue/paste0

Accessibility

Conclusion

Data science and analytics languages

Python

R

SQL

Databases

Microsoft SQL Server

PostgreSQL

Oracle Database

Spreadsheets

Google Sheets

Microsoft Excel

Business intelligence tools

Tableau

Microsoft Power BI

Produce line-highlighted code blocks with `glue`/`paste0`