Quantcast
Channel: R-bloggers
Viewing all 12128 articles
Browse latest View live

How California Uses Shiny in Production to Fight COVID-19

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hero image

Short term forecast from the California COVID Assessment Tool (CalCAT)

“Things move along so rapidly nowadays that people saying: “It can’t be done,” are always being interrupted by somebody doing it.” – Puck magazine, 1903.

As we at RStudio have talked about the topic of serious data science, we often field questions about the suitability of R for use in large-scale, production environments. Those questions typically coalesce around:

  1. Speed: Is R fast enough to run production workloads?
  2. Scalability: Can R be used for large scale production?
  3. Infrastructure: What kind of R infrastructure do administrators need to run production applications?

Instead of debating these question in theory in this post, we’ll instead turn to an organization that is not just talking about deploying Shiny dashboards in large-scale production, but is actually “doing it”.

Many definitions exist for what constitutes an application being in large-scale production. For the purposes of this article, we’ll define large-scale production as:

Applications serving thousands of users on a daily basis.

One application that fits this definition nicely is the California COVID Assessment Tool (CalCAT) which serves 32 million Californian citizens. CalCAT is a Shiny dashboard written in R by a group of data scientists within the California Department of Public Health (CDPH) and is hosted on an array of commercial RStudio Team servers.

RStudio recently talked with members of the team who deployed this dashboard to understand how this public, large-scale Shiny app came to be. The following sections present some of our takeaways from those discussions.

.quote-spacing { padding:0 80px; } .quote-size { font-size: 160%; line-height: 34px; } .speaker-quote { padding-left: 50px; text-indent: -50px; }.no-speaker-quote { padding-left: 50px; }</p><p>[@media] only screen and (max-width: 600px) { .quote-spacing { padding-top:0; } .quote-size { font-size: 120%; line-height: 28px; }}

CDPH’s First Shiny Dashboard Tracked Opiod Use

Opiod dashboard

CDPH’s Opioid Overdose Surveillance application

The CalCAT dashboard project was born out of CDPH’s experience fielding a prior public-facing Shiny dashboard in 2016, namely the CDPH Opioid Overdose Surveillance application. That application evolved largely from:

  • A need to get data out quickly. CDPH didn’t really have an enterprise-level dashboarding solution secured in 2016. When the opioid crisis arrived, the department realized it needed to get data out quickly and update it as needed as the epidemic gripped the state.
  • The ability to deploy a dashboard using free software and cloud resources. When looking for a dashboarding solution, one of the developers evaluated Shiny, realized it was free and open source, and that RStudio offered shinyapps.io for a very low cost way for CDPH to deploy it. Without the need for a capital investment to get started, they created some basic visualizations, shopped them to leadership including the director of the department, and got the full go-ahead to develop and deploy shortly thereafter. This allowed them to get their opioid dashboard out in 3 or 4 months, which was unheard of at the time.
  • A positive reception by users. California was one of the first states in the country that had a public opioid overdose dashboard. This positive experience with Shiny and shinyapps.io generated interest in R and encouraged the building of more internal infrastructure for hosting and deploying these apps.

COVID-19’s Arrival Made Sharing Data Mission Critical

When COVID-19 arrived in the United States in early 2020, many organizations, both inside and outside of the California Department of Public Health, suddenly found themselves wanting data to respond to the pandemic. That demand led to:

  • The formation of the CalCAT development team. CalCAT evolved out of some early work with Johns Hopkins University regarding scenario-based models. Initially, CalCAT just wanted to develop a quick lightweight app to explore the simulations that Johns Hopkins was providing and to share it using an RStudio Connect server with other CDPH staff.
  • Creation of a extranet-hosted Shiny dashboard for COVID-19. Based on their experience with the Opioid Dashboard, the team developed an internal Shiny app to provide visualizations of what was going on throughout the state. As the dashboard evolved, CDPH moved it to the state government extranet for others to access.
  • Expanding the dashboard to serve other departments with data. While the app began as an effort to share data with county health officers and local epidemiologists, people from other departments started asking, “How did you get this number? We can’t replicate it.” That led the team to expand the app to allow users to download the code and data behind the visualizations and do their own analyses.

Once other departments gained access to the data, the app quickly became a vital source of COVID information throughout the state because it:

  • Allowed authenticated access to internal confidential data. Because the COVID dashboard authenticated county health officers to gain access to the Shiny app, it could include aggregated confidential data beyond what would normally be available to the general public.
  • Supported county-based dashboards. County health jurisdictions found that they could download their county’s data and republish it on their own dashboards, thereby giving their users visibility into their local situation.
  • Drove county-level pandemic actions. California established hard metrics such as case and infection rates to guide which businesses were allowed to open. The data published by this extranet dashboard ensured everyone was working from a consistent set of measurements and actions that were authorized by the state.

Responding to the Emergency: Creating A Public Dashboard for California Citizens

Covid dashboard

The CalCAT public dashboard

The extranet site helped CDPH and the county health officers understand both the depth and breadth of pandemic infections within California. However, on March 4, 2020, the following announcement spurred the department to build a public site.

“As part of the state’s response to address the global COVID-19 outbreak, Governor Gavin Newsom today declared a State of Emergency to make additional resources available, formalize emergency actions already underway across multiple state agencies and departments, and help the state prepare for broader spread of COVID-19. The proclamation comes as the number of positive California cases rises and following one official COVID-19 death.” – Gavin Newsom, Governor of California, March 4, 2020

In response to the Governor’s mandate, the team:

  • Deployed the public COVID dashboard app you see today. Based on their work with their internal county-based dashboard and with advice from DJ Patil, the Chief Data Scientist of the United States in the Obama administration, the team modified and upgraded the internal county-based app into what you currently see today. This dashboard allows people to explore both the California models and an ensemble of estimates from other organizations to provide a single picture for the state and its counties. The team used R to do some statistical work in the background while also creating interactive visualizations to share those results.
  • Made their code open source. The CalCAT team made the source code for the site public on Github so anybody in the world could access and improve on it. In addition to the website, they also created an open data portal for the state that includes additional aggregated data.

CDPD’s R Infrastructure Evolved to Support the Pandemic Efforts

As CalCAT gained popularity and the team gained experience, the infrastructure supporting the team evolved to meet the new demands by adding:

  • Multiple hosting environments. The CalCAT environment now features both a public-facing environment and an extranet environment that requires authentication with partners and staff. CDPH now also has internal testing platforms on which they run apps before they go out to the public-facing and extranet servers.
  • Professional products. While the project started off with open source Shiny Servers and shinyapps.io for the Opioid Dashboard, the team later moved to RStudio Server Pro for development and then added RStudio Connect and RStudio Package Manager for publishing. They now run multiple instances of those products to spread the workload out and accommodate the millions of users who access the public site.
  • Collaborative workflows. Once the team grew beyond just one or two developers, it created a Github repository where it could collaboratively work on code, push changes, and adopt changes from others. While this workflow required scientists within the department to learn basic devops software development techniques, the team decided the benefits from collaboration were worth climbing that learning curve.

CalCAT’s Success Has Encouraged R Use Within CDPH

The project team noted how much the Opiod dashboard changed CDPH’s thinking about how R could be used to deliver data to the public by:

  • Providing examples of what was possible. The Opioid dashboard expanded the scope of what could be done with CDPH data. The CalCAT dashboard proved that, with the help of their infrastructure and IT team, such applications could be scaled up to provide service to the public. Collaborating with IT also introduced the CalCAT scientists to software tools they wouldn’t have discovered themselves.
  • Rapidly deploying new apps. After the COVID dashboard was up, other groups started asking for new apps that could tackle other aspects of the crisis. One such application was a very simple program to create unique IDs for COVID tests, which was mandated and published within a week. The ability to respond quickly to department needs burnished R’s reputation within CDPH.
  • Creating an internal R community. The team is already seeing real expansion in personnel with R skills, especially in hiring. Their job descriptions now ask for R skills, and people are being recruited from other disciplines. Increasingly, the personnel within the department are coming in with R experience.
  • Embracing a code-based approach. One developer noted that writing code to do data science instead of using a point-and-click tool was analogous to a team doing rock climbing. Working code creates a path and anchors for others to use, and new developers then can use those anchors to follow in their footsteps.

Takeaways

The CalCAT experience shows that, despite claims to the contrary, R can be used for large-scale production applications. When we re-examine the three categories of concern about R with which we started the piece, we discover that:

  • Speed of development was the key to success. This was an application that had to be deployed quickly in response to a national emergency. Using R and Shiny allowed the team to deploy an interactive app that provided access to COVID data in weeks, not months.
  • Scaling up production use was an evolutionary process. The team took advantage of its prior experience with the Opioid Dashboard to deploy both the extranet and public versions of the COVID-19 application. The team had already deployed public apps on shinyapps.io and had deployed server infrastructure in house as part of their extranet application. When the time came to go public with the public CalCAT dashboard, scaling up became mostly a matter of replicating servers they already had experience with.
  • Infrastructure to support this application was available off the shelf. Instead of having to roll their own deployment process, the group was able to use RStudio’s server product suite to do the app development as well as the large-scale deployment on an array of RStudio Connect servers.

By using a code-based approach, the California Department of Public Health has built a repository of human and intellectual capital around building public health dashboards. This small team’s work and open source code can now be passed on to others both within and outside of California government. Their efforts will likely spawn new projects that will better inform citizens and continue to help them stay safe throughout this unprecedented pandemic.

To Learn More

You can learn about each of RStudio’s commercial products by following the links below.

  • RStudio Server Pro delivers fully integrated development environments for R and Python accessible via a browser.
  • RStudio Connect connects data scientists with decision makers with a one-button publishing solution from the RStudio IDE.
  • RStudio Package Manager controls package distribution for reproducible data science.
  • RStudio Team bundles RStudio Server Pro, RStudio Connect, and RStudio Package Manager products to ease purchasing and administration.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How California Uses Shiny in Production to Fight COVID-19 first appeared on R-bloggers.


Why R Webinar – Satellite imagery analysis in R

$
0
0

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today at Why R? Webinars we will have a chance to host Ewa Grabska who is planning to present Satellite imagery analysis in R. The biogram and the link to YouTube are below. To go the video and set a reminder!

Thursday November 19th. 7:00pm UTC

Satellite imagery, such as freely available data from Sentinel-2 mission, enable us to monitor the Earth’s surface frequently (every 5 days), and with a high spatial resolution (10-20 meters). Furthermore, Sentinel-2 sensors, including 13 spectral bands in the visible and infrared wavelengths, provide very valuable information which can be used to automatically perform tasks such as classify crop types, assess forest changes, or monitor build-up area development. This is particularly important now, in the era of rapid changes in the environment related to climate change. In R, there are plenty of tools and packages which can be used for satellite images such as pre-processing, analyzing, and visualizing data in a simple and efficient way. Also, the variety of methods, such as machine learning algorithms, are available in R and can be applied in the analysis of satellite imagery. I would like to show the framework for acquiring, pre-processing and preliminary analysis of the Sentinel-2 time series in R. It includes the spectral indices calculation, the use of the machine learning algorithms in the classification of land cover, and, the analysis of time series of imagery, i.e. determining the changes in environment based on the spectral trajectories of pixels.

Ewa is a Ph.D. candidate in geography at the Jagiellonian University, and a research assistant at the Faculty of Forestry, the University of Agriculture in Kraków. In her research, she focus on the use of Sentinel-2 satellite imagery in determining different forest characteristics, such as tree species composition or detection of forest disturbances. She is also a big R enthusiast and she mostly use it in my analysis!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Why R Webinar - Satellite imagery analysis in R first appeared on R-bloggers.

Moving away from Travis CI

$
0
0

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At rOpenSci, we encourage R package developers to take advantage of Continuous Integration services to automatically check the package on different platforms, with different versions of R. The rOpenSci dev guide dedicates chapter 2 to the topic of Continuous Integration Best Practices, and recommends a few common CI vendors, including Travis CI.

Travis CI has been a pioneer in free public CI services, and made the concept popular in the open-source community. The service started to get wide adoption in 2012, and native support for R was added by Craig Citro in 2015, with more contributions from current maintainers Jim Hester and later Jeroen. In 2016 we wrote a blog post about using a build-matrix in order to check your packages on multiple versions of Linux and MacOS, which is super powerful for R package development.

travis-image

A Change of Management

Sadly, good times came to an end. In 2019 the company was sold to a private equity firm, and soon after the acquisition, a large portion of the senior engineer staff was layed off. Under the new management, open-source users started suffering from significant outages and backlogs, while being pushed towards the new enterprise product travis-ci.com (the original Travis service was hosted on travis-ci.org).

The big blow came earlier this month, when Travis announced a new pricing model which no longer has a generous free tier for open-source projects, and also fully shutting down the old travis-ci.org product by December 31. It is still unclear what exactly the new product will look like, perhaps it can still be useful, but with the direction the company is heading, we recommend exploring other options.

GitHub Actions

Fortunately, there are many other companies offering free CI for open-source projects these days. Some popular vendors include AppVeyor, Circle-CI, Azure Pipelines, and Gitlab CI, but the biggest new player is GitHub Actions: the native CI/CD system from GitHub.

The GitHub Actions (GHA) platform was introduced only recently, but has quickly taken over the open-source world. The system is extremely flexible, allowing you to run any combination of scripts, containers, and imported actions from other users. The native integration with GitHub this takes away the annoying authentication dance that is required for third party services, making the setup completely seamless. Very generous free build resources are provided for open-source projects, but if you need something else, GHA also allows you to plug in self-hosted runners, giving you complete control of your hardware and build environment.

To get started with GHA for R, the r-lib/actions repo has a number of preconfigured actions and example configurations written by Jim Hester (again), for installing R, running checks, etc. Simply copy the check-standard.yaml file into the .github/workflows/ folder of your R package 1, then push, and see the magic happen. Note that Jim’s scripts are only one example: GHA will let you run any script in the OS or container your choice, allowing you to fully customize what happens on each new push, pull request, opened issue, etc.

Thank you Travis !

Travis worker

We will likely only start seeing more of GitHub actions, as it becomes the default CI for open-source. But for now I would like to thank the Travis team, especially the initial founders and engineers, for bringing massive free CI to the open-source ecosystem.

I remember the first time installing Travis on a project, and seeing commits and pull requests automatically get built and checked, without having to do anything. I was blown away, it really brought the concept of pull requests to life. In no time, CI became an integral part of development, providing an efficient workflow to test new features and review pull requests, and green badges started appearing everywhere. I think it is fair to say that we could never maintain the number of projects we do today, if we had to test all those pull requests manually!

The R community certainly is not alone in this. A user in the hackernews discussion comments:

Free CI really did provide a massive boost to collaborative open-source projects. As a user, it did a lot to increase software quality: not just catching inadvertent bugs, but also ensuring that there was at least some reproducible way to get the code working, that didn’t depend on some implicit configuration of the authors system. As a maintainer, accepting simple pull requests becomes much easier when you can quickly look over the code and check the CI status, and not have to try it out locally yourself. It was certainly critical to the “social coding” idea behind GitHub.

Of course the open-source community is sad to see Travis become an enterprise-first product. But in all honesty, the system is no longer state-of-the-art, and probably won’t be able to compete with the new GitHub/Microsoft products.

Also we are very aware that it is difficult to find a sustainable business model around open-source, and hope the company will survive by refocusing on specialized enterprise CI needs. Nevertheless, as we transition towards a new generation of CI systems, we won’t forget the pioneering role that Travis played in taking open-source collaboration to the next level. Thank you!

Migrating your Projects

If you are still using Travis-CI, there are several options. First, you should convert your travis-ci.org account into a travis-ci.com account following instructions, if you had not done so already. This will make sure your CI keeps working after December 31, at least until you run out of credit 🙂

As mentioned, we recommend exploring alternatives, in particular GitHub actions. As things are evolving rapidly, you can keep an eye on the r-lib/actions repository, our dev guide, and of course have a look at what other package authors are doing.

Fortunately the vendor lock-in with CI is pretty limited. For standard configurations you can just replace the .travis file with an equivalent template from another CI service. For customized configuration, you may have to spend a few minutes translating that to another format, which is the price we pay for using these cloud services. Feel free to reach out to the rOpenSci community if you could use some help!


  1. Or automatically with usethis by running: usethis::use_github_action_check_standard()↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Moving away from Travis CI first appeared on R-bloggers.

October 2020: “Top 40” New CRAN Packages

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One hundred fifty-six new packages made it to CRAN in October. Here are my “Top 40” selections in eight categories: Computational Methods, Data, Epidemiology, Mathematics, Machine Learning, Statistics, Utilities, and Visualization.

Computational Methods

mcmcensemble v 2.0: Provides ensemble samplers for affine-invariant Monte Carlo Markov Chain algorithms which allow a faster convergence for badly scaled estimation problems. Two samplers are included: the ‘differential.evolution’ sampler from the Braak and Vrugt (2008) and the ‘stretch’ sampler from Goodman and Weare (2010). Look here for examples.

psqn v0.1.3: Provides quasi-Newton methods to minimize partially separable functions. The methods are described by Nocedal and Wright (2006). There is an Introduction and a vignette on the Partially Separable Quasi-Newton method.

Data

AirSensor v1.0.2: Allows processing and displaying data from air quality sensors. Initial focus is on PM2.5 measurements from sensors produced by PurpleAir.

fflr v0.3.12: Provides functions to format the raw data from the ESPN fantasy football API into tidy tables. See the vignette.

pmlbr v0.2.0: Provides access to more than 150 classification and regression data sets in the University of Pennsylvania’s PMLB repository. See README to get started.

podr v0.0.5: Allows users to to connect, access and review over 250 datasets in Pharmaceutical User Software Exchange (PHUSE) open data repository (PODR). See the vignette for details.

starwarsdb v0.1.2: Provides the data from the Star Wars API as a set of relational tables or a DuckDB database. Look here for an example.

Epidemiology

anovir v0.1.0: Implements maximum likelihood techniques to estimate virulence in population dynamics models. See the pre-print and the eleven vignettes including Introduction, Confidence intervals and Worked examples I and II.

epifitter v0.1.0: Provides functions for fitting two-parameter population dynamics models to proportion data for single or multiple epidemics using either linear or non-linear regression. See Madden et al. (2007) for background and the vignettes on fitting models and simulating disease progress.

epitweetr v0.1.24: Allows users to automatically monitor trends of tweets by time, place and topic aiming at detecting public health threats early through the detection of signals (e.g. an unusual increase in the number of tweets). It was designed to focus on infectious diseases, but can be adapted to other applications by modifying the topics and keywords. See the vignette.

i2extras v0.0.2: Provides functions to work with ‘incidence2’ objects, including a simplified interface for trend fitting and peak estimation. This package is part of the RECON toolkit for outbreak analysis. There are vignettes on Fitting epicurves and Peak Estimation.

IBMPopSim v0.3.1: Provides functions to simulate the random evolution of structured population dynamics, called stochastic Individual Based Models (IBMs). Users can simulate the random evolution of a population in which individuals are characterized by their date of birth, a set of attributes, and their potential date of death. See Ferrière and Tran (2009) and Bansaye and Méléard (2015) for background. There is a package overview and vignettes on C++ essentials, Human populations, Human populations with changing characteristics, Insurance portfolio, and populations with genetically variable traits.

MGDrivE2 v1.0.1: Provides a simulation modeling framework which significantly extends capabilities of the MGDrivE package with a new mathematical and computational framework based on stochastic Petri nets. To get started with this package see the vignettes on SEIR dynamics, Meta population dynamics, One node dynamics, Inhomogenous stochastic processes, Life-cycle dynamics, One node lifecycle dynamics, data analysis, and advanced topics

msce v1.0.1: Provide functions to calculate hazard and survival functions for multi-stage clonal expansion models used in cancer epidemiology. There is a vignette on fitting incidence data.

Machine Learning

acumos v0.4-1: Provides access to the Linux Foundation Acumos open source framework intended to make it easy to build, share, and deploy AI apps. Look here for information on the Acumos R CLient.

bigSurvSGD v0.0.1: Provides a function to fit Cox models via stochastic gradient descent which avoids the computational instability of the standard Cox Model. Functions scales up with large datasets and accommodate datasets that do not fit the memory. See Aliasghar et al. (2020) for details.

deepredeff v0.1.0: Implements a tool that contains trained deep learning models for predicting effector proteins using a set of known experimentally validated effectors from either bacteria, fungi, or oomycetes. Kristianingsih and MacLean (2020) for background, and the overview and vignette on prediction.

MKclass v0.3: Implements performance measures and scores for statistical classification including accuracy, sensitivity, specificity, recall, similarity coefficients, AUC, GINI index, Brier score and more. It calculates optimal cut-offs and decisions stumps according to (Iba and Langley (1991) follows Lemeshow and Hosmer and Hosmer et al. (1997) for goodness of fit tests and Porta (2014) for epidemiological risk measures. See the vignette to get started.

mlr3hyperband v0.1.0: Implements the bandit-based hyperparameter optimization method of Li et al. (2016). Look here for an example.

Mathematics

tdaunif Provides functions to randomly sample from simple manifolds. See Arvo (1995) and Diaconis, Holmes, and Shahshahani (2013) for the theory, and the vignette to get started.

Science

SoundShape v1.0: Implements the eignensound method of MacLeod et al. (2013) to compare stereotyped sound from different species. The vignette will make you want to start comparing sound waves.

Statistics

anscombiser v1.0.0: Provides functions to create data sets with identical summary statistics: i.e. identical marginal sample means and sample variances, sample correlation, least squares regression coefficients and coefficient of determination, that that look amusingly different. See the vignette for examples.

gglm v0.1.0: Extends ggplot2 for creating common diagnostic plots associated with linear models. Look here for examples.

microcluster v0.1.0: Implements the method in Betancourt et al. (2020) to perform microclustering models for categorical data. The vignette offers an example.

PressPurt v1.0.2: Provides functions to identify the most sensitive interactions within a network which can be described by differential equations, in order to produce qualitatively robust predictions to a press perturbation. See Koslicki & Novak (2017) for background. There a tutorial and a vignette for set up.

quantdr v1.0.0: Provides functions to perform dimension reduction for conditional quantiles by determining the directions that span the central quantile subspace (CQS). The vignette contains the details.

singcar v0.1.1: Implements frequentist and Bayesian methods to compare single cases to control populations. See Crawford and Howell (1998) and Crawford and Garthwaite (2005) for background, and the vignette for an introduction.

statgenGxE v1.0.3: Provides functions to facilitate the analysis of multi-environment data from plant breeding experiments following the analyses described in Malosetti et al. (2013). See the vignette for examples.

survivalMPLdc v0.1.1: Implements functions to fit Cox proportional hazard models under dependent right censoring using copula and maximum penalized likelihood methods. See the vignette for examples.

Time Series

FoReco v0.1.1: Provides bottom-up, optimal and heuristic combination forecast reconciliation procedures for cross-sectional, temporal, and cross-temporal linearly constrained time series. There is an introduction and another vignette on average relative accuracy indices.

modeltime.ensemble v0.3.0: Implements time series ensemble forecasting methods including model averaging, weighted averaging, and stacking. See Pavlyshenko (2019) for the theory and the vignette for an introduction.

Utilities

findInFiles v0.1.2: Enables users to search for a pattern in a folder and display the results in the RStudio viewer pane or in an R Markdown file or Shiny app.

groundhog v1.0.0: Assists with reproducibility by providing functions for version specific package loading. See the vignette.

monaco v0.1.0: Implements an HTML widget rendering of the Monaco editor which is useful for JavaScript. Look here for examples.

oskeyring v0.1.1: Aims to support all features of the system credential store, including non-portable ones. It supports Keychain on macOS, and Credential Manager on Windows. See the keyring package if you need a portable API. README describes how to get started.

PDE v1.1.1: Enables extracting information and tables from pdf files based on search words. The vignette has several examples.

representr v0.1.1: Implements the record linkage methods of Kaplan et al. (2020) to create representative records for use in downstream tasks after entity resolution is performed. See the vignette.

Visualizations

imbibe v0.1.0: Provides a set of fast, chainable image-processing operations which are applicable to images of two, three or four dimensions, particularly medical images. Look here for an example.

r3dmol v0.1.0: Provides functions to create and manipulate rich and fully interactive 3D visualizations of molecular data that can be included in Shiny apps and R markdown documents. See the vignette for an introduction.

visualpred v0.1.0: Provides 2D point and contour plots for visualizing binary classification models. There is an introduction and vignettes on plotting outliers, comparing algorithms, and advanced settings.

_____='https://rviews.rstudio.com/2020/11/19/october-2020-top-40-new-cran-packages/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post October 2020: "Top 40" New CRAN Packages first appeared on R-bloggers.

Lake Erie Pileup

$
0
0

[This article was first published on AdventuresInData, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 Strong winds caused substantial water surface elevation differences in Lake Erie around November 15, 2020.  Here’s an animated plot of elevations measured by NOAA.  Plot was developed using R and Windows Live Movie Maker

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: AdventuresInData.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Lake Erie Pileup first appeared on R-bloggers.

Bio7 3.2 Released

$
0
0

[This article was first published on R – Bio7, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

20.11.2020

A new version of Bio7 is available. This update comes with a plethora of new features, improvements and bugfixes.

Bio7 3.2 Dark theme enabled using the spatstat package examples and plots.

For those who don’t know Bio7. The application Bio7 is an integrated development environment for ecological modeling, scientific image analysis and statistical analysis.

It also contains a feature complete development environment for R with an advanced R editor, R developer tools and interfaces to perform scientific image analysis with R and the embedded ImageJ application.

New and Noteworthy

General:

  • Bio7 3.2 RCP (Rich Client Platform) built upon Eclipse 4.17
  • Bundled with AdoptOpenJDK 14.0.2 and JavaFX 15
  • Bundled with R 4.0.3 (Windows only!)
  • Added an image classification plugin for supervised and unsupervised image classification (using R and ImageJ – see R-Shell view context menu “Image Classification”). For an overview and details, see: https://github.com/Bio7/Bio7_Classification

Image Classification plugin (Dark Theme enabled)

  • Menus and scrollbars are now dark on Windows, too
  • Improved the dynamic script menus. Nested folders can be hidden for complex Java plugins (see image classification plugin).
  • Updated several Java libraries (Groovy, POI, etc.)
  • Improved several view layouts (showing scrollbars if necessary)
  • Changed all SWT ExpandBars to CTabFolders to improve the display when the dark theme is selected
  • Added more default fast wizards actions to the toolbar menu (to create Bio7 projects with files in one action)

Opened fast wizard menu (Dark Theme enabled)

  • Added new API methods (e.g., an R script job interface)
  • Enabled the recognition of Eclipse supported ASCII control characters in the console (see preferences)
  • Improved the compilation of pure LaTeX files without the necessity to run Rserve
  • Improved the dark theme on Windows, Linux and MacOSX

ImageJ

  • Updated the ImageJ plugin to version 1.53g34 (see ImageJ release notes)
  • Improved the visual interface for debugging
  • The main view menu can now be extended dynamically from plugins
  • ImageJ macros updated
  • Toolbar menus improved
  • Converted the ImageJ toolbar menus to SWT to display the dark theme
  • Converted several ImageJ context menus to SWT (recognizable on HighDPI, Linux GTK)
  • For all changes since the last release, see: https://github.com/Bio7/EclipseImageJ1Plugin

R

  • Updated the embedded R application on Windows
  • The R plugin can now be updated individually
  • Added a new ‘Load Packages’ table to display installed R packages and if updates available (selected packages can be updated in this view, too)

Load Packages tab (Dark Theme)

  • Added a grammar rule for raw string literals to the R editor (thanks to Bart Kiers)
  • Added a database preference to store the XML database profile file in a different (secret) location.
  • Fixed some minor R-Shell bugs

R + ImageJ

  • Added an option to transfer ImageJ ROI groups or special ROI names as class signatures to R (ROI Manager transfer actions in the Image-Methods view)
  • Improved the ROI stack transfer for virtual stacks (to load and transfer disk resident image stacks)
  • Improved several functions for the new image classification plugin

Download and Installation:

Windows:

Just download the *.zip distribution file from https://bio7.org and unzip it in your preferred location. Bio7 comes bundled with a Java Runtime Environment, R and Rserve distribution and works out of the box.

Linux:

Download and extract the installation file from https://bio7.org. For Linux you have to install R and Rserve (see Rserve installation below!).

MacOSX:

Download and extract the installation file from https://bio7.org.

If you start Bio7 a warning or error can occur because of the changes how Apple treats signatures! To allow Bio7 to start see this instructions for Yosemite, Sierra, Mojave and Big Sur:

First try to open the app with the context menu to allow the execution. If that won’t work try the following:

Yosemite: Open an app from an unidentified developer

Sierra: Open an app from an unidentified developer

Moave and Sierra: How to fix “Application” is damaged and can’t be opened error in macOS Mojave and High Sierra.

In addition for MacOSX you have to install R and Rserve (see below!).

Linux and MacOSX Rserve (compiled for cooperative mode) installation:

To install Rserve open the ‘Native R’ console in the ‘Console’ view and then execute the view menu action “Options -> Install Rserve (coop. mode) for R …” for different R versions (SSL 1.1 version for Linux Ubuntu > 19.10). This will download and install Rserve in your default R library location, see video below (please make sure that your default Linux R library install location has writing permissions!):

How to install Rserve for Linux and MacOSX: https://youtu.be/tF7HbRBRIF

In cooperative mode only one connection at a time is allowed (which we want for this Desktop apl.) and all subsequent connections share the same namespace (default on Windows)!

Bio7 Documentation

For more information about Bio7 please consult the soon updated Bio7 User Guide.

A plethora of Bio7 videotutorials for an introduction can be found on YouTube.

 

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Bio7 3.2 Released first appeared on R-bloggers.

Deloitte Names Appsilon a Rising Star in the 2020 Fast 50 List

$
0
0

[This article was first published on r – Appsilon | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Appsilon is Named a Rising Star

2020 wasn’t easy on anyone, and it was a devastating year for many. Businesses of all types had to readjust the way they work and pivot their overall strategies in this new environment. The IT sector was one of the least affected by the COVID-19 crisis, but difficulties were still inevitable for many tech companies. At Appsilon, the crisis made us pull together (figuratively) and work harder than ever at delivering innovative solutions for our clients. The results speak for themselves.

Want to see some impressive Shiny dashboards? Visit Appsilon’s Shiny App Demo Gallery.

In this year alone, we have increased our team size by 50%. We’ve made a huge effort to hire kind, talented, and motivated people. A larger team has led to more quality projects, and these projects have landed us among the Rising Stars in this year’s Deloitte Fast 50 CEE ranking list with an overall growth of 277%.

Previous year finalists include companies such as Tooploox, RTB House, Warsaw Genomics, and Kiwi. It is a tremendous honor to be listed alongside these companies. Usually, there would be a gala event to announce the finalists, but the gala event was held online this year. If you would like to watch the show, refer to this link

About Deloitte Technology Fast 50

Deloitte Technology Fast 50 in Central Europe is a program that recognizes and profiles fast-growing technology companies in the region. The program, which is now in its 21st year, ranks the 50 fastest growing public or private technology companies.

For more than 20 years, Deloitte has been honoring the fastest-growing technology companies in Central Europe. The Deloitte Technology Fast 50 Programme reveals the depth and scope of innovation across our region, driven by many of the dynamic, inspiring young companies that in time will form our economic bedrock. Being recognized as a Technology Fast 50 winner provides increased visibility, brand recognition, and growth opportunities to these fast-growing companies.

Learn more:

Join our growing team! We are primarily seeking senior-level developers with management experience. See our Careers page for all new openings, including openings for a Project Manager and Community Manager.

Article Deloitte Names Appsilon a Rising Star in the 2020 Fast 50 List comes from Appsilon | End­ to­ End Data Science Solutions.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Deloitte Names Appsilon a Rising Star in the 2020 Fast 50 List first appeared on R-bloggers.

The Mathematics and Statistics of Infectious Disease Outbreaks

$
0
0

[This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract:

Slides, R code and video lectures of our 2020 The Mathematics and Statistics of Infectious Disease Outbreaks summer course at Stockholm University are made available to a wider audience.

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

During the 2020 summer Tom Britton and I gave a course on The Mathematics and Statistics of Infectious Disease Outbreaks at the Department of Mathematics, Stockholm University, Sweden. Pre-requisites for the course were undergraduate knowledge of mathematics (e.g. differential equations, optimization) and statistics (e.g. random variables, distributions, maximum likelihood inference) as well as some programming skills in a language with a data science component (python, R, Julia, matlab, …).

Now the course is done, we have decided to share all our course material, consisting of slides, R code and video lectures. The main page for navigating the material is on GitHub:

https://github.com/hoehleatsu/mt3002-summer2020

which, e.g., links to the Youtube playlist containing the videos.

Course content

Discussion

We hope the material can be of value for those interested in the field, e.g., new Ph.D. students in epidemic modelling, infectious disease epidemiologists with a like for the quantitative side of matters, and for those who just want to improve their armchair epidemiology skills.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Mathematics and Statistics of Infectious Disease Outbreaks first appeared on R-bloggers.


Global Lockdown Effects on Social Distancing: A Graphical Primer

$
0
0

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The idea

OK. We are at home. Again. Given that large parts of Europe and the U.S. are currently experiencing a second large wave of Covid-19 cases and that most European jurisdictions have reacted with more or less rigorous lockdown regulations, one wonders about the effects of these regulations on social distancing compared to the one in March/April. In a recent TRR 266 workshop on data visualization, we (Astrid and Joachim) used this setting to discuss a workflow on how to let data speak graphically. This blog post, which is co-authored with Astrid van Kimmenade from Paderborn University, is based on the class’ main assignment.

About the workflow

As you see, we are using R for preparing the data and making the plots. This is by no means required. Any statistical software that is able to output vector graphics (e.g., PDF or SVG) will do. However, data visualization is one of the areas where R really shines. Many leading experts in data visualization use it as their statistical analysis tool of choice. See the wonderful textbook ‘Data Visualization: A Practical Introduction’ by Kieran Healy for a hands-on introduction on how to use R for data visualization.

When you prepare a data visualization for publication, you will most likely at some point want to move from statistical software to layout-centered tools. Illustrator from Adobe is the professional tool of choice here. It being commercially licensed and costly makes this somewhat inconvenient so you might want to check for Open Source alternatives. Feel free to use the comment function below if you happen to have any recommendations.

The data

We are using data provided by the {tidycovid19} Package. While there are many R packages floating around that can be used to obtain Covid-19 related data, the advantage of this package for our purpose is that it provides case data from the world-wide JHU CSSE Covid-19 repository as well as data on social movements as provided by Apple and Google and from various other data sources. Yes and it is Joachim’s first “lockdown baby” as part of the TRR 266 Open Science Data Center activities related to Covid-19, so there is that.

Prepping data

The first step of data analysis is always data preparation, including data cleaning and exploratory data analysis (EDA). It also entails carefully selecting and defining the variables that you want to display. If your data has a panel structure, you also might want to decide on its frequency.

In many cases, you will be preparing different variables for your conceptual measures. This is also the case here. We prep two country-daily measures for assessing the magnitude of the Covid-19 spread and two country-daily measures that aim to assess the extent of which people in a given country socially distance. For the former, we use the daily number of new cases and the number of new cases over the number of tests performed (percentage of positive tests). For the later, we use measures provided by Apple and Google. Apple reports the search usage of their map service for various categories. Google reports actual position data classified by the various location types. We average the measures over the location types that are positively associated with social distancing. All measures are smoothed over seven days to smooth-out week-daily fluctuations.

To focus the analysis on a meaningful and digestible number of countries, we apply two cleaning steps: First, we limit the analysis to countries with a population of more than 10 million that suffered at least 100 Covid19-related deaths per million inhabitants. This ensures that we limit the analysis on reasonably large countries that were significantly affected by Covid-19. Second, to focus on countries that experienced both, the March/April and the current wave, we restrict the sample to the countries that had a peak number of daily new infections higher than 30 per 100,000 inhabitants prior to April 2020.

library(tidyverse)library(zoo)library(tidycovid19)df <- download_merged_data(cached = TRUE)ctries <- df %>%  group_by(iso3c) %>%  filter(!is.na(deaths), population >= 10e6) %>%  filter(date == max(date)) %>%  summarise(deaths_per_mio_pop = deaths * 1e6/population) %>%  filter(deaths_per_mio_pop > 100) %>%   pull(iso3c)ave_measures <- df %>%  arrange(iso3c, date) %>%  group_by(iso3c) %>%  filter(iso3c %in% ctries) %>%  mutate(    new_cases = confirmed - lag(confirmed),    total_tests = na.approx(total_tests, na.rm = FALSE),    new_tests = total_tests - lag(total_tests),    ave_pos_test_rate = rollsum(      (confirmed - lag(confirmed))/new_tests,      7, na.pad=TRUE, align="right"    ),    ave_new_cases_wk_per_100e5 = rollsum(      new_cases*1e5/population, 7, na.pad=TRUE, align="right"    ),    ave_soc_dist_google = rollmean(      (gcmr_retail_recreation +  gcmr_transit_stations +        gcmr_workplaces)/3, 7, na.pad=TRUE, align="right"    ),    ave_soc_dist_apple = rollmean(      (apple_mtr_driving + apple_mtr_walking + apple_mtr_transit)/3, 7, na.pad=TRUE, align="right"    )  ) %>%  filter(    max(      (date < lubridate::ymd("2020-04-01")) *  ave_new_cases_wk_per_100e5,      na.rm = TRUE    ) > 30  ) %>%  select(    iso3c, country, date, population, ave_new_cases_wk_per_100e5,     ave_pos_test_rate, ave_soc_dist_apple, ave_soc_dist_google  ) smp_countries <- unique(ave_measures$country)

Some exploratory plots

The first step of data visualization (and any data analysis, for this matter) is a quick’n’dirty set of exploratory visuals. You can also use the C02 ExPanD app or the package that it used to generate it for this. Here, we try to assess the development of our constructs of interest (Covid-19 spread and extent of social distancing) over time. Also, we eyeball the correlation of our two social distancing measures and the correlation of our two constructs.

ggplot(ave_measures, aes(x = date, color = iso3c)) +  geom_line(aes(y = ave_new_cases_wk_per_100e5)) +  theme_minimal()

ggplot(ave_measures, aes(x = date, color = iso3c)) +  geom_line(aes(y = ave_pos_test_rate)) +  theme_minimal()

ggplot(ave_measures, aes(x = date, color = iso3c)) +  geom_line(aes(y = ave_soc_dist_apple)) +  theme_minimal()

ggplot(ave_measures, aes(x = date, color = iso3c)) +  geom_line(aes(y = ave_soc_dist_google)) +  theme_minimal()

ggplot(  ave_measures,   aes(x = ave_soc_dist_apple, y = ave_soc_dist_google, color = iso3c)) +  geom_point(alpha = 0.2) +  theme_minimal()

ggplot(  ave_measures,   aes(x = ave_new_cases_wk_per_100e5, y = ave_soc_dist_google, color = iso3c)) +  scale_x_continuous(trans = "log10") +  geom_point(alpha = 0.2) +  theme_minimal()

Based on this, we take some decisions. First we choose our variables: The 7-day average of new cases by 100,000 inhabitants will be used to assess the magnitude of the Covid19 spread and Google data will be used for measuring social distancing. The main reasons for our choices are that the testing data, while highly relevant, is too scarce to produce a reliable measure. The Apple data is conceptually of lower quality than the Google data and also has a two week gap where Apple changed its methodology. Unfortunately, the Google data has roughly one week lag, meaning that it currently contains data up until November 15. Selecting data always involves trade-offs. We will need to communicate these trade-offs to the reader.

Graphical story idea: Country drill down

It becomes obvious from the exploratory plot that

  • The social distancing effect of the second wave is smaller compared to the social distancing effect of the first wave and
  • that the effects vary significantly across countries

We decide to tell this story graphically. In principle, we could start with directly throwing country-level time-series data at people. However, we believe that this is a little bit hard to address. Instead, we decide to prepare a series of graphs that allow the reader to “zoom in” on the problem:

  • The first graph will be a “global level” time-series line graph contrasting the waves with the social distancing effect. It should allow the reader to compare the social distancing effects of the two waves at the global level.
  • The second graph will display this trend by country. To ease the display and to avoid the Spaghetti Graph effect, we will use a faceted version of a country-level line graph for this purpose. It will enable the reader to assess country-level differences.
  • The third graph will display the country level differences more pronouncedly by plotting the social distancing effect of each wave over its magnitude at the country-wave level. This scatter plot will show a common pattern how countries reacted two the waves in terms of social distancing and also a country that looks like an outlier (stay tuned).

Graph 1: global time series

The first visual focuses on the main takeaway that we feel deserves to be communicated: Social distancing during the first wave kicked in quicker and also seems to be more pronounced compared to the second wave. While it is still a little bit too early to tell, we believe that this is the main point that we should be making with our graphs. To make this point at the global level, we population-weight average the data across our sample countries.

library(grid)library(gridExtra)library(RColorBrewer)ave_measures %>%  group_by(date) %>%  filter(    !is.na(ave_new_cases_wk_per_100e5),    !is.na(ave_soc_dist_google)  ) %>%  filter(n() == length(smp_countries)) %>%  summarise(    cases = weighted.mean(ave_new_cases_wk_per_100e5, population, na.rm = TRUE),    soc_dist = weighted.mean(ave_soc_dist_google, population, na.rm = TRUE)/100,    .groups = "drop"  ) -> wwidecaption_text <- paste0(str_wrap(paste0(  "Contains social distancing data up to ",  format(    max(ave_measures$date[!is.na(ave_measures$ave_soc_dist_google)]),    "%b %d"  ),  " and is based on countries that experienced a significant first ",   "Covid-19 wave in March/April ",  "(", paste(smp_countries, collapse = ", "), "). ",  "Data and code: https://github.com/joachim-gassen/tidycovid19."), 80))my_palette <- c(brewer.pal(8, "Set1"), "lightblue")p_cases <- ggplot(wwide, aes(  x = date, y = cases)) +  geom_line(color = my_palette[1]) +  theme_minimal() +  ylim(0, NA) +  annotate(    x = lubridate::ymd("2020-08-15"),    y = 200,    geom = "text",    color = my_palette[1],    label = "New weekly Covid19 infections\nby 100,000 inhabitants"  ) +  labs(x = "", y = "") +  theme(    legend.position = "none",    panel.grid.minor = element_blank(),    panel.grid.major.x = element_blank(),    panel.grid.major.y = element_line(size = 0.5)  )p_soc_dist <- ggplot(wwide, aes(x = date, y = soc_dist)) +  geom_line(color = my_palette[2]) +  theme_minimal() +  scale_y_continuous(labels = scales::percent) +  annotate(    x = lubridate::ymd("2020-08-15"),    y = -.4,    geom = "text",    color = my_palette[2],    label = "% of reduction in social interaction\n(assessed by Google Mobility Reports)"  ) +  labs(    x = "",    y = "",    caption = caption_text  ) +  theme(    legend.position = "none",    axis.title.x = element_blank(), axis.text.x = element_blank(),    panel.grid.minor = element_blank(),    panel.grid.major.x = element_blank(),    panel.grid.major.y = element_line(size = 0.5)  )grid.newpage()grid.draw(rbind(ggplotGrob(p_cases), ggplotGrob(p_soc_dist), size = "last"))

Graph 2: spaghetti facet plot

As a next step, to allow the reader to explore country-level differences, we simply facet the time-series plots, high-lightening the countries one-by-one to get rid of the spaghetti problem. A very informative display for an educated audience but maybe a little bit too data-rich for a general audience.

library(gghighlight)p_cases <- ggplot(ave_measures) +  geom_line(aes(date, ave_new_cases_wk_per_100e5, colour = iso3c)) +  gghighlight() +  labs(    x = "",    y = "New weekly Covid19 infections\nby 100,000 inhabitants"  ) +  theme_minimal() +  scale_color_manual(values = my_palette) +  theme(    panel.grid.minor = element_blank(),    panel.grid.major.x = element_blank(),    panel.grid.major.y = element_line(size = 0.5)  ) +   facet_wrap(~ country)p_soc_dist <- ggplot(ave_measures) +  geom_line(aes(date, ave_soc_dist_google, colour = iso3c)) +  gghighlight() +  theme_minimal() +  scale_color_manual(values = my_palette) +  labs(    x = "",    y = "% of reduction in social interaction\n(assessed by Google Mobility Reports)",    caption = caption_text  ) +  theme(    panel.grid.minor = element_blank(),    panel.grid.major.x = element_blank(),    panel.grid.major.y = element_line(size = 0.5)  ) +   facet_wrap(~ country)grid.newpage()grid.draw(rbind(ggplotGrob(p_cases), ggplotGrob(p_soc_dist), size = "last"))

Graph 3: country-level correlation scatter

Finally, here is our last graph that focuses on the association. A grouped scatter plot is the classical display for that. The frequency reduction is very effective (compare it to the exploratory scatter plot above). To the experienced eye, the two starkly different associations become immediately apparent. Also, the U.S. seems to be negative outlier in the first and in particular second wave for all too obvious and soon to be gone reasons. We thought about including separate regression lines to document the different associations for both waves but felt that this cluttered the graph by stating the obvious.

library(ggrepel)ave_measures %>%  filter(date < lubridate::ymd("2020-06-01")) %>%  summarise(    cases = max(ave_new_cases_wk_per_100e5, na.rm = TRUE),    soc_dist = -min(ave_soc_dist_google, na.rm = TRUE)/100,    .groups = "drop"  ) %>% mutate(wave = "Spring") %>%  select(iso3c, wave, cases, soc_dist) -> spring_waveave_measures %>%  filter(date > lubridate::ymd("2020-09-01")) %>%  summarise(    cases = max(ave_new_cases_wk_per_100e5, na.rm = TRUE),    soc_dist = -min(ave_soc_dist_google, na.rm = TRUE)/100,    .groups = "drop"  ) %>% mutate(wave = "Fall") %>%  select(iso3c, wave, cases, soc_dist) -> fall_wavesoc_dist_by_wave <- rbind(spring_wave, fall_wave)soc_dist_by_wave$wave <- factor(soc_dist_by_wave$wave, c("Spring", "Fall"))ggplot(soc_dist_by_wave, aes(  x = cases, y = soc_dist, color = wave,  label = iso3c)) +  geom_point() +   geom_text_repel(color = "black") +  scale_x_continuous(trans = "log10") +  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +  theme_minimal() +  scale_color_manual(values = my_palette) +  labs(    x = "Peak Weekly New Cases by 100,000 inhabitants\n(log scale)",    y = "Reduction in Social Interaction\n(percentage to baseline)",    caption = caption_text  )

Designing graphs to be more visually appealing

While the graphs above communicate our story (we think) they still have this “statistical program look and feel”. To take them production level, some final touches are needed. There is always a trade-off between doing this within R or in your layout software. For academic publications, we tend do the whole design in R. For general audience publications, moving to layout software rather sooner than later might be advisable.

When communicating to general audiences it is important to make sure the visual is clear and can stand by itself. In this example we applied several tweaks (using Adobe Illustrator) to the output from R:

  • Remove chart junk: always get rid of unnecessary clutter, clean up your charts. In this case, the charts were surprisingly clean already. Nonetheless, a few simple tweaks make them easier on the eyes. We specifically added (visual) hierarchy: we made the axes less prominent (grey and thinner lines), removed unnecessary lines, and made the numbers on the axes grey and smaller.
  • Check your axes: it is preferable to use “natural” increments, meaning 2-4-6-8 is preferred over 3-6-9, and 0-50-100% over 30-60-90%. In the case of our charts, we want to show a general trend, meaning no detail is required. For the reduction in social interaction 0-50-100% seems to suffice, no need to add more detail there.
  • Add context: we added the text describing the data as an introduction to the visual. This way the visual can be a stand-alone, and if it is copied or shared by others, the context is not lost (as long as the image is not cropped ;-))
  • Annotate: this is something we rarely see in scientific publications, but annotating a graph is a very user-friendly approach. Especially in this case, where we offer an explanatory visual (vs. exploratory visual). We want to communicate that “the effect of social distancing in the second wave is smaller compared to the first”, and we want to bring out this story most clearly. In the case of the Global Time Series visual, small annotations that compare the second and first wave help us tell our story. Of course, annotations should be used sparsely to avoid chart junk.
  • No type at an angle: don’t set type at an angle, nobody wants to twist his neck reading a chart.
  • Direct labeling: as you can see there is no legend with “spring=green” and “fall=yellow” for the scatterplot. We guess you didn’t miss it either? When possible, use direct labeling. Don’t make your audience do a color-coding exercise.
  • Add chart junk: didn’t we just argue to keep chart junk to a minimum…? We sure did, but if you want your graph to stand out, a little chart junk can be very effective. A little detail that supports your story, like the little virus icons, can give it that touch that could make your graph stand out from the crowd. As always, it’s a matter of taste (and corporate design and branding). Optimizing your visual is a step that should not be skipped, a few simple tweaks can make a big difference.

So, without further ado this is the outcome of this last processing step.

Finalized Graph 1

Finalized Graph 1

Finalized Graph 2

Finalized Graph 2

Finalized Graph 3

Finalized Graph 3

Nice, huh?

An interactive display

As a final goodie, we also played around with an interactive display that combines graph 2 and graph 3 into one interactive display, allowing readers to dive from country-wave level data into time series. This graphic is not neatly designed but gives you an intuition on how you can use interactive features to showcase complex data.

library(ggiraph)library(grid)library(gridExtra)produce_country_wave_html <- function(ic) {  tfile <- tempfile(fileext = ".png")  png(tfile, width = 300, height = 250)     p_cases <- ggplot(ave_measures) +    geom_line(aes(date, ave_new_cases_wk_per_100e5, colour = iso3c)) +    gghighlight(iso3c == ic) +    labs(      title = unique(ave_measures$country[ave_measures$iso3c == ic]),      subtitle = "Weekly new cases",       x = "",      y = ""    ) +    theme_minimal() +    scale_color_manual(values = my_palette) +    theme(      panel.grid.minor = element_blank(),      panel.grid.major.x = element_blank(),      panel.grid.major.y = element_line(size = 0.5),      plot.title.position = "plot"    )     p_soc_dist <- ggplot(ave_measures) +    geom_line(aes(date, ave_soc_dist_google, colour = iso3c)) +    gghighlight(iso3c == ic) +    theme_minimal() +    scale_color_manual(values = my_palette) +    labs(      x = "",      y = "",      subtitle =  "% Reduction in social interaction"    ) +    theme(      panel.grid.minor = element_blank(),      panel.grid.major.x = element_blank(),      panel.grid.major.y = element_line(size = 0.5),      plot.title.position = "plot"    )     grid.arrange(rbind(ggplotGrob(p_cases), ggplotGrob(p_soc_dist), size = "last"))  dev.off()    txt <- RCurl::base64Encode(    readBin(tfile, "raw", file.info(tfile)[1, "size"]),     "txt"  )    html_snippet <- htmltools::HTML(sprintf('', txt))   return(html_snippet)}html_lup <- tibble(  iso3c = unique(ave_measures$iso3c),  html_code = sapply(iso3c, produce_country_wave_html, USE.NAMES = FALSE))df <- soc_dist_by_wave %>%  left_join(html_lup, by = "iso3c")p <- ggplot(df, aes(  x = cases, y = soc_dist, color = wave,  label = iso3c)) +  geom_point_interactive(aes(    tooltip = html_code,    data_id = iso3c  ))  +   geom_text_repel(color = "black") +  scale_x_continuous(trans = "log10") +  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +  theme_minimal() +  scale_color_manual(values = my_palette) +  labs(    x = "Peak Weekly New Cases by 100,000 inhabitants\n(log scale)",    y = "Reduction in Social Interaction\n(percentage to baseline)",    caption = caption_text,    color = "Covid-19 Wave"  ) +   theme(    legend.position = c(0.8, 0.8)  )tooltip_css <- "background-color:transparent;"girafe(  ggobj = p,   options = list(    opts_tooltip(      css = tooltip_css    )  ))

{"x":{"html":"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<svg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05' viewBox='0 0 432.00 360.00'>\n <g>\n <defs>\n <clipPath id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_1'>\n <rect x='0.00' y='0.00' width='432.00' height='360.00'/>\n <\/clipPath>\n <\/defs>\n <rect x='0.00' y='0.00' width='432.00' height='360.00' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_1' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_1)' fill='#FFFFFF' fill-opacity='1' stroke='#FFFFFF' stroke-opacity='1' stroke-width='0.75' stroke-linejoin='round' stroke-linecap='round'/>\n <defs>\n <clipPath id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_2'>\n <rect x='0.00' y='0.00' width='432.00' height='360.00'/>\n <\/clipPath>\n <\/defs>\n <defs>\n <clipPath id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3'>\n <rect x='57.86' y='5.48' width='368.66' height='268.64'/>\n <\/clipPath>\n <\/defs>\n <polyline points='57.86,231.38 426.52,231.38' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_2' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,170.33 426.52,170.33' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_3' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,109.27 426.52,109.27' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_4' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,48.22 426.52,48.22' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_5' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='99.97,274.12 99.97,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_6' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='215.99,274.12 215.99,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_7' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='337.57,274.12 337.57,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_8' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='0.53' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,261.91 426.52,261.91' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_9' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,200.85 426.52,200.85' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_10' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,139.80 426.52,139.80' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_11' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,78.74 426.52,78.74' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_12' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='57.86,17.69 426.52,17.69' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_13' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='157.98,274.12 157.98,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_14' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='274.00,274.12 274.00,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_15' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <polyline points='401.14,274.12 401.14,5.48' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_16' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='none' stroke='#EBEBEB' stroke-opacity='1' stroke-width='1.07' stroke-linejoin='round' stroke-linecap='butt'/>\n <circle cx='145.15' cy='89.79' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_17' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='BEL' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='82.96' cy='125.84' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_18' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='DEU' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='176.38' cy='55.60' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_19' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='ESP' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='143.94' cy='67.12' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_20' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='FRA' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='86.84' cy='83.51' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_21' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='GBR' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='113.46' cy='68.63' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_22' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='ITA' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='74.62' cy='131.31' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_23' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='NLD' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='94.28' cy='82.93' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_24' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='PRT' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='117.17' cy='141.31' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_25' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='USA' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='409.76' cy='139.22' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_26' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='BEL' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='211.93' cy='198.76' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_27' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='DEU' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='278.87' cy='183.53' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_28' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='ESP' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='340.09' cy='140.96' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_29' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='FRA' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='261.16' cy='146.78' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_30' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='GBR' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='306.25' cy='165.62' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_31' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='ITA' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='306.01' cy='177.59' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_32' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='NLD' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='311.26' cy='184.22' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_33' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='PRT' title='&lt;img src=&quot;&quot;&gt;'/>\n <circle cx='288.36' cy='196.43' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_34' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round' data-id='USA' title='&lt;img src=&quot;&quot;&gt;'/>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='124.13' y='101.49' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_35' font-size='8.28pt' font-family='Helvetica'>BEL<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='81.90' y='122.12' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_36' font-size='8.28pt' font-family='Helvetica'>DEU<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='175.64' y='52.01' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_37' font-size='8.28pt' font-family='Helvetica'>ESP<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='143.39' y='78.64' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_38' font-size='8.28pt' font-family='Helvetica'>FRA<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='64.30' y='95.25' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_39' font-size='8.28pt' font-family='Helvetica'>GBR<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='115.91' y='64.56' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_40' font-size='8.28pt' font-family='Helvetica'>ITA<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='61.48' y='143.05' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_41' font-size='8.28pt' font-family='Helvetica'>NLD<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='92.70' y='79.70' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_42' font-size='8.28pt' font-family='Helvetica'>PRT<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='116.31' y='137.56' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_43' font-size='8.28pt' font-family='Helvetica'>USA<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='388.79' y='150.81' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_44' font-size='8.28pt' font-family='Helvetica'>BEL<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='210.58' y='195.10' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_45' font-size='8.28pt' font-family='Helvetica'>DEU<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='253.06' y='193.85' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_46' font-size='8.28pt' font-family='Helvetica'>ESP<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='339.36' y='137.37' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_47' font-size='8.28pt' font-family='Helvetica'>FRA<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='238.68' y='158.43' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_48' font-size='8.28pt' font-family='Helvetica'>GBR<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='306.29' y='158.79' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_49' font-size='8.28pt' font-family='Helvetica'>ITA<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='280.47' y='173.98' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_50' font-size='8.28pt' font-family='Helvetica'>NLD<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='311.33' y='195.94' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_51' font-size='8.28pt' font-family='Helvetica'>PRT<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_3)'>\n <text x='281.21' y='209.04' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_52' font-size='8.28pt' font-family='Helvetica'>USA<\/text>\n <\/g>\n <defs>\n <clipPath id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4'>\n <rect x='0.00' y='0.00' width='432.00' height='360.00'/>\n <\/clipPath>\n <\/defs>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='40.21' y='265.06' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_53' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>0%<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='35.32' y='204.01' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_54' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>25%<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='35.32' y='142.96' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_55' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>50%<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='35.32' y='81.90' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_56' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>75%<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='30.43' y='20.85' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_57' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>100%<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='150.64' y='285.36' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_58' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>100<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='266.66' y='285.36' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_59' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>300<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='391.35' y='285.36' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_60' font-size='6.60pt' fill='#4D4D4D' fill-opacity='1' font-family='Helvetica'>1000<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='123.57' y='297.95' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_61' font-size='8.25pt' font-family='Helvetica'>Peak Weekly New Cases by 100,000 inhabitants<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='216.82' y='309.83' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_62' font-size='8.25pt' font-family='Helvetica'>(log scale)<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text transform='translate(13.37,214.09) rotate(-90.00)' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_63' font-size='8.25pt' font-family='Helvetica'>Reduction in Social Interaction<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text transform='translate(25.25,199.11) rotate(-90.00)' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_64' font-size='8.25pt' font-family='Helvetica'>(percentage to baseline)<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='315.19' y='43.13' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_65' font-size='8.25pt' font-family='Helvetica'>Covid-19 Wave<\/text>\n <\/g>\n <circle cx='323.83' cy='58.47' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_66' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)' fill='#E41A1C' fill-opacity='1' stroke='#E41A1C' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round'/>\n <circle cx='323.83' cy='75.75' r='1.47pt' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_67' clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)' fill='#377EB8' fill-opacity='1' stroke='#377EB8' stroke-opacity='1' stroke-width='0.71' stroke-linejoin='round' stroke-linecap='round'/>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='337.95' y='61.63' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_68' font-size='6.60pt' font-family='Helvetica'>Spring<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='337.95' y='78.91' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_69' font-size='6.60pt' font-family='Helvetica'>Fall<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='133.10' y='324.06' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_70' font-size='6.60pt' font-family='Helvetica'>Contains social distancing data up to Nov 15 and is based on countries that<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='115.57' y='333.56' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_71' font-size='6.60pt' font-family='Helvetica'>experienced a significant first Covid-19 wave in March/April (Belgium, Germany,<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='130.21' y='343.06' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_72' font-size='6.60pt' font-family='Helvetica'>Spain, France, United Kingdom, Italy, Netherlands, Portugal, United States).<\/text>\n <\/g>\n <g clip-path='url(#svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_cl_4)'>\n <text x='182.02' y='352.57' id='svg_ab547f36-bfa9-4070-a8f3-e7d158728b05_el_73' font-size='6.60pt' font-family='Helvetica'>Data and code: https://github.com/joachim-gassen/tidycovid19.<\/text>\n <\/g>\n <\/g>\n<\/svg>","js":null,"uid":"svg_ab547f36-bfa9-4070-a8f3-e7d158728b05","ratio":1.2,"settings":{"tooltip":{"css":".tooltip_SVGID_ { background-color:transparent; ; position:absolute;pointer-events:none;z-index:999;}\n","offx":10,"offy":0,"use_cursor_pos":true,"opacity":0.9,"usefill":false,"usestroke":false,"delay":{"over":200,"out":500}},"hover":{"css":".hover_SVGID_ { fill:orange;stroke:gray; }\n","reactive":false},"hoverkey":{"css":".hover_key_SVGID_ { stroke:red; }\n","reactive":false},"hovertheme":{"css":".hover_theme_SVGID_ { fill:green; }\n","reactive":false},"hoverinv":{"css":""},"zoom":{"min":1,"max":1},"capture":{"css":".selected_SVGID_ { fill:red;stroke:gray; }\n","type":"multiple","only_shiny":true,"selected":[]},"capturekey":{"css":".selected_key_SVGID_ { stroke:gray; }\n","type":"single","only_shiny":true,"selected":[]},"capturetheme":{"css":".selected_theme_SVGID_ { stroke:gray; }\n","type":"single","only_shiny":true,"selected":[]},"toolbar":{"position":"topright","saveaspng":true,"pngname":"diagram"},"sizing":{"rescale":true,"width":1}}},"evals":[],"jsHooks":[]}

Conclusion

These are our ideas for a graphical story that compares the social distancing effects of the two large Covid-19 waves. What are yours? Let us know!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Global Lockdown Effects on Social Distancing: A Graphical Primer first appeared on R-bloggers.

Updated Apache Drill R JDBC Interface Package {sergeant.caffeinated} With {dbplyr} 2.x Compatibility

$
0
0

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some great improvements, including JSON resultset streaming for the REST API. Alas, tweaking {sergeant} (my REST API R package) to handle that is not on the TODO for the foreseeable future, so I’ve been using {sergeant.caffeinated} — https://github.com/hrbrmstr/sergeant-caffeinated— (a RJDBC wrapper for the Drill JDBC interface) for quite a while since it handles large resultsets quite nicely.

I broke out the RJDBC functionality from {sergeant} into this separate package since, despite the fact that it’s 2019/2020, many folks still have/had problems getting {rJava} to work (FWIW it’s a seamless install for me on Windows, Ubuntu, or macOS, even Apple Silicon macOS). The surgery to separate it was fairly hack-ish (one reason it’s not on CRAN) and it finally broke with the recent {dbplyr} 2.x release. I assumed fixing the caffeinated version was easier/quicker than the REST API version, so I dug in and am cautiously tossing it out for wider poking.

An All New Way To Use 💂☕

Gone are the days of src_drill_jdbc(), but enter in the new term of more standardized {DBI} and {d[b]plyr} access to Apache Drill. To install this version you can do:

remotes::install_github("hrbrmstr/sergeant-caffeinated")

(more install options using safer and saner social coding sites coming soon).

Let’s load up the package(s) and perform some operations.

library(sergeant.caffeinated)test_host <- Sys.getenv("DRILL_TEST_HOST", "localhost")be_quiet()(con <- dbConnect(drv = DrillJDBC(), sprintf("jdbc:drill:zk=%s", test_host)))## 

The DRILL_TEST_HOST environment variable contains the hostname or IP address of my/your Drill server, defaulting to localhost if none is found.

The be_quiet() function stops the Java engine from yelling at you with “illegal reflective access” warnings. If you see this in other rJava-powered packages it means code in some classes in some Java archive files are doing some sketchy old-school things that newer JVMs aren’t happy about. At some point, these warnings become full-on errors which will break many things. Unfortunately, Drill is still fairly tied to Java 8.x and has tons of introspecting code. The errors are ugly, so if you want to get rid of them, just call this function before doing anything with Drill. (You’ll also notice log4j errors are finally gone!)

Now that we have a Drill JDBC connection, we can do something with it. All the DBI-ish operations work, but it’s 2020 and {d[b]ply} is the bee’s knees, so we’ll just dive right in with that:

(db <- tbl(con, "cp.`employee.json`"))## # Source:   table [?? x 16]## # Database: DrillJDBCConnection##    employee_id full_name first_name last_name position_id position_title store_id##                                               ##  1           1 Sheri No… Sheri      Nowmer              1 President             0##  2           2 Derrick … Derrick    Whelply             2 VP Country Ma…        0##  3           4 Michael … Michael    Spence              2 VP Country Ma…        0##  4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…        0##  5           6 Roberta … Roberta    Damstra             3 VP Informatio…        0##  6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…        0##  7           8 Kim Brun… Kim        Brunner            11 Store Manager         9##  8           9 Brenda B… Brenda     Blumberg           11 Store Manager        21##  9          10 Darren S… Darren     Stanz               5 VP Finance            0## 10          11 Jonathan… Jonathan   Murraiin           11 Store Manager         1## # … with more rows, and 9 more variables: department_id , birth_date ,## #   hire_date , salary , supervisor_id , education_level ,## #   marital_status , gender , management_role 

Basically, that’s it: it “just works”.

FIN

If you’ve been a user of {sergeant.caffeinated} and really need src_drill_jdbc() back, drop an issue on GH or a note in the comments, and be sure to file issues if I’ve missed anything as you kick the tyres.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Updated Apache Drill R JDBC Interface Package {sergeant.caffeinated} With {dbplyr} 2.x Compatibility first appeared on R-bloggers.

Little useless-useful R functions – Making scatter plot from an image

$
0
0

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Taking a jpg image and converting it to raster, getting pixelized data manipulation of the image and plot a scatter image.

Sound like another useless R function, that can produce a scatter plot in a shape of a logo with a smooth curve.

Example with Amazon logo

So the function is using the image manipulation part:

img <- magick::image_read("image/amazonLogo.jpg")img <- img %>%   image_quantize(max=2, colorspace = 'gray', dither=TRUE) %>%  image_scale(geometry = geometry_size_pixels(width=25, height=20, preserve_aspect=FALSE)) # Image manipulationmat <- t(1L - 1L * (img[[1]][1,,] > 180))mat_df <-data.frame(mat)

Second part consists of data transformation to dataframe:

# Melt datadff <- data.frame(x = NULL, y = NULL)for (i in 1:nrow(mat_df)) {  for (j in 1:ncol(mat_df)){    if (mat_df[i,j] == 1){      d <- data.frame(x=i, y=j)      dff <<- rbind(dff, d)    }  }}

and last part is a simple ggplot scatter plot:

# draw scatterg <- ggplot(dff, aes(x = x, y = y)) + geom_point()  + scale_x_reverse() +  coord_flip()g + theme(panel.background = element_rect(fill = "white", colour = "grey")) #draw scatter with jitterg <- ggplot(dff, aes(x = x, y = y)) + geom_point() + geom_jitter() + scale_x_reverse() +  coord_flip() g + theme(panel.background = element_rect(fill = "white", colour = "grey"))# draw scatter with smooth and CIg <- ggplot(dff, aes(x = x, y = y)) + geom_point()  + scale_x_reverse() +  coord_flip() +  geom_smooth() g + theme(panel.background = element_rect(fill = "white", colour = "grey")) 

As always, complete code is available at Github.

And some examples with famous logos:

And you can see the pattern 🙂

Happy R-ing! And stay healthy.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Little useless-useful R functions – Making scatter plot from an image first appeared on R-bloggers.

Graphical User Interfaces were a mistake but you can still make things right

$
0
0

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some weeks ago I tweeted this:

GUIs were a mistake

— Bruno Rodrigues (@brodriguesco) October 9, 2020

you might think that I tweeted this as an unfunny joke, but it’s not. GUIs were one of the worst things to happen for statisticians. Clickable interfaces for data analysis is probably one of the greatest source of mistakes and errors in data processing, very likely costing many millions to companies worldwide and is a source of constant embarassment when mistakes happen which cost the reputation, and money, of institutions or people.

Remember the infamous Excel mistake by Reinhard and Rogoff? If you don’t know what I’m talking about, you can get up to speed by reading this. I think the most interesting sentence is this:

The most serious was that, in their Excel spreadsheet, Reinhart and Rogoff had not selected the entire row when averaging growth figures: they omitted data from Australia, Austria, Belgium, Canada and Denmark.

This is a typical mistake that happens when a mouse is used to select data in a GUI, instead of typing whatever you need in a scripting language. Many other mistakes like that happen, and they remain hidden, potentially for years, or go unreported.

Recently there was another Excel-related problem in England where positive Covid tests got lost. For some obscure reason, the raw data, which was encoded in a CSV file, got converted into an Excel spreadsheet, most likely for further analysis. The problem is that the format that was used was the now obsolete XLS format, instead of the latest XLSX format, which can handle millions of rows. Because the data was converted in the XLS format, up to 15841 cases were lost. You can get all the details from this BBC article. Again, not entirely Excel’s fault, as it was misused. The problem is that when all you have is a hammer, everything looks like a nail, and Excel is that data analytics hammer. So to the uncultured, everything looks like an Excel problem.

Now don’t misunderstand me; I’m not blaming Excel specifically, or any other specific GUI application for this. In many cases, the problem lies between the keyboard and the chair. But GUI applications have a part of responsibility, as they allow users to implement GUI-based workflows. I think that complex GUI based workflows were an unintended consequence of developing GUIs. Who could have expected, 40 years ago, that office jobs would evolve so much and that they would require such complex workflows to generate an output? Consider the life-cycle of a shared Excel file in your typical run-of-the-mill financial advisory firm. In many cases, it starts with an already existing file that was made for another client and that is now used as a starting point. The first thing to do, is to assign a poor junior to update the file and adapt it for the current assignment. He or she will spend hours trying to reverse engineer this Excel file and then update it. This file will at some point go to more senior members that will continue working on it, until it gets send off for review to a manager. This manager, already overworked and with little time between meetings to review the file correctly, just gives it a cursory glance and might find some mistakes here and there. As a review method, colours and comments will be used. The file goes back for a round of updates and reviews. As time goes by, and as the file gets more and more complex, it starts to become impossible to manage and review properly. It eventually gets used to give advice to a client, which might be totally wrong, because just as in the case of Reinhard and Rogoff, someone, at some point, somewhere, did not select the right cells for the right formula. Good luck ever finding this mistake, and who did it. During my consulting years, I have been involved with very, very, big clients that were completely overwhelmed because all their workflows were GUI based. They had been working like that for years, and kept recruiting highly educated people en masse just to manage Excel and Word files. They were looking for a magic, AI-based solution, because in their minds, if AIs could drive fricking cars, they should also be able to edit and send Excel files around for review. Well, we’re not quite there yet, so we told them, after our review of their processes and data sources (which in many cases were Excel AND Word files), that what they needed was for their company to go through an in-depth optimisation process “journey”. They weren’t interested so they kept hiring very intelligent people to be office drones. I don’t think that business model can remain sustainable.

Now how much are situations like that the fault of Excel and how much personal responsibility do the people involved have? I don’t know, but my point is that if, by magic, GUIs were made to disappear, problems like that would also not exist. The reason is that if you’re forced to write code to reach the results you want, you avoid a lot of these pitfalls I just described. Working with scripts and the command line forces a discipline unto you; you cannot be lazy and click around. For example, reverse engineering a source code file is much easier that a finished Excel spreadsheet. Even poorly written and undocumented code is always much better than an Excel spreadsheet. If you throw a version control system in the mix, you have the whole history of the file and the ability to know exactly what happened and when. Add unit tests on the pile, and you start to get something that is very robust, transparent, and much easier to audit.

“But Bruno, not everyone is a programmer!” I hear you scream at your monitor.

My point, again, is that if GUIs did not exist, people would have enough knowledge of these tools to be able to work. What other choice would they have?

Of course, GUIs have been invented, and they’re going nowhere. So what can you do?

When it comes to statistics and data analysis/processing, you can at least not be part of the problem and avoid using Excel altogether. If we go back to our previous scenario from the financial advisory firm, the first step, which consisted in reverse engineering an Excel file, can be done using {tidyxl}. Let’s take a quick look; the spreadsheet I used as the header image for this blog post comes from the Enron corpus , which is mostly know for being a database of over 600000 emails from the US company Enron. But it also contains spreadsheets, which are delightful. You can download the one from the picture here (8mb xlsx warning). Opening it in your usual spreadsheet application will probably cause your heart rate to increase to dangerous levels, so avoid that. Instead, let’s take a look at what {tidyxl} does with it:

library(tidyxl)## Warning: package 'tidyxl' was built under R version 4.0.3library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──## ✔ ggplot2 3.3.2     ✔ purrr   0.3.4## ✔ tibble  3.0.1     ✔ dplyr   1.0.0## ✔ tidyr   1.1.0     ✔ stringr 1.4.0## ✔ readr   1.3.1     ✔ forcats 0.5.0## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag()    masks stats::lag()dutch_quigley_9378 <- xlsx_cells("~/six_to/spreadsheets/dutch_quigley__9378__modeldutch.xlsx")head(dutch_quigley_9378)## # A tibble: 6 x 21##   sheet address   row   col is_blank data_type error logical numeric##                        ## 1 Swap… A1          1     1 FALSE    character   NA           NA## 2 Swap… D2          2     4 FALSE    character   NA           NA## 3 Swap… E2          2     5 FALSE    character   NA           NA## 4 Swap… F2          2     6 FALSE    character   NA           NA## 5 Swap… G2          2     7 FALSE    character   NA           NA## 6 Swap… D3          3     4 FALSE    character   NA           NA## # … with 12 more variables: date , character ,## #   character_formatted , formula , is_array ,## #   formula_ref , formula_group , comment , height ,## #   width , style_format , local_format_id 

That whole Excel workbook is inside a neat data frame. Imagine that you want to quickly know where all the formulas are:

dutch_quigley_9378 %>%  filter(!is.na(formula)) %>%  count(sheet, address)## # A tibble: 18,776 x 3##    sheet address     n##        ##  1 Front B22         1##  2 Front C13         1##  3 Front C2          1##  4 Front C22         1##  5 Front C25         1##  6 Front C26         1##  7 Front C27         1##  8 Front C28         1##  9 Front C30         1## 10 Front C31         1## # … with 18,766 more rows

With the code above, you can quickly find, for each sheet, where the formulas are. This workbook contains 18776 formulas. If Hell is a real place, it’s probably an office building full of cubicles where you’ll sit for eternity looking at these spreadsheets and trying to make sense of them.

Now imagine that you’d like to know what these formulas are, let’s say, for the Swap sheet:

dutch_quigley_9378 %>%  filter(sheet == "Swap", !is.na(formula)) %>%  select(address, formula)## # A tibble: 6,773 x 2##    address formula           ##                    ##  1 F1      DAY(EOMONTH(G1,0))##  2 G1      A11               ##  3 E2      BE9               ##  4 A3      BQ5               ##  5 E3      BF9               ##  6 F3      SUM(G3:K3)        ##  7 H3      $F$1*H2           ##  8 I3      $F$1*I2           ##  9 J3      $F$1*J2           ## 10 K3      $F$1*K2           ## # … with 6,763 more rows

Brilliant! Maybe you’re interested to find all the "SUM" formulas? Easy!

dutch_quigley_9378 %>%  filter(sheet == "Swap", !is.na(formula)) %>%  filter(grepl("SUM", formula)) %>%  select(address, formula)## # A tibble: 31 x 2##    address formula        ##                 ##  1 F3      SUM(G3:K3)     ##  2 E4      SUM(D11:D309)  ##  3 F5      SUM(G5:K5)     ##  4 E6      SUM(F6:H6)     ##  5 BF8     SUM(BF11:BF242)##  6 B9      SUM(B47:B294)  ##  7 AB9     SUM(AB11:AB253)##  8 AC9     SUM(AC11:AC253)##  9 AD9     SUM(AD11:AD253)## 10 AE9     SUM(AE11:AE253)## # … with 21 more rows

You get the idea. There are many more things that you can extract such as the formatting, the contents of the cells, the comments (and where to find them) and much, much more. This will make making sense of a complex Excel file a breeze.

The other thing that you can also do, once you’re done understanding this monster Excel file, is not to perform the analysis inside Excel. Don’t fall into the temptation of continuing this bad habit. As one on the data experts in your team/company, you have a responsibility to bring the light to your colleagues. Be their Prometheus and decouple the data from the code. Let the data be in Excel, but write all the required code to create whatever is expected from you inside R. You can then export your finalized results back to Excel if needed. If management tells you to do it in Excel, tell them that you’re the professional statistician/data scientist, and that they shouldn’t tell you how to do your job. Granted, this is not always possible, but you should plead your case as much as possible. In general, a good manager will be all ears if you explain that not using GUIs like Excel makes it easier to spot and correct mistakes, with the added benefit of being much easily audited and with huge time savings in the long run. This is of course easier for completely new projects, and if you have an open-minded manager. If you’re the manager, then you should ask your IT department to uninstall Excel from your team member’s computers.

Be brave, and ditch the GUI.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}Buy me an EspressoBuy me an Espresso

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Graphical User Interfaces were a mistake but you can still make things right first appeared on R-bloggers.

Boosting nonlinear penalized least squares

$
0
0

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For some reasons I couldn’t foresee, there’s been no blog post here on november 13 and november 20. So, here is the post about LSBoost announced here a few weeks ago.

First things first, what is LSBoost? Gradient boosted nonlinear penalized least squares. More precisely in LSBoost, the ensembles’ base learners are penalized, randomized neural networks.

These previous posts, with several Python and R examples, constitute a good introduction to LSBoost:

More recently, I’ve also written a more formal, short introduction to LSBoost:

The paper’s code – and more insights on LSBoost – can be found in the following Jupyter notebook:

Comments, suggestions are welcome as usual.

pres-image

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Boosting nonlinear penalized least squares first appeared on R-bloggers.

Gold-Mining Week 11 (2020)

$
0
0

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

<br /> Week 11 Gold Mining and Fantasy Football Projection Roundup now available.<br />

The post Gold-Mining Week 11 (2020) appeared first on Fantasy Football Analytics.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Gold-Mining Week 11 (2020) first appeared on R-bloggers.

The Purpose of our Data Science Chalk Talk Series

$
0
0

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’d like to share an introduction to my data science chalk talk series

(video link, series link)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The Purpose of our Data Science Chalk Talk Series first appeared on R-bloggers.


Online Tests for TestVision with R/exams

$
0
0

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Generating, importing, and customizing online tests for TestVision with R/exams.

Online Tests for TestVision with R/exams

Motivation

This tutorial illustrates how to use R/exams for creating online exams for TestVision. Testvision is a Dutch commercial online testing company which administers online exams for many universities in the Netherlands. Until now the import of external material from R/exams into TestVision was rather limited, and therefore a new function was necessary.

Here, two things are illustrated:

  1. Steps required using R/exams to create tests in TestVision format.
  2. Steps required in TestVision online to import these tests in TestVision system.

An accompanying video guide is available on YouTube at https://www.youtube.com/watch?v=rrpudw2aKVc.

exams2testvision

Steps in R

Install the package

At the moment (November 2020) the exams2testvision() function is not yet part of the CRAN version of R/exams. For the time being one should use the development version from R-forge. To obtain it, use:

install.packages("exams", repos = "http://R-Forge.R-project.org")

Note that this line of code only needs to be run once, and that after that R/exams is permanently installed on your machine.

Run example

After loading the exams package, we create an exam called myexam which is a list of exercises. It consists of, respectively, a num, schoice, mchoice, string, and cloze item.

library("exams")myexam <- list(  "calcmean.Rmd",  "tstat2.Rmd",  "relfreq.Rmd",  "essayreg.Rmd",  "boxhist.Rmd")

The first item in the list of exercises, calcmean.Rmd (or alternatively calcmean.Rnw in R/LaTeX format), is a question that is not part of the package. It was added to show (a) how a table may be generated, an (b) what this table looks like in TestVision. For things to work this exercise should be saved in the working directory. The remaining exercises do not need to be copied as they are shipped withing the package. For more information on these exercises, see: tstat2, relfreq, essayreg, boxhist.

As a first quick check that all exercises work correctly and can be rendered well in a browser, we use the following code (setting a seed for exact reproducibility):

set.seed(127)exams2html(myexam, converter = "pandoc-mathjax")

testvision.html

Subsequently, we create the import file for an exam in TestVision.

set.seed(127)exams2testvision(myexam)

In the working directory a zip file called testvision.zip is created, which includes (a) a collection of XML files containing the exercises in TestVision format (based on QTI 2.1) and (b) a directory containing supplementary material, such as images and data files.

Steps in TestVision

Importing R/exams output

To import the exams created using R/exams into TestVision the following steps should be performed:

  1. Log in into your institution’s TestVision site.
  2. Select ‘Vragen’ (‘Questions’).
  3. Select ‘Import’ in the upper left corner. A pop-up screen called ‘Vragen importeren’ (‘Import questions’) appears .
  4. Select the zip-file ‘testvision.zip’ on your computer.
  5. Check the option ‘Minder strikte import controle’ (‘less strict import evaluation’).
  6. Click ‘OK’. Uploading may take a while!

Comment: Here ‘Minder strikte import controle’ was required because the first exercise contains a table. TestVision has very strict rules for the HTML structure of tables, and when this option is not chosen uploading fails.

Inspecting the exam

Once imported you can take a closer look at the content using ‘Preview’. To permanently import the questions they should be moved to a directory.

  1. Under ‘…’ select ‘Geïmporteerde vragen verplaatsen’ (‘Move imported questions’).
  2. Select an appropriate directory and click ‘OK’. In this directory the questions can be edited and inspected more closely.
  3. Select ‘Bewerken’ (‘Edit’) to learn more about the content and settings and/or edit the exercises.

Note that formula content is displayed using a relatively large font size. This is a TestVision issue. Hopefully it will disappear in future versions of the system. For now: The font size can be manually adjusted.

To employ the collection of items in a online exam, (a) their status should be changed (from ‘Concept’) into ‘Approved (Aan)’, (b) a new test should be created by selecting ‘Toets’ (‘Test’) in the main menu, and (c) the new items should be included in the new test. For more information see the help function in TestVision.

Funding

The work on the exams2testvision() function, the video tutorial, and this blog are part of the ShareStats project, and are financially supported by the Dutch Ministry of Education, Culture and Science (Project code OL20-06), and the University of Amsterdam.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Online Tests for TestVision with R/exams first appeared on R-bloggers.

COVID-19 dashboard page now up

$
0
0

[This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I aggregated the visualizations from my previous two posts on COVID-19 cases in Arkansas and the U.S. on a separate landing page. I wanted a quick place to look a few times a week at how my home state is handling the coronavirus pandemic (spoiler alert: terrible at present). You can find the new page here.

Thanks for reading!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post COVID-19 dashboard page now up first appeared on R-bloggers.

Plotting the excessive number of deaths in 2020 by age (with the eurostat package)

$
0
0

[This article was first published on Stories by Przemyslaw Biecek on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Number of deaths in consecutive weeks. See the second plot for the whole story.

Recently there have been several blog entries showing excessive number of deaths in different countries. Recently I discovered that in the eurostat database (1) one can find current data on the number of deaths, (2) this number is broken down by age, gender and geographical area, (3) one can use the ‘eurostat’ package to easily read and plot these data.

It turns out that the difference in the number of deaths by age leads to interesting observations.

Read the data from demo_r_mwk_10 table from eurostat.

library(eurostat)mdata <- get_eurostat("demo_r_mwk_10")mdata2010 <- mdata[as.character(mdata$time) >= "2010",]

Do some cleaning in order to select only interesting age groups, interesting countries and genders (here T stands for Total).

age_group <- c("Y_LT10" = "<10", "Y10-19" = "10-19", "Y20-29" = "20-29",                "Y30-39" = "30-39", "Y40-49" = "40-49", "Y50-59" = "50-59",                "Y60-69" = "60-69", "Y70-79" = "70-79", "Y_GE80" = "> 80")mdata2010 <- mdata2010[mdata2010$age %in% names(age_group), ]mdata2010$age <- factor(mdata2010$age, levels = names(age_group), labels = age_group)geo_group <- c("SE" = "Sweden", "BE" = "Belgium", "ES" = "Spain",                "UK" = "United Kingdom", "FR" = "France", "PL" = "Poland",                "DE" = "Germany", "IT" = "Italy")mdata2010 <- mdata2010[mdata2010$geo %in% names(geo_group), ]mdata2010$geo <- factor(mdata2010$geo, levels = names(geo_group), labels = geo_group)mdata2010 <- mdata2010[mdata2010$sex %in% c("T"), ]mdata2010$year <- substr(mdata2010$time, 1, 4)mdata2010$week <- as.numeric(substr(mdata2010$time, 6, 7))

And plot it with ggplot2. Note that we force to have 0 in the plot (geom_hline), there is smoothed average for years 2010–2019 (gem_smooth), this year is presented with a step function because we have weekly aggregates (geom_step) and everything is split into small panels with theme_wrap().

ggplot(mdata2010, aes(week, values, group=paste(sex, year))) +  geom_line(data = mdata2010[mdata2010$year!="2020",], alpha = 0.1) +   geom_smooth(data = mdata2010[mdata2010$year!="2020",], se = FALSE, group = 1, color = "black", size=0.6) +  geom_hline(yintercept = 0, color="grey", size=0.5) +   geom_step(data = mdata2010[mdata2010$year=="2020",], color = "red3") +  facet_wrap(geo~age, scales = "free_y", ncol = 9) + xlim(0,52) + ylab("Number of deaths (eurostat)") +  geom_vline(xintercept = seq(0,50,10), color="grey", lty=3)+  DALEX::theme_ema() + ggtitle("Excessive deaths in 2020 by age\nRed - data for 2020, grey - data for 2010-2019, black - average for 2010-2019")

The results are below. A quick observation: mortality among young people is low and we do not observe excessive deaths below age 50. Note that for most countries we have data till week 44–45.

With better resolution: https://raw.githubusercontent.com/MOCOS-COVID19/mortality/master/excessive_deaths/ed_eurostat.png

You can find the code and images presented below on the MOCOS (MOdeling COronavirus Spread) GitHub.

The difference between the average number of deaths per week in 2020 and the average number of deaths per week in 2010–2019.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

In order to see more R related content visit https://www.r-bloggers.com


Plotting the excessive number of deaths in 2020 by age (with the eurostat package) was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Przemyslaw Biecek on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Plotting the excessive number of deaths in 2020 by age (with the eurostat package) first appeared on R-bloggers.

Analyzing the Harmonic Structure of Music: Modes, Keys and Clustering Musical Genres

$
0
0

[This article was first published on Method Matters Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we will examine the harmonic properties of songs in my music collection. We will focus on two primary aspects of the music: the mode (e.g. whether the songs are played in major or minor keys), and the musical key itself (e.g. C major, D minor, etc. – “tonal home” for the songs). Finally, we’ll explore differences across genres in the modes and keys that the music is played in, and use this information to simultaneously cluster the musical keys and genres.

The Data

The data for this blog post come from the digital music (.mp3) files on my computer. I have most of the music I’ve listened to over the past 10 years in a digital format, and I extracted the artist, album, and musical genre information from ID3 tags included in the files (using code adapted from a previous blog post).

I then used the artist and album information to get the song mode and key for each album track from the Spotify API, which has catalogued this information for a huge number of albums. I queried the Spotify API using Python and the excellent Spotipy package. In total, I was able to retrieve the mode and key information for about 80% of the albums in my digital collection (obscure or niche recordings are not always available on Spotify).

The data and code for this analysis are available on Github here.

The head of the raw data looks like this:

<p> table { margin-left: auto; margin-right: auto;table-layout: fixed; width: 100%;word-wrap: break-word; } table, th, td { border: 1px solid grey; border-collapse: collapse; } th, td { padding: 5px; text-align: center; font-family: Helvetica, Arial, sans-serif; font-size: 90%; width: 85px; } table tbody tr:hover { background-color: #dddddd; } .wide { width: 90%; }</p>
artist_name album_name track_name genre_clean key_clean mode_clean master_key
Ed Sheeran ÷ (Deluxe) Eraser Pop Ab min Ab min
Ed Sheeran ÷ (Deluxe) Castle on the Hill Pop D maj D maj
Ed Sheeran ÷ (Deluxe) Dive Pop E maj E maj
Ed Sheeran ÷ (Deluxe) Shape of You Pop Db min Db min
Ed Sheeran ÷ (Deluxe) Perfect Pop Ab maj Ab maj
Ed Sheeran ÷ (Deluxe) Galway Girl Pop A maj A maj
Ed Sheeran ÷ (Deluxe) Happier Pop C maj C maj
Ed Sheeran ÷ (Deluxe) New Man Pop G maj G maj
Ed Sheeran ÷ (Deluxe) Hearts Don’t Break Around Here Pop G maj G maj
Ed Sheeran ÷ (Deluxe) What Do I Know? Pop Db min Db min

For each album, we have the album name and genre, artist, as well as the names of each song. For each song, we have the mode and the key as determined by Spotify. I’ve concatenated the mode and the key to create a variable called master_key, which contains the complete song key information. There are 8,503 songs in the cleaned dataset.

Number of Songs Per Genre

In this blog post, we are interested in the musical properties of the songs in my music collection. We will look at the overall properties of the songs across all of our data, and we will also see how these musical qualities differ across genres.

As a first step in this process, let’s take a look at the frequency of the genres in our data set:

# load the libraries we'll needlibrary(plyr); library(dplyr)library(ggplot2)library(tidyverse)library(gplots) library(RColorBrewer)library(kableExtra)# barplot of song counts per genreraw_data %>%   group_by(genre_clean) %>%  summarise(num_songs=n()) %>%  ggplot(aes(x = reorder(genre_clean, num_songs),              y = num_songs, fill = genre_clean)) +  geom_bar(stat = 'identity') +   geom_text(aes(label = num_songs),             size = 4, hjust = -0.15) +   coord_flip(ylim = c(0,3500)) +   labs(x = "Genre", y = "Number of Songs",        title = 'Number of Songs Per Genre' ) +  theme(legend.position = "none")

Which yields the following plot:

counts per genre

The top three genres are rock (3,426 songs), rap (1,411 songs) and jazz (1,141 songs). This matches my intuition – it’s definitely the type of music that I listen to. It must be noted that “rock” is somewhat of a catch-all genre, encompassing many different sub-categories. What most of these songs have in common is that they are primarily guitar-driven.

Mode Analysis Across All Songs

Let’s first take a look at the mode of the songs. The mode is a property that describes the tonal base of a song. There’s lots to say about major and minor modes, and if you’re interested in learning more this Wikipedia page is a good place to start. A simple heuristic we can use for the present discussion is that major modes sound happy and upbeat, whereas minor modes sound sad and dark.

In this analysis, we will include all of the 8,503 songs across all of the genres. We can make a barplot of the distribution of major and minor modes like so:

# barplot of mode across songsraw_data %>%   select(genre_clean, mode_clean) %>%   group_by(mode_clean) %>%  # counts of songs per mode  summarise(Percentage=n())  %>%  # calculate the % of songs per mode  mutate(Percentage=Percentage/sum(Percentage)*100,         mode_clean = recode(mode_clean, 'maj' = 'Major',                             'min' = 'Minor'))  %>%    # pass to ggplot  ggplot(aes(x = reorder(mode_clean, Percentage) , y = Percentage, fill = mode_clean)) +  geom_bar(stat = 'identity') +   # specify the colors  scale_fill_manual(name = "Mode", values = c('maroon', 'darkgrey')) +   # add the value labels above the bars  geom_text(aes(label = paste(round(Percentage, 0), "%", sep = '')), hjust = -0.1) +  # flip the axes  coord_flip(ylim = c(0,71)) +   # add the titles  labs(x = "Mode", y = "Percentage",        title = 'Song Modes Across Music Collection (8503 Songs)' ) 

Which yields the following plot:

mode overall

Across all of the songs in my music collection, nearly 70% of them are in major modes. I was expecting that the majority of songs would be performed in major modes, but was somewhat surprised by the size of the difference.

Mode Analysis By Genre

Now let’s look at the distribution of modes across genres. In the analysis below, I only select genres with over 200 songs, and I exclude rap music. The logic is that we’re focused here on musicians playing instruments, whereas rap music is often built around samples (which borrow from existing recordings, although there are definitely exceptions!).1

Let’s look at the modes across the different genres:

# song mode by genreraw_data %>%   group_by(genre_clean) %>%   # count the number of songs per genre  # and include that in our genre text  mutate(num_per_genre = n(),         master_genre = paste(genre_clean, " (N = ", num_per_genre, ")", sep = '')) %>%   # select genres with 200+ songs and remove rap songs  filter(num_per_genre > 200 & genre_clean != "Rap") %>%   select(master_genre, mode_clean) %>%    # group by genre and mode  group_by(master_genre, mode_clean) %>%  # calculate the number per mode per genre  summarise(Percentage=n()) %>%    # group by genre  group_by(master_genre) %>%   # and calculate the % per mode per genre  # order the factor for the plot  # (ordered by % major mode)  mutate(Percentage=Percentage/sum(Percentage)*100,         genre_clean_factor = factor(master_genre,                                      levels = c("Country (N = 551)",                                                 "Pop (N = 607)",                                                 "Rock (N = 3426)",                                                "World (N = 319)",                                                 "Jazz (N = 1141)",                                                "Soul / R&B (N = 205)")),         mode_clean = recode(mode_clean, 'maj' = 'Major',                             'min' = 'Minor')) %>%      # pass to ggplot  ggplot(aes(x = mode_clean, y = Percentage, fill = mode_clean)) +  # we want a bar plot  geom_bar(stat = 'identity')  +  # add the value labels to the bars  geom_text(aes(label = paste(round(Percentage, 0), "%", sep = '')),             hjust = .5, vjust =-.3, size = 3)  +   # add the labels  labs(x = "Mode", y = "Percentage",        title = 'Song Mode Distributions By Music Genre' ) +  # facet per genre  facet_grid(. ~ genre_clean_factor) +   # specify the colors  scale_fill_manual(name = "Mode", values = c('maroon', 'darkgrey')) +  theme(strip.text.x = element_text(size = 8))

mode by genre

There are definitely differences across genres. The genres with the most songs in “major” modes are country (at 83%!), followed by pop and rock (with 76% each). World, jazz and soul/r&b all have less, with jazz and soul/r&b having just under 60% of the songs in major modes. This matches my experience with listening to music in these genres: country, pop and rock are definitely more consistently happy and upbeat, as opposed to world, jazz and soul music.

Key Analysis Across All Songs

Now let’s take a look at the keys that the songs are played in. The key refers to the “group of pitches, or scale, that forms the basis of a music composition.” I won’t get into the details of musical keys here (see this Wikipedia page to learn more), but for the purpose of this analysis it’s enough to know that there are 12 pitches (C, C#, D, Eb, etc.), each of which can be paired with a major or minor mode to produce a total of 24 different possible keys (e.g. C major, A minor, etc.).

We can plot the distribution of keys across all of the songs in my music collection with the following code:

# percentage of keys across all songsraw_data %>%   select(genre_clean, master_key, mode_clean) %>%   group_by(master_key) %>%  # calculate the number of songs for each key  # hang on to the mode info - we'll use that  # in our plot  summarise(Percentage=n(),             mode_clean = unique(mode_clean))  %>%    # calculate the percentage of songs per key  # recode the mode variable to make it clean  # for the plot  mutate(Percentage=Percentage/sum(Percentage)*100,         mode_clean = recode(mode_clean, 'maj' = 'Major',                             'min' = 'Minor'))  %>%   # pass the data on to ggplot  ggplot(aes(x = reorder(master_key, Percentage) , y = Percentage, fill = mode_clean)) +  # we want a bar plot  geom_bar(stat = 'identity') +   # specify the colors  scale_fill_manual(name = "Mode", values = c('maroon', 'darkgrey')) +   # add the value labels   geom_text(aes(label = paste(round(Percentage, 1), "%", sep = '')), hjust = -0.1, size = 3.5) +   # flip the chart  coord_flip(ylim = c(0, 11.5)) +   # add the labels  labs(x = "Key", y = "Percentage",        title = 'Song Keys Across Music Collection (8503 Songs)' ) 

Which returns this plot:

key overall

As we saw in our analysis above, the most popular keys are all in major modes. Furthermore, G, C and D major are the most popular keys overall, while B minor is the most popular minor key.

Key Analysis By Genre

Now let’s separate our analysis of key distribution by musical genre – do the patterns above differ across the genres in our data?

# percentage of keys, separate per genreraw_data %>%   group_by(genre_clean) %>%   mutate(num_per_genre = n(),         master_genre = paste(genre_clean, " \n(N = ", num_per_genre, ")", sep = '')) %>%    filter(num_per_genre > 200 & genre_clean != "Rap") %>%   select(master_genre, master_key) %>%   group_by(master_genre, master_key) %>%  summarise(Percentage=n()) %>%  group_by(master_genre) %>%   mutate(Percentage=Percentage/sum(Percentage)*100) %>%   ggplot(aes(x = master_key, y = Percentage, fill = master_key)) +  geom_bar(stat = 'identity') +  # add the value labels above the bars  geom_text(aes(label = paste(round(Percentage, 0), "%", sep = '')),             hjust = .5, vjust =-.3, size = 2.5) +   # rotate the x axis labels 90 degrees so they're horizontal  # and hide the legend  theme(axis.text.x = element_text(angle = 90, vjust = .3, hjust=1),        legend.position = "none" , strip.text.x = element_text(size = 100)) +  labs(x = "Key", y = "Percentage",        title = 'Song Key Distributions By Music Genre' ) +  coord_cartesian(ylim = c(0,16)) +   facet_grid(master_genre ~ .) +  theme(strip.text.y = element_text(size = 9, angle = 0))

key by genre

The above graph is complete but somewhat overwhelming. We see the relative percentage within each genre for each of the 24 different keys, with a separate facet for each genre. Some keys appear to be universally popular (e.g. G major has a share of 8-14% across the genres), whereas other keys are much more frequent in some genres as compared to others (e.g. A major is relatively popular in country, rock, and pop, but much less so in jazz, soul/r&b and world music).

It is possible to eyeball every one of the 24 keys and compare differences across the genres, but we can leverage the variation in these data to cluster the keys and genres into groups. Below, we will make a simultaneous clustering of both the keys and the genres to distill the differences we see above into a single analysis and heatmap visualization that will make the underlying structure clearer.

Cluster Analysis + Heatmap

Preparing the Data

In order to make our heatmap, we need to extract the data we plotted above into a standalone dataset, which I do with the following code:

# make the cluster data# use tidyverse here - column to rownamescluster_data <- raw_data %>%   group_by(genre_clean) %>%   mutate(num_per_genre = n()) %>%   filter(num_per_genre > 200 & genre_clean != "Rap") %>%   select(genre_clean, master_key) %>%   group_by(genre_clean, master_key) %>%  summarise(Percentage=n()) %>%  group_by(genre_clean) %>%   mutate(Percentage=Percentage/sum(Percentage)*100) %>%  spread(master_key, Percentage) %>%  replace(is.na(.), 0) %>%  column_to_rownames(var = "genre_clean")

Our data set contains one row per genre, with the key row percentages contained in the columns:

head(cluster_data, 10) %>%   mutate_if(is.numeric, round, 2)%>%  kable("html", align= 'c')  
<p> table { margin-left: auto; margin-right: auto;table-layout: fixed; width: 100%;word-wrap: break-word; } table, th, td { border: 1px solid grey; border-collapse: collapse; } th, td { padding: 5px; text-align: center; font-family: Helvetica, Arial, sans-serif; font-size: 90%; width: 85px; } table tbody tr:hover { background-color: #dddddd; } .wide { width: 90%; }</p>
A maj A min Ab maj Ab min B maj B min Bb maj Bb min C maj C min D maj D min Db maj Db min E maj E min Eb maj Eb min F maj F min F# maj F# min G maj G min
Country 9.26 1.63 2.90 0.91 5.63 1.63 4.72 0.73 12.16 1.81 11.07 1.09 4.17 1.63 7.80 2.54 2.54 0.54 6.90 2.54 2.36 0.91 13.61 0.91
Jazz 3.59 4.12 8.33 0.61 1.05 2.81 3.94 4.29 8.85 4.65 4.47 2.37 6.75 1.23 2.80 4.12 1.93 1.05 7.89 8.59 1.31 1.31 7.89 6.05
Pop 8.40 4.12 4.45 0.66 3.29 3.29 3.13 0.66 11.70 2.14 10.71 1.81 4.12 1.81 4.78 2.14 2.97 0.66 7.25 1.98 1.81 1.15 13.34 3.62
Rock 10.04 3.82 2.77 0.64 3.06 4.70 2.77 1.23 11.97 1.55 13.46 1.78 2.89 1.69 6.01 4.03 1.84 0.26 6.10 1.63 2.22 1.49 12.58 1.46
Soul / R&B 4.39 3.42 3.42 2.44 6.83 4.39 2.44 3.90 7.80 4.88 6.83 1.95 5.85 4.39 1.46 3.90 1.95 1.46 2.93 2.93 3.41 2.44 10.24 6.34
World 5.02 5.64 6.27 0.31 3.76 6.27 0.94 2.19 6.58 4.08 10.97 2.19 7.21 0.63 4.39 3.45 4.39 0.00 7.84 3.13 2.51 2.51 8.46 1.25

The data above are expressed in percentages. For our cluster analysis, we need to scale the data so that each column has a mean of zero and a standard deviation of one.

We scale our data and display the resulting data set with the following code:

# scale the datacluster_data_scaled <- scale(cluster_data)# what does it look like?round(cluster_data_scaled,2) %>%   kable("html", align= 'c')  
<p> table { margin-left: auto; margin-right: auto;table-layout: fixed; width: 100%;word-wrap: break-word; } table, th, td { border: 1px solid grey; border-collapse: collapse; } th, td { padding: 5px; text-align: center; font-family: Helvetica, Arial, sans-serif; font-size: 90%; width: 85px; } table tbody tr:hover { background-color: #dddddd; } .wide { width: 90%; }</p>
A maj A min Ab maj Ab min B maj B min Bb maj Bb min C maj C min D maj D min Db maj Db min E maj E min Eb maj Eb min F maj F min F# maj F# min G maj G min
Country 0.89 -1.66 -0.81 -0.03 0.83 -1.36 1.33 -0.90 0.96 -0.90 0.45 -1.75 -0.58 -0.20 1.45 -0.99 -0.06 -0.22 0.22 -0.36 0.12 -1.07 1.03 -0.96
Jazz -1.15 0.25 1.65 -0.41 -1.42 -0.64 0.73 1.33 -0.41 0.97 -1.55 1.13 0.93 -0.52 -0.77 0.91 -0.69 0.73 0.75 1.99 -1.35 -0.47 -1.25 1.13
Pop 0.58 0.25 -0.11 -0.35 -0.32 -0.34 0.11 -0.94 0.77 -0.69 0.34 -0.12 -0.62 -0.07 0.10 -1.46 0.37 -0.01 0.41 -0.58 -0.65 -0.71 0.93 0.14
Rock 1.18 0.02 -0.87 -0.38 -0.43 0.52 -0.17 -0.59 0.88 -1.08 1.18 -0.19 -1.34 -0.16 0.65 0.80 -0.78 -0.76 -0.21 -0.71 -0.07 -0.22 0.62 -0.74
Soul / R&B -0.86 -0.29 -0.58 1.98 1.42 0.33 -0.42 1.09 -0.85 1.12 -0.84 0.19 0.41 1.92 -1.37 0.65 -0.67 1.51 -1.91 -0.21 1.62 1.19 -0.31 1.25
World -0.64 1.42 0.72 -0.81 -0.09 1.49 -1.58 0.02 -1.35 0.59 0.42 0.74 1.20 -0.98 -0.07 0.10 1.83 -1.25 0.73 -0.13 0.33 1.29 -1.02 -0.82

Making a Heatmap

We are finally ready to make our heatmap. Heatmaps allow one to visualize clusters of samples and features. This method presents the results of hierarchical clustering of the rows (musical genre in our case) and columns (keys in our case) of a matrix, ordering the rows and columns according to the cluster solution. This makes it easy to see groupings present in both axes (clusterings of genres and clusterings of keys in our case). The underlying data values (percentages of songs in a given key for each genre, scaled per key) are represented with colors in the cluster solution.

Let’s use the gplots package to produce our heatmap:

# red-blue color palette# red is high, blue is lowhmcol = rev(colorRampPalette(brewer.pal(9, "RdBu"))(10))heatmap.2(cluster_data_scaled,           # we've already scaled the data above          # so we turn of scaling here          scale = c("none"),           # show histogram on color key          density.info=c("histogram"),          # turn off tracing in the plot          trace=c("none"),           # specify our color palette          # (defined above)          col = hmcol,          # set the font size for          # row labels          cexRow=1.3,          # set the margins so we see          # all axis labels          margin=c(5, 7),           # set the plot title          main = 'Clustering Genres by Song Key')

Which returns the following plot:

key genre heatmap

The plot shows a simultaneous clustering of the genres (the rows of our input matrix) and of the keys (the columns of our input matrix). We have passed standardized scores to the clustering algorithm, and the legend in the upper-left hand corner of the plot shows how the color-coding links to the values of these scores. Specifically, higher values are colored in red, while lower values are colored in blue. For each color, darker (lighter) shades indicate higher (lower) values.

Clusters of Genres

We see two main genre clusters. The cluster on top groups together soul/r&b, world, and jazz music (within this cluster, world and jazz are in their own sub-cluster). The second cluster of music genres groups country, rock and pop music together (within this cluster, rock and pop are in their own sub-cluster).

Clusters of Keys

The clustering of keys is a little more complicated, as there are 24 of them. The right-most cluster groups together a number of major keys: F, Eb, E, D, A, G, and C. We see several sub-clusters here, including a grouping of E, D, A, G, and C, which which we’ll discuss further below.

The left-most cluster includes 10 keys, 8 of which are minor. The left-most sub-cluster includes Db, C minor, Bb minor, Ab major and F minor.

Genre / Key combinations

How are the genres separated by their use of different keys?

For the soul/r&b, world and jazz cluster, the keys colored in red at the upper-left hand side of the plot are most unique to this cluster. Specifically, songs in these genres are more likely to be in Db, C minor, Bb minor, and to some extent Ab and its relative minor F minor (though jazz is much more represented in these last two). Interestingly, these keys all have a lot of “flats.”

For the country, rock and pop cluster, the keys colored in red at the lower-right hand side of the plot are most unique to this cluster. Specifically, it looks like songs in these genres are more likely to be in C, G, A, D, and E. Interestingly, with one exception (C), these keys all have one or more “sharps”.

Interpretation

What factors influence the key a song is played in?

In my experience, there are at least 3 things that can influence the key a song is played in:

  1. Vocal range of the singer (not applicable for instrumental songs). Simply put, the requirements of the song (e.g. how wide a distance there is between the highest and lowest notes in the vocal part) must match the natural range of the singer (e.g. which notes can they sing comfortably and with conviction, without straining their voice). Selecting the key that best matches the singer’s vocal range allows for the best possible performance of a song.
    • Although the vocal ranges of the singers in my music collection surely influence some of the keys that the songs are played in, there are too many different vocalists across the albums and the genres for us to see a systematic push towards a given key across the space of the data.
  2. “Easy” vs. “Hard” Keys. When first learning to play music, particularly if learning how read scores, one tends to start with the “easy” keys first – e.g. those with fewer accidentals (sharps and flats). This makes it easier to read first pieces of music, because it is not necessary to remember which notes are sharp or flat when reading them in the score. “Easier keys” therefore have fewer sharps and flats, such as C (no sharps or flats), G and F (one sharp/flat, respectively), D and Bb (two sharps/flats, respectively), and A and Eb (three sharps/flats, respectively).
    • We do not see a systematic over-representation of the “easy keys” (e.g. those with fewer sharps or flats) in any specific musical genre. We do see some over-representation of keys with relatively few sharps among country, rock and pop music, however. Specifically, G major, D major, A major, and E major are all more common in these musical genres. Interestingly, the corresponding “easy keys” with flats are not used commonly in country, rock, and pop music.
    • It appears that soul/r&b, world and jazz music are played in harder keys with more flats. Specifically, these genres all tend to have more songs in Db (5 flats), C minor (parallel minor to Eb; 3 flats), Bb minor (parallel minor to Db; 5 flats), and Ab (4 flats). Jazz in particular dominates in terms of Ab and its parallel minor F minor (4 flats).
  3. The different instruments that are playing on a given song. Different instruments have specific qualities that can impact the key that a song is played in. In particular, when playing music with different instruments, practical considerations tied to the instruments in the mix can impact the choice of musical key. I see the potential impact of two such considerations in the data presented above:
    • Not all instruments play in the same keys. Piano, guitar, trombone, flute, among others, all play in what’s called “Concert C”, where the “C” that is played on the instruments matches the pitch that corresponds to the note C. Other instruments, such has brass (e.g. trumpets) and reed (e.g. saxophones, clarinets) instruments play in different pitches (e.g. Concert Bb or Eb), which means that when a “C” is played on those instruments, the pitch does not correspond to Concert C.
      • As we saw above, soul/r&b, world and jazz music (genres which are more likely to feature horns or reed instruments) dominate in keys with a lot of flats. This is no doubt done in part to accomodate the wind instruments, most of which play in different keys than standard rhythm-section instruments (e.g. bass, guitar, and piano, which all play in Concert C). If we consider the most common keys in our data for jazz songs (Ab and its parallel minor F minor; 4 flats), trumpets and tenor saxophones (Bb instruments) play in Bb (2 flats) and alto saxophones (Eb instruments) play in F (1 flat). By choosing a somewhat more “complicated” concert key, the brass and the reeds get an “easier” key. We can see this balancing act play out in our data, with soul/r&b, world and jazz music played in keys with more flats, which ends up giving the wind instruments a slightly easier key for a given song.
    • Open chords on the guitar. Open chords are chords that include one or more “open” strings on the guitar (meaning it is not necessary to hold a string down with one’s finger in order to play a note that fits in the chord). In essence, open chords are easier to play for beginning musicians, and are among the first chords that one learns when starting to play the guitar. Examples of keys that include many open chords on the guitar include: C, G, D, A and E.
      • These are precisely the chords that dominate in our country, rock and pop music cluster! Not surprisingly, these genres are all very guitar-driven, especially in comparison with soul/r&b, world and jazz music.

Implications for Musicians

What does this analysis teach us about playing music in different genres? I think there are 3 takeaways for the practicing musician:

  1. Focus on the major modes. Across all of the songs, just about 70% were in in major modes, with even higher percentages in country, pop and rock. If you want to play world, soul/r&b or jazz, focus a bit more on the minor modes. Nevertheless, across genres you’ll be playing much more in major (vs. minor) modes.

  2. If you want to play country, rock, and pop, you can pick a handful of relatively easy major keys (most with sharps and open chords on the guitar) and spend your time getting comfortable in them. For example, if you were very comfortable in C, G, D, A and E, you would cover the keys of half of the songs in the current data for country, rock, and pop. If you add F to the mix, you’re at around 60%. The comparable figures for these keys for jazz, world, and soul/r&b are around 30% to 35%, respectively. Which leads to the final implication:

  3. If you want to play jazz, soul/r&b, or world music, it’s a good idea to be comfortable with a lot of keys, both major and minor, as these these genres’ songs are more spread out across the different keys. Given the relatively high frequency of songs with many flats (vs. the country, rock and pop cluster), it’s not a bad idea to get comfortable playing in keys with flats.

Caveats and Limitations

We should keep in mind that we are not examining a representative sample of songs; at the end of the day, this is just my music collection. Nevertheless, the patterns examined here match my experience as a musician playing songs in different genres with different bands across the years.

Summary and Conclusion

In this blog post, we examined the musical properties of songs in my digital music collection.

We first examined modes across all songs and saw that around 70% of the songs were in major modes, whose music (in comparison with minor modes) is upbeat and happy. However, the ratio of major to minor modes was not identical across the different musical genres. Country and pop contained the greatest percentage of major modes, whereas jazz and soul/r&b contained the smallest percentage of major modes.

We then examined the distribution of musical keys. Looking across my entire music collection, G, C and D major are the most popular keys overall, while B minor is the most popular minor key. We examined the distribution of keys across genres, and saw that some keys were more or less common in certain genres as compared to others.

We made a heatmap to better understand the relationship between musical genres and keys. This analysis showed two clusters of musical genres: one containing soul/r&b, world and jazz music, and the other containing country, rock, and pop music. The soul/r&b, world, and jazz cluster had greater proportions of keys with a lot of flats, perhaps due to the fact that these genres typically include reed and brass instruments, which play in “easier” keys when the concert key has flats. The country, rock and pop cluster had greater proportions of easy keys with sharps, and these keys contain many “open chords,” which are easier to play on the guitar.

Finally, we looked at a couple of takeaway messages for the practicing musician. In sum: focus on the major modes, and if you want to play country, pop or rock, you can focus a handful of relatively easy keys with sharps. If you want to play jazz, world or sould/r&b, it’s a good idea to focus your attention on many different keys, and in particular to be comfortable in keys with many flats!

Coming up next

In the next blog post, we’ll examine how to extract, clean, and visualize data from the Mi-Band 5 fitness tracker.

Stay tuned!


  1. Don’t get me wrong – I love rap music and have written about itextensivelyon this blog↩

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Analyzing the Harmonic Structure of Music: Modes, Keys and Clustering Musical Genres first appeared on R-bloggers.

The ‘circular random walk’ puzzle: tidy simulation of stochastic processes in R

$
0
0

[This article was first published on Variance Explained, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously in this series

I love 538’s Riddler column, and I’ve enjoyed solving the November 20th puzzle. I’ll quote:

To celebrate Thanksgiving, you and 19 of your family members are seated at a circular table (socially distanced, of course). Everyone at the table would like a helping of cranberry sauce, which happens to be in front of you at the moment.

Instead of passing the sauce around in a circle, you pass it randomly to the person seated directly to your left or to your right. They then do the same, passing it randomly either to the person to their left or right. This continues until everyone has, at some point, received the cranberry sauce.

Of the 20 people in the circle, who has the greatest chance of being the last to receive the cranberry sauce?

I’ll solve this with tidy simulation in R, as usual using one of my favorite functions, tidyr’s crossing(), and getting another chance to explore a variation of a “random walk”.

Simulating a table of 20

What’s fun about this problem is that it’s an example of a random walk: a stochastic process made up of a sequence of random steps (in this case, left or right). What makes this a fun variation is that it’s a random walk in a circle- passing 5 to the left is the same as passing 15 to the right. I wasn’t previously familiar with a random walk in a circle, so I approached it through simulation to learn about its properties.

Simulating a random walk in a circle

The classic way to simulate a random walk in R is with cumsum() and sample(). sample(c(1, -1)) picks a random direction for each step, and cumsum() takes the cumulative sum of those steps:

cumsum(sample(c(1, -1), 30, replace = TRUE))
##  [1] 1 2 3 4 5 6 7 6 5 6 5 6 5 6 7 6 5 6 7 6 5 6 5 4 5 4 3 4 5 4

Right now this is a random walk on integers (…, -2, -1, 0, 1, 2, …). To make this a random walk at a circular table of 20, you can use the modulo operator, %% 20.

cumsum(sample(c(1, -1), 30, replace = TRUE)) %% 20
##  [1]  1  0 19 18 17 16 17 18 19 18 19 18 19 18 17 18 19  0 19  0 19  0  1  2  1  2  3## [28]  4  5  4

Notice that the cranberry sauce can now be passed from 0 to 19 and then back.

Simulating many walks with crossing()

We’ll simulate 50,000 trials (feel free to increase or decrease that number, depending on how accurate you need your simulation to be). For each, we’ll try 1000 steps.

library(tidyverse)# For the sake of efficiency, perform only the cumulative sum# as a grouped operationsim_steps <- crossing(trial = 1:50000,                      step = 1:1000) %>%  mutate(direction = sample(c(1, -1), n(), replace = TRUE)) %>%  group_by(trial) %>%  mutate(position = cumsum(direction)) %>%  ungroup() %>%  mutate(seat = position %% 20)sim_steps
## # A tibble: 50,000,000 x 5##    trial  step direction position  seat##               ##  1     1     1         1        1     1##  2     1     2        -1        0     0##  3     1     3        -1       -1    19##  4     1     4         1        0     0##  5     1     5        -1       -1    19##  6     1     6         1        0     0##  7     1     7        -1       -1    19##  8     1     8         1        0     0##  9     1     9         1        1     1## 10     1    10         1        2     2## # … with 49,999,990 more rows

We end up with 10 million steps, each representing the position of the cranberry sauce at one point in time.

How long does it take for the cranberry sauce to reach each seat for the first time? We can use distinct() with .keep_all = TRUE to answer that: we keep the first time each seat appears in each trial. (We also filter out seat 0, which is you, because you start with the cranberry sauce on step 0).

sim <- sim_steps %>%  distinct(trial, seat, .keep_all = TRUE) %>%  filter(seat != 0)sim
## # A tibble: 949,995 x 5##    trial  step direction position  seat##               ##  1     1     1         1        1     1##  2     1     3        -1       -1    19##  3     1    10         1        2     2##  4     1    14        -1       -2    18##  5     1    43        -1       -3    17##  6     1    50        -1       -4    16##  7     1    63        -1       -5    15##  8     1    64        -1       -6    14##  9     1    65        -1       -7    13## 10     1    66        -1       -8    12## # … with 949,985 more rows

Summarizing and visualizing

Now that we have our simulation with one row for each seat in each trial, we can learn stats about each seat. Which is the best seat to be in? Which is most likely to be last?

by_seat <- sim %>%  group_by(trial) %>%  mutate(is_last = row_number() == 19) %>%  group_by(seat) %>%  summarize(avg_step = mean(step),            pct_last = mean(is_last),            avg_length_last = mean(step[is_last]))

This isn’t the Riddler’s question, but it’s the first one I was interested in: how long does it take for each seat, on average, to get the cranberry sauce?

by_seat %>%  ggplot(aes(seat, avg_step)) +  geom_line() +  expand_limits(y = 0) +  labs(x = "Seat",       y = "Average # of steps to reach this seat")

center

It looks like a parabola. The best seat to be in is either #1 or #19, immediately to the right or left of the starting position: on average they’re waiting about 19 steps to get it. The worst seat to be in is #10, directly across the table. On average they’re waiting about 100 steps to get it. Overall, this makes intuitive sense: the closer you are to the original sauce, the more likely you can get it right away.

What I love about tidy simulation is that I can visualize some more details about the distribution, such as with a histogram on a log scale.

sim %>%  ggplot(aes(step)) +  geom_histogram() +  scale_x_log10() +  facet_wrap(~ seat, scales = "free_y") +  labs(x = "Step on which this seat gets the cranberry sauce",       y = "")

center

The seats immediately to your left and right, 1 and 19, have a mode of 1 step (which makes sense: they have a 50% chance of getting it on the first step). For seats that are roughly across the table (like 7-13), the number of steps looks roughly log-normally distributed.

Now let’s answer the Ridder’s question: how likely is each seat to be the last person to get the cranberry sauce?

by_seat %>%  ggplot(aes(seat, pct_last)) +  geom_line() +  scale_y_continuous(labels = scales::percent) +  geom_hline(yintercept = 1 / 19, lty = 2, color = "red") +  expand_limits(y = 0) +  labs(x = "Seat",       y = "% this is the last seat to get cranberry sauce")

center

That’s a very different story! Other than a little random noise, the 19 people at the table all have the same probability of being the last to receive the cranberry sauce. (The probability is therefore 1/19, shown by the dashed red line).

This wasn’t what I originally expected, but upon consideration it makes sense. Consider the person at seat 10 (directly across the table from you). We know it will take longer for them to get the sauce, but consider what it would take to be last. Imagine the moment that the sauce first reaches either person 9 or person 11 (one of which has to happen first). At that moment, the situation is analogous. to the person seated immediately to the original left or right: the sauce would have to make a full circle of the table before going just one step. The same would apply to any seat \(s\), by breaking it down into the situation where the sauce reaches either \(s-1\) or \(s+1\).

Larger table sizes

Something I love about crossing() for simulation is that you can keep adding complexity to the question you’re asking. What if there weren’t 20 people at the table, but some arbitrary \(n\)? We’ll try 20, 30, and 40, doing 20K simulations each.

# Repeat all of the above, but with an extra crossing() stepsim_size <- crossing(trial = 1:20000,                           step = 1:2000) %>%  mutate(direction = sample(c(1, -1), n(), replace = TRUE)) %>%  group_by(trial) %>%  mutate(position = cumsum(direction)) %>%  ungroup() %>%  crossing(table_size = c(20, 30, 40)) %>%  mutate(seat = position %% table_size) %>%  distinct(table_size, trial, seat, .keep_all = TRUE) %>%  filter(seat != 0)# Group by table_size as wellby_seat_size <- sim_size %>%  group_by(table_size, trial) %>%  mutate(is_last = row_number() == table_size - 1) %>%  group_by(table_size, seat) %>%  summarize(avg_step = mean(step),            pct_last = mean(is_last),            avg_length_last = mean(step[is_last]))

We can confirm that the “all seats are equally likely to get the sauce last” holds true for every table size.

center

But we can also take a closer look at that parabola for the average # of steps to reach each position. Can we figure out a closed form solution for it?

center

A few patterns we notice, where \(n\) is the table size.

  • The peak for each is at seat \(n / 2\), which takes on average \((n / 2) ^ 2 steps\).
  • The average for seats 1 and \(n-1\) (the people seated immediately to your left/right) is \(n-1\).

We can get a bit more precise by fitting a parabola to each of the seat size results, using the broom package to combine the linear models.

by_seat_size %>%  mutate(linear = seat,         quadratic = seat ^ 2) %>%  group_by(table_size) %>%  summarize(mod = list(lm(avg_step ~ quadratic + linear)),            td = map(mod, broom::tidy, conf.int = TRUE)) %>%  unnest(td) %>%  ggplot(aes(table_size, estimate)) +  geom_line() +  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +  facet_wrap(~ term)

center

The intercept is indistinguishable from 0, the linear term is equal to the table size, and the quadratic term stays at -1. This suggests that the average number of steps to reach a seat \(s\) is \(-s^2+n*s\). (This matches our results for the maximum and the \(s=1\) point above).

As I often note in these posts, I love how doing some simulation can lead us to an exact solution (even if we can’t yet prove it).

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;// s.defer = true;// s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post The 'circular random walk' puzzle: tidy simulation of stochastic processes in R first appeared on R-bloggers.

Viewing all 12128 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>