Package GetDFPData

December 6, 2017, 12:00 am

≫ Next: Mapping lithium production using R

≪ Previous: The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Downloading Annual Financial Reports and Corporate Events from B3 (formerly Bovespa) –

Financial statements of companies traded at B3 (formerly Bovespa), the Brazilian stock exchange, are available in its website. Accessing the data for a single company is straightforward. In the website one can find a simple interface for accessing this dataset. An example is given here. However, gathering and organizing the data for a large scale research, with many companies and many dates, is painful. Financial reports must be downloaded or copied individually and later aggregated. Changes in the accounting format thoughout time can make this process slow, unreliable and irreproducible.

Package GetDFPData provides a R interface to all annual financial statements available in the website and more. It not only downloads the data but also organizes it in a tabular format and allows the use of inflation indexes. Users can select companies and a time period to download all available data. Several information about current companies, such as sector and available quarters are also at reach. The main purpose of the package is to make it easy to access financial statements in large scale research, facilitating the reproducibility of corporate finance studies with B3 data.

The positive aspects of GetDFDData are:

Easy and simple R and web interface
Changes in accounting format are internally handled by the software
Access to corporate events in the FRE system such as dividend payments, changes in stock holder composition, changes in governance listings, board composition and compensation, debt composition, and a lot more!
The output data is automatically organized using tidy data principles (long format)
A cache system is employed for fast data acquisition
Completely free and open source!

Installation

The package is (not yet) available in CRAN (release version) and in Github (development version). You can install any of those with the following code:

# Release version in CRANinstall.packages('GetDFPData') # not in CRAN yet# Development version in Githubdevtools::install_github('msperlin/GetDFPData')

Shinny interface

The web interface of GetDFPData is available at http://www.msperlin.com/shiny/GetDFPData/.

How to use `GetDFPData`

The starting point of GetDFPData is to find the official names of companies in B3. Function gdfpd.search.company serves this purpose. Given a string (text), it will search for a partial matches in companies names. As an example, let’s find the official name of Petrobras, one of the largest companies in Brazil:

library(GetDFPData)library(tibble)gdfpd.search.company('petrobras',cache.folder = tempdir())## ## Reading info file from github## Found 43873 lines for 687 companies  [Actives =  521  Inactives =  167 ]## Last file update:  2017-10-19## Caching RDATA into tempdir()## ## Found 1 companies:## PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS | situation = ATIVO | first date = 1998-12-31 | last date - 2016-12-31## [1] "PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS"

Its official name in Bovespa records is PETRÓLEO BRASILEIRO S.A. - PETROBRAS. Data for quarterly and annual statements are available from 1998 to 2017. The situation of the company, active or canceled, is also given. This helps verifying the availability of data.

The content of all available financial statements can be accessed with function gdfpd.get.info.companies. It will read and parse a .csv file from my github repository. This will be periodically updated for new information. Let’s try it out:

df.info <- gdfpd.get.info.companies(type.data = 'companies', cache.folder = tempdir())## ## Reading info file from github## Found 43873 lines for 687 companies  [Actives =  521  Inactives =  167 ]## Last file update:  2017-10-19## Caching RDATA into tempdir()glimpse(df.info)## Observations: 689## Variables: 8## $ name.company     "521 PARTICIPAÇOES S.A. - EM LIQUIDAÇÃO EXTRAJ...## $ id.company       16330, 16284, 108, 20940, 21725, 19313, 18970,...## $ situation        "ATIVO", "ATIVO", "CANCELADA", "CANCELADA", "A...## $ listing.segment  NA, "None", "None", "None", "None", "None", "C...## $ main.sector      NA, "Financeiro e Outros", "Materiais Básicos"...## $ tickers          NA, "QVQP3B", NA, NA, NA, "AELP3", "TIET11;TIE...## $ first.date       1998-12-31, 2001-12-31, 2009-12-31, 2009-12-3...## $ last.date        2016-12-31, 2016-12-31, 2009-12-31, 2009-12-3...

This file includes several information that are gathered from Bovespa: names of companies, official numeric ids, listing segment, sectors, traded tickers and, most importantly, the available dates. The resulting dataframe can be used to filter and gather information for large scale research such as downloading financial data for a specific sector.

Downloading financial information for ONE company

All you need to download financial data with GetDFPData are the official names of companies, which can be found with gdfpd.search.company, the desired starting and ending dates and the type of financial information (individual or consolidated). Let’s try it for PETROBRAS:

name.companies <- 'PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS'first.date <- '2004-01-01'last.date  <- '2006-01-01'df.reports <- gdfpd.GetDFPData(name.companies = name.companies,                                first.date = first.date,                               last.date = last.date,                               cache.folder = tempdir())## Found cache file. Loading data..## ## Downloading data for 1 companies## First Date: 2004-01-01## Laste Date: 2006-01-01## Inflation index: dollar## ## Downloading inflation data##  Caching inflation RDATA into tempdir()  Done## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files:## ## PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS##  Available periods: 2005-12-31   2004-12-31## ## ## Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS##  Finding info from Bovespa | downloading and reading data | saving cache##  Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS | date 2005-12-31##      Acessing DFP data | downloading file | reading file | saving cache##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..##  Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS | date 2004-12-31##      Acessing DFP data | downloading file | reading file | saving cache##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..

The resulting object is a tibble, a data.frame type of object that allows for list columns. Let’s have a look in its content:

glimpse(df.reports)## Observations: 1## Variables: 33## $ company.name                   "PETRÓLEO BRASILEIRO  S.A.  - PE...## $ company.code                   9512## $ company.tickers                "PETR3;PETR4"## $ min.date                       2004-12-31## $ max.date                       2005-12-31## $ n.periods                      2## $ company.segment                "Tradicional"## $ current.stockholders           [ [ [ [ [ [ [ [ [ [ [ [NULL]## $ history.capital.issues         [NULL]## $ history.mkt.value              [NULL]## $ history.capital.increases      [NULL]## $ history.capital.reductions     [NULL]## $ history.stock.repurchases      [NULL]## $ history.other.stock.events     [NULL]## $ history.compensation           [NULL]## $ history.compensation.summary   [NULL]## $ history.transactions.related   [NULL]## $ history.debt.composition       [NULL]## $ history.governance.listings    [NULL]## $ history.board.composition      [NULL]## $ history.committee.composition  [NULL]## $ history.family.relations       [NULL]

Object df.reports only has one row since we only asked for data of one company. The number of rows increases with the number of companies, as we will soon learn with the next example. All financial statements for the different years are available within df.reports. For example, the assets statements for all desired years of PETROBRAS are:

df.income.long <- df.reports$fr.income[[1]]glimpse(df.income.long)## Observations: 48## Variables: 6## $ name.company        "PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS", "...## $ ref.date            2005-12-31, 2005-12-31, 2005-12-31, 2005-1...## $ acc.number          "3.01", "3.02", "3.03", "3.04", "3.05", "3....## $ acc.desc            "Receita Bruta de Vendas e/ou Serviços", "D...## $ acc.value           143665730, -37843204, 105822526, -57512113,...## $ acc.value.infl.adj  61398234.97, -16173000.56, 45225234.41, -24...

The resulting dataframe is in the long format, ready for processing. In the long format, financial statements of different years are stacked. In the wide format, we have the year as columns of the table.

If you want the wide format, which is the most common way that financial reports are presented, you can use function gdfpd.convert.to.wide. See an example next:

df.income.wide <- gdfpd.convert.to.wide(df.income.long)knitr::kable(df.income.wide )

acc.number	acc.desc	name.company	2004-12-31	2005-12-31
3.01	Receita Bruta de Vendas e/ou Serviços	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	120024727	143665730
3.02	Deduções da Receita Bruta	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-34450292	-37843204
3.03	Receita Líquida de Vendas e/ou Serviços	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	85574435	105822526
3.04	Custo de Bens e/ou Serviços Vendidos	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-48607576	-57512113
3.05	Resultado Bruto	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	36966859	48310413
3.06	Despesas/Receitas Operacionais	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-11110540	-14810467
3.06.01	Com Vendas	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-2858630	-4195157
3.06.02	Gerais e Administrativas	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-2599552	-3453753
3.06.03	Financeiras	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-1019901	126439
3.06.04	Outras Receitas Operacionais	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	0	0
3.06.05	Outras Despesas Operacionais	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-5982336	-9070019
3.06.06	Resultado da Equivalência Patrimonial	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	1349879	1782023
3.07	Resultado Operacional	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	25856319	33499946
3.08	Resultado Não Operacional	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-550694	-199982
3.08.01	Receitas	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	46611	1256194
3.08.02	Despesas	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-597305	-1456176
3.09	Resultado Antes Tributação/Participações	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	25305625	33299964
3.10	Provisão para IR e Contribuição Social	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-5199166	-8581490
3.11	IR Diferido	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-1692288	-422392
3.12	Participações/Contribuições Estatutárias	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-660000	-846000
3.12.01	Participações	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	-660000	-846000
3.12.02	Contribuições	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	0	0
3.13	Reversão dos Juros sobre Capital Próprio	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	0	0
3.15	Lucro/Prejuízo do Exercício	PETRÓLEO BRASILEIRO S.A. – PETROBRAS	17754171	23450082

Downloading financial information for SEVERAL companies

If you are doing serious research, it is likely that you need financial statements for more than one company. Package GetDFPData is specially designed for handling large scale download of data. Let’s build a case with two selected companies:

my.companies <- c('PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS',                  'BANCO DO ESTADO DO RIO GRANDE DO SUL SA')first.date <- '2005-01-01'last.date  <- '2007-01-01'type.statements <- 'individual'df.reports <- gdfpd.GetDFPData(name.companies = my.companies,                                first.date = first.date,                               last.date = last.date,                               cache.folder = tempdir())## Found cache file. Loading data..## ## Downloading data for 2 companies## First Date: 2005-01-01## Laste Date: 2007-01-01## Inflation index: dollar## ## Downloading inflation data##  Found cache file. Loading data..    Done## ## ## WARNING: Cash flow statements are not available before 2009 ## ## Inputs looking good! Starting download of files:## ## BANCO DO ESTADO DO RIO GRANDE DO SUL SA##  Available periods: 2006-12-31   2005-12-31## PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS##  Available periods: 2006-12-31   2005-12-31## ## ## Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA##  Finding info from Bovespa | downloading and reading data | saving cache##  Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2006-12-31##      Acessing DFP data | downloading file | reading file | saving cache##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..##  Processing 1210 - BANCO DO ESTADO DO RIO GRANDE DO SUL SA | date 2005-12-31##      Acessing DFP data | downloading file | reading file | saving cache##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..## Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS##  Finding info from Bovespa##      Found cache file /tmp/RtmpSpLsOP/9512_PETRÓLEO/GetDFPData_BOV_cache_9512_PETR.rds##  Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS | date 2006-12-31##      Acessing DFP data | downloading file | reading file | saving cache##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..##  Processing 9512 - PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS | date 2005-12-31##      Acessing DFP data | Found DFP cache file##      Acessing FRE data | No FRE file available..##      Acessing FCA data | No FCA file available..

And now we can check the resulting tibble:

glimpse(df.reports)## Observations: 2## Variables: 33## $ company.name                   "BANCO DO ESTADO DO RIO GRANDE D...## $ company.code                   1210, 9512## $ company.tickers                "BRSR3;BRSR5;BRSR6", "PETR3;PETR4"## $ min.date                       2005-12-31, 2005-12-31## $ max.date                       2006-12-31, 2006-12-31## $ n.periods                      2, 2## $ company.segment                "Corporate Governance - Level 1"...## $ current.stockholders           [ [ [ [ [ [ [ [ [ [ [ [NULL, NULL]## $ history.capital.issues         [NULL, NULL]## $ history.mkt.value              [NULL, NULL]## $ history.capital.increases      [NULL, NULL]## $ history.capital.reductions     [NULL, NULL]## $ history.stock.repurchases      [NULL, NULL]## $ history.other.stock.events     [NULL, NULL]## $ history.compensation           [NULL, NULL]## $ history.compensation.summary   [NULL, NULL]## $ history.transactions.related   [NULL, NULL]## $ history.debt.composition       [NULL, NULL]## $ history.governance.listings    [NULL, NULL]## $ history.board.composition      [NULL, NULL]## $ history.committee.composition  [NULL, NULL]## $ history.family.relations       [NULL, NULL]

Every row of df.reports will provide information for one company. Metadata about the corresponding dataframes such as min/max dates is available in the first columns. Keeping a tabular structure facilitates the organization and future processing of all financial data. We can use tibble df.reports for creating other dataframes in the long format containing data for all companies. See next, where we create dataframes with the assets and liabilities of all companies:

df.assets <- do.call(what = rbind, args = df.reports$fr.assets)df.liabilities <- do.call(what = rbind, args = df.reports$fr.liabilities)df.assets.liabilities <- rbind(df.assets, df.liabilities)

As an example, let’s use the resulting dataframe for calculating and analyzing a simple liquidity index of a company, the total of current (liquid) assets (Ativo circulante) divided by the total of current short term liabilities (Passivo Circulante), over time.

library(dplyr)## ## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':## ##     filter, lag## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, unionmy.tab <- df.assets.liabilities %>%  group_by(name.company, ref.date) %>%  summarise(Liq.Index = acc.value[acc.number == '1.01']/ acc.value[acc.number == '2.01'])my.tab## # A tibble: 3 x 3## # Groups:   name.company [?]##                              name.company   ref.date Liq.Index##                                               ## 1 BANCO DO ESTADO DO RIO GRANDE DO SUL SA 2006-12-31 0.7251432## 2  PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS 2005-12-31 0.9370813## 3  PETRÓLEO BRASILEIRO  S.A.  - PETROBRAS 2006-12-31 0.9733600

Now we can visualize the information using ggplot2:

library(ggplot2)p <- ggplot(my.tab, aes(x = ref.date, y = Liq.Index, fill = name.company)) +  geom_col(position = 'dodge' )print(p)

Exporting financial data

The package includes function gdfpd.export.DFP.data for exporting the financial data to an Excel or zipped csv files. See next:

my.basename <- 'MyExcelData'my.format <- 'csv' # only supported so fargdfpd.export.DFP.data(df.reports = df.reports,                       base.file.name = my.basename,                      type.export = my.format)

The resulting Excel file contains all data available in df.reports.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R and Finance.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Mapping lithium production using R

December 6, 2017, 3:15 am

≫ Next: How to Avoid the dplyr Dependency Driven Result Corruption

≪ Previous: Package GetDFPData

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

A great thing about data science – and data visualization in particular– is that you can use data science tools to get information about a subject very quickly.

This is particularly true when you know how to present the information. When you know the best technique for visualizing data, you can use those tools to present information in a way that can help you increase your understanding very quickly.

Here’s a case in point: I was thinking about energy, solar energy, and battery technology, and I got curious about sources of lithium (which is used in batteries and related tech).

Using a few tools from R, I created a map of lithium production within about 15 minutes. The maps that I created certainly don’t tell the full story, but they at least provide a baseline of knowledge.

If you’re fluent with the tools and techniques of data science, this becomes possible. Whether you’re just doing some personal research, or working for a multinational corporation, you can use the tools of data science to quickly identify and present insights about the world.

Let me walk you through this example and show you how I did it.

Tutorial: how to map lithium production data using R

First, we’ll load the packages that we’re going to use.

We’ll load tidyverse mostly for access to ggplot2 and dplyr, but we’ll also load rvest (for data scraping), stringr (to help us manipulate our variables and put the data into shape), and viridis (to help us modify the color of the final plot).

#--------------# LOAD PACKAGES#--------------library(tidyverse)library(rvest)library(stringr)library(viridis)

Ok. Now, we’re going to scrape this mineral production data from the wikipedia page.

Notice that we’re essentially using several rvest functions in series. We’re using several functions and combining them together using the “pipe” operator, %>%.

If you’re not using it yet, you should definitely familiarize yourself with this operator, and start using it. It’s beyond the scope of this blog post to talk extensively about pipe operator, but I will say that it’s one of the most useful tools in the R data scientist’s toolkit. If you learn to use it properly, it will make your code easier to read and easier to write. It will even train your mind to think about analysis in a more step by step way.

Concerning the code: what we’re doing here is designating the URL from which we’re going to scrape the data, then we’re specifying that we’ll be scraping one of the tables. Then, we specify that we’re going to scrape data from the 9th table, and finally we coerce the data into a tibble (instead of keeping it as a traditional data.frame).

#---------------------------# SCRAPE DATA FROM WIKIPEDIA#---------------------------df.lithium <- read_html("https://en.wikipedia.org/wiki/Lithium") %>%  html_nodes("table") %>%  .[[9]] %>%  html_table() %>%  as.tibble()# INSPECTdf.lithium

The resultant dataset, df.lithium, is relatively small (which makes it easy to inspect and easy to work with), but in its raw form, it’s a little messy. We’ll need to do a few things to clean up the data, like change the variable names, parse some data into numerics, etc.

So first, let’s change the column names.

There are a few ways we could do this, but the most straightforward is to simply pass a vector of manually-defined column names into the colnames() function.

#--------------------------------------------# CHANGE COLUMN NAMES# - the raw column names are capitalized and #   have some extra information# - we will just clean them up#--------------------------------------------colnames(df.lithium) <- c('country', 'production', 'reserves', 'resources')colnames(df.lithium)

Now, we’ll remove an extraneous row of data. The original data table on Wikipedia contained not only the individual records of lithium production for particular countries, but it also contained a “total” row at the bottom of the table. More often than not, these sorts of “total” rows are not appropriate for a data.frame in R; we’ll typically remove them, just as we will do here.

#-----------------------------------------------# REMOVE "World total"# - this is a total amount that was#   in the original data table# - we need to remove, because it's not a#   proper data record for a particular country#-----------------------------------------------df.lithium <- df.lithium %>% filter(country != 'World total')df.lithium

Next, we need to parse the numbers into actual numeric data. The reason is that when we scraped the data, it actually read in the numbers as character data, along with commas and some extra characters. We need to transform this character data into proper numeric data in R.

To do this, we’ll need to do a few things. First, we need to remove a few “notes” that were in the original data. This is why we’re using the code str_replace(production,”W\\[.*\\]”, “-“)). Essentially, we’re removing those notes from the data.

After that, we’re using parse_number(production, na = ‘-‘) to transform three variables – production, reserves, resources – into numerics.

Note once again how we’re structuring this code. We’re using a combination of functions from dplyr and stringr to achieve our objectives.

To a beginner, this might look complicated, but it’s really not that bad once you understand the individual pieces. If you don’t understand this code (our couldn’t write it yourself), I recommend that you learn the individual functions from dplyr and stringr first, and then come back to this once you’ve learned those pieces.

#---------------------------------------------------------# PARSE NUMBERS# - the original numeric quantities in the table#   were read-in as character data# - we need to "parse" this information ....#   & transform it from character into proper numeric data#---------------------------------------------------------# Strip out the 'notes' from the numeric data#str_replace(df.lithium$production,"W\\[.*\\]", "") #testdf.lithium <- df.lithium %>% mutate(production = str_replace(production,"W\\[.*\\]", "-"))# inspectdf.lithium# Parse character data into numbersdf.lithium <- df.lithium %>% mutate(production = parse_number(production, na = '-')                   ,reserves = parse_number(reserves, na = '-')                   ,resources = parse_number(resources, na = '-')                     )# Inspectdf.lithium

Now we’ll get data for a map of the world. To do this, we’ll just use map_data().

#--------------# GET WORLD MAP#--------------map.world <- map_data('world')

We’ll also get the names of the countries in this dataset.

The reason is because we’ll need to join this map data to the data from Wikipedia, and we’ll need the country names to be exactly the same. To make this work, we’ll need to examine the names in both datasets and modify any names that aren’t exactly the same.

Notice that once again, we’re using a combination of functions from dplyr, wiring them together using the pipe operator.

#----------------------------------------------------# Get country names# - we can use this list and cross-reference#   with the country names in the scraped data# - when we find names that are not the same between#   this map data and the scraped data, we can recode#   the values#----------------------------------------------------map_data('world') %>% group_by(region) %>% summarise() %>% print(n = Inf)

Ok. Now we’re going to recode some country names. Again, we’re going this so that the country names in df.lithium are the same as the corresponding country names in map.data.

#--------------------------------------------# RECODE COUNTRY NAMES# - some of the country names do not match#   the names we will use later in our map# - we will re-code so that the data matches#   the names in the world map#--------------------------------------------df.lithium <- df.lithium %>% mutate(country = if_else(country == "Canada (2010)", 'Canada'                                      ,if_else(country == "People's Republic of China", "China"                                      ,if_else(country == "United States", "USA"                                      ,if_else(country == "DR Congo","Democratic Republic of the Congo", country))))                                )# Inspectdf.lithium

Ok, now we’ll join the data using dplyr::left_join().

#-----------------------------------------# JOIN DATA# - join the map data and the scraped-data#-----------------------------------------df <- left_join(map.world, df.lithium, by = c('region' = 'country'))

Now we’ll plot.

We’ll start with just a basic plot (to make sure that the map plots correctly), and then we’ll proceed to plot separate maps where the fill color corresponds to reserves, production, and resources.

#-----------# PLOT DATA#-----------# BASIC MAPggplot(data = df, aes(x = long, y = lat, group = group)) +    geom_polygon()# LITHIUM RESERVESggplot(data = df, aes(x = long, y = lat, group = group)) +  geom_polygon(aes(fill = reserves))ggplot(data = df, aes(x = long, y = lat, group = group)) +  geom_polygon(aes(fill = reserves)) +  scale_fill_viridis(option = 'plasma')# LITHIUM PRODUCTIONggplot(data = df, aes(x = long, y = lat, group = group)) +  geom_polygon(aes(fill = production)) +  scale_fill_viridis(option = 'plasma')# LITHIUM RESOURCESggplot(data = df, aes(x = long, y = lat, group = group)) +  geom_polygon(aes(fill = resources)) +  scale_fill_viridis(option = 'plasma')

In the final three versions, notice as well that we’re modifying the color scales by using scale_fill_viridis().

There’s actually quite a bit more formatting that we could do on these, but as a first pass, these are pretty good.

I’ll leave it as an exercise for you to format these with titles, background colors, etc. If you choose to do this, leave your finalized code in the comments section below.

To master data science, you need a plan

At several points in this tutorial, I’ve mentioned a high level plan for mastering data science: master individual pieces of a programming language, and then learn to put them together into more complicated structures.

If you can do this, you will accelerate your progress … although, the devil is in the details.

That’s actually not the only learning hack that you can use to rapidly master data science. There are lots of other tricks and learning hacks that you can use to dramatically accelerate your progress.

Want to know them?

Here at Sharp Sight, we teach data science. But we also teach you how to learn and how to study data science, so you master the tools as quickly as possible.

By signing up for our email list, you’ll get weekly tutorials about data science, delivered directly to your inbox.

You’ll also get our Data Science Crash Course, for free.

SIGN UP NOW

The post Mapping lithium production using R appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

↧

How to Avoid the dplyr Dependency Driven Result Corruption

December 6, 2017, 7:27 am

≫ Next: 14 Jobs for R users from around the world (2017-12-06)

≪ Previous: Mapping lithium production using R

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our last article we pointed out a dangerous silent result corruption we have seen when using the Rdplyr package with databases.

To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using dplyr with a database.

We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

14 Jobs for R users from around the world (2017-12-06)

December 6, 2017, 8:07 am

≫ Next: The British Ecological Society’s Guide to Reproducible Science

≪ Previous: How to Avoid the dplyr Dependency Driven Result Corruption

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

Full-Time
Quantitative Market Research AnalystGradient Metrics – Posted by kyblock
Anywhere
6 Dec2017
Full-Time
Data Innovation Specialist /Technology Analyst- 20509jessicaxtan
Washington District of Columbia, United States
30 Nov2017
Full-Time
Data Scientist: Mobility ServicesParkbob – Posted by ivan.kasanicky
Wien Wien, Austria
29 Nov2017
Full-Time
Data Analyst: Travel BehaviorRSG – Posted by patricia.holland@rsginc.com
San Diego California, United States
22 Nov2017
Full-Time
Customer Success RepresentativeRStudio, Inc. – Posted by jclemens1
Anywhere
17 Nov2017
Full-Time
Data Science EngineerBonify – Posted by arianna@meritocracy
Berlin Berlin, Germany
17 Nov2017

More New R Jobs

Full-Time
Quantitative Market Research AnalystGradient Metrics – Posted by kyblock
Anywhere
6 Dec2017
Full-Time
Data Scientist in the Institute for Economics and Peace Institute for Economics and Peace – Posted by Institute for Economics and Peace
Sydney New South Wales, Australia
5 Dec2017
Full-Time
Data Innovation Specialist /Technology Analyst- 20509jessicaxtan
Washington District of Columbia, United States
30 Nov2017
Full-Time
Data Scientist: Mobility ServicesParkbob – Posted by ivan.kasanicky
Wien Wien, Austria
29 Nov2017
Freelance
Statistician/Econometrician – R Programmer for Academic Statistical Research Academic Research – Posted by empiricus
Anywhere
29 Nov2017
Full-Time
Senior Data Analyst @ Bangkok, ThailandAgoda – Posted by atrapassi
Bangkok Krung Thep Maha Nakhon, Thailand
28 Nov2017
Full-Time
R Shiny Dashboard Engineer in Health TechCastor EDC – Posted by Castor EDC
Amsterdam-Zuidoost Noord-Holland, Netherlands
24 Nov2017
Full-Time
Data Analyst: Travel BehaviorRSG – Posted by patricia.holland@rsginc.com
San Diego California, United States
22 Nov2017
Full-Time
R&D Database Developer @ Toronto, CanadaCrescendo Technology Ltd – Posted by Crescendo
Toronto Ontario, Canada
17 Nov2017
Full-Time
Customer Success RepresentativeRStudio, Inc. – Posted by jclemens1
Anywhere
17 Nov2017
Full-Time
Data Science EngineerBonify – Posted by arianna@meritocracy
Berlin Berlin, Germany
17 Nov2017
Part-Time
Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza DataWorld Health Organization – Posted by aspenh
Anywhere
6 Nov2017
Full-Time
Data Scientist for H Labs @ Chicago, Illinois, United StatesHeidrick & Struggles – Posted by Heidrick1
Chicago Illinois, United States
2 Nov2017
Full-Time
Business Data Analytics FacultyMaryville University – Posted by tlorden
St. Louis Missouri, United States
2 Nov2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

↧

The British Ecological Society’s Guide to Reproducible Science

December 6, 2017, 2:26 pm

≫ Next: RcppArmadillo 0.8.300.1.0

≪ Previous: 14 Jobs for R users from around the world (2017-12-06)

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The British Ecological Society has published a new volume in their Guides to Better Science series: A Guide to Reproducible Code in Ecology and Evolution (pdf). The introduction, by , describes its scope:

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

The guide proposes a simple reproducible project workflow, and a guide to organizing projects for reproducibility. The Programming section provides concrete tips and traps to avoid (example: use relative, not absolute pathnames), and the Reproducible Reports section provides a step-by-step guide for generating reports with R Markdown.

While written for an ecology audience (and also including some gorgeous photography of animals), this guide would be useful for anyone in the science looking to implement a reproducible workflow. You can download the guide at the link below.

British Ecological Society: A Guide to Reproducible Code in Ecology and Evolution (via Laura Graham)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

RcppArmadillo 0.8.300.1.0

December 6, 2017, 4:59 pm

≫ Next: jmv – one R package (not just) for the social sciences

≪ Previous: The British Ecological Society’s Guide to Reproducible Science

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

armadillo image

Another RcppArmadillo release hit CRAN today. Since our last 0.8.100.1.0 release in October, Conrad kept busy and produced Armadillo releases 8.200.0, 8.200.1, 8.300.0 and now 8.300.1. We tend to now package these (with proper reverse-dependency checks and all) first for the RcppCore drat repo from where you can install them "as usual" (see the repo page for details). But this actual release resumes within our normal bi-monthly CRAN release cycle.

These releases improve a few little nags on the recent switch to more extensive use of OpenMP, and round out a number of other corners. See below for a brief summary.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 405 other packages on CRAN.

A high-level summary of changes follows.

Changes in RcppArmadillo version 0.8.300.1.0 (2017-12-04)
Upgraded to Armadillo release 8.300.1 (Tropical Shenanigans)
faster handling of band matrices by solve()
faster handling of band matrices by chol()
faster randg() when using OpenMP
added normpdf()
expanded .save() to allow appending new datasets to existing HDF5 files
Includes changes made in several earlier GitHub-only releases (versions 0.8.300.0.0, 0.8.200.2.0 and 0.8.200.1.0).
Conversion from simple_triplet_matrix is now supported (Serguei Sokol in #192).
Updated configure code to check for g++ 5.4 or later to enable OpenMP.
Updated the skeleton package to current packaging standards
Suppress warnings from Armadillo about missing OpenMP support and -fopenmp flags by setting ARMA_DONT_PRINT_OPENMP_WARNING

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

jmv – one R package (not just) for the social sciences

December 6, 2017, 3:00 pm

≫ Next: Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

≪ Previous: RcppArmadillo 0.8.300.1.0

(This article was first published on jamovi, and kindly contributed to R-bloggers)

tl;dr

many analyses in the social sciences require many R packages
jmv makes these common analyses available from one R package
jmv can be used from jamovi, a graphical statistical spreadsheet, making it super-accessible

introducing `jmv`

There are many tremendously useful R packages for the social sciences (and similar fields), such as car, afexvcd, etc. Although running basic analyses (such as t-tests or ANOVA) with these packages is very straight forward, it is typically necessary to perform a number of supplementary analyses to accompany them; post-hoc tests, effect-size calculations, bias-corrections, and assumption checks. These additional tests often require the use of many additional R packages, and can make reasonably standard analyses quite time-consuming to perform. For example, in the book Discovering Statistics Using R by Andy Field (a popular textbook in the social sciences), the chapter on ANOVA alone recommends the use of 7 packages.

jmv simplifies this whole process by bringing all of these packages together and makes doing the following analyses with their most common supplementary tests, corrections and assumption checks as easy as possible:

Descriptive statistics
T-Tests
ANOVA
ANCOVA
Repeated Measures ANOVA
Non-parametric ANOVAs
Correlation
Linear Regression
Contingency Tables
Proportion Tests
Factor Analysis

and coming soon:

Logistic Regression
Log-linear Regression

jmv aims to make all common statistical tests taught at an undergraduate level available from a single package.

An ANOVA

Let’s begin with a simple, familiar analysis – an ANOVA. In this example we use the ToothGrowth dataset from R, and explore whether different food supplements and their dosage affect how much a guinea pig’s teeth grow. We’ll specify len to be the dependent variable, and supp and dose to be the factors.

library('jmv')data('ToothGrowth')jmv::anova(ToothGrowth,dep='len',factors=c('supp','dose'))

####  ANOVA####  ANOVA                                                                   ##  ───────────────────────────────────────────────────────────────────────##                 Sum of Squares    df    Mean Square    F        p        ##  ───────────────────────────────────────────────────────────────────────##    supp                    205     1          205.4    15.57    < .001   ##    dose                   2426     2         1213.2    92.00    < .001   ##    supp:dose               108     2           54.2     4.11     0.022   ##    Residuals               712    54           13.2                      ##  ───────────────────────────────────────────────────────────────────────

This produces what should be a familiar ANOVA table. You have likely seen something like this in R before, though perhaps not as nicely formatted.

Where jmv really comes into its own, is with additional options. In the following example we will perform the same analysis, but additionally requesting effect-size, post-hoc tests, homogeneity of variances tests, descriptive statistics, and a descriptives plot:

library('jmv')data('ToothGrowth')jmv::anova(ToothGrowth,dep='len',factors=c('supp','dose'),effectSize='eta',postHoc=c('supp','dose'),plotHAxis='dose',plotSepLines='supp',descStats=TRUE,homo=TRUE)

####  ANOVA####  ANOVA                                                                            ##  ────────────────────────────────────────────────────────────────────────────────##                 Sum of Squares    df    Mean Square    F        p         η²      ##  ────────────────────────────────────────────────────────────────────────────────##    supp                    205     1          205.4    15.57    < .001    0.059   ##    dose                   2426     2         1213.2    92.00    < .001    0.703   ##    supp:dose               108     2           54.2     4.11     0.022    0.031   ##    Residuals               712    54           13.2                               ##  ────────────────────────────────────────────────────────────────────────────────######  ASSUMPTION CHECKS####  Test for Homogeneity of Variances (Levene's)##  ────────────────────────────────────────────##    F       df1    df2    p       ##  ────────────────────────────────────────────##    1.94      5     54    0.103   ##  ────────────────────────────────────────────######  POST HOC TESTS####  Post Hoc Comparisons - supp                                                  ##  ────────────────────────────────────────────────────────────────────────────##    supp         supp    Mean Difference    SE       df      t       p-tukey   ##  ────────────────────────────────────────────────────────────────────────────##    OJ      -    VC                 3.70    0.938    54.0    3.95    < .001   ##  ────────────────────────────────────────────────────────────────────────────######  Post Hoc Comparisons - dose                                                   ##  ─────────────────────────────────────────────────────────────────────────────##    dose         dose    Mean Difference    SE      df      t         p-tukey   ##  ─────────────────────────────────────────────────────────────────────────────##    0.5     -    1                 -9.13    1.15    54.0     -7.95    < .001   ##            -    2                -15.50    1.15    54.0    -13.49    < .001   ##    1       -    2                 -6.37    1.15    54.0     -5.54    < .001   ##  ─────────────────────────────────────────────────────────────────────────────######  Descriptives                            ##  ───────────────────────────────────────##    supp    dose    N     Mean     SD     ##  ───────────────────────────────────────##      OJ     0.5    10    13.23    4.46   ##      OJ       1    10    22.70    3.91   ##      OJ       2    10    26.06    2.66   ##      VC     0.5    10     7.98    2.75   ##      VC       1    10    16.77    2.52   ##      VC       2    10    26.14    4.80   ##  ───────────────────────────────────────

ToothGrowth

As can be seen, jmv can provide many additional tests and statistics relevant to the main tests, but with far less effort.

You can explore additional options for the jmv ANOVA here, and the other tests and their available options here.

jamovi integration

jmv is also useable from the jamovi statistical spreadsheet. jamovi makes a range of analyses accessible to a broader audience by making them available from a familiar, spreadsheet user-interface. jamovi can also make the underlying R code for each analysis available, making it easy for people to learn R, and transition to R scripting if they like.

Here is exactly the same analysis as above, having been performed in jamovi.

jamovi

summing up

jmv makes a whole suite of common analyses from the social sciences very easy to perform
jamovi makes these even easier to perform

jmv is available from CRAN

jamovi is available from www.jamovi.org

To leave a comment for the author, please follow the link and comment on their blog: jamovi.

↧

Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

December 7, 2017, 6:00 am

≫ Next: Clustering Music Genres with R

≪ Previous: jmv – one R package (not just) for the social sciences

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the beautiful gifts that R has got (that Python misses) is the package – Shiny. Shiny is an R package that makes it easy to build interactive web apps straight from R. Making Dashboard is an imminent wherever Data is available since Dashboards are good in helping Business make insights out of the existing data.

In this post, We will see how to leverage Shiny to build a simple Sales Revenue Dashboard.

Loading Packages

All the packages listed below can be directly installed from CRAN.

# load the required packageslibrary(shiny)require(shinydashboard)library(ggplot2)library(dplyr)

Sample Input File: Considering the fact that Dashboard needs an input data to visualise, we will use this sample recommendation.csv as input data to our dashboard but this can be modified to suit any organisational need like a Database connection or Data from remote location.

recommendation <- read.csv('recommendation.csv',stringsAsFactors = F,header=T)head(recommendation)       Account Product Region Revenue1    Axis Bank     FBB  North    20002         HSBC     FBB  South   300003          SBI     FBB   East    10004        ICICI     FBB   West    10005 Bandhan Bank     FBB   West     2006    Axis Bank    SIMO  North     200

Every shiny application has two main sections 1. ui and 2. server. ui is where the code for front-end like buttons, plot visuals, tabs and so on are present and server is where the code for back-end like Data Retrieval, Manipulation, Wrangling are present.

Image Courtesy: Slideplayer

Instead of simply using only shiny, Here we will couple it with shinydashboard. shinydashboard is an R package whose job is to make it easier (as the name suggests) to build dashboards with shiny. The ui part of a shiny app built with shinydashboard would have 3 basic elements wrapped in dashboardPage().

1. dashboardHeader(), 2. dashboardSidebar(), 3. dashboardBody()

Simplest Shiny app with shinydashboard:

## app.R ##library(shiny)library(shinydashboard)ui <- dashboardPage(  dashboardHeader(),  dashboardSidebar(),  dashboardBody())server <- function(input, output) { }shinyApp(ui, server)

Gives this app: Image Courtesy: rstudio

Aligning to our larger goal of making a Sales Revenue Dashboard, Let us look at the code of dashboardHeader() and dashboardSidebar().

#Dashboard header carrying the title of the dashboardheader <- dashboardHeader(title = "Basic Dashboard")  #Sidebar content of the dashboardsidebar <- dashboardSidebar(  sidebarMenu(    menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),    menuItem("Visit-us", icon = icon("send",lib='glyphicon'),              href = "https://www.salesforce.com")  ))

To begin with dashboardPage(), We must decide what are the UI elements that we would like to show in our dashboard. Since it’s a Sales Revenue Dashboard, let us show 3 KPI boxes on the top that could represent quick summary and then 2 Graphical plots followed by them for a detailed view.

To align these elements one by one, we will define these elements inside fluidRow().

frow1 <- fluidRow(  valueBoxOutput("value1")  ,valueBoxOutput("value2")  ,valueBoxOutput("value3"))frow2 <- fluidRow(   box(    title = "Revenue per Account"    ,status = "primary"    ,solidHeader = TRUE     ,collapsible = TRUE     ,plotOutput("revenuebyPrd", height = "300px")  )  ,box(    title = "Revenue per Product"    ,status = "primary"    ,solidHeader = TRUE     ,collapsible = TRUE     ,plotOutput("revenuebyRegion", height = "300px")  ) )# combine the two fluid rows to make the bodybody <- dashboardBody(frow1, frow2)

It could be seen from the above code that valueBoxOutput() is used to display the KPI but what is displayed in this valueBoxOutput() will be written in the server part and the same applies for plotOutput() which is used in the ui part to display a plot. box() is a function provided by shinydashboard to enclose the plot inside a box with certain features like title, solidHeader and collapsible. Having defined two fluidRow() functions individually for the sake of modularity, we can combine both of them in dashbboardBody().

Thus we can complete the ui part comprising Header, Sidebar and Page with the below code:

#completing the ui part with dashboardPageui <- dashboardPage(title = 'This is my Page title', header, sidebar, body, skin='red')

Note that the value of title in dashboardPage() will serve as the title of the browser page/tab, while the title defined in the dashboardHeader() will be visible as the dashboard title.

With the ui part over, We will create the server part where the program and logic behind valueBoxOutput() and plotOutput() are added with renderValueBox() and renderPlot() respectively, enclosed inside server function with input,output as its paramaters. Values inside input contain anything that is received from ui (like textBox value, Slider value) and Values inside output contain anything that is sent to ui (like plotOutput, valueBoxOutput).

Below is the complete server code:

# create the server functions for the dashboard  server <- function(input, output) {   #some data manipulation to derive the values of KPI boxes  total.revenue <- sum(recommendation$Revenue)  sales.account <- recommendation %>% group_by(Account) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value))  prof.prod <- recommendation %>% group_by(Product) %>% summarise(value = sum(Revenue)) %>% filter(value==max(value))#creating the valueBoxOutput content  output$value1 <- renderValueBox({    valueBox(      formatC(sales.account$value, format="d", big.mark=',')      ,paste('Top Account:',sales.account$Account)      ,icon = icon("stats",lib='glyphicon')      ,color = "purple")    })  output$value2 <- renderValueBox({     valueBox(      formatC(total.revenue, format="d", big.mark=',')      ,'Total Expected Revenue'      ,icon = icon("gbp",lib='glyphicon')      ,color = "green")    })output$value3 <- renderValueBox({    valueBox(      formatC(prof.prod$value, format="d", big.mark=',')      ,paste('Top Product:',prof.prod$Product)      ,icon = icon("menu-hamburger",lib='glyphicon')      ,color = "yellow")     })#creating the plotOutput content  output$revenuebyPrd <- renderPlot({    ggplot(data = recommendation,            aes(x=Product, y=Revenue, fill=factor(Region))) +       geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") +       xlab("Product") + theme(legend.position="bottom"                               ,plot.title = element_text(size=15, face="bold")) +       ggtitle("Revenue by Product") + labs(fill = "Region")  })output$revenuebyRegion <- renderPlot({    ggplot(data = recommendation,            aes(x=Account, y=Revenue, fill=factor(Region))) +       geom_bar(position = "dodge", stat = "identity") + ylab("Revenue (in Euros)") +       xlab("Account") + theme(legend.position="bottom"                               ,plot.title = element_text(size=15, face="bold")) +       ggtitle("Revenue by Region") + labs(fill = "Region")  })}

So far, we have defined both the essential parts of a Shiny app – ui and server. And finally we have to call/run the shinyApp with ui and server as its paramters.

#run/call the shiny appshinyApp(ui, server)Listening on http://127.0.0.1:5101

The entire file has to be saved as app.R inside a folder before running the shiny app. Also remember to put the input data file (in our case, recommendation.csv inside the same folder where app.R is saved). While there is another valid way to structure the shiny app with two files ui.R and server.R (optionally, global.R), it has been ignored in this article for the sake of brevity since this is aimed at beginners.

Upon running the file, the shiny web app would open in your default browser and would look similar to the below screenshot:

Hopefully at this stage, You have got your shiny web app in the form of Sales Revenue Dashboard up and running. The code and plots used here are available on my Github.

References

The basic parts of a shiny app

Get Started – Shinydashboard

Shiny Files Structure

Input recommendation.csv file

recommendation.csv

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

↧

Clustering Music Genres with R

December 7, 2017, 9:28 am

≫ Next: In case you missed it: November 2017 roundup

≪ Previous: Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In a number of upcoming posts, I’ll be analyzing an interesting dataset I found on Kaggle. The dataset contains information on 18,393 music reviews from the Pitchfork website. The data cover reviews posted between January 1999 and January 2016. I downloaded the data and did an extensive data munging exercise to turn the data into a tidy dataset for analysis (not described here, but perhaps in a future blog post).

The goal of this post is to describe the similarities and differences among the music genres of the reviewed albums using cluster analysis.

The Data

After data munging, I was left with 18,389 reviews for analysis (there were 4 identical review id’s in the dataset; visual inspection showed that the reviews were identical so I removed the duplicates).

One of the pieces of information about each album is the music genre, with the following options available: electronic, experimental, folk/country, global, jazz, metal, pop/rnb, rap and rock. Each album can have 0, 1 or multiple genres attached to it. I represented this information in the wide format, with one column to represent each genre. The presence of a given genre for a given album is represented with a 1, and the absence of a given genre for a given album is represented with a 0.

The head of the dataset, called “categories,” looks like this:

table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }

title	artist	electronic	metal	rock
mezzanine	massive attack	1	0	0
prelapsarian	krallice	0	1	0
all of them naturals	uranium club	0	0	1
first songs	kleenex liliput	0	0	1
new start	taso	1	0	0
insecure (music from the hbo original series)	various artists	0	0	0

As we can see, each album has numeric values on all of our 9 genres. The album in the last row of the data shown above does not have a genre attached to it.

Let’s first answer some basic questions about the music genres. How often is each genre represented in our 18,389 reviews? We can make a simple bar plot using base R with the following code:

# histogram of genre frequencies par(mai=c(1,1.5,1,1)) barplot(sort(colSums(categories[,3:11]), decreasing=FALSE),          horiz=TRUE, cex.names=1, col='springgreen4',          main='Genre Frequency', las=1)

Which gives us the following plot:

Rock is by far the most frequently-occurring genre in the dataset!

Let’s not forget that albums can have more than one genre. How many albums have more than 1 genre attached to them? What is the maximum number of genres in these data? What number of albums have what number of genres? It’s possible to extract these figures with dplyr, but for basic summary statistics I find it quicker and easier to use base R:

# how many of the albums have more than 1 genre? table(rowSums(categories[,3:11])>1) # FALSE  TRUE # 14512  3877# what is the maximum number of genres? max(rowSums(categories[,3:11])) # [1] 4# how many albums have what number of genres? table(rowSums(categories[,3:11])) #   0     1     2     3     4 # 2365 12147  3500   345    32

It looks like 3,877 of the albums have more than 1 genre. The table shows that the vast majority of albums with more than 1 genre have 2 genres, with a much smaller number having 3 and 4 genres. There are 2,365 albums with no recorded genre.

Data Preparation

In order to cluster the music genres, we first must make a matrix which contains the co-occurrences of our dummy-coded genre variables with one another. We can use matrix multiplication to accomplish this, as explained in this excellent StackOverflow answer:

# make a co-occurrence matrix# select the relevant columns and convert to matrix format library(plyr); library(dplyr)  co_occur<-categories %>% select(electronic, experimental, folk_country,                global,jazz, metal, pop_rnb, rap, rock) %>% as.matrix()  # calculate the co-occurrences of genresout<- crossprod(co_occur) # make the diagonals of the matrix into zeros# (we won't count co-occurrences of a genre with itself) diag(out) <-0

The resulting co-occurrence matrix, called “out”, is a 9 by 9 matrix containing the counts of the genre co-occurrences together in the data:

	electronic	experimental	folk_country	global	jazz	metal	pop_rnb	rap	rock
electronic	0	198	26	49	127	40	228	81	1419
experimental	198	0	15	20	76	64	32	17	1121
folk_country	26	15	0	7	6	10	65	2	52
global	49	20	7	0	13	0	48	3	33
jazz	127	76	6	13	0	26	52	25	57
metal	40	64	10	0	26	0	12	27	449
pop_rnb	228	32	65	48	52	12	0	133	126
rap	81	17	2	3	25	27	133	0	68
rock	1419	1121	52	33	57	449	126	68	0

Cluster Analysis

We can now proceed with the cluster analysis. We will use hierarchical clustering, an algorithm which seeks to build a hierarchy of clusters in the data. This analysis will produce groupings (e.g. clusters) of music genres. We will visualize the results of this analysis via a dendrogram.

We first need to produce a distance matrix from our co-occurrence matrix. For each pair of music genres, we will calculate the Euclidean distance between them. To calculate the Euclidean distances, we first calculate the sum of the squared differences in co-occurrences for each pair of rows across the nine columns. We then take the square root of the sum of squared differences. If we consider the two bottom rows (rap and rock) in the co-occurrence matrix above, then the Euclidean distance between them is calculated as follows:

# calculate Euclidean distance manually between rock and rapsquared_diffs= (81-1419)^2+ (17-1121)^2+ (2-52)^2+ (3-33)^2+        (25-57)^2+ (27-449)^2+ (133-126)^2+ (0-68)^2+ (68-0)^2 sqrt(squared_diffs) # [1] 1789.096

This calculation makes clear our definition of genre similarity, which will define our clustering solution: two genres are similar to one another if they have similar patterns of co-occurrence with the other genres.

Let’s calculate all pairwise Euclidean distances with the dist() function, and check our manual calculation with the one produced by dist().

# first produce a distance matrix# from our co-occurrence matrixdist_matrix<- dist(out) # examine the distance matrix round(dist_matrix,3) # the result for rap+rock# is 1789.096# the same as we calculated above!

As noted in the R syntax above (and shown in the bottom-right corner of the distance matrix below), the distance between rap and rock is 1789.096, the same value that we obtained from our manual calculation above!

The distance matrix:

	electronic	experimental	folk_country	global	jazz	metal	pop_rnb	rap
experimental	462.453
folk_country	1397.729	1087.241
global	1418.07	1102.418	38.105
jazz	1392.189	1072.719	122.511	106.353
metal	1012.169	698.688	405.326	420.86	405.608
pop_rnb	1346.851	1006.476	275.647	259.559	197.16	397.776
rap	1375.999	1066.633	92.644	101.975	116.816	406.478	259.854
rock	2250.053	2041.001	1836.619	1818.557	1719.464	1854.881	1682.785	1789.096

To perform the clustering, we simply pass the distance matrix to our hierarchical clustering algorithm (specifying that we to use Ward’s method), and produce the dendrogram plot.*

# perform the hierarchical clusteringhc<- hclust(dist_matrix, method="ward.D") # plot the dendrogram plot(hc, hang=-1, xlab="", sub="")

The hierarchical clustering algorithm produces the organization of the music genres visible in the above dendrogram. But how many clusters are appropriate for these data?

There are many different ways of choosing the number of clusters when performing cluster analysis. With hierarchical clustering, we can simply examine the dendrogram and make a decision about where to cut the tree to determine the number clusters for our solution. One of the key intuitions behind the dendrogram is that observations that fuse together higher up the tree (e.g. at the top of the plot) are more different from one another, while observations that fuse together further down (e.g. at the bottom of the plot) are more similar to one another. Therefore, the higher up we split the tree, the more different the music genres will be among the clusters.

If we split the tree near the top (e.g., around a height of 2500), we end up with three clusters. Let’s make a dendrogram that represents each of these three clusters with a different color. The colored dendrogram is useful in the interpretation of the cluster solution.

There are many different ways to produce a colored dendrogram in R; a lot of them are pretty hacky and require a fair amount of code. Below, I use the wonderful dendextendpackage to produce the colored dendrogram in a straightforward and simple way (with a little help from this StackOverflow answer):

# make a colored dendrogram with the dendextend package library(dendextend) # hc is the result of our hclust call abovedend<-hc# specify we want to color the branches and labelsdend<- color_branches(dend, k=3) dend<- color_labels(dend, k=3) # plot the colored dendrogram. add a title and y-axis label plot(dend, main='Cluster Dendrogram', ylab='Height')

Which yields the following plot:

Interpretation of the Clusters

The Clusters

The three clusters identified by our chosen cut point are as follows. First, on the right-hand side of the plot, we see electronic and experimental music grouped together in a single cluster (colored in blue). Also on the right-hand of the plot, on the same branch but with a different leaf and colored in green, we find rock music which constitutes a cluster in and of itself. These two clusters (electronic + experimental and rock) are distinct, yet they share some similarities in terms of their co-occurrences with other music genres, as indicated by their fusion at the top right-hand side of the dendrogram.

On the left-hand side of the dendrogram, we see the third cluster (colored in pink) which encompasses metal, pop/r&b, folk/country, global, jazz and rap music. Our chosen cut point lumps all of these genres into a single cluster, but examination of the sub-divisions of this cluster reveals a more nuanced picture. Metal is most different from the other genres in this cluster. A further sub-grouping separates pop/r&b from folk/country, global, jazz and rap. Folk/country and global are fused at the lowest point in the dendrogram, followed by a similar grouping of jazz and rap. These two pairings occur very close to the bottom of the dendrogram, indicating strong similarity in genre co-occurrence between folk/country and global on the one hand, and between jazz and rap on the other hand. Substantive Interpretation: What have we learned about music genres?

Rock (Green Cluster) Rock music is the only music genre to get its own cluster, suggesting that it is distinctive in its patterns of co-occurrence with the other genres.

We must keep in mind the following statistical consideration, however. Rock music was the most frequent genre in the dataset. One of the reason rock’s Euclidean distances were so large is that, because this genre occurs so frequently in the data, the distances calculated between rock and the less-frequently occurring genres are naturally larger.

Electronic & Experimental (Blue Cluster) The blue cluster in the dendrogram above groups electronic and experimental music together. One reason for this might be because these genres are both relatively more modern (compared to, say, jazz or folk), and therefore share some sonic similarities (for example, using electronic or synthetically generated sounds) which leads them to be used in a similar way in conjunction with the other music genres.

Metal, Pop/R&B, Folk/Country, Global, Jazz and Rap (Pink Cluster) The remaining genres fall into a single cluster. Within this cluster, it seems natural that metal is most different from the other genres, and that pop/r&b separates itself from folk/country, global, jazz and rap.

Pop and R&B are different, in my vision, from folk/country, global, jazz and rap music in that the former feature slick production styles (e.g. a tremendously layered sound with many different tracks making up a song), electronic instrumentation (e.g. keyboard synthesizers and drum machines) and a contemporary musical aesthetic (e.g. auto-tuning to produce noticeably distorted vocals), whereas the latter feature more sparse arrangements and fewer electronically-produced sounds.

Folk/country and global music, meanwhile, share a similar musical palette in terms of their more traditional instrumentation. While there are exceptions, I feel that both folk/country and global music use more acoustic or “natural” instruments (guitars, violins, wind instruments, drums, etc.) and less of the obviously electronically-produced sounds mentioned above.

Finally, jazz and hip-hop, although quite different in many musical aspects, share a similar aesthetic in some ways. Hip-hop, for example, has long made use of samples from older records (including many jazz tunes) to create beats (although recent hip-hop sub-genres such as trap music make use of drum machines and synthesizers for a more mechanical sound). One testament to the complementary between jazz and rap are number of recent notable collaborations between artists in these two genres: Kendrick Lamar’s excellent 2015 album To Pimp a Butterfly contains a number of tracks featuring jazz musicians and jazz instrumentation, the jazz pianist Robert Glasper has made a several collaborative albums featuring rap artists, and the jazzy group BadBadNotGood made an excellent album with the always brilliant Ghostface Killah.

Conclusion

In this post, we clustered music genres from albums reviewed by Pitchfork. Our original dataset contained dummy-coded indicators for 9 different music genres across 18,389 different albums. We used matrix multiplication to create a co-occurrence matrix for the music genres, which we then turned into a distance matrix for hierarchical clustering. We cut the resulting dendrogram high up on the tree, obtaining three separate clusters of music genres: 1) rock 2) electronic and experimental and 3) metal, pop/r&b, folk/country, global, jazz and rap. This clustering solution seems to align with the production styles, sonic and musical qualities, and past and current cross-pollination between the various music genres.

Coming Up Next

In the next post, I’ll use R to produce a unique visualization of a very famous dataset. Stay tuned!

—-

* If you’re interested, you can play around with other distance metrics and clustering algorithms at home- just import the above co-occurrence matrix into R, and adapt the code however you like!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters.

↧

In case you missed it: November 2017 roundup

December 7, 2017, 1:54 pm

≫ Next: Writing Excel formatted csv using readr::write_excel_csv2

≪ Previous: Clustering Music Genres with R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from November of particular interest to R users.

R 3.4.3 "Kite Eating Tree" has been released.

Several approaches for generating a "Secret Santa" list with R.

The "RevoScaleR" package from Microsoft R Server has now been ported to Python.

The call for papers for the R/Finance 2018 conference in Chicago is now open.

Give thanks to the volunteers behind R.

Advice for R user groups from the organizer of R-Ladies Chicago.

Use containers to build R clusters for parallel workloads in Azure with the doAzureParallel package.

A collection of R scripts for interesting visualizations that fit into a 280-character Tweet.

R is featured in a StackOverflow case study at the Microsoft Connect conference.

The City of Chicago uses R to forecast water quality and issue beach safety alerts.

A collection of best practices for sharing data in spreadsheets, from a paper by Karl Broman and Kara Woo.

The MRAN website has been updated with faster package search and other improvements.

The curl package has been updated to use the built-in winSSL library on Windows.

Beginner, intermediate and advanced on-line learning plans for developing AI applications on Azure.

A recap of the EARL conference (Effective Applications of the R Language) in Boston.

Giora Simchoni uses R to calculate the expected payout from a slot machine.

An introductory R tutorial by Jesse Sadler focuses on the analysis of historical documents.

A new RStudio cheat sheet: "Working with Strings".

An overview of generating distributions in R via simulated gaming dice.

An analysis of StackOverflow survey data ranks R and Python among the most-liked and least-disliked languages.

And some general interest stories (not necessarily related to R):

Siri transcribes a trombone player
A collection of short videos of interesting chemical reactions
An animation shows the impact of a rogue drone on Gatwick airport
An AI sythesizes novel images of furniture, animals, and celebritiesA

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Writing Excel formatted csv using readr::write_excel_csv2

December 7, 2017, 4:00 pm

≫ Next: .rprofile: Jenny Bryan

≪ Previous: In case you missed it: November 2017 roundup

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

Why this post?

Currently, my team and I are building a Shiny app that serves as an interface for a forecasting model. The app allows business users to interact with predictions. However, we keep getting feature requests, such as, “Can we please have this exported to Excel.”

Our client chose to see results exported to a csv file and wants to open them in Excel. App is already running on the Linux server and the csv that can be downloaded via app are utf-8 encoded.

If you are a Linux user you may not be aware that Windows Excel is not able to recognize utf-8 encoding automatically. It turns out that a few people faced this problem in the past.

Obviously, we cannot have a solution where our users are changing options in Excel or opening the file in any other way than double clicking.

We find having a Shiny App that allows for Excel export to be a good compromise between R/Shiny and Excel. It gives the user the power of interactivity and online access, while still preserving the possibility to work with the results in the environment they are most used to. This a great way to gradually accustom users with working in Shiny.

Current available solution in R

What we want is the following, write a csv file with utf-8 encoding and BOM sticky noteThe byte order mark (BOM) is a Unicode character which tells about the encoding of the document. . This has been addressed in R by RStudio in readr package.

library(readr)write_excel_csv(mtcars,"assets/data/readr/my_file.csv")

This is great and solves the problem with opening the file in Excel, but… supports only one type of locale.

Show me your locale

Depending on where you live you might have different locale. Locale is a set of parameters that defines the user’s language, region and any special variant preferences that the user wants to see in their user interface.

This means that number formatting can differ between different regions, for example in the USA . is used as a decimal separator, but on the other hand almost whole Europe uses ,. This article shows how countries around the world define their number formats.

This proves that there is a large need to extend the readr functionality and allow users to save Excel with European locale easily and quickly. This is not currently possible since write_excel_csv only allows one to write in the US locale.

New addition to readr

We proposed to add write_excel_csv2() to readr package that would allow the user to write a csv with , as a decimal separator and ; as column separator. To be consistent with naming convention in R for functions reading in (e.g. read.csv() and read.csv2()) or writing (e.g. write.csv() and write.csv2()) csv files with different delimiter we decided to simply add 2 to write_excel_csv().

tmp<-tempfile()on.exit(unlink(tmp))readr::write_excel_csv2(mtcars,tmp)

To prove that it works, let’s read the first two lines and inspect the output.

readr::read_lines(tmp,n_max=2)

## [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb"
## [2] "21,0;6;160,0;110;3,90;2,620;16,46;0;1;4;4"

write_excel_csv2() is already available for download from readr repository and should be available on CRAN with the next release.

devtools::install_github("tidyverse/readr")

We hope you and your business team will find this addition useful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

↧

.rprofile: Jenny Bryan

December 7, 2017, 4:00 pm

≫ Next: Downloading files from a webserver, and failing.

≪ Previous: Writing Excel formatted csv using readr::write_excel_csv2

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Jenny Bryan @JennyBryan is a Software Engineer at RStudio and is on leave from being an Associate Professor at the University of British Columbia. Jenny serves in leadership positions with rOpenSci and Forwards and as an Ordinary member of The R Foundation.

KO: What is your name, your title, and how many years have you worked in R?

JB: I’m Jenny Bryan, I am a software engineer at RStudio (still getting used to that title)., And I am on leave from being an Associate Professor at the University of British Columbia. I’ve been working with R or it’s predecessors since 1996. I switched to R from S in the early 2000s.

KO: Why did you make the switch to R from S?

JB: It just seemed like the community was switching over to R and I didn’t have a specific reason to do otherwise, I was just following the communal path of least resistance.

KO: You have a huge following from all the stuff you post about your course. Did you always want to be a teacher? How did you get into teaching?

JB: No, I wouldn’t say that I always wanted to be a teacher, but I think I’ve enjoyed that above average compared to other professors. But it was more that I realized several years ago that I could have a bigger impact on what people did by improving data analysis workflows, thinking, and tooling instead of trying to make incremental progress on statistical methodology. It is a reflection of where I have a comparative advantage with respect to interest and aptitude, so it’s not really a knock on statistical methodology. But I feel we could use more people working on this side of the field – working on knowledge translation.

I was also reacting to what I saw in my collaborative work. I would work with people in genomics and if I’m completely honest with myself, often my biggest contribution to the paper would be getting all the datasets and analyses organized. I didn’t necessarily do some highly sophisticated statistical analysis. It would often boil down to just doing millions of t-tests or something. But the reason I had an impact on the project would be that I got everything organized so that we could re-run it and have more confidence in our results. And I was like, I have a PhD in stats, why is this my main contribution? Why do the postdocs, grad students, and bioinformaticians not know how to do these things? So then I started to make that more and more the focus of my course, instead of squeezing in more statistical methods. Then the teaching sort of changed who I was and what I allowed myself to think about and spend time on. I used to not let myself spend time on those things. Or if I did, I would feel guilty about it because I thought, I can’t get any professional credit for this! It’s not statistically profound, but it seems to be what the world needs me to do, and needs other people to be doing.

You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things.

KO: Do you feel proud of what you’ve accomplished?

JB: I finally in some sense gave myself permission to start teaching what I thought people actually needed to know. And then after spending lots of time on it in the classroom, you realize what gaps there are, you become increasingly familiar with the tooling that you’re teaching and you’re like, hey I could actually improve that. Or no one really talks about how you get the output of this step to flow nicely as the input into the following step, i.e. how to create workflows. It really helped open my mind to different forms of work that are still valuable. You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things. However because we don’t have names for all of it, it can be difficult from a career point of view. But so many people see it, use it, and are grateful for it.

KO: Can you talk about your transition into working for RStudio and what that will look like on a day-to-day basis?

JB: In many ways it looks a lot like my life already did because I had, especially in the last two to three years, decided if I want to work on R packages or on exposition, I’m going to do that. That’s what I think tenure is for! So I had decided to stop worrying about how to sell myself in a framework set up to reward traditional work in statistical methodology. That freed up a lot of mental energy, to pursue these other activities, unapologetically. Which lead to other opportunities, such as RStudio. I was already working mostly from home. The Statistics department is by no means a negative environment for me, but the internet helped me find virtual colleagues around the globe who really share my interests. The physical comfort of home is very appealing. RStudio is also very light on meetings, which is a beautiful thing.

KO: What is your team like at RStudio? How many projects are you juggling at any given time? Do you have an idea of what you want to accomplish while you’re there?

JB: The person I interact with most is Hadley Wickham and he now has a team of five. There’s a fair amount of back and forth with other team members. I might seek their advice on, e.g., development practices, or just put questions out there for everyone. This team is pretty new and the formalization of the tidyverse is pretty new, so everyone has different packages that they’re working on, either from scratch or shifting some of the maintenance burden off of Hadley. There’s a concerted effort to figure out “what does it mean to be an ecosystem of packages that work together?“.

KO: Do you have a well defined road map at this point on the team?

JB: I’ve been on that team since January and before that we had queued up readxl as a good project for me. It was also overdue for maintenance! I was already a “Spreadsheet Lady”, very familiar with the underlying objects, and with the problem space. It was a good opportunity for me to write compiled code which I hadn’t done in a really long time. I had never written C++ so it was a way to kill at least three birds with one stone. So that was an easy selection for the first thing to work on. And even before that was done, it was clear that going back and doing another project in the Google arena made sense. We knew we would do some work with interns. Wrapping the Google Drive API was going to be useful (in general and for a future update of googlesheets) and I knew our intern Lucy McGowan would be a great person to work with on it.

So no, there’s not some detailed 18-month roadmap stretching ahead of me. I think it will cycle between doing something that’s mine or new and doing maintenance on something that already exists. I also continue to do a lot of exposition, training, and speaking.

It actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance … their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done.

KO: Day-to-day, do you have regular standups? How do you like your day to be structured?

JB: Oh there’s how I wish my day was structured and how it’s actually structured. I wish I could get up and just work because that’s when I feel by far my most productive. Unfortunately, this coincides with the morning chaos of a household with three kids, who, despite the fact that we’re trying to get them more independent with lunches and getting to school, you cannot completely tune out through this part of the day. So I do not really get up and just work, I sort of work until everyone gets out the door. Then I usually go exercise at that point, get that taken care of. I get more work done in the afternoon until the children all arrive home. I do a lot of work between 9 or 10 at night and 1 in the morning. Not because I love working at that time, but that’s just what I have.

Given that I have this platform, it actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance, though it is possible that I have a poor work-life balance, but I feel like it’s usually coming from people who don’t have the same constraints in their life. “You shouldn’t work on the weekends, You shouldn’t work in the evenings”. I’m like, when the heck else do you think I would work? I feel like sometimes people are – their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done. And I have a family and it means that I work when my children are asleep.

They’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: This topic is very interesting and personal to me. As I get older I’ve been thinking (nebulously) about starting a family, and I don’t know what that looks like. It’s scary to me, to not want to give up this lifestyle and this career that I’ve started for myself.

JB: My pivoting of thinking about myself as an applied statistician to more of a data scientist, coincided with me reemerging from having little kids. I had all of them pre-tenure and at some point we had “three under three”. I was trying to get tenure, just barely getting it all done and I was kind of in my own little world, just surviving. Then the tenure process completed successfully, the kids got older, they were all in school, and eventually they didn’t need any out of school care. So me being able to string multiple abstract thoughts together and carve out hours at a time to do thought work coincided with me also freeing myself to work on stuff that I found more interesting.

I don’t know how this all would have worked out if the conventional academic statistical work had suited me better. The time where I was most conflicted between doing a decent job parenting and doing decent work was also when I was doing work I wasn’t passionate about. I can’t tell if having more enthusiasm about the work would have made that period harder or easier! I really thought about ditching it all more than a few times.

The reinvigoration that coincided with switching emphasis also coincided with the reinvigoration that comes from the kids becoming more independent. It does eventually happen! There are some very tough years – they’re not dark years, they’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: What are your favorite tools for managing your workflow?

JB: In terms of working with R I’ve completely standardized on working with RStudio. Before that I was an Emacs-ESS zealot and I still have more accumulated years in that sphere. But once RStudio really existed and was viable, I started teaching with it. I hate doing R one way when I’m in front of students and another when I’m alone. It got very confusing and mixing up the keyboard shortcuts would create chaos. So now I’ve fully embraced RStudio and have never looked back.

I’m also a git evangelist. Everything I do is in git, everything is on Github and at this point, almost everything is public because I’ve gotten unselfconscious enough to put it up there. Plus there’s enough volume now that no one could be looking at any particular one thing. It’s so much easier for me to find it again later. I just put everything in a public place rather than trying to have this granular access control; it simplifies things greatly. Working in the open has simplified a lot of decisions, that’s nice.

Otherwise I feel like my workflow is very primitive. I have thousands of email in my inbox. I’ve completely given up on managing email and I’m mostly okay with that. It’s out of my control and I can’t commit to a system where I’m forced to get to inbox zero. I’ve just given up on it. And twitter and slack are important ways to feel connected when I’m sitting at home on my sofa.

KO: Do you have any online blogs, personalities or podcasts that you particularly enjoy? It doesn’t have to be R related.

JB: I do follow people on twitter and the rstats hashtag, so that often results in serendipitous one-off links that I enjoy. I don’t follow certain blogs regularly, but there are certain places that I end up at regularly. I like the Not So Standard Deviations podcast. In the end I always listen to every episode, but it’s what I do on an airplane or car drive.

KO: You build up a backlog?

JB: Exactly. Then the next time I need to drive to Seattle in traffic, I’ll power through four episodes.

KO: What are some of your favorite R packages – do you have some that you think are funny, or love?

JB: I live entirely in the tidyverse. I’m not doing primary data analysis on projects anymore. It’s weird that the more involved you become in honing the tools, the less time you spend wielding them. So I’m increasingly focused on the data prep, data wrangling, data input part of the cycle and not on modeling. I did a lot more of that when I was a statistician and now it’s not where my comparative interest and advantage seems to lie. There’s plenty to do on the other end. And also not that many people who like it. I actually do enjoy it. I don’t have to force myself to enjoy it – this is really important, and it pleases me. Given how important I think the work is, it’s a relatively uncrowded field. Whereas machine learning, it seems like everyone wants to make a contribution there. I’m like, you go for it – I’m going to be over here getting data out of Excel spreadsheets.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

↧

Downloading files from a webserver, and failing.

December 7, 2017, 4:00 pm

≫ Next: Bayesian Regression Modelling in R: Choosing informative priors in rstanarm #rstats

≪ Previous: .rprofile: Jenny Bryan

(This article was first published on Clean Code, and kindly contributed to R-bloggers)

Recently I wanted to download all the transcripts of a podcast (600+ episodes). The transcripts are simple txt files so in a way I am not even ‘web’-scraping but just reading in 600 or so text files which is not really a big deal. I thought.

This post shows you where I went wrong

Also here is a picture I found of scraping.

Scraping a plate

Webscraping general

For every download you ask the server for a file and it returns the file (this is also how you normally browse the web btw, your browser requests the pages).

In general it is nice if you ask permission (I did, on twitter and the author was really nice! I recommend it!) and don’t push the website to its limit. The servers where these files are hosted are quite beefy and I will probably not even make a dent in them, when I’m downloading these files. But still, be gentle.

No really, be a responsible scraper and tell the website owners you are scraping (in person or by identifying in the header) and check if it is allowed

I recently witnessed a demo where someone explained a lot of dirty tricks on how to get over those pesky servers denying them access and generally ignoring good practices and it made me sick…

Here are some general guides:

Downloading non-html files

There are multiple ways I could do this downloading: if I had used rvest to scrape a website I would have set a user-agent header^[a piece of information we snd with every request that describes who we are] and I would have used incremental backoff: when the server refuses a connection we would wait and retry again, if it still refuses we would wait twice as long and retry again etc.

However, since these are txt files I can just use read_lines^[This is the readr variant of readLines from base-R, it is much faster then the original] to read the txt file of a transcript and apply further work downstream.

A first, failing approach, tidy but wrong

This was my first approach:

all episodes are numbered and the transcript files are sequental too, so just a paste0 of “https://first-part-of-link” number”.txt” would work.
put all links as row into dataframe
apply some purrr magic by mapping every link to a read_lines function (what? use the read_lines() function on every link ).

latest_episode<-636system.time(df_sn<-data_frame(link=paste0("https:linktowebsite.com/firstpart-",formatC(1:latest_episode,width=3,flag=0),".txt"))%>%mutate(transcript=map(link,read_lines2)))

This failed.

Some episodes don’t exists or have no transcript (I didn’t know). Sometimes the internet connection didn’t want to work and just threw me out. Sometimes the server stopped my requests.

On every of those occasions the process would stop, give an informative error^[really, it did]. But the R-process would stop and I had no endresult.

Getting more information to my eyeballs and pausing in between requests

Also I didn’t know where it failed. So I created a new function that also sometimes waited (to not overwhelm the server)

## to see where we are this function wraps read_lines and prints the episodenumberread_lines2<-function(file){print(file)if(runif(1,0,1)>0.008)Sys.sleep(5)read_lines(file)}

This one also failed, but more informatively, I now knew if it failed on a certain episode.

But ultimately, downloading files from the internet is a somewhat unpredictable process. And it is much easier to just first download all the files and read them in afterwards.

A two step approach, download first, parse later.

Also I wanted to let the logs show that I was the one doing the scraping and how to reach me if I was overwhelming the service.

Enter curl. Curl is a library that helps you download stuff, it is used by the httr package and is a wrapper around the c++ package with the same name, wrapped by Jeroen ‘c-plus-plus’ Ooms.

Since I ran this function a few times I downloaded some of the files, and didn’t really want to download every file again, so I also added a check to see if the file wasn’t already downloaded^[I thought that was really clever, didn’t you?]. And I wanted it to print to the screen, because I like moving text over the screen when I’m debugging.

download_file<-function(file){filename<-basename(file)if(file.exists(paste0("data/",filename))){print(paste("file exists: ",filename))}else{print(paste0("downloading file:",file))h<-new_handle(failonerror=FALSE)h<-handle_setheaders(h,"User-Agent"="scraper by RM Hogervorst, @rmhoge, gh: rmhogervorst")curl_download(url=file,destfile=paste0("data/",filename),mode="wb",handle=h)Sys.sleep(sample(seq(0,2,0.5),1))# copied this from  Bob Rudis(@hrbrmstr)}}

I set the header (I think…) and I tell curl not to worry if it fails, we all need reassurance sometimes, but just to continue.

And the downloading begins:

# we choose walk here, because we don't expect output (we do get prints)# We specificaly do this for the side-effect: downloading to a folder.latest_episode<-636#downloadingwalk(paste0("https://first-part-of-link.com/episodenr-",formatC(1:latest_episode,width=3,flag=0),".txt"),download_file)

Conclusion

So in general, don’t be a dick, ask permission and take it easy.

The final download approach works great! And it doesn’t matter if you stop it halfway. In the future you can see why I wanted all of these files.

I thought this would be the easy step, would the rest be even harder? Tune in next time!

Cool things that I could have done:

use purrr::safely ? I think it will continue to work after a fail then?
use a trycatch in the download
first check if the file exists
Do something more with curl, honestly it has many many options that I just didn’t explore.
use some CLI spinners for every download, way cooler
write to a log, and not to the console.

Downloading files from a webserver, and failing. was originally published by at Clean Code on December 08, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Clean Code.

↧

Bayesian Regression Modelling in R: Choosing informative priors in rstanarm #rstats

December 7, 2017, 11:27 pm

≫ Next: Some quirks with R and SQL Server by @ellis2013nz

≪ Previous: Downloading files from a webserver, and failing.

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Yesterday, at the last meeting of the Hamburg R User Group in this year, I had the pleasure to give a talk about Bayesian modelling and choosing (informative) priors in the rstanarm-package.

You can download the slides of my talk here.

Thanks to the Stan team and Tristan for proof reading my slides prior (<- hoho) to the talk. Disclaimer: Still, I'm fully responsible for the content of the slides, and I'm to blame for any false statements or errors in the code…

Tagged: Bayes, R, regression, rstats, Statistik

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

↧

Some quirks with R and SQL Server by @ellis2013nz

December 8, 2017, 3:00 am

≫ Next: Live Earthquakes App

≪ Previous: Bayesian Regression Modelling in R: Choosing informative priors in rstanarm #rstats

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I’ve been writing on this blog less frequently in the past few months. Mostly this is because I’ve been working on a very intensive and engaging professional project that is much more hands-on (ie coding) than I’ve been at work for a while. So a lot of energy has been going into that. One interesting side effect for me has been diving deep into Structured Query Language (SQL), and the Microsoft-flavoured Transact SQL in particular. I’ve used SQL for a long time, but usually with the attitude of “get it out of the database as quickly as possible”. In a situation where this didn’t make sense, I’ve been pleasantly surprised at how powerful and flexible SQL is for the right sort of jobs.

The bulk of the work project is in SQL, but there is an R flavour with a Shiny front end and a bunch of testing and semi-automation of the build process going on in R (long story). Here are a couple of quirky and useful things relating to using R and SQL Server in combination.

Using R to schedule SQL scripts

Imagine a project that combines SQL and R. For example, SQL is used to do a bunch of heavy lifting data management and complex queries in the database; and R is used for statistical modelling and producing polished outputs. This is actually a very common scenario. Relational databases are very powerful tools with many decades of optimisation embedded in them. They aren’t going anywhere soon.

“Whether your first data analysis language is R, Python, Julia or SAS, your second language should be SQL”

Quote by me – something I just thought of.

It’s good practice to keep any non-trivial SQL queries in their own files with a .sql suffix, and develop them in a database-aware environment. With SQL Server, this will often mean SQL Server Management Studio. But when you come to doing the final stage of reproducible analysis, you don’t want to be flicking between two applications; certainly not in terms of actually running anything. Although SQL Server since 2016 can include R code in a stored procedure, if it’s basically a statistical project it’s still going to have the workflow of “database first, R second”, with the statistical and presentation stage probably developed in RStudio or similar. So it’s very useful to be able to run bunch of .sql scripts from R. This is commonly done by reading in the script with readLines() and executing it on the database via RODBC or other database connection software.

I developed an R function sql_execute() to make this process efficient. The version below is available in my pssmisc R package (only on GitHub at this point) which is a grab-bag of multi-use functionality associated with this blog. The original version had a few project-specific features as well as a few cut corners. It also had an accompanying simple function that uses sql_execute() to run all the SQL scripts in a given folder in order.

The sql_execute() function provides the following benefits:

combines the multiple steps of reading in SQL scripts and executing them in a single command
handles a common problem where files developed in Management Studio often aren’t saved in an encoding automatically recognised by R
allows the use of the GO batch separator, a Microsoft-specific addition to SQL that will cause problems if included in a ODBC query
lets you specify a search-and-replace for a string – very useful sometimes if you’re running lots of SQL scripts to be able to say something like “oh by the way, can you change all references to database X to database Y while you’re at it”
lets you specify if an error in one batch should be fatal, or whether to proceed with the rest of the batches from that script
logs execution and results in a table in the databse.

Dealing with GO was particularly important, because it’s common (and often essential) in T-SQL development. GO divides a single script into batches. The sql_execute() function below embraces this, by splitting the original file into separate queries based on the location of GO, and sending the individual batches one at a time to the server.

# Helper function to convert the output of Sys.time() into a character string
# without the time zone on it
# 
# @details Not exported.
# @keywords internal
# @param dt an object of class \code{POSIXCT}
# @examples
# datetime_ch(Sys.time())
datetime_ch<-function(dt){dt<-gsub(" [A-Z]*$","",as.character(dt))dt<-paste0("CAST ('",dt,"' AS DATETIME)")return(dt)}#' Execute SQL
#'
#' Execute T-SQL in a script, split into batches
#' 
#' @export
#' @importFrom RODBC sqlQuery
#' @importFrom stringr str_split str_length 
#' @details Reads a script of SQL, splits it into separate queries on the basis of any occurrences of \code{GO}
#' in the script, and passes it to the database server for execution.  While the initial use case was for SQL Server, there's no
#' reason why it wouldn't work with other ODBC connections.
#' 
#' The case of \code{GO} is ignored but it has to be the first non-space word on its line of code.
#' 
#' If any batch at any point returns rows of data (eg via a \code{SELECT} statement that does not \code{INSERT} the
#' results into another table or variable on the database), the rest of that batch is not executed.  
#' If that batch was the last batch of SQL
#' in the original file, the results are returned as a data.frame, otherwise it is discarded.
#' 
#' Example SQL code for creating a log suitable for this function:
#' \preformatted{
#' CREATE TABLE some_database.dbo.sql_executed_by_r_log
#' (
#'   log_event_code INT NOT NULL IDENTITY PRIMARY KEY, 
#'   start_time     DATETIME, 
#'   end_time       DATETIME,
#'   sub_text       NVARCHAR(200),
#'   script_name    NVARCHAR(1000),
#'   batch_number   INT,
#'   result         NCHAR(30),
#'   err_mess       VARCHAR(8000),
#'   duration       NUMERIC(18, 2)
#' );
#' }
#' 
#' @param channel connection handle as returned by RODBC::odbcConnect() of class RODBC
#' @param filename file name of an SQL script
#' @param sub_in character string that you want to be replaced with \code{sub_out}.  Useful if you want to do a bulk search
#' and replace.  This is useful if you have a bunch of scripts that you maybe want
#' to run on one schema sometimes, and on another schema other times - just automate the search and replace.  Use with caution.
#' @param sub_out character string that you want to replace \code{sub_in} with.
#' @param fixed logical.  If TRUE, \code{sub_in} is a string to be matched as is.  Otherwise it is treated as a regular expression 
#' (eg if fixed = FALSE, then . is a wild card)
#' @param error_action should you stop with an error if a batch gets an error message back from the database?  Any alternative
#' to "stop" means we just keep ploughing on, which may or may not be a bad idea.  Use "stop" unless you know that failure
#' in one part of a script isn't fatal.
#' @param log_table table in the database to record a log of what happened.  Set to NULL if no log table available.  The log_table
#' needs to have (at least) the following columns: event_time, sub_out, script_name, batch_number, result, err_mess and duration. 
#' See Details for example SQL to create such a log table.
#' @param verbose Logical, gives some control over messages
#' @param ... other arguments to be passed to \code{sqlQuery()}, such as \code{stringsAsFactors = FALSE}.
#' @examples
#' \dontrun{
#' ch <- odbcConnect("some_dsn")
#' sql_execute(ch, "some_file.sql", log_table = "some_database.dbo.sql_executed_by_r_log")
#' }
#' @author Peter Ellis
sql_execute<-function(channel,filename,sub_in=NULL,sub_out=NULL,fixed=TRUE,error_action="stop",log_table=NULL,verbose=TRUE,...){# we can't tell in advance what encoding the .sql files are in, so we read it in
# in two ways (one of which is certain to return gibberish) and choose the version that is recognised as a proper string:
# encoding method 1 (weird Windows encoding):
file_con<-file(filename,encoding="UCS-2LE")sql1<-paste(readLines(file_con,warn=FALSE),collapse="\n")close(file_con)# encoding method 2 (let R work it out - works in most cases):
file_con<-file(filename)sql2<-paste(readLines(file_con,warn=FALSE),collapse="\n")close(file_con)# choose between the two encodings, based on which one has a legitimate string length:
suppressWarnings({if(is.na(stringr::str_length(sql2))){sql<-sql1}else{sql<-sql2}})# do the find and replace that are needed
if(!is.null(sub_in)){sql<-gsub(sub_in,sub_out,sql,fixed=fixed)}# split the SQL into separate commands wherever there is a "GO" at the beginning of a line
# ("GO" is not ANSI SQL, only works for SQL Server - it indicates the lines above are a batch)
sql_split<-stringr::str_split(sql,"\\n *[Gg][Oo]",simplify=TRUE)base_log_entry<-data.frame(sub_out=ifelse(is.null(sub_out),"none",sub_out),script_name=filename,stringsAsFactors=FALSE)n_batches<-length(sql_split)# execute the various separate commands
for(iin1:n_batches){log_entry<-base_log_entrylog_entry$batch_number<-ilog_entry$result<-"no error"log_entry$err_mess<-""log_entry$start_time<-datetime_ch(Sys.time())if(verbose){message(paste("Executing batch",i,"of",n_batches))}duration<-system.time({res<-sqlQuery(channel,sql_split[[i]],...)})log_entry$duration<-duration[3]if(class(res)=="data.frame"){txt<-paste("Downloaded a data.frame with",nrow(res),"rows and",ncol(res),"columns in batch",i,". Any commands left in batch",i,"were not run.")if(verbose){message(txt)}log_entry$result<-"data.frame"}if(class(res)=="character"&length(res)>0){message("\n\nI got this error message:")cat(res)log_entry$result<-"error"log_entry$err_mess<-paste(gsub("'","",res),collapse="\n")message(paste0("\n\nSomething went wrong with the SQL execution of batch ",i," in ",filename,". \n\nError message from the database is shown above\n\n"))}log_entry$end_time<-datetime_ch(Sys.time())# Update the log in the database, if we have been given one:
if(!is.null(log_table)){# couldn't get sqlSave to append to a table even when append = TRUE... 
# see https://stackoverflow.com/questions/36913664/rodbc-error-sqlsave-unable-to-append-to-table
# so am writing the SQL to update the log by hand:
sql<-with(log_entry,paste0("INSERT INTO ",log_table,"(start_time, end_time, sub_out, script_name, batch_number, 
                                    result, err_mess, duration)"," VALUES (",start_time,", ",end_time,", '",sub_out,"', '",script_name,"', ",batch_number,", '",result,"', '",err_mess,"', ",duration,");"))log_res<-sqlQuery(channel,sql)}if(error_action=="stop"&&log_entry$result=="error"){stop(paste("Stopping due to an error in",filename))}if(class(res)=="data.frame"){if(i==n_batches){return(res)}else{warning("Downloaded a data frame from batch ",i," of SQL, which wasn't the \nlast batch in the file.  This data frame is not kept.")}}}}

One of the things to watch out for in this situation is how running a script via ODBC can get different results from hitting F5 in Management Studio. One key thing to trip up on is what happens if the SQL includes a SELECT statement that doesn’t INSERT the results into another table or variable, but returns them as a table. In this case, ODBC considers its work done and will not continue to execute anything else in the batch beyond that SELECT statement.

To clarify how this works, here is a potentially problematic SQL file:

/*
eg-sql.sql

for testing the sql_execute R function

*/-- this will return five rows but sql_execute discards themSELECTTOP5*FROMsome_tableGO-- this will return an errorsomenon-legitimateSQLherethatcausesanerrorgo-- next batch will only get as far as the first seven rowsSELECTTOP7*FROMsome_tableSELECTTOP10*FROMsome_tableGO

If I run that file via sql_execute(ch, "examples/eg-sql.sql"), it does the following:

executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
tries to execute the some non-legitimate SQL, gets an error and stops.

Alternatively, if I run it via sql_execute(ch, "examples/eg-sql.sql", error_action = "continue") it does the following

executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
tries to execute the some non-legitimate SQL, gets an error and prints it to the screen.
executes the SELECT TOP 7 statement, returns the results as a data frame, and stops. The SELECT TOP 10 statement isn’t returned.

An odd quirk with SQL loops cutting short with ODBC

A second quirk that had me puzzled for a while (and indeed I am still puzzled and can’t get a fully reproducible example) seems to relate to the use of SQL WHILE loops in a script executed on the database from R via ODBC. I found many such SQL programs would silently stop after about 20 iterations of the loop under ODBC, even if they worked perfectly in Management Studio. The examples all look like this:

DECLARE@iINT=1WHILE@i<=50BEGIN-- do something that needs loopingSET@i=@i+1ENDGO

BTW, SQL is one of those languages where you avoid loops if you can, and think instead in terms of joining, aggregating and filtering tables. But there are times when it is necessary (for example, performing an action on each table in a database, such as creating a carefully chosen random sample of it in another database – one of the things we had to do in the work project mentioned above).

The solution to this mysterious refusal to go beyond about 20 (it varied) iterations in some loops was to wrap the whole action in a user-defined stored procedure, then execute the procedure. This seems satisfyingly secure in all sorts of ways. The procedure can be kept permanently or blown away depending on what makes sense:

CREATEPROCEDUREdo_stuffASBEGINDECLARE@iINT=1WHILE@i<=50BEGIN-- do something that needs loopingSET@i=@i+1ENDENDGOEXECUTEdo_stuffDROPPROCEDUREdo_stuff

Worth noting – T-SQL distinguishes between its functions and stored procedures, whereas R lumps the two types of functionality together. Functions in SQL are true computer-science-defined functions, that take inputs and return outputs, with strictly no side effects. Stored procedures can have side effects (like creating or modifying tables). In R, functions can have side effects (and frequently do eg drawing plots), not just return outputs based on inputs.

No graphic today…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

↧

Live Earthquakes App

December 8, 2017, 9:44 am

≫ Next: New DataCamp Course: Working with Web Data in R

≪ Previous: Some quirks with R and SQL Server by @ellis2013nz

(This article was first published on analytics for fun, and kindly contributed to R-bloggers)

It’s awesome when you are asked to build a product demo and you end up building something you actually use yourself.

That is what happened to me with the Live Earthquake Shiny App. A few…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: analytics for fun.

↧

New DataCamp Course: Working with Web Data in R

December 8, 2017, 10:13 am

≫ Next: R 3.4.3 is released (a bug-fix release)

≪ Previous: Live Earthquakes App

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Hi there! We just launched Working with Web Data in R by Oliver Keyes and Charlotte Wickham, our latest R course!

Most of the useful data in the world, from economic data to news content to geographic information, lives somewhere on the internet – and this course will teach you how to access it. You’ll explore how to work with APIs (computer-readable interfaces to websites), access data from Wikipedia and other sources, and build your own simple API client. For those occasions where APIs are not available, you’ll find out how to use R to scrape information out of web pages. In the process, you’ll learn how to get data out of even the most stubborn website, and how to turn it into a format ready for further analysis. The packages you’ll use and learn your way around are rvest, httr, xml2 and jsonlite, along with particular API client packages like WikipediR and pageviews.

Take me to chapter 1!

Working with Web Data in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in getting information from the Internet!

What you’ll learn

1. Downloading Files and Using API Clients

Sometimes getting data off the internet is very, very simple – it’s stored in a format that R can handle and just lives on a server somewhere, or it’s in a more complex format and perhaps part of an API but there’s an R package designed to make using it a piece of cake. This chapter will explore how to download and read in static files, and how to use APIs when pre-existing clients are available.

2. Using httr to interact with APIs directly

If an API client doesn’t exist, it’s up to you to communicate directly with the API. But don’t worry, the package httr makes this really straightforward. In this chapter, you’ll learn how to make web requests from R, how to examine the responses you get back and some best practices for doing this in a responsible way.

3. Handling JSON and XML

Sometimes data is a TSV or nice plaintext output. Sometimes it’s XML and/or JSON. This chapter walks you through what JSON and XML are, how to convert them into R-like objects, and how to extract data from them. You’ll practice by examining the revision history for a Wikipedia article retrieved from the Wikipedia API using httr, xml2 and jsonlite.

4. Web scraping with XPATHs

Now that we’ve covered the low-hanging fruit (“it has an API, and a client”, “it has an API”) it’s time to talk about what to do when a website doesn’t have any access mechanisms at all – when you have to rely on web scraping. This chapter will introduce you to the rvest web-scraping package, and build on your previous knowledge of XML manipulation and XPATHs.

5. ECSS Web Scraping and Final Case Study

CSS path-based web scraping is a far-more-pleasant alternative to using XPATHs. You’ll start this chapter by learning about CSS, and how to leverage it for web scraping. Then, you’ll work through a final case study that combines everything you’ve learnt so far to write a function that queries an API, parses the response and returns data in a nice form.

Master web data in R with our course Working with Web Data in R!

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

↧

R 3.4.3 is released (a bug-fix release)

December 8, 2017, 10:32 am

≫ Next: Data science trivia from the Basel Data Science (BDS) meetup group

≪ Previous: New DataCamp Course: Working with Web Data in R

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.4.3 (codename “Kite-Eating Tree”) was released last week. You can get the latest binaries version from here. (or the .tar.gz source code from here).

As mentioned by David Smith, R 3.4.3 is primarily a bug-fix release:

It fixes an issue with incorrect time zones on MacOS High Sierra, and some issues with handling Unicode characters. (Incidentally, representing international and special characters is something that R takes great care in handling properly. It’s not an easy task: a 2003 essay by Joel Spolsky describes the minefield that is character representation, and not much has changed since then.)

The full list of bug fixes and new features is provided below.

Upgrading to R 3.4.3 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr")# install setInternet2(TRUE)# only for R versions older than 3.3.0installr::updateR()# updating R.# If you wish it to go faster, run: installr::updateR(T)

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.4.3

INSTALLATION on a UNIX-ALIKE

A workaround has been added for the changes in location of time-zone files in macOS 10.13 ‘High Sierra’ and again in 10.13.1, so the default time zone is deduced correctly from the system setting when R is configured with –with-internal-tzcode (the default on macOS).
R CMD javareconf has been updated to recognize the use of a Java 9 SDK on macOS.

BUG FIXES

raw(0) & raw(0) and raw(0) | raw(0) again return raw(0) (rather than logical(0)).
intToUtf8() converts integers corresponding to surrogate code points to NA rather than invalid UTF-8, as well as values larger than the current Unicode maximum of 0x10FFFF. (This aligns with the current RFC3629.)
Fix calling of methods on S4 generics that dispatch on ... when the call contains ....
Following Unicode ‘Corrigendum 9’, the UTF-8 representations of U+FFFE and U+FFFF are now regarded as valid by utf8ToInt().
range(c(TRUE, NA), finite = TRUE) and similar no longer return NA. (Reported by Lukas Stadler.)
The self starting function attr(SSlogis, "initial") now also works when the y values have exact minimum zero and is slightly changed in general, behaving symmetrically in the y range.
The printing of named raw vectors is now formatted nicely as for other such atomic vectors, thanks to Lukas Stadler.

logo

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

↧

Data science trivia from the Basel Data Science (BDS) meetup group

December 8, 2017, 4:00 pm

≫ Next: Brownian Motion GIF with R and ImageMagick

≪ Previous: R 3.4.3 is released (a bug-fix release)

(This article was first published on blog, and kindly contributed to R-bloggers)

A few weeks ago, we had our first meetup of the Basel Data Scientists (BDS) here in Basel, Switzerland. As it was our first meeting, and I wanted people to get to know each other and have some fun, I decided to have the members play a data science trivia game. I split the group into two teams of 5, and had each group answer 20 data science trivia questions I gathered from a mixture of classic statistical brain teasers, from both statistics and psychology, some statistical history (thank you Wikipedia!), and a few basic probability calculations. I had no idea if people would be into the game or not, but I was happy to see that after a few questions (and beers), people were engaged in some (at times heated!) debates over questions like the definition of a p-value, and how to best protect an airplane from enemy fire.

As I thought other people might have fun with the game. I am posting them here for other people to enjoy. As you’ll see, the 20 questions are broken down into four categories “Fun”, “Statistics”, “History”, and “Terminology”. Once you’ve given the questions a shot, you can find (my) answers to the questions at http://ndphillips.github.io/DataScienceTrivia_Answers.html. If you find errors, or have suggestions for better questions, don’t hesitate to write me at Nathaniel.D.Phillips.is@gmail.com. Have fun!

Data Science Trivia

Fun

Abraham is tasked with reviewing damaged planes coming back from sorties over Germany in the Second World War. He has to review the damage of the planes to see which areas must be protected even more. Abraham finds that the fuel system of returned planes are much more likely to be damaged by bullets than the engines. Which part of the plan should he recommend to receive additional protection, the fuel systems or the engines?
Paul the __ was an animal that became famous in 2010 for accurately predicting the outcomes of the 2010 world cup. What species was Paul?
Amy and Bob have two children, one of whom is female. What is the probability that their other child is female?
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Assuming that you do not want a goat, should you stick with door No 1. or should you switch to door No 2.?
Imagine the following coin flipping game. Before the game starts, the pot starts at $2. I then continually flip a coin, and each time a Head appears, the pot doubles. The first time tails appears, the game ends and you win whatever is in the pot. Thus a Tails comes on the first flip, the game is over and you get 2$. If the first Tails comes on the second flip, you get $4. Formally, you win $2^k$ dollars, where k is the number of flips. If you played this game infinitely times, how much money would you expect to earn on average? How much would you pay me for the opportunity to play this game?

Statistics

How many people do you need in a room for the probability to be greater than .50 that at least two people in the room have the same birthday?
If you flip a fair coin 4 times, what is the probability that it will have at least one head?

Imagine you are a physician presented with the following problem. A 50-year old woman Betty, with no symptoms, participants in routine mammogram screening. She tests positive and wants to know how likely it is that she actually has breast cancer given her positive test result. You know that about 1% of 50-year old women have breast cancer. If a woman does have breast cancer, the probability that she tests positive is 90%. If she does not have breast cancer, the probability that she nevertheless tests positive is 9%. Based on this information, how likely is it that Betty actually has breast cancer given her positive test result?

What is the definition of a p-value?

Imagine that I flipped a fair coin 5 times: which of the following two sequences is more likely to occur? A) “H, H, T, H, T”, B) “T, T, T, T, T”

History

The ___ ___ theorem, one of the most famous in all of statistics, states that, given enough data, the probability distribution of the sample mean will always be Normal, regardless of the probability distribution of the raw data.

The mathematician ___ developed the method of least squares in 1809.
In 1907, Francis Galton submitted a paper to Nature where he found that when 787 people guessed the weight of an ox at a county fair, the median estimate of the group was only off by 10 pounds. This is one of the most famous examples of the ___ __ ___.
The .05 significance threshold was introduced by ___ in 1925.
Python is a programming language created by Guido van Rossum and was first released in 1991. Where did the name for Python come from?

Terminology

A machine learning model that is so complex that no one, even at times its programmers, don’t know exactly why it works the way it does, is called a ___ ___ model.
When an algorithm has very high accuracy in fitting a training dataset, but poor accuracy in predicting a new dataset, then the model has ___ the training data.
In order to computationally estimate probability distributions, especially in Bayesian statistics, MCMC methods are often used, which stand for ___ ___ ___ ___ methods.
What does SPSS stand for?
Regression, decision trees, and random forests are known as ___ learning algorithms, while algorithms such as nearest neighbor and principle component analysis are known as ___ learning algorithms

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: blog.

↧

Brownian Motion GIF with R and ImageMagick

December 9, 2017, 10:18 am

≫ Next: #12: Know and Customize your OS and Work Environment

≪ Previous: Data science trivia from the Basel Data Science (BDS) meetup group

(This article was first published on long time ago..., and kindly contributed to R-bloggers)

p { margin-bottom: 0.25cm; line-height: 120%; }

Hi there!

Last Monday we celebrated a “Scientific Marathon” at Royal Botanic Garden in Madrid, a kind of mini-conference to talk about our research. I was talking about the relation between fungal spore size and environmental variables such as temperature and precipitation. To make my presentation more friendly, I created a GIF to explain the Brownian Motion model. In evolutionary biology, we can use this model to simulate the random variation of a continuous trait through time. Under this model, we can notice how closer species tend to maintain closer trait values due to shared evolutionary history. You have a lot of information about Brownian Motion models in evolutionary biology everywhere!

Here I will show you how I built a GIF to explain Brownian Motion in my talk using R and ImageMagick.

 # First, we simulate continuous trait evolution by adding in each iteration    # a random number from a normal distribution with mean equal to 0 and standard    # deviation equal to 1. We simulate a total of 4 processes, to obtain at first    # two species and a specieation event at the middle of the simulation, obtaining    # a total of 3 species at the end.    df1<- data.frame(0,0)    names(df1)<- c("Y","X")    y<-0    for (g in 1:750){    df1[g,2] <- g    df1[g,1] <- y    y <- y + rnorm(1,0,1)    }    #plot(df1$X,df1$Y, ylim=c(-100,100), xlim=c(0,1500), cex=0)    #lines(df1$X,df1$Y, col="red")    df2<- data.frame(0,0)    names(df2)<- c("Y","X")    y<-0    for (g in 1:1500){     df2[g,2] <- g     df2[g,1] <- y     y <- y + rnorm(1,0,1)    }    #lines(df2$X,df2$Y, col="blue")    df3<- data.frame(750,df1[750,1])    names(df3)<- c("Y","X")    y<-df1[750,1]    for (g in 750:1500){     df3[g-749,2] <- g     df3[g-749,1] <- y     y <- y + rnorm(1,0,1)    }    #lines(df3$X,df3$Y, col="green")    df4<- data.frame(750,df1[750,1])    names(df4)<- c("Y","X")    y<-df1[750,1]    for (g in 750:1500){     df4[g-749,2] <- g     df4[g-749,1] <- y     y <- y + rnorm(1,0,1)    }    #lines(df4$X,df4$Y, col="orange")

 # Now, we have to plot each simmulation lapse and store them in our computer.    # I added some code to make lighter the gif (plotting just odd generations) and     # to add a label at the speciation time. Note that, since Brownan Model is a     # stocasthic process, my simulation will be different from yours.    # You should adjust labels or repeat the simulation process if you don't     # like the shape of your plot.    parp<-rep(0:1, times=7, each= 15)    parp<- c(parp, rep(0, 600))    for (q in 1:750){     if ( q %% 2 == 1) {     id <- sprintf("%04d", q+749)     png(paste("bm",id,".png", sep=""), width=900, height=570, units="px",        pointsize=18)     par(omd = c(.05, 1, .05, 1))     plot(df1$X,df1$Y, ylim=c(-70,70), xlim=c(0,1500), cex=0,         main=paste("Brownian motion model \n generation=", 749 + q) ,         xlab="generations", ylab="trait value", font.lab=2, cex.lab=1.5 )    lines(df1$X,df1$Y, col="red", lwd=4)    lines(df2$X[1:(q+749)],df2$Y[1:(q+749)], col="blue", lwd=4)    lines(df3$X[1:q],df3$Y[1:q], col="green", lwd=4)    lines(df4$X[1:q],df4$Y[1:q], col="orange", lwd=4)    if (parp[q]==0)    text(750, 65,labels="speciation event", cex= 1.5, col="black", font=2)    if (parp[q]==0)    arrows(750, 60, 750, 35, length = 0.20, angle = 30, lwd = 3)    dev.off()    }    }

Now, you just have to use ImageMagick to put all the PNG files together in a GIF using a command like this in a terminal:

 convert -delay 10 *.png bm.gif

Et voilà!

To leave a comment for the author, please follow the link and comment on their blog: long time ago....

↧

Installation

Shinny interface

How to use GetDFPData

Downloading financial information for ONE company

Downloading financial information for SEVERAL companies

Exporting financial data

Tutorial: how to map lithium production data using R

To master data science, you need a plan

To post your R job on the next post

Current R jobs

Featured Jobs

More New R Jobs

Changes in RcppArmadillo version 0.8.300.1.0 (2017-12-04)

tl;dr

introducing jmv

An ANOVA

jamovi integration

summing up

Loading Packages

Why this post?

Current available solution in R

Show me your locale

New addition to readr

Webscraping general

Downloading non-html files

A first, failing approach, tidy but wrong

Getting more information to my eyeballs and pausing in between requests

A two step approach, download first, parse later.

Conclusion

Cool things that I could have done:

Using R to schedule SQL scripts

An odd quirk with SQL loops cutting short with ODBC

Upgrading to R 3.4.3 on Windows

CHANGES IN R 3.4.3

INSTALLATION on a UNIX-ALIKE

BUG FIXES

Data Science Trivia

Fun

Statistics

History

Terminology

How to use `GetDFPData`

introducing `jmv`