Quantcast
Channel: R-bloggers
Viewing all 12300 articles
Browse latest View live

Changing the variable inside an R formula

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset:

data(mtcars)
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s say I wanted to fit a linear model of mpg vs. hp and get the coefficients. This is easy:

lm(mpg ~ hp, data = mtcars)$coefficients
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 

But what if I wanted to fit a linear model of y vs. hp, where y is a response variable that I won’t know until runtime? Or what if I want to fit 3 linear models: each of mpg, disp, drat vs. hp? Or what if I want to fit 300 such models? There has to be a way to do this programmatically.

It turns out that there are at least 4 different ways to achieve this in R. For all these methods, let’s assume that the responses we want to fit models for are in a character vector:

response_list <- c("mpg", "disp", "drat")

Here are the 4 ways I know (in decreasing order of preference):

1. as.formula()

as.formula() converts a string to a formula object. Hence, we can programmatically create the formula we want as a string, then pass that string to as.formula():

for (y in response_list) {
    lmfit <- lm(as.formula(paste(y, "~ hp")), data = mtcars)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

2. Don’t specify the data option

Passing the data = mtcars option to lm() gives us more succinct and readable code. However, lm() also accepts the response vector and data matrix themselves:

for (y in response_list) {
    lmfit <- lm(mtcars[[y]] ~ mtcars$hp) 
    print(lmfit$coefficients)
} 
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

3. get()

get() searches for an R object by name and returns that object if it exists.

for (y in response_list) {
    lmfit <- lm(get(y) ~ hp, data = mtcars)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

4. eval(parse())

This one is a little complicated. parse() returns the parsed but unevaluated expressions, while eval() evaluates those expressions (in a specified environment).

for (y in response_list) {
    lmfit <- lm(eval(parse(text = y)) ~ hp, data = mtcars)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

 

Of course, for any of these methods, we could replace the outer loop with apply() or purrr::map().

References:

  1. johnramey. Converting a String to a Variable Name On-The-Fly and Vice-versa in R.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Inferring a continuous distribution from binned data by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today’s post comes from an idea and some starting code by my colleague David Diviny from Nous Group.

A common real-world problem is trying to estimate an unknown continuous variable from data that has been published in lumped-together bins. Often this will have been done for confidentialisation reasons; or it might just be that it has been aggregated that way for reporting and the original data is unavailable; or the binning might have happened at the time of measurement (common in surveys, when respondents might be asked to pick from several categories for ‘how often do you…’ or a similar question).

This is a related problem to the question of data being suppressed in low-count cells I discussed in some previous posts. It exemplifies a more general problem with censoring of all of our data into rough left and right bounds, at the mercy of the analyst who produced the original set of bins for publication.

Simulated data

I’m going to start with a made up but realistic example where respondents were asked “how long do you think it will take for X to happen”. Our simulated data from 10,000 responses (ok, perhaps that sample size isn’t very realistic for this sort of question :)) looks like this:

bin freq
<10 474
10-50 1710
50-100 1731
100 – 1000 6025
>= 1000 60

Under the hood, I generated 10,000 actual continuous values from an exponential distribution, so I know the true un-binned values, the population mean and even the true data generating process. The true mean is 200.

A tempting but overly simplistic way of estimating the average time would be to allocate to each value the mid point of its bin; so we would have 474 values of 5, 1,710 values of 30, 1,731 of 75, and so on. This would give us an estimated population average of 358, far too high. The fact that I chose a very right-skewed distribution (exponential) to generate the data is the main reason this method doesn’t work, but this is not an unlikely scenario.

A vastly better method of estimating the mean is to model the underlying distribution, choosing parameters for it that are most likely to have generated the binned results we see. I could cheat and fit an exponential distribution, but let’s be more realistic and allow our model the flexibility of a Gamma distribution (of which exponential is a special case), reflecting the uncertainty we would have in encountering this data in the wild. Fitting this method to my binned data gives me a Gamma distribution with an estimated shape parameter of 1.02 (very close to the true data generating process value of 1, meaning a pure exponential distribution), estimated rate of 0.0051 and inferred mean of 198.5 – very close to the true total and much better than 358.

Here’s the results:

> # overall fit:> summary(fitted_distribution_gamma)Fitting of the distribution ' gamma ' By maximum likelihood on censored data Parameters        estimate   Std. Errorshape 1.01844011 0.0144237671rate  0.00512979 0.0001093316Loglikelihood:  -10860.99   AIC:  21725.99   BIC:  21740.41 Correlation matrix:          shape      rateshape 1.0000000 0.7996524rate  0.7996524 1.0000000> > # estimated mean> fitted_distribution_gamma$estimate["shape"] / fitted_distribution_gamma$estimate["rate"]   shape 198.5345

And here’s the code that generated it. Most of this is just simulating the data; as is often the case, the actual statistical modelling here is a one liner, using the fitdistcens function from Marie Laure Delignette-Muller, Christophe Dutang (2015). fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software, 64(4), 1-34. URL http://www.jstatsoft.org/v64/i04/

Post continues after R code

library(tidyverse)library(multidplyr)library(frs)library(fitdistrplus)library(knitr)library(readxl)library(kableExtra)library(clipr)#-----------------simulated data-------------set.seed(123)simulated_rate<-0.005volumes<-tibble(volume=rexp(n=10000,rate=simulated_rate))volumes<-volumes%>%mutate(bin=case_when(volume<10~"<10",volume<50~"10-50",volume<100~"50-100",volume<1000~"100 - 1000",TRUE~">= 1000"),left=case_when(volume<10~0,volume<50~10,volume<100~50,volume<1000~100,TRUE~1000),right=case_when(volume<10~10,volume<50~50,volume<100~100,volume<1000~1000,TRUE~NA_real_),bin=factor(bin,levels=c("<10","10-50","50-100","100 - 1000",">= 1000"),ordered=T))# This is how it would look to the original uservolumes%>%group_by(bin)%>%summarise(freq=n())# Create data frame with just "left" and "right" columns, one row per respondent, ready for fitdistcensvolumes_bin<-dplyr::select(volumes,left,right)%>%as.data.frame()# Fit model  fitted_distribution_gamma<-fitdistcens(volumes_bin,"gamma")# overall fit:summary(fitted_distribution_gamma)# estimated meanfitted_distribution_gamma$estimate["shape"]/fitted_distribution_gamma$estimate["rate"]ggplot(volumes)+geom_density(aes(x=volume))+stat_function(fun=dgamma,args=fitted_distribution_gamma$estimate,colour="blue")+annotate("text",x=700,y=0.0027,label="Blue line shows modelled distribution; black is density of the actual data.")# bin_based_mean (358 - very wrong)volumes%>%mutate(mid=(left+replace_na(right,2000))/2)%>%summarise(crude_mean=mean(mid))%>%pull(crude_mean)

Here’s the comparison of our fitted distribution compared to the actual density of the data:

Binned data in the wild

Motivation – estimating average firm size from binned data

That’s all very well for simulated data, with an impressive reduction in error (from > 75% down to <1%), but did I cook the example by using a right-skewed distribution? Let’s use an even more realistic example, this time with real data.

The Australian Bureau of Statistics series 8165.0 is the Count of Australian Businesses, including Entries and Exits. It includes a detailed breakdown of number of firms by primary state of the firm’s operation, number of employees (who may not all be in that state of course) and detailed industry classification. To preserve confidentiality the number of employees is reported in bins of 0, 1-19, 20-199 and 200+. The data looks like this:

For today’s purpose I am using the number of businesses operating at the start of the 2018 financial year, ie in July 2017.

If we want to use this data to estimate the mean number of employees per firm by industry and state we have a challenge that is almost identical to that with my simulated data, albeit we now have 4,473 instances of that challenge (one for each combination of industry and state or territory). So this is a good chance to see whether we can scale up this method to a larger and more complex dataset. Luckily, dplyr makes quick work of this sort of calculation, as we will see in a minute.

Data wrangling

The first job is tidy the data into a structure we want. The four digit industry classification is too detailed to be interesting for me just now and I will want to roll it up to three digit, and also be able to easily summarise it at even coarser levels for graphics. For some reason, the ANZSIC classification is surprisingly hard to get hold of in tidy format, so I’ve included it as the object d_anzsic_4_abs in my package of R miscellanea, frs (available via GitHub).

For fitting my distributions to censored data down the track, I need columns for the left and right bounds of each bin and the frequency of occurances in each. After tidying, the data looks like this:

anzsic_group_code anzsic_group state employees left right division freq
011 Nursery and Floriculture Production Australian Capital Territory 200-30000 200 30000 Agriculture, Forestry and Fishing 0
011 Nursery and Floriculture Production Australian Capital Territory 20-199 20 199 Agriculture, Forestry and Fishing 0
011 Nursery and Floriculture Production Australian Capital Territory 1-19 1 19 Agriculture, Forestry and Fishing 3
011 Nursery and Floriculture Production Australian Capital Territory 0-0 0 0 Agriculture, Forestry and Fishing 6
011 Nursery and Floriculture Production New South Wales 200-30000 200 30000 Agriculture, Forestry and Fishing 0
011 Nursery and Floriculture Production New South Wales 20-199 20 199 Agriculture, Forestry and Fishing 12

Along the way we’ll do some exploratory analysis. Here’s some summary data showing the top five states and ten industry divisions:

We can see that small businesses with less than 20 employees form the overwhelming number of business, and that perhaps half or more have no employees at all. Construction is the biggest industry (at the coarse one digit level of classification), followed by professional services and real estate. The pattern is similar across states, although Queensland and South Australia have relatively fewer professional and financial services firms than the other states. All of this is useful context.

I first chose 250,000 as the upper bound for the top bin (firms with 200 to 250,000 employees) because Wikipedia informs us that the largest firm in Australia by employee size is retailing giant Wesfarmers with 220,000. (Wesfarmers began as a farmer’s cooperative in my home state of Western Australia in 1914, but has grown and diversified – it now owns Bunnings, Kmart, Target and Officeworks among other things). But after some reflection I realised that Wesfarmers is almost certainly one of the 40,000 Australian Business Numbers “associated with complex structures” in this helpful diagram from the ABS on how the Business Counts is scoped. They would be a profiled unit for which the ABS maintains a record of its unit structure by direct contact with the business, and its various units likely are recorded as separate units in the data.

So, unable to use that 220,000 maximum employee size, I struggled to get a better figure for a maximum individual business unit, so decided on 30,000 more or less arbitrarily. It turns out this choice matters a lot if we are going to try to calculate average firm size directly from the counting binned data, but it doesn’t matter at all when fitting an explicit model the way I am here. Which is another argument in favour of doing it this way.

Here’s the entire data sourcing wrangling needed for this operation (asssuming frs is loaded with the ANZSIC lookup table):

Post continues after code excerpt

#------real data - business counts------------download.file("https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&816502.xls&8165.0&Data%20Cubes&B164DBE8275CCE58CA2583A700121372&0&June%202014%20to%20June%202018&21.02.2019&Latest",destfile="bus_counts.xls",mode="wb")bus_counts_orig<-readxl::read_excel("bus_counts.xls",sheet="June 2018",range="A7:G4976",col_types=c("text","text","text",rep("numeric",4)))names(bus_counts_orig)<-c("state","code","industry","0-0","1-19","20-199","200-30000")bus_counts<-bus_counts_orig%>%mutate(code=str_pad(code,4,"left","0"))%>%filter(!grepl("^Total",industry)&code!="9999")%>%filter((`0-0`+`1-19`+`20-199`+`200-30000`)>0)%>%gather(employees,freq,-state,-code,-industry)%>%separate(employees,by="-",into=c("left","right"),remove=FALSE)%>%mutate(left=as.numeric(left),right=as.numeric(right),employees=fct_reorder(employees,-left))%>%left_join(dplyr::select(anzsic_4_abs,anzsic_class_code,anzsic_group_code,anzsic_group,division),by=c("code"="anzsic_class_code"))%>%group_by(anzsic_group_code,anzsic_group,state,employees,left,right,division)%>%summarise(freq=sum(freq))%>%arrange(anzsic_group_code)%>%ungroup()# how data will look:kable(head(bus_counts))%>%kable_styling(bootstrap_options="striped")%>%write_clip()# demo plot:bus_counts%>%mutate(division2=fct_lump(division,10,w=freq),division2=fct_reorder(division2,freq,.fun=sum),state2=fct_lump(state,5,w=freq),division2=fct_relevel(division2,"Other",after=0))%>%ggplot(aes(x=division2,fill=employees,weight=freq))+geom_bar()+facet_wrap(~state2)+coord_flip()+scale_y_continuous(label=comma,expand=c(0,0))+labs(y="Number of firms",x="",fill="Number of employees",caption="Source: ABS Count of Australian Businesses, analysis by freerangestats.info",title="Number of firms by industry division and number of employees")+theme(panel.grid.minor=element_blank(),panel.spacing=unit(5,"mm"),panel.border=element_blank(),axis.text.x=element_text(angle=45,hjust=1))+scale_fill_brewer(palette="Spectral",guide=guide_legend(reverse=TRUE))

Modelling strategy and application

Whereas the simulated data in the first part of this post was estimates of time taken to perform a task, the underlying data here is a count of employees. So the Gamma distribution won’t be appropriate. The Poisson distribution is the usual starting point for modelling counts, but it doesn’t work in this case, so we relax it to a negative binomial distribution. One possible derivation of the negative binomial in this case would be to assume that the number of counts in any individual business has a Poisson distribution, but that even within state-industry combinations the mean of that distribution is itself a random variable. This feels pretty intuitive and plausible, and more importantly it seems to work.

To apply this with a minimum of code, I define a function to do the data reshaping needed for fitdistcens, estimate the parameters, and return the mean of the fitted distribution as a single number. This is suitable for use in dplyr’s summarise function after a group_by operation which means we can fit our models with only a few lines of code. I encountered uninteresting problems doing this with the whole dataset, so I decided to fit these models only to the manufacturing industries in three of the larger states. Even this was a computationally intensive task, so I make use of Hadley Wickham’s experimental multidplyr package which provides yet another way of performing parallel processing tasks in R. I think multidplyr looks promising because it integrates the familiar group_by approach from dplyr into partitioning of data into separate clusters for embarassingly parallel tasks (like fitting this distribution to many combinations of industry and state). This fits nicely into my workflow. The example usage I found on the net seemed to be from an old version of multidplyr (it looks like you used to be able to wrap your group_by command into partition but now they are sensibly separate) but it wasn’t hard to get things working.

Here’s the code for fitting all these models to every detailed manufacturing industry group in those three states. It takes less than a minute to run.

Post continues after code excerpt

bus_counts_small<-bus_counts%>%filter(division=="Manufacturing"&state%in%c("New South Wales","Victoria","Queensland"))#' @param d a data frame or tibble with columns including left, right and freq#' @param keepdata passed through to fitdistcens#' @param ... extra variables to pass to fitdistcens eg starting values for the estimation processavg_finder<-function(d,keepdata=FALSE,...){d_bin<-as.data.frame(d[rep(1:nrow(d),d$freq),c("left","right")])fit<-fitdistrplus::fitdistcens(d_bin,"nbinom",keepdata=keepdata,...)return(fit$estimate[["mu"]])}# Meat manufacturing in NSW:avg_finder(bus_counts_small[1:4,])# Meat manufacturing in Queensland:avg_finder(bus_counts_small[5:8,])cluster<-new_cluster(7)cluster_library(cluster,"fitdistrplus")cluster_assign(cluster,avg_finder=avg_finder)# calculate the simple ones using single processing dplyr for ease during dev:bus_summary_simp<-bus_counts_small%>%mutate(middle=(left+right)/2,just_above_left=(right-left)/10+left,tiny_above_left=left*1.1)%>%group_by(anzsic_group,state)%>%summarise(number_businesses=sum(freq),crude_avg=sum(freq*middle)/sum(freq),less_crude_avg=sum(freq*just_above_left)/sum(freq),another_avg=sum(freq*tiny_above_left)/sum(freq))# parallel processing for the negative binomial verionsbus_summary_nb<-bus_counts_small%>%group_by(anzsic_group,state)%>%partition(cluster=cluster)%>%do(nb_avg=avg_finder(.))%>%collect()%>%mutate(nb_avg=unlist(nb_avg))bus_summary<-bus_summary_simp%>%left_join(bus_summary_nb,by=c("anzsic_group","state"))

Results

So what do we see?

In the three plots below I compare three crude ways of estimating average firm size (in number of employees), on the horizontal axis, with the superior statistical modelling method taking into account the nature of binned data on the vertical axis. We can see that the choice of a crude way to assign an average employee number within each bin makes a huge difference. For the first two plots, I did this by taking either the mid point of each bin, or the point 10% of the way between the left and right bounds of the bin. Both of these methods give appallingly high overestimates of average firm size. But for the third plot, where I said the average number of employees in each bin was just 10% more than the lowest bound of the bin, we get bad under-estimates.

The shape of these plots depends not only on the method used for crude averaging within bins, but on the critical choice of the right-most bound of the highest bin (30,000 in my case). Differing choices for this value lead to more sensible estimates of average firm size, but this just shows how hopeless any naive method of using these bins is for assigning averages.

These are interesting results which illustrate how difficult it is estimate means of real world, highly skewed data – particularly so when the data has been clouded by binning for confidentialisation or convenience reasons. The mean has a breakdown point of zero, meaning that a single arbitrarily large value is enough to make the whole estimate meaningless (unlike eg the median, where 50% of the data needs to be problematic for this to happen). Having an unknown upper bound to a bin makes this problem almost certain when you are trying to recover a mean from the underlying data.

Here’s the code for those comparative plots.

Post continues after code excerpt

p1<-bus_summary%>%ggplot(aes(x=crude_avg,y=nb_avg,colour=anzsic_group))+facet_wrap(~state)+geom_abline(intercept=0,slope=1)+geom_point()+theme(legend.position="none")+labs(x="Average firm size calculated based on mid point of each bin\nDiagonal line shows where points would be if both methods agreed.",y="Average firm size calculated with negative binomial model",title="Mean number of employees in manufacturing firms estimated from binned data",subtitle="Inference based on mid point of bin delivers massive over-estimates")p2<-bus_summary%>%ggplot(aes(x=less_crude_avg,y=nb_avg,colour=anzsic_group))+facet_wrap(~state)+geom_abline(intercept=0,slope=1)+geom_point()+theme(legend.position="none")+labs(x="Average firm size calculated based on 10% of way from left side of bin to right side\nDiagonal line shows where points would be if both methods agreed.",y="Average firm size calculated with negative binomial model",title="Mean number of employees in manufacturing firms estimated from binned data",subtitle="Using a point 10% of the distance from the left to the right still over-estimates very materially")p3<-bus_summary%>%ggplot(aes(x=another_avg,y=nb_avg,colour=anzsic_group))+facet_wrap(~state)+geom_abline(intercept=0,slope=1)+geom_point()+theme(legend.position="none")+labs(x="Average firm size calculated based on left side of bin x 1.1\nDiagonal line shows where points would be if both methods agreed.",y="Average firm size calculated with negative binomial model",title="Mean number of employees in manufacturing firms estimated from binned data",subtitle="Using the (left-most point of the bin times 1.1) results in under-estimates")

Conclusion

Don’t be naive in translating binned counts into estimates of the average of an underlying continuous distribution, whether that underlying distribution is a discrete count (like our real life example) or truly continuous (as in the simulated data). It’s much more accurate, and not much harder, to fit a statistical model of the underlying distribution and calculate the estimates of interest directly from that. This applies whether you are interested in the average, the sum, percentiles, variance or other more complex statistics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing the relationship between multiple variables

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical.

For all the code in this post in one file, click here.

The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package:

library(fueleconomy)data(vehicles)df <- vehicles[1:100, ]str(df)# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':100 obs. of  12 variables:#     $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...# $ make : chr  "AM General" "AM General" "AM General" "AM General" ...# $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...# $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...# $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...# $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...# $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...# $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...# $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...# $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...# $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...# $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

Let’s see how GGally::ggpairs() visualizes relationships between quantitative variables:

library(GGally)quant_df <- df[, c("cyl", "hwy", "cty")]ggpairs(quant_df)

  • Along the diagonal, we see a density plot for each of the variables.
  • Below the diagonal, we see scatterplots for each pair of variables.
  • Above the diagonal, we see the (Pearson) correlation between each pair of variables.

The visualization changes a little when we have a mix of quantitative and categorical variables. Below, fuel is a categorical variable while hwy is a quantitative variable.

mixed_df <- df[, c("fuel", "hwy")]ggpairs(mixed_df)

  • For a categorical variable on the diagonal, we see a barplot depicting the number of times each category appears.
  • In one of the corners (top-right), for each categorical value we have a boxplot for the quantitative variable.
  • In one of the corners (bottom-left), for each categorical value we have a histogram for the quantitative variable.

The only behavior for GGally::ggpairs() we haven’t observed yet is for a pair of categorical variables. In the code fragment below, all 3 variables are categorical:

cat_df <- df[, c("fuel", "make", "drive")]ggpairs(cat_df)

For each pair of categorical variables, we have a barplot depicting the number of times each pair of categorical values appears.

You may have noticed that the plots above the diagonal are essentially transposes of the plot below the diagonal, and so they don’t really convey any more information. What follows below is my attempt to make the plots above the diagonal more useful. Instead of plotting the transpose barplot, I will plot a heatmap showing the relative proportion of observations having each pair of categorical values.

First, the scaffold for the plot. I will use the gridExtra package to put several ggplot2 objects together. The code below puts the same barplots below the diagonal, variable names along the diagonal, and empty canvases above the diagonal. (Notice that I need some tricks to make the barplots with the variables as strings, namely the use of aes_string() and as.formula() within facet_grid().)

library(gridExtra)library(tidyverse)grobs <- list()idx <- 0for (i in 1:ncol(cat_df)) {    for (j in 1:ncol(cat_df)) {        idx <- idx + 1                # get feature names (note that i & j are reversed)        x_feat <- names(cat_df)[j]        y_feat <- names(cat_df)[i]                if (i < j) {            # empty canvas            grobs[[idx]] <- ggplot() + theme_void()        } else if (i == j) {            # just the name of the variable            label_df <- data.frame(x = -0, y = 0, label = x_feat)            grobs[[idx]] <- ggplot(label_df, aes(x = x, y = y, label = label),                                    fontface = "bold", hjust = 0.5) +                geom_text() +                coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) +                theme_void()        }        else {            # 2-dimensional barplot            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +                 geom_bar() +                facet_grid(as.formula(paste(y_feat, "~ ."))) +                theme(legend.position = "none", axis.title = element_blank())        }    }}grid.arrange(grobs = grobs, ncol = ncol(cat_df))

This is essentially showing the same information as GGally::ggpairs(). To add the frequency proportion heatmaps, replace the code in the (i < j) branch with the following:

# frequency proportion heatmap# get frequency proportionsfreq_df <- cat_df %>%     group_by_at(c(x_feat, y_feat)) %>%    summarize(proportion = n() / nrow(cat_df)) %>%     ungroup()            # get all pairwise combinations of valuestemp_df <- expand.grid(unique(cat_df[[x_feat]]),                        unique(cat_df[[y_feat]]))names(temp_df) <- c(x_feat, y_feat)            # join to get frequency proportiontemp_df <- temp_df %>%    left_join(freq_df, by = c(setNames(x_feat, x_feat),                              setNames(y_feat, y_feat))) %>%    replace_na(list(proportion = 0))            grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) +     geom_tile(aes(fill = proportion)) +    geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +    scale_fill_gradient(low = "white", high = "#007acc") +    theme(legend.position = "none", axis.title = element_blank())

Notice that each heatmap has its own limits for the color scale. If you want to have the same color scale for all the plots, you can add limits = c(0, 1) to the scale_fill_gradient() layer of the plot.

The one thing we lose here over the GGally::ggpairs() version is the marginal barplot for each variable. This is easy to add but then we don’t really have a place to put the variable names. Replacing the code in the (i == j) branch with the following is one possible option.

# df for positioning the variable namelabel_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2,                        y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)# marginal barplot with variable name on topgrobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +    geom_bar() +    geom_label(data = label_df, aes(x = x, y = y, label = label),               size = 5)

In this final version, we clean up some of the axes so that more of the plot space can be devoted to the plot itself, not the axis labels:

theme_update(legend.position = "none", axis.title = element_blank())grobs <- list()idx <- 0for (i in 1:ncol(cat_df)) {    for (j in 1:ncol(cat_df)) {        idx <- idx + 1                # get feature names (note that i & j are reversed)        x_feat <- names(cat_df)[j]        y_feat <- names(cat_df)[i]                if (i < j) {            # frequency proportion heatmap            # get frequency proportions            freq_df <- cat_df %>%                 group_by_at(c(x_feat, y_feat)) %>%                summarize(proportion = n() / nrow(cat_df)) %>%                 ungroup()                        # get all pairwise combinations of values            temp_df <- expand.grid(unique(cat_df[[x_feat]]),                                    unique(cat_df[[y_feat]]))            names(temp_df) <- c(x_feat, y_feat)                        # join to get frequency proportion            temp_df <- temp_df %>%                left_join(freq_df, by = c(setNames(x_feat, x_feat),                                          setNames(y_feat, y_feat))) %>%                replace_na(list(proportion = 0))                        grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) +                 geom_tile(aes(fill = proportion)) +                geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +                scale_fill_gradient(low = "white", high = "#007acc") +                theme(axis.ticks = element_blank(), axis.text = element_blank())        } else if (i == j) {            # df for positioning the variable name            label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2,                                    y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)            # marginal barplot with variable name on top            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +                geom_bar() +                geom_label(data = label_df, aes(x = x, y = y, label = label),                           size = 5)        }        else {            # 2-dimensional barplot            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +                 geom_bar() +                facet_grid(as.formula(paste(y_feat, "~ ."))) +                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())        }    }}grid.arrange(grobs = grobs, ncol = ncol(cat_df))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RcppExamples 0.1.9

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of the RcppExamples package is now on CRAN.

The RcppExamples package provides a handful of short examples detailing by concrete working examples how to set up basic R data structures in C++. It also provides a simple example for packaging with Rcpp.

This releases brings a number of small fixes, including two from contributed pull requests (extra thanks for those!), and updates the package in a few spots. The NEWS extract follows:

Changes in RcppExamples version 0.1.9 (2019-08-24)

  • Extended DateExample to use more new Rcpp features

  • Do not print DataFrame result twice (Xikun Han in #3)

  • Missing parenthesis added in man page (Chris Muir in #5)

  • Rewrote StringVectorExample slightly to not run afould the -Wnoexcept-type warning for C++17-related name mangling changes

  • Updated NAMESPACE and RcppExports.cpp to add registration

  • Removed the no-longer-needed #define for new Datetime vectors

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dummy Is As Dummy Does

$
0
0

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the 1975 edition of “Applied multiple regression/correlation analysis for the behavioral sciences” by Jacob Cohen, an interesting approach of handling missing values in numeric variables was proposed with the purpose to improve the traditional single-value imputation, as described below:

– First of all, impute missing values by the value of mean or median – And then create a dummy variable to flag out imputed values

In the setting of a regression model, both imputed and dummy variables would be included and therefore the number of independent variables are doubled.

Although the aforementioned approach has long been criticized and eventually abandoned by Cohen himself in the recent edition of the book, I was told that this obsolete technique is still being actively used.

Out of my own curiosity, I applied this dummy imputation approach to the data used in https://statcompute.wordpress.com/2019/05/04/why-use-weight-of-evidence and then compared it with the WoE imputation in the context of Logistic Regression.

Below are my observations:

– Since the dummy approach converts each numeric variable with missing values, the final model tends to have more independent variables, which is not desirable in terms of the model parsimony. For instance, there are 7 independent variables in the model with dummy imputation and only 5 in the model with WoE approach.

– The model performance doesn’t seem to justify the use of more independent variables in the regression with the dummy imputation. As shown in the output below, ROC statistic from the model with WoE approach is significantly better than the one with the dummy imputation based on the DeLong’s test, which is also consistent with the result of Vuong test.

.gist table { margin-bottom: 0; }
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

graphlouts v0.5.0 released

$
0
0

[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version of graphlayouts is now available on CRAN. This major update introduces several new layout algorithms and adds additional support for weighted networks. Here is a breakdown of all changes:

  • BREAKING CHANGE: removed qgraph(). Now part of ggraph.
  • POSSIBLE BREAKING CHANGE: layout_with_focus() now also returns the distance to the focus node
  • changed filenames (doesn’t have any effect on functionality)
  • added layout_as_dynamic() for longitudinal network data
  • removed gbp and scales dependency and moved oaqc to suggest
  • edge weights are now supported in layout_with_stress() and layout_with_focus()
  • added layout_with_pmds() (Pivot MDS for large graphs)
  • added layout_with_sparse_stress() (“stress for large graphs”)

All layout algorithms will be fully supported by the new version of ggrap. An updated tutorial for ggraph and graphlayouts can be found here.

library(igraph)library(ggraph)library(graphlayouts)library(patchwork)

Large networks

Two layout algorithms for large graphs were added in this version. layout_with_pmds() is similar to layout_with_mds() from igraph, but performs the MDS only on a small set of pivots to reduce computation time. Below is an example using the US PowerGrid network (4941 nodes and 6594 edges). The runtime for the full MDS calculation was 114s.

Pivot MDS with 50 pivots computes a comparable layout in under 1 second.

# US PowerGrid network obtained from https://sparse.tamu.edu/ggraph(power,layout = "pmds", pivots = 50)+  geom_edge_link0(edge_colour = "grey66")+  geom_node_point(size = 3)+  theme_graph()

layout_with_sparse_stress() is built with a similar logic as pivot MDS. Instead of calculating the full stress model, only a small set of pivots is used.

# collaboration network (preprints in condensed matter archive) # obtained from https://sparse.tamu.edu/ggraph(power,layout = "sparse_stress", pivots = 50)+  geom_edge_link0(edge_colour = "grey66")+  geom_node_point(size = 3)+  theme_graph()

For the above network with ~12000 nodes and ~70000 edges, the algorithm ran 22s (The full stress model did not terminate within 10m).

The graphlayouts wiki includes several additional examples and a runtime comparison with other existing layout algorithms.

Dynamic networks

One algorithm was added to calculate layouts for longitudinal network data. The function layout_as_dynamic() is not yet supported by ggrap.

The algorithm starts by computing a reference layout based on the aggregated network of all networks. Afterwards, a layout is calculated for each network separately which is then combined with the reference layout in a linear combination. The parameter alpha controls the influence of the reference layout. For alpha=1, only the reference layout is used and all graphs have the same layout. For alpha=0, the stress layout of each individual graph is used. Values in-between interpolate between the two layouts.

Below is an example that shows how to use the function together with ggraph and patchwork to produce a series of network snapshots of longitudinal data.

# 3 Waves of 50 girls from 'Teenage Friends and Lifestyle Study' data # https://www.stats.ox.ac.uk/~snijders/siena/gList
## [[1]]## IGRAPH c2863cd UN-- 50 74 -- ## + attr: name (v/c), smoking (v/c)## + edges from c2863cd (vertex names):##  [1] V1 --V11 V1 --V14 V2 --V7  V2 --V11 V3 --V4  V3 --V9  V4 --V9 ##  [8] V5 --V32 V6 --V8  V7 --V12 V7 --V26 V7 --V42 V7 --V44 V10--V11## [15] V10--V14 V10--V15 V10--V33 V11--V14 V11--V15 V11--V16 V11--V19## [22] V11--V30 V12--V42 V12--V44 V15--V16 V17--V18 V17--V19 V17--V21## [29] V17--V22 V17--V24 V18--V19 V18--V35 V19--V24 V19--V26 V19--V30## [36] V21--V22 V21--V24 V21--V31 V21--V32 V22--V24 V22--V25 V22--V31## [43] V22--V34 V22--V43 V23--V24 V25--V31 V25--V32 V26--V29 V26--V30## [50] V26--V44 V27--V28 V27--V29 V27--V30 V29--V30 V29--V33 V30--V33## + ... omitted several edges## ## [[2]]## IGRAPH af42ccd UN-- 50 81 -- ## + attr: name (v/c), smoking (v/c)## + edges from af42ccd (vertex names):##  [1] V1 --V10 V1 --V11 V1 --V14 V1 --V33 V2 --V26 V3 --V4  V3 --V9 ##  [8] V4 --V5  V4 --V17 V4 --V34 V5 --V17 V6 --V8  V6 --V35 V7 --V26## [15] V7 --V44 V10--V11 V10--V14 V10--V33 V11--V14 V11--V19 V11--V26## [22] V11--V30 V12--V15 V12--V26 V12--V42 V12--V44 V15--V16 V15--V36## [29] V15--V42 V16--V26 V16--V42 V16--V44 V17--V22 V17--V24 V17--V27## [36] V17--V32 V18--V35 V19--V21 V19--V23 V19--V30 V19--V36 V19--V41## [43] V21--V31 V21--V37 V21--V40 V22--V24 V23--V50 V24--V25 V24--V28## [50] V25--V27 V25--V28 V25--V32 V26--V42 V27--V28 V28--V35 V29--V30## + ... omitted several edges## ## [[3]]## IGRAPH f5b65f3 UN-- 50 77 -- ## + attr: name (v/c), smoking (v/c)## + edges from f5b65f3 (vertex names):##  [1] V1 --V10 V1 --V11 V1 --V14 V1 --V41 V2 --V7  V2 --V23 V2 --V26##  [8] V3 --V4  V3 --V9  V3 --V34 V4 --V32 V4 --V34 V5 --V17 V5 --V32## [15] V6 --V24 V6 --V27 V6 --V28 V7 --V16 V7 --V26 V7 --V42 V7 --V44## [22] V8 --V25 V10--V11 V10--V12 V10--V14 V10--V33 V11--V14 V11--V15## [29] V11--V33 V12--V15 V12--V33 V14--V33 V15--V29 V15--V33 V15--V36## [36] V16--V26 V16--V42 V16--V44 V17--V22 V17--V27 V19--V29 V19--V30## [43] V19--V36 V21--V31 V21--V37 V21--V40 V21--V45 V24--V27 V24--V28## [50] V25--V50 V26--V44 V27--V28 V29--V30 V29--V33 V30--V33 V30--V36## + ... omitted several edges
xy <- layout_as_dynamic(gList,alpha = 0.2)pList <- vector("list",length(gList))for(i in 1:length(gList)){  pList[[i]] <- ggraph(gList[[i]],layout="manual",x = xy[[i]][,1],y = xy[[i]][,2])+    geom_edge_link0(edge_width = 0.2,edge_colour = "grey25")+    geom_node_point(shape=21,aes(fill=smoking),size = 3)+    geom_node_text(aes(label = 1:50),repel = T)+    scale_fill_manual(values=c("forestgreen","grey25","firebrick"),                      guide = ifelse(i!=2,FALSE,"legend"))+    theme_graph()+    theme(legend.position = "bottom")+    labs(title = paste0("Wave ",i))}Reduce("+",pList)+  plot_annotation(title="Friendship network",                  theme = theme(title = element_text(family="Arial Narrow",                                                     face = "bold",size = 16)))

If you want to play around with this function, there are some small data sets available from the Siena data repository.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you sure you’re precise? Measuring accuracy of point forecasts

$
0
0

[This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two years ago I have written a post “Naughty APEs and the quest for the holy grail“, where I have discussed why percentage-based error measures (such as MPE, MAPE, sMAPE) are not good for the task of forecasting performance evaluation. However, it seems to me that I did not explain the topic to the full extent – the time has shown that there are some other issues that need to be discussed in detail, so I have decided to write another post on the topic, possibly repeating myself a little bit. This time we won’t have imaginary forecasters, we will be more serious.

Introduction

We start from a fact, well-known in statistics. MSE is minimised by mean value, while MAE is minimised by the median. There is a lot of nice examples, papers and post, explaining, why this happens and how, so we don’t need to waste time on that. But there are two elements related to this that need to be discussed.

First, some people think that this property is applicable only to the estimation of models. For some reason, it is implied that forecasts evaluation is a completely different activity, unrelated to the estimation. Something like: as soon as you estimate a model, you are done with “MSE is minimised by mean” thingy, the evaluation is not related to this. However, when we select the best performing model based on some error measure, we inevitably impose properties of that error on the forecasts. So, if a method performs better than the other in terms of MAE, it means that it produces forecasts closer to the median of the data than the others do.

For example, zero forecast will always be one of the most accurate forecasts for intermittent demand in terms of MAE, especially when number of zeroes in the data is greater than 50%. The reason for this is obvious: if you have so many zeroes, then saying that we won’t sell anything in the next foreseeable future is a safe strategy. And this strategy works, because MAE is minimised by median. The usefulness of such forecast is a completely different topic, but a thing to carry out from this, is that in general, MAE-based error measures should not be used on intermittent demand.

Are you not convinced? Click here…
Just to make this point clearer, let’s consider the following example in R. We generate data from a mixture of normal and Bernoulli distributions (with probability \(p=0.4\), which means that we should have around 60% of zeroes in the data):

x <- rnorm(150,30,10) * rbinom(150, 1, 0.4)

The series should look something like this:

plot.ts(x)

An example of artificial intermittent data

We use the first 100 observations for methods estimation and the last 50 for the evaluation of the forecasts. We use two forecasting methods: simple average of the in-sample part of the series and zero forecast (which accidentally corresponds to the median of the data). They look the following way:

plot.ts(x)abline(h=mean(x[1:100]),col="blue", lwd=2)abline(h=0,col="purple", lwd=2)abline(v=100, col="red", lwd=2)

An example of artificial intermittent data and the forecasts. The blue line is the simple average, the purple line is the zero forecast. The red line splits the data into in-sample and the holdout

It is obvious that the average forecast is more reasonable than the zero forecast in this case – at least we can make some decisions based on that. In addition, it seems that it goes “through the data” in the holdout sample, which is what we usually want from the point forecasts. But what about the error measures?

errorMeasures <- matrix(c(mean(abs(x[101:150] - mean(x[1:100]))),                          mean(abs(x[101:150] - 0)),                          mean((x[101:150] - mean(x[1:100]))^2),                          mean((x[101:150] - 0)^2)),                        2,2,dimnames=list(c("Average","Zero"),c("MAE","MSE")))errorMeasures

MAE     MSEAverage 15.4360 264.9922Zero    12.3995 418.4934

We can see that MAE recommends zero forecast as the more appropriate here (it has the lower error of 12.3995 in contrast with 15.4360), while MSE prefers the Average (264.9922 vs 418.4934). This is a simple illustration of the point about the mean and median.

Second, some researchers think that if a model is optimised, for example, using MSE, then it should always be evaluated using the MSE-based error measure. This is not completely correct. Yes, the model will probably perform better if the loss function is aligned with the error measure used for the evaluation. However, this does not mean that we cannot use MAE-based error measures, if the loss is not MAE-based. These are still slightly different tasks, and the selection of the error measures should be motivated by a specific problem (for which the forecast is needed) not by a loss function used. For example, in case of inventory management neither MAE nor MSE might be useful for the evaluation. One would probably need to see how models perform in terms of safety stock allocation, and this is a completely different problem, which does not necessarily align well with either MAE or MSE. As a final note, in some cases we are interested in estimating models via the likelihood maximisation, and selecting an aligned error measure in those cases might be quite challenging.

So, as a minor conclusion, MSE-based measures should be used, when we are interested in identifying the method that outperforms the others in terms of mean values, while the MAE-based should be preferred for the medians, irrespective to how we estimate our models.

As one can already see, there might be some other losses, focusing, for example, on specific quantiles (such as pinball loss). But the other question is, what statistics minimise MAPE and sMAPE. The short answer is: “we don’t know”. However, Stephan Kolassa and Martin Roland (2011) showed on a simple example that in case of strictly positive distribution the MAPE prefers the biased forecasts, and Stephan Kolassa (2016) noted that in case of log normal distribution, the MAPE is minimised by the mode. So at least we have an idea of what to expect from MAPE. However, the sMAPE is a complete mystery in this sense. We don’t know what it does, and this is yet another reason not to use it in forecasts evaluation at all (see the other reasons in the previous post).

We are already familiar with some error measures from the previous post, so I will not rewrite all the formulae here. And we already know that the error measures can be in the original units (MAE, MSE, ME), percentage (MAPE, MPE), scaled or relative. Skipping the first two, we can discuss the latter two in more detail.

Scaled measures

Scaled measures can be quite informative and useful, when comparing different forecasting methods. For example, sMAE and sMSE (from Petropoulos & Kourentzes, 2015): \begin{equation} \label{eq:sMAE} \text{sMAE} = \frac{\text{MAE}}{\bar{y}}, \end{equation} \begin{equation} \label{eq:sMSE} \text{sMSE} = \frac{\text{MSE}}{\bar{y}^2}, \end{equation} where \(\bar{y}\) is the in-sample mean of the data. These measures have a simple interpretation, close to the one of MAPE: they show the mean percentage errors, relative to the mean of the series (not to each specific observation in the holdout). They don’t have problems that APEs have, but they might not be applicable in cases of non-stationary data, when mean changes over time. To make a point, they might be okay for the series on the graph on the left below, where the level of series does not change substantially, but their value might change dramatically, when the new data is added on the graph on the right, with trend time series.

Two time series examples

MASE by Rob Hyndman and Anne Koehler (2006) does not have this issue, because it is scaled using the mean absolute in-sample first differences of the data: \begin{equation} \label{eq:MASE} \text{MASE} = \frac{\text{MAE}}{\frac{1}{T-1}\sum_{t=2}^{T}|y_t -y_{t-1}|} . \end{equation} The motivation here is statistically solid: while the mean can change over time, the first difference of the data are usually much more stable. So the denominator of the formula becomes more or less fixed, which solves the problem, mentioned above.

Unfortunately, MASE has a different issue – it is uninterpretable. If MASE is equal to 1.3, this does not really mean anything. Yes, the denominator can be interpreted as a mean absolute error of in-sample one-step-ahead forecasts of Naive, but this does not help with the overall interpretation. This measure can be used for research purposes, but I would not expect practitioners to understand and use it.

And let’s not forget about the dictum “MAE is minimised by median”, which implies that, in general, neither MASE nor sMAE should be used on intermittent demand.

Relative measures

Finally, we have relative measures. For example, we can have relative MAE or RMSE. Note that Davydenko & Fildes, 2013 called them “RelMAE” and “RelRMSE”, while the aggregated versions were “AvgRelMAE” and “AvgRelRMSE”. I personally find these names tedious, so I prefer to call them “rMAE”, “rRMSE” and “ArMAE” and “ArRMSE” respectively. They are calculated the following way: \begin{equation} \label{eq:rMAE} \text{rMAE} = \frac{\text{MAE}_a}{\text{MAE}_b}, \end{equation} \begin{equation} \label{eq:rRMSE} \text{rRMSE} = \frac{\text{RMSE}_a}{\text{RMSE}_b}, \end{equation} where the numerator contains the error measure of the method of interest, and the denominator contains the error measure of a benchmark method (for example, Naive method in case of continuous demand or forecast from an average for the intermittent one). Given that both are aligned and are evaluated over the same part of the sample, we don’t need to bother about the changing mean in time series, which makes both of them easily interpretable. If the measure is greater than one, then our method performed worse than the benchmark; if it is less than one, the method is doing better. Furthermore, as I have mentioned in the previous post, both rMAE and rRMSE align very well with the idea of forecast value, developed by Mike Gilliland from SAS, which can be calculated as: \begin{equation} \label{eq:FV} \text{FV} = 1-\text{relative measure} \cdot 100\%. \end{equation} So, for example, rMAE = 0.96 means that our method is doing 4% better than the benchmark in terms of MAE, so that we are adding the value to the forecast that we produce.

As mentioned in the previous post, if you want to aggregate the relative error measures, it makes sense to use geometric means instead of arithmetic ones. This is because the distribution of relative measures is typically asymmetric, and the arithmetic mean would be too much influenced by the outliers (cases, when the models performed very poorly). Geometric one, on the other hand, is much more robust, in some cases aligning with the median value of a distribution.

The main limitation of relative measures is that they cannot be properly aggregated, when the error (either MAE or MSE) is equal to zero either in the numerator or in the denominator, because the geometric mean becomes either equal to zero or infinity. This does not stop us from analysing the distribution of the errors, but might cause some inconveniences. To be honest, we don’t face these situations very often in real world, because this implies that we have produced perfect forecast for the whole holdout sample, several steps ahead, which can only happen, if there is no forecasting problem at all. An example would be, when we know for sure that we will sell 500 bottles of beer per day for the next week, because someone pre-ordered them from us. The other example would be an intermittent demand series with zeroes in the holdout, where the Naive would produce zero forecast as well. But I would argue that Naive is not a good benchmark in this case, it makes sense to switch to something like simple mean of the series. I struggle to come up with other meaningful examples from the real world, where the mean error (either MAE or MSE) would be equal to zero for the whole holdout. Having said that, if you notice that either rMAE or rRMSE becomes equal to zero or infinite for some time series, it makes sense to investigate, why that happened, and probably remove those series from the analysis.

It might sound as I am advertising relative measures. To some extent I do, because I don’t think that there are better options for the comparison of the point forecasts than these at the moment. I would be glad to recommend something else as soon as we have a better option. They are not ideal, but they do a pretty good job for the forecasts evaluation.

So, summarising, I would recommend using relative error measures, keeping in mind that MSE is minimised by mean and MAE is minimised by median. And in order to decide, what to use between the two, you should probably ask yourself: what do we really need to measure? In some cases it might appear that none of the above is needed. Maybe you should look at the prediction intervals instead of point forecasts… This is something that we will discuss next time.

Examples in R

To make this post a bit closer to the application, we will consider a simple example with

smooth

package v2.5.3 and several series from the M3 dataset.

Load the packages:

library(smooth)library(Mcomp)

Take a subset of monthly demographic time series (it’s just 111 time series, which should suffice for our experiment):

M3Subset <- subset(M3, 12, "demographic")

Prepare the array for the two error measures: rMAE and rRMSE (these are calculated based on the

measures()

function from

greybox

package). We will be using three options for this example: CES, ETS with automatic model selection between the 30 models and ETS with the automatic selection, skipping multiplicative trends:

errorMeasures <- array(NA, c(length(M3Subset),2,3),                       dimnames=list(NULL, c("rMAE","rRMSE"),                                     c("CES","ETS(Z,Z,Z)","ETS(Z,X,Z)")))

Do the loop, applying the models to the data and extracting the error measures from the accuracy variable. By default, the benchmark in rMAE and rRMSE is the Naive method.

for(i in 1:length(M3Subset)){    errorMeasures[i,,1] <- auto.ces(M3Subset[[i]])$accuracy[c("rMAE","rRMSE")]    errorMeasures[i,,2] <- es(M3Subset[[i]])$accuracy[c("rMAE","rRMSE")]    errorMeasures[i,,3] <- es(M3Subset[[i]],"ZXZ")$accuracy[c("rMAE","rRMSE")]    cat(i); cat(", ")}

Now we can analyse the results. We start with the ArMAE and ArRMSE:

exp(apply(log(errorMeasures),c(2,3),mean))

CES        ETS(Z,Z,Z) ETS(Z,X,Z)rMAE  0.6339194  0.8798265  0.8540869rRMSE 0.6430326  0.8843838  0.8584140

As we see, all models did better than Naive: ETS is approximately 12 – 16% better, while CES is more than 35% better. Also, CES outperformed both ETS options in terms of rMAE and rRMSE. The difference is quite substantial, but in order to see this clearer, we can reformulate our error measures, dividing rMAE of each option by (for example) rMAE of ETS(Z,Z,Z):

errorMeasuresZZZ <- errorMeasuresfor(i in 1:3){    errorMeasuresZZZ[,,i] <- errorMeasuresZZZ[,,i] / errorMeasures[,,"ETS(Z,Z,Z)"]}exp(apply(log(errorMeasuresZZZ),c(2,3),mean))

CES        ETS(Z,Z,Z) ETS(Z,X,Z)rMAE  0.7205050          1  0.9707448rRMSE 0.7270968          1  0.9706352

With these measures, we can say that CES is approximately 28% more accurate than ETS(Z,Z,Z) both in terms of MAE and RMSE. Also, the exclusion of the multiplicative trend in ETS leads to the improvements in the accuracy of around 3% for both MAE and RMSE.

We can also analyse the distributions of the error measures, which sometimes can give an additional information about the performance of the models. The simplest thing to do is to produce boxplots:

boxplot(errorMeasures[,1,])abline(h=1, col="grey", lwd=2)points(exp(apply(log(errorMeasures[,1,]),2,mean)),col="red",pch=16)

Boxplot of rMAE for a subset of time series from the M3

Given that the error measures have asymmetric distribution, it is difficult to analyse the results. But what we can spot is that the boxplot of CES is located lower than the boxplots of the other two models. This indicates that the model is performing consistently better than the others. The grey horizontal line on the plot is the value for the benchmark, which is Naive in our case. Notice that in some cases all the models that we have applied to the data do not outperform Naive (there are values above the line), but on average (in terms of geometric means, red dots) they do better.

Producing boxplot in log scale might sometimes simplify the analysis:

boxplot(log(errorMeasures[,1,]))abline(h=0, col="grey", lwd=2)points(apply(log(errorMeasures[,1,]),2,mean),col="red",pch=16)

Boxplot of rMAE in logarithms for a subset of time series from the M3

The grey horizontal line on this plot still correspond to Naive, which in log-scale is equal to zero (log(1)=0). In our case this plot does not give any additional information, but in some cases it might be easier to work in logarithms rather than in the original scale due to the potential magnitude of positive errors. The only thing that we can note is that CES was more accurate than ETS for the first, second and third quartiles, but it seems that there were some cases, where it was less accurate than both ETS and Naive (the upper whisker and the outliers).

There are other things that we could do in order to analyse the distribution of error measures more thoroughly. For example, we could do statistical tests (such as Nemenyi) in order to see whether the difference between the models is statistically significant or if it is due to randomness. But this is something that we should leave for the future posts.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New Version of regtools package

$
0
0

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My updated version of my regtools package, tools for parametric and nonparametric regression, is now on CRAN,

https://cran.r-project.org/package=regtools

It has a number of new functions and datasets. Type vignette(‘regtools’) for an overview.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

$
0
0

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this:

It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortcut I accidentally used was Cmd+Option+O, which folds up all chunks. To unfold all chunks, use Cmd+Shift+Option+O.

The full list of RStudio keyboard shortcuts can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RSwitch 1.4.0 Released

$
0
0

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Swift 5 has been so much fun to hack on that there’s a new update to macOS R-focused mebubar utility RSwitch available. Along with the app comes a new dedicated RSwitch landing page and a new user’s guide since it has enough features to warrant such documentation. Here’s the new menu

The core changes/additions include:

  • a reorganized menu (see above)
  • the use of notifications instead of alerts
  • disabling of download menu entries while download is in progress
  • the ability to start new R GUI or RStudio instances
  • the ability to switch to and make running R GUI or RStudio instances active
  • additional “bookmarks” in the reorganized web resources submenu
  • Built-in check for updates

To make RSwitch launch at startup, just add it as a login item to your user in the “Users & Groups” pane of “System Preferences”.

The guide has information on how all the existing and new features work plus provides documentation on the how to install the alternate R versions available at the R for macOS Developer’s Page. There’s also a slightly expanded set of information on how to contribute to RSwitch development.

FIN

As usual, kick the tyres, file feature requests or bug reports where you’re comfortable, & — if you’re macOS-dev-curious — join in the Swift 5 fun (it really is a pretty fun language).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Click n click (GC606J1)

$
0
0

[This article was first published on geocacheR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today’s puzzle is a single part of a mystery that comprises several smaller puzzles, styled on the cult classic TV show The Crystal Maze. To get the coordinates there is a “mental” test for the northings, two easy puzzles that I won’t go into here. The westings are the more interesting part, with a “skill” test.

The skill test is a whack-a-mole game, with 60 radio buttons. One radio button is highlighted at random and you need to click it. If you hit, you get a point, and if you miss, you lose a point. After 30 seconds, if your score is 30 or more, you are presented with the coordinates. Now, there’s more than one way to skin this particular cat, and one of them doesn’t require any fast fingers or making an R programme. But I’ll stay silent on that method and instead use the opportunity to demonstrate how sometimes R can be enhanced by invoking scripts in other languages.

R seems to lack a way to read information about pixels on the screen. It’s easy to do if you’re interested in pixels within an image, but in this case we’re not. I’d like to read the pixels on the screen to detect which radio button is lit, then direct the mouse to click there. So I cribbed a small python script from somewhere or other to plug the gap of detecting the colour:

def get_pixel_colour(i_x, i_y):    import win32gui    i_desktop_window_id = win32gui.GetDesktopWindow()    i_desktop_window_dc = win32gui.GetWindowDC(i_desktop_window_id)    long_colour = win32gui.GetPixel(i_desktop_window_dc, i_x, i_y)    i_colour = int(long_colour)    return (i_colour & 0xff), ((i_colour >> 8) & 0xff), ((i_colour >> 16) & 0xff)

The reticulate package allows us to take a python function and “source” it, loading it as though it is an R function. It’s really painless. KeyboardSimulator can send key presses and mouse clicks to the OS. And those two packages do most of the hard work. Here’s the code:

library(KeyboardSimulator)library(reticulate)library(magrittr)## the source_python takes a few seconds to run, and we're against the clock in this game## so the first run will stop after it's in memory, and the second run will ignore this blockif (exists("runBefore")==0) {  runBefore <- 1  source_python("C:/Users/alunh/OneDrive/Documents/Repos/geo2/getPixelColour.py")  stop("python script loaded!")}## the position of the radio buttons on my screenw <- 26h <- 21l <- 1324t <- 290## the target radio button is dark grey and the non-targets are pale.## Choose r+g+b < 500 to identify the targetstart <- Sys.time()repeat {  for (x in sample(10)) {    for (y in sample(6)) {      ## here's the python function!      p <- get_pixel_colour(as.integer(l+w*(x-1)), as.integer(t+h*(y-1))) %>%         unlist()      if (sum(p) < 500) {        mouse.move(l + w * (x-1), t+h * (y-1))        mouse.click()      }    }  }  ## quit after 32 seconds  if ((Sys.time() - start) > 32) break}

So how did our R/python hybrid do? Remember, the target was 30 correct clicks within 30 seconds, without errors. Using a mouse, I managed 23, and using a touchscreen device I made it to 34.

(Final coordinates redacted)

51! Not too shabby! Doubtless something bigger and better could be done to improve this score, but I’m very satisfied with my first foray into sourcing python from R.

It’s not often that R has a gap in its functionality, and there might well be a way of doing this without resorting to a python script. Please let me know if you have a better way by using the comment section below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: geocacheR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Studying Politics on and with Wikipedia

$
0
0

[This article was first published on R on Methods Bites, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The online encyclopedia Wikipedia, together with its sibling, the collaboratively edited knowledge base Wikidata, provides incredibly rich yet largely untapped sources for political research. In this Methods Bites Tutorial, Denis Cohen and Nick Baumann offer a hands-on recap of Simon Munzert’s (Hertie School of Governance) workshop materials to show how these platforms can inform research on public attention dynamics, policies, political and other events, political elites, and parties, among other things.

After reading this blog post and engaging with the applied exercises, readers should:

  • be able to collect Wikipedia data and Wikidata items using R
  • be able to conduct explorative analyses of Wikipedia data using R
  • have a basic intuition of the potentials and limitations of using Wikipedia data in research projects

Note: This blog post provides a summary of Simon’s workshop in the MZES Social Science Data Lab with some adaptations. Simon’s original workshop materials, including slides and scripts, are available from our GitHub.

Wikipedia for Political Research

According to its website, “Wikipedia […] is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content”. As of July 2019, it comprises more than 48 million articles and is ranked sixth in the list of the most frequently visited websites.

Wikipedia harbors numerous types of data. These include both article contents as well as meta information such as pageviews, clickstreams, links and backlinks, or edits and revision histories. Additionally, Wikipedia’s sibling, the collaboratively edited document-oriented data base Wikidata, provides access to over 58 million data items (as of July 2019). Given the broad collection of articles on politicians and institutions from all over the world, Wikipedia offers tremendous potential for (comparative) political research.

In what follows, we will introduce the functionalities of various R packages, including WikipediR, WikidataR, and pageviews. In doing so, we will showcase how to connect to Wikipedia and Wikidata APIs, how to efficiently access and parse content, and how to process the retrieved data in order to address various questions of substantive interest. We will also provide an overview of the legislatoR package, a fully relational individual-level data package that comprises political, sociodemographic, and Wikipedia-related data on elected politicians from various consolidated democracies.

Code: R packages used in this tutorial

## Packagespkgs <- c(  "devtools",  "ggnetwork",  "igraph",  "intergraph",  "tidyverse",  "rvest",  "devtools",  "magrittr",  "plotly",  "RColorBrewer",  "colorspace",  "lubridate",  "networkD3",  "pageviews",  "readr",  "wikipediatrend",  "WikipediR",  "WikidataR")## Install uninstalled packageslapply(pkgs[!(pkgs %in% installed.packages())], install.packages)## Load all packages to librarylapply(pkgs, library, character.only = TRUE)## legislatoRdevtools::install_github("saschagobel/legislatoR")library(legislatoR)

Collecting and Analyzing Wikipedia Data

Application 1: Using Pageviews to Measure Public Attention

Pageviews measure the aggregate number of clicks for a given Wikipedia article. Data on pageviews can be collected from different sources. First, this interactive tool provides summary data which allows users to compare various search items’ popularity in a specified period. Secondly, Wikimedia Downloads, a collection of archived Wikimedia wikis, offers pageviews data through August 2016 as well as data using a new pageviews definition from May 2015 onward.

The code chunk below demonstrates how to collect and graphically display pageviews data using the pageviews package. We use the command article_pageviews(), where the argument project = "en.wikipedia" specifies that we want to collect pageviews of article = "Donald Trump" from the English Wikipedia. We can only restrict our query to a given language edition; it is not possible to limit queries to pageviews from a specific country. We also specify the argument user_type = "user", which ensures that we exclude pageviews generated by bots and spiders. Finally, start and end define the period on which we want to collect pageviews data: July 2015 to May 2017. We proceed analogously for article = "Hillary Clinton".

Code: Pageviews Data Collection

# get pageviewstrump_views <-  article_pageviews(    project = "en.wikipedia",    article = "Donald Trump",    user_type = "user",    start = "2015070100",    end = "2017050100"  )head(trump_views)clinton_views <-  article_pageviews(    project = "en.wikipedia",    article = "Hillary Clinton",    user_type = "user",    start = "2015070100",    end = "2017050100"  )

This query allows us to retrieve the pageviews for both Trump’s and Clinton’s Wikipedia articles by date. We can then plot the frequencies of pageviews over time to identify trends in search behaviour. As we can see, the data indicate that Trump attracted considerably more attention than Clinton throughout the 2016 election campaign.

Code: Plotting Pageviews

# Plot pageviewsplot(ymd(trump_views$date), trump_views$views, col = "red", type = "l", xlab="Time", ylab="Pageviews")lines(ymd(clinton_views$date), clinton_views$views, col = "blue")legend("topleft", legend=c("Trump","Clinton"), cex=.8,col=c("red","blue"), lty=1) 

Application 2: Using Article Links to Create a Network Graph of German MPs

The WikipediR package is a wrapper for the MediaWiki API that can be used to retrieve page contents as well as metadata for articles and categories, e.g. information about users or page edit histories. The functionality of the package includes:

  • page_content(): Retrieve current article versions (HTML and wikitext as possible output formats)
  • revision_content(): Retrieve older versions of the article; this also includes metadata about the revision history
  • page_links(): Retrieve outgoing links from the page’s content (which Wikipedia articles does the page link to?)
  • page_backlinks(): Retrieve incoming links (which Wikipedia articles link to the page?)
  • page_external_links(): Retrieve outgoing links to external sites
  • page_info(): Page metadata
  • categories_in_page(): What categories is a given page in?
  • pages_in_category(): What pages are in a given category?

For our application, we use the page_links() function to extract mutual referrals between the articles on members of the 2017-2021 German Bundestag. We can then use this information to create a network graph of current German MPs. First, we use the legislatoR package to retrieve a list of all German MPs of the 2017-2021 German Bundestag, including information on their page IDs and page titles in the German Wikipedia. Using this information, we then extract all page_links() in every MP’s Wikipedia articles. The third step identifies the subset of links for every MP that link to the Wikipedia article of another current MP.

This allows us to finally plot an interactive network using the forceNetwork() command from the networkD3 package. We can save the interactive network graph as an HTML widget, which is included below.

Code: Creating an Interactive Network Graph Based on Article Links

## step 1: get info about legislatorsdat <- semi_join(  x = get_core(legislature = "deu"),  y = filter(get_political(legislature = "deu"), session == 19),  by = "pageid")## step 2: get page links (max 500 links)if (!file.exists("studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData")) {  links_list <- list()  for (i in 1:nrow(dat)) {    links <-      page_links(        "de",        "wikipedia",        page = dat$wikititle[i],        clean_response = TRUE,        limit = 500,        namespaces = 0      )    links_list[[i]] <- lapply(links[[1]]$links, "[", 2) %>% unlist  }  save(links_list, file = "studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData")} else{  load("studying-politics-wikipedia/data/wikipediR/mdb_links_list.RData")}## step 3: identify links between MPs# loop preparationconnections <- data.frame(from = NULL, to = NULL)# loopfor (i in seq_along(dat$wikititle)) {  links_in_pslinks <-    seq_along(dat$wikititle)[str_replace_all(dat$wikititle, "_", " ") %in%                               links_list[[i]]]  links_in_pslinks <- links_in_pslinks[links_in_pslinks != i]  connections <-    rbind(connections,          data.frame(            from = rep(i - 1, length(links_in_pslinks)), # -1 for zero-indexing            to = links_in_pslinks - 1 # here too            )          )}# resultsnames(connections) <- c("from", "to")# make symmetricalconnections <- rbind(connections,                     data.frame(from = connections$to,                                to = connections$from))connections <- connections[!duplicated(connections), ]## step 4: visualize connectionsconnections$value <- 1nodesDF <- data.frame(name = dat$name, group = 1)network_out <-  forceNetwork(    Links = connections,    Nodes = nodesDF,    Source = "from",    Target = "to",    Value = "value",    NodeID = "name",    Group = "group",    zoom = TRUE,    opacityNoHover = 3,    height = 360,    width = 636  )

<br />

Using the underlying connections data set, we can also identify which members of the German parliament share the most nodes with others. Perhaps unsurprisingly, we see the German chancellor Angela Merkel on top of the list, followed by a list of current and former federal ministers and (deputy) party leaders.

Code: Top 10 MPs by Connections Counts

nodesDF$id <- as.numeric(rownames(nodesDF)) - 1connections_df <-  merge(connections,        nodesDF,        by.x = "to",        by.y = "id",        all = TRUE)to_count_df <- count(connections_df, name)arrange(to_count_df, desc(n))
## # A tibble: 712 x 2##    name                     n##                    ##  1 Angela Merkel           59##  2 Andrea Nahles           40##  3 Heiko Maas              38##  4 Katarina Barley         38##  5 Peter Altmaier          38##  6 Wolfgang Schäuble       38##  7 Wolfgang Kubicki        37##  8 Hans-Peter Friedrich    34##  9 Hermann Gröhe           34## 10 Ursula von der Leyen    33## # ... with 702 more rows
Application 3: Using Clickstream Data to Analyze Referral Patterns

Wikipedia articles usually “provide links designed to guide the user to related pages with additional information”. This allows us to collect clickstream data. Clickstreams yield information on the incoming and outgoing traffic of articles. They capture the articles that refer users to a given article as well as the links within a given article that users click to navigate to other articles. Clickstream data are inherently dyadic: Observations represent referral patterns for article-pairs (previous site → current site). Thus, our quantity of interest is the cumulated number of times this pattern was observed in a given period of time.

Clickstream data are offered as monthly aggregate counts for the major Wikipedia language editions. To obtain the data, we first have to download the raw clickstream data from this page, where they are offered as compressed files. After extracting the files, we can load them into R.

In the example below, we focus on two party groups of the 8th (2014-2019) European Parliament: the euroskeptic EFDD (Europe of Freedom and Direct Democracy) and the far right ENF (Europe of Nations and Freedom). In particular, we are interested in clickstreams between the two party groups, between the party groups and their member parties, and between the individual member parties.

Toward this end, we download clickstream data from the English Wikipedia for May 2019, the month of the 2019 European Parliament elections. We identify 19 articles of interest and store them in the object articles. Having retrieved and extracted the clickstream data from May 2019, we import the TSV file into R using read.table(). Lastly, we subset the data to observations that involve referrals between all available article-pairs of the 19 articles.

Code: Collecting and Processing Clickstream Data

# retrieve article titles of interestenf <- "Europe_of_Nations_and_Freedom"efdd <- "Europe_of_Freedom_and_Direct_Democracy"enf_parties <- c(  "Freedom_Party_of_Austria",  "Vlaams_Belang",  "National_Rally_(France)",  "The_Blue_Party_(Germany)",  "Lega_Nord",  "Party_for_Freedom",  "Congress_of_the_New_Right")efdd_parties <- c(  "Svobodní",  "The_Patriots_(France)",  "Debout_la_France",  "Alternative_for_Germany",  "Five_Star_Movement",  "Order_and_Justice",  "Liberty_(Poland)",  "Brexit_Party",  "Social_Democratic_Party_(UK,_1990–present)",  "Libertarian_Party_(UK)")articles <- c(enf, efdd, enf_parties, efdd_parties)# import raw clickstream datacs <-  read.table(    "clickstream-enwiki-2019-05.tsv",    header = FALSE,    col.names = c("prev", "curr", "type", "n"),    fill = TRUE,    stringsAsFactors = FALSE  )cs$n <- as.integer(cs$n)# subsetcs <- subset(cs, prev %in% articles & curr %in% articles)

Next, we aim to analyze aggregate referral patterns. We first assign both previous (prev) and current (curr) articles to one of four categories: Articles on the EFDD and ENF parliamentary groups (one article each), articles on ENF member parties (7 articles), and articles on EFDD member parties (10 articles). We then summarize the data to obtain aggregate referral counts between all category pairs. Lastly, we display these in an interactive Sankey diagram using the plotly package.

Code: Analyzing and Plotting Clickstream Data

# assign categoriescs <- cs %>%  mutate(    curr_cat = ifelse(      curr == enf,      "ENF Group",      ifelse(        curr == efdd,        "EFDD Group",        ifelse(curr %in% enf_parties, "ENF Parties",               "EFDD Parties")      )    ),    prev_cat = ifelse(      prev == enf,      "ENF Group",      ifelse(        prev == efdd,        "EFDD Group",        ifelse(prev %in% enf_parties, "ENF Parties",               "EFDD Parties")      )    )  )# summarize datacs_sum <-  cs %>%  group_by(curr_cat, prev_cat) %>%  summarize(n = sum(n)) %>%  arrange(prev_cat)# Sankey diagram using plotlylabels <- c(unique(cs_sum$prev_cat), unique(cs_sum$curr_cat))colors <- ifelse(grepl("EFDD", labels), "#24B9B9", "#2B3856")sankey_plot <- plot_ly(  type = "sankey",  orientation = "h",    node = list(    label = labels,    color = colors,    pad = 15,    thickness = 15,    line = list(color = "black",                width = 0.5)  ),    link = list(    source = as.numeric(as.factor(cs_sum$prev_cat)) - 1L,    target = as.numeric(as.factor(cs_sum$curr_cat)) + 3L,    value =  cs_sum$n  ),    height = 340,  width = 600) %>%  layout(font = list(size = 10))

<br />

The diagram shows that in our data, clickstream dyads involving the articles on the EFDD and ENF parliamentary groups are much more numerous than dyads involving the member parties. Much of this can be attributed to clickstreams between the two party groups, EFDD ↔ ENF. Whereas clickstreams between members of the same parliamentary group are also fairly frequent, clickstreams between the member of one group to a member of the respective other group are rare.

Moving beyond clickstreams between the four categories, we can also visualize the full network structure of all individual articles in our data. The code below starts with some preparatory data management and then uses the igraph package to create the network and to customize its graphical display.

In the final section of the code, we use the intergraph, ggnetwork and plotly packages to produce an interactive HTML5-compatible figure for this blog post. On your own machine, you may skip this section and simply use plot.igraph() on cs_net without transforming the object to a ggplot friendly format.

Code: Interactive Network Graph

# construct edgescs_edge <-  cs %>%  group_by(prev, curr, prev_cat, curr_cat) %>%  dplyr::summarise(weight = sum(n)) %>%  arrange(curr)# get list of unique articles to construct as nodescs_node <-   gather(cs_edge,         `prev`,         `curr`,         key = "where",         value = "article") %>%  ungroup() %>%  select(article) %>%  distinct(article)names(cs_node) <- c("node")cs_node$category <-  ifelse(cs_node$node == enf,         "ENF Group",         ifelse(           cs_node$node == efdd,           "EFDD Group",           ifelse(cs_node$node %in% enf_parties, "ENF Parties",                  "EFDD Parties")         )  )# generate graphset.seed(3)cs_net <-  graph.data.frame(cs_edge,                   vertices = cs_node,                   directed = F)cs_net <-  igraph::simplify(cs_net, remove.multiple = T, remove.loops = T)# generate colors based on categoryV(cs_net)$color <-   ifelse(grepl("EFDD", V(cs_net)$category), "#24B9B9", "#2B3856")# compute node degrees (#links) and use that to set node sizedeg <- igraph::degree(cs_net, mode = "all")V(cs_net)$size <- deg / 10# set labelsV(cs_net)$label <- NAV(cs_net)$label.cex = 0.5V(cs_net)$label = ifelse(igraph::degree(cs_net) > 5, V(cs_net)$label, NA)cs_hc_labels <- as.vector(cs_node$node)# set edge width based on weightE(cs_net)$width <- log(E(cs_net)$weight) / 5E(cs_net)$edge.color <- "gray80"# transform the network to a ggplot friendly format# (required to generate interactive graph embedded in blog post)gg_cs_net <-  ggnetwork(    cs_net,    layout = "fruchtermanreingold",    weights = "weight",    niter = 50000,    arrow.gap = 0  )cs_plot <- ggplot(gg_cs_net, aes(x = x, y = y, xend = xend, yend = yend)) +  geom_edges(aes(color = edge.color), size = 0.4, alpha = 0.25) +  geom_nodes(aes(color = color, size = size)) +  geom_nodetext(aes(color = color, label = vertex.names, cex = 0.6)) +  guides(size=FALSE) +  theme_blank() +  theme(legend.position = "none")

<br />

In the graph above, node diameters indicate the relative weight (total counts) of each article; node colors indicate whether an articles belongs to the EFDD or ENF. We see that members of the same party group tend to share more clickstreams. The Alternative for Germany (AfD), however, shares many connections with members of the ENF. This makes sense when we consider that the AfD has sought closer cooperation with numerous ENF member parties since 2016 with whom it eventually formed the new far right EP group, Identity and Democracy, in June 2019.

Lastly, a word of caution: One should keep in mind that clickstream counts heaviliy depend on how prominently (if at all) outgoing links are placed in a given Wikipedia article. Furthermore, raw counts from an isolated subset of clickstreams (as in the examples above) give no information on the relative importance of a given referral pattern relative to all outgoing referrals of a given article. Users should thus ensure that they use clickstream data in a way that adequately addresses their substantive inquiries.

Collecting Data via Wikidata Queries

Wikidata is a collaboratively edited knowledge base with over 58 million entries as of July 2019. It harbors various types of database items, including text, numerical quantities, coordinates, and images. There are no language editions, but individual entries can have values in different languages.

Wikidata allows users to submit queries using SPARQL, a query language for data stored in RDF (Resource Description Framework) format (see this link). Click here for a brief introduction to SPARQL. While basic queries can be used to answer mundane questions (e.g. “what is the capital city of every member of the European Union, and how many inhabitants live there?”), a targeted combination of related queries can be used for systematic data collection.

Instead of submitting explicit SPARQL queries, the example below uses the WikidataR package to combine various queries in order to collect data on the candidates in the 2019 leadership election of the UK Conservative Party. Suppose we want to retrieve the following information on each candidate:

  • name
  • sex
  • date of birth
  • political experience
  • education
  • official website URL
  • Twitter accout
  • Facebook account

In Wikidata, entries are stored as items with a unique item ID that starts with “Q”. For instance, the item 2019 Conservative Party (UK) leadership election is stored as “Q30325756”. Items are characterized by a number of statements or claims. Claims start with “P” and detail an item’s properties. For instance, the claim “candidate” is stored as “P726”. Claims have values, which may once again be items. For example, the values of claim “P726” (candidate) of item “Q30325756” (2019 UK Conservative Party leadership election) are 10 items: one entry for each of the 10 candidates running in the leadership election. Take, for example, winning candidate Boris Johnson, who is listed as a candidate under claim “P726”. In turn, the entry on Boris Johnson is stored as item “Q180589”. This item is characterized by numerous claims, including “P1559” (name in native language), “P21” (sex or gender), “P569” (date of birth), “P39” (positions held), “P69” (educated at), “P856” (official website), “P2002” (Twitter username), and “P2013” (Facebook ID).

In order to collect the data for all 10 candidates in the 2019 Conservative Party leadership election, the code chunk below implements the following steps:

  1. We retrieve item“Q30325756”, i.e., the entry for 2019 Conservative Party (UK) leadership election
  2. We extract claims“P726” of the above item to retrieve the item IDs of all 10 candidates, which we store in the object candidates
  3. We save the IDs of the claims of interest, stored in the object claims
  4. We then use some nested sapplycommands to do the following:
    • Retrieve the item (entry) for each candidate
    • Extract the eight claims from each candidate item
    • Process the informational value of each extracted claim, depending on whether the claim value is
      • an atomic object (such as web site URLs)
      • a textual object with auxiliary information (such as names, which come with language information)
      • a time/date (such as date of birth)
      • yet another item (such as previous positions, where each position has an own data base entry)

Code: Retrieving Items and Claims from Wikidata

# get item based on item iduk_item <- get_item("Q30325756", language = "en")# extract candidatescandidates <- extract_claims(uk_item, claims = "P726")candidates <- candidates[[1]][[1]]$mainsnak$datavalue$value$id# collect the following attributes ("claims") for each candidateclaims <- c("P1559", "P21", "P569", "P39", "P69", "P856", "P2002", "P2013")names(claims) <- c("nam", "sex", "dob", "exp", "edu", "web", "twi", "fbk")claims
##     nam     sex     dob     exp     edu     web     twi     fbk ## "P1559"   "P21"  "P569"   "P39"   "P69"  "P856" "P2002" "P2013"
# retrieve datauk_data <-  sapply(candidates,         function (item) {           tmp_item <- get_item(item, language = "en")           sapply(claims,                  function(claim) {                    tmp_claim <- extract_claims(tmp_item, claim)[[1]][[1]]                    if (any(is.na(tmp_claim))) {                      return(NA)                    } else {                      tmp_claim <- tmp_claim$mainsnak$datavalue$value                      if (is.atomic(tmp_claim)) {                        return(tmp_claim)                      } else if ("text" %in% names(tmp_claim)) {                        return(tmp_claim$text)                      } else if ("time" %in% names(tmp_claim)) {                        tmp_claim <- as.Date(substr(tmp_claim$time, 2, 11))                        return(tmp_claim)                      } else if ("id" %in% names(tmp_claim)) {                        tmp_claim <- tmp_claim$id                        tmp_claim <-                           sapply(tmp_claim,                                  get_item,                                  language = "en",                                 simplify = FALSE,                                 USE.NAMES = TRUE)                        tmp_claim <-                          sapply(tmp_claim,                                 function (x) {                                   x[[1]]$labels$en$value                                 })                        return(tmp_claim)                      }                    }                  },                  simplify = FALSE,                  USE.NAMES = TRUE)         },         simplify = FALSE,         USE.NAMES = TRUE  )

The retrieved data are stored in a nested list. At the upper level of the list, we have the ten candidates, named with their respective item IDs. Nested within each of the ten upper-level elements, we have the values of the eight claims, named with the labels we specified above. Claim values may either be atomic (such as date of birth) or vectors (such as “positions held”, which may have multiple entries). Below, we can see the retrieved data for the first candidate on the list, winning candidate Boris Johnson.

Output: Retrieved Data for Boris Johnson

## $nam## [1] "Boris Johnson"## ## $sex## Q6581097 ##   "male" ## ## $dob## [1] "1964-06-19"## ## $exp##                                                    Q38931 ##                                         "Mayor of London" ##                                                  Q1371091 ## "Secretary of State for Foreign and Commonwealth Affairs" ##                                                 Q28841847 ##       "Member of the Privy Council of the United Kingdom" ##                                                 Q30524710 ##     "Member of the 57th Parliament of the United Kingdom" ##                                                 Q30524718 ##     "Member of the 56th Parliament of the United Kingdom" ##                                                 Q35647955 ##     "Member of the 54th Parliament of the United Kingdom" ##                                                 Q35921591 ##     "Member of the 53rd Parliament of the United Kingdom" ##                                                    Q14211 ##                    "Prime Minister of the United Kingdom" ##                                                  Q3303456 ##                        "Leader of the Conservative Party" ## ## $edu##                         Q192088                         Q805285 ##                  "Eton College"               "Balliol College" ##                        Q4804780                          Q34433 ##          "Ashdown House School"          "University of Oxford" ##                        Q5413121 ## "European School of Brussels I" ## ## $web## [1] "http://www.boris-johnson.com"## ## $twi## [1] "BorisJohnson"## ## $fbk## [1] "borisjohnson"

legislatoR

legislatoR is a joint project of Sascha Göbel and Simon Munzert. It offers a comprehensive relational individual-level database that provides political, sociodemographic, and other Wikipedia-related data on members of various national parliaments, including the all sessions of the Austrian Nationalrat, the German Bundestag, the Irish Dáil, the French Assemblée, and the United States Congress (House and Senate). It currently comprises data of 42,534 elected representatives and holds information for a wide variety of variables, including:

  • sociodemographics (Core)
  • basic political variables (Political)
  • records of individual Wikipedia data, including full revision histories (History)
  • daily user traffic on individual Wikipedia biographies (Traffic)
  • social media handles and website URLs (Social)
  • URLs to individual Wikipedia portraits (Portraits)
  • information on public offices held by MPs (Offices)
  • MPs’ occupations (Professions)
  • IDs that link politicians to other files, databases and websites (IDs)

The figure below, taken from Göbel and Munzert (2019), illustrates the data structure:

The package provides a relational database. This means that all data sets can be joined with the core data set via one of two keys: the Wikipedia page ID or the Wikidata ID, which uniquely identify individual politicians.

legislatoR services the increasing demand for micro-level data on political elites among political scientists, political analysts, and journalists and offers an accessible and rich collection of data on past and present politicians. The inclusion of Wikipedia and other web data allows for the inclusion of detailed information on politicians’ biographies.

To install the current developmental version from GitHub, we use the devtools package. After installing and loading legislatoR, we can use the ls() command to explore the full functionality of the package.

Code: Installing legislatoR

## Install from GitHubdevtools::install_github("saschagobel/legislatoR")library(legislatoR)## View functionalityls("package:legislatoR")
## [1] "get_core"       "get_history"    "get_ids"        "get_office"    ## [5] "get_political"  "get_portrait"   "get_profession" "get_social"    ## [9] "get_traffic"

Application 1: Social Media Adoption Rates

To retrieve legislatoR data, we first load the entire core data set of a given national parliament using the get_core() command. In the example below, we focus on the German Bundestag. We immediately right_join() the core data set with the political component using get_political(). We then filter() the data to retain the legislative session of interest (here, the most recent session of the Bundestag, 2017-2021).

In the next step, we left_join() this data set with the social component using get_social(). This gives us full information on the social media accounts of all MPs of the 2017-2021 Bundestag. Whenever MPs do not have an account, this is stored as missing information (NA). Using this information, we can calculate the social media adoption rates in the German parliament.

Code: Retrieving legislatoR Data and Calculating Social Media Adoption Rates

## Get social media adoption rates# get data: Germanydat_ger <- right_join(  x = get_core(legislature = "deu"),  y = filter(    get_political(legislature = "deu"),    as.numeric(session) == max(as.numeric(session))  ),  by = "pageid")dat_ger <- left_join(x = dat_ger,                     y = get_social(legislature = "deu"),                     by = "wikidataid")dat_ger$legislature <- "Germany"dat_ger_sum <- dat_ger %>%  dplyr::summarize(    twitter = mean(not(is.na(twitter)), na.rm = TRUE),    facebook = mean(not(is.na(facebook)), na.rm = TRUE),    website = mean(not(is.na(website)), na.rm = TRUE),    session_start = ymd(first(session_start)),    session_end = ymd(first(session_end)),    legislature = first(legislature)  )
##     twitter  facebook  website session_start session_end legislature## 1 0.7591036 0.6918768 0.640056    2017-10-24  2021-10-24     Germany

Given the inclusion of political variables in our data set, we could think of numerous feasible extensions. For instance, we could look at social media adoption rates by party. Alternatively, we could use the constituency identifiers to add external data on the rurality/urbanity of German electoral districts and analyze whether politicians competing in urban districts are more likely to maintain social media profiles.

Application 2: Public Attention to Members of the German Bundestag

In the second application, we use pageviews to identify peaks in public attention for MPs over time. This is particularly interesting in the context of politically significant events. For instance, we may want to know about public attention to parliamentarians following scandals, around elections, or during election campaigns. The code below illustrates this logic by averaging daily pageviews across all Wikipedia articles on members of the German Bundestag between July 2015 and December 2017.

Code: Plotting Average Daily Pageviews for German MPs

## Visualize average pageviews data of German MPs# get datager_traffic <- right_join(  x = get_traffic(legislature = "deu"),  y = filter(    get_political(legislature = "deu"),    session_end >= as.Date("2015-07-01")  ),  by = "pageid")ger_traffic <- left_join(x = ger_traffic,                         y = get_core(legislature = "deu"),                         by = "pageid")ger_traffic <-  dplyr::select(ger_traffic, pageid, date, traffic, session, party, name)# aggregate datager_traffic$date <- ymd(ger_traffic$date)ger_traffic_date <- group_by(ger_traffic, date)ger_traffic_legislators <- group_by(ger_traffic, pageid)ger_traffic_sum <-  summarize(ger_traffic_date, mean = mean(traffic, na.rm = TRUE))ger_traffic_sum <- mutate(  ger_traffic_sum,  mean_l1 = lag(mean, 1),  mean_f1 = lead(mean, 1),  peak = (mean >= 1.8 * mean_l1 &            mean > 180))# identify peaksger_traffic_peaks <- filter(ger_traffic_sum, peak == TRUE)ger_traffic_peaks_df <-  filter(ger_traffic, date %in% ger_traffic_peaks$date)ger_traffic_peaks_group <-  group_by(ger_traffic_peaks_df, date) %>%   dplyr::arrange(desc(traffic)) %>%   filter(row_number() == 1)ger_traffic_peaks_group <- arrange(ger_traffic_peaks_group, date)events_vec <-  c(    "deceased",    "drug affair",    "bullying affair",    "candidacy for presidency",    "deceased",    "???",    "chancellorchip announcement",    "elected president",    "???",    "policy success",    "TV debate",    "general election",    "threat to resign",    "elected speaker of parliament"  )# plotpar(oma = c(0, 0, 0, 0))par(mar = c(0, 4, 0, .5))par(yaxs = "i", xaxs = "i", bty = "n")layout(matrix(c(1, 1, 3, 2, 2, 3), 2, 3, byrow = TRUE),       heights = c(1, 2, 3),       widths = c(5, 5, 1.8))# names labelsplot(  ymd(ger_traffic_sum$date),  rep(0, length(ger_traffic_sum$date)),  xlim = c(ymd("2015-07-01"), ymd("2018-01-01")),  xaxt = "n",  ylim = c(0, 8),  yaxt = "n",  xlab = "",  ylab = "",  cex = 0)text(  ger_traffic_peaks_group$date,  0,  ger_traffic_peaks_group$name,  cex = .75,  srt = 90,  adj = c(0, 0))# pageviews time seriespar(mar = c(2, 4, 0, .5))plot(  ymd(ger_traffic_sum$date),  ger_traffic_sum$mean,  type = "l",  ylim = c(0, 1.25 * max(ger_traffic_sum$mean)),  xlim = c(ymd("2015-07-01"), ymd("2018-01-01")),  xaxt = "n",  yaxt = "n",  xlab = "",  ylab = "mean(pageviews)",  col = "white")abline(h = seq(0, 1.5 * max(ger_traffic_sum$mean), 250), col = "lightgrey")lines(ymd(ger_traffic_sum$date), ger_traffic_sum$mean, lwd = .5)dates <- seq(ymd("2015-07-01"), ymd("2018-01-01"), by = 1)axis(1, dates[day(dates) == 1 &                month(dates) %in% c(1, 4, 7, 10)], labels = FALSE)axis(1,     dates[day(dates) == 1 &             month(dates) %in% c(1)],     lwd = 0,     lwd.ticks = 3,     labels = FALSE)axis(1,     dates[day(dates) == 1 &             month(dates) %in% c(7)],     labels = as.character(year(dates[day(dates) == 15 &                                        month(dates) %in% c(7)])),     tick = F,     lwd = 0)axis(2, seq(0, 1.5 * max(ger_traffic_sum$mean), 250), las = 2)# events labels in time seriesfor (i in seq_along(events_vec)) {  text(ger_traffic_peaks_group$date[i],       ger_traffic_sum$mean[ger_traffic_sum$peak == TRUE][i] + 80,       i,       cex = .8)  points(    ger_traffic_peaks_group$date[i],    ger_traffic_sum$mean[ger_traffic_sum$peak == TRUE][i] + 80,    pch = 1,    cex = 2.2  )}# election date# events labels explainedpar(mar = c(0, 0, 0, 0))plot(  0,  0,  xlim = c(0, 5),  ylim = c(0, 10),  xaxt = "n",  yaxt = "n",  xlab = "",  ylab = "",  cex = 0)positions <-  data.frame(    events_xpos = 0.45,    events_ypos = seq(6.5, (6.5 - .5 * length(events_vec)),-.5),    text_xpos = .5  )text(0, 7, "Events", pos = 4, cex = .75)for (i in seq_along(events_vec)) {  text(positions$events_xpos[i], positions$events_ypos[i], i, cex = .8)  points(positions$events_xpos[i],         positions$events_ypos[i],         pch = 1,         cex = 2.2)  text(    positions$text_xpos[i],    positions$events_ypos[i],    events_vec[i],    pos = 4,    cex = .75  )}

The plot shows 14 notable spikes in daily pageviews. We identify which politician’s article generated the most traffic during each of these events. Furthermore, we add a legend that lists salient political events that likely caused these spikes in attention to MPs Wikipedia entries. For example, spike 14 marks the day on which Wolfgang Schäuble (CDU) was elected speaker of the parliament.

Conclusion

Collecting and analyzing Wikipedia data is relatively easy and entirely free. It enables researchers to use and analyze an enormous body of data that offers valuable information for research in political science and beyond. Tools that facilitate the collection, processing, and analysis of Wikipedia data advance rapidly, broadening the realm of possibilities for scientific research. Political science research is increasingly picking up on these developments, as is evident in recent contributions (Munzert 2015; Göbel and Munzert 2018; Shi et al. 2019) and softwares (such as legislatoR).

However, using Wikipedia data may also come with limitations and pitfalls. As entries can be read and edited by both humans and machines, the accuracy of contents and the validity of metadata are not guaranteed. With respect to the latter, Wikidata adds provenance information to all the data. These can be used to evaluate the validity of the data in question for applied research. Researchers should also keep in mind that Wikipedia data highly depends on user-driven creation, editing, and use of contents. This may not only lead to systematic selection bias due to data availability but also induce problems of equivalence of data points (e.g., articles on historical political figures likely receive fewer views and edits than articles on active politicians for reasons unrelated to their legislative activity or real-world importance).

These caveats are however all but exclusive to Wikipedia data. They merely underline that Wikipedia data is no exception when it comes to the general necessity of thoroughly scrutinizing and critically assessing the suitability of any given data for addressing substantive research questions.

About the Presenter

Simon Munzert is a lecturer in Political Data Science at the Hertie School of Governance in Berlin, Germany. A former member of the MZES Data and Methods Unit, Simon founded the Social Science Data Lab in 2016. His research focuses on public opinion, political representation, and the role of new media for political processes.

References

Göbel, Sascha, and Simon Munzert. 2018. “Political Advertising on the Wikipedia Marketplace of Information.” Social Science Computer Review 36 (2): 157–75. https://doi.org/10.1177/0894439317703579.

———. 2019. “legislatoR: Political, sociodemographic, and Wikipedia-related data on political elites.” https://github.com/saschagobel/legislatoR.

Munzert, Simon. 2015. “Using Wikipedia Article Traffic Volume to Measure Public Issue Attention.” Working Paper. https://github.com/simonmunzert/workingPapers/blob/master/wikipedia-salience-v3.pdf.

Shi, Feng, Misha Teplitskiy, Eamon Duede, and James A. Evans. 2019. “The wisdom of polarized crowds.” Nature Human Behaviour 3 (4): 329–36. https://doi.org/10.1038/s41562-019-0541-6.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Methods Bites.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Maximum likelihood estimation from scratch

$
0
0

[This article was first published on R on Alejandro Morales' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Maximum likelihood estimates of a distribution

Maximum likelihood estimation (MLE) is a method to estimate the parameters of a random population given a sample. I described what this population means and its relationship to the sample in a previous post.

Before we can look into MLE, we first need to understand the difference between probability and probability density for continuous variables. Probability density can be seen as a measure of relative probability, that is, values located in areas with higher probability will get have higher probability density. More precisely, probability is the integral of probability density over a range. For example, the classic “bell-shaped” curve associated to the Normal distribution is a measure of probability density, whereas probability corresponds to the area under the curve for a given range of values:

If we assign an statistical model to the random population, any particular value (let’s call it \(x_i\)) sampled from the population will have a probability density according to the model (let’s call it \(f(x_i)\)). If we then assume that all the values in our sample are statistically independent (i.e. the probability of sampling a particular value does not depend on the rest of values already sampled), then the likelihood of observing the whole sample (let’s call it \(L(x)\)) is defined as the product of the probability densities of the individual values (i.e. \(L(x) = \prod_{i=1}^{i=n}f(x_i)\) where \(n\) is the size of the sample).

For example, if we assume that the data were sampled from a Normal distribution, the likelihood is defined as:

\[ L(x) = \prod_{i=1}^{i=n}\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{\left(x_i – \mu \right)^2}{2\sigma^2}} \]

Note that \(L(x)\) does not depend on \(x\) only, but also on \(\mu\) and \(\sigma\), that is, the parameters in the statistical model describing the random population. The idea behind MLE is to find the values of the parameters in the statistical model that maximize \(L(x)\). In other words, it calculates the random population that is most likely to generate the observed data, while being constrained to a particular type of distribution.

One complication of the MLE method is that, as probability densities are often smaller than 1, the value of \(L(x)\) can become very small as the sample size grows. For example the likelihood of 100 values sampled from a standard Normal distribution is very small:

set.seed(2019)sample = rnorm(100)prod(dnorm(sample))
## [1] 2.23626e-58

When the variance of the distribution is small it is also possible to have probability densities higher than one. In this case, the likelihood function will grow to very large values. For example, for a Normal distribution with standard deviation of 0.1 we get:

sample_large = rnorm(100, sd = 0.1)prod(dnorm(sample_large, sd = 0.1))
## [1] 2.741535e+38

The reason why this is a problem is that computers have a limited capacity to store the digits of a number, so they cannot store very large or very small numbers. If you repeat the code above but using sample sizes of say 1000, you will get 0 or Inf instead of the actual values, because your computer will just give up. Although it is possible to increase the amount of digits to be stored per number, this does not really solve the problem, as it will eventually come back with larger samples. Furthermore, in most cases we will need to use numerical optimization algorithms (see below) which will make the problem even worse. Therefore, we cannot work directly with the likelihood function.

One trick is to use the natural logarithm of the likelihood function instead (\(log(L(x))\)). A nice property is that the logarithm of a product of values is the sum of the logarithms of those values, that is:

\[ \text{log}(L(x)) = \sum_{i=1}^{i=n}\text{log}(f(x_i)) \]

Also, the values of log-likelihood will always be closer to 1 and the maximum occurs for the same parameter values as for the likelihood. For example, the likelihood of the first sample generated above, as a function of \(\mu\) (fixing \(\sigma\)) is:

whereas for the log-likelihood it becomes:

Although the shapes of the curves are different, the maximum occurs for the same value of \(\mu\). Note that there is nothing special about the natural logarithm: we could have taken the logarithm with base 10 or any other base. But it is customary to use the natural logarithm as some important probability density functions are exponential functions (e.g. the Normal distribution, see above), so taking the natural logarithm makes mathematical analyses easier.

You may have noticed that the optimal value of \(\mu\) was not exactly 0, even though the data was generated from a Normal distribution with \(\mu\) = 0. This is the reason why it is called a maximum likelihood estimator. The source of such deviation is that the sample is not a perfect representation of the population, precisely because of the randomness in the sampling procedure. A nice property of MLE is that, generally, the estimator will converge asymptotically to the true value in the population (i.e. as sample size grows, the difference between the estimate and the true value decreases).

The final technical detail you need to know is that, except for trivial models, the MLE method cannot be applied analytically. One option is to try a sequence of values and look for the one that yields maximum log-likelihood (this is known as grid approach as it is what I tried above). However, if there are many parameters to be estimated, this approach will be too inefficient. For example, if we only try 20 values per parameter and we have 5 parameters we will need to test 3.2 million combinations.

Instead, the MLE method is generally applied using algorithms known as non-linear optimizers. You can feed these algorithms any function that takes numbers as inputs and returns a number as ouput and they will calculate the input values that minimize or maximize the output. It really does not matter how complex or simple the function is, as they will treat it as a black box. By convention, non-linear optimizers will minimize the function and, in some cases, we do not have the option to tell them to maximize it. Therefore, the convention is to minimize the negative log-likelihood (NLL).

Enough with the theory. Let’s estimate the values of \(\mu\) and \(\sigma\) from the first sample we generated above. First, we need to create a function to calculate NLL. It is good practice to follow some template for generating these functions. An NLL function should take two inputs: (i) a vector of parameter values that the optimization algorithm wants to test (pars) and (ii) the data for which the NLL is calculated. For the problem of estimating \(\mu\) and \(\sigma\), the function looks like this:

NLL = function(pars, data) {  # Extract parameters from the vector  mu = pars[1]  sigma = pars[2]  # Calculate Negative Log-LIkelihood  -sum(dnorm(x = data, mean = mu, sd = sigma, log = TRUE))}

The function dnorm returns the probability density of the data assuming a Normal distribution with given mean and standard deviation (mean and sd). The argument log = TRUE tells R to calculate the logarithm of the probability density. Then we just need to add up all these values (that yields the log-likelihood as shown before) and switch the sign to get the NLL.

We can now minimize the NLL using the function optim. This function needs the initial values for each parameter (par), the function calculating NLL (fn) and arguments that will be passed to the objective function (in our example, that will be data). We can also tune some settings with the control argument. I recommend to set the setting parscale to the absolute initial values (assuming none of the initial values are 0). This setting determines the scale of the values you expect for each parameter and it helps the algorithm find the right solution. The optim function will return an object that holds all the relevant information and, to extract the optimal values for the parameters, you need to access the field par:

mle = optim(par = c(mu = 0.2, sigma = 1.5), fn = NLL, data = sample,            control = list(parscale = c(mu = 0.2, sigma = 1.5)))mle$par
##          mu       sigma ## -0.07332745  0.90086176

It turns out that this problem has an analytical solution, such that the MLE values for \(\mu\) and \(\sigma\) from the Normal distribution can also be calculated directly as:

c(mu = mean(sample), sigma = sd(sample))
##         mu      sigma ## -0.0733340  0.9054535

There is always a bit of numerical error when using optim, but it did find values that were very close to the analytical ones. Take into account that many MLE problems (like the one in the section below) cannot be solved analytically, so in general you will need to use numerical optimization.

MLE applied to a scientific model

In this case, we have a scientific model describing a particular phenomenon and we want to estimate the parameters of this model from data using the MLE method. As an example, we will use a growth curve typical in plant ecology.

Let’s imagine that we have made a series of a visits to a crop field during its growing season. At every visit, we record the days since the crop was sown and the fraction of ground area that is covered by the plants. This is known as ground cover (\(G\)) and it can vary from 0 (no plants present) to 1 (field completely covered by plants). An example of such data would be the following (data belongs to my colleague Ali El-Hakeem):

data = data.frame(t = c(0, 16, 22, 29, 36, 58),                  G = c(0, 0.12, 0.32, 0.6, 0.79, 1))plot(data, las = 1, xlab = "Days after sowing", ylab = "Ground cover")

Our first intuition would be to use the classic logistic growth function (see here) to describe this data. However, this function does not guarantee that \(G\) is 0 at \(t = 0\) . Therefore, we will use a modified version of the logistic function that guarantees \(G = 0\) at \(t = 0\) (I skip the derivation):

\[ G = \frac{\Delta G}{1 + e^{k \left(t – t_{h} \right)}} – G_{o} \]

where \(k\) is a parameter that determines the shape of the curve, \(t_{h}\) is the time at which \(G\) is equal to half of its maximum value and \(\Delta G\) and \(G_o\) are parameters that ensure \(G = 0\) at \(t = 0\) and that \(G\) reaches a maximum value of \(G_{max}\) asymptotically. The values of \(\Delta G\) and \(G_o\) can be calculated as:

\[ \begin{align} G_o &= \frac{\Delta G}{1 + e^{k \cdot t_{h}}} \\ \Delta G &= \frac{G_{max}}{1 – 1/\left(1 + e^{k \cdot t_h}\right)} \end{align} \]

Note that the new function still depends on only 3 parameters: \(G_{max}\), \(t_h\) and \(k\). The R implementation as a function is straightforward:

G = function(pars, t) {  # Extract parameters of the model  Gmax = pars[1]  k = pars[2]  th = pars[3]  # Prediction of the model  DG = Gmax/(1 - 1/(1 + exp(k*th)))  Go = DG/(1 + exp(k*th))  DG/(1 + exp(-k*(t - th))) - Go}

Note that rather than passing the 3 parameters of the curve as separate arguments I packed them into a vector called pars. This follows the same template as for the NLL function described above.

Non-linear optimization algorithms always requires some initial values for the parameters being optimized. For simple models such as this one we can just try out different values and plot them on top of the data. For this model, \(G_{max}\) is very easy as you can just see it from the data. \(t_h\) is a bit more difficult but you can eyeball it by cheking where \(G\) is around half of \(G_{max}\). Finally, the \(k\) parameter has no intuitive interpretation, so you just need to try a couple of values until the curve looks reasonable. This is what I got after a couple of tries:

plot(data, las = 1, xlab = "Days after sowing", ylab = "Ground cover")curve(G(c(Gmax = 1, k = 0.15, th = 30), x), 0, 60, add = TRUE)

If we want to estimate the values of \(G_{max}\), \(k\) and \(t_h\) according to the MLE method, we need to construct a function in R that calculates NLL given an statistical model and a choice of parameter values. This means that we need to decide on a distribution to represent deviations between the model and the data. The canonical way to do this is to assume a Normal distribution, where \(\mu\) is computed by the scientific model of interest, letting \(\sigma\) represent the degree of scatter of the data around the mean trend. To keep things simple, I will follow this approach now (but take a look at the final remarks at the end of the article). The NLL function looks similar to the one before, but now the mean is set to the predictions of the model:

NLL = function(pars, data) {  # Values predicted by the model  Gpred = G(pars, data$t)  # Negative log-likelihood   -sum(dnorm(x = data$G, mean = Gpred, sd = pars[4], log = TRUE))}

We can now calculate the optimal values using optim and the “eyeballed” initial values (of course, we also need to have an initial estimate for \(\sigma\)):

par0 = c(Gmax = 1.0, k = 0.15, th = 30, sd = 0.01)fit = optim(par = par0, fn = NLL, data = data, control = list(parscale = abs(par0)),             hessian = TRUE)fit$par
##        Gmax           k          th          sd ##  0.99926603  0.15879585 26.70700004  0.01482376

Notice that eyeballing the initial values already got us prettly close to the optimal solution. Of course, for complicated models your initial estimates will not be as good, but it always pays off to play around with the model before going into optimization. Finally, we can compare the predictions of the model with the data:

plot(data, las = 1, xlab = "Days after sowing", ylab = "Ground cover")curve(G(fit$par, x), 0, 60, add = TRUE)

Final remarks

The model above could have been fitted using the method of ordinary least squares (OLS) with the R function nls. Actually, unless something went wrong in the optimization you should obtain the same results as with the method described here. The reason is that OLS is equivalent to MLE with a Normal distribution and constant standard deviation. However, I believe it is worthwhile to learn MLE because:

  • You do not have to restrict yourself to the Normal distribution. In some cases (e.g. when modelling count data) it does not make sense to assume a Normal distribution. Actually, in the ground cover model, since the values of \(G\) are constrained to be between 0 and 1, it would have been more correct to use another distribution, such as the Beta distribution (however, for this particular data, you will get very similar results so I decided to keep things simple and familiar).

  • You do not have to restrict yourself to modelling the mean of the distribution only. For example, if you have reason to believe that errors do not have a constant variance, you can also model the \(\sigma\) parameter of the Normal distribution. That is, you can model any parameter of any distribution.

  • If you undestand MLE then it becomes much easier to understand more advanced methods such as penalized likelihood (aka regularized regression) and Bayesian approaches, as these are also based on the concept of likelihood.

  • You can combine the NLL of multiple datasets inside the NLL function, whereas in ordinary least squares, if you want to combine data from different experiments, you have to correct for different in scales or units of measurement and for differences in the magnitude of errors your model makes for different datasets.

  • Many methods of model selection (so-called information criteria such as AIC) are based on MLE.

  • Using a function to compute NLL allows you to work with any model (as long as you can calculate a probability density) and dataset, but I am not sure this is possible or convenient with the formula interface of nls (e.g combining multiple datasets is not easy when using a formula interface).

Of course, if none of the above applies to your case, you may just use nls. But at least now you understand what is happening behind the scenes. In future posts I discuss some of the special cases I gave in this list. Stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Alejandro Morales' Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Practice using lubridate… THEATRICALLY

$
0
0

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am so pleased to now be an RStudio-certified tidyverse trainer! 🎉 I have been teaching technical content for decades, whether in a university classroom, developing online courses, or leading workshops, but I still found this program valuable for my own professonal development. I learned a lot that is going to make my teaching better, and I am happy to have been a participant. If you are looking for someone to lead trainings or workshops in your organization, you can check out this list of trainers to see who might be conveniently located to you!

Part of the certification process is delivering a demonstration lesson. I quite like the content of the demonstration lesson I built and I might not use it in an actual workshop anytime soon, so I decided to expand upon it and share it here as a blog post. My demonstration focused on handling dates using lubridate; dates and times are important in data analysis, but they can often be challenging. In this post, we will explore some wild caught date data from the London Stage Database 🎭 and explore how to handle these dates using the lubridate package.

Read in the London Stage Database

Learn more about the London Stage Database, including about the data provenance and code used to build the database. Briefly, it explores the theater scene in London from when playhouses were reopened in 1660 after the English civil wars to the end of the 18th century.

<br />

via GIPHY

(H/T for this dataset to Data is Plural by Jeremy Singer-Vine, one of the most fun newsletters I subscribe to.)

To start, we are going to download, unzip, and open up the full London Stage Database.

Notes:

  • The chunk below downloads the dataset to the working directory.
  • This is a pretty sizeable dataset, so if you run this yourself, be patient while it opens up!
library(tidyverse)json_path <- "https://londonstagedatabase.usu.edu/downloads/LondonStageJSON.zip"download.file(json_path, "LondonStageJSON.zip")unzip("LondonStageJSON.zip")london_stage_raw <- jsonlite::fromJSON("LondonStageFull.json") %>%    as_tibble()

Finding the dates

There are thirteen columns in this data. Let’s take a moment and look at the column names and content of the first few lines. Which of these columns contains the date informaiton?

london_stage_raw
## # A tibble: 52,617 x 13##    EventId EventDate TheatreCode Season Volume Hathi CommentC TheatreId##                                ##  1 0       16591029  city        1659-… 1      ""    The … 63       ##  2 1       16591100  mt          1659-… 1      ""    On 23 N… 206      ##  3 2       16591218  none        1659-… 1      ""    Represe… 1        ##  4 3       16600200  mt          1659-… 1      ""    6 Feb. … 206      ##  5 4       16600204  cockpit     1659-… 1      ""    $Thomas… 73       ##  6 5       16600328  dh          1659-… 1      ""    At D… 90       ##  7 6       16600406  none        1659-… 1      ""    ""       1        ##  8 7       16600412  vh          1659-… 1      ""    Edition… 319      ##  9 8       16600413  fh          1659-… 1      ""    The … 116      ## 10 9       16600416  none        1659-… 1      ""    ""       1        ## # … with 52,607 more rows, and 5 more variables: Phase2 ,## #   Phase1 , CommentCClean , BookPDF , Performances 

The EventDate column contains the date information, but notice that R does not think it’s a date!

class(london_stage_raw$EventDate)
## [1] "character"

R thinks this is a character (dates encoded like "16591029"), because of the details of the data and the type guessing used by the process of reading in this data. This is NOT HELPFUL for us, as we need to store this information as a date type 📆 in order to explore the dates of this London stage data. We will use a function ymd() from the lubridate package to convert it. (There are other similar functions in lubridate, like ymd_hms() if you have time information, mdy() if your information is arranged differently, etc.)

library(lubridate)london_stage <- london_stage_raw %>%    mutate(EventDate = ymd(EventDate)) %>%    filter(!is.na(EventDate))
## Warning: 378 failed to parse.

Notice that we had some failures here; there were a few hundred dates with a day of 00 that could not be parsed. In the filter() line here, I’ve filtered those out.

What happens now if I check the class of the EventDate column?

class(london_stage$EventDate)
## [1] "Date"

We now have a column of type Date🙌 which is just what we need. In this lesson we will explore what we can learn from this kind of date data.

Getting years and months

This dataset on the London stage spans more than a century. How can we look at the distribution of stage events over the years? The lubridate package contains functions like year() that let us get year components of a date.

year(today())
## [1] 2019

Let’s count up the stage events by year in this dataset.

london_stage %>%    mutate(EventYear = year(EventDate)) %>%    count(EventYear)
## # A tibble: 142 x 2##    EventYear     n##         ##  1      1659     2##  2      1660    58##  3      1661   138##  4      1662    91##  5      1663    68##  6      1664    53##  7      1665    20##  8      1666    30##  9      1667   149## 10      1668   147## # … with 132 more rows

Looks to me like there are some big differences year-to-year. It would be easier to see this if we made a visualization.

london_stage %>%    count(EventYear = year(EventDate)) %>%    ggplot(aes(EventYear, n)) +    geom_area(fill = "midnightblue", alpha = 0.8) +    labs(y = "Number of events",         x = NULL)

There was a dramatic increase in theater events between about 1710 and 1730. After 1750, the yearly count looks pretty stable.

Do we see month-to-month changes? The lubridate package has a function very similar to year() but instead for finding the month of a date.

london_stage %>%    ggplot(aes(month(EventDate))) +    geom_bar(fill = "midnightblue", alpha = 0.8) +    labs(y = "Number of events")

Wow, that is dramatic! There are dramatically fewer events during the summer months than the rest of the year. We can make this plot easier to read by making a change to how we call the month() function, with label = TRUE.

london_stage %>%    ggplot(aes(month(EventDate, label = TRUE))) +    geom_bar(fill = "midnightblue", alpha = 0.8) +    labs(x = NULL,         y = "Number of events")

When you use label = TRUE here, the information is being stored as an ordered factor.

In this dataset, London playhouses staged the most events in January.

OK, one more! What day of the week has more theater events? The lubridate package has a function wday() package to get the day of the week for any date. This function also has a label = TRUE argument, like month().

london_stage %>%    ggplot(aes(wday(EventDate, label = TRUE))) +    geom_bar(fill = "midnightblue", alpha = 0.8) +    labs(x = NULL,         y = "Number of events")

London theaters did not stage events on Sunday or Wednesday. Who knew?!?

Time differences

One of the most challenging parts of handling dates is finding time intervals, and lubridate can help with that!

Let’s look at the individual theaters (tabulated in TheatreId) and see how long individual theaters tend to be in operation.

london_by_theater <- london_stage %>%    filter(TheatreCode != "none") %>%     group_by(TheatreCode) %>%    summarise(TotalEvents = n(),              MinDate = min(EventDate),              MaxDate = max(EventDate),              TimeSpan = as.duration(MaxDate - MinDate)) %>%    arrange(-TotalEvents)london_by_theater
## # A tibble: 233 x 5##    TheatreCode TotalEvents MinDate    MaxDate   ##                           ##  1 dl                18451 1674-03-26 1800-06-18##  2 cg                12826 1662-05-09 1800-06-16##  3 hay                5178 1720-12-29 1800-09-16##  4 king's             4299 1714-10-23 1800-08-02##  5 lif                4117 1661-06-28 1745-10-07##  6 gf                 1832 1729-10-31 1772-10-23##  7 queen's             884 1705-04-09 1714-06-23##  8 marly               403 1750-08-16 1776-08-10##  9 bf                  257 1661-08-22 1767-09-07## 10 dg                  235 1671-06-26 1706-11-28## # … with 223 more rows, and 1 more variable: TimeSpan 

We have created a new dataframe here, with one row for each theater. The columns tell us

  • how many theater events that theater had
  • the first date that theater had an event
  • the last date that theater had an event
  • the duration of the difference between those two

A duration is a special concept in lubridate of a time difference, but don’t get too bogged down in this. How did we calculate this duration? We only had to subtract the two dates, and then wrap it in the lubridate function as.duration().

Look at the data type that was printed out at the top of the column for TimeSpan; it’s not numeric, or integer, or any of the normal data types in R. It says .

What do you think will happen if we try to make to make a histogram for TimeSpan?

london_by_theater %>%     filter(TotalEvents > 100) %>%    ggplot(aes(TimeSpan)) +    geom_histogram(bins = 20)
## Error: Incompatible duration classes (Duration, numeric). Please coerce with `as.duration`.

We have an error! 🙀 This “duration” class is good for adding and subtracting dates, but less good once we want to go about plotting or doing math with other kinds of data (like, say, the number of total events). We need to coerce this to something more useful, now that we’re done subtracting the dates.

Data that is being stored as a duration can be coerced with as.numeric(), and you can send another argument to say what kind of time increment you want back. For example, what if we want the number of years that each of these theaters was in operation in this dataset?

london_by_theater %>%    mutate(TimeSpan = as.numeric(TimeSpan, "year"))
## # A tibble: 233 x 5##    TheatreCode TotalEvents MinDate    MaxDate    TimeSpan##                               ##  1 dl                18451 1674-03-26 1800-06-18   126.  ##  2 cg                12826 1662-05-09 1800-06-16   138.  ##  3 hay                5178 1720-12-29 1800-09-16    79.7 ##  4 king's             4299 1714-10-23 1800-08-02    85.8 ##  5 lif                4117 1661-06-28 1745-10-07    84.3 ##  6 gf                 1832 1729-10-31 1772-10-23    43.0 ##  7 queen's             884 1705-04-09 1714-06-23     9.20##  8 marly               403 1750-08-16 1776-08-10    26.0 ##  9 bf                  257 1661-08-22 1767-09-07   106.  ## 10 dg                  235 1671-06-26 1706-11-28    35.4 ## # … with 223 more rows

A number of these theaters had events for over a century!

If we wanted to see the number of months that each theater had events, we would change the argument.

london_by_theater %>%    mutate(TimeSpan = as.numeric(TimeSpan, "month"))
## # A tibble: 233 x 5##    TheatreCode TotalEvents MinDate    MaxDate    TimeSpan##                               ##  1 dl                18451 1674-03-26 1800-06-18    1515.##  2 cg                12826 1662-05-09 1800-06-16    1657.##  3 hay                5178 1720-12-29 1800-09-16     957.##  4 king's             4299 1714-10-23 1800-08-02    1029.##  5 lif                4117 1661-06-28 1745-10-07    1011.##  6 gf                 1832 1729-10-31 1772-10-23     516.##  7 queen's             884 1705-04-09 1714-06-23     110.##  8 marly               403 1750-08-16 1776-08-10     312.##  9 bf                  257 1661-08-22 1767-09-07    1272.## 10 dg                  235 1671-06-26 1706-11-28     425.## # … with 223 more rows

We can use this kind of transformation to see the relationship between the number of events and length of time in operation. Convert the Duration object to a numeric value in months in order to make a plot.

library(ggrepel)london_by_theater %>%    mutate(TimeSpan = as.numeric(TimeSpan, "month")) %>%    filter(TotalEvents > 10) %>%    ggplot(aes(TimeSpan, TotalEvents, label = TheatreCode)) +    geom_smooth(method = "lm") +    geom_label_repel(family = "IBMPlexSans") +    geom_point() +    scale_x_log10() +    scale_y_log10() +    labs(x = "Months that theater was in operation",         y = "Total events staged by theater")

It makes sense that theaters open much longer had many more events, but we can also notice which theaters are particularly high or low in this chart. Theaters high in this chart hosted many events for how long they were in operation, and theaters low in this chart hosted few events for how long they were open.

This plot opens up many more possibilities for exploration, such as whether theaters were in constant operation or took breaks. Further date handling offers the ability to address such questions! Let me know if you have any questions. 📆

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RMarkdown Template that Manages Academic Affiliations – docx or PDF output

$
0
0

[This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

I like writing my academic papers in RMarkdown because it allows reproducible research. The cleanest way to submit a manuscript made in RMarkdown is using the LaTeX code that it generates using the YAML switch keep_tex = true. A minimalist YAML header would look like so:

---title: The document titleauthor:   - Duke A Caboom, MD  - Justin d'Ottawa, PhD:output:   pdf_document:    keep_tex: true---

Introduction

However, when you want mutliple authors affiliations you discover that you can’t do as you would in LaTeX because Pandoc does not know what to do with the affiliations and you end out a dishearting PDF that looks like the output shown in figure 1 below:

This is so sad.

Figure 1: This is so sad.

The situation worsens if you want MS-Word output. As those of us in medical fields know, most journals (with some notable exceptions like the Clinical Mass Spectrometry Journal and other Elsevier journals like Clinical Biochemistry and Clinica Chimica Acta) require submission of a document in MS-Word format which goes against all that Data Science and Reprodicible Research stands for–he says, with hyperbole. Parenthetically, it is my hope that since AACC has indicated that they intend to make Data Science a strategic priority for Lab Medicine, they will soon accept submissons to Clinical Chemistry and Journal of Applied Laboratory Medicine written reproducibly in RMardown or LaTeX.

In the mean time, here are the workarounds for getting the affiliations to display correctly along with all the other stuff we want, namely, cross referencing of figures and tables and correct reference formatting and abbreviation of journal names. This allows you to avoid the horror of manually fixing your Word document after it generated from RMarkdown. In any case, let’s start with MS-Word.

Dependencies for MS-Word and the Associated YAML

You will also need to install Pandoc which is the Swiss Army Knife of document conversion. It’s going to turn your code into a .docx file for you. Mac users can do this with Homebrew on the terminal command line:

brew install pandocbrew install pandoc-citeprocbrew install pandoc-crossref

There are some extra installs required to help Pandoc do its job. Install the prebuilt binaries if you can.

Finally, you need to use some scripts written in the Lua scripting language which means you will need the language itself:

And you will need two Lua scripts:

These are in Pandoc github repository:

You want the files named scholarly-metadata.lua and author-info-blocks.lua.

You will need to choose a .csl file for your journal. This will tell Pandoc how to format the references. You can download the correct .csl file here. You will also need a journal abbreviations database. I have made one for you from the Web of Science list and you can download it here.

You will need to creat a .bibtex database which is just your list of references. This can be exported from various reference managers or built by hand. Name the file mybibfile.bib.

Now follow the bouncing ball:

  1. Go to the directory containing your .Rmd file.
  2. Create a directory in it called “Extras”
  3. Put the two Lua scripts, the Bibtex database, the abbreviations database and the .csl file into the “Extras” folder.
  4. If you want to avoid Pandoc’s goofy default .docx formatting, then put this word document in the same folder.

OR

Download the contents of this folder from my github repo that has everything set up as I describe above.

For two authors, your YAML will need to look like this:

title: |  RMarkdown Template for Managing    Academic Affiliations subtitle: |  Also Deals with Cross References and    Reference Abbreviations for MS-Word Outputauthor:  - Duke A Caboom, MD:      email: duke.a.caboom@utuktoyaktuk.edu      institute: [UofT]      correspondence: true  - Justin d'Ottawa, PhD:      email: justin@neverready.ca      institute: [UofO]      correspondence: falseinstitute:  - UofT: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada  - UofO: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canadaabstract: |  **Introduction**: There's a big scientific problem out there. I know how to fix it.  **Methods**: My experiments are pure genius.  **Results**: Now you have your proof.  **Conclusion**: Give me more grant money.journal: "An awesome journal"date: ""toc: falseoutput: bookdown::word_document2:    pandoc_args:      - --csl=Extras/clinical-biochemistry.csl      - --citation-abbreviations=Extras/abbreviations.json      - --filter=pandoc-crossref      - --lua-filter=Extras/scholarly-metadata.lua      - --lua-filter=Extras/author-info-blocks.lua      - --reference-doc=Extras/Reference_Document.docx bibliography: "Extras/mybibfile.bib"keywords: "CRAN, R, RMarkdown, RStudio, YAML"

Et voila! Figure 2 shows that we have something reasonable.

This is so great

Figure 2: This is so great

Dependencies for LaTeX and the Associated YAML

It goes without saying that you need to install LaTeX. LaTeX markup language is available here: Mac, Windows. For Linux, just install from the command line with your package manager. Do a full install with all the glorious bloat of all LaTeX packages. This saves many headaches in the future.

You don’t need the lua scripts for LaTeX although you can use them. The issue with LaTeX is that the .tex template that Pandoc uses for generating LaTeX files does not support author affiliations as descibed in the Pandoc documentation. So what you need to do is modify the Pandoc LaTeX template. To get your current working copy of the Pandoc LaTeX template open up a terminal (Mac/Linux) and type:

pandoc -D latex > mytemplate.tex

This will push the contents to a file. Move the file to the “Extras” folder discussed above. If that seems difficult, you can also download it here. Now you have to edit it. Open it up in a text editor and find the section that reads:

$if(author)$\author{$for(author)$author$sep$ \and $endfor$}$endif$

Replace this with this code that will invoke the LaTeX authblk package.

$if(author)$    \usepackage{authblk}    $for(author)$        $if(author.name)$            $if(author.number)$                \author[$author.number$]{$author.name$}            $else$                \author[]{$author.name$}            $endif$            $if(author.affiliation)$                $if(author.email)$                    \affil{$author.affiliation$ \thanks{$author.email$}}                $else$                    \affil{$author.affiliation$}                $endif$            $endif$            $else$              \author{$author$}        $endif$    $endfor$$endif$

Then make your YAML header look like this:

---title: |  RMarkdown Template for Managing    Academic Affiliations subtitle: |  Also Deals with Cross References and    Reference Abbreviations for PDF Outputauthor:- name: Duke A Caboom, MD  affiliation: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada  email: dtholmes@mail.ubc.ca  number: 1- name: Justin d'Ottawa, PhD  affiliation: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada  email: justin@neverready.ca  number: 2abstract: |  **Introduction**: There's a big scientific problem out there. I know how to fix it.  **Methods**: My experiments are pure genius.  **Results**: Now you have your proof.  **Conclusion**: Give me more grant money.toc: falseoutput:   bookdown::pdf_document2:    pandoc_args:      - --filter=pandoc-crossref      - --csl=Extras/clinical-biochemistry.csl      - --citation-abbreviations=Extras/abbreviations.json      - --template=Extras/mytemplate.texbibliography: "Extras/mybibfile.bib"keep-latex: true

And as you can see in figure 3 you get a correctly list of authors.

This is also great.

Figure 3: This is also great.

Cross Reference of a Table

Of course, tables can be cross referenced in the same manner as figures. Here is a cross reference to table 1 using the code \@ref(tab:mytable) .

Table 1: A short table
termestimatestd.errorstatisticp.value
(Intercept)36.9082.19116.8470.000
hp-0.0190.015-1.2750.213
cyl-2.2650.576-3.9330.000

This Template also Takes Care of Reference Abbreviation.

As usual, you can make a citation with the code [@bibtexname], where bibtexname is the articles’s abbreviated handle in your bibtex database. Here is a great resource on the bookdown package [1] and reproducible research [2] and here are references where the journal title is longer [3,4]. The references in your documnent (and shown below) will have appropriate abbreviations based on the .json abbreviations database I have provided. In this case, I have chosen the .csl file for Clinical Mass Spectrometry–’cause MSACL.

Other Ways to Skin the YAML Cat

I came across some other ways to deal with this that I did not like as much but they are simpler. Here is one using a footnote.

title: The document titleauthor:- [Duke A Caboom, MD]^(University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada)- [Justin d'Ottawa, PhD]^(University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada)output: pdf_document

And you can also misuse the date variable:

title: The document titleauthor:- Duke A Caboom, MD [1]- Justin d'Ottawa, PhD [2]date: 1. University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada \newline 2. University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canadaoutput: pdf_document

Conclusion

This concludes my long personal struggle to get a completely reproducible .docx manusript genereated by RMarkdown and Pandoc. Here is the output for PDF and Word.

Parting Thought

Let us not become weary in doing good, for at the proper time we will reap a harvest if we do not give up.

Galations 6:9

References

[1] Y. Xie, J.J. Allaire, G. Grolemund, R markdown: The definitive guide, Chapman; Hall/CRC, 2018. https://bookdown.org/yihui/bookdown.

[2] R.D. Peng, Reproducible research in computational science, Science. 334 (2011) 1226–1227.

[3] G. Eisenhofer, C. Durán, T. Chavakis, C.V. Cannistraci, Steroid metabolomics: Machine learning and multidimensional diagnostics for adrenal cortical tumors, hyperplasias, and related disorders, Curr. Opin. Endocr. Metab. Res. 8 (2019) 40–49. doi:https://doi.org/10.1016/j.coemr.2019.07.002.

[4] F.B. Vicente, D.C. Lin, S. Haymond, Automation of chromatographic peak review and order to result data transfer in a clinical mass spectrometry laboratory, Clin. Chim. Acta. 498 (2019) 84–89. doi:https://doi.org/10.1016/j.cca.2019.08.004.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Combining the power of R and Python with reticulate

$
0
0

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R + Py

In the word of R vs Python fights, This is a simple (could be called, naive as well) attempt to show how we can combine the power of Python with R and create a new superpower.

 jack-jacc_Parr.jpg Like this one, If you have watched The Incredibles before!

About this Dataset

This dataset contains a bunch of tweet that came with this tag #JustDoIt after Nike released the ad campaign with Colin Kaepernick that turned controversial.

Dataset source: https://www.kaggle.com/eliasdabbas/5000-justdoit-tweets-dataset

Superstar – Reticulate

The superstar who’s making this possible is the R package reticulate by RStudio.

Let us start with the code!!

The R Code

#loading required R libraries library(tidyverse)library(ggthemes)library(knitr)tweets <- read_csv("https://raw.githubusercontent.com/amrrs/python_plus_r_brug/master/justdoit_tweets_2018_09_07_2.csv")text <- tweets$tweet_full_textset.seed(123)text_10 <- text[sample(1:nrow(tweets),100)]

The Python Code

import spacyimport pandas as pdnlp = spacy.load('en_core_web_sm')doc = nlp(str(r.text_10))pos_df = pd.DataFrame(columns = ["text","pos","lemma"])for token in doc:    df1 = pd.DataFrame({"text" : token.text, "pos" : token.pos_, "lemma" : token.lemma_}, index = [0])    #print(token.text, token.pos_)    #print(df1)    pos_df = pd.concat([pos_df,df1])#print(pos_df) 

Now, Again The R Code

#data.frame(token = as.vector(py$tokens)) %>% count(token) %>% arrange(desc(n))py$pos_df %>%   count(pos) %>%   ggplot() + geom_bar(aes(pos,n), stat = "identity") +  coord_flip() +  theme_minimal() +  labs(title = "POS Tagging",       subtitle = "NLP using Python space - Graphics using R ggplot2")

Now, Again The Python Code

ent_df = pd.DataFrame(columns = ["text","label"])for ent in doc.ents:    df1 = pd.DataFrame({"text" : ent.text,   "label" : ent.label_}, index = [0])    #print(token.text, token.pos_)    #print(df1)    ent_df = pd.concat([ent_df,df1])

One Final Time, The R Code

py$ent_df %>%   count(label) %>%   ggplot() + geom_bar(aes(label,n), stat = "identity") +  coord_flip() +  #theme_solarized() +  theme_fivethirtyeight() +  labs(title = "Entity Recognition",       subtitle = "NLP using Python space - Graphics using R ggplot2")

Summary

Thus, In this post we learnt how to combine the best of R and Python – in this case – R for Data Analysis and Data Visualization – Python for Natural Languge Processing with Spacy.

If you liked this, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Behind the Scenes of an R Consortium Project

$
0
0

[This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Next Tuesday (September 3) I will be giving a talk at the Bay Area R User Group titled Behind the Scenes of an R Consortium Project. This will be my first time speaking about my work with the R Consortium, and I encourage you to attend!

Over the last few years the R Consortium has emerged as a major source of funding for R projects. However, many R users are not aware of the R Consortium. And many of those who are aware of it view applying for funding as something that “other people do”.

This talk will use a case study – my own R Consortium Project – to argue that even “average” members of the R community can and should apply to the R Consortium for funding for projects that personally interest them.

In 2018 I began working with the Census Bureau to create an online course on using R to analyze US Census Data. Shortly after I announced that project, Joseph Rickert from the R Consortium contacted me about applying for funding to further this work. In this talk I will describe what problem I wanted to solve, how it related to work that I was already doing and what the application process was like. I will argue that my project can serve as a template for other, similar projects, and will also provide a post-mortem on the project.

You can register for the Meetup for free here.

The post Behind the Scenes of an R Consortium Project appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introducing data_algebra

$
0
0

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Introduction

Parts of the project are in early development (and not yet ready for production use), and other parts are mature and have been used in production.

The project intent is to realize a modern data processing language based on Codd’s relational operators that is easy to maintain, has helpful tooling, and has very similar realizations (or dialects) for:

  • SQL databases accessed from Python (in development here, not yet ready for production use).
  • PandasData.Frame objects in Python (in development here, not yet ready for production use).
  • SQL databases access from R (implementation is here, and is mature and ready for production use).
  • data.table objects in R (implementation is here, and is mature and ready for production use).

The idea is the notation should look idiomatic in each language. Working in Python should feel like working in Python, and working in R should feel like working in R. The data semantics, however, are designed to be close to the SQL realizations (given the close connection of SQL to the relational algebra; in particular row numbering starts at 1 and row and column order is not preserved except at row-order steps or select-columns steps respectively). The intent is: it should be very easy to use the system in either Python or R (a boon to multi-language data science projects) and it is easy to port either code or experience from one system to another (a boon for porting projects, or for data scientists working with more than one code base or computer language).

Related work includes:

The data_algebra principles include:

  • Writing data transforms as a pipeline or method-chain of many simple transform steps.
  • Treating data transform pipelines or directed acyclic graphs (DAGs) as themselves being sharable data.
  • Being able to use the same transform specification many places (in memory, on databases, in R, in Python).

Example

Let’s start with an example in Python.

For our example we will assume we have a data set of how many points different subjects score in a psychological survey. The goal is transform the data so that we see what fraction of the subjects answers are in each category (subject to an exponential transform, as often used in logistic regression). We then treat the per-subject renormalized data as a probabilty or diagnosis.

The exact meaning of such a scoring method are not the topic of this note. It is a notional example to show a non-trivial data transformation need. In particular: having to normalize per-subject (divide some set of scores per-subject by a per-subject total) is a classic pain point in data-processing. In classic SQL this can only be done by joining against a summary table, or in more modern SQL with a “window function.” We want to show by working in small enough steps this can be done simply.

Set up

Let’s start our Python example. First we import the packages we are going to use, and set a few options.

In [1]:
importiofrompprintimportpprintimportpsycopg2# http://initd.org/psycopg/importpandas# https://pandas.pydata.orgimportyaml# https://pyyaml.orgimportdb_helpers# https://github.com/WinVector/data_algebra/blob/master/Examples/LogisticExample/db_helpers.pypandas.set_option('display.max_columns',None)pandas.set_option('display.expand_frame_repr',False)pandas.set_option('max_colwidth',-1)
Now let’s type in our example data. Notice this is an in-memory PandasData.Frame.

In [2]:
d_local=pandas.DataFrame({'subjectID':[1,1,2,2],'surveyCategory':["withdrawal behavior","positive re-framing","withdrawal behavior","positive re-framing"],'assessmentTotal':[5,2,3,4],'irrelevantCol1':['irrel1']*4,'irrelevantCol2':['irrel2']*4,})d_local
Out[2]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; }</p><p> .dataframe tbody tr th { vertical-align: top; }</p><p> .dataframe thead th { text-align: right; }
subjectIDsurveyCategoryassessmentTotalirrelevantCol1irrelevantCol2
01withdrawal behavior5irrel1irrel2
11positive re-framing2irrel1irrel2
22withdrawal behavior3irrel1irrel2
32positive re-framing4irrel1irrel2
Let’s also copy this data to a PostgreSQL database. Normally big data is already in the system one wants to work with, so the copying over is just to simulate the data already being there.

In [3]:
conn=psycopg2.connect(database="johnmount",user="johnmount",host="localhost",password="")conn.autocommit=True
In [4]:
db_helpers.insert_table(conn,d_local,'d')db_helpers.read_table(conn,'d')
Out[4]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; }</p><p> .dataframe tbody tr th { vertical-align: top; }</p><p> .dataframe thead th { text-align: right; }
subjectidsurveycategoryassessmenttotalirrelevantcol1irrelevantcol2
01.0withdrawal behavior5.0irrel1irrel2
11.0positive re-framing2.0irrel1irrel2
22.0withdrawal behavior3.0irrel1irrel2
32.0positive re-framing4.0irrel1irrel2
Normally one does not read data back from a database, but instead materializes results in the database with SQL commands such as CREATE TABLE tablename AS SELECT .... Also note: case in columns is a bit of nightmare. It is often best to lower-case them all.

Back to the data_algebra

Now we continue our example by importing the data_algebra components we need.

In [5]:
fromdata_algebra.data_opsimport*# https://github.com/WinVector/data_algebraimportdata_algebra.envimportdata_algebra.yamlimportdata_algebra.PostgreSQL# set some things in our environment_,_1,_2,_get=[None,None,None,lambdax:x]# don't look unbounddata_algebra.env.push_onto_namespace_stack(locals())# ask YAML to write simpler structuresdata_algebra.yaml.fix_ordered_dict_yaml_rep()db_model=data_algebra.PostgreSQL.PostgreSQLModel()
Now we use the data_algebra to define our processing pipeline: ops. We are writing this pipeline using a method chaining notation where we have placed Python method-dot at the end of lines using the .\ notation. This notation will look very much like a pipe) to R/magrittr users.

In [6]:
scale=0.237ops=TableDescription('d',['subjectID','surveyCategory','assessmentTotal','irrelevantCol1','irrelevantCol2']).\    extend({'probability':'(assessmentTotal * scale).exp()'}).\    extend({'total':'probability.sum()'},partition_by='subjectID').\    extend({'probability':'probability/total'}).\    extend({'row_number':'_row_number()'},partition_by=['subjectID'],order_by=['probability','surveyCategory'],reverse=['probability']).\    select_rows('row_number==1').\    select_columns(['subjectID','surveyCategory','probability']).\    rename_columns({'diagnosis':'surveyCategory'}).\    order_rows(['subjectID'])
For a more pythonic way of writing the same pipeline we can show how the code would have been formatted by black.

In [7]:
py_source=ops.to_python(pretty=True)print(py_source)
TableDescription(    table_name="d",    column_names=[        "subjectID",        "surveyCategory",        "assessmentTotal",        "irrelevantCol1",        "irrelevantCol2",    ],).extend({"probability": "(assessmentTotal * 0.237).exp()"}).extend(    {"total": "probability.sum()"}, partition_by=["subjectID"]).extend(    {"probability": "probability / total"}).extend(    {"row_number": "_row_number()"},    partition_by=["subjectID"],    order_by=["probability", "surveyCategory"],    reverse=["probability"],).select_rows(    "row_number == 1").select_columns(    ["subjectID", "surveyCategory", "probability"]).rename_columns(    {"diagnosis": "surveyCategory"}).order_rows(    ["subjectID"])
In either case, the pipeline is read as a sequence of operations (top to bottom, and left to right). What it is saying is:

  • We start with a table named “d” that is known to have columns “subjectID”, “surveyCategory”, “assessmentTotal”, “irrelevantCol1”, and “irrelevantCol2”.
  • We produce a new table by transforming this table through a sequence of “extend” operations which add new columns.
    • The first extend computes probability = exp(scale*assessmentTotal), this is similar to the inverse-link step of a logistic regression. We assume when writing this pipleline we were given this math as a requirement.
    • The next few extend steps total the probabilty per-subject (this is controlled by the partition_by argument) and then rank the normalized probabilities per-subject (grouping again specified by the partition_by argument, and order contolled by the order_by clause).
  • We then select the per-subject top-ranked rows by the select_rows step.
  • And finally we clean up the results for presentation with the select_columns, rename_columns, and order_rows steps. The names of these methods are intedned to evoke what they do.

The point is: each step is deliberately so trivial one can reason about it. However the many steps in sequence do quite a lot.

SQL

Once we have the ops object we can do quite a lot with it. We have already exhibited the pretty-printing of the pipeline. Next we demonstrate translating the operator pipeline into SQL.

In [8]:
sql=ops.to_sql(db_model,pretty=True)print(sql)
SELECT "probability",       "subjectid",       "diagnosis"FROM  (SELECT "probability",          "subjectid",          "surveycategory" AS "diagnosis"   FROM     (SELECT "probability",             "surveycategory",             "subjectid"      FROM        (SELECT "probability",                "surveycategory",                "subjectid"         FROM           (SELECT "probability",                   "surveycategory",                   "subjectid",                   ROW_NUMBER() OVER (PARTITION BY "subjectid"                                      ORDER BY "probability" DESC, "surveycategory") AS "row_number"            FROM              (SELECT "surveycategory",                      "subjectid",                      "probability" / "total" AS "probability"               FROM                 (SELECT "probability",                         "surveycategory",                         "subjectid",                         SUM("probability") OVER (PARTITION BY "subjectid") AS "total"                  FROM                    (SELECT "surveycategory",                            "subjectid",                            EXP(("assessmenttotal" * 0.237)) AS "probability"                     FROM                       (SELECT "assessmenttotal",                               "surveycategory",                               "subjectid"                        FROM "d") "sq_0") "sq_1") "sq_2") "sq_3") "sq_4"         WHERE "row_number" = 1 ) "sq_5") "sq_6") "sq_7"ORDER BY "subjectid"
The SQL can be hard to read, as SQL expresses composition by inner-nesting (inside SELECT statements happen first). The operator pipeline expresses composition by sequencing or method-chaining, which can be a lot more legible. However the huge advantage of the SQL is: we can send it to the database for execution, as we do now.

Also notice the generate SQL has applied query narrowing: columns not used in the outer queries are removed from the inner queries. The “irrelevant” columns are not carried into the calculation as they would be with a SELECT *. This early optimization comes in quite handy.

In [9]:
db_helpers.read_query(conn,sql)
Out[9]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; }</p><p> .dataframe tbody tr th { vertical-align: top; }</p><p> .dataframe thead th { text-align: right; }
probabilitysubjectiddiagnosis
00.6706221.0withdrawal behavior
10.5589742.0positive re-framing
What comes back is: one row per subject, with the highest per-subject diagnosis and the estimated probabilty. Again, the math of this is outside the scope of this note (think of that as something coming from a specification)- the ability to write such a pipeline is our actual topic.

The hope is that the data_algebra pipeline is easier to read, write, and maintain than the SQL query. If we wanted to change the calculation we would just add a stage to the data_algebra pipeline and then regenerate the SQL query.

Pandas

An advantage of the pipeline is it can also be directly used on PandasDataFrames. Let’s see how that is achieved.

In [10]:
ops.eval_pandas({'d':d_local})
Out[10]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; }</p><p> .dataframe tbody tr th { vertical-align: top; }</p><p> .dataframe thead th { text-align: right; }
subjectIDdiagnosisprobability
01withdrawal behavior0.670622
12positive re-framing0.558974
eval_pandas takes a dictionary of PandasDataFrames (names matching names specified in the pipeline) and returns the result of applying the pipeline to the data using Pandas commands. Currently our Pandas implementation only allows very simple window functions. This is why did’t write probability = probability/sum(probability), but instead broken the calculation into multiple steps by introducing the total column (the SQL realizaition does in fact support more complex window functions). This is a small issue with the grammar: but our feeling encourange simple steps is in fact a good thing (improves debugabbility), and in SQL the query optimizers likely optimize the different query styles into very similar realizations anyway.

Export/Import

Because our operator pipeline is a Python object with no references to external objects (such as the database connection), it can be saved through standard methods such as “pickling.”

However, data_algebra also supports exporting a pipeline to and from simple structures that are in turn optimized for conversion to YAML. The simple structure format is particularly useful for writing more data_algebra tools (such as pipeline analysis and presentation tools). And the YAML tooling makes moving a processing pipeline to another a language (such as R) quite easy.

We will demonstrate this next.

In [11]:
# convert pipeline to simple objectsobjs_R=ops.collect_representation(dialect='R')# print these objectspprint(objs_R)
[OrderedDict([('op', 'TableDescription'),              ('table_name', 'd'),              ('qualifiers', {}),              ('column_names',               ['subjectID',                'surveyCategory',                'assessmentTotal',                'irrelevantCol1',                'irrelevantCol2']),              ('key', 'd')]), OrderedDict([('op', 'Extend'),              ('ops', {'probability': 'exp(assessmentTotal * 0.237)'}),              ('partition_by', []),              ('order_by', []),              ('reverse', [])]), OrderedDict([('op', 'Extend'),              ('ops', {'total': 'sum(probability)'}),              ('partition_by', ['subjectID']),              ('order_by', []),              ('reverse', [])]), OrderedDict([('op', 'Extend'),              ('ops', {'probability': 'probability / total'}),              ('partition_by', []),              ('order_by', []),              ('reverse', [])]), OrderedDict([('op', 'Extend'),              ('ops', {'row_number': 'row_number()'}),              ('partition_by', ['subjectID']),              ('order_by', ['probability', 'surveyCategory']),              ('reverse', ['probability'])]), OrderedDict([('op', 'SelectRows'), ('expr', 'row_number == 1')]), OrderedDict([('op', 'SelectColumns'),              ('columns', ['subjectID', 'surveyCategory', 'probability'])]), OrderedDict([('op', 'Rename'),              ('column_remapping', {'diagnosis': 'surveyCategory'})]), OrderedDict([('op', 'Order'),              ('order_columns', ['subjectID']),              ('reverse', []),              ('limit', None)])]
In the above data structure the recursive operator steps have been linearized into a list, and simplified to just ordered dictionaries of a few defining and derived fields. In particular, the key field of the TableDescription nodes is the unique identifier for the tables, two TableDescription with the same key are referring to the same table.

We can then write this representation to YAML format.

In [12]:
# convert objects to a YAML stringdmp_R=yaml.dump(objs_R)# write to filewithopen("pipeline_yaml.txt","wt")asf:print(dmp_R,file=f)

R

This pipeline can be loaded into R and used as follows.

In [13]:
%load_ext rpy2.ipython
In [14]:
%%Rlibrary(yaml)library(wrapr)library(rquery)library(rqdatatable)source('R_fns.R')r_yaml<-yaml.load_file("pipeline_yaml.txt")r_ops<-convert_yaml_to_pipleline(r_yaml)cat(format(r_ops))
table(d;   subjectID,  surveyCategory,  assessmentTotal,  irrelevantCol1,  irrelevantCol2) %.>% extend(.,  probability := exp(assessmentTotal * 0.237)) %.>% extend(.,  total := sum(probability),  p= subjectID) %.>% extend(.,  probability := probability / total) %.>% extend(.,  row_number := row_number(),  p= subjectID,  o= "probability" DESC, "surveyCategory") %.>% select_rows(.,   row_number == 1) %.>% select_columns(.,   subjectID, surveyCategory, probability) %.>% rename(.,  c('diagnosis' = 'surveyCategory')) %.>% orderby(., subjectID)
The above representation is nearly “R code” (it is not actually executable, unlike the Python representation, but very similar to the actual rquery steps) written using wrapr dot pipe notation. However, it can be executed in R.

In [15]:
%%Rd_local<-build_frame("subjectID","surveyCategory","assessmentTotal","irrelevantCol1","irrelevantCol2"|1L,"withdrawal behavior",5,"irrel1","irrel2"|1L,"positive re-framing",2,"irrel1","irrel2"|2L,"withdrawal behavior",3,"irrel1","irrel2"|2L,"positive re-framing",4,"irrel1","irrel2")print(d_local)
  subjectID      surveyCategory assessmentTotal irrelevantCol1 irrelevantCol21         1 withdrawal behavior               5         irrel1         irrel22         1 positive re-framing               2         irrel1         irrel23         2 withdrawal behavior               3         irrel1         irrel24         2 positive re-framing               4         irrel1         irrel2
We can use the R pipeline by piping data into the r_ops object.

In [16]:
%%Rd_local%.>%r_ops%.>%print(.)
   subjectID           diagnosis probability1:         1 withdrawal behavior   0.67062212:         2 positive re-framing   0.5589742
And the Rrquery package can also perform its own SQL translation (and even execution management).

In [17]:
%%Rsql<-to_sql(r_ops,rquery_default_db_info())cat(sql)
SELECT * FROM ( SELECT  "subjectID" AS "subjectID",  "surveyCategory" AS "diagnosis",  "probability" AS "probability" FROM (  SELECT   "subjectID",   "surveyCategory",   "probability"  FROM (   SELECT * FROM (    SELECT     "subjectID",     "surveyCategory",     "probability",     row_number ( ) OVER (  PARTITION BY "subjectID" ORDER BY "probability" DESC, "surveyCategory" ) AS "row_number"    FROM (     SELECT      "subjectID",      "surveyCategory",      "probability" / "total"  AS "probability"     FROM (      SELECT       "subjectID",       "surveyCategory",       "probability",       sum ( "probability" ) OVER (  PARTITION BY "subjectID" ) AS "total"      FROM (       SELECT        "subjectID",        "surveyCategory",        exp ( "assessmentTotal" * 0.237 )  AS "probability"       FROM (        SELECT         "subjectID",         "surveyCategory",         "assessmentTotal"        FROM         "d"        ) tsql_76525498125437036191_0000000000       ) tsql_76525498125437036191_0000000001      ) tsql_76525498125437036191_0000000002     ) tsql_76525498125437036191_0000000003   ) tsql_76525498125437036191_0000000004   WHERE "row_number" = 1  ) tsql_76525498125437036191_0000000005 ) tsql_76525498125437036191_0000000006) tsql_76525498125437036191_0000000007 ORDER BY "subjectID"
The R implementation is mature, and appropriate to use in production. The rquery grammar is designed to have minimal state and minimal annotations (no grouping or ordering annotations!). This makes the grammar, in my opinion, a good design choice. rquery has very good performance, often much faster than dplyr or base-R due to its query generation ideas and use of data.table via rqdatatable. rquery is a mature pure R package; here is the same example being worked directly in R, with no translation from Python.

The R implementation supports additional features such as converting a pipeline into a diagram (though that would also be easy to implement in Python on top of the collect_representation() objects).

More of the R example (including how the diagram was produced) can be found here.

Advantages of data_algebra

Multi-language data science is an important trend, so a cross-language query system that supports at least R and Python is going to be a useful tool or capability going forward. Obviously SQL itself is fairly cross-language, but data_algebra adds a few features we hope are real advantages.

In addition to the features shown above, a data_algebra operator pipeline carries around usable knowledge of the data transform. For example:

In [18]:
# report all tables used by the query, by nameops.get_tables()
Out[18]:
{'d': TableDescription(table_name='d', column_names=['subjectID', 'surveyCategory', 'assessmentTotal', 'irrelevantCol1', 'irrelevantCol2'])}
In [19]:
# report all source table columns used by the queryops.columns_used()
Out[19]:
{'d': {'assessmentTotal', 'subjectID', 'surveyCategory'}}
In [20]:
# what columns does this operation produce?ops.column_names
Out[20]:
['subjectID', 'diagnosis', 'probability']

Conclusion

The data_algebra is part of a powerful cross-language and mutli-implementaiton family data manipulation tools. These tools can greatly reduce the development and maintenance cost of data science projects, while improving the documentation of project intent.

Win Vector LLC is looking for sponsors and partners to further the package. In particular if your group is using both R and Python in big-data projects (where SQL is a need, including Apache Spark), or are porting a project from one of these languages to another- please get in touch.

Appendix:

Demonstrate we can round-trip a data_algebra through YAML and recover the code.

In [21]:
# land the pipeline as a fileobjs_Python=ops.collect_representation()dmp_Python=yaml.dump(objs_Python)withopen("pipeline_Python.txt","wt")asf:print(dmp_Python,file=f)
In [22]:
# read backwithopen("pipeline_Python.txt","rt")asf:ops_text=f.read()ops_back=data_algebra.yaml.to_pipeline(yaml.safe_load(ops_text))print(ops_back.to_python(pretty=True))
TableDescription(    table_name="d",    column_names=[        "subjectID",        "surveyCategory",        "assessmentTotal",        "irrelevantCol1",        "irrelevantCol2",    ],).extend({"probability": "(assessmentTotal * 0.237).exp()"}).extend(    {"total": "probability.sum()"}, partition_by=["subjectID"]).extend(    {"probability": "probability / total"}).extend(    {"row_number": "_row_number()"},    partition_by=["subjectID"],    order_by=["probability", "surveyCategory"],    reverse=["probability"],).select_rows(    "row_number == 1").select_columns(    ["subjectID", "surveyCategory", "probability"]).rename_columns(    {"diagnosis": "surveyCategory"}).order_rows(    ["subjectID"])
In [23]:
# confirm we have a data_algebra.data_ops.ViewRepresentation# which is the class the data_algebra pipelines are derived fromisinstance(ops_back,data_algebra.data_ops.ViewRepresentation)
Out[23]:
True
In [24]:
# be neatconn.close()
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to scrape Zomato Restaurants Data in R

$
0
0

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Zomato is a popular restaurants listing website in India (Similar to Yelp) and People are always interested in seeing how to download or scrape Zomato Restaurants data for Data Science and Visualizations.

In this post, We’ll learn how to scrape / download Zomato Restaurants (Buffets) data using R. Also, hope this post would serve as a basic web scraping framework / guide for any such task of building a new dataset from internet using web scraping.

Steps

  • Loading required packages
  • Getting web page content
  • Extract relevant attributes / data from the content
  • Building the final dataframe (to be written as csv) or for further analysis

Note: This post also assumes you’re familiar with Browser Devtools and CSS Selectors

Packages

We’ll use the R-packages rvest for web scraping and tidyverse for Data Analysis and Visualization

Loading the libraries

library(rvest)library(tidyverse)
zomato web scraping

zomato web scraping

Getting Web Content from Zomato

zom <- read_html("https://www.zomato.com/bangalore/restaurants?buffet=1")

Extracting relevant attributes

Considering, It’s Restaurant listing – the columns that we can try to build are – Name of the Restaurant, Place / City where it’s, Average Price (or as Zomato says, Price for two)

Name of the Restaurant

This is how the html code for the name is placed:

Barbeque Nation

So, what we need is for a tag with class value result-title, the value of attribute title.

zom %>% html_nodes("a.result-title") %>%   html_attr("title") %>%   stringr::str_split(pattern = ',') -> listing

As a good thing for us, Zomato’s website is designed in such a way that the name and place of the Restaurant are within the same css selector a.result-title– so it’s one scraping. And it’s separated by a , so we can use str_split() to split and the final output is now saved into listing which is a list.

Converting List to Dataframe

zom_df <- do.call(rbind.data.frame, listing)names(zom_df) <- c("Name","Place")

In the above two lines, we’re trying to convert the listing list to a dataframe zom_df and then rename the columns into Name and Place

Extracting Price and Adding a New Price Column

zom_df$Price <- zom %>% html_nodes("div.res-cost > span.pl0") %>%   html_text() %>%   parse_number()

Since the Price field is actually a combination of Indian Currency and Comma-separated Number (which is ultimately a character), we’ll use parse_number() function remove the Indian currency unicode from the text and extract only the price value number.

Dataset

head(zom_df)
##                                Name             Place Price## 1 abs absolute barbecues Restaurant      Marathahalli  1600## 2            big pitcher Restaurant  Old Airport Road  1800## 3                 pallet Restaurant        Whitefield  1600## 4        barbeque nation Restaurant       Indiranagar  1600## 5            black pearl Restaurant      Marathahalli  1500## 6      empire restaurant Restaurant       Indiranagar   500

Price Graph

zom_df %>%   ggplot() + geom_line(aes(Name,Price,group = 1)) +  theme_minimal() +  coord_flip() +  labs(title = "Top Zomato Buffet Restaurants",       caption = "Data: Zomato.com")

Summary

Thus, We’ve learnt how to build a new dataset by scraping web content and in this case, from Zomato to build a Price Graph.

Share this Story

If you liked this, Share this Article with your and Also, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Journal July Issue

$
0
0

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As the current Editor-in-Chief of the R Journal, I must apologize for the delay in getting the July issue online, due to technical and other matters. In the meantime, though, please take a look at the many interesting articles slated for publication in this and upcoming issues.

Various improvements in technical documentation, as well as the pending hire of the journal’s first-ever editorial assistant, should shorten the review and publication processes in the future.

By the way, I’ve made a couple of tweaks to the Instructions for Authors. First, I note that the journal’s production software really does require following the instructions carefully. For instance, the \author field in your .tex file must start with “by” in order to work properly; it’s not merely a matter of, say, aesthetics. And your submission package must not have more than one .bib file or more than one .tex file other than RJwrapper.tex.

Finally, a reminder to those whom we ask to review article submissions: Your service would be greatly appreciated, a valuable contribution to R. If you must decline, though, please respond to our e-mail query, stating so, so that we may quickly search for other possible reviewers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12300 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>