WoE and IV Variable Screening with {Information} in R

August 27, 2017, 2:42 am

≫ Next: Marginal effects for negative binomial mixed effects models (glmer.nb and glmmTMB) #rstats

≪ Previous: NYC Citi Bike Visualization – A Hint of Future Transportation

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

A short note on information-theoretic variable screening in R w. {information}.

Variable screening comes as an important step in the contemporary EDA for predictive modeling: what can we tell about the nature of the relationships between a set of predictors and the dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we start rolling them into a predictive model? In this blog post I will discuss two information-theoretic measures that are common in variable screening for binary classification and regression models in the credit risk arena (the fact being completely unrelated to the simple truth that they could do you some good in any other application of predictive modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value (IV) of a variable in respect to a binary outcome. Then I will illustrate their computation (it’s fairly easy, in fact) from the {Information} package in R.

Weight of Evidence

Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):

and assume your models M1, M2 of the world* are simply two discrete possible states of a binary variable Y, while the data are given by discrete distributions of some predictor X (or, X stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n, take the log:

and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence (WoE). In theory, any monotonic transformation of the odds would do, but the logarithm brings an intuitive advantage of obtaining a negative WoE when the odds are less than one and a positive one when they are higher than one. To simplify the setting where the analysis under consideration encompasses a binary dependent Y and a discrete (or binned continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:

where f denotes counts.

Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:

dataSet <- read.table(‘bank-additional-full.csv’, header = T, strip.white = F, sep = “;”)
str(dataSet) table(dataSet$y)
dataSet$y <- recode(dataSet$y, ‘yes’ = 1, ‘no’ = 0)

and we want to compute the WoE for, say, the age variable. Here it goes:

# – compute WOE for: dataSet$age bins <- 10
q <- quantile(dataSet$age, probs = c(1:(bins – 1)/bins), na.rm = TRUE, type = 3)
cuts <- unique(q)
aggAge <- table(findInterval(dataSet$age, vec = cuts, rightmost.closed = FALSE), dataSet$y)
aggAge <- as.data.frame.matrix(aggAge) aggAge$N <- rowSums(aggAge) aggAge$WOE <- log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))

In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table() call provides for the conditional distributions like the ones shown in the table above. The computation of WoE is then straightforward – as exemplified in the last line. However, you want to spare yourself from computing the WoE in this way for many variables in the dataset, and that’s where {Information} in R comes handy; for the same dataset:

# – Information value: all variables
infoTables <- create_infotables(data = dataSet,
y = “y”, bins = 10, parallel = T)
# – WOE table: infoTables$Tables$age$WOE

with the respective data frames in infoTables$Tables standing for the variables in the dataset.

Information Value

A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:

In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.

Using {Information} in R, for the dataset under consideration:

# – Information value: all variables

infoTables <- create_infotables(data = dataSet,
y = “y”, bins = 10, parallel = T)
# – Plot IV
plotFrame <- infoTables$Summary[order(-infoTables$Summary$IV), ] plotFrame$Variable <- factor(plotFrame$Variable,
levels = plotFrame$Variable[order(-plotFrame$IV)])
ggplot(plotFrame, aes(x = Variable, y = IV)) + geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) + ggtitle(“Information Value”) + theme_bw() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x = element_text(angle = 90))

You may have noted the usage of parallel = T in the create_infotables() call; the {Information} package will try to use all available cores to speed up the computations by default. Besides the basic package functionality that I have illustrated, the package provides a natural way of dealing with uplift models, where the computation of the IVs for the control vs. treatment designs is nicely automated. Cross-validation procedures are also built-in.

However, now that we know that we have a nice, working package for WoE and IV estimation in R, let’s restrain ourselves from using it to perform automatic feature selection for models like binary logistic regression. While the information-theoretic measures discussed here truly assess the predictive power of a predictor in binary classification, building a model that encompasses multiple terms model is another story. Do not get disappointed if you start figuring out how the AICs for the full models are still lower then those for the nested models obtained by feature selection based on the IV values; although they can provide useful guidelines, WoE and IV are not even meant to be used that way (I’ve tried… once with the dataset used in the previous examples, and then with the two {Information} built-in datasets; not too much of a success, as you may have guessed).

References

* For parametric models, you would need to integrate over the full parameter space, of course; taking the MLEs would result in obtaining the standard LR test.

** The dataset is considered in S. Moro, P. Cortez and P. Rita (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/, file: bank-additional.zip); a nice description of the dataset is found at: http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)

Goran S. Milovanović, Phd

Data Science Consultant, SmartCat

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Marginal effects for negative binomial mixed effects models (glmer.nb and glmmTMB) #rstats

August 27, 2017, 3:17 am

≫ Next: Why I find tidyeval useful

≪ Previous: WoE and IV Variable Screening with {Information} in R

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Here’s a small preview of forthcoming features in the ggeffects-package, which are already available in the GitHub-version: For marginal effects from models fitted with glmmTMB() or glmer() resp. glmer.nb(), confidence intervals are now also computed.

If you want to test these features, simply install the package from GitHub:

library(devtools)devtools::install_github("strengejacke/ggeffects")

Here are three examples:

library(glmmTMB)library(lme4)library(ggeffects)data(Owls)m1 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom1)m2 <- glmmTMB(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls, family = nbinom2)m3 <- glmer.nb(SiblingNegotiation ~ SexParent + ArrivalTime + (1 | Nest), data = Owls)m4 <-  glmmTMB(    SiblingNegotiation ~ FoodTreatment + ArrivalTime + SexParent + (1 | Nest),    data = Owls,    ziformula =  ~ 1,    family = list(family = "truncated_poisson", link = "log")  )

pr1 <- ggpredict(m1, c("ArrivalTime", "SexParent"))plot(pr1)

pr2 <- ggpredict(m2, c("ArrivalTime", "SexParent"))plot(pr2)

pr3 <- ggpredict(m3, c("ArrivalTime", "SexParent"))plot(pr3)

pr4 <- ggpredict( m4, c("FoodTreatment", "ArrivalTime [21,24,30]", "SexParent"))plot(pr4)

The code to calculate confidence intervals is based on the FAQ provided from Ben Bolker. Here is another example, that reproduces this plot (note, since age is numeric, ggpredict() produces a straight line, and not points with error bars).

library(nlme)data(Orthodont)m5 <- lmer(distance ~ age * Sex + (age|Subject), data = Orthodont)pr5 <- ggpredict(m5, c("age", "Sex"))plot(pr5)

Tagged: data visualization, ggplot, R, rstats, sjPlot

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

↧

Why I find tidyeval useful

August 26, 2017, 5:00 pm

≫ Next: Pricing Optimization: How to find the price that maximizes your profit

≪ Previous: Marginal effects for negative binomial mixed effects models (glmer.nb and glmmTMB) #rstats

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

First thing’s first: maybe you shouldn’t care about tidyeval. Maybe you don’t need it. If you exclusively work interactively, I don’t think that learning about tidyeval is important. I can only speak for me, and explain to you why I personally find tidyeval useful. I wanted to write this blog post after reading this twitter thread and specifically this question. Mara Averick then wrote this blogpost linking to 6 other blog posts that give some tidyeval examples.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

↧

Pricing Optimization: How to find the price that maximizes your profit

August 27, 2017, 9:02 am

≫ Next: A hidden process behind binary or other categorical outcomes?

≪ Previous: Why I find tidyeval useful

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Yuri Fonseca

Basic idea

In this post we will discuss briefly about pricing optimization. The main idea behind this problem is the following question: As manager of a company/store, how much should I charge in order to maximize my revenue or profit?

Obviously, the answer isn’t as high as possible. If you charge one hundred dollars for a candy, probably only one or two people will accept to buy it. Although the profit per product is very high, you probably won’t even your fixed costs. Charge a very small is also not the best call.

Before showing an example for this problem, let us build some simple formulas.

Imagine a monopolist selling a specific product with demand curve Q(p) , where Q(p) is the quantity sold given a specific price . To simplify things, let’s suppose that Q(p) is a linear function:

$\displaystyle Q(p) = \alpha p + \beta$

The total revenue will be given by:

$\displaystyle R(p) = p Q(p) = \alpha p^2 + p\beta$

And total profit will be given by:

$\displaystyle L(p) = (p - c)Q(p) = \alpha p^2 - \alpha pc + \beta(p - c)$

Where is the unity cost of the product. Adding fixed costs in the profit equation does not change the price police, so we will suppose it’s zero.

Next, we differentiate the equations for R(p) and L(p) to find the first order conditions, which allow us to find the optimal police under the hypothesis of a linear demand curve. $\alpha$ is expected to be negative (demand decrease when prices increase) and are concave functions of . As consequence, the critical point will be a maximum point. Therefore, the optimal police for revenue is given by:

$\displaystyle P_{\textrm{max rev}}= \frac{-\beta}{2\alpha}$

And for profit:

$\displaystyle P_{\textrm{max prof}} = \frac{-\beta + \alpha c}{2\alpha}$

Note that sometimes people write the linear demand curve as $Q(p) = -\alpha p + \beta$ , in this case $\alpha$ should be positive, and the signs in equation 2 and equation 3 must change. Moreover, it is interesting to note that the price that maximizes profit is always bigger than the one that maximizes total revenue because is always positive.

If taxes are calculated just on profit the price police remains the same. However, countries like Brazil usually charges a lot of taxes on total revenue. In this case, the price police for maximizing revenue doesn’t change, but the police for maximizing profit will change according to the following expression:

$\displaystyle P_{\textrm{max prof}}^{(tax)} = \frac{-\beta(1-tax) + \alpha c}{2\alpha(1-tax)}$

Example and implementations:

As an example of how to proceed with the estimation of the optimum price, let’s generate a linear demand curve with for daily selling of a product:

library(ggplot2)# example of linear demand curve (first equation) demand = function(p, alpha = -40, beta = 500, sd = 10) {  error = rnorm(length(p), sd = sd)  q = p*alpha + beta + error  return(q)}set.seed(100)prices = seq(from = 5, to = 10, by = 0.1)q = demand(prices)data = data.frame('prices' = prices,'quantity' = q)ggplot(data, aes(prices, quantity)) +  geom_point(shape=1) +  geom_smooth(method='lm') +  ggtitle('Demand Curve')

plot of chunk demand

Imagine a company that has been selling the product which follows the demand curve above for a while (one year changing prices daily), testing some prices over time. The following time-series is what we should expect for the historical revenue, profit and cost of the company:

set.seed(10)hist.prices = rnorm(252, mean = 6, sd = .5) # random prices defined by the companyhist.demand = demand(hist.prices) # demand curve defined in the chunck abovehist.revenue = hist.prices*hist.demand # From the revenue equationunity.cost = 4 # production cost per unityhist.cost = unity.cost*hist.demandhist.profit = (hist.prices - unity.cost)*hist.demand # From the price equationdata = data.frame('Period' = seq(1,252),'Daily.Prices' = hist.prices,                  'Daily.Demand' = hist.demand, 'Daily.Revenue' = hist.revenue,                  'Daily.Cost' = hist.cost, 'Daily.Profit' = hist.profit)ggplot(data, aes(Period, Daily.Prices)) +  geom_line(color = 4) +  ggtitle('Historical Prices used for explotation')

plot of chunk historical series

ggplot(data, aes(Period, Daily.Revenue, colour = 'Revenue')) +  geom_line() +  geom_line(aes(Period, Daily.Profit, colour = 'Profit')) +  geom_line(aes(Period, Daily.Cost, colour = 'Cost')) +  labs(title = 'Historical Performance', colour = '')

plot of chunk historical series

We can recover the demand curve using the historical data (that is how it is done in the real world).

library(stargazer)model.fit = lm(hist.demand ~ hist.prices) # linear model for demandstargazer(model.fit, type = 'html', header = FALSE) # output


	Dependent variable:

	hist.demand

hist.prices	-38.822^***
	(1.429)

Constant	493.588^***
	(8.542)


Observations	252
R²	0.747
Adjusted R²	0.746
Residual Std. Error	10.731 (df = 250)
F Statistic	738.143^*** (df = 1; 250)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

And now we need to apply equation 2 and equation 3.

# estimated parametersbeta = model.fit$coefficients[1]alpha = model.fit$coefficients[2]  p.revenue = -beta/(2*alpha) # estimated price for revenuep.profit = (alpha*unity.cost - beta)/(2*alpha) # estimated price for profit

The final plot with the estimated prices:

true.revenue = function(p) p*(-40*p + 500) # Revenue with true parameters (chunck demand)true.profit = function(p) (p - unity.cost)*(-40*p + 500) # price with true parameters# estimated curvesestimated.revenue = function(p) p*(model.fit$coefficients[2]*p + model.fit$coefficients[1])estimated.profit = function(p) (p - unity.cost)*(model.fit$coefficients[2]*p + model.fit$coefficients[1])opt.revenue = true.revenue(p.revenue) # Revenue with estimated optimum priceopt.profit = true.profit(p.profit) # Profit with estimated optimum price# plotdf = data.frame(x1 = p.revenue, x2 = p.profit,                y1 = opt.revenue, y2 = opt.profit, y3 = 0)ggplot(data = data.frame(Price = 0)) +  stat_function(fun = true.revenue, mapping = aes(x = Price, color = 'True Revenue')) +  stat_function(fun = true.profit, mapping = aes(x = Price, color = 'True Profit')) +  stat_function(fun = estimated.revenue, mapping = aes(x = Price, color = 'Estimated Revenue')) +  stat_function(fun = estimated.profit, mapping = aes(x = Price, color = 'Estimated Profit')) +  scale_x_continuous(limits = c(4, 11)) +  labs(title = 'True curves without noise') +  ylab('Results') +  scale_color_manual(name = "", values = c("True Revenue" = 2, "True Profit" = 3, "Estimated Revenue" = 4, "Estimated Profit" = 6)) +  geom_segment(aes(x = x1, y = y1, xend = x1, yend = y3), data = df) +  geom_segment(aes(x = x2, y = y2, xend = x2, yend = y3), data = df)

plot of chunk final plot

Final observations

As you can see, the estimated Revenue and estimated Profit curves are quite similar to the true ones without noise and the expected revenue for our estimated optimal policies looks very promising. Although the linear and monopolist assumption looks quite restrictive, this might not be the case, check Besbes and Zeevi (2015) and Cooper et al (2015).

If one expect a large variance for $\alpha$ , it might be useful to simulate $\alpha$ , $\beta$ and then the optimal price using Jensen’s inequality.

References

Phillips, Robert Lewis. Pricing and revenue optimization. Stanford University Press, 2005.

Besbes, Omar, and Assaf Zeevi. “On the (surprising) sufficiency of linear models for dynamic pricing with demand learning.” Management Science 61.4 (2015): 723-739.

Cooper, William L., Tito Homem-de-Mello, and Anton J. Kleywegt. “Learning and pricing with models that do not explicitly incorporate competition.” Operations research 63.1 (2015): 86-103.

Talluri, Kalyan T., and Garrett J. Van Ryzin. The theory and practice of revenue management. Vol. 68. Springer Science & Business Media, 2006.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

↧

A hidden process behind binary or other categorical outcomes?

August 27, 2017, 5:00 pm

≫ Next: Recording of my Talk at the ACS Data Users Conference

≪ Previous: Pricing Optimization: How to find the price that maximizes your profit

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. – six groups in total. The question was how best to evaluate effectiveness of the intervention?

Since the outcome was categorical and ordinal – that is category 1 implied less meat consumption that category 2, category 2 implied less consumption that category 3, and so on – a model that estimates the cumulative probability of ordinal outcomes seemed like a possible way to proceed. Cumulative logit models estimate a number of parameters that represent the cumulative log-odds of an outcome; the parameters are the log-odds of categories 2 through 6 versus category 1, categories 3 through 6 versus 1 & 2, etc. Maybe not the most intuitive way to interpret the data, but seems to plausibly fit the data generating process.

I was concerned about the proportionality assumption of the cumulative logit model, particularly when we started to consider adjusting for baseline characteristics (more on that in the next post). I looked more closely at the data generating assumptions of the cumulative logit model, which are quite frequently framed in the context of a continuous latent measure that follows a logistic distribution. I thought I’d describe that data generating process here to give an alternative view of discrete data models.

I know I have been describing a context that includes an outcome with multiple categories, but in this post I will focus on regular logistic regression with a binary outcome. This will hopefully allow me to establish the idea of a latent threshold. I think it will be useful to explain this simpler case first before moving on to the more involved case of an ordinal response variable, which I plan to tackle in the near future.

A latent continuous process underlies the observed binary process

For an event with a binary outcome (true or false, A or B, 0 or 1), the observed outcome may, at least in some cases, be conceived as the manifestation of an unseen, latent continuous outcome. In this conception, the observed (binary) outcome merely reflects whether or not the unseen continuous outcome has exceeded a specified threshold. Think of this threshold as a tipping point, above which the observable characteristic takes on one value (say false), below which it takes on a second value (say true).

The logistic distribution

Logistic regression models are used to estimate relationships of individual characteristics with categorical outcomes. The name of this regression model arises from the logistic distribution, which is a symmetrical continuous distribution. In a latent (or hidden) variable framework, the underlying, unobserved continuous measure is drawn from this logistic distribution. More specifically, the standard logistic distribution is typically assumed, with a location parameter of 0, and a scale parameter of 1. (The mean of this distribution is 0 and variance is approximately 3.29.)

Here is a plot of a logistic pdf, shown in relation to a standard normal pdf (with mean 0 and variance 1):

library(ggplot2)library(data.table)my_theme <- function() {  theme(panel.background = element_rect(fill = "grey90"),         panel.grid = element_blank(),         axis.ticks = element_line(colour = "black"),         panel.spacing = unit(0.25, "lines"),         plot.title = element_text(size = 12, vjust = 0.5, hjust = 0),         panel.border = element_rect(fill = NA, colour = "gray90"))}x <- seq(-6, 6, length = 1000)yNorm <- dnorm(x, 0, 1) yLogis <- dlogis(x, location = 0, scale = 1)dt <- data.table(x, yNorm, yLogis)dtm <- melt(dt, id.vars = "x", value.name = "Density")ggplot(data = dtm) +  geom_line(aes(x = x, y = Density, color = variable)) +  geom_hline(yintercept = 0, color = "grey50") +  my_theme() +  scale_color_manual(values = c("red", "black"),                      labels=c("Normal", "Logistic")) +  theme(legend.position = c(0.8, 0.6),        legend.title = element_blank())

The threshold defines the probability

Below, I have plotted the standardized logistic pdf with a threshold that defines a tipping point for a particular Group A. In this case the threshold is 1.5, so for everyone with a unseen value of $X < 1.5$, the observed binary outcome $Y$ will be 1. For those where $X \geq 1.5$, the observed binary outcome $Y$ will be 0:

xGrpA <- 1.5ggplot(data = dtm[variable == "yLogis"], aes(x = x, y = Density)) +  geom_line() +  geom_segment(x = xGrpA, y = 0, xend = xGrpA, yend = dlogis(xGrpA), lty = 2) +  geom_area(mapping = aes(ifelse(x < xGrpA, x, xGrpA)), fill = "white") +  geom_hline(yintercept = 0, color = "grey50") +  ylim(0, 0.3) +  my_theme() +  scale_x_continuous(breaks = c(-6, -3, 0, xGrpA, 3, 6))

Since we have plot a probability density (pdf), the area under the entire curve is equal to 1. We are interested in the binary outcome $Y$ defined by the threshold, so we can say that the area below the curve to the left of threshold (filled in white) represents $P(Y = 1|Group=A)$. The remaining area represents $P(Y = 0|Group=A)$. The area to the left of the threshold can be calculated in R using the plogis function:

(p_A <- plogis(xGrpA))

## [1] 0.8175745

Here is the plot for a second group that has a threshold of 2.2:

The area under the curve to the left of the threshold is $P(X < 2.2)$, which is also $P(Y = 1 | Group=B)$:

(p_B <- plogis(xGrpB))

## [1] 0.9002495

Log-odds and probability

In logistic regression, we are actually estimating the log-odds of an outcome, which can be written as

\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right]\].

In the case of Group A, log-odds of Y being equal to 1 is

(logodds_A <- log(p_A / (1 - p_A) ))

## [1] 1.5

And for Group B,

(logodds_B <- log(p_B / (1 - p_B) ))

## [1] 2.2

As you may have noticed, we’ve recovered the thresholds that we used to define the probabilities for the two groups. The threshold is actually the log-odds for a particular group.

Logistic regression

The logistic regression model that estimates the log-odds for each group can be written as

\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right] = B_0 + B_1 * I(Grp = B) \quad ,\]

where $B_0$ represents the threshold for Group A and $B_1$ represents the shift in the threshold for Group B. In our example, the threshold for Group B is 0.7 units (2.2 – 1.5) to the right of the threshold for Group A. If we generate data for both groups, our estimates for $B_0$ and $B_1$ should be close to 1.5 and 0.7, respectively

The process in action

To put this all together in a simulated data generating process, we can see the direct link with the logistic distribution, the binary outcomes, and an interpretation of estimates from a logistic model. The only stochastic part of this simulation is the generation of continuous outcomes from a logistic distribution. Everything else follows from the pre-defined group assignments and the group-specific thresholds:

n = 5000set.seed(999)# Stochastic stepxlatent <- rlogis(n, location = 0, scale = 1)# Deterministic partgrp <- rep(c("A","B"), each = n / 2)dt <- data.table(id = 1:n, grp, xlatent, y = 0)dt[grp == "A" & xlatent <= xGrpA, y := 1]dt[grp == "B" & xlatent <= xGrpB, y := 1]# Look at the datadt

##         id grp    xlatent y##    1:    1   A -0.4512173 1##    2:    2   A  0.3353507 1##    3:    3   A -2.2579527 1##    4:    4   A  1.7553890 0##    5:    5   A  1.3054260 1##   ---                      ## 4996: 4996   B -0.2574943 1## 4997: 4997   B -0.9928283 1## 4998: 4998   B -0.7297179 1## 4999: 4999   B -1.6430344 1## 5000: 5000   B  3.1379593 0

The probability of a “successful” outcome (i.e $P(Y = 1$)) for each group based on this data generating process is pretty much equal to the areas under the respective densities to the left of threshold used to define success:

dt[, round(mean(y), 2), keyby = grp]

##    grp   V1## 1:   A 0.82## 2:   B 0.90

Now let’s estimate a logistic regression model:

library(broom)glmfit <- glm(y ~ grp, data = dt, family = "binomial")tidy(glmfit, quick = TRUE)

##          term  estimate## 1 (Intercept) 1.5217770## 2        grpB 0.6888526

The estimates from the model recover the logistic distribution thresholds for each group. The Group A threshold is estimated to be 1.52 (the intercept) and the Group B threshold is estimated to be 2.21 (intercept + grpB parameter). These estimates can be interpreted as the log-odds of success for each group, but also as the threshold for the underlying continuous data generating process that determines the binary outcome $Y$. And we can interpret the parameter for grpB in the traditional way as the log-odds ratio comparing the log-odds of success for Group B with the log-odds of success for Group A, or as the shift in the logistic threshold for Group A to the logistic threshold for Group B.

In the next week or so, I will extend this to a discussion of an ordinal categorical outcome. I think the idea of shifting the thresholds underscores the proportionality assumption I alluded to earlier …

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

↧

Recording of my Talk at the ACS Data Users Conference

August 28, 2017, 9:00 am

≫ Next: July 2017 New Package Picks

≪ Previous: A hidden process behind binary or other categorical outcomes?

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

I recently had the honor of speaking at the 2017 American Community Survey (ACS) Data Users Conference. My talk focused on choroplethr, the suite of R packages I’ve written for mapping open datasets in R. The ACS is the one of the main surveys that the US Census Bureau runs, and I was honored to share my work there.

My talk was recorded, and you can view it below:

Attending the conference was part of my effort to bridge the gap between the world of free data analysis tools (i.e. the R community) and the world of government datasets (which I consider the most interesting datasets around).

The conference was great. I have been meaning to write a blog post summarizing what I learned, but I simply haven’t had the time. Instead, I will point you to a recording of all the conference talks.

The post Recording of my Talk at the ACS Data Users Conference appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

↧

July 2017 New Package Picks

August 27, 2017, 5:00 pm

≫ Next: One-way ANOVA in R

≪ Previous: Recording of my Talk at the ACS Data Users Conference

(This article was first published on R Views, and kindly contributed to R-bloggers)

Two hundred and twenty-four new packages were added to CRAN in July. Below are my picks for the “Top 40” packages arranged in eight categories: Machine Learning, Science, Statistics, Numerical Methods, Statistics, Time Series, Utilities and Visualizations. Science and Numerical Methods are categories that I have not used before. The idea behind the Science category is to find a place for packages that appear to have been created with some particular scientific investigation or problem in mind. The Numerical Methods category is reserved for packages that, while they may be targeted to some general form of statistical analysis, emphasize numerical considerations and carefully constructed algorithms.

As always, my selections are heavily weighted by the availability of documentation beyond what is included in the package PDF. I rarely select packages that do not have a vignette or some other source of documentation about how the package can be used, for example, README files or a referenced URL. I almost never select “professional” packages, which I define as packages that are devoted to esoteric topics that either include no documentation beyond the PDF, or exclusively refer to papers that are protected by a paywall. While these packages usually comprise serious, valuable contributions to R, they also appear to have been written for very small audiences.

Finally, before listing this month’s Top 40, I would like to call attention to an awesome display of productivity by Kevin R. Coombes, who had fourteen packages on various topics accepted by CRAN in July: BimodalIndex, ClassComparison, ClassDiscovery, CrossValidate, GeneAlgo, IntegIRTy, Modeler, NameNeedle, oompaBase, oompaData, PreProcess, SIBERG, TailRank and Umpire.

The July 2017 Top 40

Machine Learning

autoBagging v0.1.0: Implements a framework for automated machine learning that focuses on the optimization of bagging workflows. See the paper by Pinto et al. for details.

grf v0.9.3: Provides methods for non-parametric least-squares regression, quantile regression, and treatment effect estimation (optionally using instrumental variables).

iRF v2.0.0: Provides functions to iteratively grow feature-weighted random forests and finds high-order feature interactions in a stable fashion. Look here for the details.

keras v2.0.5: Implements an interface to Keras, a high-level neural networks API that runs on top of TensorFlow. There is an Overview of the Keras backend, and a number of vignettes including Keras Layers, Writing Custom Keras Layers, Keras Models, Using Pre-Trained Models, Sequential Models and more.

randomForestExplainer v0.9: Provide set of tools to help explain which variables are most important in a random forests. The vignette provides examples of visualizing multiple performance measures.

sgmcmc v0.1.0: Provides functions to implement stochastic gradient Markov chain Monte Carlo (SGMCMC) methods for user-specified models. TensorFlow is used to calculate the gradients. There is a Getting Started Guide and vignettes for Simulating from a Gaussian Mixture, a Multivariate Gaussian Mixture and Logistic Regression.

Numerical Methods

mcMST v1.0.0: Provides algorithms to approximate the Pareto-front of multi-criteria minimum spanning tree problems, along with a toolbox for generating multi-objective benchmark graph problems. There is an Introduction and a vignette on benchmarking optimization problems.

mize v0.1.1: Provides optimization algorithms, including conjugate gradient (CG), Broyden-Fletcher-Goldfarb-Shanno (BFGS), and the limited memory BFGS (L-BFGS) methods. There is an introduction and vignettes on Convergence, Metric MDS, and Stateful Optimization.

SuperGauss v1.0: Provides a fast C++ based algorithm for the evaluation of Gaussian time series, along with efficient implementations of the score and Hessian functions. The vignette shows an example of inference for the Hurst parameter.

Science

GROAN v1.0.0: Is a workbench for testing genomic regression accuracy on noisy phenotypes. It contains a noise-generator function. There is a vignette on Genomic Regression in Noisy Scenarios.

noaastormevents v0.1.0: Allows users to explore and plot data from the National Oceanic and Atmospheric Administration (NOAA) Storm Events database for United States counties through R. There is an Overview and a vignette providing details.

swmmr v0.7.0: Provides functions to connect the Storm Water Management Model (SWMM) of the United States Environmental Protection Agency (US EPA).

Statistics

blandr v0.4.3: Contains functions to carry out Bland Altman analyses (also known as a Tukey mean-difference plot) as described by JM Bland and DG Altman. See the vignette for details.

cnbdistr v1.0.1: Provides the distribution functions for the Conditional Negative Binomial distribution. The vignette shows the math.

diffpriv v0.4.2: Provides an implementation of major general-purpose mechanisms for privatizing statistics, models, and machine learners, within the framework of differential privacy of Dwork et al. (2006). There is a vignette on The Bernstein Mechanism and an Introduction

fence v1.0: Implements a new class of model-selection strategies for mixed-model selection, which includes linear and generalized linear mixed models. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). The package points to several references in the literature including papers by Jiang et al. 2008, Jiang et al. 2010, Jiang et al. 2011, Nguyen et al. 2012, and Jiang 2014.

llogistic v1.0.0: Provides density, distribution, quantile and random generation functions for the L-Logistic distribution with parameters m (median) and b.

metaBMA v0.3.9: Provides functions to compute the posterior model probabilities the meta-analysis models assuming either fixed or random effects. See the paper by Gronau et al. and the vignette for details.

MFKnockoffs v0.9: Provides functions to create model-free knockoffs, a general procedure for controlling the false discovery rate FDR when performing variable selection. There are vignettes on using the the Model-Free Knockoff Filter Basic and Advanced, and Using the Filter with a Fixed Design Matrix.

Modeler v3.4.2: Provides tools to define classes and methods to learn models and use them to predict binary outcomes. There is a vignette on Learning and Predicting.

msde v1.0: Implements an MCMC sampler for the posterior distribution of arbitrary, time-homogeneous, multivariate stochastic differential equation (SDE) models with possibly latent components. There is a vignette with Sample Models and another for Inference.

RBesT v1.2-3: Provides a tool set to support Bayesian evidence synthesis, including meta-analysis, prior derivation from historical data, operating characteristics, and analysis. There is an Introduction and vignettes on Customizing Plots, Normal Endpoints, and Robust Priors.

RcppTN v0.2-1: Provides R and C++ functions to generate random deviates from and calculate moments of a Truncated Normal distribution using the algorithm of Robert (1995). There is a vignette showing how to use the package, and one for Performance.

rsample v0.0.1: Provides classes and functions to create and summarize different kinds of resampling objects. There is a vignette on the Basics, and another for Working with rsets.

SMM v1.0: Provides functions to simulate and estimate of Multi-State Discrete-Time Semi-Markov and Markov Models. The implementation details are described in two papers by Barbu, Limnios one and two, and one paper by Trevezas and Limnios. The vignette also provides considerable detail.

treeDA v0.02: Provides functions to perform sparse discriminant analysis on a combination of node and leaf predictors, when the predictor variables are structured according to a tree. There is a vignette.

vennLasso v0.1: Provides variable selection and estimation routines for models stratified by binary factors. There is a vignette.

Time Series

sweep v0.2.0: Provides tools for bringing tidyverse organization to time series forecasting. There is an Introduction, as well as vignettes on Forecasting and Forecasting with Mutiple Models.

timetk v0.1.0: Implements a toolkit for working with time series, including functions to interrogate time series objects and tibbles, and coerce between time-based tibbles (‘tbl’) and ‘xts’, ‘zoo’, and ‘ts’. There is an Introduction and vignettes on Working with time series index, Making a Future Index, and Forecasting.

Utilities

dataCompareR v0.1.0: Contains functions to compare two tabular data objects with the specific intent of showing differences in a way that should make it easier to understand the differences. The vignette shows how to use the package.

dataPreparation v0.2: Provides functions for data preparation that take advantage of data.table efficiencies. There is a tutorial.

datastructures v0.2.0: Provides implementations of advanced data structures such as hashmaps, heaps, and queues. There is a Tutorial.

docker v0.0.2: Provides access to the Docker SDK from R via python, using the reticulate package. There is a vignette to get you started.

seplyr v0.1.4: Supplies standard evaluation adapter methods for important common dplyr methods that currently have a non-standard programming interface. There is an Introduction, as well as vignettes for Using seplyr with dplyr, the operator named map builder, and the operator rename_se.

vetr v0.1.0: Provides a declarative template-based framework for verifying that objects meet structural requirements, and auto-composing error messages when they do not. There is a vignette on Alikeness and one on Trust, but Verify.

Visualization

ggjoy v0.3.0: Joyplots provide a convenient way of visualizing changes in distributions over time or space. ggjoy enables the creation of such plots in ggplot2. There is an Introduction and a Gallery of examples.

ggplotgui v1.0.0: Implements a shiny app for creating and exploring ggplot2 graphs that also generates the required R code. Look here for examples.

loon v1.1.0: Is an extensible toolkit for interactive data visualization and exploration. There are two vignettes containing examples: Visible minorities in Canadian cities and Smoothers and Bone Mineral Density

sugrrants v0.1.0: Provides ggplot2 graphics for analyzing time series data with the goal of fitting into the tidyverse and grammar of graphics framework. The vignette provides examples.

tidygraph v1.0.0: A graph, while not “tidy” in itself, can be thought of as two tidy data frames describing node and edge data respectively. tidygraph provides functions to manipulate these virtual data frames using the dplyr package. Look here for some details.

visdat v0.1.0: provides functions to create preliminary exploratory data visualizations of an entire dataset using ggplot2. There is a vignette to get you started.

_____='https://rviews.rstudio.com/2017/08/28/july-2017-new-package-picks/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

↧

One-way ANOVA in R

August 28, 2017, 5:24 am

≫ Next: Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

≪ Previous: July 2017 New Package Picks

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Suppose as a business manager you have the responsibility for testing and comparing the lifetimes of four brands (Apollo, Bridgestone, CEAT and Falken) of automobile tyres. The lifetime of these sample observations are measured in mileage run in ’000 miles. For each brand of automobile tyre, sample of 15 observations have been collected. On the basis of these information, you have to take you decision regarding the four brands of automobile tyre. The data is provided in the csv file format (called, tyre.csv).

In order to test and compare the lifetimes of four brands of tyre, you should apply one-way ANOVA method as there is only one factor or criterion (mileage run) to classify the sample observations. The null and alternative hypothesis to be used is given as:

Data Import and Outlier Checking

Let us first import the data into R and save it as object ‘tyre’. The R codes to do this:

tyre<- read.csv(file.choose(),header = TRUE, sep = ",")attach(tyre)

Before doing anything, you should check the variable type as in ANOVA, you need categorical independent variable (here the factor or treatment variable ‘brand’. In R, you can use the following code:

is.factor(Brands)[1] TRUE

As the result is ‘TRUE’, it signifies that the variable ‘Brands’ is a categorical variable. Now it is all set to run the ANOVA model in R. Like other linear model, in ANOVA also you should check the presence of outliers can be checked by boxplot. As there are four populations to study, you should use separate boxplot for each of the population. In R, it is done as:

boxplot(Mileage~Brands, main="Fig.-1: Boxplot of Mileage of Four Brands of Tyre", col= rainbow(4))

Gives this plot:

If you are using advanced graphics, you can use the ‘ggplot2’ package with the following code to get the above boxplot.

library(ggplot2)ggplot(tyre, aes(Brands,Mileage))+geom_boxplot(aes(col=Brands))+labs(title="Boxplot of Mileage of Four Brands of Tyre")

The above picture shows that there is one extreme observation in the CEAT brand. To find out the exact outliers or extreme observations, you can use the following command:

boxplot.stats(Mileage[Brands=="CEAT"])$stats[1] 30.42748 33.11079 34.78336 36.12533 36.97277$n[1] 15$conf[1] 33.55356 36.01316$out[1] 41.05

So, the outlier is the observation valued ‘41.05’. The confidence interval is (33.55 – 36.01) and the minimum and maximum values of the sample coming from the population ‘CEAT’ is 30.43 and 41.05 respectively. Considering all these points, we ignore the outlier value ‘41.05’ momentarily and carry out the analysis. If at later stage, we find that this outlier may create problems in the estimation, we will exclude it.

Estimation of Model

Let us now fit the model by using the aov() function in R.

model1<- aov(Mileage~Brands)

To get the result of the one-way ANOVA:

summary(model1) Df Sum Sq Mean Sq F value   Pr(>F)    Brands       3  256.3   85.43   17.94 2.78e-08 ***Residuals   56  266.6    4.76                     ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal. Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test. In R, the following are the code for applying the Tukey’s HSD test:

TukeyHSD(model1, conf.level = 0.99)Tukey multiple comparisons of means    99% family-wise confidence levelFit: aov(formula = Mileage ~ Brands)$Brands                          diff        lwr        upr     p adjBridgestone-Apollo -3.01900000 -5.6155816 -0.4224184 0.0020527CEAT-Apollo        -0.03792661 -2.6345082  2.5586550 0.9999608Falken-Apollo       2.82553333  0.2289517  5.4221149 0.0043198CEAT-Bridgestone    2.98107339  0.3844918  5.5776550 0.0023806Falken-Bridgestone  5.84453333  3.2479517  8.4411149 0.0000000Falken-CEAT         2.86345994  0.2668783  5.4600415 0.0037424

The TukeyHSD command shows the pair-wise difference of mileage of four brands of tyres at 1% level of significance. Here, the “diﬀ” column provides mean diﬀerences. The “lwr” and “upr” columns provide lower and upper 99% confidence bounds, respectively. Finally, the “p adj” column provides the p-values adjusted for the number of comparisons made. As there are four brands, six possible pair-wise comparisons are obtained. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant. Similarly, you can make other pair-wise comparison. You can also plot these the results of Tukey’s HSD comparison by using the plot function as follows:

plot(TukeyHSD(model1, conf.level = 0.99),las=1, col = "red")

The diagram is show below:

Another way of visualization by plotting means of mileage of four brands of tyre with the help of gplots package. By using the plotmeans() function in the gplots package, you can create the mean plots for single factors including confidence intervals.

library(gplots)plotmeans(Mileage~Brands, main="Fig.-3: Mean Plot with 95% Confidence Interval", ylab = "Mileage run ('000 miles)", xlab = "Brands of Tyre")

Gives this plot:

Diagnostic Checking

Diagnostic Checking: After estimating the ANOVA model and finding out the possible pairs of differences, it is now time to go for the different diagnostic checking with respect to model assumptions. The single call to plot function generates four diagnostic plots (Fig.-5).

par(mfrow=c(2,2))plot(model1)

Give this plot:

The Residuals vs. Fitted plot shown in the upper left of Fig.-4, shows the ﬁtted values plotted against the model residuals. If the residuals follow any particular pattern, such as a diagonal line, there may be other predictors not yet in the model that could improve it. The ﬂat lowess line looks very good as the single predictor variable or regressor sufficiently explaining the dependent variable. The Normal Q-Q Plot in the upper right of Fig.-4, shows the quantiles of the standardized residuals plotted against the quantiles you would expect if the data were normally distributed. Since these fall mostly on the straight line, the assumption of normally distributed residuals is met. Since there are only 15 observations in each individual brand of tyre, it is not wise to go for group-wise checking of normality assumption. Moreover, the normality of the overall residual can be checked by means of some statistical test such as Shapiro-Wilk test. Shortly I’ll show you this procedure too. The Scale-Location plot in the lower left of Fig.-4, shows the square root of the absolute standardized residuals plotted against the ﬁtted, or predicted, values. Since the lowess line that ﬁts this is fairly ﬂat, it indicates that the spread in the predictions is almost the same across the prediction line, indicating the very less chances of failure of meeting the assumption of homoscedasticity. This will be further verified by some statistical tests. In case of ANOVA, you can check the assumption of homogeneity of variances across the four brands of tyre. Finally, the Residuals vs. Leverage plot in the lower right corner, shows a measure of the influence of each point on the overall equation against the standardized residuals. Since no points stand out far from the pack, we can assume that there are no outliers having undue influence on the ﬁt of the model. Thus, the graphical diagnostic of the model fit apparently shows that the assumptions requirements of ANOVA model is fairly fulfilled. However, the normality assumption and homogeneity are supposed to be confirmed by the appropriate statistical tests.

Regarding the fulfillment of normality assumption, it has been already discussed that when the number of observations is less, it is wise to test normality for the overall residuals of the model, instead of checking it for separate group. In R the residuals of model is saved as follows:

uhat<-resid(model1)

where resid function extracts the model residual and it is saved as object ‘uhat’.

Now you may apply the Shapiro-Wilk test for normality with the following hypotheses set-up:

The test code and results are shown below:

shapiro.test(uhat)Shapiro-Wilk normality testdata:  uhatW = 0.9872, p-value = 0.7826

As the p-value is higher than the level of significance, you cannot reject the null hypothesis, which implies that the samples are taken from the normal populations. Another assumption requirement is the homogeneity of variances across the groups, which can be statistically tested by Bartlett test and Levene test. In both the test, the null hypothesis is set as the homogeneity of variance across the cross-sectional group. The tests are conducted as follows:

bartlett.test(Mileage~Brands)Bartlett test of homogeneity of variancesdata:  Mileage by BrandsBartlett's K-squared = 2.1496, df = 3, p-value = 0.5419

library(car)levene.test(Mileage~Brands)Levene's Test for Homogeneity of Variance (center = median)      Df F value Pr(>F)group  3  0.6946 0.5592      56

Both the tests confirmed the presence of homogeneity of variance across the four brands of tyre as we cannot reject the null hypothesis of homogeneity of variances across the brands of tyre.

The Conclusion

The mileages of the four brands of tyre are different. The results show that all the pairs of mileage are statistically significantly different from the Tukey’s Honestly Significant Differences, except for the pair CEAT-Apollo. More specifically, the pair-wise difference between Bridgestone-Apollo is found to be -3.019 which means that Apollo has higher mileage than Bridgestone and this is statistically significant.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

↧

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

August 28, 2017, 9:26 am

≫ Next: Neat New seplyr Feature: String Interpolation

≪ Previous: One-way ANOVA in R

(This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers)

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

plot of chunk unnamed-chunk-1

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

plot of chunk unnamed-chunk-2

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

plot of chunk unnamed-chunk-3

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

plot of chunk unnamed-chunk-4

with a gamma region looking like this:

plot of chunk unnamed-chunk-5

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

plot of chunk unnamed-chunk-6

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal $\gamma$-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

plot of chunk unnamed-chunk-7

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

plot of chunk unnamed-chunk-8

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y)stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)ep.data.sm <- ep.data[stair.midpoints,]plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

plot of chunk unnamed-chunk-9

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

ep.data <- ep.data.sm

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

library(Bolstad) #A package containing a function for Simpon's Rule integrationep.data$x <- ep.data$x/max(ep.data$x)A.tot <- sintegral(ep.data$x,ep.data$y)$valueep.data$y <- ep.data$y/A.tot#sanity checksintegral(ep.data$x,ep.data$y)$value

## [1] 1

plot(y~x, data = ep.data, type = "l")

plot of chunk unnamed-chunk-11

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod)ep.max <- findPeaks(ep.data$y)plot(y~x, data = ep.data, type = "l", main = "Maxima")abline(v = ep.data$x[ep.max], col = "red", lty = 2)

plot of chunk unnamed-chunk-12

ep.min <- findValleys(ep.data$y)plot(y~x, data = ep.data, type = "l", main = "Minima")abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

plot of chunk unnamed-chunk-12

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1]ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of $\mu_i$ and $\sigma_i$) but there is no particular clinical value to fitting the albumin, $\alpha_1$, $\alpha_2$, $\beta_1$ and $\beta_2$ domains with Gaussians. What is of interest is separately quantifying the two peaks in $\gamma$ with two separate Gaussians so let's isolate the $\gamma$ region based on the location of the minimum between $\beta_2$ and $\gamma$.

Isolate the $\gamma$ Region

gamma.ind <- max(ep.min):nrow(ep.data)gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])plot(y ~ x, gamma.data, type  = "l")

plot of chunk unnamed-chunk-14

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where $\mu_1$ is the $x$ location of the first peak in $\gamma$ and $\mu_2$ is the $x$ location of the second peak in $\gamma$. The estimates of $\sigma_1$ and $\sigma_2$ can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to $\sigma$ by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the $FWHM$. Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){  peak.height <- ep.data$y[mu.index]  fxn.for.roots <- ep.data$y - peak.height/2  indices <- 1:nrow(ep.data)  root.indices <- which(diff(sign(fxn.for.roots))!=0)  tmp <- c(root.indices,mu.index) %>% sort  tmp2 <- which(tmp == mu.index)  first.root <- root.indices[tmp2 -1]  second.root <- root.indices[tmp2]  HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]  HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]  FWHM <- HWHM2 + HWHM1  FWHM2 = 2*min(c(HWHM1,HWHM2))  return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))}

The peak in the $\gamma$ region was obtained previously:

plot(y ~ x, gamma.data, type  = "l")gamma.max <- findPeaks(gamma.data$y)abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-16

and from them $\mu_1$ is determined to be 0.7. We have to guess where the second peak is, which is at about $x=0.75$ and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

##             x         y## 252 0.7487757 0.6381026

#append the second peakgamma.max <- c(gamma.max,252)gamma.mu <- gamma.data$x[gamma.max]gamma.mu

## [1] 0.6983350 0.7487757

plot(y ~ x, gamma.data, type  = "l")abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-17

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2:FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of $\sigma_1$ and $\sigma_2$ are now obtained. The estimates of $C_1$ and $C_2$ are just the peak heights.

peak.heights <- gamma.data$y[gamma.max]

We can now use nls() to determine the fit.

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),           data = gamma.data,           start = list(mean1 = gamma.mu[1],                        mean2 = gamma.mu[2],                        sigma1 = gamma.sigma[1],                        sigma2 = gamma.sigma[2],                        C1 = peak.heights[1],                        C2 = peak.heights[2]),           algorithm = "port")

Determining the fitted values of our unknown coefficients:

dffit <- data.frame(x=seq(0, 1 , 0.001))dffit$y <- predict(fit, newdata=dffit)fit.sum <- summary(fit)fit.sum #show the fitted coefficients

## ## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - ##     mean2)^2/(2 * sigma2^2)))## ## Parameters:##         Estimate Std. Error t value Pr(>|t|)    ## mean1  0.7094793  0.0003312 2142.23   <2e-16 ***## mean2  0.7813900  0.0007213 1083.24   <2e-16 ***## sigma1 0.0731113  0.0002382  306.94   <2e-16 ***## sigma2 0.0250850  0.0011115   22.57   <2e-16 ***## C1     0.6983921  0.0018462  378.29   <2e-16 ***## C2     0.0819704  0.0032625   25.12   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.01291 on 611 degrees of freedom## ## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1]mu.fit <- coef.fit[1:2]sigma.fit <- coef.fit[3:4]C.fit <- coef.fit[5:6]

And now we can plot the fitted results against the original results:

#originalplot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") #overall fitlines(y ~ x, data = dffit, col ="red", cex = 0.2) legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))#components of the fitfor(i in 1:2){  x <- dffit$x  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))  lines(x,y, col = i + 2)}

plot of chunk unnamed-chunk-22

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the $\gamma$ background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of $\mu_1$ and $\mu_2$ but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the $\gamma$ region to be those values having $y \leq 0.02$. Then I will connect the first (x,y) point from the $\gamma$ region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) baseline.data <- rbind(gamma.data[1,],gamma.tail)names(baseline.data) <- c("x","y")baseline.fun <- approxfun(baseline.data)plot(y~x, data = gamma.data, type = "l")lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

plot of chunk unnamed-chunk-24

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))plot(y~x, data = gamma.data, type = "l")lines(y ~ x, data = gamma.no.base, lty = 2)gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaksabline(v = gamma.no.base$x[gamma.max])

plot of chunk unnamed-chunk-25

The black is the original $\gamma$ and the dashed has the baseline removed. This becomes and easy fit.

#Estimate the Cipeak.heights <- gamma.no.base$y[gamma.max]#Estimate the mu_igamma.mu <- gamma.no.base$x[gamma.max] #the same values as before#Estimate the sigma_i from the FWHMFWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355#Perform the fitfit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),           data = gamma.no.base,           start = list(mean1 = gamma.mu[1],                        mean2 = gamma.mu[2],                        sigma1 = gamma.sigma[1],                        sigma2 = gamma.sigma[2],                        C1 = peak.heights[1],                        C2 = peak.heights[2]),           algorithm = "port")#Plot the fitdffit <- data.frame(x=seq(0, 1 , 0.001))dffit$y <- predict(fit, newdata=dffit)fit.sum <- summary(fit)coef.fit <- fit.sum$coefficients[,1]mu.fit <- coef.fit[1:2]sigma.fit <- coef.fit[3:4]C.fit <- coef.fit[5:6]plot(y ~ x, data = gamma.no.base, type = "l")legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))lines(y ~ x, data = dffit, col ="red", cex = 0.2)for(i in 1:2){  x <- dffit$x  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))  lines(x,y, col = i + 2)}

plot of chunk unnamed-chunk-26

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the $\gamma$ background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() A.mono <- round(A.mono,3)A.mono

## sigma1 ##  0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal $\gamma$-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the $\gamma$-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast $\gamma$ dwarfed by normal $\gamma$-globulins. That is, like this:

plot of chunk unnamed-chunk-28

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions

The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.
- This is particularly important if you are using nls() because nls() does not play nice with data having no noise (“Do not use nls on artificial 'zero-residual' data”)
Integrating monoclonal peaks under the $\gamma$ baseline (or $\beta$) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.
- Basline removal might be helpful at times.
Peak integration will require human adjudication.
While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.
MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian.

↧

Neat New seplyr Feature: String Interpolation

August 28, 2017, 1:33 pm

≫ Next: Packages to simplify mapping in R

≪ Previous: Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.

This provides a powerful way to easily work complicated expressions into the seplyr data manipulation methods.

The method is easiest to see with an example:

library("seplyr")

## Loading required package: wrapr

ratio <-2compCol1 <- "Sepal.Width"expr <-expand_expr("Sepal.Length">=ratio *compCol1)print(expr)

## [1] "Sepal.Length >= ratio * Sepal.Width"

expand_expr works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).

Notice in the above one layer of quoting was removed from "Sepal.Length" and the name referred to by “compCol1” was substituted into the expression. “ratio” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.

If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:

expand_expr("Sepal.Length">= "ratio" *compCol1)

## [1] "Sepal.Length >= ratio * Sepal.Width"

After the substitution the returned quoted expression is exactly in the form seplyr expects. For example:

resCol1 <- "Sepal_Long"datasets::iris %.>%mutate_se(.,             resCol1 :=expr) %.>%head(.)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long## 1          5.1         3.5          1.4         0.2  setosa      FALSE## 2          4.9         3.0          1.4         0.2  setosa      FALSE## 3          4.7         3.2          1.3         0.2  setosa      FALSE## 4          4.6         3.1          1.5         0.2  setosa      FALSE## 5          5.0         3.6          1.4         0.2  setosa      FALSE## 6          5.4         3.9          1.7         0.4  setosa      FALSE

Details on %.>% (dot pipe) and := (named map builder) can be found here and here respectively. The idea is: seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width") should be equilant to dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width).

seplyr also provides an number of seplyr::*_nse() convenience forms wrapping all of these steps into one operation. For example:

datasets::iris %.>%mutate_nse(.,              resCol1 := "Sepal.Length">=ratio *compCol1) %.>%head(.)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long## 1          5.1         3.5          1.4         0.2  setosa      FALSE## 2          4.9         3.0          1.4         0.2  setosa      FALSE## 3          4.7         3.2          1.3         0.2  setosa      FALSE## 4          4.6         3.1          1.5         0.2  setosa      FALSE## 5          5.0         3.6          1.4         0.2  setosa      FALSE## 6          5.4         3.9          1.7         0.4  setosa      FALSE

To use string literals you merely need one extra layer of quoting:

"is_setosa" :=expand_expr(Species == "'setosa'")

##               is_setosa ## "Species == \"setosa\""

datasets::iris %.>%transmute_nse(.,              "is_setosa" :=Species == "'setosa'") %.>%summary(.)

##  is_setosa      ##  Mode :logical  ##  FALSE:100      ##  TRUE :50

The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:

countMatches <-function(data, columnName, targetValue) {  # extra quotes to say we are interested in value, not de-reference  targetSym <-paste0('"', targetValue, '"')   data %.>%transmute_nse(., "match" :=columnName ==targetSym) %.>%group_by_se(., "match") %.>%summarize_se(., "count" := "n()")}countMatches(datasets::iris, "Species", "setosa")

## # A tibble: 2 x 2##   match count##    ## 1 FALSE   100## 2  TRUE    50

The purpose of the seplyr string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Packages to simplify mapping in R

August 28, 2017, 2:30 pm

≫ Next: Le Monde puzzle [#1018]

≪ Previous: Neat New seplyr Feature: String Interpolation

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Computerworld's Sharon Machlis has published a very useful tutorial on creating geographic data maps with R. (The tutorial was actually published back in March, but I only came across it recently.) While it's been possible to create maps in R for a long time, some recent packages and data APIs have made the process much simpler. The tutorial is based on the following R packages:

sf, a package that provides simple data structures for objects representing spatial geometries including map features like borders and rivers
tmap and tmaptools, packages for creating static and interactive "thematic maps" like choropleths and spatial point maps using a ggplot2-like syntax
tigris, a package that provides shapefiles you can use to map the USA and its states, counties, and census tracts (very useful for visualizing US census data, which you can easily obtain with the tidycensus package)
rio, a package to streamline the import of flat files from third-party data sources like the U.S. Bureau of Labor Statistics

With those packages, as you'll learn in the tutorial linked below, you can easily create attractive maps to visualize geographic data, and even make those graphics interactive with scrolling, zooming, and data pop-ups thanks to the capabilities of the leaflet package. All of these packages are now available on CRAN, so there's no need to install from Github as the tutorial then suggested, unless you want access to even newer in-development capabilities.

Computerworld: Mapping in R just got a whole lot easier

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Le Monde puzzle [#1018]

August 28, 2017, 3:17 pm

≫ Next: 6 R Jobs for R users (2017-08-28) – from all over the world

≪ Previous: Packages to simplify mapping in R

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An arithmetic Le Monde mathematical puzzle (that first did not seem to involve R programming because of the large number of digits in the quantity involved):

An integer x with less than 100 digits is such that adding the digit 1 on both sides of x produces the integer 99x. What are the last nine digits of x? And what are the possible numbers of digits of x?

The integer x satisfies the identity

$10^{\omega+2}+10x+1=99x$

where ω is the number of digits of x. This amounts to

10….01 = 89 x,

where there are ω zeros. Working with long integers in R could bring an immediate solution, but I went for a pedestrian version, handling each digit at a time and starting from the final one which is necessarily 9:

#multiply by 9rap=0;row=NULLfor (i in length(x):1){prud=rap+x[i]*9row=c(prud%%10,row)rap=prud%/%10}row=c(rap,row)#multiply by 80rep=raw=0for (i in length(x):1){prud=rep+x[i]*8raw=c(prud%%10,raw)rep=prud%/%10}#find next digity=(row[1]+raw[1]+(length(x)>1))%%10

returning

7 9 7 7 5 2 8 0 9

as the (only) last digits of x. The same code can be exploited to check that the complete multiplication produces a number of the form 10….01, hence to deduce that the length of x is either 21 or 65, with solutions

[1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9[1] 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 8 9 8 8 7 6 4 0 4 4 9 4 3 8 2 0 2 2[39] 4 7 1 9 1 0 1 1 2 3 5 9 5 5 0 5 6 1 7 9 7 7 5 2 8 0 9

The maths question behind is to figure out the powers k of 10 such that

$10^k\equiv -1 \text{ mod } (89)$

For instance, 10²≡11 mod (89) and 11¹¹≡88 mod (89) leads to the first solution ω=21. And then, since 10⁴⁴≡1 mod (89), ω=21+44=65 is another solution…

Filed under: Books, Kids, R Tagged: arithmetics, competition, Le Monde, long division, mathematical puzzle, R

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

6 R Jobs for R users (2017-08-28) – from all over the world

August 28, 2017, 3:23 pm

≫ Next: Shiny Dev Center gets a shiny new update

≪ Previous: Le Monde puzzle [#1018]

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

Full-Time
Technical Project Manager Virginia Tech Applied Research Corporation – Posted by miller703
ArlingtonVirginia, United States
15 Jul2017
Full-Time
Software Developer Virginia Tech Applied Research Corporation – Posted by miller703
ArlingtonVirginia, United States
15 Jul2017
Full-Time
Solutions Engineer RStudio – Posted by nwstephens
Anywhere
14 Jul2017
Full-Time
Customer Success Rep RStudio – Posted by jclemens1
Anywhere
12 Jul2017

More New Jobs

Freelance
Optimization Expert IdeaConnection – Posted by Sherri Ann O’Gorman
Anywhere
22 Aug2017
Full-Time
Data journalist for The Economist @London The Economist – Posted by cooberp
London England, United Kingdom
18 Aug2017
Full-Time
Technical Business Analyst Investec Asset Management – Posted by IAM
London England, United Kingdom
14 Aug2017
Full-Time
Senior Data Scientist @ Dallas, Texas, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
Dallas Texas, United States
8 Aug2017
Full-Time
Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
Mesa Arizona, United States
31 Jul2017
Full-Time
Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
Philadelphia Pennsylvania, United States
31 Jul2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

↧

Shiny Dev Center gets a shiny new update

August 28, 2017, 5:00 pm

≫ Next: RStudio::Conf 2018

≪ Previous: 6 R Jobs for R users (2017-08-28) – from all over the world

I am excited to announce the redesign and reorganization of shiny.rstudio.com, also known as the Shiny Dev Center. The Shiny Dev Center is the place to go to learn about all things Shiny and to keep up to date with it as it evolves.

Shiny Dev Center

The goal of this refresh is to provide a clear learning path for those who are just starting off with developing Shiny apps as well as to make advanced Shiny topics easily accessible to those building large and complex apps. The articles overview that we designed to help navigate the wealth of information on the Shiny Dev Center aims to achieve this goal.

Articles overview

Other highlights of the refresh include:

A brand new look!
New articles
Updated articles with modern Shiny code examples
Explicit linking, where relevant, to other RStudio resources like webinars, support docs, etc.
A prominent link to our ever growing Shiny User Showcase
A guide for contributing to Shiny (inspired by the Tidyverse contribute guide)

Stay tuned for more updates to the Shiny Dev Center in the near future!

↧

RStudio::Conf 2018

August 28, 2017, 5:00 pm

≫ Next: RcppSMC 0.2.0

≪ Previous: Shiny Dev Center gets a shiny new update

(This article was first published on R Views, and kindly contributed to R-bloggers)

It’s not even Labor Day, so it seems to be a bit early to start planning for next year’s R conferences. But, early-bird pricing for RStudio::Conf 2018 ends this Thursday.

The conference which will be held in San Diego between January 31st and February 3rd promises to match and even surpass this year’s event. In addition to keynotes from Di Cook (Monash University and Iowa State University), J.J. Allaire (RStudio Founder, CEO & Principal Developer), Shiny creator Joe Cheng, and Chief Scientist Hadley Wickham, a number of knowledgeable (and entertaining) speakers have already committed including quant, long-time R user and twitter humorist JD Long (@CMastication), Stack Overflow’s David Robinson (@drob) and ProPublica editor Olga Pierce (@olgapierce).

Making the deadline for early-bird pricing will get you a significant savings. The “full seat” price for the conference is $495 before midnight EST August 31, 2017 and $695 thereafter. You can register here.

For a good idea of the kinds of things you can expect to learn at RStudio::Conf 2018 have a look at the videos from this year’s event.

_____='https://rviews.rstudio.com/2017/08/29/rstudio-conf-2018/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

↧

RcppSMC 0.2.0

August 28, 2017, 7:38 pm

≫ Next: rtimicropem: Using an *R* package as platform for harmonized cleaning of data from RTI MicroPEM air quality sensors

≪ Previous: RStudio::Conf 2018

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version 0.2.0 of the RcppSMC package arrived on CRAN earlier today (as a very quick pretest-publish within minutes of submission).

RcppSMC provides Rcpp-based bindings to R for the Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article.

This release 0.2.0 is chiefly the work of Leah South, a Ph.D. student at Queensland University of Technology, who was during the last few months a Google Summer of Code student mentored by Adam and myself. It was pleasure to work with Leah on this, and see her progress. Our congratulations to Leah for a job well done!

Changes in RcppSMC version 0.2.0 (2017-08-28)
Also use .registration=TRUE in useDynLib in NAMESPACE
Multiple Sequential Monte Carlo extensions (Leah South as part of Google Summer of Code 2017)
Switching to population level objects (#2 and #3).
Using Rcpp attributes (#2).
Using automatic RNGscope (#4 and #5).
Adding multiple normalising constant estimators (#7).
Static Bayesian model example: linear regression (#10 addressing #9).
Adding a PMMH example (#13 addressing #11).
Framework for additional algorithm parameters and adaptation (#19 addressing #16; also #24 addressing #23).
Common adaptation methods for static Bayesian models (#20 addressing #17).
Supporting MCMC repeated runs (#21).
Adding adaptation to linear regression example (#22 addressing #18).

Courtesy of CRANberries, there is a diffstat report for this release.

More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

rtimicropem: Using an R package as platform for harmonized cleaning of data from RTI MicroPEM air quality sensors

August 29, 2017, 12:00 am

≫ Next: Tidyverse practice: mapping large European cities

≪ Previous: RcppSMC 0.2.0

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

As you might remember from my blog post about ropenaq, I work as a data manager and statistician for an epidemiology project called CHAI for Cardio-vascular health effects of air pollution in Telangana, India. One of our interests in CHAI is determining exposure, and sources of exposure, to PM2.5 which are very small particles in the air that have diverse adverse health effects. You can find more details about CHAI in our recently published protocol paper. In this blog post that partly corresponds to the content of my useR! 2017 lightning talk, I'll present a package we wrote for dealing with the output of a scientific device, which might remind you of similar issues in your experimental work.

Why write the `rtimicropem` package?

Part of the CHAI project is a panel study involving about 40 people wearing several devices, as you see above. The devices include a GPS, an accelerometer, a wearable camera, and a PM2.5 monitor outputting time-resolved data (the grey box on the left). Basically, with this device, the RTI MicroPEM, we get one PM2.5 exposure value every 10 seconds. This is quite exciting, right? Except that we have two main issues with it…

First of all, the output of the device, a file with a ".csv" extension corresponding to a session of measurements, in our case 24 hours of measurements, is not really a csv. The header contains information about settings of the device for that session, and then comes the actual table with measurements.

Second, since the RTI MicroPEMs are nice devices but also a work-in-progress, we had some problems with the data, such as negative relative humidity. Because of these issues, we decided to write an R package whose three goals were to:

Transform the output of the device into something more usable.
Allow the exploration of individual files after a day in the field.
Document our data cleaning process.

We chose R because everything else in our project, well data processing, documentation and analysis, was to be implemented in R, and because we wanted other teams to be able to use our package.

Features of `rtimicropem`: transform, explore and learn about data cleaning

First things first, our package lives here and is on CRAN. It has a nice documentation website thanks to pkgdown.

Transform and explore single files

In rtimicropem after the use of the convert_output function, one gets an object of the R6 class micropem class. Its fields include the settings and measurements as two data.frames, and it has methods such as summary and plot for which you see the static output below (no unit on this exploratory plot).

The plot method can also outputs an interactive graph thanks to rbokeh.

While these methods can be quite helpful to explore single files as an R user, they don't help non R users a lot. Because we wanted members of our team working in the field to be able to explore and check files with no R knowledge, we created a Shiny app that allows to upload individual files and then look at different tabs, including one with a plot, one with the summary of measurements, etc. This way, it was easy to spot a device failure for instance, and to plan a new measurement session with the corresponding participant.

Transform a bunch of files

At the end of the CHAI data collection, we had more than 250 MicroPEM files. In order to prepare them for further processing we wrote the batch_convert function that saves the content of any number of MicroPEM files as two (real!) csv, one with the measurements, one with the settings.

Learn about data cleaning

As mentioned previously, we experienced issues with MicroPEM data quality. Although we had heard other teams complain of similar problems, in the literature there were very few details about data cleaning. We decided to gather information from other teams and the manufacturer and to document our own decisions, e.g. remove entire files based on some criteria, in a vignette of the package. This is our transparent answer to the question "What was your experience with MicroPEMs?" which we get often enough from other scientists interested in PM2.5 exposure.

Place of `rtimicropem` in the R package ecosystem

When preparing rtimicropem submission to rOpenSci, I started wondering whether one would like to have one R package for each scientific device out there. In our case, having the weird output to deal with, and the lack of a central data issues documentation place, were enough of a motivation. But maybe one could hope that manufacturers of scientific devices would focus a bit more on making the output format analysis-friendly, and that the open documentation of data issues would be language-agnostic and managed by the manufacturers themselves. In the meantime, we're quite proud to have taken the time to create and share our experience with rtimicropem, and have already heard back from a few users, including one who found the package via googling "RTI MicroPEM data"! Another argument I in particular have to write R packages for dealing with scientific data is that it might motivate people to learn R, but this is maybe a bit evil.

What about the place of rtimicropem in the rOpenSci package collection? After very useful reviews by Lucy D'Agostino McGowan and Kara Woo our package got onboarded which we were really thankful for and happy about. Another package I can think off the top of my head to deal with the output of a scientific tool is plater. Let me switch roles from CHAI team member to rOpenSci onboarding co-editor here and do some advertisement… Such packages are unlikely to become the new ggplot2 but their specialization doesn't make them less useful and they fit very well in the "data extraction" of the onboarding categories. So if you have written such a package, please consider submitting it! It'll get better thanks to review and might get more publicity as part of a larger software ecosystem. For the rtimicropem submission we took advantage of the joint submission process of rOpenSci and the Journal of Open Source Software, JOSS, so now our piece of software has its JOSS paper with a DOI. And hopefully, having more submissions of packages for scientific hardware might inspire R users to package up the code they wrote to use the output of their scientific tools!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

↧

Tidyverse practice: mapping large European cities

August 29, 2017, 4:59 am

≫ Next: Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

≪ Previous: rtimicropem: Using an *R* package as platform for harmonized cleaning of data from RTI MicroPEM air quality sensors

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

As noted in several recent posts, when you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn.

What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent.

But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects.

With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.

As we do this, pay attention:

How many packages and functions do you really need?
Evaluate: how long would it really take to memorize each individual function? (Hint: it’s much, much less time than you think.)
Which functions have you seen before? Are some of the functions and techniques used more often than others (if you look across many different analyses)?

Ok, with those questions in mind, let’s get after it.

First we’ll just load a few packages.

#==============# LOAD PACKAGES#==============library(rvest)library(tidyverse)library(ggmap)library(stringr)

Next, we’re going to use the rvest package to scrape data from Wikipedia. The data that we are gathering is data about the largest cities in Europe. You can read more about the data on Wikipedia.

#===========================# SCRAPE DATA FROM WIKIPEDIA#===========================html.population <- read_html('https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits')df.euro_cities <- html.population %>%  html_nodes("table") %>%  .[[2]] %>%  html_table()# inspectdf.euro_cities %>% head()df.euro_cities %>% names()

Here at Sharp Sight, we haven’t worked with the rvest package in too many examples, so you might not be familiar with it.

Having said that, just take a close look. How many functions did we use from rvest? Could you memorize them? How long would it take?

Ok. Now we’ll do a little data cleaning.

First, we’re going to remove some of the variables using dplyr::select(). We are using the minus sign (‘-‘) in front of the names of the variables that we want to remove.

#============================# REMOVE EXTRANEOUS VARIABLES#============================df.euro_cities <- select(df.euro_cities, -Date, -Image, -Location, -`Ref.`, -`2011 Eurostat\npopulation[3]`)# inspectdf.euro_cities %>% names()

After removing the variables that we don’t want, we only have four variables. These remaining raw variable names could be cleaned up a little.

Ideally, we want names that are lower case (because they are easier to type). We also want variable names that are brief and descriptive.

In this case, renaming these variables to be brief, descriptive, and lower-case is fairly straightforward. Here, we will use very simple variable names: like rank, city, country, and population.

To add these new variable names, we can simply assign them by using the colnames() function.

#===============# RENAME COLUMNS#===============colnames(df.euro_cities) <- c("rank", "city", "country", "population")# inspectdf.euro_cities %>% names()df.euro_cities %>% head()

Now that we have clean variable names, we will do a little modification of the data itself.

When we scraped the data from Wikipedia, some extraneous characters appeared in the population variable. Essentially, there were some leading digits and special characters that appear to be useless artifacts of the scraping process. We want to remove these extraneous characters and parse the population data into a proper numeric.

To do this, we will use a few functions from the stringr package.

First, we use str_extract() to extract the population data. When we do this, we are extracting everything from the ‘♠’ character to the end of the string (note: to do this, we are using a regular expression in str_extract()).

This is a quick way to get the numbers at the end of the string, but we actually don’t want to keep the ‘♠’ character. So, after we extract the population numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using str_replace().

#========================================================================# CLEAN UP VARIABLE: population# - when the data are scraped, there are some extraneous characters#   in the "population" variable.#   ... you can see leading numbers and some other items# - We will use stringr functions to extract the actual population data#   (and remove the stuff we don't want)# - We are executing this transformation inside dplyr::mutate() to #   modify the variable inside the dataframe#========================================================================df.euro_cities <- df.euro_cities %>% mutate(population = str_extract(population, "♠.*$") %>% str_replace("♠","") %>% parse_number())df.euro_cities %>% head()

We will also do some quick data wrangling on the city names. Two of the city names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of this, those two city names had extra bracket characters when we read them in (e.g. “Istanbul[a]”).

We want to strip off those footnotes. To do this we will once again use str_replace() to strip away the information that we don’t want.

#==========================================================================# REMOVE "notes" FROM CITY NAMES# - two cities had extra characters for footnotes#   ... we will remove these using stringr::str_replace and dplyr::mutate()#==========================================================================df.euro_cities <- df.euro_cities %>% mutate(city = str_replace(city, "\\[.\\]",""))  df.euro_cities %>% head()

For the sake of making the data a little easier to explain, we’re going to filter the data to records where the population is over 1,000,000.

Keep in mind: this is a straightforward use of dplyr::filter(); this is the sort of thing that you should be able to do with your eyes closed.

#=========================# REMOVE CITIES UNDER 1 MM#=========================df.euro_cities <- filter(df.euro_cities, population >= 1000000)#=================# COERCE TO TIBBLE#=================df.euro_cities <- df.euro_cities %>% as_tibble()

Before we map the cities on a map, we need to get geospatial information. That is, we need to geocode these records.

To do this, we will use the geocode() function to get the longitude and latitude.

After obtaining the geo data, we will join it back to the original data using cbind().

#========================================================# GEOCODE# - here, we're just getting longitude and latitude data #   using ggmap::geocode()#========================================================data.geo <- geocode(df.euro_cities$city)df.euro_cities <- cbind(df.euro_cities, data.geo)#inspectdf.euro_cities

To map the data points, we also need a map that will sit in the background, underneath the points.

We will use the function map_data() to get a world map.

#==============# GET WORLD MAP#==============map.europe <- map_data("world")

Now that the data are clean, and we have a world map, we will plot the data.

#=================================# PLOT BASIC MAP# - this map is "just the basics"#=================================ggplot() +  geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) +  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) +  coord_cartesian(xlim = c(-9,45), ylim = c(32,70))

This first plot is a “first iteration.” In this version, we haven’t done any serious formatting. It’s just a “first pass” to make sure that the data are in the right format. If we had found anything “out of line,” we would go back to an earlier part of the analysis and modify our code to correct any problems in the data.

Based on this plot, it looks like the data are essentially correct.

Now, we just want to “polish” the visualization by changing colors, fonts, sizes, etc.

#====================================================# PLOT 'POLISHED' MAP# - this version is formatted and cleaned up a little#   just to make it look more aesthetically pleasing#====================================================#-------------# CREATE THEME#-------------theme.maptheeme <-  theme(text = element_text(family = "Gill Sans", color = "#444444")) +  theme(plot.title = element_text(size = 32)) +  theme(plot.subtitle = element_text(size = 16)) +  theme(panel.grid = element_blank()) +  theme(axis.text = element_blank()) +  theme(axis.ticks = element_blank()) +  theme(axis.title = element_blank()) +  theme(legend.background = element_blank()) +  theme(legend.key = element_blank()) +  theme(legend.title = element_text(size = 18)) +  theme(legend.text = element_text(size = 10)) +  theme(panel.background = element_rect(fill = "#596673")) +  theme(panel.grid = element_blank())#------# PLOT#------#fill = "#AAAAAA",colour = "#818181", size = .15)ggplot() +  geom_polygon(data = map.europe, aes(x = long, y = lat, group = group), fill = "#DEDEDE",colour = "#818181", size = .15) +  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) +  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) +  coord_cartesian(xlim = c(-9,45), ylim = c(32,70)) +  labs(title = "European Cities with Large Populations", subtitle = "Cities with over 1MM population, within city limits") +  scale_size_continuous(range = c(.7,15), breaks = c(1100000, 4000000, 8000000, 12000000), name = "Population", labels = scales::comma_format()) +  theme.maptheeme

Not too bad.

Keep in mind that as a reader, you get to see the finished product: the finalized visualization and the finalized code.

But as always, the process for creating a visualization like this is highly iterative. If you work on a similar project, expect to change your code dozens of times. You’ll change your data-wrangling code as you work with the data and identify new items you need to change or fix. You’ll also change your ggplot() visualization code multiple times as you try different colors, fonts, and settings.

If you master the basics, the hard things never seem hard

Creating this visualization is actually not terribly hard to do, but if you’re somewhat new to R, it might seem rather challenging.

If you look at this, and it seems difficult then you need to understand: once you master the basics, the hard things never seem hard.

What I mean by that, is that this visualization is nothing more than a careful application of a few dozen simple tools, arranged in a way to create something new.

Once you master individual tools from ggplot2, dplyr, and the rest of the Tidyverse, projects like this become very easy to execute.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

You’ll learn:

What data science tools you should learn (and what not to learn)
How to practice those tools
How to put those tools together to execute analyses and machine learning projects
… and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Tidyverse practice: mapping large European cities appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

↧

Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

August 29, 2017, 7:14 am

≫ Next: Clean or shorten Column names while importing the data itself

≪ Previous: Tidyverse practice: mapping large European cities

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I was about to embark on setting up a background task to sift through R package PDFs for traces of functions that “omit NA values” as a surprise present for Colin Fay and Sir Tierney:

[Please RT]#RStats folks, @nj_tierney& I need your help for {naniar}! When does R silently drop/omit NA? https://t.co/V5elyGcG8Z pic.twitter.com/VScLXFCl2n
— Colin Fay (@_ColinFay) August 29, 2017

When I got distracted by a PDF in the CRAN doc/contrib directory: Short-refcard.pdf. I’m not a big reference card user but students really like them and after seeing what it was I remembered having seen the document ages ago, but never associated it with CRAN before.

I saw:

by Tom Short, EPRI PEAC, tshort@epri-peac.com 2004-11-07 Granted to the public domain. See www. Rpad. org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission).

at the top of the card. The link (which I’ve made unclickable for reasons you’ll see in a sec — don’t visit that URL) was clickable and I tapped it as I wanted to see if it had changed since 2004.

You can open that image in a new tab to see the full, rendered site and take a moment to see if you can find the section that links to objectionable — and, potentially malicious — content. It’s easy to spot.

I made a likely correct assumption that Tom Short had nothing to do with this and wanted to dig into it a bit further to see when this may have happened. So, don your bestest deerstalker and follow along as we see when this may have happened.

Digging In Domain Land

We’ll need some helpers to poke around this data in a safe manner:

library(wayback) # devtools::install_github("hrbrmstr/wayback")library(ggTimeSeries) # devtools::install_github("AtherEnergy/ggTimeSeries")library(splashr) # devtools::install_github("hrbrmstr/splashr")library(passivetotal) # devtools::install_github("hrbrmstr/passivetotal")library(cymruservices)library(magick)library(tidyverse)

(You’ll need to get a RiskIQ PassiveTotal key to use those functions. Also, please donate to Archive.org if you use the wayback package.)

Now, let’s see if the main Rpad content URL is in the wayback machine:

glimpse(archive_available("http://www.rpad.org/Rpad/"))## Observations: 1## Variables: 5## $ url         "http://www.rpad.org/Rpad/"## $ available   TRUE## $ closet_url  "http://web.archive.org/web/20170813053454/http://ww...## $ timestamp   2017-08-13## $ status      "200"

It is! Let’s see how many versions of it are in the archive:

x <- cdx_basic_query("http://www.rpad.org/Rpad/")ts_range <- range(x$timestamp)count(x, timestamp) %>%  ggplot(aes(timestamp, n)) +  geom_segment(aes(xend=timestamp, yend=0)) +  labs(x=NULL, y="# changes in year", title="rpad.org Wayback Change Timeline") +  theme_ipsum_rc(grid="Y")

count(x, timestamp) %>%  mutate(Year = lubridate::year(timestamp)) %>%  complete(timestamp=seq(ts_range[1], ts_range[2], "1 day"))  %>%  filter(!is.na(timestamp), !is.na(Year)) %>%  ggplot(aes(date = timestamp, fill = n)) +  stat_calendar_heatmap() +  viridis::scale_fill_viridis(na.value="white", option = "magma") +  facet_wrap(~Year, ncol=1) +  labs(x=NULL, y=NULL, title="rpad.org Wayback Change Timeline") +  theme_ipsum_rc(grid="") +  theme(axis.text=element_blank()) +  theme(panel.spacing = grid::unit(0.5, "lines"))

There’s a big span between 2008/9 and 2016/17. Let’s poke around there a bit. First 2016:

tm <- get_timemap("http://www.rpad.org/Rpad/")(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2016))## # A tibble: 1 x 5##       rel                                                                   link  type##                                                                        ## 1 memento http://web.archive.org/web/20160629104907/http://www.rpad.org:80/Rpad/  ## # ... with 2 more variables: from , datetime (p2016 <- render_png(url = rurl$link))

Hrm. Could be server or network errors.

Let’s go back to 2009.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2009))## # A tibble: 4 x 5##       rel                                                                  link  type##                                                                       ## 1 memento     http://web.archive.org/web/20090219192601/http://rpad.org:80/Rpad  ## 2 memento http://web.archive.org/web/20090322163146/http://www.rpad.org:80/Rpad  ## 3 memento http://web.archive.org/web/20090422082321/http://www.rpad.org:80/Rpad  ## 4 memento http://web.archive.org/web/20090524155658/http://www.rpad.org:80/Rpad  ## # ... with 2 more variables: from , datetime (p2009 <- render_png(url = rurl$link[4]))

If you poke around that, it looks like the original Rpad content, so it was “safe” back then.

(rurl <- filter(tm, lubridate::year(anytime::anydate(datetime)) == 2017))## # A tibble: 6 x 5##       rel                                                                link  type##                                                                     ## 1 memento  http://web.archive.org/web/20170323222705/http://www.rpad.org/Rpad  ## 2 memento http://web.archive.org/web/20170331042213/http://www.rpad.org/Rpad/  ## 3 memento http://web.archive.org/web/20170412070515/http://www.rpad.org/Rpad/  ## 4 memento http://web.archive.org/web/20170518023345/http://www.rpad.org/Rpad/  ## 5 memento http://web.archive.org/web/20170702130918/http://www.rpad.org/Rpad/  ## 6 memento http://web.archive.org/web/20170813053454/http://www.rpad.org/Rpad/  ## # ... with 2 more variables: from , datetime (p2017 <- render_png(url = rurl$link[1]))

I won’t break your browser and add another giant image, but that one has the icky content. So, it’s a relatively recent takeover and it’s likely that whomever added the icky content links did so to try to ensure those domains and URLs have both good SEO and a positive reputation.

Let’s see if they were dumb enough to make their info public:

rwho <- passive_whois("rpad.org")str(rwho, 1)## List of 18##  $ registryUpdatedAt: chr "2016-10-05"##  $ admin            :List of 10##  $ domain           : chr "rpad.org"##  $ registrant       :List of 10##  $ telephone        : chr "5078365503"##  $ organization     : chr "WhoisGuard, Inc."##  $ billing          : Named list()##  $ lastLoadedAt     : chr "2017-03-14"##  $ nameServers      : chr [1:2] "ns-1147.awsdns-15.org" "ns-781.awsdns-33.net"##  $ whoisServer      : chr "whois.publicinterestregistry.net"##  $ registered       : chr "2004-06-15"##  $ contactEmail     : chr "411233718f2a4cad96274be88d39e804.protect@whoisguard.com"##  $ name             : chr "WhoisGuard Protected"##  $ expiresAt        : chr "2018-06-15"##  $ registrar        : chr "eNom, Inc."##  $ compact          :List of 10##  $ zone             : Named list()##  $ tech             :List of 10

Nope. #sigh

Is this site considered “malicious”?

(rclass <- passive_classification("rpad.org"))## $everCompromised## NULL

Nope. #sigh

What’s the hosting history for the site?

rdns <- passive_dns("rpad.org")rorig <- bulk_origin(rdns$results$resolve)tbl_df(rdns$results) %>%  type_convert() %>%  select(firstSeen, resolve) %>%  left_join(select(rorig, resolve=ip, as_name=as_name)) %>%   arrange(firstSeen) %>%  print(n=100)## # A tibble: 88 x 3##              firstSeen        resolve                                              as_name##                                                                           ##  1 2009-12-18 11:15:20  144.58.240.79      EPRI-PA - Electric Power Research Institute, US##  2 2016-06-19 00:00:00 208.91.197.132 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG##  3 2016-07-29 00:00:00  208.91.197.27 CONFLUENCE-NETWORK-INC - Confluence Networks Inc, VG##  4 2016-08-12 20:46:15  54.230.14.253                     AMAZON-02 - Amazon.com, Inc., US##  5 2016-08-16 14:21:17  54.230.94.206                     AMAZON-02 - Amazon.com, Inc., US##  6 2016-08-19 20:57:04  54.230.95.249                     AMAZON-02 - Amazon.com, Inc., US##  7 2016-08-26 20:54:02 54.192.197.200                     AMAZON-02 - Amazon.com, Inc., US##  8 2016-09-12 10:35:41   52.84.40.164                     AMAZON-02 - Amazon.com, Inc., US##  9 2016-09-17 07:43:03  54.230.11.212                     AMAZON-02 - Amazon.com, Inc., US## 10 2016-09-23 18:17:50 54.230.202.223                     AMAZON-02 - Amazon.com, Inc., US## 11 2016-09-30 19:47:31 52.222.174.253                     AMAZON-02 - Amazon.com, Inc., US## 12 2016-10-24 17:44:38  52.85.112.250                     AMAZON-02 - Amazon.com, Inc., US## 13 2016-10-28 18:14:16 52.222.174.231                     AMAZON-02 - Amazon.com, Inc., US## 14 2016-11-11 10:44:22 54.240.162.201                     AMAZON-02 - Amazon.com, Inc., US## 15 2016-11-17 04:34:15 54.192.197.242                     AMAZON-02 - Amazon.com, Inc., US## 16 2016-12-16 17:49:29   52.84.32.234                     AMAZON-02 - Amazon.com, Inc., US## 17 2016-12-19 02:34:32 54.230.141.240                     AMAZON-02 - Amazon.com, Inc., US## 18 2016-12-23 14:25:32  54.192.37.182                     AMAZON-02 - Amazon.com, Inc., US## 19 2017-01-20 17:26:28  52.84.126.252                     AMAZON-02 - Amazon.com, Inc., US## 20 2017-02-03 15:28:24   52.85.94.225                     AMAZON-02 - Amazon.com, Inc., US## 21 2017-02-10 19:06:07   52.85.94.252                     AMAZON-02 - Amazon.com, Inc., US## 22 2017-02-17 21:37:21   52.85.63.229                     AMAZON-02 - Amazon.com, Inc., US## 23 2017-02-24 21:43:45   52.85.63.225                     AMAZON-02 - Amazon.com, Inc., US## 24 2017-03-05 12:06:32  54.192.19.242                     AMAZON-02 - Amazon.com, Inc., US## 25 2017-04-01 00:41:07 54.192.203.223                     AMAZON-02 - Amazon.com, Inc., US## 26 2017-05-19 00:00:00   13.32.246.44                     AMAZON-02 - Amazon.com, Inc., US## 27 2017-05-28 00:00:00    52.84.74.38                     AMAZON-02 - Amazon.com, Inc., US## 28 2017-06-07 08:10:32  54.230.15.154                     AMAZON-02 - Amazon.com, Inc., US## 29 2017-06-07 08:10:32  54.230.15.142                     AMAZON-02 - Amazon.com, Inc., US## 30 2017-06-07 08:10:32  54.230.15.168                     AMAZON-02 - Amazon.com, Inc., US## 31 2017-06-07 08:10:32   54.230.15.57                     AMAZON-02 - Amazon.com, Inc., US## 32 2017-06-07 08:10:32   54.230.15.36                     AMAZON-02 - Amazon.com, Inc., US## 33 2017-06-07 08:10:32  54.230.15.129                     AMAZON-02 - Amazon.com, Inc., US## 34 2017-06-07 08:10:32   54.230.15.61                     AMAZON-02 - Amazon.com, Inc., US## 35 2017-06-07 08:10:32   54.230.15.51                     AMAZON-02 - Amazon.com, Inc., US## 36 2017-07-16 09:51:12 54.230.187.155                     AMAZON-02 - Amazon.com, Inc., US## 37 2017-07-16 09:51:12 54.230.187.184                     AMAZON-02 - Amazon.com, Inc., US## 38 2017-07-16 09:51:12 54.230.187.125                     AMAZON-02 - Amazon.com, Inc., US## 39 2017-07-16 09:51:12  54.230.187.91                     AMAZON-02 - Amazon.com, Inc., US## 40 2017-07-16 09:51:12  54.230.187.74                     AMAZON-02 - Amazon.com, Inc., US## 41 2017-07-16 09:51:12  54.230.187.36                     AMAZON-02 - Amazon.com, Inc., US## 42 2017-07-16 09:51:12 54.230.187.197                     AMAZON-02 - Amazon.com, Inc., US## 43 2017-07-16 09:51:12 54.230.187.185                     AMAZON-02 - Amazon.com, Inc., US## 44 2017-07-17 13:10:13 54.239.168.225                     AMAZON-02 - Amazon.com, Inc., US## 45 2017-08-06 01:14:07  52.222.149.75                     AMAZON-02 - Amazon.com, Inc., US## 46 2017-08-06 01:14:07 52.222.149.172                     AMAZON-02 - Amazon.com, Inc., US## 47 2017-08-06 01:14:07 52.222.149.245                     AMAZON-02 - Amazon.com, Inc., US## 48 2017-08-06 01:14:07  52.222.149.41                     AMAZON-02 - Amazon.com, Inc., US## 49 2017-08-06 01:14:07  52.222.149.38                     AMAZON-02 - Amazon.com, Inc., US## 50 2017-08-06 01:14:07 52.222.149.141                     AMAZON-02 - Amazon.com, Inc., US## 51 2017-08-06 01:14:07 52.222.149.163                     AMAZON-02 - Amazon.com, Inc., US## 52 2017-08-06 01:14:07  52.222.149.26                     AMAZON-02 - Amazon.com, Inc., US## 53 2017-08-11 19:11:08 216.137.61.247                     AMAZON-02 - Amazon.com, Inc., US## 54 2017-08-21 20:44:52  13.32.253.116                     AMAZON-02 - Amazon.com, Inc., US## 55 2017-08-21 20:44:52  13.32.253.247                     AMAZON-02 - Amazon.com, Inc., US## 56 2017-08-21 20:44:52  13.32.253.117                     AMAZON-02 - Amazon.com, Inc., US## 57 2017-08-21 20:44:52  13.32.253.112                     AMAZON-02 - Amazon.com, Inc., US## 58 2017-08-21 20:44:52   13.32.253.42                     AMAZON-02 - Amazon.com, Inc., US## 59 2017-08-21 20:44:52  13.32.253.162                     AMAZON-02 - Amazon.com, Inc., US## 60 2017-08-21 20:44:52  13.32.253.233                     AMAZON-02 - Amazon.com, Inc., US## 61 2017-08-21 20:44:52   13.32.253.29                     AMAZON-02 - Amazon.com, Inc., US## 62 2017-08-23 14:24:15 216.137.61.164                     AMAZON-02 - Amazon.com, Inc., US## 63 2017-08-23 14:24:15 216.137.61.146                     AMAZON-02 - Amazon.com, Inc., US## 64 2017-08-23 14:24:15  216.137.61.21                     AMAZON-02 - Amazon.com, Inc., US## 65 2017-08-23 14:24:15 216.137.61.154                     AMAZON-02 - Amazon.com, Inc., US## 66 2017-08-23 14:24:15 216.137.61.250                     AMAZON-02 - Amazon.com, Inc., US## 67 2017-08-23 14:24:15 216.137.61.217                     AMAZON-02 - Amazon.com, Inc., US## 68 2017-08-23 14:24:15  216.137.61.54                     AMAZON-02 - Amazon.com, Inc., US## 69 2017-08-25 19:21:58  13.32.218.245                     AMAZON-02 - Amazon.com, Inc., US## 70 2017-08-26 09:41:34   52.85.173.67                     AMAZON-02 - Amazon.com, Inc., US## 71 2017-08-26 09:41:34  52.85.173.186                     AMAZON-02 - Amazon.com, Inc., US## 72 2017-08-26 09:41:34  52.85.173.131                     AMAZON-02 - Amazon.com, Inc., US## 73 2017-08-26 09:41:34   52.85.173.18                     AMAZON-02 - Amazon.com, Inc., US## 74 2017-08-26 09:41:34   52.85.173.91                     AMAZON-02 - Amazon.com, Inc., US## 75 2017-08-26 09:41:34  52.85.173.174                     AMAZON-02 - Amazon.com, Inc., US## 76 2017-08-26 09:41:34  52.85.173.210                     AMAZON-02 - Amazon.com, Inc., US## 77 2017-08-26 09:41:34   52.85.173.88                     AMAZON-02 - Amazon.com, Inc., US## 78 2017-08-27 22:02:41  13.32.253.169                     AMAZON-02 - Amazon.com, Inc., US## 79 2017-08-27 22:02:41  13.32.253.203                     AMAZON-02 - Amazon.com, Inc., US## 80 2017-08-27 22:02:41  13.32.253.209                     AMAZON-02 - Amazon.com, Inc., US## 81 2017-08-29 13:17:37 54.230.141.201                     AMAZON-02 - Amazon.com, Inc., US## 82 2017-08-29 13:17:37  54.230.141.83                     AMAZON-02 - Amazon.com, Inc., US## 83 2017-08-29 13:17:37  54.230.141.30                     AMAZON-02 - Amazon.com, Inc., US## 84 2017-08-29 13:17:37 54.230.141.193                     AMAZON-02 - Amazon.com, Inc., US## 85 2017-08-29 13:17:37 54.230.141.152                     AMAZON-02 - Amazon.com, Inc., US## 86 2017-08-29 13:17:37 54.230.141.161                     AMAZON-02 - Amazon.com, Inc., US## 87 2017-08-29 13:17:37  54.230.141.38                     AMAZON-02 - Amazon.com, Inc., US## 88 2017-08-29 13:17:37 54.230.141.151                     AMAZON-02 - Amazon.com, Inc., US

Unfortunately, I expected this. The owner keeps moving it around on AWS infrastructure.

So What?

This was an innocent link in a document on CRAN that went to a site that looked legit. A clever individual or organization found the dead domain and saw an opportunity to legitimize some fairly nasty stuff.

Now, I realize nobody is likely using “Rpad” anymore, but this type of situation can happen to any registered domain. If this individual or organization were doing more than trying to make objectionable content legit, they likely could have succeeded, especially if they enticed you with a shiny new devtools::install_…() link with promises of statistically sound animated cat emoji gif creation tools. They did an eerily good job of making this particular site still seem legit.

There’s nothing most folks can do to “fix” that site or have it removed. I’m not sure CRAN should remove the helpful PDF, but with a clickable link, it might be a good thing to suggest.

You’ll see that I used the splashr package (which has been submitted to CRAN but not there yet). It’s a good way to work with potentially malicious web content since you can “see” it and mine content from it without putting your own system at risk.

After going through this, I’ll see what I can do to put some bows on some of the devel-only packages and get them into CRAN so there’s a bit more assurance around using them.

I’m an army of one when it comes to fielding R-related security issues, but if you do come across suspicious items (like this or icky/malicious in other ways) don’t hesitate to drop me an @ or DM on Twitter.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

↧

Clean or shorten Column names while importing the data itself

August 29, 2017, 10:05 am

≫ Next: The Cycling Accident Map of Madrid City

≪ Previous: Rpad Domain Repurposed To Deliver Creepy (and potentially malicious) Content

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.

One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:“ library(janitor); newdataobject <- read.csv(“yourcsvfilewithpath.csv”, header=T) %>% clean_names()

” Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details. Find more about author at http://in.linkedin.com/in/pradeepmavuluri

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

↧

Basic idea

Example and implementations:

Final observations

References

A latent continuous process underlies the observed binary process

The logistic distribution

The threshold defines the probability

Log-odds and probability

Logistic regression

The process in action

The July 2017 Top 40

Machine Learning

Numerical Methods

Science

Statistics

Time Series

Utilities

Visualization

Data Import and Outlier Checking

Estimation of Model

Diagnostic Checking

The Conclusion

Background

Easy Peaks

More Difficult Peaks

Getting the Data

Smoothing

Transformation

Find Extrema

Fitting

Isolate the \(\gamma\) Region

Attempt Something that Ultimately Does Not Work

Baseline Removal

All about that Base(line)

Conclusions

To post your R job on the next post

Current R jobs

Featured Jobs

More New Jobs

R-users Resumes

Changes in RcppSMC version 0.2.0 (2017-08-28)

Why write the rtimicropem package?

Features of rtimicropem: transform, explore and learn about data cleaning

Transform and explore single files

Transform a bunch of files

Learn about data cleaning

Place of rtimicropem in the R package ecosystem

If you master the basics, the hard things never seem hard

Sign up now, and discover how to rapidly master data science

Digging In Domain Land

So What?

Why write the `rtimicropem` package?

Features of `rtimicropem`: transform, explore and learn about data cleaning

Place of `rtimicropem` in the R package ecosystem