Quantcast
Channel: R-bloggers
Viewing all 12100 articles
Browse latest View live

Naming things is hard

$
0
0

Prefixing R function names –

‘There are only two hard things in Computer Science: cache invalidation and naming things.’

The above quip by Phil Karlton is fairly well known and often quoted, sometimes with amusing extensions:

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

— Jeff Atwood (@codinghorror) August 31, 2014

There are only 2 hard things in computer science: 0. Cache invalidation 1. Naming things 7. Asynchronous callbacks 2. Off-by-one errors

— Paweł Zajączkowski (@gvaireth) September 18, 2017

These are funny, but they do also convey some truth: in the midst of all the technicalities and abstractions we can find ourselves caught up with in the world of programming, it’s surprising how often seemingly ‘simple’ things like naming things trip us up.

I was recently reminded of this when a difference of opinions about function names sparked some healthy debate in a pull request I was reviewing for one of my personal projects.

So the question I want to raise is this:

“When (if ever) is it a good idea to adopt a prefixing convention for the names of exported functions in an R package?”

Disclaimers

Before I dive into the details I feel it is important to state a few things upfront.

Firstly, I want to thank my friends and collaborators Katrin and Lorenz who are strong proponents of open source software and for whom I have a lot of respect. On this occasion they both seem to disagree with me, but that is not a bad thing – discussion and debate is valuable, and that’s not gonna happen when everyone agrees with each other all the time. I did also ask for their permission before publishing this post.

xkcd: Survivorship Bias

Secondly, my purpose in writing about this is less about trying to determine who is right, and more about attempting to convert this experience into insight. Often we learn more from our failures than from our successes, but it’s harder to share our mistakes than it is to share our triumphs. So this post is my way of being vulnerable about something that is a work in progress and the process of trying to improve on it. As an aside, if you haven’t encountered survivorship bias before, I highly recommend you read this.

Thirdly, I was wrong. This is especially important in light of the previous point. I was wrong to raise the issue of function names (for the package as a whole) in a pull request which was focused on something else. This is a valuable lesson. One should always aim to keep pull requests (and issues) narrow in scope because attempting to tackle multiple problems in one place (unless they are inherently linked or dependent on each other) is messy and complicates matters unnecessarily.

Lastly, what I share will be my own opinion, but it is just an opinion and I’m always open to learning from others with different views. My hope is that collectively we can share some worthwhile perspectives from both sides and possibly encourage more thinking and conversation around this or related issues.

Background Context

Ok, so having made those upfront disclaimers, I’ll begin by summarizing the back-story and context in which the discussion arose. If you’d like to refer to the pull request itself – it can be found here.

R track on Exercism

excercism.io is a learning platform that aims to provide code practice and mentorship for everyone. I got involved in developing the R track on Exercism and wrote about it earlier this year. Unlike most online learning platforms, with Exercism, all the coding happens on your machine in an environment you’re familiar with. So Exercism provides a command line tool which leverages an API in order to facilitate the process of fetching and submitting exercises.

A few months ago, I had an idea to write an R package which wraps this API. The thinking was that the user experience (for R users) might be improved upon by facilitating interaction with exercism.io directly from R itself. This removes the need for switching repeatedly between R and a shell when fetching, iterating on and submitting exercises – although now the addition of terminal tabs in RStudio 1.1 has already reduced this friction to a degree. In any case, there are additional opportunities for Exercism helper functions in the package which can be context aware and integrate with the RStudio if it is being used. An example this could be functions (or addins) which make use of the rstudioapi to detect which problem was last worked on when submitting so that it doesn’t need to be specified manually.

Katrin, who is a co-maintainer for the R track on exercism.io, has also been collaborating on this R package with me and has had some great ideas like leveraging testthat’sauto_test() to facilitate and encourage test driven development, as this is one of the implicit goals of Exercism. In the PR introducing this feature, the potential for function name confusion was soon evident when this new Exercism specific version of testthat::auto_test() was (initially) given the name autotest(). This reminded me that I’d in fact been thinking for a while about renaming all the exported functions to adopt the prefixing convention ex_* (for a few different reasons which I’ll get to later). So I figured this “name clash” was as good a catalyst as any, and made the suggestion to start adopting the new naming convention in the PR. Once again it’s worth noting that this was a mistake – I should have instead opened a separate issue to discuss my proposed change in naming conventions.

Discussion & follow-up

The suggestion was met with some resistance, and after some further discussion it became clear to me that it was a thoughtfully considered resistance. So I asked my friend Lorenz to weigh in on the discussion too, given that he knows Katrin and I but is not involved in the project and thus has the benefit of a more neutral perspective. To my surprise, he did not agree with me either!

But I did still seem to have Jenny Bryan on my side (thanks Jenny!), and I figured Hadley (or Sir Lord General Professor Wickham as Mara generally likes to refer to him) had to have thought it was a good idea at some point given the str_ prefix for stringr and fct_ prefix for forcats among others. So after thinking on the problem for a while, out of curiousity I eventually tweeted out a poll to see if I could get any sense of where the #rstats community falls on this issue.

What is your take on prefixing conventions for #rstats function names? (e.g. stringr/stringi, forcats, googlesheets)

— Jon Calder (@jonmcalder) October 20, 2017

At a glance it looks like a reasonable proportion are actually in favour of prefixing conventions for function names (or at least not against the idea), but of course there are a number of disclaimers to make here:

  • Character limits (at the time of the poll) made it hard to communicate the question clearly or to include any additional context for the question, so that probably leaves a lot of room for interpretation
  • I don’t have much reach on Twitter, so there weren’t many responses (81 votes is not much to go on)
  • Even if there had been a good number of responses, Twitter polls need to be looked at skeptically given the potential for sampling bias
  • Speaking of sampling bias, most of the votes came in after Hadley tweeted a reply to the poll so it makes sense that the results would be skewed towards his legions of followers (I’m one of them and the degree of influence is clear because his packages are what got me considering prefixing conventions in the first place, among others like googlesheets)

Maëlle had two helpful follow-ups for me. Firstly, she encouraged me to blog about this (and I don’t think I would have done so otherwise so thanks Maëlle!). Secondly, she directed me to the ROpenSci review process for her package ropenaq, which provides access to air quality data via the OpenAQ API. In his review of the package, Andrew MacDonald suggested the following:

“I was thinking that the names of functions might be a bit similar to functions in other packages that also use geography. What do you think of prefixing every function with a package-specific string? Perhaps something like aq_ before all the user-facing functions (i.e. countries() becomes aq_countries()). This is similar to another rOpenSci package, geonames, which uses GN (as in GNcities()). This has the added benefit of playing very nicely with Rstudio, which will show all the package functions as completions when users type aq_ and hit tab.”

Interestingly, this suggestion (although the original inspiration may have come from elsewhere) was later incorporated into ROpenSci’s packaging guide:

ROpenSci logo

Consider an object_verb() naming scheme for functions in your package that take a common data type or interact with a common API. object refers to the data/API and verb the primary action. This scheme helps avoid namespace conflicts with packages that may have similar verbs, and makes code readable and easy to auto-complete. For instance, in stringi, functions starting with stri_ manipulate strings (stri_join(), stri_sort(), and in googlesheets functions starting with gs_ are calls to the Google Sheets API (gs_auth(), gs_user(), gs_download()).

Though I hadn’t seen this recommendation from ROpenSci at the time, it aligns very strongly with my initial reasoning for wanting to change the function names in the exercism package. It is primarily an API package, and all functions either interact with the exercism.io API or act on some (local) Exercism data/code (exercises). A potential objection could be that in some cases the ex_* prefix may be interpreted either as exercism_* or as exercise_*, but I don’t think that’s a problem since either way the context is common and shared implicitly.

Having said that, I’m also aware that a prefixing convention is not suitable in the majority of cases and there are reasons to avoid it, otherwise it would already be far more common. I’ve not tried to summarize the arguments for and against it here since this post is already quite lengthy, but I believe Katrin and Lorenz both raised a number of good points over in the original PR thread, so I would encourage you to read through that to get some more insight into the potential pros and cons.

Below is an overview of the currently exported functions for exercism, along with a brief description of what they do and potential new names for each should we adopt a prefixing convention:

Current FunctionDescriptionNew Name?
set_api_key()Set an environment variable for the provided exercism.io API key, and store in .Renviron so that it can persist for future sessions.ex_set_key()
set_exercism_path()Set an environment variable for the provided exercism path, and store in .Renviron so that it can persist for future sessions.ex_set_path()
track_status()Fetches current track status from exercism.ioex_status()
check_next_problem()Returns the next problem for a language trackex_check()
fetch_problem()Fetches the files for a problem via the Exercism API and writes them into a new problem folder in the Exercism directoryex_fetch()
fetch_next()Checks for the next problem via the Exercism API, and writes the files into the folder in the Exercism directory*special case of ex_fetch()
open_exercise()Open files for an exercism.io problemex_open()
start_testing()Exercism- and R-specific wrapper for testthat::auto_test() that starts testing your solution against the problem’s test cases.ex_auto_test()
submit()Submits the specified solution to exercism.ioex_submit()
skip_problem()Marks a problem as ‘skipped’ via the Exercism APIex_skip()
browse_exercise()Navigate to an exercise description on exercism.ioex_browse()
browse_solution()Navigate to an exercise solution on exercism.io*special case of ex_browse()

So looking at the above, do you think this a good use case for an object_verb() naming convention? How should one determine this? Please feel free to comment with your thoughts and suggestions below or ping me on Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

When you use inverse probability weighting for estimation, what are the weights actually doing?

$
0
0

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Towards the end of Part 1 of this short series on confounding, IPW, and (hopefully) marginal structural models, I talked a little bit about the fact that inverse probability weighting (IPW) can provide unbiased estimates of marginal causal effects in the context of confounding just as more traditional regression models like OLS can. I used an example based on a normally distributed outcome. Now, that example wasn’t super interesting, because in the case of a linear model with homogeneous treatment effects (i.e. no interaction), the marginal causal effect is the same as the conditional effect (that is, conditional on the confounders.) There was no real reason to use IPW in that example – I just wanted to illustrate that the estimates looked reasonable.

But in many cases, the conditional effect is different from the marginal effect. (And in other cases, there may not even be an obvious way to estimate the conditional effect – that will be the topic for the last post in this series). When the outcome is binary, the notion that conditional effects are equal to marginal effects is no longer the case. (I’ve touched on this before.) What this means, is that we can recover the true conditional effects using logistic regression, but we cannot estimate the marginal effect. This is directly related to the fact that logistic regression is linear on the logit (or log-odds) scale, not on the probability scale. The issue here is collapsibility, or rather, non-collapsibility.

A simulation

Because binary outcomes are less amenable to visual illustration, I am going to stick with model estimation to see how this plays out:

library(simstudy)# define the datadefB <- defData(varname = "L", formula =0.27,                 dist = "binary")defB <- defData(defB, varname = "Y0", formula = "-2.5 + 1.75*L",                 dist = "binary", link = "logit")defB <- defData(defB, varname = "Y1", formula = "-1.5 + 1.75*L",                 dist = "binary", link = "logit")defB <- defData(defB, varname = "A", formula = "0.315 + 0.352 * L",                 dist = "binary")defB <- defData(defB, varname = "Y", formula = "Y0 + A * (Y1 - Y0)",                 dist = "nonrandom")# generate the dataset.seed(2002)dtB <- genData(200000, defB)dtB[1:6]
##    id L Y0 Y1 A Y## 1:  1 0  0  0 0 0## 2:  2 0  0  0 0 0## 3:  3 1  0  1 1 1## 4:  4 0  1  1 1 1## 5:  5 1  0  0 1 0## 6:  6 1  0  0 0 0

We can look directly at the potential outcomes to see the true causal effect, measured as a log odds ratio (LOR):

odds <- function (p) {    return((p/(1 - p)))}# log odds ratio for entire sample (marginal LOR)dtB[, log( odds( mean(Y1) ) / odds( mean(Y0) ) )]
## [1] 0.8651611

In the linear regression context, the conditional effect measured using observed data from the exposed and unexposed subgroups was in fact a good estimate of the marginal effect in the population. Not the case here, as the conditional causal effect (LOR) of A is 1.0, which is greater than the true marginal effect of 0.86:

library(broom)tidy(glm(Y ~ A + L , data = dtB, family="binomial")) 
##          term   estimate  std.error  statistic p.value## 1 (Intercept) -2.4895846 0.01053398 -236.33836       0## 2           A  0.9947154 0.01268904   78.39167       0## 3           L  1.7411358 0.01249180  139.38225       0

This regression estimate for the coefficient of \(A\)is a good estimate of the conditional effect in the population (based on the potential outcomes at each level of \(L\)):

dtB[, .(LOR = log( odds( mean(Y1) ) / odds( mean(Y0) ) ) ), keyby = L]
##    L       LOR## 1: 0 0.9842565## 2: 1 0.9865561

Of course, ignoring the confounder \(L\) is not very useful if we are interested in recovering the marginal effect. The estimate of 1.4 is biased for both the conditional effect and the marginal effect – it is not really useful for anything:

tidy(glm(Y ~ A , data = dtB, family="binomial"))
##          term  estimate   std.error statistic p.value## 1 (Intercept) -2.049994 0.009164085 -223.6987       0## 2           A  1.433094 0.011723767  122.2384       0

How weighting reshapes the data …

Here is a simple tree graph that shows the potential outcomes for 1000 individuals (based on the same distributions we’ve been using in our simulation). For 27% of the individuals, \(L=1\), for 73% \(L=0\). Each individual has a potential outcome under each level of treatment \(A\). So, that is why there are 730 individuals with \(L=0\) who are both with and without treatment. Likewise each treatment arm for those with \(L=0\) has 270 individuals. We are not double counting.

Both the marginal and conditional estimates that we estimated before using the simulated data can be calculated directly using information from this tree. The conditional effects on the log-odds scale can be calculated as …

\[LOR_{A=1 \textbf{ vs } A=0|L = 0} = log \left (\frac{0.182/0.818}{0.076/0.924} \right)=log(2.705) = 0.995\]

and

\[LOR_{A=1 \textbf{ vs } A=0|L = 1} = log \left (\frac{0.562/0.438}{0.324/0.676} \right)=log(2.677) = 0.984\]

The marginal effect on the log odds scale is estimated marginal probabilities: \(P(Y=1|A=0)\) and \(P(Y=1|A=1)\). Again, we can take this right from the tree …

\[P(Y=1|A=0) = 0.73\times0.076 + 0.27\times0.324 = 0.143\] and

\[P(Y=1|A=1) = 0.73\times0.182 + 0.27\times0.562 = 0.285\]

Based on these average outcomes (probabilities) by exposure, the marginal log-odds for the sample is:

\[LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.285/0.715}{0.143/0.857} \right)=log(2.389) = 0.871\]

Back in the real world of observed data, this is what the tree diagram looks like:

This tree tells us that the probability of exposure \(A=1\) is different depending upon that value of \(L\). For \(L=1\), \(P(A=1) = 230/730 = 0.315\) and for \(L=0\), \(P(A=1) = 180/270 = 0.667\). Because of this disparity, the crude estimate of the effect (ignoring \(L\)) is biased for the marginal causal effect:

\[P(Y=1|A=0) = \frac{500\times0.076 + 90\times0.324}{500+90}=0.114\]

and

\[P(Y=1|A=1) = \frac{230\times0.182 + 180\times0.562}{230+180}=0.349\]

The crude log odds ratio is

\[LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.349/0.651}{0.114/0.886} \right)=log(4.170) = 1.420\]

As mentioned in the prior post, the IPW is based on the probability of the actual exposure at each level of \(L\): \(P(A=a | L)\), where \(a\in(0,1)\) (and not on \(P(A=1|L)\), the propensity score). Here are the simple weights for each group:

If we apply the weights to each of the respective groups, you can see that the number of individuals in each treatment arm is adjusted to the total number of individuals in the sub-group defined the level of \(L\). For example, if we apply the weight of 3.17 (730/230) to each person observed with \(L=0\) and \(A=1\), we end up with \(230\times3.17=730\). Applying each of the respective weights to the subgroups of \(L\) and \(A\) results in a new sample of individuals that looks exactly like the one we started out with in the potential outcomes world:

This all works only if we make these two assumptions: \[P(Y=1|A=0, L=l) = P(Y_0=1 | A=1, L=l)\] and \[P(Y=1|A=1, L=l) = P(Y_1=1 | A=0, L=l)\]

That is, we can make this claim only under the assumption of no unmeasured confounding. (This was discussed in the Part 1 post.)

Applying IPW to our data

We need to estimate the weights using logistic regression (though other, more flexible methods, can also be used). First, we estimate \(P(A=1|L)\)

exposureModel <- glm(A ~ L, data = dtB, family = "binomial")dtB[, pA := predict(exposureModel, type = "response")]

Now we can derive an estimate for \(P(A=a|L=l)\) and get the weight itself…

# Define two new columnsdefB2 <- defDataAdd(varname = "pA_actual",                     formula = "(A * pA) + ((1 - A) * (1 - pA))",                     dist = "nonrandom")defB2 <- defDataAdd(defB2, varname = "IPW",                     formula = "1/pA_actual",                     dist = "nonrandom")# Add weightsdtB <- addColumns(defB2, dtB)dtB[1:6]
##    id L Y0 Y1 A Y        pA pA_actual      IPW## 1:  1 0  0  0 0 0 0.3146009 0.6853991 1.459004## 2:  2 0  0  0 0 0 0.3146009 0.6853991 1.459004## 3:  3 1  0  1 1 1 0.6682351 0.6682351 1.496479## 4:  4 0  1  1 1 1 0.3146009 0.3146009 3.178631## 5:  5 1  0  0 1 0 0.6682351 0.6682351 1.496479## 6:  6 1  0  0 0 0 0.6682351 0.3317649 3.014183

To estimate the marginal effect on the log-odds scale, we use the function glm with weights specified by IPW. The true value of marginal effect (based on the population-wide potential outcomes) was 0.87 (as we estimated from the potential outcomes directly and from the tree graph). Our estimate here is spot on (but with such a large sample size, this is not so surprising):

tidy(glm(Y ~ A , data = dtB, family="binomial", weights = IPW)) 
##          term   estimate   std.error statistic p.value## 1 (Intercept) -1.7879512 0.006381189 -280.1909       0## 2           A  0.8743154 0.008074115  108.2862       0

It may not seem like a big deal to be able to estimate the marginal effect – we may actually not be interested in it. However, in the next post, I will touch on the issue of estimating causal effects when there are repeated exposures (for example, administering a drug over time) and time dependent confounders that are both affected by prior exposures and affect future exposures and affect the outcome. Under this scenario, it is very difficult if not impossible to control for these confounders – the best we might be able to do is estimate a marginal, population-wide causal effect. That is where weighting will be really useful.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RTutor: Water Pollution and Cancer

$
0
0

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

One very important benefit of stronger environmental protection is to reduce the damaging effects of pollution on human health.

In his very interesting article “The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China” (Review of Economics and Statistics, 2012) Avraham Ebenstein studies the impact of water pollution on the rate of digestive cancers for several Chinese river systems. He convincingly argues that there is a causal effect of substantial size and a cost-benefit analysis shows how stricter environmental regulations would allow to statistically save a human life at relatively low cost.

As part of her Master Thesis at Ulm University, Brigitte Peter has generated a very nice RTutor problem set that allows you to replicate the main insights of the article in an interactive fashion. You learn about R, econometrics and the identification of causal effects from field data, as well as the relationship between water pollution and digestive cancer.


Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to proceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/brigittepeter/RTutorWaterPollutionChina

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 25 hours total usage time per month. So it may be greyed out when you click at it.)

https://brigittepeter.shinyapps.io/RTutorWaterPollutionChina/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

You can also install RTutor as a docker container: https://hub.docker.com/r/skranz/rtutor/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Outliers Detection and Intervention Analysis

$
0
0

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In my previous tutorial Arima Models and Intervention Analysis we took advantage of the strucchange package to identify and date time series level shifts structural changes. Based on that, we were able to define ARIMA models with improved AIC metrics. Furthermore, the attentive analysis of the ACF/PACF plots highlighted the presence of seasonal patterns. In this tutorial we will investigate another flavour of intervention variable, the transient change. We will take advantage of two fundamental packages for the purpose:

* tsoutliers* TSA

Specifically, we will compare results obtained by modeling the transient change.

Outliers Analysis

Outliers detection relates with intervention analysis as the latter can be argued as a special case of the former one. A basic list of intervention variable is:

* step response intervention* pulse response intervention

A basic list of outliers is:

* Additive Outliers* Level Shifts* Transient Change * Innovation Outliers* Seasonal Level Shifts

An Additive Outlier (AO) represents an isolated spike. A Level Shift (LS) represents an abrupt change in the mean level and it may be seasonal (Seasonal Level Shift, SLS) or not. A Transient Change (TC) represents a spike that takes a few periods to disappear. An Intervention Outlier (IO) represents a shock in the innovations of the model.

Pre-specified outliers are able to satisfactorily describe the dynamic pattern of untypical effects and can be captured by means of intervention variables.

Intervention Analysis – Common Models

A short introduction of the very basic common models of functions useful for intervention analysis follows.

Step function

The step function is useful to represent level shift outliers.

$$ \begin{equation} \begin{aligned} S_{T}(t) &=\left\{ \begin{array}{@{}ll@{}} 0 & \text{if}\ t < T \\ 1 & \text{otherwise}\ \end{array} \right. \end{aligned} \end{equation} $$

It can be thought as a special case of the transient change intervention model with delta = 1 (see ahead the transient change model). We can represent it with the help of the filter() function as follows. .

tc <- rep(0, 50)tc[20] <- 1ls <- filter(tc, filter = 1, method = "recursive")plot(ls, main = "Level Shift - TC delta = 1", type = "s")

By adding up two step functions at different lags, it is possible to represent additive outliers or transitory level shifts, as we will see soon.

Pulse function

The pulse function is useful to represent additive outliers.

$$ \begin{equation} \begin{aligned} P_{T}(t) = S_{T}(t)\ -\ S_{T}(t-1) \end{aligned} \end{equation} $$

It can be thought as a special case of the transient change intervention model with delta = 0 (see ahead the transient change model). We can graphically represent it with the help of the filter() function as herein shown.

ao <- filter(tc, filter = 0, method = "recursive")plot(ao, main = "Additive Outlier - TC delta = 0", type = "s")

Level Shift function

The level shift function is useful to represent level shift outliers. It can be modeled in terms of step function with magnitude equal to the omega parameter.

$$ \begin{equation} \begin{aligned} m(t) = \omega S_{T}(t) \end{aligned} \end{equation} $$

The graphical representation is the same of the step function with magnitude equal to the omega parameter of the formula above.

Transient change function

The transient change function is useful to represent transient change outliers.

$$ \begin{equation} \begin{aligned} \ C(t) = \dfrac{\omega B}{1 – \delta B} P_{T}(t) \end{aligned} \end{equation} $$

We can graphically represent it by the help of the filter() function as herein shown. Two delta values are considered to show how the transient change varies accordingly.

tc_0_4 <- filter(tc, filter = 0.4, method = "recursive")tc_0_8 <- filter(tc, filter = 0.8, method = "recursive")plot(tc_0_4, main = "TC delta = 0.4")plot(tc_0_8, main = "TC delta = 0.8")

Packages

suppressPackageStartupMessages(library(tsoutliers))suppressPackageStartupMessages(library(TSA))suppressPackageStartupMessages(library(lmtest))suppressPackageStartupMessages(library(astsa))

Analysis

In the following, I will analyse the sex ratio at birth as based on the Arbuthnot dataset which provides information of male and female births in London from year 1639 to 1710. As done in ref. [1], a metric representing the fractional excess of boys births versus girls is defined as:

$$ \begin{equation} \begin{aligned} \dfrac{(Boys – Girls)}{Girls} \end{aligned} \end{equation} $$

url <- "https://www.openintro.org/stat/data/arbuthnot.csv"abhutondot <- read.csv(url, header=TRUE)boys_ts <- ts(abhutondot$boys, frequency=1, start = abhutondot$year[1])girls_ts <- ts(abhutondot$girls, frequency=1, start = abhutondot$year[1])delta_ts <- boys_ts - girls_tsexcess_ts <- delta_ts/girls_tsplot(excess_ts)

Gives this plot.

With the help of the tso() function within tsoutliers package, we identify if any outliers are present in our excess_ts time series.

outliers_excess_ts <- tso(excess_ts, types = c("TC", "AO", "LS", "IO", "SLS"))outliers_excess_tsSeries: excess_ts Regression with ARIMA(0,0,0) errors Coefficients:      intercept    TC31         0.0665  0.1049s.e.     0.0031  0.0199sigma^2 estimated as 0.0007378:  log likelihood=180.34AIC=-354.69   AICc=-354.38   BIC=-347.47Outliers:  type ind time coefhat tstat1   TC  31 1659  0.1049 5.283

A transient change outlier occurring on year 1659 was identified. We can inspect graphically the results too.

plot(outliers_excess_ts)

Gives this plot.

We found an outlier of Transient Change flavour occurring on year 1659. Specific details are herein shown.

outliers_excess_ts$outliers  type ind time   coefhat   tstat1   TC  31 1659 0.1049228 5.28339
# time index where the outliers have been detected(outliers_idx <- outliers_excess_ts$outliers$ind)[1] 31
# calendar years where the outliers have been detected outliers_excess_ts$outliers$time[1] 1659

We now want to evaluate the effect of such transient change, comparing the time series under analysis pre and post intervention.

#length of our time seriesn <- length(excess_ts)# transient change outlier at the same time index as found for our time seriesmo_tc <- outliers("TC", outliers_idx)# transient change effect data is stored into a one-column matrix, tctc <- outliers.effects(mo_tc, n)             TC31 [1,] 0.000000e+00 [2,] 0.000000e+00 [3,] 0.000000e+00 [4,] 0.000000e+00 [5,] 0.000000e+00 [6,] 0.000000e+00 [7,] 0.000000e+00 [8,] 0.000000e+00 [9,] 0.000000e+00[10,] 0.000000e+00...

The “coefhat” named data frame stores the coefficient used as multiplier for our transient change tc data matrix.

# converting to a numbercoefhat <- as.numeric(outliers_excess_ts$outliers["coefhat"])# obtaining the transient change data with same magnitude as determined by the tso() functiontc_effect <- coefhat*tc# definining a time series for the transient change datatc_effect_ts <- ts(tc_effect, frequency = frequency(excess_ts), start = start(excess_ts))# subtracting the transient change intervention to the original time series, obtaining the pre-intervention time seriesexcess_wo_ts <- excess_ts - tc_effect_ts# plot of the original, the pre-intervention and transient change time series plot(cbind(excess_ts, excess_wo_ts, tc_effect_ts))

Gives this plot.

We can further highlight the difference between the original time series and the pre-intervention one.

plot(excess_ts, type ='b', ylab = "excess birth ratio")lines(excess_wo_ts, col = 'red', lty = 3, type ='b')

Gives this plot.

A quick check on the residuals of the pre-intervention time series confirms validity of the ARIMA(0,0,0) model for.

sarima(excess_wo_ts, p=0, d=0, q=0)

Gives this plot.

Now, we implement a similar representation of the transient change outlier by taking advantage of the arimax() function within the TSA package. The arimax() function requires to specify some ARMA parameters, and that is done by capturing the seasonality as discussed in ref. [1]. Further, the transient change is specified by means of xtransf and transfer input parameters. The xtransf parameter is a matrix with each column containing a covariate that affects the time series response in terms of an ARMA filter of order (p,q). For our scenario, it provides a value equal to 1 at the outliers time index and zero at others. The transfer parameter is a list consisting of the ARMA orders for each transfer covariate. For our scenario, we specify an AR order equal to 1.

arimax_model <- arimax(excess_ts,                       order = c(0,0,0),                       seasonal = list(order = c(1,0,0), period = 10),                       xtransf = data.frame(I1 = (1*(seq(excess_ts) == outliers_idx))),                                                                                                             transfer = list(c(1,0)),                       method='ML')summary(arimax_model)Call:arimax(x = excess_ts, order = c(0, 0, 0), seasonal = list(order = c(1, 0, 0),     period = 10), method = "ML", xtransf = data.frame(I1 = (1 * (seq(excess_ts) ==     outliers_idx))), transfer = list(c(1, 0)))Coefficients:        sar1  intercept  I1-AR1  I1-MA0      0.2373     0.0668  0.7601  0.0794s.e.  0.1199     0.0039  0.0896  0.0220sigma^2 estimated as 0.0006825:  log likelihood = 182.24,  aic = -356.48Training set error measures:                        ME      RMSE        MAE       MPE     MAPE      MASE      ACF1Training set -0.0001754497 0.0261243 0.02163487 -20.98443 42.09192 0.7459687 0.1429339

The significance of the coefficients is then verified.

coeftest(arimax_model)z test of coefficients:           Estimate Std. Error z value  Pr(>|z|)    sar1      0.2372520  0.1199420  1.9781 0.0479224 *  intercept 0.0667816  0.0038564 17.3173 < 2.2e-16 ***I1-AR1    0.7600662  0.0895745  8.4853 < 2.2e-16 ***I1-MA0    0.0794284  0.0219683  3.6156 0.0002997 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The model coefficients are all statistically significative. We remark that the resulting model shows a slight improved AIC result with respect the previous model. Further, both models show improved AIC values with respect previous tutorial discussed ARIMA models.

A quick check on the residuals of the pre-intervention time series confirms validity of the ARIMA(0,0,0)(1,0,0)[10] model for.

sarima(excess_wo_arimax_ts, p=0, d=0, q=0, P=1, D=0, Q=0, S=10)

Gives this plot.

Let us plot the original time series against the fitted one.

plot(excess_ts)lines(fitted(arimax_model), col = 'blue')

Gives this plot.

Considering the transient change model formula, we can elaborate a linear filter with the AR parameter as coefficient and the MA parameter to multiply the filter() function result.

# pulse intervention variableint_var <- 1*(seq(excess_ts) == outliers_idx)# transient change intervention variable obtained by filtering the pulse according to the# definition of transient change and parameters obtained by the ARIMAX model tc_effect_arimax <- filter(int_var, filter = coef(arimax_model)["I1-AR1"],                           method = "rec", sides = 1) * coef(arimax_model)["I1-MA0"]# defining the time series for the intervention effecttc_effect_arimax_ts <- ts(tc_effect_arimax, frequency = frequency(excess_ts), start = start(excess_ts))

It is interesting to compare two transient change effects, as obtained by the arimax() and the tso() functions.

# comparing transient change effect resulting by ARIMAX (red) with the tso() one (blue)plot(tc_effect_arimax_ts, col ='red', type='l', ylab = "transient change")lines(tc_effect_ts, col ='blue', lty = 3)

By eyeballing the plot above, they appear pretty close.

If you have any questions, please feel free to comment below.

References

    Related Post

    1. Spark DataFrames: Exploring Chicago Crimes
    2. Image Processing and Manipulation with magick in R
    3. Analyzing the Bible and the Quran using Spark
    4. Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest
    5. Find Your Best Customers with Customer Segmentation in Python
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to Show R Inline Code Blocks in R Markdown

    $
    0
    0

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    Inline code with R Markdown

    R Markdown is a well-known tool for reproducible science in R. In this article, I will focus on a few tricks with R inline code.

    Some time ago, I was writing a vignette for my package WordR. I was using R Markdown. At one point I wanted to show `r expression` in the output, exactly as it is shown here, as an inline code block.

    In both R Markdown and Markdown, we can write `abc` to show abc (I will use the blue colour for code blocks showing code as it appears in Rmd file, whereas the default colour will be showing the rendered output). What is not obvious is that you can use double backticks to escape single backticks in the code block. So code like this: `` `abc` `` (mind the spaces!) produces this `abc`.

    Now as an exercise, you can guess how I produced the `` `abc` `` block above. Yes, indeed, I have ``` `` `abc` `` ``` in the Rmd source file. And we can go on like this ad infinitum (can we?).

    OK, but I wanted to produce `r expression`. Learning the lesson above, we can try `` `r expression` ``. But trying this, I was getting an error:

    processing file: codeBlocks.Rmd
    Quitting from lines 12-22 (codeBlocks.Rmd) 
    Error in vapply(x, format_sci_one, character(1L), ..., USE.NAMES = FALSE) : 
      values must be length 1,
     but FUN(X[[1]]) result is length 0
    Calls:  ... paste -> hook -> .inline.hook -> format_sci -> vapply
    Execution halted

    Obviously, the R Markdown renderer is trying to evaluate the expression. So it seems that R Markdown renderer does not know that it should (should it?) skip R inline code blocks which are enclosed by double backticks.

    Solution

    Making a long (and yes, I spent some time to find a solution) story short. The correct code block to produce `r expression` is `` `r "\u0060r expression\u0060"` ``.

    Short explanation how it works: \\u0060 is an Unicode representation of the backtick (`). So first, the R Markdown renderer finds the R expression within the double backticks and it evaluates it. Important here is the usage of the Unicode for backtick, since using backtick within the expression would result in an error. (We are lucky, that the R Markdown renderer is not running recursively, finding again the R code block and evaluating it again.) So once the R Markdown is done, the Markdown is just seeing `` `r expression` `` in the temporary .md file, and it evaluates it correctly to `r expression` in the HTML output.

    If you want to see (much) more, just look at the source R Markdown file for this article here. Do you know a better, more elegant solution? If you do, please use the discussion below.


    Epilogue

    Some time after I sent the draft of this blog to the RViews admin, I got a reply (thank you!) which pointed to the knitr FAQ page, specifically question number 7 (and a new post from author of knitr package explaining it a little further). It suggests probably more elegant solution of using

    Some text before inline code `` `r
    expression` `` and some code after
    

    (mind the newline!) that will produce Some text before inline code `r expression` and some text after or use `` `r knitr::inline_expr("expression")` `` which produces similarly `r expression`.

    But, I believe this post (especially its source) might still help someone to understand how the R inline code is evaluated.

    _____='https://rviews.rstudio.com/2017/12/04/how-to-show-r-inline-code-blocks-in-r-markdown/';

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Views.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing the Plumber v0.4.4 Release

    $
    0
    0

    (This article was first published on Trestle Technology, LLC - R, and kindly contributed to R-bloggers)

    Plumber is a package which allows you to create web APIs from your R code. If you’re new to Plumber, you can find out more at www.rplumber.io.

    We’re excited to announce the v0.4.4 release of Plumber! This release adds a handful of oft-requested features and cleans up a few issues that came out of the major refactor that took place in the 0.4.2 release. We’ve also continued to expand the official Plumber documentation. We’ll mention the highlights below, but you can see the full release notes for v0.4.4 here. The release is on CRAN now, so you can update using:

    install.packages("plumber")
    

    Plumber v0.4.4 Highlights

    1. Support customized image sizes on the @png and @jpeg annotations. More details here.
    2. Support expiration, HTTPOnly, and Secure flags on cookies as discussed here (but see the “Known Bugs” section below for an important note about cookie expiration in v0.4.4).
    3. Restore functionality of PlumberStatic routers (#156).
    4. Support arguments sent from clients to nested subrouters.
    5. For APIs deployed using DigitalOcean, set the working directory to the root of the API before starting.
    6. Restore functionality of setErrorHandler.
    7. Ignore capitalization when searching for plumber.r and entrypoint.r files when plumb()ing a directory.
    8. Support query strings with keys that appear more than once (#165)
    9. Fix the validation error warning at the bottom of deployed Swagger files which would have appeared any time your swagger.json file was hosted in such a way that a hosted validator service would not have been able to access it. For now we just suppress validation of swagger.json files. (#149)
    10. Support for floating IPs in DNS check that occurs in do_configure_https()
    11. Make adding swap file idempotent in do_provision() so you can now call that function on a single droplet multiple times.
    12. Support an exit hook which can define a function that will be evaluated when the API is interrupted. e.g. pr <- plumb("plumber.R"); pr$registerHook("exit", function(){ print("Bye bye!") })
    13. Fixed bug in which a single function couldn’t support multiple paths for a single verb (#203).

    Known Bugs

    At the time of writing, one bug has already been fixed since the v0.4.4 release. Cookie expiration times were not properly being sent to clients which caused most clients to ignore these times altogether and revert back to using session cookies. If you wish to set an expiration time on a cookie, you will need to use the development release of Plumber which you can install using:

    devtools::install_github("trestletech/plumber")
    
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology, LLC - R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    5 ways a SAS Health Check can simplify a transition to R

    $
    0
    0

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    Nic Crane, Data Scientist

    At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output.

    While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process.

    So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation?

    1. Analyse procedure calls to inform technology choice

    Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job?

    A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R.

    2. Use macro analysis to find the most and least important components of an application

    Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used?

    We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once.

    Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality.

    On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it.

    3. Looking at the interrelated components to understand process flow

    When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code?

    The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first.

    4. Examine code complexity to assess what needs redeveloping and redesigning

    Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively?

    Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers.

    5. Use the high level overview to get an informed perspective which ties into your strategic objectives

    Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives?

    The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions.

    For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process.

    SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money.

    If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    AI School: Microsoft R and SQL Server ML Services

    $
    0
    0

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    If you'd like to learn how you use R to develop AI applications, the Microsoft AI School now features a learning path focused on Microsoft R and SQL Server ML Services. This learning path includes eight modules, each comprising detailed tutorials and examples:

    All of the Microsoft AI School learning paths are free to access, and the content is hosted on Github (where feedback is welcome!). You can access this course and many others at the link below.

    Microsoft AI School: Microsoft R and SQL Server ML Services 

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


    Magick 1.6: clipping, geometries, fonts, fuzz, and a bit of history

    $
    0
    0

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    cover image

    This week magick 1.6 appeared on CRAN. This release is a big all-round maintenance update with lots of tweaks and improvements across the package.

    The NEWS file gives an overview of changes in this version. In this post we highlight some changes.

    library(magick)stopifnot(packageVersion('magick') >= 1.6)

    If you are new to magick, check out the vignette for a quick introduction.

    Perfect Graphics Rendering

    I have fixed a few small rendering imperfections in the graphics device. The native magick graphics device image_graph() now renders identical or better quality images as the R-base bitmap devices png, jpeg, etc.

    One issue was that sometimes magick graphics would show a 1px black border around the image. It turned out this is caused by rounding of clipping coordinates.

    When R calculates clipping area it often ends up at non-whole values. It is then up to the graphics device to decide what to do with the pixel that is partially clipped. Let’s show clipping in action:

    testplot <- function(title = ""){  plot(1, main = title)  abline(0, 1, col = "blue", lwd = 2, lty = "solid")  abline(0.1, 1, col = "red", lwd = 3, lty = "dotted")  abline(0.2, 1, col = "green", lwd = 4, lty = "twodash")  abline(0.3, 1, col = "black", lwd = 5, lty = "dotdash")  abline(0.4, 1, col = "purple", lwd = 6, lty = "dashed")  abline(0.5, 1, col = "yellow", lwd = 7, lty = "longdash")  abline(-0.1, 1, col = "blue", lwd = 10, lend = "round", lty = "dashed")  abline(-0.2, 1, col = "blue", lwd = 10, lend = "butt", lty = "dashed")  abline(-0.3, 1, col = "blue", lwd = 10, lend = "square", lty = "dashed")}

    Now we run it with and without clipping:

    img2 <- magick::image_graph(clip = FALSE)testplot("Without clipping")dev.off()

    noclip.png

    img1 <- magick::image_graph(clip = TRUE)testplot("With clipping")dev.off()

    clip.png

    As we can see the latter image is now perfectly clipped. The colored lines are truncated exactly at the pixel where the axis starts. This is not always the case in base R 😉

    Font Families

    In magick there are two ways to render text on an image. You can either open the image or graphic in the magick graphics device and then use base R text() function to print text. Alternatively there is image_annotate() which is a simpler version to print some text on an image.

    Wherever text rendering is involved, two major headache arise: encoding and fonts. The latter is tricky because different operating systems have different fonts with different names. In addition a font can be specified as a name, or family name, or alias.

    Below is a simple test that I use to quickly inspect if fonts are working on different systems:

    img <- image_graph(width = 800, height = 500, pointsize = 20, res = 96)graphics::plot.new()graphics::par(mar = c(0,0,3,0))graphics::plot.window(xlim = c(0, 20), ylim = c(-.5, 8))title(expression(Gamma %prop% sum(x[alpha], i==1, n) * sqrt(mu)), expression(hat(x)))# Standard families as supported by other devicestext(0.95, 7, "abcdefg  - Helvetica", pos = 4, family = "helvetica")text(0.95, 6, "abcdefg  - Sans (Arial)", pos = 4, family = "sans")text(0.95, 5, "abcdefg - Serif (Times)", pos = 4, family = "serif")text(0.95, 4, "abcdefg - Monospace (Courier New)", pos = 4, family = "mono")text(0.95, 3, "abcdefg - Symbol Face", pos = 4, font = 5)text(0.95, 2, "abcdefg  - Comic Sans", pos = 4, family = "Comic Sans")text(0.95, 1, "abcdefg - Georgia Serif", pos = 4, family = "Georgia")text(0.95, 0, "abcdefg - Courier", pos = 4, family = "Courier")dev.off()img <- image_border(img, 'red', geometry = '2x2')

    families

    R requires that a graphics device supports at least 4 font types: serif, sans, mono and symbol. The latter is a special 8bit font with some Greek letters and other characters needed for rendering math. This set of fonts corresponds to the original 13 base fonts from the 1984 postscript standard:

    • 4x Courier (Regular, Oblique, Bold, Bold Oblique)
    • 4x Helvetica (Regular, Oblique, Bold, Bold Oblique)
    • 4x Times (Roman, Italic, Bold, Bold Italic)
    • Symbol

    Below a photo of the 1985 Apple Laser Writer which was the first laser printer to use the PostScript language and support all these fonts! Not much later PostScript graphics devices were adopted by R’s predecessor “The New S” (The New S Language, 1988).

    printers

    Geometry Helpers

    Another major improvement in this release is the introduction of helper functions for geometry and option strings. Many functions in magick require a special geometry syntax to specify a size, area, or point. For example to resize an image you need to specify a size:

    image_resize(img, "50%")image_resize(img, "300x300")image_resize(img, "300x300!")

    Or to crop you need to specify an area which consists of a size and offset:

    image_crop(img, "300x300+100+100")

    We added a few handy ?geometry helper functions to generate proper geometry syntax

    geometries

    Magick Options

    A lot of the power in ImageMagick is contained in the hundreds of built-in filters, colorspaces, compose operators, disposal types, convolution kernels, noise types and what not. These are specified simply as a string in the function.

    For example in our previous post about Image Convolution we discussed a few kernel types:

    # Gaussian Kernelimg %>% image_convolve('Gaussian:0x5', scaling = '60,40%')# Sobel Kernelimg %>% image_convolve('Sobel')# Difference of Gaussiansimg %>% image_convolve('DoG:0,0,2')

    Supported values for each option are described in the online ImageMagick documentation. We now have added functions in the magick package that list all values for each option. This should make it a easier to see what is supported and harness the full power of built-in ImageMagick algorithms.

    options

    So we can now easily list e.g. supported kernel types:

    > kernel_types() [1] "Undefined"     "Unity"         "Gaussian"      "DoG"           [5] "LoG"           "Blur"          "Comet"         "Binomial"      [9] "Laplacian"     "Sobel"         "FreiChen"      "Roberts"      [13] "Prewitt"       "Compass"       "Kirsch"        "Diamond"      [17] "Square"        "Rectangle"     "Disk"          "Octagon"      [21] "Plus"          "Cross"         "Ring"          "Peaks"        [25] "Edges"         "Corners"       "Diagonals"     "ThinDiagonals"[29] "LineEnds"      "LineJunctions" "Ridges"        "ConvexHull"   [33] "ThinSe"        "Skeleton"      "Chebyshev"     "Manhattan"    [37] "Octagonal"     "Euclidean"     "User Defined" 

    That’s a lot of kernels.

    Fuzz Scaling

    Finally one more (breaking) change: several functions in magick use a fuzz parameter to specify the max distance between two colors to be considered similar.

    For example the flood fill algorithm (the paint-bucket button in ms-paint) changes the color of a given starting pixel, and then recursively all adjacent pixels that have the same color. However sometimes neighboring pixels are not precisely the same color, but nearly the same. The fuzz parameter allows the fill to continue when pixels are not the same but similar color.

    # Paint the shirt orangefrink <- image_read("https://jeroen.github.io/images/frink.png") %>%  image_fill("orange", point = "+100+200", fuzz = 25)

    frink

    What has changed in this version is that fuzz parameter been rescaled to a percentage. Hence you should always provide a value between 0 and 100. Previously it was the absolute distance between colors, but this depends on the type and color depth of the image at hand, which was very confusing.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Exploratory Data Analysis of Ancient Texts with rperseus

    $
    0
    0

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Introduction

    When I was in grad school at Emory, I had a favorite desk in the library. The desk wasn’t particularly cozy or private, but what it lacked in comfort it made up for in real estate. My books and I needed room to operate. Students of the ancient world require many tools, and when jumping between commentaries, lexicons, and interlinears, additional clutter is additional “friction”, i.e., lapses in thought due to frustration. Technical solutions to this clutter exist, but the best ones are proprietary and expensive. Furthermore, they are somewhat inflexible, and you may have to shoehorn your thoughts into their framework. More friction.

    Interfacing with the Perseus Digital Library was a popular online alternative. The library includes a catalog of classical texts, a Greek and Latin lexicon, and a word study tool for appearances and references in other literature. If the university library’s reference copies of BDAG1 and Synopsis Quattuor Evangeliorum2 were unavailable, Perseus was our next best thing.

    Fast forward several years, and I’ve abandoned my quest to become a biblical scholar. Much to my father’s dismay, I’ve learned writing code is more fun than writing exegesis papers. Still, I enjoy dabbling with dead languages, and it was the desire to wed my two loves, biblical studies and R, that birthed my latest package, rperseus. The goal of this package is to furnish classicists with texts of the ancient world and a toolkit to unpack them.

    Exploratory Data Analysis in Biblical Studies

    Working with the Perseus Digital Library was already a trip down memory lane, but here’s an example of how I would have leveraged rperseus many years ago.

    My best papers often sprung from the outer margins of my Nestle-Aland Novum Testamentum Graece. Here the editors inserted cross references to parallel vocabulary, themes, and even grammatical constructions. Given the intertextuality of biblical literature, the margins are a rich source of questions: Where else does the author use similar vocabulary? How is the source material used differently? Does the literary context affect our interpretation of a particular word? This is exploratory data analysis in biblical studies.

    Unfortunately the excitement of your questions is incommensurate with the tedium of the process–EDA continues by flipping back and forth between books, dog-earring pages, and avoiding paper cuts. rperseus aims to streamline this process with two functions: get_perseus_text and perseus_parallel. The former returns a data frame containing the text from any work in the Perseus Digital Library, and the latter renders a parallel in ggplot2.

    Suppose I am writing a paper on different expressions of love in Paul’s letters. Naturally, I start in 1 Corinthians 13, the famed “Love Chapter” often heard at weddings and seen on bumper stickers. I finish the chapter and turn to the margins. In the image below, I see references to Colossians 1:4, 1 Thessalonians 1:3, 5:8, Hebrews 10:22-24, and Romans 8:35-39.

    1 Corinithians 13 in Nestle-Aland Novum Testamentum Graece

    Ignoring that some scholars exclude Colossians from the “authentic” letters, let’s see the references alongside each other:

    library(rperseus) #devtools::install_github(“ropensci/rperseus”)library(tidyverse)tribble(  ~label, ~excerpt,  "Colossians", "1.4",  "1 Thessalonians", "1.3",  "1 Thessalonians", "5.8",  "Romans", "8.35-8.39"  ) %>%   left_join(perseus_catalog) %>%  filter(language == "grc") %>%  select(urn, excerpt) %>%  pmap_df(get_perseus_text) %>%  perseus_parallel(words_per_row = 4)

    A brief explanation: First, I specify the labels and excerpts within a tibble. Second, I join the lazily loaded perseus_catalog onto the data frame. Third, I filter for the Greek3 and select the columns containing the arguments required for get_perseus_text. Fourth, I map over each urn and excerpt, returning another data frame. Finally, I pipe the output into perseus_parallel.

    The key word shared by each passage is agape (“love”). Without going into detail, it might be fruitful to consider the references alongside each other, pondering how the semantic range of agape expands or contracts within the Pauline corpus. Paul had a penchant for appropriating and recasting old ideas–often in slippery and unexpected ways–and your Greek lexicon provides a mere approximation. In other words, how can we move from the dictionary definition of agape towards Paul’s unique vision?

    If your Greek is rusty, you can parse each word with parse_excerpt by locating the text’s urn within the perseus_catalog object.

    parse_excerpt(urn = "urn:cts:greekLit:tlg0031.tlg012.perseus-grc2", excerpt = "1.4")
    wordformversepart_of_speechpersonnumbertensemoodvoicegendercasedegree
    ἀκούωἀκούσαντες1.4verbNApluralaoristparticipleactivemasculinenominativeNA
    τὴν1.4articleNAsingularNANANAfeminineaccusativeNA
    πίστιςπίστιν1.4nounNAsingularNANANAfeminineaccusativeNA
    ὑμόςὑμῶν1.4pronounNApluralNANANAmasculinegenativeNA

    If your Greek is really rusty, you can also flip the language filter to “eng” to view an older English translation.4 And if the margin references a text from the Old Testament, you can call the Septuagint as well as the original Hebrew.5

    tribble(  ~label, ~excerpt,  "Genesis", "32.31",  "Genesis, pointed", "32.31",  "Numeri", "12.8",  "Numbers, pointed", "12.8"  ) %>%   left_join(perseus_catalog) %>%  filter(language %in% c("grc", "hpt")) %>%  select(urn, excerpt) %>%  pmap_df(get_perseus_text) %>%  perseus_parallel()

    Admittedly, there is some “friction” here in joining the perseus_catalog onto the initial tibble. There is a learning curve with getting acquainted with the idiosyncrasies of the catalog object. A later release will aim to streamline this workflow.

    Future Work

    Check the vignette for a more general overview of rperseus. In the meantime, I look forward to getting more intimately acquainted with the Perseus Digital Library. Tentative plans to extend rperseus a Shiny interface to further reduce “friction” and a method of creating a “book” of custom parallels with bookdown.

    Acknowledgements

    I want to thank my two rOpenSci reviewers, Ildikó Czeller and François Michonneau, for coaching me through the review process. They were the first two individuals to ever scrutinize my code, and I was lucky to hear their feedback. rOpenSci onboarding is truly a wonderful process.


    1. Bauer, Walter. A Greek-English Lexicon of the New Testament and Other Early Christian Literature. Edited by Frederick W. Danker. 3rd ed. Chicago: University of Chicago Press, 2000.
    2. Aland, Kurt. Synopsis Quattuor Evangeliorum. Deutsche Bibelgesellschaft, 1997.
    3. The Greek text from the Perseus Digital Library is from 1885 standards. The advancement of textual criticism in the 20th century led to a more stable text you would find in current editions of the Greek New Testament.
    4. The English translation is from Rainbow Missions, Inc. World English Bible. Rainbow Missions, Inc.; revision of the American Standard Version of 1901. I’ve toyed with the idea of incorporating more modern translations, but that would require require resources beyond the Perseus Digital Library.
    5. “hpt” is the pointed Hebrew text from Codex Leningradensis.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Usage of ruler package

    $
    0
    0

    (This article was first published on QuestionFlow , and kindly contributed to R-bloggers)

    Usage examples of ruler package: dplyr-style exploration and validation of data frame like objects.

    Prologue

    My previous post tells a story about design of my ruler package, which presents tools for “… creating data validation pipelines and tidy reports”. This package offers a framework for exploring and validating data frame like objects using dplyr grammar of data manipulation.

    This post is intended to show some close to reality ruler usage examples. Described methods and approaches reflect package design. Along the way you will learn why Yoda and Jabba the Hutt are “outliers” among core “Star Wars” characters.

    For more information see README (for relatively brief comprehensive introduction) or vignettes (for more thorough description of package capabilities).

    Beware of a lot of code.

    Overview

    suppressMessages(library(dplyr))
    suppressMessages(library(purrr))
    library(ruler)

    The general way of performing validation with ruler can be described with following steps:

    • Formulate a validation task. It is usually stated in the form of a yes-no question or true-false statement about some part (data unit) of an input data frame. Data unit can be one of: data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole], cell. For example, does every column contain elements with sum more than 100?.
    • Create a dplyr-style validation function (rule pack) which checks desired data unit for obedience to [possibly] several rules:
      mtcars %>% summarise_all(funs(enough_sum = sum(.) > 100))
      • Use ruler’s function rules() instead of explicit or implicit usage of funs():
      mtcars %>% summarise_all(rules(enough_sum = sum(.) > 100))
      . %>% summarise_all(rules(enough_sum = sum(.) > 100))
      • Wrap with rule specification function to explicitly identify validated data unit and to name rule pack. In this case it is col_packs() for column data unit with “is_enough_sum” as rule pack name:
      col_packs(
        is_enough_sum = . %>% summarise_all(rules(is_enough = sum(.) > 100))
      )
    • Expose data to rules to obtain validation result (exposure). Use ruler’s expose() function for that. It doesn’t modify contents of input data frame but creates/updates exposure attribute. Exposure is a list with information about used rule packs (packs_info) and tidy data validation report (report).
    • Act after exposure. It can be:
      • Observing validation results with get_exposure(), get_packs_info() or get_report().
      • Making assertions if specific rules are not followed in desired way.
      • Imputing input data frame based on report.

    In examples we will use starwars data from dplyr package (to celebrate an upcoming new episode). It is a tibble with every row describing one “Star Wars” character. Every example starts with a validation task stated in italic and performs validation from beginning to end.

    Create rule packs

    Data

    Does starwars have 1) number of rows 1a) more than 50; 1b) less than 60; 2) number of columns 2a) more than 10; 2b) less than 15?

    check_data_dims <- data_packs(
      check_dims = . %>% summarise(
        nrow_low = nrow(.) >= 50, nrow_up = nrow(.) <= 60,
        ncol_low = ncol(.) >= 10, ncol_up = ncol(.) <= 15
      )
    )
    
    starwars %>%
      expose(check_data_dims) %>%
      get_exposure()
    ##   Exposure
    ## 
    ## Packs info:
    ## # A tibble: 1 x 4
    ##         name      type             fun remove_obeyers
    ##                                 
    ## 1 check_dims data_pack            TRUE
    ## 
    ## Tidy data validation report:
    ## # A tibble: 1 x 5
    ##         pack    rule   var    id value
    ##              
    ## 1 check_dims nrow_up  .all     0 FALSE

    The result is interpreted as follows:

    • Data was exposed to one rule pack for data as a whole (data rule pack) named “check_dims”. For it all obeyers (data units which follow specified rule) were removed from validation report.
    • Combination of var equals .all and id equals 0 means that data as a whole is validated.
    • Input data frame doesn’t obey (because value is equal to FALSE) rule nrow_up from rule pack check_dims.

    Does starwars have enough rows for characters 1) with blond hair; 2) humans; 3) humans with blond hair?

    check_enough_rows <- data_packs(
      enough_blond = . %>% filter(hair_color == "blond") %>%
        summarise(is_enough = n() > 10),
      enough_humans = . %>% summarise(
        is_enough = sum(species == "Human", na.rm = TRUE) > 30
      ),
      ehough_blond_humans = . %>% filter(
        hair_color == "blond", species == "Human"
      ) %>%
        summarise(is_enough = n() > 5)
    )
    
    starwars %>%
      expose(check_enough_rows) %>%
      get_exposure()
    ##   Exposure
    ## 
    ## Packs info:
    ## # A tibble: 3 x 4
    ##                  name      type             fun remove_obeyers
    ##                                          
    ## 1        enough_blond data_pack            TRUE
    ## 2       enough_humans data_pack            TRUE
    ## 3 ehough_blond_humans data_pack            TRUE
    ## 
    ## Tidy data validation report:
    ## # A tibble: 2 x 5
    ##                  pack      rule   var    id value
    ##                         
    ## 1        enough_blond is_enough  .all     0 FALSE
    ## 2 ehough_blond_humans is_enough  .all     0 FALSE

    New information gained from example:

    • Rule specification functions can be supplied with multiple rule packs all of which will be independently used during exposing.

    Does starwars have enough numeric columns?

    check_enough_num_cols <- data_packs(
      enough_num_cols = . %>% summarise(
        is_enough = sum(map_lgl(., is.numeric)) > 1
      )
    )
    
    starwars %>%
      expose(check_enough_num_cols) %>%
      get_report()
    ## Tidy data validation report:
    ## # A tibble: 0 x 5
    ## # ... with 5 variables: pack , rule , var , id ,
    ## #   value 
    • If no breaker is found get_report() returns tibble with zero rows and usual columns.

    Group

    Does group defined by hair color and gender have a member from Tatooine?

    has_hair_gender_tatooine <- group_packs(
      hair_gender_tatooine = . %>%
        group_by(hair_color, gender) %>%
        summarise(has_tatooine = any(homeworld == "Tatooine")),
      .group_vars = c("hair_color", "gender"),
      .group_sep = "__"
    )
    
    starwars %>%
      expose(has_hair_gender_tatooine) %>%
      get_report()
    ## Tidy data validation report:
    ## # A tibble: 12 x 5
    ##                   pack         rule                 var    id value
    ##                                           
    ## 1 hair_gender_tatooine has_tatooine      auburn__female     0 FALSE
    ## 2 hair_gender_tatooine has_tatooine  auburn, grey__male     0 FALSE
    ## 3 hair_gender_tatooine has_tatooine auburn, white__male     0 FALSE
    ## 4 hair_gender_tatooine has_tatooine      blonde__female     0 FALSE
    ## 5 hair_gender_tatooine has_tatooine          grey__male     0 FALSE
    ## # ... with 7 more rows
    • group_packs() needs grouping columns supplied via .group_vars.
    • Column var of validation report contains levels of grouping columns to identify group. By default their are pasted together with .. To change that supply .group_sep argument.
    • 12 combinations of hair_color and gender don’t have a character from Tatooine. They are “auburn”-“female”, “auburn, grey”-“male” and so on.

    Column

    Does every list-column have 1) enough average length; 2) enough unique elements?

    check_list_cols <- col_packs(
      check_list_cols = . %>%
        summarise_if(
          is.list,
          rules(
            is_enough_mean = mean(map_int(., length)) >= 1,
            length(unique(unlist(.))) >= 10
          )
        )
    )
    
    starwars %>%
      expose(check_list_cols) %>%
      get_report()
    ## Tidy data validation report:
    ## # A tibble: 3 x 5
    ##              pack           rule       var    id value
    ##                              
    ## 1 check_list_cols is_enough_mean  vehicles     0 FALSE
    ## 2 check_list_cols is_enough_mean starships     0 FALSE
    ## 3 check_list_cols        rule..2     films     0 FALSE
    • To specify rule functions inside dplyr’s scoped verbs use ruler::rules(). It powers correct output interpretation during exposing process and imputes missing rule names based on the present rules in current rule pack.
    • Columns vehicles and starships don’t have enough average length and column films doesn’t have enough unique elements.

    Are all values of column birth_year non-NA?

    starwars %>%
      expose(
        col_packs(
          . %>% summarise_at(
            vars(birth_year = "birth_year"),
            rules(all_present = all(!is.na(.)))
          )
        )
      ) %>%
      get_report()
    ## Tidy data validation report:
    ## # A tibble: 1 x 5
    ##          pack        rule        var    id value
    ##                        
    ## 1 col_pack..1 all_present birth_year     0 FALSE
    • To correctly validate one column with scoped dplyr verb it should be a named argument inside vars. It is needed for correct interpretation of rule pack output.

    Row

    Has character appeared in enough films? As character is defined by row, this is a row pack.

    has_enough_films <- row_packs(
      enough_films = . %>% transmute(is_enough = map_int(films, length) >= 3)
    )
    
    starwars %>%
      expose(has_enough_films) %>%
      get_report() %>%
      left_join(y = starwars %>% transmute(id = 1:n(), name),
                by = "id") %>%
      print(.validate = FALSE)
    ## Tidy data validation report:
    ## # A tibble: 64 x 6
    ##           pack      rule   var    id value              name
    ##                               
    ## 1 enough_films is_enough  .all     8 FALSE             R5-D4
    ## 2 enough_films is_enough  .all     9 FALSE Biggs Darklighter
    ## 3 enough_films is_enough  .all    12 FALSE    Wilhuff Tarkin
    ## 4 enough_films is_enough  .all    15 FALSE            Greedo
    ## 5 enough_films is_enough  .all    18 FALSE  Jek Tono Porkins
    ## # ... with 59 more rows
    • 64 characters haven’t appeared in 3 films or more. Those are characters described in starwars in rows 8, 9, etc. (counting based on input data).

    Does character with height less than 100 is a droid?

    is_short_droid <- row_packs(
      is_short_droid = . %>% filter(height < 100) %>%
        transmute(is_droid = species == "Droid")
    )
    
    starwars %>%
      expose(is_short_droid) %>%
      get_report() %>%
      left_join(y = starwars %>% transmute(id = 1:n(), name, height),
                by = "id") %>%
      print(.validate = FALSE)
    ## Tidy data validation report:
    ## # A tibble: 5 x 7
    ##             pack     rule   var    id value                  name height
    ##                                      
    ## 1 is_short_droid is_droid  .all    19 FALSE                  Yoda     66
    ## 2 is_short_droid is_droid  .all    29 FALSE Wicket Systri Warrick     88
    ## 3 is_short_droid is_droid  .all    45 FALSE              Dud Bolt     94
    ## 4 is_short_droid is_droid  .all    72 FALSE         Ratts Tyerell     79
    ## 5 is_short_droid is_droid  .all    73    NA                R4-P17     96
    • One can expose only subset of rows by using filter or slice. The value of id column in result will reflect row number in the original input data frame. This feature is powered by keyholder package. In order to use it, rule pack should be created using its supported functions.
    • value equal to NA is treated as rule breaker.
    • 5 “not tall” characters are not droids.

    Cell

    Is non-NA numeric cell not an outlier based on z-score? This is a bit tricky. To present outliers as rule breakers one should ask whether cell is not outlier.

    z_score <- function(x, ...) {abs(x - mean(x, ...)) / sd(x, ...)}
    
    cell_isnt_outlier <- cell_packs(
      dbl_not_outlier = . %>%
        transmute_if(
          is.numeric,
          rules(isnt_out = z_score(., na.rm = TRUE) < 3 | is.na(.))
        )
    )
    
    starwars %>%
      expose(cell_isnt_outlier) %>%
      get_report() %>%
      left_join(y = starwars %>% transmute(id = 1:n(), name),
                by = "id") %>%
      print(.validate = FALSE)
    ## Tidy data validation report:
    ## # A tibble: 4 x 6
    ##              pack     rule        var    id value                  name
    ##                                          
    ## 1 dbl_not_outlier isnt_out     height    19 FALSE                  Yoda
    ## 2 dbl_not_outlier isnt_out       mass    16 FALSE Jabba Desilijic Tiure
    ## 3 dbl_not_outlier isnt_out birth_year    16 FALSE Jabba Desilijic Tiure
    ## 4 dbl_not_outlier isnt_out birth_year    19 FALSE                  Yoda
    • 4 non-NA numeric cells appear to be an outlier within their column.

    Expose data to rules

    Do groups defined by species, gender and eye_color (3 different checks) have appropriate size?

    starwars %>%
      expose(
        group_packs(. %>% group_by(species) %>% summarise(isnt_many = n() <= 5),
                    .group_vars = "species")
      ) %>%
      expose(
        group_packs(. %>% group_by(gender) %>% summarise(isnt_many = n() <= 60),
                    .group_vars = "gender"),
        .remove_obeyers = FALSE
      ) %>%
      expose(is_enough_eye_color = . %>% group_by(eye_color) %>%
               summarise(isnt_many = n() <= 20)) %>%
      get_exposure() %>%
      print(n_report = Inf)
    ##   Exposure
    ## 
    ## Packs info:
    ## # A tibble: 3 x 4
    ##                  name       type              fun remove_obeyers
    ##                                            
    ## 1       group_pack..1 group_pack            TRUE
    ## 2       group_pack..2 group_pack           FALSE
    ## 3 is_enough_eye_color group_pack            TRUE
    ## 
    ## Tidy data validation report:
    ## # A tibble: 7 x 5
    ##                  pack      rule           var    id value
    ##                                 
    ## 1       group_pack..1 isnt_many         Human     0 FALSE
    ## 2       group_pack..2 isnt_many        female     0  TRUE
    ## 3       group_pack..2 isnt_many hermaphrodite     0  TRUE
    ## 4       group_pack..2 isnt_many          male     0 FALSE
    ## 5       group_pack..2 isnt_many          none     0  TRUE
    ## 6       group_pack..2 isnt_many            NA     0  TRUE
    ## 7 is_enough_eye_color isnt_many         brown     0 FALSE
    • expose() can be applied sequentially which results into updating existing exposure with new information.
    • expose() imputes names of supplied unnamed rule packs based on the present rule packs for the same data unit type.
    • expose() by default removes obeyers (rows with data units that obey respective rules) from validation report. To stop doing that use .remove_obeyers = FALSE during expose() call.
    • expose() by default guesses the type of the supplied rule pack based only on its output. This has some annoying edge cases but is suitable for interactive usage. To turn this feature off use .guess = FALSE as an argument for expose(). Also, to avoid edge cases create rule packs with appropriate wrappers.

    Perform some previous checks with one expose().

    my_packs <- list(check_data_dims, is_short_droid, cell_isnt_outlier)
    
    str(my_packs)
    ## List of 3
    ##  $ :List of 1
    ##   ..$ check_dims:function (value)  
    ##   .. ..- attr(*, "class")= chr [1:4] "data_pack" "rule_pack" "fseq" "function"
    ##  $ :List of 1
    ##   ..$ is_short_droid:function (value)  
    ##   .. ..- attr(*, "class")= chr [1:4] "row_pack" "rule_pack" "fseq" "function"
    ##  $ :List of 1
    ##   ..$ dbl_not_outlier:function (value)  
    ##   .. ..- attr(*, "class")= chr [1:4] "cell_pack" "rule_pack" "fseq" "function"
    
    starwars_exposed_list <- starwars %>%
      expose(my_packs)
    
    starwars_exposed_arguments <- starwars %>%
      expose(check_data_dims, is_short_droid, cell_isnt_outlier)
    
    identical(starwars_exposed_list, starwars_exposed_arguments)
    ## [1] TRUE
    • expose() can have for rule pack argument a list of lists [of lists, of lists, …] with functions at any depth. This enables creating a list of rule packs wrapped with *_packs() functions (which all return a list of functions).
    • expose() can have multiple rule packs as separate arguments.

    Act after exposure

    Throw an error if any non-NA value of mass is more than 1000.

    starwars %>%
      expose(
        col_packs(
          low_mass = . %>% summarise_at(
            vars(mass = "mass"),
            rules(is_small_mass = all(. <= 1000, na.rm = TRUE))
          )
        )
      ) %>%
      assert_any_breaker()
    ##   Breakers report
    ## Tidy data validation report:
    ## # A tibble: 1 x 5
    ##       pack          rule   var    id value
    ##                  
    ## 1 low_mass is_small_mass  mass     0 FALSE
    ## Error: assert_any_breaker: Some breakers found in exposure.
    • assert_any_breaker() is used to assert presence of at least one breaker in validation report.

    However, offered solution via column pack doesn’t show rows which break the rule. To do that one can use cell pack:

    starwars %>%
      expose(
        cell_packs(
          low_mass = . %>% transmute_at(
            vars(mass = "mass"),
            rules(is_small_mass = (. <= 1000) | is.na(.))
          )
        )
      ) %>%
      assert_any_breaker()
    ##   Breakers report
    ## Tidy data validation report:
    ## # A tibble: 1 x 5
    ##       pack          rule   var    id value
    ##                  
    ## 1 low_mass is_small_mass  mass    16 FALSE
    ## Error: assert_any_breaker: Some breakers found in exposure.

    Remove numeric columns with mean value below certain threshold. To achieve that one should formulate rule as “column mean should be above threshold”, identify breakers and act upon this information.

    remove_bad_cols <- function(.tbl) {
      bad_cols <- .tbl %>%
        get_report() %>%
        pull(var) %>%
        unique()
      
      .tbl[, setdiff(colnames(.tbl), bad_cols)]
    }
    
    starwars %>%
      expose(
        col_packs(
          . %>% summarise_if(is.numeric, rules(mean(., na.rm = TRUE) >= 100))
        )
      ) %>%
      act_after_exposure(
        .trigger = any_breaker,
        .actor = remove_bad_cols
      ) %>%
      remove_exposure()
    ## # A tibble: 87 x 11
    ##             name height hair_color  skin_color eye_color gender homeworld
    ##                                       
    ## 1 Luke Skywalker    172      blond        fair      blue   male  Tatooine
    ## 2          C-3PO    167               gold    yellow     Tatooine
    ## 3          R2-D2     96        white, blue       red        Naboo
    ## 4    Darth Vader    202       none       white    yellow   male  Tatooine
    ## 5    Leia Organa    150      brown       light     brown female  Alderaan
    ## # ... with 82 more rows, and 4 more variables: species ,
    ## #   films , vehicles , starships 
    • act_after_exposure is a wrapper for performing actions after exposing. It takes .trigger function to trigger action and .actor function to perform action and return its result.
    • any_breaker is a function which return TRUE if tidy validation report attached to it has any breaker and FALSE otherwise.

    Conclusions

    • Yoda and Jabba the Hutt are outliers among other “Star Wars” characters: Yoda is by height and birth year, Jabba is by mass and also birth year.
    • There are less than 10 “Star Wars” films yet.
    • ruler offers flexible and extendable functionality for common validation tasks. Validation can be done for data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole] and cell. After exposing data frame of interest to rules and obtaining tidy validation report, one can perform any action based on this information: explore report, throw error, impute input data frame, etc.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow .

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Names of b…..s badder than Taylor Swift, a class in women’s studies?

    $
    0
    0

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    Onceagain, a Twitter trend sent me to my R prompt… Here is a bit of context. My summary: Taylor Swift apparently plays the bad girl in her new album and a fan of hers asked a question…

    Name a bitch badder than Taylor Swift 😍😛😤pic.twitter.com/AkSyQBUIME

    — Nutella (@xnulz) 10 de novembre de 2017

    The tweet was then quoted by many people mentioning badass women, and I decided to have a look at these heroes!

    Name tweets badder than Nutella’s

    I was a bit lazy and asked Mike Kearney, rtweet maintainer, how to find tweets quoting a tweet, to which Bob Rudis answered. Now that I even had the code, it was no trouble at all getting the data. I added the filtering steps myself, see, I’m not that lazy. I also removed the link to the quoted tweet that was at the end of each tweet.

    question_tweet<-"928857792982781952"badass<-rtweet::search_tweets(question_tweet,n=18000,include_rts=FALSE)badass<-dplyr::filter(badass,is_quote_status,quote_status_id==question_tweet)badass<-dplyr::mutate(badass,text=stringr::str_replace(text,"https://t\\.co/.*$",""))badass<-dplyr::mutate(badass,text=trimws(text))readr::write_csv(badass,path="data/2017-12-03-badderb_badass.csv")

    I obtained 15653 tweets. Not bad!

    library("magrittr")set.seed(20171015)indices<-sample.int(n=nrow(badass),size=7)badass$text[indices]
    ## [1] "Carmina Barrios"                                                                                                                                                                                                      ## [2] "Anyone"                                                                                                                                                                                                               ## [3] "Shirley Temple"                                                                                                                                                                                                       ## [4] "So a lot of people have shared some bad ass women in repsonse to the tweet below. I hope someone is compiling those responses for one kick ass book."                                                                 ## [5] "Mary Bowers was a slave with a photographic memory who pretended to be \"slow-witted\" in order to spy on confederate soldiers in the house worked, and pass intel she gathered to Union forces during the Civil War."## [6] "Ramona Quimby, age 8"                                                                                                                                                                                                 ## [7] "Snow White. Sleeping Beauty. Cinderella, Elsa, Ariel bec the category example is obvs mediocre Disney caricatures allowed by the patriarchy, right?"

    Unnamed badder b…..s: your mother and grand-mother

    Out of 15653, 570 contained the word “mother” – I haven’t looked for the word “mum” and haven’t checked for the fact that it is someone from the family of the tweet author. Here a few of the personal stories (or not) identified with this quick and dirty method.

    set.seed(20171015)mothers<-dplyr::filter(badass,stringr::str_detect(text,"mother"))indices<-sample.int(n=15)mothers$text[indices]
    ##  [1] "My grandmother struggled through poverty her entire life, in a family prone to depression, addiction, and suicide. She had so many Caesarian sections that when she reached menopause she didn�t have a belly button anymore."                                                       ##  [2] "Mom married my dad. He had been married 2X before with a total of 10 kids. His first wife left him and the kid.  2nd wife died after their divorce.  He had custody of all 10, judge asked my mother if she would take them in, she did. Raised his 10 and had 3 with dad. 13 in all"##  [3] "Each of my grandmothers raised 11 kids "                                                                                                                                                                                                                 ##  [4] "My mother"                                                                                                                                                                                                                                                                           ##  [5] "My grandmother."                                                                                                                                                                                                                                                                     ##  [6] "My mother in law, who isn�t a �bitch� in any way, shape, or form, and raised three girls on her own (and who are all bad asses in their own way) without the help of a deadbeat ex."                                                                                                 ##  [7] ".@RuPaul for starters. And my great grandmother who drove a car when she was 12. My dog, too. She is definitely a bad bitch."                                                                                                                                                        ##  [8] "My grandmother escaped from the Tsar with nothing but the clothes on her back."                                                                                                                                                                                                      ##  [9] "My mother"                                                                                                                                                                                                                                                                           ## [10] "Almost any woman alive today,@xnulz \n\nAnd - Here�s another one, for sure (PLUS she�s a good mother):\n\nTonya Harding; @ITonyaMovie \n\n"## [11] "all of the single mothers doing the most they can for their children"                                                                                                                                                                                                                ## [12] "My grandmother was married off at the age of 14 to an older man who had already been married once, dealt with an abusive marriage but stuck around for the kids and went back to school after having 5 children and worked for 20 years to support her family"                       ## [13] "my grandmother was an army nurse in WW2.\ntaught me how to tourniquet a leg and bandage it using only gauze"                                                                                                                                                                         ## [14] "My wife birthed a goddamn child. My mother and grandmother both birthed multiple. This is too easy."                                                                                                                                                                                 ## [15] "My grandmother worked for the OSS in London during WW2 as a code breaker."

    Can we talk about that belly button thing?! I’m also happy to see a diversity of things they were recognized for.

    Names of the badder b…..s

    Quite a few of the tweets from this trend contained the name of someone. In order to extract these names, I resorted to a language processing method called entity extraction, the entity here being a person. For that, I could have used an extractor module of the Monkeylearn platform via my own monkeylearn package.

    Instead, I chose to illustrate a different method: using the cleanNLP package that I know from the excellent R Journal paper presenting it. Among other things, it serves as an interface between R and the Python library spaCy and also as an interface between R and the coreNLP Java library. Installing these tools is the painful part of the setup, but 1) you only need to install one of them 2) there are detailed instructions here 3) once your tool is installed, using the package is a breeze (and well independent of any rate limit contrary to monkeylearn use). I am at that breeze stage, you can be jealous.

    There were a few tweets with infuriating encoding issues, BOM or something like that, and I decided to just ignore them by using purrr::possibly. I obviously did this to illustrate the use of this purrr function, not out of laziness.

    library("cleanNLP")init_spaCy()# we need to remove characters like "\u0098"badass<-dplyr::mutate(badass,text=enc2native(text))get_entities_with_text<-function(x){obj<-run_annotators(x,as_strings=TRUE)entities<-get_entity(obj)entities$text<-xentities}possibly_get_entities<-purrr::possibly(get_entities_with_text,otherwise=NULL)entities<-purrr::map_df(badass$text,possibly_get_entities)readr::write_csv(entities,path="data/2017-12-03-badderb_entities.csv")

    I got at least one entity for 7504 out of 15653 tweets, and at least one person for 4664. I am very satisfied with this.

    So, who are you, badder b…..s?

    We get this kind of entities: NORP, CARDINAL, LANGUAGE, GPE, DATE, ORG, PERSON, TIME, LOC, MONEY, WORK_OF_ART, EVENT, FAC, QUANTITY, LAW, PRODUCT, ORDINAL, PERCENT. I’m more interested in PERSON and no, I’m not shouting. I chose to look at the top 12 in order to get a top 10 excluding Taylor Swift herself.

    entities%>%dplyr::filter(entity_type=="PERSON")%>%dplyr::group_by(entity)%>%dplyr::summarise(n=n())%>%dplyr::arrange(-n)%>%head(n=12)%>%knitr::kable()
    entityn
    Taylor Swift213
    Taylor145
    Rosa Parks140
    Harriet Tubman109
    Dora90
    Rose West85
    Lyudmila Pavlichenko77
    Joan71
    Marie Curie57
    Myra Hindley50
    Nancy Wake45
    Hillary Clinton41

    At that point I did feel like bursting out laughing though. Dora! And I checked, we’re talking about Dora the explorer! Joan is Joan of arc. Interestingly in that top 10 we’re mixing really bad persons, e.g. Myra Hindley was a serial killer, and really badass persons, like Rosa Parks. My husband will be happy to see Marie Curie in this list, since he’s a big fan of hers, having even guided a few tours about her life in Paris.

    Looking at the most frequently mentioned women obviously makes us loose well wrongly written names, and most importantly personal stories of badass mothers and the like, and of native women for instance, although I have the impression of having read about a few but probably because of my following Auriel Fournier.

    Writing history?

    I saw someone said they’d use the tweets as basis for history lessons. In order to get a view of a person, one could concatenate the tweets about them. Take Marie Curie for instance.

    entities%>%dplyr::filter(entity_type=="PERSON",entity=="Marie Curie")%>%dplyr::summarise(text=toString(text))%>%.$text
    ## [1] "Rachel Carson, Marie Curie, Ruth Bader Ginsberg, Madeleine Albright, Diane Fossey, Helen Keller, Gloria Steinem, Madonna, Aretha Franklin, Margot Lee Shetterly, Malala Yousafzai, and a whole lot more., Rosa Parks. Harriet Tubman. Anne Frank. Malala Yousafazi, Susan B. Anthony, Sally Ride, Marie Curie. Margaret Thatcher, Indira Ghandi, Golda Meier., ...it�s not bitch. It�s woman.\n\nMarie Curie\n\nRosa Parks\n\nEleanor Roosevelt\n\nHedy Lamar\n\nSappho\n\nAbagail Adams\n\nFlorence Nightingale\n\nSally Ride\n\nMargaret Chase Smith\n\nAnne Frank\n\nMargaret Thatcher \n\nSandra Day O�Connor\n\nOprah\n\nLilith\n\nMarilyn Monroe\n\nDita Von Teese, Ruby Bridges, Barbara Jordan, Marie Curie, Rosa Parks, Ida B Wells, Susan B Anthony, Harriet Tubman, ..., Lorde. Etc., I mean, off hand, Marie Curie was denied access to University because she was a woman, educated herself, served as a surgeon in WW1 and became the only person ever to earn two Nobel prizes, Emmeline Pankhurst, Amelia Earhart, Florence Nightingale, Rosa Parks, Joan of Arc, Marie Curie, Lilian Ngoyi, Helen Suzman, Katherine Johnson,  Frida Khalo, Catherine the Great, Eleanor Roosevelt, Mary Wollstonecraft, Valentina Tereshkova.., lmao Elizabeth Schuyler-Hamilton, Hedy Lamar, Amelia Earhart, Marie Curie, Sophie Scholl, Katherine Johnson, Dorothy Vaughn, Mary Jackson, Margaret Hamilton, Ada Lovelace, Anne Bonny, Calamity Jane, George Eliot, Mary Read, Mary Wolstncraft, Mary Shelley... need I go on?! , Marie Curie, Rose Parks\nFrancis Perkins\nJeanette Rankin\nSally Ride\nMarie Curie\nJK Rowling\n\nShall I go on?, Marie Curie: she invented radioactivity, got the Nobel prize twice and found polonium and radium., Well I would start with not calling a woman a �bitch�.\nAfter that I would say Hedy Lamarr, Susan B. Anthony, any of the @HiddenFigures, Marie Curie, Sally Ride, Molly Pitcher, Clara Barton � shall I continue?, Marie Curie literally died doing Nobel Prize winning research., Harriet Tubman, Shirley Chisholm, Rosa Parks, Ida B. Wells, Marie Curie, Maxine Waters, Lena Horne,  Ruby Dee, Angela Davis, My grandmama, my great-aunt Priscilla, my mama, my sister..., Marie Curie. Mother of radioactivity. First and only woman to win the nobel prize twice. Died due to radiation exposure., Rosa Parks, Freya Stark, Ida B Wells, Sally Ride, Marie Curie, Margaret Heafield, Amelia Earheart, Ruth Bader Ginsburg, Abigail Adams, Jane Goodall, Malala how much time you got?, Marie Curie made mobile x-ray unit, learned anatomy & auto repairs, trained women, raised money, ran IRC's medical x-ray division in WWI., Marie Curie. No explanation needed., Rosa Parks, Katherine Johnson, Dorothy Vaughn, Mary Jackson, Bettie Page, Emma Watson, Malala Yousafzai, Emily Pankhurst, Anne Frank, Marie Curie, Joan of Arc, Boudicca... need I go on?, Marie Curie., Malala, Michelle Obama, Frida Kahlo, Emma Watson, Maya Angelou, Marie Curie, Lady Gaga, JK Rowling, Princess Diana, Carrie Fisher, Anne Frank, Laverne Cox...\n\n literally so many women... I could go on for weeks..., Marie Curie, Mary Anning, Dorothy Hodgkin, Jane Austin, Katherine Johnson, my GCSE geography and A Level geology teacher Sheila Tanner...., Marie Curie. She handled radioactive materials with barehand and won the Nobel Prize. How bout that? #RealTimeChem, Rosa Parks, Nellie Bly, Harriet Tubman, Jane Adams, Eleanor Roosevelt, Elizabeth Blackwell, Marie Curie, Amelia Earhart, Cleopatra, Queen Elizabeth I, Ella Fitzgerald, Grace Hopper, Dolores Huerta, Shirley Jackson, Joan of Arc, Helen Keller, Sacagawea, Sappho the Greek poet..., Marie Curie., Marie Curie., Marie Curie, Maya Angelou, Rosa Parks, Michelle Obama, my  grandma, my neuroscientist mum, Danica Patrick, the Queen of England, Angela Merkel,  @chrissyteigen @BrookeBCNN @KateBolduan - Me, Rosa Parks, Harriet Tubman, Shirley Chisholm, Ella Fitzgerald, Ruth Bader Ginsburg, Esperonza Spaulding, Sister Rossetta Tharpe, Sarah Vaughn, Aretha Franklin, Eartha Kitt, Carrie Fisher, Patti Smith, Jane Goodall, Marie Curie, Sally Ride, Leontyne Price, Princess Diana, etc..., Whoa hey now, Marie Curie won Nobel Prizes in two different sciences, only person to have done that. Probably had a lovely singing voice, too, before it got demolished by the radioactivity she was discovering for the betterment of humanity., Rosa Parks. Laura Secord. Hillary Rodham Clinton. Nellie McClung. Queen Victoria. Joan of Arc. Myrlie Evers-Williams. Tina Turner. Jane Goodall. Ruth Bader Ginsburg. Viola Desmond. Jennie Trout. Malala  Yousafzai. Margret Atwood. Ameila Earhart. J.K. Rowling. Marie Curie., Marie Curie. Won the Nobel Prize. Twice., Just off the top of my head: Elizabeth Cady Stanton, Clara Barton, Marie Curie, Amelia Earhart, Elizabeth I, Joan of Arc, Helen Keller�., Amelia Earhart, Rosa Parks, Sacagawea, Pocahontas, Susan B. Anthony, Noor Inayat Khan, Marie Curie, Alice Paul, Margaret Thatcher, Harriet Tubman, Maya Angelou, Lucy Stone, Eleanor Roosevelt ..... just to name a few., Marie Curie., The kashariyot. Shirley Chisholm. bell hooks. Golda Meir. Edith Bunker. Hedy Lamar. Marie Curie. Bella Abzug. Michelle Obama. Judith Halberstam. Jo March. Hillary Rodham Clinton., Hedy Lamarr. Marie Curie. Mae Jemison. Sally Ride. Sonia Sotomayor. Ruth Bader Ginsburg. Beyoncé. Mulan. My wife., Marie Curie is the only person to win two Nobel Prizes in two separate scientific fields and died of leukaemia as a result of her research into radionuclides which are used in medicine to save thousands of lives a year, Marie Curie.\n2 Nobel Prizes., Marie Curie who died discovering two elements and isolated radioactive isotopes., Rosa Parks, Marie Curie, Charlize Theron, Noor Inayat Khan, Rihanna, Serena Williams, Ruth Bader Ginsberg, basically all these women on this list:, Marie Curie developed the theory of radioactivity, techniques for isolating isotopes, and discovered two chemical elements. She died of aplastic anemia from exposure to radiation over the course of her research and her work at field hospitals during WWI., Marie Curie won two Nobel prizes for two different fields of science and her notebook is so radioactive that exposure to it would be lethal., How much time have you got? How about Marie Curie, or Amelia Earhart, Nancy Hart? Arlene Limas? Tomoe Gozen? Lucretia Mott? Irina Sendler? Nothing against Ms. Swift, but we could do this all day�, Marie Curie, Nobel prizes in Physics and Chemistry., Marie Curie? 1st woman in Europe to get a PhD and 1st woman to win a Nobel Prize (which she did twice, in Physics and Chemistry). She also developed a theory of a radioactivity & found the elements radium and polonium. A pretty bad bitch., Marie Curie was a badass scientist who won two Nobel prizes and ate radium for breakfast. #thatlastpartwasntsogood, Rosa Parks? Harriet Tubman? Louisa May Alcott? Susan B. Anthony? Marie Curie? Amelia Earhart? Cleopatra? Frida Khalo? Mother Theresa? Sandra Day O�Connor? Oprah Winfrey? Sally Ride? Margaret Sanger? Anne Frank? Joan of Arc? Sacajawea? Sappho? Sojourner Truth? Malala Yousafzai?, Cleopatra. Mata Hari. Marie Curie. Susan B. Anthony. Hannah Arendt. . . . \n\nIt's really not hard, 1. Malala Yousafzai, Marie Curie, Rosa Parks, Georgia O'Keeffe, Maria Mitchell, Tanya Tagaq, Billie Holiday, Valentina Tereshkova, Sally Ride, Anne Frank, Maya Angelou, Sacheen Littlefeather, Frida Kahlo, Harper Lee, Marsha P Johnson, Sylvia Rivera, Helen Keller, Margaret Sanger,, Marie Curie, Marie Curie., Ada Lovelace, Jane Austen, Helen Sharman, Margaret Atwood, Mary Beard, Rosalind Franklin, Hedy Lamarr, Marie Curie, Helen Keller, Emmeline Pankhurst, Rosa Parks, Florence Nightingale, Brenda Hale, QE1, Boudicca, Amy Johnson, Elizabeth Fry, Octavia Hill, Mary Wollstonecraft, &c., Marie Curie. Most people are lucky to have 1 Nobel prize, or share it. She has 2., Sojourner Truth, Khutulun, Marie Curie, Queen Lillikulani, Hedy Lamar, Empress Wu Zetain, Hellen Mirren, Raden Ajeng Kartini, Ida B. Wells, Artemisia Gentileschi, Nancy Wake, Tomoe Gozen, Yaa Asantewaa, Boudicca, Juana Inés de la Cruz..., Marie Curie, whose greatest discovery and life�s work also ended up being what killed her and who is the only person to win a Nobel Prize in two different sciences., Joan of Arc, Marie Curie, Queen Elizabeth, Cleopatra, my mom, the lady in front of me on line at the grocery store...\n\nIt's a pretty long list., Rosa Parks. All of the women from Hidden Figures. Marie Curie. Any woman in STEM fields., Marie Curie \nThat's where the list starts"

    Doing this one also gets the name of many other women. Moreover, if writing history lessons, one should have several sources, right? What about Wikidata like in this other blog post of mine? It should have data for at least the most famous badass women.

    # add a function for getting a silent answerquietly_query<-purrr::quietly(WikidataQueryServiceR::query_wikidata)# function for getting someone's dataget_wikidata<-function(name,pb=NULL){if(!is.null(pb))pb$tick()$print()Sys.sleep(1)item<-WikidataR::find_item(name,language="en")# sometimes people have no Wikidata entry so I need this conditionif(length(item)>0){entity_code<-item[[1]]$idquery<-paste0("PREFIX entity:                      #partial results                                          SELECT ?propUrl ?propLabel ?valUrl ?valLabel ?picture                     WHERE                     {                     hint:Query hint:optimizer 'None' .                     {BIND(entity:",entity_code," AS ?valUrl) .                     BIND(\"N/A\" AS ?propUrl ) .                     BIND(\"identity\"@en AS ?propLabel ) .                     }                     UNION                     {entity:",entity_code," ?propUrl ?valUrl .                     ?property ?ref ?propUrl .                     ?property rdf:type wikibase:Property .                     ?property rdfs:label ?propLabel                     }                                          ?valUrl rdfs:label ?valLabel                     FILTER (LANG(?valLabel) = 'en') .                     OPTIONAL{ ?valUrl wdt:P18 ?picture .}                     FILTER (lang(?propLabel) = 'en' )                     }                     ORDER BY ?propUrl ?valUrl                     LIMIT 200")results<-quietly_query(query)results<-results$resultresults$name<-nameresults}else{NULL}}

    Yes, I just had to replace all occurrences of “sv” with “en” to get a function for this post. I’d like to try to write an automatic text about badass women.

    get_a_string<-function(prop,prep,wikidata){answer<-dplyr::filter(wikidata,propLabel==prop)%>%.$valLabel%>%unique()%>%toStringif(answer==""){return("")}else{return(paste(prep,answer))}}tell_me_about<-function(name){wikidata<-get_wikidata(name)questions<-c("occupation","country of citizenship","field of work","award received")words<-c("a","from","known from her work in","and who was awarded")strings<-purrr::map2_chr(questions,words,get_a_string,wikidata=wikidata)strings<-strings[strings!=""]sentence<-paste(name,"was",toString(strings))sentence<-paste0(sentence,".")return(sentence)}

    Ok, let’s try our automatic history writing function. It won’t work for Dora and Joan, sadly.

    tell_me_about("Lyudmila Pavlichenko")
    ## [1] "Lyudmila Pavlichenko was a historian, sniper, military personnel, from Soviet Union, and who was awarded Medal \"For the Defence of Sevastopol\", Medal \"For the Defence of Odessa\", Hero of the Soviet Union, Order of Lenin, Medal \"For Battle Merit\", Gold Star, Medal \"For the Victory over Germany in the Great Patriotic War 1941�1945\"."
    tell_me_about("Myra Hindley")
    ## [1] "Myra Hindley was a criminal, from United Kingdom."
    tell_me_about("Harriet Tubman")
    ## [1] "Harriet Tubman was a writer, from United States of America, and who was awarded New Jersey Hall of Fame, National Women's Hall of Fame, Maryland Women's Hall of Fame."

    Not many details clearly, but not too bad for a quickly written history hum bot, if I can call it so.

    So, happy, Nutella?

    This was my contribution to the meme following Nutella’s viral tweet. I am thankful for the badass women I did end up discovering thanks to the tweets, and am waiting for someone to replace the lyrics of all Taylor Swift’s songs with gems from this Twitter trend.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Version 0.6-8 of NIMBLE released

    $
    0
    0

    (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

    We’ve released the newest version of NIMBLE on CRAN and on our website a week ago. Version 0.6-8 has a few new features, and more are on the way in the next few months.

    New features include:

    • the proper Gaussian CAR (conditional autoregressive) model can now be used in BUGS code as dcar_proper, which behaves similarly to BUGS’ car.proper distribution;
    • a new nimbleMCMC function that provides one-line invocation of NIMBLE’s MCMC engine, akin to usage of JAGS and WinBUGS through R;
    • a new runCrossValidate function that will conduct k-fold cross-validation of NIMBLE models fit by MCMC;
    • dynamic indexing in BUGS code is now allowed by default;
    • and a variety of bug fixes and efficiency improvements.

    Please see the NEWS file in the installed package for more details.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    My book ‘Practical Machine Learning with R and Python’ on Amazon

    $
    0
    0

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    My book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($9.99) and kindle ($6.97/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo!

    This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

    Here is a look at the topics covered

    Table of Contents Essential R …………………………………….. 7 Essential Python for Datascience ………………..   54 R vs Python ……………………………………. 77 Regression of a continuous variable ………………. 96 Classification and Cross Validation ……………….113 Regression techniques and regularization …………. 134 SVMs, Decision Trees and Validation curves …………175 Splines, GAMs, Random Forests and Boosting …………202 PCA, K-Means and Hierarchical Clustering …………. 234

    Pick up your copy today!! Hope you have a great time learning as I did while implementing these algorithms!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    visualizing reassortment history using seqcombo

    $
    0
    0

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    Reassortment is an important strategy for influenza A viruses to introduce a HA subtype that is new to human populations, which creates the possibilities of pandemic.

    A diagram showed above (Figure 2 of doi:10.1038/srep25549) is widely used to illustrate the reassortment events. While such diagrams are mostly manually draw and edit without software tool to automatically generate. Here, I implemented the hybrid_plot function for producing publication quality figure of reassortment events.

    library(tibble)library(ggplot2)n <- 8virus_info <- tibble(    id = 1:7,    x = c(rep(1990, 4), rep(2000, 2), 2009),    y = c(1,2,3,5, 1.5, 3, 4),    segment_color = list(        rep('purple', n),        rep('red', n),        rep('darkgreen', n),        rep('lightgreen', n),        c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'red', 'purple', 'red', 'purple'),        c('darkgreen', 'darkgreen', 'red', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple'),        c('darkgreen', 'lightgreen', 'lightgreen', 'darkgreen', 'darkgreen', 'purple', 'red', 'purple')))flow_info <- tibble(from = c(1,2,3,3,4,5,6),                    to = c(5,5,5,6,7,6,7))hybrid_plot(virus_info, flow_info)

    The hybrid_plot requires two tibble data frame of virus information and genetic flow information.

    Users need to provide x and y positions to plot the virus, this make sense for geographically and temporally information are usually available in such phylodynamic study and can be employed to set x or y to provide more information and help interpretation of the reassortment events.

    We use hexagon to represent virus. Users can set the virus outer boundary color by v_color and fill the virus by v_fill. Color of line segments that indicate the genetic flow relationship can be specify via l_color parameter.

    hybrid_plot(virus_info, flow_info, v_color='firebrick', v_fill='darkgreen', l_color='steelblue')

    We usually have more information to present, for example host information and HA subtype etc. and these information can be used to color the virus either by v_color or v_fill

    virus_info$Host = c("Avian", "Human", rep("Swine", 4), "Human")hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    The relative virus size can also be specify if a virus_size column is available in the input virus_info data.

    virus_info$virus_size <- c(rep(1, 3), 2, 1, 1, 1.5)hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    If label and label_position coloumns are available, the virus labels (virus name or other information) will be added automatically.

    virus_info$label <- c("Avian", "Human\nH3N2", "Classic\nswine\nH1N1", "Eurasian swine",                      "North American swine\n triple reassrotant H3N2",                      "North American swine\n triple reassortant H1N2", "2009 Human H1N1")virus_info$label_position <- c('left', 'left', 'left', 'below', 'below', 'upper', 'below')hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host)

    User can use asp to set the aspect ratio of hexagons, asp < 1 for thick/short and asp > 1 for thin/tall.

    hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host, asp=2)

    The output of hybrid_plot is a ggplot object and users can use ggplot2 to modify the details.

    title <- "Reassortment events in evolution of the 2009 influenza A (H1N1) virus"caption <- 'Gene segments: PB2, PB1, PA, HA, NP, NA, M, NS'color <- c(Avian="purple", Human="red", Swine="darkgreen")hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) +    labs(caption=caption, title=title) +    scale_color_manual(values=color) + scale_fill_manual(values=color) +    scale_x_continuous(breaks=c(1990, 2000, 2009)) +    xlab(NULL) + ylab(NULL) + theme_minimal() +    theme(axis.line.y = element_blank(),          axis.text.y = element_blank(),          axis.ticks.y = element_blank(),          panel.grid.minor=element_blank(),          panel.grid.major.y=element_blank(),          legend.position = c(.95, .1)          )

    Top-down or bottom-up style is also supported.

    x <- virus_info$xvirus_info$x <- virus_info$yvirus_info$y <- xvirus_info$label_position <- c(rep("right", 3), "left", "left", "right", "right")hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host) +    scale_y_reverse() + scale_x_continuous(limits=c(0, 5.5))

    User can also use Emoji to label the virus (host information in this example):

    virus_info$label <- c("chicken", "woman", "pig", "pig", "pig", "pig", "woman")hybrid_plot(virus_info, flow_info, v_color=~Host, v_fill=~Host,              parse='emoji', t_size=8, t_color='firebrick') +    scale_y_reverse()

    In case you don’t have xy-coordination information, you can use set_layout function to auto setting the xy position using selected layout function.

    virus_info <- set_layout(virus_info, flow_info, layout="layout.kamada.kawai")hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')

    virus_info <- set_layout(virus_info, flow_info, layout="layout.fruchterman.reingold")hybrid_plot(virus_info, flow_info, parse='emoji', t_size=8, t_color='firebrick')


    Please let me know if you know any published reassortment data that contain spatial information, I will demonstrate how to visualize reassortment history on a map.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


    Computing wind average in an area using rWind

    $
    0
    0

    (This article was first published on long time ago..., and kindly contributed to R-bloggers)

    p { margin-bottom: 0.25cm; line-height: 120%; }

    Hi all!
    A researcher asked me last week about how to compute wind average for an area using rWind. I wrote a simple function to do this using dplyr library (following the advice of my friend Javier Fajardo). The function will be added to rWind package as soon as possible. Meanwhile, you can test the results… enjoy!

     # First, charge the new function    library(dplyr)    wind.region <- function (X){      X[,3] <- X[,3] %% 360      X[X[,3]>=180,3] <- X[X[,3]>=180,3] - 360      avg<-summarise_all(X[,-1], .funs = mean)      wind_region <- cbind(X[1,1],avg)      return(wind_region)     }    

    Once you have charged the function, let’s do some example

     # Get some wind data and convert it into a raster to be plotted later    library(rWind)    library(raster)    wind_data<-wind.dl(2015,2,12,0,-10,5,35,45)    wind_fitted_data <- wind.fit(wind_data)    r_speed <- wind2raster(wind_fitted_data, type="speed")   

    Now, you can use the new function to obtain wind average in the study area:

     myMean <- wind.region(wind_data)    myMean   

     # Now, you can use wind.fit to get wind speed and direction.    myMean_fitted <- wind.fit(myMean)    myMean_fitted   

     # Finally, let's plot the results!    library(rworldmap)    library(shape)    plot(r_speed)    lines(getMap(resolution = "low"), lwd=4)     alpha <- arrowDir(myMean_fitted)    Arrowhead(myMean_fitted$lon, myMean_fitted$lat, angle=alpha,          arr.length = 2, arr.type="curved")    text(myMean_fitted$lon+1, myMean_fitted$lat+1,        paste(round(myMean_fitted$speed,2), "m/s"), cex = 2)   

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: long time ago....

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    #R – Deprecate functions with roxygen2

    $
    0
    0

    (This article was first published on R-feed, and kindly contributed to R-bloggers)

    We show how to use roxygen2 tags and templates for deprecating existing documented functions.

    Introduction

    Package roxygen2 provides a powerful way of documenting packages and objects inside them. In particular, great flexibility is available for organizing the documentation content in source files and templates, and render it as desired in the corresponding help pages.

    We can leverage on this flexibility to adapt an existing package documentation upon making a function deprecated (and similarly defunct). As stated in the base-R‘Marking Objects as Deprecated’ documentation (help(Deprecated)):

    The original help page for these functions is often available at help(“oldName-deprecated”) (note the quotes). Functions should be listed in help(“pkg-deprecated”) for an appropriate pkg, including base.

    Deprecate existing functions

    We show how an existing and documented function myFun in package ourPkg can be deprecated, and the documentation adjusted accordingly by re-arranging the roxygen2 documentation content.

    ## myFun.r#' @title My function.#' @description My function.#' @param arg1 My first argument.#' @param arg2 My second argument.#' @return My return value.#' @exportmyFun<-function(arg1,arg2){"My return value"}

    Let us assume a new function myNewFunction exists as a replacement for the old myFun, which becomes deprecated. We want to achieve the following

    • list myFun in help("ourPkg-deprecated""), along with its replacement;
    • make its original documentation available via help("myFun-deprecated").

    This can be obtained by

    • creating generic content for help("ourPkg-deprecated"");
    • adjusting the existing myFun documentation content.
    Package-level documentation

    Generic content for help("ourPkg-deprecated") is defined in its own source file:

    ## ourPkg-deprecated.r#' @title Deprecated functions in package \pkg{ourPkg}.#' @description The functions listed below are deprecated and will be defunct in#'   the near future. When possible, alternative functions with similar#'   functionality are also mentioned. Help pages for deprecated functions are#'   available at \code{help("-deprecated")}.#' @name ourPkg-deprecated#' @keywords internalNULL
    Function-specific documentation

    Function-specific content is added to the ourPkg-deprecated help page from the corresponding source file, where we also want to make the original help page available via help("myFun-deprecated").

    The source file of myFun is then modified as follows:

    ## myFun.r#' @title My function.#' @description My function.#' @param arg1 My first argument.#' @param arg2 My second argument.#' @return My return value.#'#' @name myFun-deprecated#' @usage myFun(arg1, arg2)#' @seealso \code{\link{ourPkg-deprecated}}#' @keywords internalNULL#' @rdname ourPkg-deprecated#' @section \code{myFun}:#' For \code{myFun}, use \code{\link{myNewFun}}.#'#' @exportmyFun<-function(arg1,arg2){.Deprecated("myNewFun")"My return value"}

    Two blocks of roxygen2 tags have been added to the existing documentation content.

    • The first block is used to create the myFun-deprecated help page, detached from the myFun object. For this reason we need to explicitly add the original function signature as ‘Usage’ section. We also add a ‘See Also’ link to ourPkg-deprecated.
    • The second block adds myFun-specific content to the ourPkg-deprecated help page. The standard ‘Usage’ section is added automatically, and we create a function-specific section where using myNewFun is suggested. Such content will be collected and shown in help("ourPkg-deprecated") for all deprecated functions with similar tags structure.

    Note also that help(myFun) will point to the ourPkg-deprecated help page, and that @keywords internal prevents the corresponding topics from appearing in the package documentation index.

    Using roxygen2 templates

    Deprecating functions as described above can be automated using templates. One can define a template for each additional block.

    ## template-depr_pkg.r#' @name ourPkg-deprecated#' @section \code{<%= old %>}:#' For \code{<%= old %>}, use \code{\link{<%= new %>}}.
    ## template-depr_fun.r#' @name <%= fun %>-deprecated#' @usage <%= gsub("\n", "\n#' ", roxygen2:::wrapString(roxygen2:::function_usage(fun, formals(fun)))) %>#' @seealso \code{\link{ourPkg-deprecated}}#' @keywords internal

    Note the presence of template variables and of some roxygen2 internals for automating the construction of the ‘Usage’ section (handling multiple lines in case of many arguments).

    Deprecating herFun in favor of herNewFun then results in the simple inclusion of the templates in the function source.

    ## herFun.r#' @title Her function.#' @description Her function.#' @param arg1 Her first argument.#' @param arg2 Her second argument.#' @return Her return value.#'#' @templateVar fun herFun#' @template template-depr_funNULL#' @templateVar old herFun#' @templateVar new herNewFun#' @template template-depr_pkg#'#' @exportherFun<-function(arg1,arg2){.Deprecated("herNewFun")"Her return value"}

    Example package manual and sources

    A complete example of ourPkg is avalailable for download. The package contains functions myFun, yourFun, herFun, hisFun, where all but yourFun are deprecated in favor of *NewFun, using roxygen2 templates for herFun and hisFun.

    The resulting documentation content can be assessed in the corresponding PDF reference manual.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-feed.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Building a Telecom Dictionary scraping web using rvest in R

    $
    0
    0

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

    Disclaimer

    Web Scraping is not allowed by some websites. Kindly check the site’s Terms of Service before Scraping. This post is only for Educational-purpose.

    A simple Google search for Telecom Glossary results in this URL: Atis.org of which the required Telecom Dictionary could be built.

    Let us start by loading the required libraries:

    #load rvest and stringr librarylibrary(rvest)library(stringr)library(dplyr)

    rvest is a web scraping library in R that makes it easier to write common scraping tasks (to scrape useful information from web pages) without getting our head into xml parsing. rvest can be downloaded from CRAN and the development version is also available on Github.

    It could be seen in the above-mentioned URL that the Glossary words are listed alphabetically from A to Z with a separate link / URL for every (starting) letter. Clicking ‘A’ takes us to this link: atis.org that lists all the keywords with starting letter ‘A’. If you closely look at the URL, the code that’d be written for one letter (here in our case, ‘A’) could be easily replicated for other letters since ‘A’ is part of the URL which will be the only change for the link of other Alphabets.

    Let us assign this URL to an R object url which could be passed on as the paramater to rvest’s read_html() function.

    #url whose content to be scrapedurl <- 'http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0'#extracting the content of the given url#url_content <- read_html('http://www.atis.org/glossary/definitionsList.aspx?find=A&kw=0')url_content <- read_html(url)

    read_html() parses the html page of the given url (as its parameter) and saves the result as an xml object.

    To reiterate the objective, we are trying get the list of Telecom Keywords and as per the screenshot above, You could see that the Keywords are listed as Hyperlinks in the given url.

    Hyperlinks in HTML is written in the following syntax:

    Google

    Google is the Link Text Label that a browser would render and when clicked would take us to www.google.com. Hence it is evident that anchor tags in the page is what we are supposed to scrape/extract. But the issue with the current page is that, It’s not just the Keywords that are represented as Anchor Text (links) in the page but also there are a lot of other links (anchor tags) in the page. Hence to extract only the required information and filter out the irrelevant information, we need to find a pattern that helps us extract only the keywords links.

    Have a look at this screenshot:

    This screenshot shows that while the keywords are also represented as hyperlinks (Anchor Text), the differentiator is this ‘id’ element in the url. Only the links of the keywords have got this ‘id’ in the url and hence we can try extracting two information from the current page to get only the relevant information which in our case is – The Keywords: 1. Href value / Actual URL 2. Link Text.

    #extracting all the links from the page links <-  url_content %>% html_nodes('a') %>% html_attr('href')#extracting all thhe link text from the pagetext <- url_content %>% html_nodes('a') %>% html_text()

    With the example Hyperlink discussed earlier, the above code gives two informations.

    www.google.com is saved in links and Google is saved in text.

    With links and text as columns, Let us build a rough dictionary (that’s not yet cleaned/filtered).

    #creating a new dictonary of links and text extracted aboverough_dict <- data.frame(links,text, stringsAsFactors = F)head(rough_dict)                             links text1              http://www.atis.org     2 definitionsList.aspx?find=A&kw=0    A3 definitionsList.aspx?find=B&kw=0    B4 definitionsList.aspx?find=C&kw=0    C5 definitionsList.aspx?find=D&kw=0    D6 definitionsList.aspx?find=E&kw=0    E

    As displayed above, rough_dict contains both signal (keywords) and noise (irrelevant links) and we have to filter the irrelevant out with the ‘id‘ pattern that we learnt earlier.

    #filtering glossary terms leaving out irrelevant informationfair_dict <- rough_dict %>% filter(str_detect(links, 'id')) %>% select(text)tail(fair_dict)                          text657                         AN658 axial propagation constant659           analog component660                axial ratio661            analog computer662                  axial ray

    And that’s how using str_detect() we can keep only links with ‘id’ in it and filter out the rest and building our fair_dict. As displayed in the above output, We have got 662 Keywords just for the letter ‘A’ and the same exercise could be repeated for the other letters available on the site. The only change that’s required is the url object. For example, Like this:

    url <- 'http://www.atis.org/glossary/definitionsList.aspx?find=B&kw=0'

    Note the ‘B’ in it and the same could be done for other available letters. This process could be further improved by making this scraping part a function and looping the function call over a character vector with all the available letters (which ideally is beyond the scope of this article’s objective and hence left out). The complete code used in this article is available on my Github.

      Related Post

      1. Scraping Javascript-rendered web content using R
      2. Analysing Cryptocurrency Market in R
      3. Time Series Analysis in R Part 3: Getting Data from Quandl
      4. Pulling Data Out of Census Spreadsheets Using R
      5. Extracting Tables from PDFs in R using the Tabulizer Package
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      On the biases in data

      $
      0
      0

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

      “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.” — Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

      “When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’” — Hanna Wallach, Senior Researcher at Microsoft Research NYC. 

      The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

      Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      The Cost of True Love (a.k.a. The Tidy — and expensive! — Twelve Days of Christmas)

      $
      0
      0

      (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

      pnc_price_index-1

      I’m in the market for Christmas presents for my true love, @mrshrbrmstr, and thought I’d look to an age-old shopping list for inspiration. Just what would it set me back if I decided to mimic the 12 Days of Christmas in this modern day and age?

      Let’s try to do the whole thing in R (of course!).

      We’ll need to:

      • Grab the lyrics
      • Parse the lyrics
      • Get pricing data
      • Compute some statistics
      • Make some (hopefully) pretty charts

      This one was complex enough formatting-wise that I needed to iframe it below. Feel free to bust out of the iframe at any time.

      Some good follow-ups to this (if you’re so inclined) would be to predict prices next year and/or clean up the charts a bit.

      Grab the code up on GitHub.

      (Note: ColourLovers API times out occasionally so just try that snippet again if you get an error).

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Viewing all 12100 articles
      Browse latest View live