Quantcast
Channel: R-bloggers
Viewing all 12130 articles
Browse latest View live

Annual Mean Temperature Trends – 12 Airports

$
0
0

(This article was first published on RClimate, and kindly contributed to R-bloggers)

12_airport_trend

This animated gif  shows changes in annual mean temperature at 12 East Coast USA airports that had continuous daily data for the 1950 – 2015 period. The data was retrieved from Weather Underground using the R weatherData package .

11 of the 12 airports (all but JAX) show statistically significant increases in annual mean temperatures.

 

To leave a comment for the author, please follow the link and comment on their blog: RClimate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


rtide: a R package for predicting tide heights (US locations only currently)

$
0
0

(This article was first published on R-project – lukemiller.org, and kindly contributed to R-bloggers)

Joe Thorley at Poisson Consulting has released a new R package, rtide, (on which I am listed as a co-author) that provides the ability to predict future (and past) tide heights for 637 different tide stations in the United States and associated territories. The underlying data, consisting of tide harmonic constituents, are collected and released by the National Oceanic and Atmospheric Administration. The author of the definitive open source tide prediction software, XTide, collates those harmonic data into a usable format, and we have harvested the data to create the rtide package which operates in R without the need to have XTide installed.

An example tide prediction for Monterey Harbor, California, produced by rtide.

An example tide prediction for Monterey Harbor, California, produced by rtide.

You can install and load the rtide package by simply running the following commands at your R console:

install.packages('rtide') # run this one time to install the package on your computerlibrary(rtide) # use this command to load the package whenever you start a new R session

To generate a set of tide height predictions, use a command like the following:

dat = tide_height('Monterey Harbor',from = as.Date('2016-09-05'), to = as.Date('2016-09-07'), minutes = 10, tz ='PST8PDT')

The results in dat will be a 3-column ‘tibble’ (essentially a data frame) consisting of the station name (Monterey Harbor), the DateTime column (which should be in the given time zone, which we specified as Pacific Daylight Time), and TideHeight, which is in units of meters. The time step between estimates is declared by the minutes argument, and we set it to 10 minutes above.

head(dat)  # show some of the resulting data # A tibble: 6 x 3                                Station            DateTime TideHeight                                  <chr>              <time>      <dbl>1 Monterey, Monterey Harbor, California 2016-09-05 00:00:00   1.2071742 Monterey, Monterey Harbor, California 2016-09-05 00:10:00   1.2305233 Monterey, Monterey Harbor, California 2016-09-05 00:20:00   1.2512654 Monterey, Monterey Harbor, California 2016-09-05 00:30:00   1.2692485 Monterey, Monterey Harbor, California 2016-09-05 00:40:00   1.2843456 Monterey, Monterey Harbor, California 2016-09-05 00:50:00   1.296448

You could then use these predicted tide heights for any number of things, such as figuring out when your field site might be exposed or underwater at a given time, or making pretty plots of the tide cycle at your favorite location. The plot above was generated using the example data produced here and the code below:

library(ggplot2)library(scales) # for the date_format function ggplot(data = dat, aes(x = DateTime, y = TideHeight)) + geom_line() + scale_x_datetime(name = "September 5-7, 2016", labels = date_format("%H:%M", tz="PST8PDT")) +scale_y_continuous(name = "Tide Height (m)") +ggtitle("Monterey Harbor")

The available sites for tide prediction can be found by calling the tide_stations() function. If you’re concerned about the accuracy of the predictions, you can always double-check them against the official NOAA tide predictions available on the http://tidesandcurrents.noaa.gov/ website.

Some folks will be wondering whether rtide can produce predictions for other locations around the world (for instance, Canada). At the present time there is no facility to predict the tides for other countries, primarily because other countries do not release their tidal harmonic data into the public domain like the US does. Hopefully this will be rectified in the future via legislative action (don’t hold your breath), but for the present time we are limited to providing predictions for the US.

To leave a comment for the author, please follow the link and comment on their blog: R-project – lukemiller.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

HW Checker: R, AppScripts, & Google Forms

$
0
0

(This article was first published on data_steve, and kindly contributed to R-bloggers)

googleformr

For the past few months I’ve been toying with the idea of using googleformr to empower data science-oriented instructors to use data science in their workflow. In this 2-part series of posts, I’ll show how to AppScripts and R to turn Google Forms into a homework checker.

I was encouraged along this path by Martin Hawksey’s post on AppScripts and R as well as by his direct help in considering some email triggering options with AppScripts.

Toy example

The most basic example I thought worth sharing was an app that had the following:

  • some security (in the form of a password),
  • an email to direct feedback to and to confirm student identity, and
  • a place to push homework submissions to.

With these 3 inputs you can pretty much insure that only your students can use this form, so that keeps out spam and the such.

Password

Google Forms provides some rather nifty data validation tools that are specific to each data type. I used short answer for my password data type, and I leverage the data validation as password validation. In the toy example below, I just used aaa as the required password. You can add these elements to a form item by clicking on the triple dot settings option at bottom right of item space.

While this password validation will work if you are directly inputing answers by hand into the web form directly, it won’t work with googleformr because googleformr bypasses the form elements and posts directly to the input node in the form. Still it is going to be helpful in keeping out people who aren’t your students, austensibly the only ones with the password. But to be extra careful, you may want to create your own homework function that has password validation directly, as seen below.

homework_submit <- function(pwd=NULL, email=NULL, answer){  try(if(is.null(pwd)) stop("Please enter a password."))  if(pwd!="aaa") stop("That is not the correct password.")  ....}

I don’t wrap the password test in try because when it errors it gives the password condition in the error message.

Email

I suppose I could have done some fancy email validation in Google Forms, but if I really wanted that, I’d just create a function for submitting homework that gave me more control over email validation at the source. If you’d like to do that, the email validator would look like so:

is_email <- function(x) {  grepl("([_+a-z0-9-]+(\\.[_+a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,14}))"    , x, ignore.case = TRUE)}

You could also make it have special school email flavors by changing the school suboption to whatever your school email is after @ sign.

is_school_email <- function(x, school=NULL) {  if (is.null(school)){    is_email(x)  } else {    grepl(paste0("([_+a-z0-9-]+(\\.[_+a-z0-9-]+)*@",school,")")            , x, ignore.case = TRUE)  }}

So the end product could look like this:

homework_submit <- function(pwd=NULL, email=NULL, answer){  try(if(is.null(pwd)) stop("Please use class password."))  if(pwd!="aaa") stop("That is not the correct password.")  try(if(!is_email(email)) stop("Please use real email address."))  ....}

But the other reason I don’t do this is that I want students to give me a specific email that I can associate with their identity for recordkeeping and grading. This would need to be collected independently from this form.

In the back end of Google Forms, I have an AppScript that handles email identity checking. If the email is one that my student has registered with me, then it proceeds with checking the value of #1 against my tests. If it isn’t, it automatically emails the student, informing her that she’ll need to resubmit the homework and reminding them to use the email address they’ve registered with me.

After you’ve created your form to collect the homework answers you want from your students, you can use googleformr to build your function to allow students to programmatically submit their answers directly to your google form. This code chunk below serves as a template.

library(googleformr)url <- "your_google_form_url"

Using the get_form_str function you can confirm that the structure of the form is the structure of the input entry nodes you get back from googleformr.

get_form_str(url)    Order your post contentaccording to the Questions on Form     entry.qs  entry.ids         [1,] "pwd *"   "entry.1089329191"[2,] "Email *" "entry.2028747629"[3,] "#1 *"    "entry.1639114256"

Once you are satisfied, you can create the function hwchecker with googleformr::gformr handling all the linking and function construction. The resulting function will expect a vector of inputs in the same order as get_form_str displayed.

hwchecker <- gformr(url, custom_reply="HW being checked!")hwchecker(c("aaa","not a real email@some_email_provider.com","yes")) 

My last post explored the details of posting multiple answers to a google form with googleformr. The main thing I re-emphasize here is that functions created with gformr expect a vector of answers as long as and in the same order as the form entries.

Answer #1

In this one example, I’ve simply put yes as an answer, the correct answer in this case. However, I’ve shown in my previous post on googleformr that you can put almost anything into a form item of the data type short answer, including a dataframe. And the chosen form of feedback I offer in this toy example is that if the student posts an answer other than yes, namely no, then they’re told their answer is too low in the spirit of TRUE=1 and FALSE=0.

The possibility of feedback is as flexible as the person developing the app is at writing tests. Even though AppScripts is an extension of JavaScript, the loop and conditional logic structures they use are quite familiar to R developers with any experience.

Make your hwsubmit function

So now, your custom hwsubmit function has password protection, email validation, and a direct link to your google form. We can even throw in a verification that they didn’t forget to include their answers.

Now to create hwsubmit, simply create all the prereqs above and then change the NULL values on the suboptions pswd and school in the function below to set the password and school-email specific parameters for your needs. It will return a function that has all the parameters set.

hwsubmit <- function(pswd = NULL,school=NULL){  try(if(is.null) stop("Assign the school email suffix."))    function(pwd=NULL, email=NULL , answer=NULL, school=NULL){    # error handling    try(if(is.null(pwd)) stop("Please use class password."))    if(pwd!=pswd) stop("That is not the correct password.")     if(!is_school_email(email,school)) stop("Please use school email."))    try(if(is.null(answer)) stop("Don't forget to include your answers."))        if (!is.null(pwd) & pwd==pswd &         is_school_email(email,school) & !is.null(answer)){      hwchecker(c(pwd, email,answer))    } else {      message("Check your inputs to homework_submit")    }  }}

When this function is saved into a package and distributed to your students, they will be able to submit their homework as easily as

hwsubmit(pwd    = "aaa"       , email  = "not a real email@some_email_provider.com"       , answer = "yes")

Notice that with hwsubmit, each part has its own parameter for checking, and then is concantenated into a vector inside hwchecker. Of course, if you have multiple answers to submit, you will need to put them in order and wrap them up in c() as a vector.

Wrap Up and Pt.2

In the next post, I’ll detail some of the AppScript choices that I made, which are easily extensible. I personally found using AppScripts to be an interesting opportunity to do some practical javascript work without having to build all the website stuff.

Some of the options related to setting event triggers do take some thought, but there are good examples to follow – hopefully, mine included.

Using AppScripts, the possibilities of how to use Google Apps to do some many tasks is quite impressive. All the more so because it is free.

Tweetgoogleformr

To leave a comment for the author, please follow the link and comment on their blog: data_steve.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

lubridate 1.6.0

$
0
0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I am pleased to announced lubridate 1.6.0. Lubridate is designed to make working with dates and times as pleasant as possible, and is maintained by Vitalie Spinu. You can install the latest version with:

install.packages("lubridate")

This release includes a range of bug fixes and minor improvements. Some highlights from this release include:

  • period() and duration() constructors now accept character strings and allow a very flexible specification of timespans:
    period("3H 2M 1S")#> [1] "3H 2M 1S"duration("3 hours, 2 mins, 1 secs")#> [1] "10921s (~3.03 hours)"# Missing numerals default to 1. # Repeated units are summedperiod("hour minute minute")#> [1] "1H 2M 0S"

    Period and duration parsing allows for arbitrary abbreviations of time units as long as the specification is unambiguous. For single letter specs, strptime() rules are followed, so m stands for months and M for minutes.

    These same rules allows you to compare strings and durations/periods:

    "2mins 1 sec">period("2mins")#> [1] TRUE
  • Date time rounding (with round_date()floor_date() and ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:
    ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit="5 minutes")#> [1] "2016-09-12 17:10:00 UTC"
  • The behavior of ceiling_date for Date objects is now more intuitive. In short, dates are now interpreted as time intervals that are physically part of longer unit intervals:
    |day1| ... |day31|day1| ... |day28| ...|    January     |   February     | ...

    That means that rounding up 2000-01-01 by a month is done to the boundary between January and February which, i.e. 2000-02-01:

    ceiling_date(ymd("2000-01-01"), unit="month")#> [1] "2000-02-01"

    This behavior is controlled by the change_on_boundary argument.

  • It is now possible to compare POSIXct and Date objects:
    ymd_hms("2000-01-01 00:00:01")>ymd("2000-01-01")#> [1] TRUE
  • C-level parsing now handles English months and AM/PM indicator regardless of your locale. This means that English date-times are now always handled by lubridate C-level parsing and you don’t need to explicitly switch the locale.
  • New parsing function yq() allows you to parse a year + quarter:
    yq("2016-02")#> [1] "2016-04-01"

    The new q format is available in all lubridate parsing functions.

See the release notes for the full list of changes. A big thanks goes to everyone who contributed: @arneschillert, @cderv, @ijlyttle, @jasonelaw, @jonboiser, and @krlmlr.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tidyverse 1.0.0

$
0
0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

The best place to learn about all the packages in the tidyverse and how they fit together is R for Data Science. Expect to hear more about the tidyverse in the coming months as I work on improved package websites, making citation easier, and providing a common home for discussions about data analysis with the tidyverse.

Installation

You can install tidyverse with

install.packages("tidyverse")

This will install the core tidyverse packages that you are likely to use in almost every analysis:

  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.

It also installs a selection of other tidyverse packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for data manipulation:

Data import:

And modelling:

  • modelr, for simple modelling within a pipeline
  • broom, for turning models into tidy data

These packages will be installed along with tidyverse, but you’ll load them explicitly with library().

Usage

library(tidyverse) will load the core tidyverse packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr. You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)#> Loading tidyverse: ggplot2#> Loading tidyverse: tibble#> Loading tidyverse: tidyr#> Loading tidyverse: readr#> Loading tidyverse: purrr#> Loading tidyverse: dplyr#> Conflicts with tidy packages ---------------------------------------#> filter(): dplyr, stats#> lag():    dplyr, stats

You can see conflicts created later with tidyverse_conflicts():

library(MASS)#> #> Attaching package: 'MASS'#> The following object is masked from 'package:dplyr':#> #>     selecttidyverse_conflicts()#> Conflicts with tidy packages --------------------------------------#> filter(): dplyr, stats#> lag():    dplyr, stats#> select(): dplyr, MASS

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()#> The following packages are out of date:#>  * broom (0.4.0 -> 0.4.1)#>  * DBI   (0.4.1 -> 0.5)#>  * Rcpp  (0.12.6 -> 0.12.7)#> Update now?#> #> 1: Yes#> 2: No

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Collapsing a bipartite co-occurrence network

$
0
0

(This article was first published on R / Notes, and kindly contributed to R-bloggers)

This note is a follow-up to the previous one. It shows how to use student-submitted keywords to find clusters of shared interests between the students.

Dear students

If you enjoyed my previous note, this one might also entertain you. And since your real first names are used in the data, you should be able to tell me later if this note makes sense with regards to the true clusters (i.e. student groups) that you formed in my class.

tl;dr– This note explains how to plot this graph in less than 70 lines of code:

bordered

The data: Your keywords

Let’s start by loading the same dataset that we used to plot a co-occurrence network out of your research interests:

We will start by extracting your first names, which, very conveniently, are unique (there are no duplicate first names in the data):

The result of that operation is simply a vector of your 17 first names in alphabetical order, because we sorted the data on the name when we imported it.

Let’s now obtain a vector of all the different keywords that you submitted, following the same separation rule as used in the previous note: keywords separated by a comma or by the “and” conjunction, as in “Sexuality and Gender”, will be treated as discrete (separate) entities. The code for that operation is a bit more obscure:

The result of that operation is a vector of 34 alphabetically-sorted keywords, starting with the following ones:

Collective Action, Community Empowerment, Community/Neighborhood, Conflict, …

We are now going to build a matrix with as many rows as there are students ($n = 17$), and as many columns as there are distinct keywords in the complete data ($k = 34$). The matrix, which will be a non-square, asymmetrical matrix, will serve as the basis for the bipartite (or two-mode) network that we are going to build.

The concept: Bipartite/two-mode networks

A two-mode network is a network in which there are two distinct kinds of entities (modes). These networks are fairly common in the social sciences: just think of a network formed by individuals who are members of groups, such as trade unions, political parties or Facebook groups.

The two different “modes” of a two-mode network are referred to as the “primary” and “secondary” modes. In the case at hand, the primary mode will be you, the students, and the secondary mode will be the keywords that you selected as your research interests. The ordering of the modes is arbitrary: it is just common practice to assign human beings (actors) as the primary mode.

The incidence matrix

The matrix that underlies a two-mode network is called an incidence matrix. If there were only 3 students in the class, and if these students had listed only 5 different keywords, the matrix might have looked like this:

$$ \begin{array}{lc} \ A = & \left(\begin{array}{} 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 \\ \end{array}\right) \end{array} $$

This example matrix indicates that Student 1, shown on the first row, has listed Keywords 1 and 2 as his/her research interests. Student 2, on the second row, listed Keywords 2, 3 and 5, and Student 3, on the last row, listed Keywords 4 and 5.

This matrix has several properties relevant to network construction, but the only one you need to remember for now is that it is a binary matrix: it contains $0$ when a student did not select a given keyword, and $1$ otherwise.

Building the matrix

Creating a binary incidence matrix from our data only takes a single, and admittedly quite esoteric, line of R code:

The crucial part of the code is where it is reads k %in% x, which tests the $k \in x$ logical statement where $k$ are all distinct keywords in the data, and $x$ is each student’s selection of keywords.

The result of the entire operation is a binary incidence matrix like the one shown in the previous section, except with $n = 17$ rows and $k = 34$ keywords. And just to make sure that we understand the data structure, let’s label the rows and columns of that matrix with the student names and the keywords:

Visualizing the two-mode network

Let’s now throw that matrix to the same network visualization function that I used in the previous note. The code for the visualization is, again, quite esoteric, because I want to show students and keywords in different graphical styles:

And here’s our two-mode network, shown in all its glory, with students in bold blue and keywords in light grey:

bordered

In this graph, students are connected to other students by the keywords that they both listed as research interests. Note that there are a few parts of the graph where it looks like students are directly connected to each other, but this is not the case: in a two-mode network like ours, nodes of the same mode never connect directly to each other.

The network conveys useful information about some keywords, like Gender, Migration and Environment, which act as bridges between several students. It also shows a group of densely connected students on the right, and several students, like Gabriella, Alexander and Felicitas, in peripheral positions on the left, as their choice of keywords more rarely matched with that of other students.

Going back to a one-mode network

Let’s now use the information of this two-mode network to connect students to each other. This time, we want to obtain a square, symmetrical matrix that would look like this if there were only 3 students in the class:

$$ \begin{array}{lc} \ B = & \left(\begin{array}{} 2 & 1 & 0 \\ 1 & 3 & 1 \\ 0 & 1 & 2 \\ \end{array}\right) \end{array} $$

This matrix actually contains the same information as the first one that I showed, only in different form: it has the 3 students on both rows and columns, and indicates, for instance, that Student 1 and Student 2 share one common keyword. You can read that information on the first row, second column, or on the second row, first column.

In this matrix, the diagonal indicates the total number of keywords listed by each student (or, if you prefer, the number of keywords that they share with themselves). Since that information is meaningless for our purposes of connecting students to each other, we will later discard it.

We are, however, interested in the other values of the matrix, and even though all these other values are equal to either $0$ or $1$ in my example, that might not be the case in the data, as some students might share more than one keyword with each other.

In other words, what we want here is to obtain a weighted, one-mode adjacency matrix of students, with the weights indicating the number of keywords that they share with each other. Note that we could perhaps find better ways to weigh that matrix, but let’s keep it simple and stick with the simplest weighting structure.

How do we do that, starting with the incidence matrix $A$ that we created in a previous section? Here’s the trick, in the words of Solomon Messing:

To convert a two-mode incidence matrix to a one-mode adjacency matrix, one can simply multiply an incidence matrix by its transpose.

Or, in mathematical notation, to get the one-mode adjacency matrix $B$ of students, all we need to do is to perform $AA’$ on the two-mode incidence matrix of students and keywords. The only condition for that operation to work properly is that students need to be located on the rows of the incidence matrix, which is the case.

So let’s do that, and then convert the result to an igraph network object:

… And we are back with a one-mode network object, only this time, in contrast to the network shown in the previous one, the nodes of the networks are students, not keywords. We could plot the network straight away to show that, but let’s add a final twist to our analysis.

Adding clusters, a.k.a community detection

Since what we have in our network are students connected by 0, 1 or more keywords, let’s try to group the students together based on that information.

In statistical jargon, finding groups of things based on some properties of these things is called cluster analysis. In network theory, the equivalent is called community detection, and one commonly used method to find communities in networks like ours is called the “Louvain method,” named after the university in which it was designed.

It would take too much effort to explain here how the Louvain method works; just take note that it works with networks like ours—that is, one-mode, undirected networks based on a weighted adjacency matrix.

Detecting the Louvain community of each student, i.e. his or her cluster, based on the keywords that he or she shares with other students, takes one line of R code:

Let’s now see if the clusters make sense!

Visualizing the one-mode network

Before we proceed to plotting the network, let’s also compute the weighted degree centrality of the nodes, which will be useful to show students with less shared keywords in smaller size:

Note that the degree distribution that we have computed here also takes into account the weighted structure of the adjacency matrix that underlies our network.

Let’s now finally get to the network plot, with each keyword-clustered group of students shown in a different color:

bordered

In this graph, the intensity of the edges (the ties between the students) reflects the number of keywords that they share with each other. The graph uses a different force-directed layout algorithm than the one used in the previous note, but this is only a matter of taste.

Final comments

Looking at this final graph, what can we say about this year’s batch of GLM students?

If we trust the method that we used to cluster the students together based on their shared research interests, then Akhil, Alexander and Felicitas should be working together. Similarly, the four students from the “blue” cluster (Gabriella, Gabriel, Alice and Charlotte) should form two student pairs between them.

The results are less obvious for other students, but still, I would expect Mohamed to have paired with either Lucien, Margaux, Corentin or Isabelle, and Marina to have paired with either Elena, Miranda, Cosima or Lucile.

We will see in class if that is what happened in reality, but I can answer that question straight away for readers from outside our class: not at all! This year’s GLM students seem to have privileged other forms of ties between them.


The code shown in this not is provided in full in this Gist. If you want to replicate the plot above, you will need to download the data and (install and) load the required packages first.

To leave a comment for the author, please follow the link and comment on their blog: R / Notes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

anytime 0.0.2: Added functionality

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

anytime arrived on CRAN via release 0.0.1 a good two days ago. anytime aims to convert anything in integer, numeric, character, factor, ordered, … format to POSIXct (or Date) objects.

This new release 0.0.2 adds two new functions to gather conversion formats — and set new ones. It also fixed a minor build bug, and robustifies a conversion which was seen to be not quite right under some time zones.

The NEWS file summarises the release:

Changes in Rcpp version 0.0.2 (2016-09-15)

  • Refactored to use a simple class wrapped around two vector with (string) formats and locales; this allow for adding formats; also adds accessor for formats (#4, closes #1 and #3).

  • New function addFormats() and getFormats().

  • Relaxed one tests which showed problems on some platforms.

  • Added as.POSIXlt() step to anydate() ensuring all POSIXlt components are set (#6 fixing #5).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to add pbapply to R packages

$
0
0

(This article was first published on Peter Solymos - R related posts, and kindly contributed to R-bloggers)

As of today, there are 20 R packages that reverse depend/import/suggest (3/14/3) the pbapply package. Current and future package developers who decide to incorporate the progress bar using pbapply might want to customize the type and style of the progress bar in their packages to better suit the needs of certain functions or to create a distinctive look. Here is a quick guide to help in setting up and customizing the progress bar.

Adding pbapply

The pbapply package has no extra (non r-base-core) dependencies and is lightweight, so adding it as dependency does not represent a major overhead. There are two alternative ways of adding the pbapply package to another R package: Suggests, or Depends/Imports. Here are the recommended and tested ways of adding a progress bar to other R packages (see the Writing R Extensions manual for an official guide).

1. Suggests: pbapply

The user decides whether to install pbapply and the function behavior changes accordingly. This might be preferred if there are only few functions that utilize a progress bar.

pbapply needs to be added to the Suggests field in the DESCRIPTION file and use conditional statements in the code to fall back on a base functions in case of pbapply is not being installed:

out <- if (requireNamespace("pbapply"))   pbapply::pblapply(X, FUN, ...) else lapply(X, FUN, ...)

See a small R package here for an example (see R CMD check log on Travis CI: Build Status).

2. Depends/Imports: pbapply

In this second case, pbapply needs to be installed and called explicitly via :: or NAMESPACE. This might be preferred if many functions utilize the progress bar.

pbapply needs to be added to the Depends or Imports field in the DESCRIPTION file. Use pbapply functions either as pbapply::pblapply() or specify them in the NAMESPACE (e.g. importFrom(pbapply, pblapply)) and use it as pblapply() (without the ::).

See a small R package here for an example (see R CMD check log on Travis CI: Build Status).

Customizing the progress bar

Other than aesthetical reasons, there are cases when customizing the progress bar is truly necessary. For example, when working with a GUI, the default text based progress bar might not be appropriate and developers want a Windows or Tcl/Tk based progress bar.

In such cases, one can specify the progress bar options in the /R/zzz.R file of the package. The following example shows the default settings, but any of those list elements can be modified (see ?pboptions for acceptable values):

.onAttach <- function(libname, pkgname){    options("pboptions" = list(        type = if (interactive()) "timer" else "none",        char = "-",        txt.width = 50,        gui.width = 300,        style = 3,        initial = 0,        title = "R progress bar",        label = "",        nout = 100L))    invisible(NULL)}

Specifying the progress bar options this way will set the options before pbapply is loaded. pbapply will not override these settings. It is possible to specify a partial list of options (from pbapply version 1.3-0 and above).

Suppressing the progress bar

Suppressing the progress bar is sometimes handy. By default, progress bar is suppressed when !interactive(). This is an important feature, so that Sweave, knitr, and R markdown documents are not polluted with a really long printout of the progress bar. (Although, it is possible to turn the progress bar back on within such documents.)

In an interactive session, put this inside a function to disable the progress bar (and reset it when exiting the function):

pbo <- pboptions(type = "none")on.exit(pboptions(pbo))

I hope that this little tutorial helps getting a progress bar where it belongs. Suggestions and feature requests are welcome. Leave a comment or visit the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: Peter Solymos - R related posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Fixing “Peer certificate cannot be authenticated”

$
0
0

(This article was first published on R | Exegetic Analytics, and kindly contributed to R-bloggers)

I’m currently getting the following error on a Windows machine:

> devtools::install_github("hadley/readr")
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Peer certificate cannot be authenticated with given CA certificates

The machine in question is sitting behind a gnarly firewall and proxy, which I suspect are the source of the problem. I also need to use --ignore-certificate-errors when running chromium-browser, which points to the same issue.

This seems to resolve the issue:

> library(httr)
> set_config(config(ssl_verifypeer = 0L))

The post Fixing “Peer certificate cannot be authenticated” appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R | Exegetic Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A few thoughts on the existing code parallelization

$
0
0

(This article was first published on Fun with R, and kindly contributed to R-bloggers)

A few weeks ago I worked on some old code parallelization. The whole process made me think about how efficient parallelization of the existing code in R can really be and what should be considered efficient. There is a lot of information about parallelization in R online (thanks to everybody who put time to write tutorials and overviews on that topic, they were all helpful in their own way!), but here are a few of my thoughts.

Ideally, parallelization would consist of adding a few lines of code that would distribute old procedures across processors and, thus, make them run in parallel. However, this would require that the existing code is written in the portable manners that allows such a simple addition. In R, this would most likely consist of using pply functions instead of loops, minimizing (or avoiding) using for or other loops, avoiding nested loops, and “packing” independent chunks of code into individual functions that would return results in standardize format, e.g., data frame, vector, or a list. Hence, to achieve this seamless parallelization one would either have to specifically write a code with idea of portability in mind or, in general, be an eager supporter (and implementer) of good R programming practices.

Unfortunately, every now and then there are circumstances in which having a working version of a program that performs as expected and runs reasonably fast on the given data set trumps code optimization and (some of the) good practices. And in the most of such cases, once the project is done, the code gets archived without any additional optimization. But what happens when the old code needs to be reused and it does not scale well with the new data sets? If one is lucky (although we can debate whether or not the luck has anything to do with it), pretty efficient parallelization can be achieved by adjusting one or two procedures to make them suitable for distributed processing would also be an acceptable option. But often more effort is required. And that is when one comes to the point when it has to be decided how much effort is really worth putting into existing code parallelization vs (re)writing the parallel version of the code itself?

Clearly, there is no universal answer. But there are at least three things to consider when weighting that decision: 1) who is the original code author (it is probably easier to modify your own code than somebody else’s), 2) when was the code written (how much one remembers about the code/project and are there new libraries available now that can improve the code), and 3) programming practices used in coding, code length, quality of code commenting, and the formatting style used. Generally, building on with your own code is always easier, not only because you are more familiar with your coding style, but also because you are familiar with the problem the code is trying to solve. In that case, even if you decide to rewrite the code, you may still be able to reuse large portions of the original code (just make sure that if your goal is a clean and optimized code, you will reuse only code that fits those criteria). Working with somebody else’s code could be tricky, especially if there are no standardized programming practices defined and code is not well documented. The most important thing here is to be sure to understand what each line of the code intends to do. And of course, in both cases, the new code needs to be tested and the results need to be compared to those obtained by the old code.

Now, let’s talk about some more practical parallelization issues.

R does not support parallelization by default. But as with many other topics, there is a number of great libraries to use to enable parallelization of your code. Cran r-project provides a great summary list of libraries for different types of parallel computing: implicit, explicit, hadoop, grid computing, etc. When thinking about how to parallelize the code it is very important to thing to think about where will the code run, as it impacts the way in which parallelization has to be addressed in the code. Thus, the implementation of parallel code that runs locally on a single multiprocessor computer or on a cluster of computers is probably differ, was well as implementation for unix vs windows system.

Probably the easiest way to illustrate this is using registerDoParallel function from the doParallel package. This package provides a parallel backend for the foreach package and represents a merger of doMC and doSNOW packages. By default, doParallel uses multicore functionality on Unix-like systems and snow functionality on Windows, as multicore supports multiple workers only on those operating systems that support the fork system call.

Let’s say we have an N-core processor (on a local machine) and we want to use N-1 cores to run our program.

On Unix-based systems, we could do it as: cl <- makeCluster(detectCores() - 1)registerDoParallel(cl)the rest of code including foreach#stopCluster(cl)

Where cl represents cluster size, or in this case a number of workers that will be used to execute the program.

Alternatively, we could specify the number of cores we want to use directly as: registerDoParallel(cores = (detectCores() - 1))

Remember that multicore functionality only runs tasks on a single computer, not a cluster of computers. Also note that from the hardware perspective clusters and cores represent different things and should not be treated as equal; but in local machine environment, the execution of both examples will have the same flow.

While both versions of the code work on the Unix-based systems, on Windows machines, only the first version will work (attempting to use more than one core with parallel results in an error).

Of course, if you use the cluster version, don't forget to stop the cluster with stopCluster(cl) command. This ensures that cluster objects are not left in a corrupt state or out of sync (for example due to leftover data in the the socket connections to the cluster workers). If you're not sure that the existing clusters are in uncorrupt state, you can always call the stopCluster function first and then create another cluster.

The system where the code will be executed makes a difference in the parallelism implementation. However, it is worth keeping in mind that a clean code that is optimized for one type of parallelism will likely be easier to adjust to another type of parallelism than ad-hoc code.

There are many other things that would be interesting to discuss from the system perspective, but I am not an expert in the field and don't have knowledge to discuss in-depth issues associated with those, so I will just briefly mention two of them.

First, the majority of R packages are not written with parallelization in mind. While this does not mean that they cannot be used in parallel environment, it does impose some restrictions on parallelization. For example, let's talk about parallelization of operations on network nodes. If we want to parallelize an operation on network nodes, we can assign one network node to each processor, perform some calculations on each node (for example, count the number of its neighbors), and merge the results. However, if we want to perform a bit more complex operations on the nodes, e.g., operations that require calculations that involve other network nodes (for example, count the number of graphlets or orbits that the node is involved in), we would not be able to achieve optimal parallelization using the non-parallel tools, as the same calculation would be repeated on multiple nodes (as there is no way for nodes to share the intermediate, common results), or we may not be able to simply merge the results, as that may result in overcounting.

Second issue to consider is the dependency among the packages and overall lack of consistency among the available packages. That means that if you plan to use a package that does not include parallel implementation of desired function, you should carefully test if it performs as expected in parallel environment. For example, you should check is the package written in a way that all of its dependencies will be exported across the clusters and whether or not it requires additional, non R libraries (are those libraries installed on all clusters?)? More complex issues would include the way package deals with resource allocation, data partition, data merging, etc.

Of course, these are all issues that should be addressed even if one writes a program from scratch and does not use other packages/libraries.

Now let's go back to a few more practical notes before I finish.

I will focus on the foreach package. This package provides a looping construct for executing R code in parallel (but also sequentially). While it is a looping construct and contains for in its name, it is not a simple, copy-paste replacement for a standard for loop. The main difference between these two is in the way data is handled - forach does not solely execute code repeatedly but also merges and returns the results obtained by execution of each loop.

For example, within the standard for loop one can execute multiple commands, but in order to merge the results for each iteration, one needs to explicitly add the results into the predefined variable (e.g., vector or list): result <- c()for (i in 1:5){a <- i + iresult <- c(result, a*i)}

Within the foreach loop one can also execute multiple commands, but it also returns a value. In this sense, the foreach loop behaves similar to a function, as it expects that the last line represents the value that will be returned (or it will return null). However, foreach also combines the returned values for each iteration into a list (or other data structures, I will talk about that soon): result_par <- foreach (i = 1:5) %dopar% {a <- i + ia*i}

Therefore, when writing a foreach loop, one should keep in mind that 1) it is expected to return a single variable/object and 2)how variables/objects from each iteration should be combined.

Both of these things can be addressed pretty simple. If one needs to perform multiple calculations within the loop and wants to save multiple values, those values can be combined into a list: results <- foreach(i = 1:100,....{... list(result_1, result_2, ... )}

Depending what these results represent, one may need to write a custom function to ensure that they merge in an appropriate manner. Assuming that result_1 and result_2 from the above example represent data frames, we can create a new merging function that will combine results for each iteration in a data frame: rbindList <- function(x,y){ list(rbind(x[[1]], y[[1]]), rbind(x[[2]], y[[2]]))}

The only thing left to do is to specify that this function will be used to merge data in the foreach loop: results <- foreach(i = 1:100, .combine = rbindList,....{... list(result_1, result_2, ... )}

When using foreach, it is also important to ensure that all custom functions called within the foreach loop are exported using the .export option. Similarly, one needs to ensure that all packages that are required for the proper execution are also exported using .packages option: foreach (i = 1:100, ..., .export = c("myFunction1", "myFunction2"), .packages=c("foreach", "igraph")) %dopar% {...

The foreach package also provides an option for nested for loop implementation using the %:% operator. Technically speaking, this operator turns multiple foreach loops into a single loop, creating a single stream of tasks that can all be executed in parallel. I am not going to go into details about the nested foreach loop, but will just say that some of the practices mentioned above are also applicable to it, specifically procedures to merge data (can differ between multiple loops) and the approaches on how the data will be returned from each loop(that is, in which format). This is especially important in case one wants to use more than two for loops (which probably should be avoided if possible).

Occasionally one needs to write to or read from files in the for loop. The same can be done within the foreach loops: result_par <- foreach (i = 1:5) %dopar% {a <- i + iwrite.table(a*i, file = "test.txt", row.names = FALSE, col.names = FALSE,append=TRUE)}

Keep in mind that disc access speed is lower than the processor speed, so disc operations will likely represent a bottleneck and slowdown the execution time. There are some cases when one needs to write data in the file. For example, if the generated results are too big to stay in the memory until all parallel operations are completed, one can decide to execute (1/N)-th of processes in parallel N times. Something like: for(part_num in 1:5){print((part_num -1)*500)results_part <- foreach(j = 1:500,...) %dopar% {...} write.table(results_part, file="results.txt" , sep = "\t", row.names = FALSE, col.names = FALSE, append = TRUE)}

While the above approach solves the need for often disc access, it requires multiple parallelization calls and these calls bring some computational overhead due to distributing processes across the clusters/processors, loading the required data, etc. For that reason, one should not make assumptions about the execution speed without testing and comparing one version against the other. Similarly, when working on code parallelization it is important to keep in mind that trying to parallelize things that run efficiently sequentially can also slow down your program and make an opposite effect. This is particularly true for very short jobs, where the parallelization overhead diminishes the parallelization benefits. Again, if you're not sure if the parallelization of specific piece of code improves its performances (speedwise), you can always check the execution time using the system.time() command (and if possible, do this on the system on which you will run the code and with sufficiently big example(s)).

To leave a comment for the author, please follow the link and comment on their blog: Fun with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing World Bank data with WDI, googleVis Motion Charts

$
0
0

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Recently I was surfing the web, when I came across a real cool post New R package to access World Bank data, by Markus Gesmann on using googleVis and motion charts with World Bank Data. The post also introduced me to Hans Rosling, Professor of Sweden’s Karolinska Institute. Hans Rosling, the creator of the famous Gapminder chart, the “Heath and Wealth of Nations” displays global trends through animated charts (A must see!!!). As they say in Hans Rosling’s hands, data dances and sings. Take a look at some of his Ted talks for e.g. Hans Rosling:New insights on poverty. Rosling developed the breakthrough software behind his visualizations in the Gapminder. The free software – which can be loaded with any data – was purchased by Google in March 2007.

In this post I recreate some of the Gapminder charts with the help of R packages WDI and googleVis. The WDI package provides a set of really useful functions to get the data based on the World Bank Data indicator.googleVis provides motion charts with which you can animate the data.

You can clone/download the code from Github at worldBankAnalysis which is in the form of an Rmd file.

library(WDI)library(ggplot2)library(googleVis)library(plyr)

1.Get the data from 1960 to 2016 for the following

  1. Population – SP.POP.TOTL
  2. GDP in US $ – NY.GDP.MKTP.CD
  3. Life Expectancy at birth (Years) – SP.DYN.LE00.IN
  4. GDP Per capita income – NY.GDP.PCAP.PP.CD
  5. Fertility rate (Births per woman) – SP.DYN.TFRT.IN
  6. Poverty headcount ratio – SI.POV.2DAY
# World population totalpopulation=WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2016)# GDP in US $gdp=WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2016)# Life expectancy at birth (Years)lifeExpectancy=WDI(indicator='SP.DYN.LE00.IN', country="all",start=1960, end=2016)# GDP Per capitaincome=WDI(indicator='NY.GDP.PCAP.PP.CD', country="all",start=1960, end=2016)# Fertility rate (births per woman)fertility=WDI(indicator='SP.DYN.TFRT.IN', country="all",start=1960, end=2016)# Poverty head countpoverty=WDI(indicator='SI.POV.2DAY', country="all",start=1960, end=2016)

2.Rename the columns

names(population)[3]="Total population"names(lifeExpectancy)[3]="Life Expectancy (Years)"names(gdp)[3]="GDP (US$)"names(income)[3]="GDP per capita income"names(fertility)[3]="Fertility (Births per woman)"names(poverty)[3]="Poverty headcount ratio"

3.Join the data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries

j1 <- join(population, gdp)j2 <- join(j1,lifeExpectancy)j3 <- join(j2,income)j4 <- join(j3,poverty)wbData <- join(j4,fertility)

4.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixeswdi_data=WDI_data# The 1st matrix is the list is the set of all World Bank Indicatorsindicators=wdi_data[[1]]# The 2nd  matrix gives the set of countries and regionscountries=wdi_data[[2]]df=as.data.frame(countries)aa<-df$region!="Aggregates"# Remove the aggregatescountries_df<-df[aa,]# Subset from the development data only those corresponding to the countriesbb=subset(wbData, country%in%countries_df$country)cc=join(bb,countries_df)
dd=complete.cases(cc)developmentDF=cc[dd,]

5.Create and display the motion chart

gg<-gvisMotionChart(cc,                                idvar="country",                                timevar="year",                                xvar="GDP",                                yvar="Life Expectancy",                                sizevar="Population",                                colorvar="region")plot(gg)cat(gg$html$chart, file="chart1.html")

Note: Unfortunately it is not possible to embed the motion chart in WordPress. It is has to hosted on a server as a Webpage. After exploring several possibilities I came up with the following process to display the animation graph. The plot is saved as a html file using ‘cat’ as shown above. The chart1.html page is then hosted as a Github page (gh-page) on Github.

Here is the ggvisMotionChart

Do give  World Bank Motion Chart1 a spin.  Here is how the Motion Chart has to be used

untitled

You can select Life Expectancy, Population, Fertility etc by clicking the black arrows. The blue arrow shows the ‘play’ button to set animate the motion chart. You can also select the countries and change the size of the circles. Do give it a try. Here are some quick analysis by playing around with the motion charts with different parameters chosen

The set of charts below are screenshots captured by running the motion chart World Bank Motion Chart1

a. Life Expectancy vs Fertility chart

This chart is used by Hans Rosling in his Ted talk. The left chart shows low life expectancy and high fertility rate for several sub Saharan and East Asia Pacific countries in the early 1960’s. Today the fertility has dropped and the life expectancy has increased overall. However the sub Saharan countries still have a high fertility rate

pic1

b. Population vs GDP

The chart below shows that GDP of India and China have the same GDP from 1973-1994 with US and Japan well ahead.

pic2

From 1998- 2014 China really pulls away from India and Japan as seen below

pic3

c. Per capita income vs Life Expectancy

In the 1990’s the per capita income and life expectancy of the sub -saharan countries are low (42-50). Japan and US have a good life expectancy in 1990’s. In 2014 the per capita income of the sub-saharan countries are still low though the life expectancy has marginally improved.

pic4

d. Population vs Poverty headcount

pic5

In the early 1990’s China had a higher poverty head count ratio than India. By 2004 China had this all figured out and the poverty head count ratio drops significantly. This can also be seen in the chart below.

pop_pov3

In the chart above China shows a drastic reduction in poverty headcount ratio vs India. Strangely Zambia shows an increase in the poverty head count ratio.

6.Get the data for the 2nd set of indicators

  1. Total population  – SP.POP.TOTL
  2. GDP in US$ – NY.GDP.MKTP.CD
  3. Access to electricity (% population) – EG.ELC.ACCS.ZS
  4. Electricity consumption KWh per capita -EG.USE.ELEC.KH.PC
  5. CO2 emissions -EN.ATM.CO2E.KT
  6. Sanitation Access – SH.STA.ACSN
# World populationpopulation=WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2016)# GDP in US $gdp=WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2016)# Access to electricity (% population)elecAccess=WDI(indicator='EG.ELC.ACCS.ZS', country="all",start=1960, end=2016)# Electric power consumption Kwh per capitaelecConsumption=WDI(indicator='EG.USE.ELEC.KH.PC', country="all",start=1960, end=2016)#CO2 emissionsco2Emissions=WDI(indicator='EN.ATM.CO2E.KT', country="all",start=1960, end=2016)# Access to sanitation (% population)sanitationAccess=WDI(indicator='SH.STA.ACSN', country="all",start=1960, end=2016)

7.Rename the columns

names(population)[3]="Total population"names(gdp)[3]="GDP US($)"names(elecAccess)[3]="Access to Electricity (% popn)"names(elecConsumption)[3]="Electric power consumption (KWH per capita)"names(co2Emissions)[3]="CO2 emisions"names(sanitationAccess)[3]="Access to sanitation(% popn)"

8.Join the individual data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries

j1 <- join(population, gdp)j2 <- join(j1,elecAccess)j3 <- join(j2,elecConsumption)j4 <- join(j3,co2Emissions)wbData1 <- join(j3,sanitationAccess)

9.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixeswdi_data=WDI_data# The 1st matrix is the list is the set of all World Bank Indicatorsindicators=wdi_data[[1]]# The 2nd  matrix gives the set of countries and regionscountries=wdi_data[[2]]df=as.data.frame(countries)aa<-df$region!="Aggregates"# Remove the aggregatescountries_df<-df[aa,]# Subset from the development data only those corresponding to the countriesee=subset(wbData1, country%in%countries_df$country)ff=join(ee,countries_df)
## Joining by: iso2c, country

10.Create and display the motion chart

gg1<-gvisMotionChart(ff,                                idvar="country",                                timevar="year",                                xvar="GDP",                                yvar="Access to Electricity",                                sizevar="Population",                                colorvar="region")plot(gg1)cat(gg1$html$chart, file="chart2.html")

This is World Bank Motion Chart2 which has a different set of parameters like Access to Energy, CO2 emissions etc

The set of charts below are screenshots of the motion chart World Bank Motion Chart 2

a. Access to Electricity vs Populationpic6The above chart shows that in China 100% population have access to electricity. India has made decent progress from 50% in 1990 to 79% in 2012. However Pakistan seems to have been much better in providing access to electricity. Pakistan moved from 59% to close 98% access to electricity

b. Power consumption vs population

powercon

The above chart shows the Power consumption vs Population. China and India have proportionally much lower consumption that Norway, US, Canada

c. CO2 emissions vs Population

pic7

In 1963 the CO2 emissions were fairly low and about comparable for all countries. US, India have shown a steady increase while China shows a steep increase. Interestingly UK shows a drop in CO2 emissions

d.  Access to sanitationsan

India shows an improvement but it has a long way to go with only 40% of population with access to sanitation. China has made much better strides with 80% having access to sanitation in 2015. Strangely Nigeria shows a drop in sanitation by almost about 20% of population.

The code is available at Github at worldBankAnalysys

Conclusion: So there you have it. I have shown some screenshots of some sample parameters of the World indicators. Please try to play around with World Bank Motion Chart1& World Bank Motion Chart 2 with your own set of parameters and countries.

Also see 1.  Introducing QCSimulator: A 5-qubit quantum computing simulator in R 2. Dabbling with Wiener filter using OpenCV 3. Designing a Social Web Portal 4. Design Principles of Scalable, Distributed Systems 5. Re-introducing cricketr! : An R package to analyze performances of cricketers 6. Natural language processing: What would Shakespeare say?

To see all posts Index of posts

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Surveillance Out of the Box – The #Zombie Experiment

$
0
0

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

We perform a social experiment to investigate, if zombie related twitter posts can used as a reliable indicator for an early warning system. We show how such a system can be set up almost out-of-the-box using R – a free software environment for statistical computing and graphics. Warning: This blog entry contains toxic doses of Danish irony and sarcasm as well as disturbing graphs.

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

Proposing statistical methods is only mediocre fun if nobody applies them. As an act of desperation the prudent statistician has been forced to provide R packages supplemented with a CRAN, github, useR! or word-of-mouth advertising strategy. To underpin efforts, a reproducibility-crisis has been announced in order to scare decent comma-separated scientist from using Excel. Social media marketing strategies of your R package include hashtag #rstats twitter announcements, possibly enhanced by a picture or animation showing your package at its best:

Introducing gganimate: #rstats package for adding animation to any ggplot2 figure https://t.co/UBWKHmIc0epic.twitter.com/oQhQaYBqOj

— David Robinson (@drob) February 1, 2016

Unfortunately, little experience with the interactive aspect of this statistical software marketing strategy appears to be available. In order to fill this scientific advertising gap, this blog post constitutes an advertisement for the out-of-the-box-functionality of the surveillance package hidden as social experiment. It shows shows what you can do with R when combining a couple of packages, wrangle the data, cleverly visualize the results and then team up with the fantastic R community.

The Setup: Detecting a Zombie Attack

As previously explained in an useR! 2015 lightning talk, Max Brooks’ Zombie Survival Guide is very concerned about the early warning of Zombie outbreaks.

However, despite of extensive research and recommendations, no reliable service appears available for the early detection of such upcoming events. Twitter, on the other hand, has become the media darling to stay informed about news as they unfold. Hence, continuous monitoring of hashtags like #zombie or #zombieattack appears an essential component of your zombie survival strategy.

Tight Clothes, Short Hair and R

Extending the recommendations of the Zombie Survival guide we provide an out-of-the-box (OOTB) monitoring system by using the rtweet R package to obtain all individual tweets containing the hashtags #zombie or #zombieattack.

the_query <- "#zombieattack OR #zombie"geocode <- ""#To limit the seach to berlin & surroundings: geocode <- "52.520583,13.402765,25km"#Converted query string which works for storing as filesafe_query <-stringr::str_replace_all(the_query, "[^[:alnum:]]", "X")

In particular, the README of the rtweet package provides helpful information on how to create a twitter app to automatically search tweets using the twitter API. One annoyance of the twitter REST API is that only the tweets of the past 7 days are kept in the index. Hence, your time series are going to be short unless you accumulate data over several queries spread over a time period. Instead of using a fancy database setup for this data collection, we provide a simple R solution based on dplyr and saveRDS– see the underlying R code of this post by clicking on the github logo in the license statement of this post. Basically,

  • all tweets fulfilling the above hashtag search queries are extracted
  • each tweet is extended with a time stamp of the query-time
  • the entire result of each query us stored into a separate RDS-files using saveRDS

In a next step, all stored queries are loaded from the RDS files and put together. Subsequently, only the newest time stamped entry about each tweet is kept – this ensures that the re-tweeted counts are up-to-date and no post is counted twice. All these data wrangling operations are easily conducted using dplyr. Of course a full database solution would have been more elegant, but R does the job just as well as long it’s not millions of queries. No matter the data backend, at the end of this pipeline we have a database of tweets.

#Read the tweet databasetw <-readRDS(file=paste0(filePath,"Tweets-Database-",safe_query,"-","2016-09-25",".RDS"))options(width=300,tibble.width =Inf)tw %>%select(created_at, retweet_count,screen_name,text,hashtags,query_at)
## # A tibble: 10,974 × 6##             created_at retweet_count    screen_name                                                                                                                                          text  hashtags            query_at##                 <dttm>         <int>          <chr>                                                                                                                                         <chr>    <list>              <dttm>## 1  2016-09-25 10:26:28             0       Lovebian                                               The latest #Zombie Nation! https://t.co/8ZkOFSZH2v Thanks to @NJTVNews @MaxfireXSA @Xtopgun901X <chr [1]> 2016-09-25 10:30:44## 2  2016-09-25 10:25:49             2  MilesssAwaaay RT @Shaaooun: I'm gonna turn to a zombie soon! xdxdxdxd #AlmostSurvived #204Days #ITried #Zombie #StuckInMyRoom #Haha\n\n#MediaDoomsDay #Kame <chr [7]> 2016-09-25 10:30:44## 3  2016-09-25 10:21:10             6 catZzinthecity          RT @ZombieEventsUK: 7 reasons #TheGirlWithAllTheGifts is the best #zombie movie in years https://t.co/MB82ssxss2 via @MetroUK #Metro <chr [3]> 2016-09-25 10:30:44## 4  2016-09-25 10:19:41             0  CoolStuff2Get                             Think Geek Zombie Plush Slippers https://t.co/0em920WCMh #Zombie #Slippers #MyFeetAreCold https://t.co/iCEkPBykCa <chr [3]> 2016-09-25 10:30:44## 5  2016-09-25 10:19:41             4  TwitchersNews    RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch \n\nhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44## 6  2016-09-25 10:17:45             0 ZombieExaminer     Washington mall shooting suspect Arcan Cetin was '#Zombie-like' during arrest - USA TODAY https://t.co/itoDXG3L8T https://t.co/q2mURi24DB <chr [1]> 2016-09-25 10:30:44## 7  2016-09-25 10:17:44             4       SpawnRTs    RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch \n\nhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44## 8  2016-09-25 10:17:23             0   BennyPrabowo                   bad miku - bad oni-chan... no mercy\n.\n.\n.\n.\n#left4dead #games #hatsunemiku #fps #zombie #witch https://t.co/YP0nRDFFj7 <chr [6]> 2016-09-25 10:30:44## 9  2016-09-25 10:12:53            62   Nblackthorne  RT @PennilessScribe: He would end her pain, but he could no longer live in a world that demanded such sacrifice. #zombie #apocalypse\nhttps: <chr [2]> 2016-09-25 10:30:44## 10 2016-09-25 10:06:46             0   mthvillaalva                                                             Pak ganern!!! Kakatapos ko lang kumain ng dugo! \n#Zombie https://t.co/Zyd0btVJH4 <chr [1]> 2016-09-25 10:30:44## # ... with 10,964 more rows

OOTB Zombie Surveillance

We are now ready to prospectively detect changes using the surveillance R package (Salmon, Schumacher, and Höhle 2016).

library("surveillance")

We shall initially focus on the #zombie series as it contains more counts. The first step is to convert the data.frame of individual tweets into a time series of daily counts.

#' Function to convert data.frame to queries. For convenience we store the time series#' and the data.frame jointly as a list. This allows for easy manipulations later on#' as we see data.frame and time series to be a joint package.#'#' @param tw data.frame containing the linelist of tweets.#' @param the_query_subset String containing a regexp to restrict the hashtags#' @return List containing sts object as well as the original data frame.#'df_2_timeseries <-function(tw, the_query_subset) {  tw_subset <-tw %>%filter(grepl(gsub("#","",the_query_subset),hashtags))  #Aggregate data per day and convert times series to sts object  ts <-surveillance::linelist2sts(as.data.frame(tw_subset), dateCol="created_at_Date", aggregate.by="1 day")  #Drop first day with observations, due to the moving window of the twitter index, this count is incomplete  ts <-ts[-1,]  return(list(tw=tw_subset,ts=ts, the_query_subset=the_query_subset))}zombie <-df_2_timeseries(tw, the_query_subset ="#zombie")

It’s easy to visualize the resulting time series using the plotting functionality of the surveillance package.

We see that the counts on the last day are incomplete. This is because the query was performed at 10:30 CEST and not at midnight. We therefore adjust counts on the last day based on simple inverse probability weighting. This just means that we scale up the counts by the inverse of the fraction the query-hour (10:30 CEST) makes up of 24h (see github code for details). This relies on the assumption that queries are evenly distributed over the day.

We are now ready to apply a surveillance algorithm to the pre-processed time series. We shall pick the so called C1 version of the EARS algorithm documented in Hutwagner et al. (2003) or Fricker, Hegler, and Dunfee (2008). For a monitored time point \(s\) (here: a particular day, say 2016-09-23), this simple algorithm takes the previous seven observations before \(s\) in order to compute the mean and standard deviation, i.e. \[ \begin{align*} \bar{y}_s &= \frac{1}{7} \sum_{t=s-8}^{s-1} y_t, \\ \operatorname{sd}_s &= \frac{1}{7-1} \sum_{t=s-8}^{s-1} (y_t – \bar{y}_s)^2 \end{align*} \] The algorithm then computes the z-statistic \(\operatorname{C1}_s = (y_s – \bar{y}_s)/\operatorname{sd}_s\) for each time point to monitor. Once the value of this statistic is above 3 an alarm is flagged. This means that we assume that the previous 7 observations are what is to be expected when no unusual activity is going on. One can interpret the statistic as a transformation to (standard) normality: once the current observation is too extreme under this model an alarm is sounded. Such normal-approximations are justified given the large number of daily counts in the zombie series we consider, but does not take secular trends or day of the week effects into account. Note that the calculations can also be reversed in order to determine how large the number of observations need to be in order to generate an alarm.

We now apply the EARS C1 monitoring procedure to the zombie time series starting at the 8th day of the time series. It is important to realize that the result of monitoring a time point in the graphic is obtained by only looking into the past. Hence, the relevant time point to consider today is if an alarm would have occurred 2016-09-25. We also show the other time points to see, if we could have detected potential alarms earlier.

zombie[["sts"]] <-earsC(zombie$ts, control=list(range =8:nrow(zombie$ts),                         method ="C1", alpha =1-pnorm(3)))

What a relief! No suspicious zombie activity appears to be ongoing. Actually, it would have taken 511 tweets before we would have raised an alarm on 2016-09-25. This is quite a number.

As an additional sensitivity analysis we redo the analyses for the #zombieattack hashtag. Here the use of the normal approximation in the computation of the alerts is more questionable. Still, we can get a time series of counts together with the alarm limits.

Also no indication of zombie activity. The number of additional tweets needed before alarm in this case is: 21. Altogether, it looks safe out there…

Summary

R provides ideal functionality to quickly extract and monitor twitter time series. Combining with statistical process control methods allows you to prospectively monitor the use of hashtags. Twitter has released a dedicated package for this purpose, however, in case of low count time series it is better to use count-time series monitoring devices as implemented in the surveillance package. Salmon, Schumacher, and Höhle (2016) contains further details on how to proceed in this case.

The important question although remains: Does this really work in practice? Can you sleep tight, while your R zombie monitor scans twitter? Here is where the social experiment starts: Please help answer this question by retweeting the post below to create a drill alarm situation. More than 511 (!) and 21 tweets, respectively, are needed before an alarm will sound.

(placeholder tweet, this will change in a couple of minutes!!)

Video recording, slides & R code of our (???) MV Time Series webinar now available at https://t.co/XVtLrjbJKZ#biosurveillance#rstats

— Michael Höhle (@m_hoehle) 21. September 2016

I will continuously update the graphs in this post to see how our efforts are reflected in the time series of tweets containing the #zombieattack and #zombie hashtags. Thanks for your help!

References

Fricker, R. D., B. L. Hegler, and D. A. Dunfee. 2008. “Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology.” Stat Med 27 (17): 3407–29.

Hutwagner, L., W. Thompson, G. M. Seeman, and T. Treadwell. 2003. “The bioterrorism preparedness and response Early Aberration Reporting System (EARS).” J Urban Health 80 (2 Suppl 1): 89–96.

Salmon, M., D. Schumacher, and M. Höhle. 2016. “Monitoring Count Time Series in R: Aberration Detection in Public Health Surveillance.” Journal of Statistical Software 70 (10). doi:10.18637/jss.v070.i10.

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tint 0.0.1: Tint Is Not Tufte

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new experimental package is now on the ghrr drat. It is named tint which stands for Tint Is Not Tufte. It provides an alternative for Tufte-style html presentation. I wrote a bit more on the package page and the README in the repo— so go read this.

Here is just a little teaser of what it looks like:

and the full underlying document is available too.

For questions or comments use the issue tracker off the GitHub repo. The package may be short-lived as its functionality may end up inside the tufte package.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Re-introducing Radiant: A shiny interface for R

$
0
0

(This article was first published on R(adiant) news, and kindly contributed to R-bloggers)

Radiant is a platform-independent browser-based interface for business analytics in R. I first introduced Radiant through R-bloggers on 5/2/2015 and, according to Dean Attali, the post was reasonably popular. So I decided to write a post about the changes to the tool since then.

Radiant is back on CRAN and the code and documentation have been moved to a GitHub organization radiant-rstats. Note that the app is only available for R 3.3.0 or later.

There have been numerous changes to the functionality and structure of Radiant. The app is now made up of 5 different menus, each in a separate package. The Data menu (radiant.data) includes interfaces for loading, saving, viewing, visualizing, summarizing, transforming, and combining data. It also contains functionality to generate reproducible reports of the analyses conducted in the application. The Design menu (radiant.design) includes interfaces for design of experiments, sampling, and sample size calculation. The Basics menu (radiant.basics) includes interfaces for probability calculation, central limit theorem simulation, comparing means and proportions, goodness-of-fit testing, cross-tabs, and correlation. The Model menu (radiant.model) includes interfaces for linear and logistic regression, Neural Networks, model evaluation, decision analysis, and simulation. The Multivariate menu (radiant.multivariate) includes interfaces for perceptual mapping, factor analysis, cluster analysis, and conjoint analysis. Finally, the radiant package combines the functionality from each of these 5 packages.

More functionality is in the works. For example, naive Bayes, boosted decision trees, random forests, and choice models will be added to the Model menu (radiant.model). I’m also planning to add a Text menu (radiant.text) to provide functionality to view, process, and analyze text.

If you are interested in contributing to, or extending, Radiant, take a look at the code for the radiant.design package on GitHub. This the simplest menu and should give you a good idea of how you can build on the functionality in the radiant.data package that is the basis for all other packages and menus.

Want to know more about Radiant? Although you could take look at the original Introducing Radiant blog post, quite a few of the links and references have changed. So to make things a bit easier, I’m including an updated version of the original post below.

If you have questions or comments please email me at radiant@rady.ucsd.edu

Key features

  • Explore: Quickly and easily summarize, visualize, and analyze your data
  • Cross-platform: It runs in a browser on Windows, Mac, and Linux
  • Reproducible: Recreate results at any time and share work with others as a state file or an Rmarkdown report
  • Programming: Integrate Radiant’s analysis functions into your own R-code
  • Context: Data and examples focus on business applications

Explore

Radiant is interactive. Results update immediately when inputs are changed (i.e., no separate dialog boxes). This greatly facilitates exploration and understanding of the data.

Cross-platform

Radiant works on Windows, Mac, or Linux. It can run without an Internet connection and no data will leave your computer. You can also run the app as a web application on a server.

Reproducible

Simply saving output is not enough. You need the ability to recreate results for the same data and/or when new data becomes available. Moreover, others may want to review your analyses and results. Save and load the state of the application to continue your work at a later time or on another computer. Share state files with others and create reproducible reports using Rmarkdown.

If you are using Radiant on a server you can even share the url (include the SSUID) with others so they can see what you are working on. Thanks for this feature go to Joe Cheng.

Programming

Although Radiant’s web-interface can handle quite a few data and analysis tasks, you may prefer to write your own code. Radiant provides a bridge to programming in R(studio) by exporting the functions used for analysis. For more information about programming with Radiant see the programming page on the documentation site.

Context

Radiant focuses on business data and decisions. It offers context-relevant tools, examples, and documentation to reduce the business analytics learning curve.

How to install Radiant

  • Required: R version 3.3.0 or later
  • Required: A modern browser (e.g., Chrome or Safari). Internet Explorer (version 11 or higher) or Edge should work as well
  • Recommended: Rstudio

Radiant is available on CRAN. However, to install the latest version of the different packages with complete documentation for offline access open R(studio) and copy-and-paste the command below into the console:

install.packages("radiant",repos="http://radiant-rstats.github.io/minicran/")

Once all packages and dependencies are installed use the following command to launch the app in your default browser:

radiant::radiant()

If you have a recent version of Rstudio installed you can also start the app from the Addins dropdown. That dropdown will also provide an option to upgrade Radiant to the latest version available on the github minicran repo.

If you currently only have R on your computer and want to make sure you have all supporting software installed as well (e.g., Rstudio, MikTex, etc.) open R, copy-and-paste the command below, and follow along as different dialogs are opened:

source("https://raw.githubusercontent.com/radiant-rstats/minicran/gh-pages/install.R")

More detailed instructions are available on the install radiant page.

Documentation

Documentation and tutorials are available at http://radiant-rstats.github.io/docs/ and in the Radiant web interface (the ? icons and the Help menu).

Want some help getting started? Watch the tutorials on the documentation site

Radiant on a server

If you have access to a server you can use shiny-server to run radiant. First, start R on the server with sudo R and install radiant using install.packages("radiant"). Then clone the radiant repo and point shiny-server to the inst/app/ directory.

If you have Rstudio server running and the Radiant package is installed, you can start Radiant from the addins menu as well. To deploy Radiant using Docker take a look the example and documentation at:

https://github.com/radiant-rstats/docker-radiant

Not ready to install Radiant, either locally or on a server? Try it out on shinyapps.io at the link below:

vnijs.shinyapps.io/radiant

Send questions and comments to: radiant@rady.ucsd.edu.


aggregated on R-bloggers– the complete collection of blogs about R

To leave a comment for the author, please follow the link and comment on their blog: R(adiant) news.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

FileTable and storing graphs from Microsoft R Server

$
0
0

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

FileTable has been around now for quite some time and and it is useful  for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.

And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.

First we need to secure that FileStream is enabled. Open SQL Server Configuration Manager and navigate to your running SQL Server. Right click and select FILESTREAM and enable Filestream for T-SQL access and I/O access. In addition, allow remote clients access to Filestream data as well.

2016-09-23-21_53_34-sqlquery9-sql-sicn-kastrun-master-spar_si01017988-63_-microsoft-sql-serv

Next step is to enable the configurations in Management Studio.

EXEC sp_configure 'filestream_access_level' , 2;GORECONFIGURE;GO

For this purpose I have decided to have a dedicated database for storing charts created in R. And this database will have FileTable enabled.

USE master;GOCREATE DATABASE FileTableRChart ON PRIMARY  (NAME = N'FileTableRChart', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FileTableRChart.mdf' , SIZE = 8192KB , FILEGROWTH = 65536KB ),FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = ChartsFG, FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\RCharts')LOG ON (NAME = N'FileTableRChart_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FileTableRChart_log.ldf' , SIZE = 8192KB , FILEGROWTH = 65536KB )GOALTER DATABASE FileTableRChart    SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'RCharts' )

So I will have folder RCharts available as a BLOB storage to my FileTableRChart SQL server database. Adding a table to get all the needed information on my charts.

USE FileTableRChart;GOCREATE TABLE ChartsR AS FILETABLEWITH ( FileTable_Directory = 'DocumentTable',FileTable_Collate_Filename = database_default  );GO

Setting the BLOB, we can focus now on R code within T-SQL. Following R Code will be used to generated histograms with normal curve for quick data overview (note, this is just a sample):

x <- data.frame(val = c(1,2,3,6,3,2,3,4,5,6,7,7,6,6,6,5,5,4,8))y <- data.frame(val = c(1,2,5,8,5,4,2,4,5,6,3,2,3,5,5,6,7,7,8))x$class <- 'XX'y$class <- 'YY'd <- rbind(x,y)#normal function with countsgghist <- ggplot(d, aes(x=val)) + geom_histogram(binwidth=2,                   aes(y=..density.., fill=..count..))gghist <- gghist + stat_function(fun=dnorm, args=list(mean=mean(d$val),                    sd=sd(d$val)), colour="red")gghist <- gghist + ggtitle("Histogram of val with normal curve")  +                    xlab("Variable Val") + ylab("Density of Val")

Returning diagram that will be further parametrized when inserted into T-SQL code.

hist_norm_curv

Besides parametrization, I will add a function to loop through all the input variables and generated diagrams for each of the given variable/column in SQL Server query passed through sp_execute_external_script stored procedure.

Final code:

DECLARE @SQLStat NVARCHAR(4000)SET @SQLStat = 'SELECT                     fs.[Sale Key] AS SalesID                    ,c.[City] AS City                    ,c.[State Province] AS StateProvince                    ,c.[Sales Territory] AS SalesTerritory                    ,fs.[Customer Key] AS CustomerKey                    ,fs.[Stock Item Key] AS StockItem                    ,fs.[Quantity] AS Quantity                    ,fs.[Total Including Tax] AS Total                    ,fs.[Profit] AS Profit                    FROM [Fact].[Sale] AS  fs                    JOIN dimension.city AS c                    ON c.[City Key] = fs.[City Key]                    WHERE                        fs.[customer key] <> 0'DECLARE @RStat NVARCHAR(4000)SET @RStat = 'library(ggplot2)              library(stringr)              #library(jpeg)              cust_data <- Sales              n <- ncol(cust_data)              for (i in 1:n)                         {                          path <- ''\\\\SICN-KASTRUN\\mssqlserver\\RCharts\\DocumentTable\\Plot_''                          colid   <- data.frame(val=(cust_data)[i])                          colname <- names(cust_data)[i]                          #print(colname)                          #print(colid)                          gghist <- ggplot(colid, aes(x=val)) + geom_histogram(binwidth=2, aes(y=..density.., fill=..count..))                          gghist <- gghist + stat_function(fun=dnorm, args=list(mean=mean(colid$val), sd=sd(colid$val)), colour="red")                          gghist <- gghist + ggtitle("Histogram of val with normal curve")  + xlab("Variable Val") + ylab("Density of Val")                          path <- paste(path,colname,''.jpg'')                          path <- str_replace_all(path," ","")                          #jpeg(file=path)                          ggsave(path, width = 4, height = 4)                          plot(gghist)                          dev.off()                        }';EXECUTE sp_execute_external_script     @language = N'R'    ,@script = @RStat    ,@input_data_1 = @SQLStat    ,@input_data_1_name = N'Sales'

I am using ggsave function, but jpeg function from package jpeg is also an option. Matter of a flavour. And variable path should be pointing to your local FileTable directory.

Now I can have graphs and charts stored persistently in filetable and retrieving information on files with simple query:

SELECT FT.Name,IIF(FT.is_directory=1,'Directory','Files') [File Category],FT.file_type [File Type],(FT.cached_file_size)/1024.0 [File Size (KB)],FT.creation_time [Created Time],FT.file_stream.GetFileNamespacePath(1,0) [File Path],ISNULL(PT.file_stream.GetFileNamespacePath(1,0),'Root Directory') [Parent Path]FROM [dbo].[ChartsR] FTLEFT JOIN [dbo].[ChartsR] PTON FT.path_locator.GetAncestor(1) = PT.path_locator

2016-09-25-13_33_40-using_filetable_to_store_graphs_generated_with_revoscale-sql-sicn-kastrun-file

Going through the charts could be now much easier for multiple purposes.

2016-09-25-13_35_08-edit-post-tomaztsql-wordpress-com

There might be some security issues: I have used mklink to create a logical drive pointing to FileTable directory.

2016-09-25-12_05_54-administrator_-command-prompt

You might also want to use Local group policy editor for MSSQLLaunchpad to have access granted (write permissions) to FileTable directory.

2016-09-25-13_37_41-local-group-policy-editor

Code is available at GitHub.

Happy R-SQLing!

 

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


The biggest liars in US politics

$
0
0

(This article was first published on Stat Of Mind, and kindly contributed to R-bloggers)

Anyone that follows US politics will be aware of the tremendous changes and volatility that has struck the US political landscape in the past year. In this post, I leverage third-party data to surface who are the most frequent liars, and show how to build a containerized Shiny app to visualize direct comparisons between individuals.

http://tlfvincent.github.io/2016/06/11/biggest-political-liars/

To leave a comment for the author, please follow the link and comment on their blog: Stat Of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Better Model Selection for Evolving Models

$
0
0

(This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

For quite some time now I have been using R’s caret package to choose the model for forecasting time series data. The approach is satisfactory as long as the model is not an evolving model (i.e. is not re-trained), or if it evolves rarely. If the model is re-trained often – the approach has significant computational overhead. Interestingly enough, an alternative, more efficient approach allows also for more flexibility in the area of model selection. Let’s first outline how caret chooses a single model. The high level algorithm is outlined here:

Model Selection Algorithm

So let’s say we are training a random forest. For this model, a single parameter, mtry is optimized:

require(caret)getModelInfo('rf')$wsrf$parameters#   parameter   class                         label#   1    mtry numeric #Randomly Selected Predictors

Let’s assume we are using some form of cross validation. According to the algorithm outline, caret will create a few subsets. On each subset, it will train all models (as many models as different values for mtry there are) and finally it will choose the model behaving best over all cross validation folds. So far so good.

When dealing with time series, using regular cross validation has a future-snooping problem and from my experience general cross validation doesn’t work well in practice for time series data. The results are good on the training set, but the performance on the test set, the hold out, is bad. To address this issue, caret provides the timeslice cross validation method:

require(caret)history = 1000initial.window = 800train.control = trainControl(                    method="timeslice",                    initialWindow=initial.window,                    horizon=history-initial.window,                    fixedWindow=T)

When the above train.control is used in training (via the train call), we will end up using 200 models for each set of parameters (each value of mtry in the random forest case). In other words, for a single value of mtry, we will compute:

WindowTraining PointsTest Point
11..800801
22..801802
33..803803
200200..9991000

The training set for each model is the previous 800 points. The test set for a single model is the single point forecast. Now, for each value of mtry we end up with 200 forecasted points, using the accuracy (or any other metric) we select the best performing model over these 200 points. No future-snooping here, because all history points are prior the points being forecasted.

Granted, this approach (of doing things on daily basis) may sound extreme, but it’s useful to illustrate the overhead which is imposed when the model evolves over time, so bear with me.

So far we have dealt with a single model selection. Once the best model is selected, we can forecast the next data point. Then what? What I usually do is to walk the time series forward and repeat these steps at certain intervals. This is equivalent to saying something like: “Let’s choose the best model each Friday, use the selected model to predict each day for the next week. Then re-fit it on Friday.”. This forward-walking approach has been found useful in trading, but surprisingly, hasn’t been discussed pretty much elsewhere. Abundant time series data is generated everywhere, hence, I feel this evolving model approach deserves at least as much attention as the “fit once, live happily thereafter” approach.

Back to our discussion. To illustrate the inefficiency, consider an even more extreme case – we are selecting the best model every day, using the the above parameters, i.e. the best model for each day is selected tuning the parameters over the previous 200 days. On day n for a given value of the parameter (mtry), we will train this model over a sequence of 200 sliding windows each of which is of size 800. Next we will move to day n+1 and we will compute, yet again, this model over a sequence of 200 sliding windows each of which is of size 800. Most of these operations are repeated (the last 800 window on day n is the second last 800 window on day n+1). So just for a single parameter value, we are repeating most of the computation on each step.

At this point, I hope you get the idea. So what is my solution? Simple. For each set of model parameters (each value of mtry), walk the series separately, do the training (no cross validation – we have a single parameter value), do the forecasting and store everything important into, let’s say, SQLite database. Next, pull out all predictions and walk the combined series. On each step, look at the history, and based on it, decide which model prediction to use for the next step. Assuming we are selecting the model over 5 different values for mtry, here is how the combined data may look like for a three-class (0, -1 and 1) classification:

models

Obviously the described approach is going to be orders of magnitude faster, but will deliver very similar (there are differences based on the window sizes) results. It also has an added bonus – once the forecasts are generated, one can experiment with different metrics for model selection on each step and all that without re-running the machine learning portion. For instance, instead of model accuracy (the default caret metric for classification), one can compare accumulative returns over the last n days.

Still cryptic or curious about the details? My plan is keep posting details and code as I progress with my Python implementation. Thus, look for the next installments of these series.

The post Better Model Selection for Evolving Models appeared first on Quintuitive.

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

When Trump visits… tweets from his trip to Mexico

$
0
0

(This article was first published on En El Margen - R-English, and kindly contributed to R-bloggers)

I’m sure many of my fellow Mexicans will remember the historically ill-advised (to say the least) decision by our President to invite Donald Trump for a meeting.

Talking to some fellow colleagues, we couldn’t help but notice that maybe in another era this decision would have been good policy. The problem, some concluded, was the influence of social media today. In fact, the Trump debacle did cause outcry among leading politica voices online.

I wanted to investigate this further, and thankfully for me, I’ve been using R to collect tweets from a catalog of leading political personalities in Mexico for a personal business project.

Here is a short descriptive look at what the 65 twitter accounts I’m following tweeted between August 27th and September 5th (the Donald announced his visit on August the 30th). I’m sorry I can’t share the dataset, but you get the idea with the code…

library(dplyr)library(stringr)# 42 of the 65 accounts tweeted between those dates.d %>%   summarise("n"= n_distinct(NOMBRE))#   n#  42

We can see how mentions of trump spike just about the time it was announced…

byhour <- d %>%   mutate("MONTH"=as.numeric(month(T_CREATED)),"DAY"=as.numeric(day(T_CREATED)),"HOUR"=as.numeric(hour(T_CREATED)),"TRUMP_MENTION"= str_count(TXT, pattern ="Trump|TRUMP|trump"))%>%   group_by(MONTH, DAY, HOUR)%>%   summarise("N"= n(),"TRUMP_MENTIONS"=sum(TRUMP_MENTION))%>%  mutate("PCT_MENTIONS"= TRUMP_MENTIONS/N*100)%>%  arrange(desc(MONTH), desc(DAY), HOUR)%>%  mutate("CHART_DATE"=as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR,":00")))library(ggplot2)library(eem)ggplot(byhour,        aes(x = CHART_DATE,            y = PCT_MENTIONS))+         geom_line(colour=eem_colors[1])+         theme_eem()+        labs(x ="Time",              y ="Trump mentions \n (% of Tweets)")

Trump tweets by mexican officials, percent

The peak of mentions (as a percentage of tweets) was September 1st at 6 am (75%). But it terms of amount of tweets, it is much more obvious the outcry was following the anouncement and later visit of the candidate:

ggplot(byhour,        aes(x = CHART_DATE,            y = TRUMP_MENTIONS))+         geom_line(colour=eem_colors[1])+         theme_eem()+        labs(x ="Time",              y ="Trump mentions \n (# of Tweets)")

Trump tweets by mexican officials, total

We can also (sort-of) identify the effect of these influencers tweeting. I’m going to add the followers, which are potential viewers, of each tweet mentioning Trump, by hour.

byaudience <- d %>%   mutate("MONTH"=as.numeric(month(T_CREATED)),"DAY"=as.numeric(day(T_CREATED)),"HOUR"=as.numeric(hour(T_CREATED)),"TRUMP_MENTION"= str_count(TXT, pattern ="Trump|TRUMP|trump"))%>%   filter(TRUMP_MENTION >0)%>%  group_by(MONTH, DAY, HOUR)%>%   summarise("TWEETS"= n(),"AUDIENCE"=sum(U_FOLLOWERS))%>%  arrange(desc(MONTH), desc(DAY), HOUR)%>%  mutate("CHART_DATE"=as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR,":00")))ggplot(byaudience,        aes(x = CHART_DATE,            y = AUDIENCE))+         geom_line(colour=eem_colors[1])+         theme_eem()+        labs(x ="Time",              y ="Potential audience \n (# of followers)")

Total audience of trump tweets

So clearly, I’m stating the obvious. People were talking. But how was the conversation being developed? Let’s first see the type of tweets (RT’s vs drafted individually):

bytype <- d %>%   mutate("TRUMP_MENTION"= str_count(TXT, pattern ="Trump|TRUMP|trump"))%>%# only the tweets that mention trump  filter(TRUMP_MENTION>0)%>%  group_by(T_ISRT)%>%   summarise("count"= n())kable(bytype)
T_ISRTcount
FALSE313
TRUE164

About 1 in 3 was a RT. Comparing to the overall tweets, (1389 out of 3833) this seems not too much of a difference, so it wasn’t necesarrily an influencer pushing the discourse. In terms of the most mentioned by tweet it was our President on the spotlight:

bymentionchain <- d %>%   mutate("TRUMP_MENTION"= str_count(TXT, pattern ="Trump|TRUMP|trump"))%>%# only the tweets that mention trump  group_by(TRUMP_MENTION, MENTION_CHAIN)%>%   summarise("count"= n())%>%   ungroup()%>%   mutate("GROUPED_CHAIN"=ifelse(grepl(pattern ="EPN",                                         x = MENTION_CHAIN),"EPN", MENTION_CHAIN))%>%   mutate("GROUPED_CHAIN"=ifelse(grepl(pattern ="realDonaldTrump",                                         x = MENTION_CHAIN),"realDonaldTrump", GROUPED_CHAIN))                                  ggplot(order_axis(bymentionchain %>%                     filter(count>10& GROUPED_CHAIN!="ND"),                   axis = GROUPED_CHAIN,                   column = count),        aes(x = GROUPED_CHAIN_o,            y = count))+   geom_bar(stat ="identity")+   theme_eem()+   labs(x ="Mention chain \n (separated by _|.|_ )", y ="Tweets")

Mentions

How about the actual persons who tweeted? It seemed like news anchor Joaquin Lopez-Doriga and security analyst Alejandro Hope were the most vocal about the visit (out of the influencers i’m following).

bytweetstar <- d %>%   mutate("TRUMP_MENTION"=ifelse(str_count(TXT, pattern ="Trump|TRUMP|trump")<1,0,1))%>%  group_by(TRUMP_MENTION, NOMBRE)%>%   summarise("count"= n_distinct(TXT))## plot with ggplot2

Mentions

I also grouped each person by his political affiliation and I found it confirms the notion that the conversation on the eve of the visit, at least among this very small subset of twitter accounts, was driven by those with no party afiliation or in the “PAN” (opposition party).

byafiliation <- d %>%   mutate("MONTH"=as.numeric(month(T_CREATED)),"DAY"=as.numeric(day(T_CREATED)),"HOUR"=as.numeric(hour(T_CREATED)),"TRUMP_MENTION"=ifelse(str_count(TXT, pattern ="Trump|TRUMP|trump")>0,1,0))%>%   group_by(MONTH, DAY, HOUR, TRUMP_MENTION, AFILIACION)%>%   summarise("TWEETS"= n())%>%  arrange(desc(MONTH), desc(DAY), HOUR)%>%  mutate("CHART_DATE"=as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR,":00")))   ggplot(byafiliation,        aes(x = CHART_DATE,            y = TWEETS,            group = AFILIACION,            fill = AFILIACION))+   geom_bar(stat ="identity")+   theme_eem()+   scale_fill_eem(20)+   facet_grid(TRUMP_MENTION ~.)+  labs(x ="Time", y ="Tweets \n (By mention of Trump)")

Mentions

However, It’s interesting to note how there is a small spike of the accounts afiliated with the PRI (party in power) on the day after his visit (Sept. 1st). Maybe they were trying to drive the conversation to another place?

To leave a comment for the author, please follow the link and comment on their blog: En El Margen - R-English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Machine Learning for Drug Adverse Event Discovery

$
0
0

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations between drugs and adverse events.

Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. Clustering is grouping of data into sub-groups so that objects within a cluster have high similarity in comparison to other objects in that cluster, but are very dissimilar to objects in other classes. For clustering, each pattern is represented as a vector in multidimensional space and a distance measure is used to find the dissimilarity between the instances. In this post, we will see how we can use hierarchical clustering to identify drug adverse events. You can read about hierarchical clustering from Wikipedia.

Data

Let’s create fake drug adverse event data where we can visually identify the clusters and see if our machine learning algorithm can identify the clusters. If we have millions of rows of adverse event data, clustering can help us to summarize the data and get insights quickly.

Let’s assume a drug AAA results in adverse events shown below. We will see in which group (cluster) the drug results in what kind of reactions (adverse events). In the table shown below, I have created four clusters:

  • Route=ORAL, Age=60s, Sex=M, Outcome code=OT, Indication=RHEUMATOID ARTHRITIS and Reaction=VASCULITIC RASH + some noise
  • Route=TOPICAL, Age=early 20s, Sex=F, Outcome code=HO, Indication=URINARY TRACT INFECTION and Reaction=VOMITING + some noise
  • Route=INTRAVENOUS, Age=about 5, Sex=F, Outcome code=LT, Indication=TONSILLITIS and Reaction=VOMITING + some noise
  • Route=OPHTHALMIC, Age=early 50s, Sex=F, Outcome code=DE, Indication=Senile osteoporosis and Reaction=Sepsis + some noise

Below is a preview of my data. You can download the data here

head(my_data)  route age sex outc_cod              indi_pt              pt1  ORAL  63   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH2  ORAL  66   F       OT RHEUMATOID ARTHRITIS VASCULITIC RASH3  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH4  ORAL  57   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH5  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH6  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH

Hierarchical Clustering

To perform hierarchical clustering, we need to change the text to numeric values so that we can calculate distances. Since age is numeric, we will remove it from the rest of the variables and change the character variables to multidimensional numeric space.

age = my_data$agemy_data = select(my_data,-age)

Create a Matrix

my_matrix = as.data.frame(do.call(cbind, lapply(mydata, function(x) table(1:nrow(mydata), x))))

Now, we can add the age column:

my_matrix$Age=agehead(my_matrix)INTRAVENOUS OPHTHALMIC ORAL TOPICAL F M DE HO LT OT RHEUMATOID ARTHRITIS Senile osteoporosis1           0          0    1       0 0 1  0  0  0  1                    1                   02           0          0    1       0 1 0  0  0  0  1                    1                   03           0          0    1       0 0 1  0  0  0  1                    1                   04           0          0    1       0 0 1  0  0  0  1                    1                   05           0          0    1       0 0 1  0  0  0  1                    1                   06           0          0    1       0 0 1  0  0  0  1                    1                   0  TONSILLITIS URINARY TRACT INFECTION Sepsis VASCULITIC RASH VOMITING Age1           0                       0      0               1        0  632           0                       0      0               1        0  663           0                       0      0               1        0  664           0                       0      0               1        0  575           0                       0      0               1        0  666           0                       0      0               1        0  66

Let’s normalize our variables using caret package.

library(caret)preproc = preProcess(my_matrix)my_matrixNorm = as.matrix(predict(preproc, my_matrix))

Next, let’s calculate distance and apply hierarchical clustering and plot the dendrogram.

distances = dist(my_matrixNorm, method = "euclidean")clusterdrug = hclust(distances, method = "ward.D") plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)

You will get this plot: fig1

From the dendrogram shown above, we see that four distinct clusters can be created from the fake data we created. Let’s use different colors to identify the four clusters.

library(dendextend)dend <- as.dendrogram(clusterdrug) install.packages("dendextend") library(dendextend)# Color the branches based on the clusters:dend <- color_branches(dend, k=4) #, groupLabels=iris_species)# We hang the dendrogram a bit:dend <- hang.dendrogram(dend,hang_height=0.1)# reduce the size of the labels:# dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")dend <- set(dend, "labels_cex", 0.5)plot(dend)

Here is the plot: fig2

Now, let’s create cluster groups with four clusters.

clusterGroups = cutree(clusterdrug, k = 4)

Now, let’s add the clusterGroups column to the original data.

my_data= cbind(data.frame(Cluster=clusterGroups), my_data, age)head(my_data)Cluster route sex outc_cod              indi_pt              pt age1       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  632       1  ORAL   F       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  663       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  664       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  575       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  666       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  66

Number of Observations in Each Cluster

observationsH=c()for (i in seq(1,4)){  observationsH=c(observationsH,length(subset(clusterdrug, clusterGroups==i)))}observationsH =as.data.frame(list(cluster=c(1:4),Number_of_observations=observationsH))observationsH cluster Number_of_observations1       1                     202       2                     133       3                     154       4                     24

What is the most common observation in each cluster?

Let’s calculate column average for each cluster.

z=do.call(cbind,lapply(1:4, function(i) round(colMeans(subset(my_matrix,clusterGroups==i)),2)))colnames(z)=paste0('cluster',seq(1,4))zcluster1 cluster2 cluster3 cluster4INTRAVENOUS                 0.00     0.00     1.00     0.00OPHTHALMIC                  0.00     0.00     0.00     0.92ORAL                        1.00     0.08     0.00     0.08TOPICAL                     0.00     0.92     0.00     0.00F                           0.10     0.85     0.80     1.00M                           0.90     0.15     0.20     0.00DE                          0.00     0.00     0.00     0.83.....

Next, most common observation in each cluster:

Age=z[nrow(z),]z=z[1:(nrow(z)-1),]my_result=matrix(0,ncol=4,nrow=ncol(mydata))for(i in seq(1,4)){    for(j in seq(1,ncol(mydata))){q = names(mydata)[j]q = as.vector(as.matrix(unique(mydata[q])))my_result[j,i]=names(sort(z[q,i],decreasing = TRUE)[1])    }}colnames(my_result)=paste0('Cluster',seq(1,4))rownames(my_result)=names(mydata)my_result=rbind(Age,my_result)my_result <- cbind(Attribute =c("Age","Route","Sex","Outcome Code","Indication preferred term","Adverse event"), my_result)rownames(my_result) <- NULLmy_resultAttributecluster1cluster2cluster3cluster4Age        61.8          17.54          5.8          44.62Route        ORAL          TOPICAL  INTRAVENOUS   OPHTHALMICSex         M          F          F           FOutcome Code OT          HO          LT           DEIndication preferred termRHEUMATOID ARTHRITIS   URINARY TRACT INFECTIONTONSILLITIS Senile osteoporosisAdverse event VASCULITIC RASH VOMITING VOMITING   Sepsis

Summary

We see that we have created the clusters using hierarchical clustering. From cluster 1, for male in the 60s, the drug results in vasculitic rash when taken for rheumatoid arthritis. We can interpret the other clusters similarly. Remember, this data is not real data. It is fake data made to show the application of clustering for drug adverse event study. From, this short post, we see that clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge.

    Related Post

    1. GoodReads: Exploratory data analysis and sentiment analysis (Part 2)
    2. GoodReads: Webscraping and Text Analysis with R (Part 1)
    3. Euro 2016 analytics: Who’s playing the toughest game?
    4. Integrating R with Apache Hadoop
    5. Plotting App for ggplot2

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Replicating Plots – Boxplot Exercises

    $
    0
    0

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    R’s boxplot function has a lot of useful parameters allowing us to change the behaviour and appearance of the boxplot graphs. In this exercise we will try to use those parameters in order to replicate the visual style of Matlab’s boxplot. Before trying out this exercise please make sure that you are familiar with the following functions: bxp, boxplot, axis, mtext

    Here is the plot we will be replicating: matlab_boxplot

    We will be using the same iris dataset which is available in R by default in the variable of the same name – iris. The exercises will require you to make incremental changes to the default boxplot style.

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1 Make a default boxplot of Sepal.Width stratified by Species.

    Exercise 2 Change the range of the y-axis so it starts at 2 and ends at 4.5.

    Exercise 3 Modify the boxplot function so it doesn’t draw ticks nor labels of the x and y axes.

    Exercise 4 Add notches (triangular dents around the median representing confidence intervals) to the boxes in the plot.

    Exercise 5 Increase the distance between boxes in the plot.

    Exercise 6 Change the color of the box borders to blue.

    Exercise 7 a. Change the color of the median lines to red. b. Change the line width of the median line to 1.

    Exercise 8 a. Change the color of the outlier points to red. b. Change the symbol of the outlier points to “+”. c. Change the size of the outlier points to 0.8.

    Exercise 9 a. Add the title to the boxplot (try to replicate the style of matlab’s boxplot). b. Add the y-axis label to the boxplot (try to replicate the style of matlab’s boxplot).

    Exercise 10 a. Add x-axis (try to make it resemble the x-axis in the matlab’s boxplot) b. Add y-axis (try to make it resemble the y-axis in the matlab’s boxplot) c. Add the y-axis ticks on the other side.

    NOTE: You can use format(as.character(c(2, 4.5)), drop0trailing=TRUE, justify="right") to obtain the text for y-axis labels.

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Viewing all 12130 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>