Quantcast
Channel: R-bloggers
Viewing all 12130 articles
Browse latest View live

Upgrading to plotly 4.0 (and above)

$
0
0

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

By Carson Sievert, lead Plotly R developer

I’m excited to announce that plotly’s R package just sent its first CRAN update in nearly four months. To install the update, run install.packages("plotly").

This update has breaking changes, enables new features, fixes numerous bugs, and takes us from version 3.6.0 to 4.5.2. To see all the changes, I encourage you to read the NEWS file. In this post, I’ll highlight the most important changes, explain why they needed to happen, and provide some tips for fixing errors brought about by this update. As you’ll see, this update is mostly about improving the plot_ly() interface, so ggplotly() users won’t notice much (if any) change. I’ve also started a plotly for R book which provides more narrative than the documentation on https://plot.ly/r (which is now updated to 4.0), more recent examples, and features exclusive to the R package. The first three chapters are nearly finished and replace the package vignettes. The later chapters are still in their beginning stages – they discuss features that are still under development, but I plan adding stability, and more documentation in the coming months.

Formula mappings

In the past, you could use an expression to reference variable(s) in a data frame, but this no longer works. Consequently, you might see an error like this when you update:

library(plotly)
plot_ly(mtcars, x = mpg, y = sqrt(wt))
## Error in plot_ly(mtcars, x = mpg, y = sqrt(wt)): object 'wt' not found

plot_ly() now requires a formula (which is basically an expression, but with a ~ prefixed) when referencing variables. You do not have to use a formula to reference objects that exist in the namespace, but I recommend it, since it helps populate sensible axis/guide title defaults (e.g., compare the output of plot_ly(z = volcano) with plot_ly(z = ~volcano) ).

plot_ly(mtcars, x = ~mpg, y = ~sqrt(wt))

<br />

There are a number of technical reasons why imposing this change from expressions to formulas is a good idea. If you’re interested in the details, I recommend reading Hadley Wickham’s notes on non-standard evaluation, but here’s the gist of the situation:

  1. Since formulas capture the environment in which they are created, we can be confident that evaluation rules are always correct, no matter the context.
  2. Compared to expressions/symbols, formulas are easier to program with, which makes writing custom functions around plot_ly() easier.
myPlot <- function(x, y, ...) {
  plot_ly(mtcars, x = x, y = y, color = ~factor(cyl), ...)
}
myPlot(~mpg, ~disp, colors = "Dark2")

<br />

Also, it’s fairly easy to convert a string to a formula (e.g., as.formula("~sqrt(wt)")). This trick can be quite useful when programming in shiny (and a variable mapping depends on an input value).

Smarter defaults

Instead of always defaulting to a “scatter” trace, plot_ly() now infers a sensible trace type (and other attribute defaults) based on the information provided. These defaults are determined by inspecting the vector type (e.g., numeric/character/factor/etc) of positional attributes (e.g., x/y). For example, if we supply a discrete variable to x (or y), we get a vertical (or horizontal) bar chart:

subplot(
  plot_ly(diamonds, y = ~cut, color = ~clarity),
  plot_ly(diamonds, x = ~cut, color = ~clarity),
  margin = 0.07
) %>% hide_legend()

<br />

Or, if we supply two discrete variables to both x and y:

plot_ly(diamonds, x = ~cut, y = ~clarity)

<br />

Also, the order of categories on a discrete axis, by default, is now either alphabetical (for character strings) or matches the ordering of factor levels. This makes it easier to sort categories according to something meaningful, rather than the order in which the categories appear (the old default). If you prefer the old default, use layout(categoryorder = "trace")

library(dplyr)
# order the clarity levels by their median price
d <- diamonds %>%
  group_by(clarity) %>%
  summarise(m = median(price)) %>%
  arrange(m)
diamonds$clarity <- factor(diamonds$clarity, levels = d[["clarity"]])
plot_ly(diamonds, x = ~price, y = ~clarity, type = "box")

<br />

plot_ly() now initializes a plot

Previously plot_ly()always produced at least one trace, even when using add_trace() to add on more traces (if you’re familiar with ggplot2 lingo, a trace is similar to a layer). From now on, you’ll have to specify the type in plot_ly() if you want it to always produce a trace:

subplot(
  plot_ly(economics, x = ~date, y = ~psavert, type = "scatter") %>% 
    add_trace(y = ~uempmed) %>%
    layout(yaxis = list(title = "Two Traces")),
  plot_ly(economics, x = ~date, y = ~psavert) %>% 
    add_trace(y = ~uempmed) %>% 
    layout(yaxis = list(title = "One Trace")),
  titleY = TRUE, shareX = TRUE, nrows = 2
) %>% hide_legend()

<br />

Why enforce this change? Often times, when composing a plot with multiple traces, you have attributes that are shared across traces (i.e., global) and attributes that are not. By allowing plot_ly() to simply initialize the plot and define global attributes, it makes for a much more natural to describe such a plot. Consider the next example, where we declare x/y (longitude/latitude) attributes and alpha transparency globally, but alter trace specific attributes in add_trace()-like functions. This example also takes advantage of a few other new features:

  1. The group_by() function which defines “groups” within a trace (described in more detail in the next section).
  2. New add_*() functions which behave like add_trace(), but are higher-level since they assume a trace type, might set some attribute values (e.g., add_marker() set the scatter trace mode to marker), and might trigger other data processing (e.g., add_lines() is essentially the same as add_paths(), but guarantees values are sorted along the x-axis).
  3. Scaling is avoided for “AsIs” values (i.e., values wrapped with I()) which makes it easier directly specify a constant value for a visual attribute(s) (as opposed to mapping data values to visuals).
  4. More support for R’s graphical parameters such as pch for symbols and lty for linetypes.
map_data("world", "canada") %>%
  group_by(group) %>%
  plot_ly(x = ~long, y = ~lat, alpha = 0.1) %>%
  add_polygons(color = I("black"), hoverinfo = "none") %>%
  add_markers(color = I("red"), symbol = I(17),
              text = ~paste(name, "<br />", pop),
              hoverinfo = "text", data = maps::canada.cities) %>%
  hide_legend()

<br />

New interpretation of group

The group argument in plot_ly() has been removed in favor of the group_by() function. In the past, the group argument incorrectlycreated multiple traces. If you want that same behavior, use the new split argument, but groups are now used to define “gaps” within a trace. This is more consistent with how ggplot2’s group aesthetic is translated in ggplotly(), and is much more efficient than plotting a trace for each group.

txhousing %>%
  group_by(city) %>%
  plot_ly(x = ~date, y = ~median) %>%
  add_lines(alpha = 0.3)

<br />

The default hovermode (compare data on hover) isn’t super useful here since we have only 1 trace to compare, so you may want to add layout(hovermode = "closest") when using group_by(). If you’re group sizes aren’t that large, you may want to use split to generate one trace per group, then set a constant color (using the I() function to avoid scaling).

txhousing %>%
  plot_ly(x = ~date, y = ~median) %>%
  add_lines(split = ~city, color = I("steelblue"), alpha = 0.3)

<br />

In the coming months, we will have better ways to identify/highlight groups to help combat overplotting (see here for example). This same interface can be used to coordinate multiple linked plots, which is a powerful tool for exploring multivariate data and presenting multivariate results (see here and here for examples).

New plotly object representation

Prior to version 4.0, plotly functions returned a data frame with special attributes attached (needed to track the plot’s attributes). At the time, I thought this was the right way to enable a “data-plot-pipeline” where a plot is described as a sequence of visual mappings and data manipulations. For a number of reasons, I’ve been convinced otherwise, and decided the central plotly object should inherit from an htmlwidget object instead. This change does not destroy our ability to implement a “data-plot-pipeline”, but it does, in a sense, constrain the set manipulations we can perform on a plotly object. Below is a simple example of transforming the data underlying a plotly object using dplyr’s mutate() and filter() verbs (the plotly book has a whole section on the data-plot-pipeline, if you’d like to learn more).

library(dplyr)
economics %>%
  plot_ly(x = ~date, y = ~unemploy / pop, showlegend = F) %>%
  add_lines(linetype = I(22)) %>%
  mutate(rate = unemploy / pop) %>% 
  slice(which.max(rate)) %>%
  add_markers(symbol = I(10), size = I(50)) %>%
  add_annotations("peak")

<br />

In this context, I’ve often found it helpful to inspect the (most recent) data associated with a particular plot, which you can do via plotly_data()

diamonds %>%
  group_by(cut) %>%
  plot_ly(x = ~price) %>%
  plotly_data()
## Source: local data frame [53,940 x 10]
## Groups: cut [5]
## 
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1   0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2   0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3   0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4   0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5   0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6   0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
## 7   0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
## 8   0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
## 9   0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows

To keep up to date with currently supported data manipulation verbs, please consult the help(reexports) page, and for more examples, check out the examples section under help(plotly_data).

This change in the representation of a plotly object also has important implications for folks using plotly_build() to “manually” access or modify a plot’s underlying spec. Previously, this function returned the JSON spec as an R list, but it now returns more “meta” information about the htmlwidget, so in order to access that same list, you have to grab the “x” element. The new as_widget() function (different from the now deprecated as.widget() function) is designed to turn a plotly spec into an htmlwidget object.

# the style() function provides a more elegant way to do this sort of thing,
# but I know some people like to work with the list object directly...
pl <- plotly_build(qplot(1:10))[["x"]]
pl$data[[1]]$hoverinfo <- "none"
as_widget(pl)

<br />

Conclusion

The latest CRAN release upgrades plotly’s R package from version 3.6.0 to 4.5.2. This upgrade includes a number of breaking changes, as well as a ton of new features and bug fixes. The time spent upgrading your code will be worth it as enables a ton of new features. It also provides a better foundation for advancing the plot_ly() interface (not to mention the linked highlighting stuff we have on tap). This post should provide the information necessary to fix these breaking changes, but if you have any trouble upgrading, please let us know on http://community.plot.ly. Happy plotting!

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


A book on RStan in Japanese: Bayesian Statistical Modeling Using Stan and R (Wonderful R, Volume 2)

$
0
0

(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

Bayesian Statistical Modeling Using Stan and R Book Cover Wonderful, indeed, to have an RStan book in Japanese:

Google translate makes the following of the description posted on Amazon Japan (linked from the title above):

In recent years, understanding of the phenomenon by fitting a mathematical model using a probability distribution on data and prompts the prediction “statistical modeling” has attracted attention. Advantage when compared with the existing approach is both of the goodness of the interpretation of the ease and predictability. Since interpretation is likely to easily connect to the next action after estimating the values ​​in the model. It is rated as very effective technique for data analysis Therefore reality.

In the background, the improvement of the calculation speed of the computer, that the large scale of data becomes readily available, there are advances in stochastic programming language to very simple trial and error of modeling. From among these languages, in this document to introduce Stan is a free software. Stan is a package which is advancing rapidly the development equipped with a superior algorithm, it can easily be used from R because the package for R RStan has been published in parallel. Descriptive power of Stan is high, the hierarchical model and state space model can be written in as little as 30 lines, estimated calculation is also carried out automatically. Further tailor-made extensions according to the analyst of the problem is the easily possible.

In general, dealing with the Bayesian statistics books or not to remain in rudimentary content, what is often difficult application to esoteric formulas many real problem. However, this book is a clear distinction between these books, and finished to a very practical content put the reality of the data analysis in mind. The concept of statistical modeling was wearing through the Stan and R in this document, even if the change is grammar of Stan, even when dealing with other statistical modeling tools, I’m sure a great help.

I’d be happy to replace this with a proper translation if there’s a Japanese speaker out there with some free time (Masanao Yajima translated the citation for us).

Big in Japan?

I’d like to say Stan’s big in Japan, but that idiom implies it’s not so big elsewhere. I can say there’s a very active Twitter community tweeting about Stan in Japanese, which we follow occasionally using Google Translate.

The post A book on RStan in Japanese: Bayesian Statistical Modeling Using Stan and R (Wonderful R, Volume 2) appeared first on Statistical Modeling, Causal Inference, and Social Science.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introduction to BiclustGUI

$
0
0

BiclustGUI in R

Ewoud De Troyer, University of Hasselt (CenStat)

Biclustering in a GUI & Growing a GUI together!

Introduction

We are happy to announce the first release of the BiclustGUI on CRAN (RcmdrPlugin.BiclustGUI)! This GUI will you enable to quickly try out a wide arrange of biclustering algorithms and produce some helpfull graphs in order to explore your data. Since we made the choice of developing it in the form of a plug-in for R Commander, you can save your R code after the session which can be used without GUI intervention. However for those of you who love using Shiny, have no fear! We have also created a Shiny App including all the biclustering algorithms and the most interesting plots. (For all available diagnostics and graphs, you will have to head over to the BiclustGUI package itself though.)

This blog is meant as a short introduction to the GUI which will highlight some features, give an example or two and showcase what else we have done in this area. A detailed instruction guide about all aspects of the GUI (as well as some short explanation about the included algorithms) can be found in the form of a vignette here. [1]

What is biclustering?

If you are familiar with the concept of biclustering, you can safely skip this section. If you are not, biclustering is actually a fairly easy concept! Say you have a matrix (M), instead of just clustering on a single dimension such as the columns (= finding similar columns based on all the rows), you are going to cluster simultaneously on both dimensions. This means you are trying to discover a subset of columns which are similar on only a subset of rows or vice versa. This submatrix which contains this local pattern is what we call a bicluster. There exist a great deal of different algorithms to find these submatrices. For example, they can differ in which assumption they have of the pattern inside the bicluster (constant/evolution or additive/multiplicative). They can be based on a specific model, random initialisations, the type of data (binary/continuous),… too much to sum up here! In the GUI we have tried to include a large variety of these different methods.

Heatmap of Example Data Matrix with 2 Biclusters

Heatmap of Example Data Matrix with 2 Biclusters

The BiclustGUI R Package

Let me start with some motivation why we decided to create this GUI and what we tried to achieve. As is with many data analysis workflows, there exist a lot of different R packages for biclustering. Since plowing through all of the reference manuals to discover all the functions and their arguments can be quite time-consuming, one of our goals was to alleviate this process. This is why we choose to create the BiclustGUI as a R Commander plug-in. On one side you have the easy point-and-click environment, and on the other side your R code, which uses the functions from the corresponding packages, is being generated. We hope by doing this, that the GUI is a helpful tool to quickly get into some biclustering exploration, while not limiting you to play around with the code after its initial generation.

When developing the GUI, we also wanted to provide a unified platform from which all of these packages are connected and can be accessed (both for the methods and diagnostics packages). The table at the end of this section shows which ones are included so far and the Examples section will show you the default look of a biclustering window. This means that as long as you are using the interface, you do not have to worry about which R output object goes where in which argument of another function, even if the functions for the method or plot come from different packages. The generated R code will reflect this correctly.

The final aim, but not less important, was to create a growing GUI. Since new biclustering methods/diagnostics and packages are still being developed, ee required a way so that the GUI could grow along side these new developments. The way we tackled this was by including easy “fill-out”-scripts which developers can use to create a GUI window themselves for their novel method. As soon as the window fits their needs, they can send us this script so that we can include the new method in the next release of the GUI with minimal work from our side. I’ll talk about more about these scripts in a later section, but the general gist is that you will have to fill out some general information about your method and then copy-paste which widgets should be used in your window! These scripts should be fairly easy to get an understanding of since the tcltk syntax has been completely omitted from them.

All included biclustering and diagnostics packages in the BiclustGUI:
R PackageBiclustering MethodPublication
biclustPlaidTurnet et al., 2005
biclustδ-biclusteringCheng and Church, 2000
biclustX MotifMurali and Kasif, 2003
biclustSpectralKluger et al., 2003
biclustQuestMotifKaiser, 2011
biclustBimaxPrelic et al., 2006
fabiaFABIAHochreiter et al., 2010
isa2The Iterative Signature AlgorithmBergman et al., 2003
iBBiGIterative Binary BIclustering of GenesetsGusenleitner et al., 2012
rqubicQualitative BiclusteringLi et al., 2009
BicAREBiclustering Analysis and Results ExplorationGestraud and Barillot, 2014
s4vdSSVD (Sparse Singular Value Decomposition)Lee et al., 2010
s4vdS4VD (SSVD incorporating stability correction)Sill et al., 2011
R PackageDiagnosticsPublication
BcDiagBicluster Diagnostic PlotsAregay et al., 2014
superbiclustGenerating Robust Biclusters from a Bicluster SetKhamiakova, 2013

Installing the GUI

Before we continue to some examples, let’s install the GUI. To do so, please follow the code below:

setRepositories(ind=c(1:5))install.packages("RcmdrPlugin.BiclustGUI")

This will install the GUI and its dependencies from both CRAN and Bioconductor. Should some issue arise with a package installation, please try to manual install them. The code for this can be found here.

On the initial start-up of R Commander, you will probably be prompted to install some additional dependencies. This should not take too long!

To launch the GUI, use:

library(RcmdrPlugin.BiclustGUI)

(Note: A development version can be found on GitHub and R-Forge).

What’s R Commander & The BiclustGUI Plug-in?

For those not familiar with R Commander (Rcmdr), it is a basic statistics GUI by John Fox based on the tlctk package. The basic R Commander window contains a script window, where the R code is generated, as well as a console and warning window. The BiclustGUI extends the R Commander window with two extra biclustering menus as you can see in the Figure below:

Basic R Commander Window and Plug-in Menu

Basic R Commander Window and Plug-in Menu

Examples

All biclustering windows follow the same structure (see Figure below) of 2 tabs.

  • The first tab, the Biclustering Tab, includes all parameters, a seed box (if required) and a Show Results button to apply the algorithm.
  • The second tab, the Plots & Diagnostics Tab, will contain, as the name suggests, plots and diagnostics for the chosen methods.
  • At the bottom of the second tab, there are also buttons to access more general diagnostics packages. Currently BcDiag and superbiclust are included.

Standard Window

Standard Window

Let’s now look at 2 examples which mimic a short biclustering exploration. Both examples will show the general workflow you can expect when using the BiclustGUI

Plaid Example

  1. Apply Plaid with the chosen parameters.
  2. Go the second tab.
  3. Generate a Profile Plot.

Plaid Windows

Plaid Windows

We could have also accessed the BcDiag button on the bottom of the second tab to create the following profile plots:

BcDiag Profile Plots

BcDiag Profile Plots

FABIA Example

  1. Apply FABIA with the chosen parameters. The corresponding R-code will be generated in the script window of R Commander.
  2. Go to the second tab.
  3. Use Biclust Plots button to access plots from the biclust package.
  4. Use the heatmap button to generate the graph in a R Graphics Device.

FABIA Windows

FABIA Windows

How about future algorithms? Including new methods!

As I explained in an earlier section, we wanted to develop an easy way to include new biclustering R packages in the future. We wanted to make sure the GUI stayed up to date with recent developments by growing it as a community. The way we addressed this problem this was through “fill-out template scripts” which do not require the original tcltk syntax. I will try to outline the general idea and workings behind these scripts, but by no means will this be a full tutorial! A much more detailed explanation can be found in the second part of the vignette.

The remainder of this section will be a bit more ‘R-code dense’ so if this does not interest you, you can safely skip to the next one, The Shiny App.

Window Function & Window Scripts

In Rcmdr or tcltk a window is simply a R function. Inside this function the window, interface and all its elements are defined. Calling this function will make the window appear in your R session. The script we provide will help you make such a function (e.g. newmethod_WINDOW()) in which the GUI is defined. This newly created window function can then be used by R Commander to call it from a meanu.

Before we head over to such a “fill-out” script, I should briefly mention that there are actually 2 types of them:

  1. The first type is used to created a biclustering window (with its 2 tabs, biclustering and plots & diagnostics).
  2. The second type is used to create an extra tool window. This can serve as an extension of the first (e.g. linking tool window functions with buttons) or a general diagnostic window.

Both scripts are nearly completely identical. They only differ in the fact that the first type is for the two tabs (biclustering and plots) window while the second type is only for a window with one tab which follows the same rules as the second tab in the first script. Since they are so similar, let’s just focus on the first type.

Structure of a Window Script

The script, provided in newmethod_script.R in the doc folder of the BiclustGUI package, starts with opening the window function (here called newmethod_WINDOW, but you can change this). Next, some objects are initialized, followed by some variables which need to changed to adapt it to the method you want to implement. These are variables such as the name of the method, the function used for the method, the argument of this function which accepts the data, etc. It is also possible to add some discretize or binarize frames to the window (see Figure below). (See the vignette for more detailed information about all of these variables!)

What follows next is the information that will decide how the window will look like for the two tabs, but we will come back to that in a minute. At the very end of the script all the defined variables come together in the cluster_template() function which will translate all of this info to tcltk syntax.

newmethod_WINDOW<-function(){new.frames<-.initialize.new.frames()grid.config<-.initialize.grid.config()grid.rows<-.initialize.grid.rows()####################################################### GENERAL INFORMATION ABOUT THE NEW METHOD/WINDOW #######################################################methodname<-"A new method"methodfunction<-"methodfunction"data.arg<-"d"data.matrix<-TRUEmethodshow<-TRUEother.arg<-""methodhelp<-""# Extra Data Conversion Boxesdata.discr<-FALSEdata.bin<-FALSE# Possibility to give a seed ?methodseed<-TRUE## COMPATIBILITY? ### BcDiagbcdiag.comp<-FALSE# SuperBiclustsuperbiclust.comp<-FALSE############################## BICLUSTERING TAB ############################### OMITTEN FOR NOW (SEE FURTHER)##################################### PLOTS & DIAGNOSTICS TAB ###################################### OMITTED FOR NOW (SEE FURTHER)################################################################# USE THE ARGUMENTS IN THE GENERAL CLUSTERTEMPLATE FUNCTION #################################################################cluster_template(methodname=methodname,methodfunction=methodfunction,methodhelp=methodhelp,data.arg=data.arg,other.arg=other.arg,methodseed=methodseed,grid.config=grid.config,grid.rows=grid.rows,new.frames=new.frames,superbiclust.comp=superbiclust.comp,bcdiag.comp=bcdiag.comp,data.transf=data.transf,data.discr=data.discr,data.bin=data.bin,methodshow=methodshow,methodsave=methodsave)}

Discretize and Binarize Frames

Discretize and Binarize Frames

Designing the Window (of the 2 tabs)

So the only part that’s left now is to design how the two tabs should look like. Both tabs are created in the exact same way and follow 3 easy steps:

  1. Making the frames.
  2. Configuring the frames into a grid (matrix).
  3. Combining rows into a box.

Let’s look how this would look like for the first tab. Note that this part starts by putting the input variable to "clusterTab" to indicate that we are adding information to the first tab.

############################ CLUSTERING TAB ############################input<-"clusterTab"### 1. ADDING THE FRAMES #### Add frames here### 2. CONFIGURING THE GRID ###grid.config<-.grid.matrix(input=input,c("frame1","frame2","frame3",NA,"frame4",NA),nrow=3,ncol=2,byrow=TRUE,grid.config=grid.config)### 3. COMBING THE ROWS ###grid.rows<-.combine.rows(input=input,rows=c(1),title="A nice    box: ",border=TRUE,grid.rows=grid.rows,grid.config=grid.config)grid.rows<-.combine.rows(input=input,rows=c(2,3),title="A nice box:",border=TRUE,grid.rows=grid.rows,grid.config=grid.config)

Step 1 In the first step we create the frames for the method function arguments. A variety of frames can be created:

  • Check Boxes
  • Radio Buttons
  • Entry Fields
  • Sliders
  • Spinboxes

The easiest way to create these is to open the frames_script.R file in the doc folder of the BiclustGUI and simply copy paste the default version of one of these frames and then adapt the variables as you need them (frame name, argument names, initial values, title, border,…). Later in this section you can find a short example of how to do this. For example the default script of the entry fields frame and its result looks like:

It should be noted that in the second tab, plots & diagnostics, another type of frame can be added, namely manual buttons. These are used to call functions which calculate diagnostics, draw graphs or apply any other function you want. The default script and result for this frame looks like:

I would like to point attention to the buttonfunction and arg.frames variable. The first determines which function is tied to this button, the second determines which frames (meaning which arguments), are tied to this function/button.

Step 2 Next you need to take the frame names of the frames defined in step 1 and put them into a matrix with the .grid.matrix() function which is saved in the grid.config object. The way the names are placed in the matrix will decide how they will appear in the window (although note that frames will always be pulled towards the top-left). Note that apart from some extra arguments which should not be changed, grid.matrix() accepts the same input as matrix().

Step 3 The final step allows you to group together/put a box or title around 1 or multiple rows of the earlier defined matrix. This helps to add visual distinction between parts of the window. Its other use is to prevent all frames from trying to fit in 1 general grid which make frames jump too much to the right sometimes. Putting them in a “box of rows” will make a subgrid in which they will try to fit. To this end the function .combine.rows() is used and the result is saved the grid.rows object. The only arguments which should be changed are rows, title and border. Note that this can be done multiple times to put multiple rows in different boxes.

Making the second tab will be completly similar to steps described with the only differences being the input variable set to "plotdiagTab" and the option of adding manual buttons.

Workflow of creating of a new method window

This is how a general workflow of creating a new window would look like:

  1. Open newmethod_script.R and start adjusted the information variables in the beginning (as well as the name of newmethod_WINDOW).
  2. Open frames_script.R and start applying the 3 window steps for both tabs while copy pasting default frames from frames_script.R
  3. Run the window function while the BiclustGUI is launched and check if the design is correct.
  4. Send your new method addition to the maintainer of the BiclustGUI:
    • The script(s) of your new window(s).
    • A function that transforms the output of your biclustering algorithm to the Biclust-class in the biclust package. This is a S4 object in which the most important slots are RowxNumber, NumberxCol and Number. Providing this transformation function will make sure both the functions from BcDiag and superbiclust immediately work with your new method in the framework of the GUI.

Window creation examples

In the BiclustGUI, all included biclustering and diagnostics packages are already implemented in the GUI by using these scripts. So for more examples, you could always take inspiration from the source code of the GUI, since they will follow the same script as explained above. The vignette itself also includes some more examples.

To finalize this probably way too elaborate section about window creating, let’s take a look at some excerpts of the script which created the Plaid window!

The script starts with adapting the general information so that the variables fit for the Plaid algorithm.

####################################################### GENERAL INFORMATION ABOUT THE NEW METHOD/WINDOW #######################################################methodname<-"Plaid"methodfunction<-"biclust"data.arg<-"x"data.matrix<-TRUEother.arg<-",method=BCPlaid()"methodhelp<-"BCPlaid"methodseed<-TRUEdata.discr<-FALSEdata.bin<-FALSEbcdiag.comp<-TRUEsuperbiclust.comp<-TRUE

Next, we go on to step 1 of the first tab, defining the frames. In the figure below you can find 2 examples of these frames and how it would look like in the final interface.

Once step 1 is completed, we head over to step 2 and 3 to configure the grid and create 2 boxes around row 1 and row 2 and 3.

### 2. CONFIGURING THE GRID ###grid.config<-.grid.matrix(input=input,c("toclusterframe","modelframe","backgroundcheckframe",NA,"backgroundentryframe1","backgroundentryframe2"),byrow=TRUE,nrow=3,ncol=2,grid.config=grid.config)### 3. COMBING THE ROWS ###grid.rows<-.combine.rows(input=input,rows=c(1),title="Plaid Specifications",border=TRUE,grid.rows=grid.rows,grid.config=grid.config)grid.rows<-.combine.rows(input=input,rows=c(2,3),title="Layer Specifications",border=TRUE,grid.rows=grid.rows,grid.config=grid.config)

The same 3 steps are repeated for the second tab which will now include buttons as well. The following figure shows how the heatmap button pulls its arguments from 2 different frames, a checkbox and entry field.

At the end everything comes together again in the cluster_template() function which marks the end of the script.

The Shiny App

For those not interested in the R-code behind the biclustering algorithms or maybe for those who would like to introduce non-statisticians to biclustering, we have also created a shiny application which includes all of the currently implemented biclustering methods in the BiclustGUI. It is possible to draw heatmaps, profile plots and export your results. However for other diagnostics or the superbiclust method (= combining many biclustering results into robust biclusters), you will have to fall back to the original BiclustGUI in R Commander. (We do plan to include superbiclust in some form in the future.)

The Shiny App is available:

  1. online on the Shiny Cloud (currently only with limited resources!){:target=”_blank”}.
  2. as a stand-alone version right here (Direct Link for version 1.0.1).
    • Download the zip-file, extract and doubleclick LAUNCH.vbs.

Shiny App - iBBiG

Shiny App – iBBiG

The REST Package

A side-product of the BiclustGUI project came in the form of another R package, REST (Rcmdr Easy Script Templates), a tool to create R Commander GUI Plug-in’s. It is currently also available on CRAN. For this package, we basically took the template scripts (which were specific for biclustering), generalized them and added some extra functionality.

This package is by no means as flexible or powerful as shiny, but it does provide you a quick, easy and no-nonsense way to create a R Commander plug-in for your own analysis or R package while not having to bother about any tlctk syntax.

Essentialy it would be possible to completely recreate the BiclustGUI using the REST package.

Book Project

For those interested in more information on the Biclust GUI, I refer to the elaborate vignette included in the BiclustGUI package.

The GUI will also be featured in the upcoming book Applied Biclustering Methods for Big and High Dimensional Data Using R by Kasim, A., Shkedy, Z., Kaiser, S., Hochreiter, S. and Talloen, W. (28th of September 2016). The book handles various applications of biclustering in gene expression experiments, chemoinformatics, molecular modelling, etc. More information can be found here.

Contact

Please direct any questions/suggestions/bugs to ewoud.detroyer[at]uhasselt.be.

We are happy to take any feedback!

[1] Currently the vignette on CRAN still shows version 1.0.4. A more updated vignette can be found inside the package itself in version 1.0.6.

Quick wordclouds from PubMed abstracts – using PMID lists in R

$
0
0

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

Wordclouds are one of the most visually straightforward, compelling ways of displaying text info in a graph.

Of course, we have a lot of web pages (and even apps) that, given an input text, will plot you some nice tagclouds. However, when you need reproducible results, or getting done complex tasks -like combined wordclouds from several files-, a programming environment may be the best option.

In R, there are (as always), several alternatives to get this done, such as tagcloud and wordcloud.

For this script I used the following packages:

  • RCurl” to retrieve a PMID list, stored in my GitHub account as a .csv file.
  • RefManageR” and plyr to retrieve and arrange PM records. To fetch the info from the inets, we’ll be using the PubMed API (free version, with some limitations). 
  • Finally, tm, SnowballC” to prepare the data and wordcloud” to plot the wordcloud. This part of the script is based on this from Georeferenced.

One of the advantages of using RefManageR is that you can easily change the field from which you are importing from, and it usually works flawlessly with the PubMed API.

My biggest problem sources when running this script… download caps, busy hours, and firewalls!.

At the beginning of the gist, there is also a handy function that automagically downloads all needed packages for you.

To source the script, simply type in the R console:

library(devtools)source_url("https://gist.githubusercontent.com/aurora-mareviv/697cbb505189591648224ed640e70fb1/raw/b42ac2e361ede770e118f217494d70c332a64ef8/pmid.tagcloud.R")

And there is the code:

Enjoy!

wordcloud

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2016 UK Tour

$
0
0

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

I’ll be in the UK next week doing three talks in three days:

  • First, I’ll be giving a talk at the London R-Ladies meetup on Monday October 3rd with perhaps the best title yet: Whose Scat Is That? An ‘Easily Digestible’ Introduction to Predictive Modeling and caret.
  • On Tuesday, October 4th I’m giving a talk at the Cambridge RUG on tuning hyperparameters using optimization algorithms. This is an extension of this blog post.
  • Finally, on Wednesday (the 5th) at the fantastic Nonclinical Statistics Conference, I’ll be speaking on Statistical Mediation in Early Discovery by Bayesian Analysis and Visualization. The banner image above is from this talk. Lots of priors and shiny apps.

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fitting a distribution in Stan from scratch

$
0
0

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Last week the French National Institute of Health and Medical Research (Inserm) organised with the Stan Group a training programme on Bayesian Inference with Stan for Pharmacometrics in Paris.

Daniel Lee and Michael Betancourt, who run the course over three days, are not only members of Stan’s development team, but also excellent teachers. Both were supported by Eric Novik, who gave an Introduction to Stan talk at the Paris Dataiku User Group last week as well.

Eric Kramer (Dataiku), Daniel Lee, Eric Novik & Michael Betancourt (Stan Group)

I have been playing around with Stan on and off for some time, but as Eric pointed out to me, Stan is not that kind of girl(boy?). Indeed, having spent three days working with Stan has revitalised my relationship. Getting down to the basics has been really helpful and I shall remember, Stan is not drawing samples from a distribution. Instead, it is calculating the joint distribution function (in log space), and evaluating the probability distribution function (in log space).

Thus, here is a little example of fitting a set of random numbers in R to a Normal distribution with Stan. Yet, instead of using the built-in functions for the Normal distribution, I define the log probability function by hand, which I will use in the model block as well, and even generate a random sample, starting with a uniform distribution. However, I do use pre-defined distributions for the priors.

Why do I want to do this? This will be a template for the day when I have to use a distribution, which is not predefined in Stan, e.g. the actuar package has some interesting candidates.

Testing

I start off by generating fake data, a sample of 100 random numbers drawn from a Normal distribution with a mean of 4 and a standard deviation of 2. Note, the sample mean of the 100 figures is 4.2 and not 4.

Histogram of 100 random numbers drawn from N(4,2).

I then use the Stan script to fit the data, i.e. to find the the parameters \(\mu\) and \(\sigma\), assuming that the data was generated by a Gaussian process.

Traceplot of 4 chains, including warm-up phase
Histograms of posterior parameter and predictive samples
Comparison of the emperical distributions

The posterior parameter distributions include both \(\mu\) and \(\sigma\) in the 95% credible interval. The distribution of posterior predictive check (y_ppc) is wider, taking into account the uncertainty of the parameters. The interquartile range and mean of my initial fake data and the sample of the posterior predictive distribution look very similar. That’s good, my model generates data, which looks like the original data.

Bayesian Mixer Meetup

Btw, tonight we have the 4th Bayesian Mixer Meetup in London.

Session Info

R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.12 (Sierra)  locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8  attached base packages: [1] stats graphics grDevices utils datasets methods base       other attached packages: [1] MASS_7.3-45 rstan_2.12.1 StanHeaders_2.12.0 ggplot2_2.1.0       loaded via a namespace (and not attached):  [1] Rcpp_0.12.7      codetools_0.2-14 digest_0.6.10    grid_3.3.1        [5] plyr_1.8.4       gtable_0.2.0     stats4_3.3.1     scales_0.4.0      [9] labeling_0.3     tools_3.3.1      munsell_0.4.3    inline_0.3.14    [13] colorspace_1.2-6 gridExtra_2.2.1
This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on their blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

text2vec 0.4

$
0
0

(This article was first published on Data Science notes, and kindly contributed to R-bloggers)

Introducing text2vec 0.4

Today I’m pleased to announce new major release of text2vec– text2vec 0.4 which is already on CRAN.

For those readers who is not familiar with text2vec – it is an R package which provides an efficient framework with a concise API for text analysis and natural language processing.

With this release I also launched project homepage – http://text2vec.org where you can find up-to-date documents and tutorials.

Functionality

The core functionality at the moment includes:

  1. Fast text vectorization (creation of document-term matrices) on arbitrary n-grams, using vocabulary or feature hashing
  2. GloVe word embeddings
  3. Topic modeling with:
    • Latent Dirichlet Allocation
    • Latent Sematic Analysis
  4. Similarities/distances between matrices (documents in vector space)

What’s new?

First of all, I would like to express special thanks to project contributors – Lincoln Mullen, Qin Wenfeng, Zach Mayer and others (and of course for all of those who reported bugs on the issue tracker!).

A lot of work was done in the last 6 months. Most notable changes are:

  • Immutable iterators. Most frustrating and annoying thing in 0.3 was that create_* functions modified input objects (in contrast to usual R behavior with copy-on-modify semantics). So I received a lot of bug reports on that. People just didn’t understand why they getting empty Document-Term matrices. That was my big mistake, R users assume that function can’t modify argument. So I rewrote iterators with R6 classes (thanks to @hadley for suggestion). Learned a lot.
  • Now text2vec have consistent pipe-friendly interface for models. User should remember few main verbs – fit, transform, fit_transform. More details will be available soon in a separate blog post. Stay tuned.
  • Started to work on models which can be useful for NLP:
    • Latent Dirichlet Allocation. Code for fast Collapsed Gibbs Sampling is based on lda package by Jonathan Chang, but with a few tweaks (which will be incorporated into lda package in next release). It happened that LDA from text2vec ~ 2x faster that original (and ~10x faster than topicmodels!)
    • Latent Semantic Analysis (based on updated irlba package)
    • Tf-Idf is also rewritten to be consistent with other models interface
  • Now text2vec contains functions for fast calculation of similarity between documents (actually similarities and distances between matrices):
    • Cosine distance
    • Jaccard distance
    • Relaxed Word Mover’s Distance (which was demonstrated to be very useful on latest kaggle competitions – 1, 2). Dedicated post/tutorial on that will be available soon. Stay tuned
  • GloVe word embeddings also significantly updated:
    • Even faster now – got ~2-3x performance boost from code optimizations and usage of single precision float arithmetic (don’t forget to enable -ffast-math option for your C++ compiler)
    • L1 regularization– our new feature (I didn’t see implementations our papers where researchers tried to add regularization). Higher quality word embedding for small data sets. More details will be available in separate blog post. Stay tuned

Updated tutorials

Check out tutorials on text2vec.org where I’ll be updating documentation on a regular basis.

Below is the updated introduction to text mining with text2vec. No fancy word clouds. No Jane Austen. Enjoy.

Text analysis pipeline

Most text mining and NLP modeling use bag of words or bag of n-grams methods. Despite their simplicity, these models usually demonstrate good performance on text categorization and classification tasks. But in contrast to their theoretical simplicity and practical efficiency building bag-of-words models involves technical challenges. This is especially the case in R because of its copy-on-modify semantics.

Let’s briefly review some of the steps in a typical text analysis pipeline:

  1. The researcher usually begins by constructing a document-term matrix (DTM) or term-co-occurrence matrix (TCM) from input documents. In other words, the first step is to vectorize text by creating a map from words or n-grams to a vector space.
  2. The researcher fits a model to that DTM. These models might include text classification, topic modeling, similarity search, etc. Fitting the model will include tuning and validating the model.
  3. Finally the researcher applies the model to new data.

In this vignette we will primarily discuss the first step. Texts themselves can take up a lot of memory, but vectorized texts usually do not, because they are stored as sparse matrices. Because of R’s copy-on-modify semantics, it is not easy to iteratively grow a DTM. Thus constructing a DTM, even for a small collections of documents, can be a serious bottleneck for analysts and researchers. It involves reading the whole collection of text documents into RAM and processing it as single vector, which can easily increase memory use by a factor of 2 to 4. The text2vec package solves this problem by providing a better way of constructing a document-term matrix.

Let’s demonstrate package core functionality by applying it to a real case problem – sentiment analysis.

text2vec package provides the movie_review dataset. It consists of 5000 movie reviews, each of which is marked as positive or negative. We will also use the data.table package for data wrangling.

First of all let’s split out dataset into two parts – train and test. We will show how to perform data manipulations on train set and then apply exactly the same manipulations on the test set:

library(text2vec)library(data.table)data("movie_review")setDT(movie_review)setkey(movie_review,id)set.seed(2016L)all_ids=movie_review$idtrain_ids=sample(all_ids,4000)test_ids=setdiff(all_ids,train_ids)train=movie_review[J(train_ids)]test=movie_review[J(test_ids)]

Vectorization

To represent documents in vector space, we first have to create mappings from terms to term IDS. We call them terms instead of words because they can be arbitrary n-grams not just single words. We represent a set of documents as a sparse matrix, where each row corresponds to a document and each column corresponds to a term. This can be done in 2 ways: using the vocabulary itself or by feature hashing.

Vocabulary-based vectorization

Let’s first create a vocabulary-based DTM. Here we collect unique terms from all documents and mark each of them with a unique ID using the create_vocabulary() function. We use an iterator to create the vocabulary.

# define preprocessing function and tokenization fucntionprep_fun=tolowertok_fun=word_tokenizerit_train=itoken(train$review,preprocessor=prep_fun,tokenizer=tok_fun,ids=train$id,progressbar=FALSE)vocab=create_vocabulary(it_train)

What was done here?

  1. We created an iterator over tokens with the itoken() function. All functions prefixed with create_ work with these iterators. R users might find this idiom unusual, but the iterator abstraction allows us to hide most of details about input and to process data in memory-friendly chunks.
  2. We built the vocabulary with the create_vocabulary() function.

Alternatively, we could create list of tokens and reuse it in further steps. Each element of the list should represent a document, and each element should be a character vector of tokens.

train_tokens=train$review%>%prep_fun%>%tok_funit_train=itoken(train_tokens,ids=train$id,# turn off progressbar because it won't look nice in rmdprogressbar=FALSE)vocab=create_vocabulary(it_train)vocab
Number of docs: 4000 0 stopwords:  ... ngram_min = 1; ngram_max = 1 Vocabulary:                 terms terms_counts doc_counts    1:     overturned            1          1    2: disintegration            1          1    3:         vachon            1          1    4:     interfered            1          1    5:      michonoku            1          1   ---                                       35592:        penises            2          235593:        arabian            1          135594:       personal          102         9435595:            end          921        74335596:        address           10         10

Note that text2vec provides a few tokenizer functions (see ?tokenizers). These are just simple wrappers for the base::gsub() function and are not very fast or flexible. If you need something smarter or faster you can use the tokenizers package which will cover most use cases, or write your own tokenizer using the stringi package.

Now that we have a vocabulary, we can construct a document-term matrix.

vectorizer=vocab_vectorizer(vocab)t1=Sys.time()dtm_train=create_dtm(it_train,vectorizer)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 0.800817 secs

Now we have a DTM and can check its dimensions.

dim(dtm_train)
[1]  4000 35596
identical(rownames(dtm_train),train$id)
[1] TRUE

As you can see, the DTM has rows, equal to the number of documents, and columns, equal to the number of unique terms.

Now we are ready to fit our first model. Here we will use the glmnet package to fit a logistic regression model with an L1 penalty and 4 fold cross-validation.

library(glmnet)NFOLDS=4t1=Sys.time()glmnet_classifier=cv.glmnet(x=dtm_train,y=train[['sentiment']],family='binomial',# L1 penaltyalpha=1,# interested in the area under ROC curvetype.measure="auc",# 5-fold cross-validationnfolds=NFOLDS,# high value is less accurate, but has faster trainingthresh=1e-3,# again lower number of iterations for faster trainingmaxit=1e3)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 3.485586 secs
plot(glmnet_classifier)

plot of chunk fit_1

print(paste("max AUC =",round(max(glmnet_classifier$cvm),4)))
[1] "max AUC = 0.923"

We have successfully fit a model to our DTM. Now we can check the model’s performance on test data. Note that we use exactly the same functions from prepossessing and tokenization. Also we reuse/use the same vectorizer– function which maps terms to indices.

# Note that most text2vec functions are pipe friendly!it_test=test$review%>%prep_fun%>%tok_fun%>%itoken(ids=test$id,# turn off progressbar because it won't look nice in rmdprogressbar=FALSE)dtm_test=create_dtm(it_test,vectorizer)preds=predict(glmnet_classifier,dtm_test,type='response')[,1]glmnet:::auc(test$sentiment,preds)
[1] 0.916697

As we can see, performance on the test data is roughly the same as we expect from cross-validation.

Pruning vocabulary

We can note, however, that the training time for our model was quite high. We can reduce it and also significantly improve accuracy by pruning the vocabulary.

For example, we can find words “a”, “the”, “in”, “I”, “you”, “on”, etc in almost all documents, but they do not provide much useful information. Usually such words are called stop words. On the other hand, the corpus also contains very uncommon terms, which are contained in only a few documents. These terms are also useless, because we don’t have sufficient statistics for them. Here we will remove pre-defined stopwords, very common and very unusual terms.

stop_words=c("i","me","my","myself","we","our","ours","ourselves","you","your","yours")t1=Sys.time()vocab=create_vocabulary(it_train,stopwords=stop_words)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 0.439589 secs
pruned_vocab=prune_vocabulary(vocab,term_count_min=10,doc_proportion_max=0.5,doc_proportion_min=0.001)vectorizer=vocab_vectorizer(pruned_vocab)# create dtm_train with new pruned vocabulary vectorizert1=Sys.time()dtm_train=create_dtm(it_train,vectorizer)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 0.6738439 secs
dim(dtm_train)
[1] 4000 6585

Note that the new DTM has many fewer columns than the original DTM. This usually leads to both accuracy improvement (because we removed “noise”) and reduction of the training time.

Also we need to create DTM for test data with the same vectorizer:

dtm_test=create_dtm(it_test,vectorizer)dim(dtm_test)
[1] 1000 6585

N-grams

Can we improve the model? Definitely – we can use n-grams instead of words. Here we will use up to 2-grams:

t1=Sys.time()vocab=create_vocabulary(it_train,ngram=c(1L,2L))print(difftime(Sys.time(),t1,units='sec'))
Time difference of 1.47972 secs
vocab=vocab%>%prune_vocabulary(term_count_min=10,doc_proportion_max=0.5)bigram_vectorizer=vocab_vectorizer(vocab)dtm_train=create_dtm(it_train,bigram_vectorizer)t1=Sys.time()glmnet_classifier=cv.glmnet(x=dtm_train,y=train[['sentiment']],family='binomial',alpha=1,type.measure="auc",nfolds=NFOLDS,thresh=1e-3,maxit=1e3)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 2.973802 secs
plot(glmnet_classifier)

plot of chunk ngram_dtm_1

print(paste("max AUC =",round(max(glmnet_classifier$cvm),4)))
[1] "max AUC = 0.9217"

Seems that usage of n-grams improved our model a little bit more. Let’s check performance on test dataset:

# apply vectorizerdtm_test=create_dtm(it_test,bigram_vectorizer)preds=predict(glmnet_classifier,dtm_test,type='response')[,1]glmnet:::auc(test$sentiment,preds)
[1] 0.9268974

Further tuning is left up to the reader.

Feature hashing

If you are not familiar with feature hashing (the so-called “hashing trick”) I recommend you start with the Wikipedia article, then read the original paper by a Yahoo! research team. This technique is very fast because we don’t have to perform a lookup over an associative array. Another benefit is that it leads to a very low memory footprint, since we can map an arbitrary number of features into much more compact space. This method was popularized by Yahoo! and is widely used in Vowpal Wabbit.

Here is how to use feature hashing in text2vec.

h_vectorizer=hash_vectorizer(hash_size=2^14,ngram=c(1L,2L))t1=Sys.time()dtm_train=create_dtm(it_train,h_vectorizer)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 1.51502 secs
t1=Sys.time()glmnet_classifier=cv.glmnet(x=dtm_train,y=train[['sentiment']],family='binomial',alpha=1,type.measure="auc",nfolds=5,thresh=1e-3,maxit=1e3)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 4.494137 secs
plot(glmnet_classifier)

plot of chunk hash_dtm

print(paste("max AUC =",round(max(glmnet_classifier$cvm),4)))
[1] "max AUC = 0.8937"
dtm_test=create_dtm(it_test,h_vectorizer)preds=predict(glmnet_classifier,dtm_test,type='response')[,1]glmnet:::auc(test$sentiment,preds)
[1] 0.9036685

As you can see our AUC is a bit worse but DTM construction time is considerably lower. On large collections of documents this can be a significant advantage.

Basic transformations

Before doing analysis it usually can be useful to transform DTM. For example lengths of the documents in collection can significantly vary. In this case it can be useful to apply normalization.

Normalization

By “normalization” we assume transformation of the rows of DTM so we adjust values measured on different scales to a notionally common scale. For the case when length of the documents vary we can apply “L1” normalization. It means we will transform rows in a way that sum of the row values will be equal to 1:

dtm_train_l1_norm=normalize(dtm_train,"l1")

By this transformation we should improve the quality of data preparation.

TF-IDF

Another popular technique is TF-IDF transformation. We can (and usually should) apply it to our DTM. It will not only normalize DTM, but also increase the weight of terms which are specific to a single document or handful of documents and decrease the weight for terms used in most documents:

vocab=create_vocabulary(it_train)vectorizer=vocab_vectorizer(vocab)dtm_train=create_dtm(it_train,vectorizer)# define tfidf modeltfidf=TfIdf$new()# fit model to train data and transform train data with fitted modeldtm_train_tfidf=fit_transform(dtm_train,tfidf)# tfidf modified by fit_transform() call!# apply pre-trained tf-idf transformation to test datadtm_test_tfidf=create_dtm(it_test,vectorizer)%>%transform(tfidf)

Note that here we first time touched model object in text2vec. At this moment the user should remember several important things about text2vec models:

  1. Models can be fitted on a given data (train) and applied to unseen data (test)
  2. Models are mutable– once you will pass model to fit() or fit_transform() function, model will be modifed by it.
  3. After model is fitted, it can be applied to a new data with transform(new_data, fitted_model) method.

More detailed overview of models and models API will be available soon in a separate vignette.

Once we have tf-idf reweighted DTM we can fit our linear classifier again:

t1=Sys.time()glmnet_classifier=cv.glmnet(x=dtm_train_tfidf,y=train[['sentiment']],family='binomial',alpha=1,type.measure="auc",nfolds=NFOLDS,thresh=1e-3,maxit=1e3)print(difftime(Sys.time(),t1,units='sec'))
Time difference of 3.033687 secs
plot(glmnet_classifier)

plot of chunk fit_2

print(paste("max AUC =",round(max(glmnet_classifier$cvm),4)))
[1] "max AUC = 0.9146"

Let’s check the model performance on the test dataset:

preds=predict(glmnet_classifier,dtm_test_tfidf,type='response')[,1]glmnet:::auc(test$sentiment,preds)
[1] 0.9053246

Usually tf-idf transformation significantly improve performance on most of the dowstream tasks.

What’s next

Try text2vec, share your thoughts in comments. I’m waiting for feedback.

To leave a comment for the author, please follow the link and comment on their blog: Data Science notes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Making a team survey to get my colleagues hooked on R

$
0
0

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this post, I will talk about the presentation of R that I did today, in the first week of my PhD. Usually, it is a team-only presentation. Yet, other people came because they were interested in learning more about R.

How I get this idea?

I get the idea of doing an R presentation while reading Getting Your Colleagues Hooked on R on R-bloggers. I began by following the 7 tips of this post to make my presentation, which was a good starting point.

After a while, I feared that a general presentation would not get my team interested in R. So, I decided to set up a google form and ask them what they wanted to learn about R. It was the way to get sure that they would care.

Get results automatically

Because I was writing my R Markdown presentation while they were answering the google form, I decided that I should get (and show) the results automatically (only by re-knitting my presentation).

To get the results

I used the gsheet package (one could also use the googlesheets package):

library(pacman)
p_load(magrittr, longurl, gsheet)

responses <- "goo.gl/4zYmrw" %>%expand_urls %>%{gsheet2tbl(.$expanded_url)[, 2]}

To get the different possible choices of the form

I got them directly from reading the website of the google form:

p_load(gsubfn, stringr)

questions <-  "https://goo.gl/forms/LREeX5NORBJlCrcC3" %>%
readLines(encoding ="UTF-8") %>%
strapply(pattern ="\\[\"([^\"]*)\",,,,0\\]") %>%
unlist

I couldn’t get them directly from the googlesheet because google doesn’t make the difference between a comma in the name of the choices and commas used to seperate multiple answers. If you know how to specify the separation when generating results from a google form, I’d like to know.

To print the results directly in my presentation

I used the chunk option results='asis':

counts <-str_count(responses, coll(questions))
counts.lvl <-counts %>%unique %>%sort(decreasing =TRUE) %>%setdiff(0)

printf <-function(...) cat(sprintf(...))

for (n in counts.lvl) {
  if (n ==2) printf("\n***\n")
  printf("- for **%d** of you:\n", n)
  q.tmp <-questions[counts ==n]
  for (q in q.tmp) {
    printf("    - %s\n", q)
  }
}

in order to generate markdown from R code.

Getting the number of R packages on CRAN

I also wanted to show them how many package we have on CRAN, so I used:

n <-readLines('https://cran.r-project.org/web/packages/') %>%
gsubfn::strapply(
    paste("Currently, the CRAN package repository",
          "features ([0-9]+) available packages.")) %>%
unlist

and printed n as inline R code.

Conclusion

You can see the presentation there and the corresponding Rmd file there.

After finishing my presentation, I realized that most of what I presented, I learned it on R-bloggers. So, thanks everyone for the wonderful posts we get to read everyday!

If some of you think about other things that are important to know about R, I’d like to hear about them, just as personal curiosity.

To leave a comment for the author, please follow the link and comment on their blog: Florian Privé.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Extending accessibility of open-source statistical software to the masses A shiny case study

$
0
0

(This article was first published on Educate-R - R, and kindly contributed to R-bloggers)

Extending accessibility of open-source statistical software to the masses: A shiny case study

Brandon LeBeau

University of Iowa

R

  • R is an open source statistical programming language.
    • Pros:
      • Common statistical procedures are found in R
      • Can extend functionality with packages/functions
    • Cons:
      • Need to be comfortable with code

Flexibility of R

  • R is powerful and flexible due to the many user written packages.
  • However, to capture this flexibility:
    • users need to be comfortable with programming
    • users need to find the package
    • users need to understand package specific syntax

R package documentation and examples

https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/summarise

Blog posts

https://blog.rstudio.org/2014/01/17/introducing-dplyr/

Vignettes

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Weaknesses of these types of documentations

  • They still rely on user understanding and reading R code.
  • Not interactive, although the user can copy and paste code into an R session.
  • This type of documentation will not capture the nontraditional useR.
  • Shiny is the path to the nontraditional useR.

What is Shiny

Advantages of Shiny

  • User needs no R knowledge
  • App is viewed in the browser so able to use
    • Javascript
    • HTML
    • CSS
  • Multiple hosting options
  • Flexible Output

Disadvantages of Shiny

  • Need a R developer to create the app.
    • More difficult as the code is somewhat different compared to traditional R code.
    • Shiny uses reactive programming.

Components of Shiny

  1. User Interface (ui.r)
    • What the user sees and interacts with
  2. R Analysis (server.r)
    • The R code running behind the scenes

User Interface

shinyUI(
  fluidPage(    
    titlePanel("Telephones by region"),
    sidebarLayout(      
      sidebarPanel(
        selectInput("region", "Region:", 
                    choices = colnames(WorldPhones)),
        hr(),
        helpText("Data from AT&T (1961) The World's Telephones.")
      ),

      mainPanel(
        plotOutput("phonePlot")  
      )
    )
  )
)

Server File

shinyServer(function(input, output) {

  output$phonePlot <- renderPlot({

    barplot(WorldPhones[ , input$region] * 1000, 
            main = input$region,
            ylab = "Number of Telephones",
            xlab = "Year")
  })
})

Case Study

  • pdfsearch
    • Note, you may need rtools to install this package.
  • This following commands will run the pdfsearch shiny application locally.
install.packages('devtools')
devtools::install_github('lebebr01/pdfsearch')
pdfsearch::run_shiny()

Case Study 2

devtools::install_github('lebebr01/simglm')
simglm::run_shiny()

Conclusions

  • Shiny can give useRs an interactive framework to try out an R package.
  • Benefits include
    • interactivity
    • no errors (for well developed Shiny applications)
    • no need to learn R or package specific syntax
    • only need a browser, no need to have R install locally when hosted on a server.

Questions?

To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tint 0.0.2: Tint Is Not Tufte

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The tint package is now on CRAN. Its name stands for Tint Is Not Tufte and it offers a fresh take on the excellent Tufte-style html (and now also pdf) presentations.

As a little teaser, here is what the html variant looks like:

and the full underlying document is available too.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: Septemer 2016 roundup

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from September of particular interest to R users. 

The R-Ladies meetups and the Women in R Taskforce support gender diversity in the R community.

Highlights from the Microsoft Data Science Summit include recordings of many presentations about R, and the keynote "The Future of Data Analysis" by Edward Tufte.

An R-based fraud detection model scores credit card transactions in SQL Server at a rate of 1 million records per second.

The Financial Times uses R for quantitative journalism (and made some lovely animations comparing European football teams). 

Part 3 in a series on Deep Learning looks at combining CNNs with RNNs.

There were many real-world applications of R presented at the EARL London conference, including applications of Microsoft R at Investec, British Car Auctions and Beazley Group.

Tips on choosing the right data science tool for a project.

Tidyverse: a collection of packages for working with data in R.

The Linux Data Science Virtual Machine has been upgraded with new tools including Microsoft R Server.

The Pirate's Guide to R: a video and 250-page e-book to learn the R language.

The 2016 O'Reilly Data Science Salary Survey reveals the most-used tools are SQL (70%), R (57%) and Python (54%).

A simple explanation of Convolutional Neural Networks.

A template for building a predictive maintenance application with SQL Server R Services.

The R Consortium awarded a grant of $10,000 to the R Documentation Task Force to design and build the next generation R documentation system.

Scaling R-based applications with DeployR grid nodes and slots.

An R package to extract colour palettes from satellite imagery.

A guide for porting SAS programs for financial data manipulation to R.

How to analyze basketball data and create animations of player movements with R.

Create a more perceptive heatmap colour scale with the viridis package.

General interest stories (not related to R) in the past month included: how a newspaper was printed in 1973, illusions caused by our poor peripheral vision (), a chart (to scale!) about climate change, a happier version of the X Files theme, and a short film about the creation of the universe.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny happy people in the land of the Czar

$
0
0

(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers)

During the summer, we’ve worked silently but relentlessly to set up a departmental server that could run R-Shiny applications. There’s a bunch of us in the department doing work on R and producing packages and so we thought it’d be a good idea to disseminate our research. Which is just as well, as I’ve been nominated “2020 REF Impact Czar“, meaning I’ll have to help collate all the evidence that our work does have an impact on the “real world”…Anyway, after some teething problems (mainly due to my getting familiar with the system and the remote installation of R and Shiny), I think we’ve now managed to successfully “deploy” (I think that’s the correct technical term) two webapps. These are bmetaweb and BCEAweb. The first one is the web-interface to our bmeta package for Bayesian meta-analysis (which I developed with my PhD student Christina). The main point of bmeta is to allow some sort of standardised framework for a set of models for meta-analysis, depending on the nature of the outcome and some modelling assumption (eg fixed vs random effects). In addition to running the default models (which are based on rather vague priors and pre-defined model structures), bmeta saves data and model code (in JAGS), so that people can actually use these templates and actually modify them to their specific needs.BCEAweb is the actual mother of the whole project (much as SAVI is then the actual grandmother, as it inspired our work on developing web-interfaces to R packages) and the idea is to use remotely BCEA to post-process the outcome of a (Bayesian) health economic model. BCEAweb works by uploading the simulations from a model and then using remotely R to produce all the relevant output for the reporting of the results in terms of cost-effectiveness analysis.One thing we’ve tried very hard to include in both the webapps is the possibility of downloading a full report (in .pdf or .docx format) with a summary of the analyses. I think this is really cool and we’ll probably develop more of these $-$ particularly for our work related to statistical methods for health economic evaluations.Comments welcome, of course!

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

On calculating AUC

$
0
0

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC).

R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated. Bob Horton’s article showed how elegantly the points on the ROC plot are expressed in terms of sorting and cumulative summation.

The next step is computing AUC. Obviously computing area is a solved problem. The issue is how you deal with interpolating between points and the conventions of what to do with data that has identical scores. An elegant interpretation of the usual tie breaking rules is: for every point on the ROC curve we must have either all of the data above a given score threshold or none of the data above a given score threshold. This is the issue alluded to when the original article states:

This brings up another limitation of this simple approach; by assuming that the rank order of the outcomes embodies predictive information from the model, it does not properly handle sequences of cases that all have the same score.

This problem is quite easy to explain with an example. Consider the following data.

d <- data.frame(pred=c(1,1,2,2),y=c(FALSE,FALSE,TRUE,FALSE))print(d)## pred     y## 1    1 FALSE## 2    1 FALSE## 3    2  TRUE## 4    2 FALSE

Using code adapted from the original article we can quickly get an interesting summary.

ord <- order(d$pred, decreasing=TRUE) # sort by prediction reversedlabels <- d$y[ord]data.frame(TPR=cumsum(labels)/sum(labels),            FPR=cumsum(!labels)/sum(!labels),           labels=labels,           pred=d$pred[ord])##   TPR       FPR labels pred## 1   1 0.0000000   TRUE    2## 2   1 0.3333333  FALSE    2## 3   1 0.6666667  FALSE    1## 4   1 1.0000000  FALSE    1

The problem is: we need to take all of the points with the same prediction score as an atomic unit (we take all of them or none of them). Notice also TPR is always 1 (an undesirable effect).

We do not really want rows 1 and 3 in our plot or area calculations. In fact the values in row 1 and 3 are not fully determined as they can vary depending on details of tie breaking in the sorting (though the values recorded in rows 2 and 4 can not so vary). Also (especially after deleting rows) we may need to add in ideal points with (FPR,TPR)=(0,0) and (FPR,TPR)=(1,1) to complete our plot and area calculations.

What we want is a plot where ties are handled. Such plots look like the following:

# devtools::install_github('WinVector/WVPlots')library('WVPlots') # see: https://github.com/WinVector/WVPlotsWVPlots::ROCPlot(d,'pred','y',TRUE,'example plot')

NewImage

There is a fairly elegant way to get the necessary adjusted plotting frame: use differencing (the opposite of cumulative sums) to find where the pred column changes, and limit to those rows.

The code is as follows (also found in our sigr library here):

calcAUC <- function(modelPredictions,yValues) {  ord <- order(modelPredictions, decreasing=TRUE)  yValues <- yValues[ord]  modelPredictions <- modelPredictions[ord]  x <- cumsum(!yValues)/max(1,sum(!yValues)) # FPR = x-axis  y <- cumsum(yValues)/max(1,sum(yValues))   # TPR = y-axis  # each point should be fully after a bunch of points or fully before a  # decision level. remove dups to achieve this.  dup <- c(modelPredictions[-1]>=modelPredictions[-length(modelPredictions)],           FALSE)  # And add in ideal endpoints just in case (redundancy here is not a problem).  x <- c(0,x[!dup],1)  y <- c(0,y[!dup],1)  # sum areas of segments (triangle topped vertical rectangles)  n <- length(y)  area <- sum( ((y[-1]+y[-n])/2) * (x[-1]-x[-n]) )  area}

This correctly calculates the AUC.

# devtools::install_github('WinVector/sigr')library('sigr') # see: https://github.com/WinVector/sigrcalcAUC(d$pred,d$y)## [1] 0.8333333

I think this extension maintains the spirit of the original. We have also shown how complexity increases as you move from code known to work on a particular data set at hand, to library code that may be exposed to data with unanticipated structures or degeneracies (this is why Quicksort, which has an elegant description, often has monstrous implementations; please see here for a rant on that topic).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating Sample Datasets – Exercises

$
0
0

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

ancient-coins Creating sample data is a common task performed in many different scenarios.

R has several base functions that make the sampling process quite easy and fast.

Below is an explanation of the main functions used in the current set of exercices:

1. set.seed() – Although R executes a random mechanism of sample creation, set.seed() function allows us to reproduce the exact sample each time we execute a random-related function.

2. sample() – Sampling function. The arguments of the function are: x – a vector of values, size – sample size replace – Either use a chosen value more than once or not prob – the probabilities of each value in the input vector.

3. seq()/seq.Date() – Create a sequence of values/dates, ranging from a ‘start’ to an ‘end’ value.

4. rep() – Repeat a value/vector n times.

5. rev() – Revert the values within a vector.

You can get additional explanations for those functions by adding a ‘?’ prior to each function’s name.

Answers to the exercises are available here. If you have different solutions, feel free to post them.

Exercise 1 1. Set seed with value 1235 2. Create a Bernoulli sample of 100 ‘fair coin’ flippings. Populate a variable called fair_coin with the sample results.

Exercise 2 1. Set seed with value 2312 2. Create a sample of 10 integers, based on a vector ranging from 8 thru 19. Allow the sample to have repeated values. Populate a variable called hourselect1 with the sample results

Exercise 3 1. Create a vector variable called probs with the following probabilities: ‘0.05,0.08,0.16,0.17,0.18,0.14,0.08,0.06,0.03,0.03,0.01,0.01’ 2. Make sure the sum of the vector equals 1.

Exercise 4 1. Set seed with value 1976 2. Create a sample of 10 integers, based on a vector ranging from 8 thru 19. Allow the sample to have repeated values and use the probabilities defined in the previous question. Populate a variable called hourselect2 with the sample results

Exercise 5 Let’s prepare the variables for a biased coin: 1. Populate a variable called coin with 5 zeros in a row and 5 ones in a row 2. Populate a variable called probs having 5 times value ‘0.08’ in a row and 5 times value ‘0.12’ in a row. 3. Make sure the sum of probabilities on probs variable equals 1.

Exercise 6 1. Set seed with value 345124 2. Create a biased sample of length 100, having as input the coin vector, and as probabilities probs vector of probabilities. Populate a variable called biased_coin with the sample results.

Exercise 7 Compare the sum of values in fair_coin and biased_coin

Exercise 8 1. Create a ‘Date’ variable called startDate with value 9th of February 2010 and a second ‘Date’ variable called endDate with value 9th of February 2005 2. Create a descending sequence of dates having all 9th’s of the month between those two dates. Populate a variable called seqDates with the sequence of dates.

Exercise 9 Revert the sequence of dates created in the previous question, so they are in ascending order and place them in a variable called RevSeqDates

Exercise 10 1. Set seed with value 10 2. Create a sample of 20 unique values from the RevSeqDates vector.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

7th MilanoR meeting + talks live streaming

$
0
0

(This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers)

On 27th of October I’m going to attend the 7th MilanoR meeting featuring the following two talks:

1. Interactive big data analysis with R: SparkR and MongoDB: a friendly walkthrough  by  Thimoty Barbieri and Marco Biglieri

2. Power consumption prediction based on statistical learning techniques by Davide Pandini

This is my first official R event and I’m very much looking forward to it. There has been a strong positive reply from a growing number of people interested in R confirmed by the fact that the tickets were sold out pretty much a few hours after the event was announced.

However, due to this unexpected sudden success in the allocation of the tickets, many people will not be able to attend. Fear not though, the MilanoR staff has just decided to live stream the event on its brand new Facebook page. The event will start around 6.30 PM local time. Feel free to leave a comment and share any thoughts you may have on the topic.

If you would like to know more about the event, the talks and the speakers, check out the following articles:

1. 7th MilanoR Meeting: October 27

2. 7h MilanoR Meeting live on Facebook

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Rblpapi 0.3.5

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of Rblpapi is now on CRAN. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required).

This is the sixth release since the package first appeared on CRAN last year. This release brings new functionality via new (getPortfolio()) and extended functions (getTicks()) as well as several fixes:

Changes in Rblpapi version 0.3.5 (2016-10-25)

  • Add new function getPortfolio to retrieve portfolio data via bds (John in #176)

  • Extend getTicks() to (optionally) return non-numeric data as part of data.frame or data.table (Dirk in #200)

  • Similarly extend getMultipleTicks (Dirk in #202)

  • Correct statement on timestamp for getBars (Closes issue #192)

  • Minor edits to a few files in order to either please R(-devel) CMD check --as-cran, or update documentation

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

facet_plot: a general solution to associate data with phylogenetic tree

$
0
0

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

ggtree provides gheatmap for visualizing heatmap with phylogenetic tree and msaplot for visualizing multiple sequence alignment with phylogenetic tree.

We may have different data types and want to visualize and align them with the tree. For example, dotplot of SNP site (e.g. using geom_point(shape='|')), barplot of trait values (e.g. using geom_barh(stat='identity')) et al.

To make it easy to associate different types of data with phylogenetic tree, I implemented the facet_plot function which accepts a geom function to draw the input data.frame and display it in an additional panel.

associate tree with different type of data by #ggtreehttps://t.co/6w755VWytZpic.twitter.com/K8WViEi13E

— Guangchuang Yu (@guangchuangyu) September 7, 2016

tr <- rtree(30)p <- ggtree(tr)d1 <- data.frame(id=tr$tip.label, location=sample(c("GZ", "HK", "CZ"), 30, replace=TRUE))p1 <- p %<+% d1 + geom_tippoint(aes(color=location))d2 <- data.frame(id=tr$tip.label, val=rnorm(30, sd=3))p2 <- facet_plot(p1, panel="dot", data=d2, geom=geom_point,                 aes(x=val), color='firebrick') + theme_tree2()

Most of the geom in ggplot2 draw vertical graph object, while for associating graph object with phylogenetic tree, we need horizontal versions. Luckily, we have ggstance which provides horizontal versions of geoms, including:

  • geom_barh()
  • geom_histogramh()
  • geom_linerangeh()
  • geom_pointrangeh()
  • geom_errorbarh()
  • geom_crossbarh()
  • geom_boxploth()
  • geom_violinh()

With ggstance, we can associate barplot, boxplot or other graphs to phylogenetic trees.

d3 <- data.frame(id = rep(tr$tip.label, each=2),                    value = abs(rnorm(60, mean=100, sd=50)),                    category = rep(LETTERS[1:2], 30))p3 <- facet_plot(p2, panel = 'Stacked Barplot', data = d3,                 geom = geom_barh,                 mapping = aes(x = value, fill = as.factor(category)),                 stat='identity' ) 

d4 = data.frame(id=rep(tr$tip.label, each=20),                 val=as.vector(sapply(1:30, function(i)                                 rnorm(20, mean=i)))                )               p4 <- facet_plot(p3, panel="Boxplot", data=d4, geom_boxploth,             mapping = aes(x=val, group=label, color=location))                                  

Citation

G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. doi:10.1111/2041-210X.12628.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

News from archivist 2.0 on eRum2016 conference

$
0
0

(This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

Ten days ago eRum2016 conference (European R Users Meeting 2016) has finished. It was a huge event that attracted over 250 attenders, both from academia and business. Meeting was a great opportunity to listen to amazing keynotes like Heather Turner, Katarzyna Stapor, Rasmus Bååth, Jakub Glinka, Ulrike Grömping, Przemyslaw Biecek, Romain Francois, Marek Gagolewski, Matthias Templ and Katarzyna Kopczewska. Big thank you goes to the whole organizing committee and dr Maciej Beręsewicz (head) especially! There were 10 workshops, 2 packages sessions, 2 data workflow sessions, 3 methodolody sessions, 1 BioR session, 2 business sessions, lightnings talks, a poster session and of course a great welcome paRty. I could not miss a chance to present news from the last release (ver 2.0) of oursarchivist package.

From the eRum’2 Book of Abstracts you can learn that: Open science needs not only reproducible research but also accessible final and partial results. During the speech I will present the most valuable applications of the archivist package. The archivist is an R package for data analysis results management, which helps in managing, sharing, storing, linking and searching for R objects. The archivist package automatically retrieves the object’s meta-data and creates a rich structure that allows for easy management of calculated R objects. The archivist package extends the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects. These functionalities also result in a variety of opportunities such as: sharing R objects within reports/articles by adding hooks to R objects in table/figure captions; interactive exploration of object repositories; caching function calls; retrieving object’s pedigree along with information on how the object was created; automated tracking of performance of models.

archivist 2.0: (News from) Managing Data Analysis Results Toolkit

My presentation about new features and a present architecture of the archivist package is available on the list of all eRum2016 presentations. If it’s hard to find it, then use this link http://r-addict.com/eRum2016/#/.

I have shown that there are some requirements for data analysis results: easy to access (for further processing), verifiable, reproducible. However, the reproducibility from scratch is not always possible, so one could improve results’ accedsibility.. The reproducibility is sometimes impossible due to different

  • base version of R
  • versions of R packages
  • versions of dependent software
  • global variables

or due to the

  • limitation of the original data
  • insufficient computational machinery

Examples: Can’t gather tibble in R, Can’t install git2r nor devtools R packages on centOS 7.0 64 bit, pandoc version 1.12.3 or higher is required and was not found (R shiny), rmarkdown::render freezes because pandoc freezes when LC_ALL and LANG are unset.

Results’ format proposed in the archivist

If one would present results with the unique hook after the results then the accedsibility. could be improved. Hooks can have the format as presented below and can be an R code that when being executed downloads results from the web (in this case from the GitHub repository named eRum2016 that belongs to user called archivistR)

library(archivist)# maybe library(survminer)archivist::aread('archivistR/eRum2016/817107d0e62a9500c4ddb1770bd03378')

plot of chunk unnamed-chunk-2

In this situation plot can be used in further processing or the data can be extracted from the plot as this the ggplot object (which by default stores data used to produce the object). For example title can be added

result<-archivist::aread('archivistR/eRum2016/817107d0e62a9500c4ddb1770bd03378')library(ggplot2)result$plot<-result$plot+ggtitle('Extra title')result

plot of chunk unnamed-chunk-3

Extensions – archivist.github

If you would like to have more archivist functionalities that are synchronized with GitHub’s repository storage system (e.g. automatic push after each object’s archiving) then you might be interested in the extensions of archivist – the archivist.github

If you are interested in more use cases of the archivist package then read our posts and talks history.

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Plotting individual observations and group means with ggplot2

$
0
0

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to share my approach for visualizing individual observations with group means in the same plot. Here are some examples of what we’ll be creating:

init-example-1.png

init-example-2.png

init-example-3.png

I find these sorts of plots to be incredibly useful for visualizing and gaining insight into our data. We often visualize group means only, sometimes with the likes of standard errors bars. Alternatively, we plot only the individual observations using histograms or scatter plots. Separately, these two methods have unique problems. For example, we can’t easily see sample sizes or variability with group means, and we can’t easily see underlying patterns or trends in individual observations. But when individual observations and group means are combined into a single plot, we can produce some powerful visualizations.

 General approach

Below is generic pseudo-code capturing the approach that we’ll cover in this post. Following this will be some worked examples of diving deeper into each component.

# Packages we needlibrary(ggplot2)library(dplyr)# Have an individual-observation data setid# Create a group-means data setgd <- id %>%         group_by(GROUPING-VARIABLES) %>%         summarise(          VAR1 = mean(VAR1),          VAR2 = mean(VAR2),          ...        )# Plot both data setsggplot(id, aes(GEOM-AESTHETICS)) +  geom_*() +  geom_*(data = gd)# Adjust plot to effectively differentiate data layers 

 Tidyverse packages

Throughout, we’ll be using packages from the tidyverse: ggplot2 for plotting, and dplyr for working on the data. Let’s load these into our session:

library(ggplot2)library(dplyr)

 Group means on a single variable

To get started, we’ll examine the logic behind the pseudo code with a simple example of presenting group means on a single variable. Let’s use mtcars as our individual-observation data set, id:

id <- mtcars %>% tibble::rownames_to_column() %>% as_data_frame()id#> # A tibble: 32 × 12#>              rowname   mpg   cyl  disp    hp  drat    wt  qsec    vs    am#>                <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1          Mazda RX4  21.0     6 160.0   110  3.90 2.620 16.46     0     1#> 2      Mazda RX4 Wag  21.0     6 160.0   110  3.90 2.875 17.02     0     1#> 3         Datsun 710  22.8     4 108.0    93  3.85 2.320 18.61     1     1#> 4     Hornet 4 Drive  21.4     6 258.0   110  3.08 3.215 19.44     1     0#> 5  Hornet Sportabout  18.7     8 360.0   175  3.15 3.440 17.02     0     0#> 6            Valiant  18.1     6 225.0   105  2.76 3.460 20.22     1     0#> 7         Duster 360  14.3     8 360.0   245  3.21 3.570 15.84     0     0#> 8          Merc 240D  24.4     4 146.7    62  3.69 3.190 20.00     1     0#> 9           Merc 230  22.8     4 140.8    95  3.92 3.150 22.90     1     0#> 10          Merc 280  19.2     6 167.6   123  3.92 3.440 18.30     1     0#> # ... with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>

Say we want to plot cars’ horsepower (hp), separately for automatic and manual cars (am). Let’s quickly convert am to a factor variable with proper labels:

id <- id %>% mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")))

Using the individual observations, we can plot the data as points via:

ggplot(id, aes(x = am, y = hp)) +  geom_point()

unnamed-chunk-6-1.png

What if we want to visualize the means for these groups of points? We start by computing the mean horsepower for each transmission type into a new group-means data set (gd) as follows:

gd <- id %>%         group_by(am) %>%         summarise(hp = mean(hp))gd#> # A tibble: 2 × 2#>          am       hp#>      <fctr>    <dbl>#> 1 automatic 160.2632#> 2    manual 126.8462

There are a few important aspects to this:

  • We group our individual observations by the categorical variable using group_by().
  • We summarise() the variable as its mean().
  • We give the summarized variable the same name in the new data set. E.g., hp = mean(hp) results in hp being in both data sets.

We could plot these means as bars via:

ggplot(gd, aes(x = am, y = hp)) +  geom_bar(stat = "identity")

unnamed-chunk-8-1.png

The challenge now is to combine these plots.

As the base, we start with the individual-observation plot:

ggplot(id, aes(x = am, y = hp)) +  geom_point()

unnamed-chunk-9-1.png

Next, to display the group-means, we add a geom layer specifying data = gd. In this case, we’ll specify the geom_bar() layer as above:

ggplot(id, aes(x = am, y = hp)) +  geom_point() +  geom_bar(data = gd, stat = "identity")

unnamed-chunk-10-1.png

Although there are some obvious problems, we’ve successfully covered most of our pseudo-code and have individual observations and group means in the one plot.

Before we address the issues, let’s discuss how this works. The main point is that our base layer (ggplot(id, aes(x = am, y = hp))) specifies the variables (am and hp) that are going to be plotted. By including id, it also means that any geom layers that follow without specifying data, will use the individual-observation data. Thus, geom_point() plots the individual points. geom_bar(), however, specifies data = gd, meaning it will try to use information from the group-means data. Because our group-means data has the same variables as the individual data, it can make use of the variables mapped out in our base ggplot() layer.

At this point, the elements we need are in the plot, and it’s a matter of adjusting the visual elements to differentiate the individual and group-means data and display the data effectively overall. Among other adjustments, this typically involves paying careful attention to the order in which the geom layers are added, and making heavy use of the alpha (transparency) values.

For example, we can make the bars transparent to see all of the points by reducing the alpha of the bars:

ggplot(id, aes(x = am, y = hp)) +  geom_point() +  geom_bar(data = gd, stat = "identity", alpha = .3)

unnamed-chunk-11-1.png

Here’s a final polished version that includes:

  • Color to the bars and points for visual appeal.
  • ggrepel::geom_text_repel to add car labels to each point.
  • theme_bw() to clean the overall appearance.
  • Proper axis labels.
ggplot(id, aes(x = am, y = hp, color = am, fill = am)) +  geom_bar(data = gd, stat = "identity", alpha = .3) +  ggrepel::geom_text_repel(aes(label = rowname), color = "black", size = 2.5, segment.color = "grey") +  geom_point() +  guides(color = "none", fill = "none") +  theme_bw() +  labs(    title = "Car horespower by transmission type",    x = "Transmission",    y = "Horsepower"  )

unnamed-chunk-12-1.png

Notice that, again, we can specify how variables are mapped to aesthetics in the base ggplot() layer (e.g., color = am), and this affects the individual and group-means geom layers because both data sets have the same variables.

 Group means on two variables

Next, we’ll move to overlaying individual observations and group means for two continuous variables. This time we’ll use the iris data set as our individual-observation data:

id <- as_data_frame(iris)id#> # A tibble: 150 × 5#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>#> 1           5.1         3.5          1.4         0.2  setosa#> 2           4.9         3.0          1.4         0.2  setosa#> 3           4.7         3.2          1.3         0.2  setosa#> 4           4.6         3.1          1.5         0.2  setosa#> 5           5.0         3.6          1.4         0.2  setosa#> 6           5.4         3.9          1.7         0.4  setosa#> 7           4.6         3.4          1.4         0.3  setosa#> 8           5.0         3.4          1.5         0.2  setosa#> 9           4.4         2.9          1.4         0.2  setosa#> 10          4.9         3.1          1.5         0.1  setosa#> # ... with 140 more rows

Let’s say we want to visualize the petal length and width for each iris Species.

Let’s create the group-means data set as follows:

gd <- id %>%         group_by(Species) %>%         summarise(Petal.Length = mean(Petal.Length),                  Petal.Width  = mean(Petal.Width))gd#> # A tibble: 3 × 3#>      Species Petal.Length Petal.Width#>       <fctr>        <dbl>       <dbl>#> 1     setosa        1.462       0.246#> 2 versicolor        4.260       1.326#> 3  virginica        5.552       2.026

We’ve now got the variable means for each Species in a new group-means data set, gd. The important point, as before, is that there are the same variables in id and gd.

Let’s prepare our base plot using the individual observations, id:

ggplot(id, aes(x = Petal.Length, y = Petal.Width)) +  geom_point()

unnamed-chunk-15-1.png

Let’s use the color aesthetic to distinguish the groups:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +  geom_point()

unnamed-chunk-16-1.png

Now we can add a geom that uses our group means. We’ll use geom_point() again:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +  geom_point() +  geom_point(data = gd)

unnamed-chunk-17-1.png

Did it work? Well, yes, it did. The problem is that we can’t distinguish the group means from the individual observations because the points look the same. Again, we’ve successfully integrated observations and means into a single plot. The challenge now is to make various adjustments to highlight the difference between the data layers.

To do this, we’ll fade out the observation-level geom layer (using alpha) and increase the size of the group means:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species)) +  geom_point(alpha = .4) +  geom_point(data = gd, size = 4)

unnamed-chunk-18-1.png

Here’s a final polished version for you to play around with:

ggplot(id, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +  geom_point(alpha = .4) +  geom_point(data = gd, size = 4) +  theme_bw() +  guides(color = guide_legend("Species"),  shape = guide_legend("Species")) +  labs(    title = "Petal size of iris species",    x = "Length",    y = "Width"  )

unnamed-chunk-19-1.png

 Repeated observations

One useful avenue I see for this approach is to visualize repeated observations. For example, colleagues in my department might want to plot depression levels measured at multiple time points for people who receive one of two types of treatment. Typically, they would present the means of the two groups over time with error bars. However, we can improve on this by also presenting the individual trajectories.

As an example, let’s examine changes in healthcare expenditure over five years (from 2001 to 2005) for countries in Oceania and the Europe.

Start by gathering our individual observations from my new ourworldindata package for R, which you can learn more about in a previous blogR post:

# Individual-observations datalibrary(ourworldindata)id <- financing_healthcare %>%         filter(continent %in% c("Oceania", "Europe") & between(year, 2001, 2005)) %>%         select(continent, country, year, health_exp_total) %>%         na.omit()id#> # A tibble: 275 × 4#>    continent country  year health_exp_total#>        <chr>   <chr> <int>            <dbl>#> 1     Europe Albania  2001         198.2242#> 2     Europe Albania  2002         225.1861#> 3     Europe Albania  2003         236.3563#> 4     Europe Albania  2004         263.5986#> 5     Europe Albania  2005         276.6520#> 6     Europe Andorra  2001        1432.2798#> 7     Europe Andorra  2002        1564.6976#> 8     Europe Andorra  2003        1601.0641#> 9     Europe Andorra  2004        1661.5608#> 10    Europe Andorra  2005        1793.9938#> # ... with 265 more rows

Let’s plot these individual country trajectories:

ggplot(id, aes(x = year, y = health_exp_total)) +  geom_line()

unnamed-chunk-21-1.png

Hmm, this doesn’t look like right. The problem is that we need to group our data by country:

ggplot(id, aes(x = year, y = health_exp_total, group = country)) +  geom_line()

unnamed-chunk-22-1.png

We now have a separate line for each country. Let’s color these depending on the world region (continent) in which they reside:

ggplot(id, aes(x = year, y = health_exp_total, group = country, color = continent)) +  geom_line()

unnamed-chunk-23-1.png

If we tried to follow our usual steps by creating group-level data for each world region and adding it to the plot, we would do something like this:

gd <- id %>%         group_by(continent) %>%         summarise(health_exp_total = mean(health_exp_total))ggplot(id, aes(x = year, y = health_exp_total, group = country, color = continent)) +  geom_line() +  geom_line(data = gd)

This, however, will lead to a couple of errors, which are both caused by variables being called in the base ggplot() layer, but not appearing in our group-means data, gd.

First, we’re not taking year into account, but we want to! In this case, year must be treated as a second grouping variable, and included in the group_by command. Thus, to compute the relevant group-means, we need to do the following:

gd <- id %>%         group_by(continent, year) %>%         summarise(health_exp_total = mean(health_exp_total))gd#> Source: local data frame [10 x 3]#> Groups: continent [?]#> #>    continent  year health_exp_total#>        <chr> <int>            <dbl>#> 1     Europe  2001        1196.7948#> 2     Europe  2002        1311.2303#> 3     Europe  2003        1375.2729#> 4     Europe  2004        1465.5530#> 5     Europe  2005        1550.2395#> 6    Oceania  2001         398.1582#> 7    Oceania  2002         414.7088#> 8    Oceania  2003         448.6919#> 9    Oceania  2004         475.8466#> 10   Oceania  2005         501.5413

The second error is because we’re grouping lines by country, but our group means data, gd, doesn’t contain this information. Thus, we need to move aes(group = country) into the geom layer that draws the individual-observation data.

Now, our plot will be:

ggplot(id, aes(x = year, y = health_exp_total, color = continent)) +  geom_line(aes(group = country)) +  geom_line(data = gd)

unnamed-chunk-26-1.png

It worked again; we just need to make the necessary adjustments to see the data properly. Here’s a polished final version of the plot. See if you can work it out!

ggplot(id, aes(x = year, y = health_exp_total, color = continent)) +  geom_line(aes(group = country), alpha = .3) +  geom_line(data = gd, alpha = .8, size = 3) +  theme_bw() +  labs(    title = "Changes in healthcare spending\nacross countries and world regions",    x = NULL,    y = "Total healthcare investment ($)",    color = NULL  )

unnamed-chunk-27-1.png

 Final challenge

For me, in a scientific paper, I like to draw time-series like the example above using the line plot described in another blogR post. As a challenge, I’ll leave it to you to draw this sort of neat time series with individual trajectories drawn underneath the mean trajectories with error bars. Don’t hesitate to get in touch if you’re struggling. Even better, succeed and tweet the results to let me know by including @drsimonj!

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

September Package Picks

$
0
0

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

September was a bit of a slow month for new R packages. Only 96 new packages showed up un CRAN. Nevertheless, I have picked out 23 for special mention which I have listed in 5 categories. I used the same selection criteria as I described in the post for August picks.

Data and Interfaces

  • darksky V1.0.0: Provides an interface to the Dark Sky API which allows you to look up weather anywhere on the globe. 
  • etseed V0.1.0: Provides a client to interface to the etcd key value store, a database written in Go.
  • LendingClub V0.1.2: Lets investors manage their LendingClub investments from R.
  • sparklyr V0.4: Allows R users to connect, provision and interface to Apache Spark. Detailed examples using MLlib and H2O are available on the RStudio site.
  • trelloR: V0.1.0: Provides access to the Trello API. The vignette explains how to retrieve data from public and private Trello boards.
  • XRPython V0.7: A Python interface structured according to the general form of the package XR described in John Chamber’s book Extending R.

Machine Learning

  • ensembleR V0.1.0: Facilitates constructing ensemble models from machine learning models available in the caret package. There is a vignette to get started.
  • exprso V0.1.7: Provides a framework for supervised machine learning customized for biologists. There are several vignettes including a cheatsheet.

cheatsheet

  • lowmemtkmeans V0.1.0: Implements trimmed k-means clustering with low memory use.
  • Textmining V0.0.2: Provides functions for text and topic mining. Full functionality requires installing TreeTagger.

Plots and Visualizations

  • plotluck V1.0.1: Is an intelligent tool built on top of ggplot2 that automatically generates plots from dataframes based on users providing variables to plot.
  • plotwidgets V0.4: Provides functions to produce small, self contained plots for use in larger plots.
library(plotwidgets)plot.new()par(usr=c(-1,1,-1,1))hues <- seq(0, 360, by=30)pos <- a2xy(hues, r=0.75)for(i in 1:length(hues)) {  cols <- modhueCol(pal, by=hues[i])  wgPlanets(x=pos$x[i], y=pos$y[i], w=0.5, h=0.5, v=v, col=cols)}pos <- a2xy(hues[-1], r=0.4)text(pos$x, pos$y, hues[-1])

plotwidgets

Statistics

  • Barycenter V1.0.0: Provides algorithms to compute the Wasserstein barycenter, the mean of a set of empirical probability measures.
  • musica V0.1.3: Provides functions for working with multivariate time series and custom time scales. There is a vignette to help you get started.
  • nhstplot V1.0.0: Provides functions to graphically illustrate the most common null hypothesis significance tests. The vignette provides some examples, e.g.:
library(nhstplot)plotftest(4,3,5)

f-test

  • nimble V0.6-1: Allows R programmers to write statistical models in the BUGS language. NIMBLE is built in R but compiles in C++. There is extensive documentation at www.nimble.org
  • Rdice V1.0.1: Allows conducting sophisticated dice rolling and coin tossing experiments including experiments with Efron like Nontransitive dice. Have a look at the vignette.
  • splines2 V0.1.0: Provides functions for constructing a variety of splines that are not available in the splines package including B-splines,M-splines, I-splines, C-splines, and the integral of B-splines. There is a vignette.
  • scanstatistics V0.1.0: Provides scan statistics functions to detect anomalous clusters in spatial or space-time data. The vignette describes the methodology and presents examples as well.
  • thief V0.2: Implements methods for generating forecasts at different temporal frequencies using hierarchical time series.

Misc

With this post, I am up to date with new CRAN packages. I hope to make my package picks a regular, monthly feature of this blog.

To leave a comment for the author, please follow the link and comment on their blog: RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Viewing all 12130 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>