satRday Cape Town: Call for Submissions

October 26, 2016, 8:18 am

≫ Next: Introducing R-hub, the R package builder service

(This article was first published on R | Exegetic Analytics, and kindly contributed to R-bloggers)

satrday-cape-town-banner

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa.

Keynotes and Workshops

We have a trio of fantastic keynote speakers: Hilary Parker, Jennifer Bryan and Julia Silge, who’ll be dazzling you on the day as well as presenting workshops on the two days prior to the satRday.

Call for Submissions

We’re accepting submissions in four categories:

Workshop [90 min],
Lightning Talk [5 min],
Standard Talk [20 min] and
Poster.

Submit your proposals here. The deadline is 16 December, but there’s no reason to wait for the last minute: send us your ideas right now so that we can add you to the killer programme.

Registration

Register for the conference and workshops here. The tickets are affordable and you’re going to get extraordinary value for money.

The post satRday Cape Town: Call for Submissions appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R | Exegetic Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Introducing R-hub, the R package builder service

October 26, 2016, 9:50 am

≫ Next: Online R courses at Udemy for only $10 (until November 1st)

≪ Previous: satRday Cape Town: Call for Submissions

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you're developing a package for R to share with others — on CRAN, say — you'll want to make sure it works for others. That means testing it on various platforms (Windows, Mac, Linux, and all the versions thereof), and on various versions of R (current, past, and future). But it's likely you only have access to one platform, and installing and managing multiple R versions can be a pain.

R-hub, the online package-building service now in public beta, aims to solve this problem by making it easy to build and test your package on a variety of platforms and R versions. Using the rhub R package, you can with a single command upload your package to the cloud-based R-hub service, and build and test your package on the current, prior, and in-development versions of R, using any or all of these platforms:

(Mac platforms will be added soon, with Solaris a possible addition in the future.) In addition to running the standard R CMD CHECK package tests for you, R-hub also applies "sanitizers" to the binary builds, to check for common errors in any compiled code (for example, writing to unallocated memory). It also handles the loading of dependent packages for the tests, even if those packages aren't on CRAN or BioConductor (so you can test with development versions of packages on GitHub, for example).

R hub is for every member of the R community, and designed to make it easy to create and share R packages with others. In particular, for packages intended for submission to CRAN, R-hub is intended to simplify and streamline that process by detecting possible bugs and errors that might appear on other platforms or R versions before submitting the package to CRAN.

R-hub was developed by Gábor Csárdi, and funded by a generous grant from the R Consortium. Microsoft is proud to be a founding member of the R Consortium, and also provides Azure credits to support the ongoing operation of R-hub.

R-hub is available as a public beta now; download the r-hub package from GitHub to get started. (It will also be available on CRAN soon.) If you find problems or have suggestions, please report an issue at the R-hub Github repository. And if you'd like to learn more about R-hub, watch the recording of the R-hub webinar presented by the R Consortium, where Gábor introduces R-hub and demonstrates the process of building and testing a package.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Online R courses at Udemy for only $10 (until November 1st)

October 28, 2016, 12:28 pm

≫ Next: Join Hadley Wickham’s Master R Workshop in Melbourne, Australia December 12 & 13

≪ Previous: Introducing R-hub, the R package builder service

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses:

The Comprehensive Programming in R Course (25 Hours of video)
Graphs in R (ggplot2, plotrix, base R) – Data Visualization with R Programming Language (5 Hours of video)
Linear Mixed-Effects Models with R (11 Hours of video)
Multivariate Data Visualization with R (7 Hours of video)
Applied Multivariate Analysis with R (13 Hours of video)
More Data Mining with R (11 Hours of video)
Text Mining, Scraping and Sentiment Analysis with R (4 Hours of video)
R Programming for Simulation and Monte Carlo Methods (12 Hours of video)
Programming Statistical Applications in R (12 Hours of video)
Comprehensive Linear Modeling with R (15 Hours of video)
Bayesian Computational Analyses with R (12 Hours of video)
Time Series Analysis and Forecasting in R (3 Hours of video)

Introductory R courses:

Introduction to R (15 Hours of video)
Applied Data Science with R (11 Hours of video)
R Level 1 – Data Analytics with R (6 Hours of video)
Learn R Programming from Scratch (2 Hours of video)
Statistics in R – The R Language for Statistical Analysis (3 Hours of video)
Machine Learning and Statistical Modeling with R Examples (3 Hours of video)
Introduction To Data Science (6 Hours of video)

Non R courses for data science:

Learning Python for Data Analysis and Visualization (20 Hours of video)
Taming Big Data with MapReduce and Hadoop – Hands On! (5.5 Hours of video)
Tableau 9 Advanced Training: Master Tableau for Data Science (7 Hours of video) (not R, but relevant to R users who wish to create dashboards with interactive visualizations using a GUI)

↧

Join Hadley Wickham’s Master R Workshop in Melbourne, Australia December 12 & 13

October 28, 2016, 1:58 pm

≫ Next: A quick exploration of the ReporteRs package

≪ Previous: Online R courses at Udemy for only $10 (until November 1st)

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

It’s nearly summeRtime in Australia! Join RStudio Chief Data Scientist Hadley Wickham for his popular Master R workshop in Melbourne.

Melbourne will be Hadley’s first and only scheduled Master R workshop in Australia. Whether you live or work nearby or you just need one more good reason to visit Melbourne in the Southern Hemisphere spring, consider joining him at the Cliftons Melbourne on December 12th and 13th. It’s a rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

Hadley’s workshops usually sell out. This is his final Master R in 2016 and he has no plans to offer another in the area in 2017. If you’re an active R user and have been meaning to take this class, now is the perfect time to do it!

We look forward to seeing you in Melbourne!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

↧

A quick exploration of the ReporteRs package

October 28, 2016, 4:01 pm

≫ Next: drat 0.1.2: Mostly harmless

≪ Previous: Join Hadley Wickham’s Master R Workshop in Melbourne, Australia December 12 & 13

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.

This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr. For example, a report I wrote today maintained my flow while writing R code to create the report.

library(ReporteRs)library(magrittr)mydoc <- docx() %>%  addParagraph(value = 'Correlation matrix', style='Titre2') %>%  addParagraph(value='Estimates') %>%  addFlexTable(FlexTable(cormat)) %>%  addParagraph(value = 'P-values') %>%  addFlexTable(FlexTable(corpval)) %>%  addParagraph(value = "Boxplots", style='Titre2') %>%  addPlot(fun=print, x = plt, height=3, width=5) %>%  writeDoc(file = 'Report.docx)

Note that plt is a ggplot object and so we actually have to print it rather than just put the object in the addPlot command.

This was my first experience in a while using ReporteRs, and it seemed pretty good to me.

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit.

↧

drat 0.1.2: Mostly harmless

October 29, 2016, 6:53 pm

≫ Next: RProtoBuf 0.4.7: Mostly harmless

≪ Previous: A quick exploration of the ReporteRs package

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

CRAN requested a release updating any URLs for Omegahat to the (actually working) omegahat.net URL. So that caused this 0.1.2 release which arrived on CRAN yesterday. It contains the requested change along with one or two other mostly minor changes which accumulated since the last release.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories is what we use. In other words, friends don’t let friends use install_github(). Just kidding. Maybe. Or not.

The NEWS file summarises the release as follows:

Changes in drat version 0.1.2 (2016-10-28)
Changes in drat documentation
The FAQ vignette added a new question Why use drat
URLs were made canonical, omegahat.net was updated from .org
Several files (README.md, Description, help pages) were edited

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

RProtoBuf 0.4.7: Mostly harmless

October 29, 2016, 7:00 pm

≫ Next: Regular Expressions Exercises – Part 1

≪ Previous: drat 0.1.2: Mostly harmless

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

CRAN requested a release updating any URLs for Omegahat to the (actually working) omegahat.net URL. The RProtoBuf package had this in one code comment (errr…) and on bibfile entry. Oh well — so that caused this 0.4.7 release which arrived on CRAN today. It contains the requested change, and pretty much nothing else.

RProtoBuf provides R bindings for the Google Protocol Buffers ("Protobuf") data encoding and serialization library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

The NEWS file summarises the release as follows:

Changes in RProtoBuf version 0.4.7 (2016-10-27)
At the request of CRAN, two documentation instances referring to the Omegehat repository were updated to http://omegahat.net

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

↧

Regular Expressions Exercises – Part 1

October 30, 2016, 10:00 am

≫ Next: The Bayesian approach to ridge regression

≪ Previous: RProtoBuf 0.4.7: Mostly harmless

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A common task performed during data preparation or data analysis is the manipulation of strings.

Regular expressions are meant to assist in such and similar tasks.

A regular expression is a pattern that describes a set of strings.

Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes).

R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perl-like regular expressions.

Also, sometimes specific patterns may or may not be found, according to the system locales.

The implementation of those patterns can be performed thru several base-r functions, such as:

grep
grepl
regexpr
gregexpr
sub
gsub
strsplit

Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3-sets series.

Answers to the exercises are available here.

Although with regex, you can get correct results in more than one way, if you have different solutions, feel free to post them.

Character class

A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list. For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”. There are some “shortcuts” that allow us finding specific ranges of digits or characters:

[0-9] means any digit
[A-Z] means any upper case character
[a-z] means any lower case character

Let’s create a variable called text1 and populate it with the value “The current year is 2016”

Exercise 1 Create a variable called my_pattern and implement the required pattern for finding any digit in the variable text1. Use function grepl to verify if there is a digit in the string variable

Exercise 2 Use function gregexpr to find all the positions in text1 where there is a digit. Place the results in a variable called string_position

Predefined classes of characters

In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc). For this purpose, there are several predefined classes of characters that save us a lot of typing.

Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale.

Below are some “popular” predefined classes and their meaning: 1. [:alnum:] Alphanumeric characters: [:alpha:] and [:digit:].

2. [:alpha:] Alphabetic characters: [:lower:] and [:upper:] can also be used.

3. [:digit:] Digits: 0 1 2 3 4 5 6 7 8 9.

4. [:blank:] Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space.

Exercise 3 Create a variable called my_pattern and implement the required pattern for finding one digit and one uppercase alphanumeric character, in variable text1. This time, combine predefined classes in the regex pattern. Use function grepl to verify if the searched pattern exists on the string.

Exercise 4 Use function regexpr to find the position of the first space in text1. Place the results in a variable called first_space and

Special single character

The period (“.”) matches any single character. Exercise 5 Create a pattern that checks in text1 if there is a lowercase character, followed by any character and then by a digit.

Exercise 6 Find the starting position of the above string. Place the results in a variable called string_pos2

Special symbols

There are several “special symbols” that assist in the definition of specific patterns. Pay attention that in R, you should append an extra backslash when using those special symbols: The symbol \w matches a ‘word’ character and \W is its negation. Symbols \d, \s, \D and \S denote the digit and space classes and their negations. As you may have noticed, some special symbols have their parallel “predefined classes”. (For example, \d equals [0-9] and equals [:digit:])

Exercise 7 Find the following pattern: one space followed by two lowercase letters and one more space. Use a function that returns the starting point of the found string and place its result in string_pos3.

Metacharacters

There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones: The caret ("^")– means: find a pattern starting from the beginning of the string The dollar sign ("$")– means: find a pattern starting from the end of the string.

Exercise 8 Using the sub function, replace the pattern found on the previous exercice by the string ” is not ” Place the resulting string in text2 variable.

Repetition Characters

There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax: {n} The preceding item is matched exactly n times.

{n,} The preceding item is matched n or more times.

{n,m} The preceding item is matched at least n times, but not more than m times.

By default repetition is greedy, so the maximal possible number of repeats is used.

Exercise 9 Find in text2 the following pattern: Four digits starting at the end of the string. Use a function that returns the starting point of the found string and place its result in string_pos4.

Exercise 10 Using the substr function, and according to the position of the string found in the previous excercise, extract the first two digits found at the end of text2.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

The Bayesian approach to ridge regression

October 30, 2016, 1:58 pm

≫ Next: ratio-of-uniforms [#2]

≪ Previous: Regular Expressions Exercises – Part 1

(This article was first published on R – On the lambda, and kindly contributed to R-bloggers)

In a previous post, we demonstrated that ridge regression (a form of regularized linear regression that attempts to shrink the beta coefficients toward zero) can be super-effective at combating overfitting and lead to a greatly more generalizable model. This approach to regularization used penalized maximum likelihood estimation (for which we used the amazing glmnet package). There is, however, another approach… an equivalent approach… but one that allows us greater flexibility in model construction and lends itself more easily to an intuitive interpretation of the uncertainty of our beta coefficient estimates. I’m speaking, of course, of the bayesian approach.

As it turns out, careful selection of the type and shape of our prior distributions with respect to the coefficients can mimic different types of frequentist linear model regularization. For ridge regression, we use normal priors of varying width.

Though it can be shown analytically that shifting the width of normal priors on the beta coefficients is equivalent to L2 penalized maximum likelihood estimation, the math is scary and hard to follow. In this post, we are going to be taking a computational approach to demonstrating the equivalence of the bayesian approach and ridge regression.

This post is going to be a part of a multi-post series investigating other bayesian approaches to linear model regularization including lasso regression facsimiles and hybrid approaches.

mtcars

We are going to be using the venerable mtcars dataset for this demonstration because (a) it’s multicollinearity and high number of potential predictors relative to its sample size lends itself fairly well to ridge regression, and (b) we used it in the elastic net blog post

Before, you lose interest… here! have a figure! An explanation will follow.

After scaling the predictor variables to be 0-centered and have a standard deviation of 1, I described a model predicting mpg using all available predictors and placed normal priors on the beta coefficients with a standard deviation for each value from 0.05 to 5 (by 0.025). To fit the model, instead of MCMC estimation via JAGS or Stan, I used quadratic approximation performed by the awesome rethinking package written by Richard McElreath written for his excellent book, Statistical Rethinking. Quadratic approximation uses an optimization algorithm to find the maximum a priori (MAP) point of the posterior distribution and approximates the rest of the posterior with a normal distribution about the MAP estimate. I use this method chiefly because as long as it took to run these simulations using quadratic approximation, it would have taken many orders of magnitude longer to use MCMC. Various spot checks confirmed that the quadratic approximation was comparable to the posterior as told by Stan.

As you can see from the figure, as the prior on the coefficients gets tighter, the model performance (as measured by the leave-one-out cross-validated mean squared error) improves—at least until the priors become too strong to be influenced sufficiently by the evidence. The ribbon about the MSE is the 95% credible interval (using a normal likelihood). I know, I know… it’s pretty damn wide.

The dashed vertical line is at the prior width that minimizes the LOOCV MSE. The minimum MSE is, for all practical purposes, identical to that of the highest performing ridge regression model using glmnet. This is good.

Another really fun thing to do with the results is to visualize the movement of the beta coefficient estimates and different penalties. The figure below depicts this. Again, the dashed vertical line is the highest performing prior width.

One last thing: we’ve heretofore only demonstrated that the bayesian approach can perform as well as the L2 penalized MLE… but it’s conceivable that it achieves this by finding a completely different coefficient vector. The figure below shows the same figure as above but I overlaid the coefficient estimates (for each predictor) of the top-performing glmnet model. These are shown as the dashed colored horizontal lines.

These results are pretty exciting! (if you’re the type to not get invited to parties). Notice that, at the highest performing prior width, the coefficients of the bayesian approach and the glmnet approach are virtually identical.

Sooooo, not only did the bayesian variety produce an equivalently generalizable model (as evinced by equivalent cross-validated MSEs) but also yielded a vector of beta coefficient estimates nearly identical to those estimated by glmnet. This suggests that both the bayesian approach and glmnet‘s approach, using different methods, regularize the model via the same underlying mechanism.

A drawback of the bayesian approach is that its solution takes many orders of magnitude more time to arrive at. Two advantages of the Bayesian approach are (a) the ability to study the posterior distributions of the coefficient estimates and ease of interpretation that they allows, and (b) the enhanced flexibility in model design and the ease by which you can, for example, swap out likelihood functions or construct more complicated hierarchal models.

If you are even the least bit interested in this, I urge you to look at the code (in this git repository) because (a) I worked really hard on it and, (b) it demonstrates cool use of meta-programming, parallelization, and progress bars… if I do say so myself

ratio-of-uniforms [#2]

October 30, 2016, 4:16 pm

≫ Next: RTutor: CO2 Trading and Risk of Firm Relocation

≪ Previous: The Bayesian approach to ridge regression

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Following my earlier post on Kinderman’s and Monahan’s (1977) ratio-of-uniform method, I must confess I remain quite puzzled by the approach. Or rather by its consequences. When looking at the set A of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), as discussed in the previous post, it can be represented by its parameterised boundary

u(x)=√ƒ(x),v(x)=x√ƒ(x) x in X

Similarly, since the simulation from ƒ(v/u) can also be derived [check Luc Devroye’s Non-uniform random variate generation in the exercise section 7.3] from a uniform on the set B of (u,v)’s in R⁺×X such that 0≤u≤ƒ(v+u), on the set C of (u,v)’s in R⁺×X such that 0≤u³≤ƒ(v/√u)², or on the set D of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), which is actually exactly the same as A [and presumably many other versions!, for which I would like to guess the generic rule of construction], there are many sets on which one can consider running simulations. And one to pick for optimality?! Here are the three sets for a mixture of two normal densities:

For instance, assuming slice sampling is feasible on every one of those three sets, which one is the most efficient? While I have no clear answer to this question, I found on Sunday night that a generic family of transforms is indexed by a differentiable monotone function h over the positive half-line, with the uniform distribution being taken over the set

H={(u,v);0≤u≤h(f(v/g(u))_}

when the primitive G of g is the inverse of h, i.e., G(h(x))=x. [Here are the slides I gave at the Warwick reading group on Devroye’s book:]

Filed under: Books, R, Statistics Tagged: Luc Devroye, mixtures of distributions, Non-Uniform Random Variate Generation, pseudo-random generator, R, ratio of uniform algorithm, slice sampler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

RTutor: CO2 Trading and Risk of Firm Relocation

November 3, 2016, 8:00 am

≫ Next: Download product information and reviews from Amazon.com

≪ Previous: ratio-of-uniforms [#2]

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

Many economists would agree that the most efficient way to fight global warming would be a world-wide tax or an emmission trading system for greenhouse gases. Yet, if only a part of the world implements such a scheme, a reasonable concern is that firms may decide to relocate to other parts of the world, causing job losses and less effective emmission reduction.

The European Union adressed this concern in its carbon emmission trading system by not auctioning off all emmission permits, but granting free emmission permits to facilities in economic sectors characterized by high trade intensity or high carbon intensity. It is true that also freely allocated permits provide incentives to reduce carbon emmissions (opportunity costs are still equal to the price at which permits are traded). Yet, there are reasons, e.g. fiscal income, to limit the amount of freely given permits.

In their article ‘Industry Compensation under Relocation Risk: A Firm-Level Analysis of the EU Emissions Trading Scheme’ (American Economic Review, 2014), Ralf Martin, Mirabelle Muûls, Laure B. de Preux and Ulrich J. Wagner study the most efficient way to allocate a fixed amount of free permits among facilities in order to minimize the risk of job losses or carbon leakage. Given their available data, they establish simple alternative allocation rules that can be expected to substantially outperform the current allocation rules used by the EU.

As part of his Master’s Thesis at Ulm University, Benjamin Lux has generated a very nice RTutor problem set that allows you to replicate the insights of the paper in an interactive fashion. You learn about the data and institutional background, run explorative regressions and dig into the very well explained optimization procedures to find efficient allocation rules. At the same time you learn some R tricks, like effective usage of some dplyr functions.

Screenshoot:

Screenshot missing

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by very well designed quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/b-lux/RTutorCarbonLeakage

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://b-lux.shinyapps.io/RTutorCarbonLeakage/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

↧

Download product information and reviews from Amazon.com

November 3, 2016, 2:32 pm

≫ Next: ratio-of-uniforms [#3]

≪ Previous: RTutor: CO2 Trading and Risk of Firm Relocation

(This article was first published on Renglish – 56north | Skræddersyet dataanalyse, and kindly contributed to R-bloggers)

Rmazon

The goal of Rmazon is to help you download product information and reviews from Amazon.com easily.

Installation

You can install Rmazon from github with:

# install.packages("devtools")
devtools::install_github("56north/Rmazon")

Example – product information

This is a basic example which shows you how ro get product information:

# Get product information for 'The Art of R Programming: A Tour of Statistical Software Design'

product_info <- Rmazon::get_product_info("1593273843")

Example – product reviews

This is a basic example which shows you how gto et reviews:

# Get reviews for 'The Art of R Programming: A Tour of Statistical Software Design'

reviews <- Rmazon::get_reviews("1593273843")

To leave a comment for the author, please follow the link and comment on their blog: Renglish – 56north | Skræddersyet dataanalyse.

↧

ratio-of-uniforms [#3]

November 3, 2016, 4:16 pm

≫ Next: I’ve started writing a ‘book’: Functional programming and unit testing for data munging with R

≪ Previous: Download product information and reviews from Amazon.com

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Being still puzzled (!) by the ratio-of-uniform approach, mostly failing to catch its relevance for either standard distributions in a era when computing a cosine or an exponential is negligible, or non-standard distributions for which computing bounds and boundaries is out-of-reach, I kept searching for solutions that would include unbounded densities and still produce compact boxes, as this seems essential for accept-reject simulation if not for slice sampling. And after exploring some dead-ends (in tune with running in Venezia!), I came upon the case of the generalised logistic transform

$h(\omega)=\omega^a/(1+\omega^a)$

which ensures that the [ratio-of-almost-uniform] set I defined in my slides last week

$\mathfrak{H}=\left\{(u,v);\ 0\le u\le h(f(v/g(u))\right\}$

is bounded in u. Since the transform g is the derivative of the inverse of h (!),

$g(y)=a^{-1}y^{(1-a)/a}/(1-y)^{(1-3a)/a}$

the parametrisation of the boundary of H is

$u(x)=f(x)^a/(1+f(x)^a)\ v(x)=a^{-1}xf(x)^{(a-1)/a}(1+f(x)^a)^2$

which means it remains bounded if (a) a≤1 [to ensure boundedness at infinity] and (b) the limit of v(x) at zero [where I assume the asymptote stands] is bounded. Meaning

$\lim_{x\to 0} xf(x)^{2a+1/a-1}<\infty$

For instance, this holds for Gamma distributions with shape parameter larger than ½…

Working a wee bit more on the problem led me to realise that resorting an arbitrary cdf Φ instead of the logistic one could solve the problem for most distributions, including all Gammas. Indeed, the boundary of H is now

$u(x)=\Phi(f(x))^a\ v(x)=a^{-1}xf(x)^{(a-1)/a}/\varphi(f(x))$

which means it remains bounded if φ has very heavy tails, like 1/x². To handle the explosion when x=0. And an asymptote itself at zero, to handle the limit at infinity when f(x) goes to zero.

Filed under: Books, pictures, R, Statistics Tagged: Gamma generator, Luc Devroye, Non-Uniform Random Variate Generation, random number generation, ratio of uniform algorithm, University of Warwick

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

↧

I’ve started writing a ‘book’: Functional programming and unit testing for data munging with R

November 3, 2016, 5:00 pm

≫ Next: New R package: packagedocs

≪ Previous: ratio-of-uniforms [#3]

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I have started writing a ‘book’ using the awesome bookdown package. In the book I explain and show why using functional programming and putting your functions in your own packages is the way to go when you want to clean, prepare and transform large data sets. It makes testing and documenting your code easier. You don’t need to think about managing paths either. The book is far from complete, but I plan on working on it steadily. For now, you can read an intro to functional programming, unit testing and creating your own packages that will hold your code. I also show you can write documentation for your functions. I am also looking for feedback; so if you have any suggestions, do not hesitate to shoot me an email or a tweet! You can read the book by clicking here.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

↧

New R package: packagedocs

November 3, 2016, 5:00 pm

≫ Next: Feuilleton

≪ Previous: I’ve started writing a ‘book’: Functional programming and unit testing for data munging with R

(This article was first published on Ryan Hafen, and kindly contributed to R-bloggers)

I’m pleased to announce the CRAN release of packagedocs which provides a mechanism for simple generation and automated deployment of nice-looking online R package documentation that plugs into the traditional R package vignette system.

You can see some examples of documentation generated with packagedocs here, here, and here.

A feature list and usage guide is provided below, but it’s probably easiest to get a feel for what the package does by watching this short screencast:

Brief History

packagedocs has been around in one form or another for quite some time and is a successor of, heavily influenced by, and borrows from Hadley Wickham’s staticdocs package. packagedocs was conceived from the desire to have a package’s vignettes and function reference all rendered into a single website with several features for convenience and utility which I’ll list below. We’ve been using it for a few years to render the documentation for packages in our DeltaRho project. Recently Barret Schloerke jumped in and revamped the package to improve and add several features and get it ready to release on CRAN. I didn’t realize until writing this post that staticdocs, which has been relatively dormant for the past few years, is now being actively revamped, and is now known as pkgdown, also worth checking out.

Features

Here are some of the features of the package:

All documentation is generated from a single RMarkdown file, "vignettes/docs.Rmd"
Documentation is nicely styled and responsive for mobile viewing with a collapsible auto-scrolling table of contents
Simple Github / TravisCI hooks are provided for automatically building and deploying documentation to your github pages branch after each commit
Once configured, by default any commits pushed to the master branch of repository https://github.com/username/reponame will have docs made automatically available at https://username.github.io/reponame
Valid R vignettes are generated that point to the live version of the docs
Automatic generation of all R object and function documentation, called the “function reference”
Examples in the function reference are evaluated and the output, including graphics, is rendered inline with the documentation
The function reference can be organized into groups with custom headings using a yaml configuration file
A convenience function is provided for linking references to functions in your vignette directly to the associated function documentation on the generated function reference page
Helper functions to initialize, run, and set up your docs for Github deployment
Github pages branch is stomped on each commit to prevent repository bloat – NOTE: this is important – if you use this package, make sure you don’t care about version control history in your gh-pages branch!

Installation

From CRAN:

install.packages("packagedocs")

From Github:

devtools::install_github("hafen/packagedocs")

Usage

There are three main functions.

To initialize your packagedocs documentation:

# in current package directory
packagedocs::init_vignettes()

This will create some files in your package’s "vignettes" directory. Edit "vignettes/docs.Rmd” and to generate your vignettes, run:

packagedocs::build_vignettes()

To set up your repository to automatically build and deploy to your github pages branch on every commit:

packagedocs::use_travis()

More detail about how to use the package is found in the package’s documentation, generated of course using packagedocs!

To leave a comment for the author, please follow the link and comment on their blog: Ryan Hafen.

↧

Feuilleton

November 4, 2016, 6:16 am

≫ Next: New Expansion of the R Course Finder!

≪ Previous: New R package: packagedocs

(This article was first published on RStudio, and kindly contributed to R-bloggers)

.fusion-fullwidth-1 { padding-left: px !important; padding-right: px !important; }

by Joseph Rickert

Here we offer ephemera, a little light reading and some more challenging material. We hope that least some of it will become the “talk of the town”.

Worth Reading

Highlights from the Moore/Sloan Data Science Environments Summit
How Artificial Intelligence Will Kill Some Jobs But Create Others: A sober but optimistic view on the impact of AI

Worth a Look

Causality for medical statistics by David Cox: A 83-minute lecture by the master
A visual history of population growth in the US from Flowing Data

R Resources

A Free Tutorial for using Quandl with R
A SAS to R Tutorial
Mastering Software Development in R, by Peng and Kross: A solid introduction to R, well worth purchasing
Some vtreat Design Principles: A look at the ideas behind statistically sound data preparation
The R Graph Gallery: a great resource and place to go for inspiration
Using R packages and education to scale Data Science at Airbnb: a case study

100k top Airbnb trips

To leave a comment for the author, please follow the link and comment on their blog: RStudio.

↧

New Expansion of the R Course Finder!

November 4, 2016, 10:15 am

≫ Next: Shiny Server (Pro) 1.5

≪ Previous: Feuilleton

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

On the 1st of september we launched R course finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.

If you haven’t looked at it yet, go to the R Course Finder now by clicking here.

Over the past month we have further expanded the courses available in the Course Finder. Currently we are at 118 courses on 12 different platforms, and 2 offline Learning Institutes.

We expanded the Course Finder across nearly all the platforms! Also there were some courses we we’re excited about and wanted to highlight:

Highlighted Courses

Advanced R Programming

This course offered by Coursera is part of the ‘Mastering Software Development in R Specialization’. When indexing it and reading thru the syllabus it got me excited to follow the complete specialization. We also added the other courses which include one on package building!

Statistics with R – Advanced Level

This is an udemy course that still starts of somewhat easy with ANOVA and other mean comparison techniques, but the expert level is reflected in the quick pace and fast steps to more advanced material. Furthermore you can also find the beginner level in the Course Finder here

But we want to keep going! If you miss a course or know of a different platform we want to know, so we can keep adding to the most complete directory of R courses available online.

How you can help to make R Course Finder better

If you miss a course that is not included yet, please post a reminder in the comments and we’ll add it.
If you miss an important filter or search functionality, please let us know in the comments below.
If you already took one of the courses, please let all of us know about your experiences in the review section, an example is available here.

And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

↧

Shiny Server (Pro) 1.5

November 4, 2016, 1:04 pm

≫ Next: ShinyProxy 0.7.0

≪ Previous: New Expansion of the R Course Finder!

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Shiny Server 1.5.1.834 and Shiny Server Pro 1.5.1.760 are now available.

The Shiny Server 1.5.x release family upgrades our underlying Node.js engine from 0.10.47 to 6.9.1. The impetus for this change was not stability or performance, but because the 0.10.x release family has reached the end of its life.

We highly recommend that you test on a staging server before upgrading production Shiny Server 1.4.x machines to 1.5. You should always do this for any production-critical software, but it’s particularly important for this release, due to the magnitude of changes to Node.js that we’ve absorbed in one big gulp. (We’ve done thorough end-to-end testing of this release, but there’s no substitute for testing with your own apps, on your own servers.)

Some small bug fixes are also included in this release. See the release notes for more details.

The beginning of the end for Ubuntu 12.04 and Red Hat 5

While we still support Ubuntu 12.04 and Red Hat 5 today, we’ll be moving on from these very old releases in a few months. Both of these distributions will end-of-life in April 2017, and will stop receiving bug fixes and security fixes from their vendors at that time. If you’re using Shiny Server with one of these platforms, we recommend that you start planning your upgrade.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

↧

ShinyProxy 0.7.0

November 11, 2016, 10:06 am

≫ Next: You should re-encode high cardinality categorical variables

≪ Previous: Shiny Server (Pro) 1.5

(This article was first published on Open Analytics - Blog, and kindly contributed to R-bloggers)

Friday 11 November 2016 – 19:06

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise or larger organizations.

Our previous post on the how and why of ShinyProxy triggered a lot of encouraging reactions. Here’s our favorite:

ShinyProxy tweet

Indeed choosing for Docker opens a world of possibilities for ShinyProxy and making you no longer dependent on a particular version of R or shiny is only one of the advantages.

We also received a number of useful suggestions and decided to quickly release the new features and fixes as version 0.7.0. Here are the most important ones:

allow one user to open multiple applications as requested by this Github issue
optional display of logos for apps on the landing page using a new configuration field logo-url
fix spurious error message on Jetty ALPN support

Documentation has been updated on the project homepage and as always community support on this new release is available on our support site.

Keep the suggestions coming and have fun with ShinyProxy!

This post is about:

r, shiny, shinyproxy

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics - Blog.

↧

You should re-encode high cardinality categorical variables

November 11, 2016, 11:07 am

≫ Next: Emotion Detection for Journalism with Microsoft Emotion API

≪ Previous: ShinyProxy 0.7.0

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.

NewImage

In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this.

First: remember the disasters you see are better than those you don’t. In the synthetic data we see failure to model a relation (even though there is one, by design). But it could easily be that some column lurking in a complex model is quietly degrading model performance, without being detected by fully ruining the model.

The reason Nina and I have written so much on the possible side-effects of re-encoding high cardinality categorical variables is that you don’t want to introduce more problems as you attempt to fix things. Also once you intervene, by supplying advice or a solution, you feel everything will be your fault. That being said, here is our advice:

Re-encode high categorical variable using impact or effects based ideas as we describe and implement in the vtreat R library.

Get your data science, predictive analytics, or machine learning house in order by fixing how you are treating incoming features and data. This is where the largest opportunities for improvement are available in real-world applications. In particular:

Do not ignore large cardinality categorical variables.
Do not blindly add large cardinality categorical variables to your model.
Do not hash-encode large cardinality categorical variables.
Consider using large cardinality categorical variables as join keys to pull in columns from external data sets.

Our advice: use vtreat. You will more and more often going forward be competing with models that use this library or similar concepts.

Once you have gotten to this level of operation then worry (as we do) about the statistical details of which processing steps are justifiable, safe, useful, and best. That is the topic we have been studying and writing on in depth (we call the potential bad issues over-fitting and nested model bias). Articles include:

More on preparing data (a great article on the concepts).
Model evaluation and nested models (two recent talks we presented on these topics).
Chapters 3,4,5 and 6 of Practical Data Science with R, (Zumel, Mount; Manning 2014) (where work through examining data, fixing data problems, evaluating models, and reasoning about data columns as single variable models in disguise).
Laplace noising versus simulated out of sample methods (cross frames) (where this example came from).

Or invite us in to re-present one of our talks or work with your predictive analytics or data science team to adapt these techniques to your workflow, software, and problem domain. We have gotten very good results with the general methods in our vtreat library, but knowing a specific domain or problem structure can often let you do much more (for example: Nina’s work on y-aware scaling for geometric problems such as nearest neighbor classification and clustering).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧