Quantcast
Channel: R-bloggers
Viewing all 12117 articles
Browse latest View live

sparklyr 1.3: Higher-order Functions, Avro and Custom Serializers

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

sparklyr 1.3 is now available on CRAN

sparklyr 1.3 is now available on CRAN, with the following major new features:

To install sparklyr 1.3 from CRAN, run

install.packages("sparklyr")

In this post, we shall highlight some major new features introduced in sparklyr 1.3, and showcase scenarios where such features come in handy. While a number of enhancements and bug fixes (especially those related to spark_apply(), Apache Arrow, and secondary Spark connections) were also an important part of this release, they will not be the topic of this post, and it will be an easy exercise for the reader to find out more about them from the sparklyr NEWS file.

Higher-order Functions

Higher-order functions are built-in Spark SQL constructs that allow user-defined lambda expressions to be applied efficiently to complex data types such as arrays and structs. As a quick demo to see why higher-order functions are useful, let’s say one day Scrooge McDuck dove into his huge vault of money and found large quantities of pennies, nickels, dimes, and quarters. Having an impeccable taste in data structures, he decided to store the quantities and face values of everything into two Spark SQL array columns:

library(sparklyr)sc <- spark_connect(master = "local", version = "2.4.5")coins_tbl <- copy_to(  sc,  tibble::tibble(    quantities = list(c(4000, 3000, 2000, 1000)),    values = list(c(1, 5, 10, 25))  ))

Thus declaring his net worth of 4k pennies, 3k nickels, 2k dimes, and 1k quarters. To help Scrooge McDuck calculate the total value of each type of coin in sparklyr 1.3 or above, we can apply hof_zip_with(), the sparklyr equivalent of ZIP_WITH, to quantities column and values column, combining pairs of elements from arrays in both columns. As you might have guessed, we also need to specify how to combine those elements, and what better way to accomplish that than a concise one-sided formula   ~ .x * .y  in R, which says we want (quantity * value) for each type of coin? So, we have the following:

result_tbl <- coins_tbl %>%  hof_zip_with(~ .x * .y, dest_col = total_values) %>%  dplyr::select(total_values)result_tbl %>% dplyr::pull(total_values)
[1]  4000 15000 20000 25000

With the result 4000 15000 20000 25000 telling us there are in total $40 dollars worth of pennies, $150 dollars worth of nickels, $200 dollars worth of dimes, and $250 dollars worth of quarters, as expected.

Using another sparklyr function named hof_aggregate(), which performs an AGGREGATE operation in Spark, we can then compute the net worth of Scrooge McDuck based on result_tbl, storing the result in a new column named total. Notice for this aggregate operation to work, we need to ensure the starting value of aggregation has data type (namely, BIGINT) that is consistent with the data type of total_values (which is ARRAY), as shown below:

result_tbl %>%  dplyr::mutate(zero = dplyr::sql("CAST (0 AS BIGINT)")) %>%  hof_aggregate(start = zero, ~ .x + .y, expr = total_values, dest_col = total) %>%  dplyr::select(total) %>%  dplyr::pull(total)
[1] 64000

So Scrooge McDuck’s net worth is $640 dollars.

Other higher-order functions supported by Spark SQL so far include transform, filter, and exists, as documented in here, and similar to the example above, their counterparts (namely, hof_transform(), hof_filter(), and hof_exists()) all exist in sparklyr 1.3, so that they can be integrated with other dplyr verbs in an idiomatic manner in R.

Avro

Another highlight of the sparklyr 1.3 release is its built-in support for Avro data sources. Apache Avro is a widely used data serialization protocol that combines the efficiency of a binary data format with the flexibility of JSON schema definitions. To make working with Avro data sources simpler, in sparklyr 1.3, as soon as a Spark connection is instantiated with spark_connect(..., package = "avro"), sparklyr will automatically figure out which version of spark-avro package to use with that connection, saving a lot of potential headaches for sparklyr users trying to determine the correct version of spark-avro by themselves. Similar to how spark_read_csv() and spark_write_csv() are in place to work with CSV data, spark_read_avro() and spark_write_avro() methods were implemented in sparklyr 1.3 to facilitate reading and writing Avro files through an Avro-capable Spark connection, as illustrated in the example below:

library(sparklyr)# The `package = "avro"` option is only supported in Spark 2.4 or highersc <- spark_connect(master = "local", version = "2.4.5", package = "avro")sdf <- sdf_copy_to(  sc,  tibble::tibble(    a = c(1, NaN, 3, 4, NaN),    b = c(-2L, 0L, 1L, 3L, 2L),    c = c("a", "b", "c", "", "d")  ))# This example Avro schema is a JSON string that essentially says all columns# ("a", "b", "c") of `sdf` are nullable.avro_schema <- jsonlite::toJSON(list(  type = "record",  name = "topLevelRecord",  fields = list(    list(name = "a", type = list("double", "null")),    list(name = "b", type = list("int", "null")),    list(name = "c", type = list("string", "null"))  )), auto_unbox = TRUE)# persist the Spark data frame from above in Avro formatspark_write_avro(sdf, "/tmp/data.avro", as.character(avro_schema))# and then read the same data frame backspark_read_avro(sc, "/tmp/data.avro")
# Source: spark [?? x 3]      a     b c      1     1    -2 "a"  2   NaN     0 "b"  3     3     1 "c"  4     4     3 ""  5   NaN     2 "d"

Custom Serialization

In addition to commonly used data serialization formats such as CSV, JSON, Parquet, and Avro, starting from sparklyr 1.3, customized data frame serialization and deserialization procedures implemented in R can also be run on Spark workers via the newly implemented spark_read() and spark_write() methods. We can see both of them in action through a quick example below, where saveRDS() is called from a user-defined writer function to save all rows within a Spark data frame into 2 RDS files on disk, and readRDS() is called from a user-defined reader function to read the data from the RDS files back to Spark:

library(sparklyr)sc <- spark_connect(master = "local")sdf <- sdf_len(sc, 7)paths <- c("/tmp/file1.RDS", "/tmp/file2.RDS")spark_write(sdf, writer = function(df, path) saveRDS(df, path), paths = paths)spark_read(sc, paths, reader = function(path) readRDS(path), columns = c(id = "integer"))
# Source: spark [?? x 1]     id  1     12     23     34     45     56     67     7

Other Improvements

Sparklyr.flint

Sparklyr.flint is a sparklyr extension that aims to make functionalities from the Flint time-series library easily accessible from R. It is currently under active development. One piece of good news is that, while the original Flint library was designed to work with Spark 2.x, a slightly modified fork of it will work well with Spark 3.0, and within the existing sparklyr extension framework. sparklyr.flint can automatically determine which version of the Flint library to load based on the version of Spark it’s connected to. Another bit of good news is, as previously mentioned, sparklyr.flint doesn’t know too much about its own destiny yet. Maybe you can play an active part in shaping its future!

EMR 6.0

This release also features a small but important change that allows sparklyr to correctly connect to the version of Spark 2.4 that is included in Amazon EMR 6.0.

Previously, sparklyr automatically assumed any Spark 2.x it was connecting to was built with Scala 2.11 and attempted to load any required Scala artifacts built with Scala 2.11 as well. This became problematic when connecting to Spark 2.4 from Amazon EMR 6.0, which is built with Scala 2.12. Starting from sparklyr 1.3, such problem can be fixed by simply specifying scala_version = "2.12" when calling spark_connect() (e.g., spark_connect(master = "yarn-client", scala_version = "2.12")).

Spark 3.0

Last but not least, it is worthwhile to mention sparklyr 1.3.0 is known to be fully compatible with the recently released Spark 3.0. We highly recommend upgrading your copy of sparklyr to 1.3.0 if you plan to have Spark 3.0 as part of your data workflow in future.

Acknowledgement

In chronological order, we want to thank the following individuals for submitting pull requests towards sparklyr 1.3:

We are also grateful for valuable input on the sparklyr 1.3 roadmap, #2434, and #2551 from @javierluraschi, and insightful advice on #1773 and #2514 from @mattpollock and @benmwhite.

Please note if you believe you are missing from the acknowledgement above, it may be because your contribution has been considered part of the next sparklyr release rather than part of the current release. We do make every effort to ensure all contributors are mentioned in this section. In case you believe there is a mistake, please feel free to contact the author of this blog post via e-mail (yitao at rstudio dot com) and request a correction.

If you wish to learn more about sparklyr, we recommend visiting sparklyr.ai, spark.rstudio.com, and some of the previous release posts such as sparklyr 1.2 and sparklyr 1.1.

Thanks for reading!

This post was originally posted on blogs.rstudio.com/ai/

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Learn R for Data Analysis with Our New R Courses!

$
0
0

[This article was first published on r-promote-feed – Dataquest, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exciting news! We know that R is one of the most important programming languages for anyone who wants to learn data analysis and data science. That’s why we’ve just launched four new R courses — a complete revamp of the first step in our Data Analyst in R career path!

These four new courses are designed to make it easier for you to start learning R from scratch and help you build a better foundation in R programming.

As with all of our courses, they’re also designed to keep you motivated by getting you working hands-on, writing real code and working with real data from day one.

The four new courses are:

The new courses also introduce new guided projects to help you synthesize your new R skills by building real data analysis projects as you learn.

We’ll help you get a local R environment set up on your machine, and then guide you through projects analyzing COVID-19 trend data and looking at books sales data to glean insights about marketing campaigns and the impact reviews have on sales.

In fact, we’re so excited about these new R programming courses and data analysis projects that we’re doing something we’ve never done before:

Learn R for FREE: July 20-27

For a full week, every course in our Data Analyst in R path will be free. This includes the four new courses and all of our existing R courses, including courses on SQL, statistics, and probability.

For one full week, the paywall is completely down, and there are no restrictions! You can complete as many courses as you like, and you’ll earn certificates for each course you complete, just as a subscriber would.

If you haven’t tried working with R before, or you haven’t tried Dataquest before, there’s no better time to try us out! Sign up now— the new courses are already free as a preview, so you don’t have to wait to get started!

learn-r-testamonial-headshotlearn-r-testamonial-headshot

I needed a resource for beginners; something to walk me through the basics with clear, detailed instructions. That is exactly what I got in Dataquest’s Introduction to R course.

Because of Dataquest, I started graduate school with a strong foundation in R, which I use every day while working with data.

Ryan Quinn  Doctoral Student at Boston University

How We Teach

Dataquest is different from other online education platforms you may have tried. One of the biggest differences you’ll notice is that we don’t teach with videos

We’ve written about some of the science behind this before, but here’s the short version: students who learn hands-on simply perform better than students who learn from video.

data cleaning live codingdata cleaning live coding

All of our courses, including our R courses, are presented like this: a text window on one side that introduces a new concept, and a coding window on the other side where you can immediately experiment and apply what you’ve learned.

This short feedback loop of learning a little bit, applying it, adding a bit more, applying that, and so on, is a core part of our learning platform, and we believe this is the most effective way to teach and learn R.

We want to teach students real-world job skills, which is why we aim to teach the tools data analysts are actually using in the real world. In our R courses, that means getting students comfortable with RStudio, the industry standard tool for working with R.

We know that motivation is also important, which is why all of our courses will get you working with real-world data and doing real data science tasks almost immediately in our first course.

Subsequent courses all make use of new and interesting data sets and ask you to solve real-world data analytics problems while you’re learning the programming skills.

When you reach the end of each course, you’ll be asked to synthesize what you’ve learned by undertaking one of our guided projects.

These are data science projects designed to help you practice your new skills even as you start to build up your data science portfolio.

And while our instructions will help point you in the right direction if you get lost, guided projects are designed to be open-ended, so you can make them completely your own, and take them as far as you’d like.

Why Learn R?

Although Python is a popular data science language, R is also increasingly popular. Either language is a great option for learning data science (here’s a head-to-head comparison of how they handle data science tasks), but learning R will open up a variety of data science positions to you whether or not you’ve already learned some Python.

Almost all of the top tech companies hire R users for data analytics and data science. And because R was originally designed with advanced statistics in mind, some basic data analytics and statistical operations are simpler in R than they are in Python. R also has a very welcoming and helpful online community (using #rstats on Twitter), and some really great open-source packages and libraries for data science (including tidyverse packages like ggplot2 and dplyr).

Everyone who works in data can benefit from learning some R, and with our Data Analyst in R path, it’s now easier than ever to get started. Start learning one of the fastest-growing languages in data science right now, and in five minutes you’ll have written your first R code and be on your way to learning R.

Charlie Custer

Charlie is a student of data science, and also a content marketer at Dataquest. In his free time, he’s learning to mountain bike and making videos about it.

The post Learn R for Data Analysis with Our New R Courses! appeared first on Dataquest.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-promote-feed – Dataquest.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Creating custom neural networks with nnlib2Rcpp

$
0
0

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For anyone interested, this is a post about creating arbitrary, new, or custom Neural Networks (NN) using the nnlib2Rcpp R package I apologize for the bare format of this post, but for 3 days now I had issues with the online visual text editor.

Let’s return to the nlib2Rcpp R package now. While this package does not even attempt to compete with other popular NN frameworks in terms of features or processing performance, it provides some useful NN models that are ready to use. Furthermore, it can be a versatile basis for experimentation with new or custom NN models, which is what this brief tutorial is about. A warning is necessary at this point:

Warning: the following text contains traces of C++ code. If you are in any way sensitive to consuming C++, please abandon reading immediately.

The NN models in nnlib2Rcpp are created using a collection of C++ classes written for creating NN models called nnlib2. A cut-down class-diagram of the classes (and class-templates) in this collection can be found here . The most important class in the collection is “component” (for all components that constitute a NN). Objects of class “component” can be added to a NN “topology” (hosted in objects of class “nn”) and interact with each other. Layers of processing nodes (class “layer”), groups of connections (class “connection_set”), and even entire neural nets (class “nn”) are based on class “component”. When implementing new components, it is also good to remember that:

– Objects of class “layer” contain objects of class “pe” (processing elements [or nodes]).

– Template “Layer” simplifies creation of homogeneous “layer” sub-classes containing a particular “pe” subclass (i.e. type of nodes).

– Objects of class “connection_set” contain objects of class “connection”.

– Template “Connection_Set” simplifies creation of homogeneous “connection_set” sub-classes containing a particular “connection” subclass (i.e. type of connections).

– Customized and modified NN components and sub-components are to be defined based on these classes and templates.

– All aforementioned classes have an “encode” (training) and a “recall” (mapping) method; both are virtual and can be overridden with custom behavior. Calling “nn” “encode” triggers the “encode” function of all the components in its topology which, in turn, triggers “encode” for “pe” objects (processing nodes) in a “layer” or “connection” objects in a “connection_set”. Similarly for “recall”.

The NN to create in this example will be based on Perceptron, the most classic of them all. It is not yet implemented in nnlib2Rcpp, so in this example we will play the role of Prof. Rosenblatt and his team [6] and implement a simple multi-class Perceptron ourselves. Unlike, Prof Rosenblatt, you -instead of inventing it- can find information about it in a wikipedia page for it [1]. We will implement a simplified (not 100% scientifically sound) variation, with no bias, fixed learning rate (at 0.3) and connection weights initialized to 0.

Let’s add it to nnlib2Rcpp.

Step 1: setup the tools needed.

To follow this example, you will need to have Rtools [2] and the Rcpp R package [3] installed, and the nnlib2Rcpp package source (version 0.1.4 or above). This can be downloaded from CRAN [4] or the latest version can be found on github [5]. If fetched or downloaded from github, the nnlib2Rcpp.Rproj is a project file for building the package in RStudio. I recommend getting the source from github, unpacking it (if needed) in a directory and then opening the aforementioned nnlib2Rcpp.Rproj in Rstudio. You can then test-build the unmodified package; if it succeeds you can proceed to the next step, adding your own NN components.

Step 2: define the model-specific components and sub-components.

Open the “additional_parts.h” C++ header file found in sub-directory “src” and create the needed classes. Much of the default class behavior is similar to what is required for a Perceptron, so we will focus on what is different and specific to the model. We will need to define (in the “additional_parts.h” file) the following:

(a) a “pe” subclass for the Perceptron processing nodes. All “pe” objects provide three methods, namely “input_function”, “activation_function”, and “threshold_function”; by default, each is applied to the result of the previous one, except for the “input_function” which gathers (by default sums) all incoming values and places result on the internal register variable “input”. The sequence of executing these methods is expected to place the final result in “pe” variable “output”. You may choose (or choose not) to modify these methods if this fits your model and implementation approach. You may also choose to modify “pe” behavior in its “encode” and/or “recall” functions, possibly bypassing the aforementioned methods completely. It may help to see the “pe.h” header file (also in directory “src”) for more insight on the base class. In any case, a C++ implementation for Perceptron processing nodes could be:

class perceptron_pe : public pe
{
public:

DATA threshold_function(DATA value)
 {
 if(value>0) return 1;
 return 0;
 }
};

(b) Next you may want to define a class for layers consisting of “perceptron_pe” objects as defined above; this can be done quickly using the template “Layer”:

typedef Layer< perceptron_pe > perceptron_layer;

(c) Moving on to the connections now. Notice that in Perceptron connections are the only elements modified (updating their weights) during encoding. Among other functionality, each connection knows its source and destination nodes, maintains and updates the weight, modifies transferred data etc. So a C++ class for such Percepton connections could be:

class perceptron_connection: public connection
{
public:

// mapping, multiply value coming from source node by weight
// and send it to destination node.
void recall()
 {
 destin_pe().receive_input_value( weight() * source_pe().output );
 }

// training, weights are updated:
void encode()
 {
 weight() = weight() + 0.3 * (destin_pe().input - destin_pe().output) * source_pe().output;
 }
};

for simplicity, during training learning rate is fixed to 0.3 and the connection assumes that the desired output values will be placed in the “input” registers of the destination nodes before updating weights. Note: for compatibility with nnlib2Rcpp version 0.1.4 (the current version on CRAN)- the example above assumes that the desired values are placed as input to the processing nodes right before update of weights (encoding); version 0.1.5 and above provides direct access from R to the “misc” variables that nodes and connections maintain (via “NN” method “set_misc_values_at”, more on “NN” below). It may have been more elegant to use these “misc” variables for holding desired output in processing nodes instead of “input”.

(d) Next, you may want to define a class for groups of such connections, which can be done quickly using the template “Connection_Set”:

typedef Connection_Set< perceptron_connection > perceptron_connection_set;

Step 3: Add the ability to create such components at run-time.

Again in the “additional_parts.h” C++ header file found in directory “src” add code that creates Percepron layers and groups of connections when a particular name is used. Locate the “generate_custom_layer” function and add to it the line:

if(name == "perceptron") return new perceptron_layer(name,size);

(you will notice other similar definitions are already there). Finally, locate the “generate_custom_connection_set” function and add to it the line:

if(name == "perceptron") return new perceptron_connection_set(name);

(again, you will notice other similar definition examples are already there). Note: as of July 15, 2020 all the aforementioned changes to “additional_parts.h” C++ header are already implemented as an example in the nnlib2Rcpp version 0.1.5 github repo [5].

That is it. You can now build the modified library and then return to the R world to use your newly created Perceptron components in a NN. The “NN” R module in nnlib2Rcpp allows you to combine these (and other) components in a network and then use it in R.

It is now time to see if this cut-down modified Perceptron is any good. In the example below, the iris dataset is used to train it. The example uses the “NN” R module in nnlib2Rcpp to build the network and then trains and tests it. The network topology consists of a generic input layer (component #1) of size 4 i.e. as many nodes as the iris features, a set of connections (component #2) whose weights are initialized to 0 (in create_connections_in_sets) below, and a processing layer (component #3) of size 3 i.e. as many nodes as the iris species:

library("nnlib2Rcpp")

# create the NN and define its components
p <- new("NN")
p$add_layer("generic",4)
p$add_connection_set("perceptron")
p$add_layer("perceptron",3)
p$create_connections_in_sets(0,0)

# show the NN topology
p$outline()

# prepare some data based on iris dataset
data_in <- as.matrix(iris[1:4])
iris_cases <- nrow((data_in))
species_id <- unclass(iris[,5])
desired_data_out <- matrix(data=0, nrow=cases, ncol=3)
for(c in 1:iris_cases) desired_data_out[c,species_id[c]]=1

# encode data and desired output (for 30 training epochs)
for(i in 1:30)
for(c in 1:iris_cases)
{
p$input_at(1,data_in[c,])
p$recall_all(TRUE)
p$input_at(3,desired_data_out[c,])
p$encode_at(2)
}

# show the NN
p$print()

# Recall the data to see what species Perceptron returns:
for(c in 1:iris_cases)
{
p$input_at(1,data_in[c,])
p$recall_all(TRUE)
cat("iris case ",c,", desired = ", desired_data_out[c,], " returned = ", p$get_output_from(3),"\n")
}

Checking the output one sees that our Perceptron variation is not THAT bad. At least it recognizes Iris setosa and virginica quite well. However, classification performance on versicolor cases is rather terrible.

iris case 1 , desired = 1 0 0 returned = 1 0 0 iris case 2 , desired = 1 0 0 returned = 1 0 0 iris case 3 , desired = 1 0 0 returned = 1 0 0 … iris case 148 , desired = 0 0 1 returned = 0 0 1 iris case 149 , desired = 0 0 1 returned = 0 0 1 iris case 150 , desired = 0 0 1 returned = 0 0 1

Anyway, this example was not about classification success but about creating a new NN type in the nnlib2Rcpp R package. I hope it will be useful to some of you out there.

Links (all accessed July 12, 2020):

[1] Perceptron: https://en.wikipedia.org/w/index.php?title=Perceptron&oldid=961536136

[2] RTools: https://cran.r-project.org/bin/windows/Rtools/history.html

[3] Rcpp package: https://cran.r-project.org/web/packages/Rcpp/index.html

[4] nnlib2Rcpp package on CRAN: https://cran.r-project.org/web/packages/nnlib2Rcpp/index.html

[5] nnlib2Rcpp package on github: https://github.com/VNNikolaidis/nnlib2Rcpp

[6] Frank Rosenblatt: https://en.wikipedia.org/wiki/Frank_Rosenblatt

PS. Time permitting, more components will be added to the collection (and added to nnlib2Rcpp), maybe accompanied by posts similar to this one; these will eventually be available in the package. Any interesting or useful NN component that you would like to contribute is welcomed (credit, of course, will go to you, its creator); if so, please contact me using the comments below. (Another project is to create parallel-processing versions of the components, if anyone wants to help).


Creating custom neural networks with nnlib2Rcpp was first posted on July 17, 2020 at 12:32 pm. ©2020 "R-posts.com". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement. Please contact me at tal.galili@gmail.com

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

nnetsauce version 0.5.0, randomized neural networks on GPU

$
0
0

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

nnetsauce is a general purpose tool for Statistical/Machine Learning, in which pattern recognition is achieved by using quasi-randomized networks. A new version, 0.5.0, is out on Pypi and for R:

  • Install by using pip (stable version):
pip install nnetsauce --upgrade
  • Install from Github (development version):
pip install git+https://github.com/thierrymoudiki/nnetsauce.git --upgrade
  • Install from Github, in R console:
library(devtools)devtools::install_github("thierrymoudiki/nnetsauce/R-package")library(nnetsauce)

This could be the occasion for you to re-read all the previous posts about nnetsauce, or to play with various examples in Python or R. Here are a few other ways to interact with the nnetsauce:

1) Forms

  • If you’re not comfortable with version control yet: a feedback form.

2) Submit Pull Requests on GitHub

yourgithubname_ddmmyy_shortdescriptionofdemo.[ipynb|Rmd]

If it’s a jupyter notebook written in R, then just add _R to the suffix.

3) Reaching out directly via email

  • Use the address: thierry dot moudiki at pm dot me

To those who are contacting me through LinkedIn: no, I’m not declining, please, add a short message to your request, so that I’d know a bit more about who you are, and/or how we can envisage to work together.

image-title-here

This new version, 0.5.0:

  • contains a refactorized code for the Base class, and for many other utilities.
  • makes use of randtoolbox for a faster, more scalable generation of quasi-random numbers.
  • contains a (work in progress) implementation of most algorithms on GPUs, using JAX. Most of the nnetsauce’s changes related to GPUs are currently made on potentially time consuming operations such as matrices multiplications and matrices inversions. Though, to see a GPU effect, you need to have loads of data at hand, and a relatively high n_hidden_features parameter. How do you try it out? By instantiating a class with the option:
backend="gpu"

or

backend="tpu"

An example can be found in this notebook, on GitHub.

nnetsauce’s future release is planned to be much faster on CPU, due the use of Cython, as with mlsauce. There are indeed a lot of nnetsauce’s parts which can be cythonized. If you’ve ever considered joining the project, now is the right time. For example, among other things, I’m looking for a volunteer to do some testing in R+Python on Microsoft Windows. Envisage a smooth onboarding, even if you don’t have a lot of experience.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A Dashboard of Shiny Apps

$
0
0

[This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We build a lot of Shiny apps. Once we have more than a couple related Shiny apps, it often makes sense to create a dashboard for our Shiny apps. A dashboard of Shiny apps allows users to easily visualize available apps and navigate between apps. This post covers a simple example of one of these dashboards of Shiny apps. The Shiny apps dashboard looks like this:

The above dashboard of Shiny apps is itself a Shiny app. It displays screenshots and links to 2 other Shiny apps (“Claims Dashboard” on the left and “Interest Rate Walk” on the right). As you would expect, the user can click on the “Live App” buttons to navigate to the actual apps.

The dashboard and each of the Shiny apps in the dashboard use our R package, polished, to secure the app with user authentication. polished manages user authorization on a per app basis, so we can restrict user access to the dashboard and to specific apps in the dashboard (e.g. we could give user A access to only “Claims Dashboard” but not “Interest Rate Walk” and user B access to both apps). Once the user is signed in to the dashboard they can seemlessly navigate to all apps that they are authorized to access. Visit polished.tech to learn more about polished and how to integrate it with your Shiny apps.

Sign in to the live dashboard of Shiny apps here using the following credentials:

If you really want to dive into the details, the source code is available here.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts on Tychobra.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rstudio::global() call for talks

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re excited to announce that the call for talks for rstudio::global(2021) is now open! Since we’re rethinking the conference to make the most of the new venue, the talks are going to be a little different to usual.

This year we are particularly interested in talks from people who can’t usually make it in person, or are newer to conference speaking. We’re excited to partner with Articulation Inc to offer free speaker coaching: as long as you have an interesting idea and are willing to put in some work, we’ll help you develop a great talk. (And if you’re an old hand at conference presentations, we’re confident that Articulation can help you get even better!)

Talks will be 20 minutes long and recordings will be due in early December, and you’ll also be part of the live program in January; details TBD. We’ll provide support to make sure that everyone can produce a high quality video regardless of circumstances.

To apply, as well as the usual title and abstract, you’ll need to create a 60 second video that introduces you and your proposed topic. In the video, you should tell us who you are, why your topic is important, and what attendees will take away from it. We’re particularly interested in hearing about:

  • How you’ve used R (by itself or with other technologies) to solve a challenging problem.

  • Your favourite R package (whether you wrote it or not) and how it significantly eases an entire class of problems or extends R into new domains.

  • Your techniques for teaching R to help it reach new domains and new audiences.

  • Broad reflections on the R community, R packages, or R code.

Applications close August 14, and you’ll hear back from us in mid September.

APPLY NOW!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rstudio::global(2021)

$
0
0

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve made the difficult decision to cancel rstudio:conf(2021) for the health and safety of our attendees and the broader community 😢. Instead, we’re excited to announce rstudio::global(2021): our first ever virtual event focused on all things R and RStudio!

We have never done a virtual event before and we’re feeling both nervous and excited. We will make rstudio::global() our most inclusive and global event, making the most of the freedom from geographical and economic constraints that comes with an online event. That means that the conference will be free, designed around participation from every time zone, and have speakers from around the world.

We’re still working through the details, but as of today we’re thinking that most talks will be pre-recorded (so you can watch at your leisure), accompanied by a 24 hour live event filled with keynotes, interviews, opportunities to share knowledge, and as much fun as we can possibly squeeze into a virtual event! We don’t know the precise dates yet, but it’s likely to be late January 2021.

We’ll share more over the next few weeks: if you would like to receive notifications about the details, please subscribe below.

MktoForms2.loadForm("//app-ab02.marketo.com", "709-NXN-706", 3297);

(If you already registered for rstudio::conf() as a superfan, we’ll be in touch shortly to find out if you’d prefer a refund or to transfer your registration to 2022. If you have any questions in the mean time, please feel free to reach out to conf@rstudio.com)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Free vtreat Tutorial Videos

$
0
0

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I would like to re-share links to our free vtreat data preparation system introduction videos, which show you what sort of machine learning problems vtreat can help you with.

The idea is: instead of attempting to automate all of machine learning, vtreat automates some of the data preparation steps.

In addition we have extensive free task-based documentation both for the Python version and for the R version.

And, chapter 8 of our textbook Practical Data Science with R also teaches the methodology.

Plus we have examples of both the Python and R versions of vtreat being used in the KNIME system here.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


RcppArmadillo 0.9.900.2.0

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

armadillo image

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 757 other packages on CRAN.

Conrad just released a new minor upstream version 9.900.2 of Armadillo which we packaged and tested as usual first as a ‘release candidate’ build and then as the release. As usual, logs from reverse-depends runs are in the rcpp-logs repo.

All changes in the new release are noted below.

Changes in RcppArmadillo version 0.9.900.2.0 (2020-07-17)

  • Upgraded to Armadillo release 9.900.2 (Nocturnal Misbehaviour)

    • In sort(), fixes for inconsistencies between checks applied to matrix and vector expressions

    • In sort(), remove unnecessary copying when applied in-place to vectors function when applied in-place to vectors

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SIMD Revisited

$
0
0

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

SIMD data without maps –

The Scottish Index of Multiple Deprivation updated for 2020

I have blogged about the SIMD previously. The last time was using data from 2016. Earlier this year, the data was refreshed, and my friend David Henderson was hot off the press with some very nice plots indeed.

Even better, he’d shared his code, so I was tempted to quickly hop onto my laptop and see if I could come up with something new and exciting. I thought it would only take ten minutes, as David had gone through all the pain of the data ingest, but I ended up spending a bit longer than that.

Last time round, I produced quite a lot of maps, and I didn’t see the point in replicating that. In addition, I still haven’t sussed out how to embed an interactive map on this Jekyll site, so I decided to not bother with maps at all.

I did have a vague idea of what I wanted to do, but I couldn’t get it to work. It had been a while since I’d used ggplot2 and I seemed to have forgotten lots.

So instead, I did what anyone would do in these circumstances, and produced what are essentially scatter plots, but they do look rather nice.

Another happy accident this time round was my choice of plot theme.

The last time I worked with SIMD I used theme_ipsum from Bob Rudis’s hrbrthemes

However, my first attempt did not look quite right, which I think is partly down to how that theme aligns text. It was absolutely brilliant for my maps, but it wasn’t quite hitting the target this time round. I’d love to say I had a plan, and was oscillating back and fore between various native ggplot themes, when I chanced on trying out theme_ft_rc from Bob’s package.

I think it’s lovely:

20200717_simd_domain_rank_plot.png

This is a much condensed image compared to the original, which is so big, I would probably need to get it printed, laminated and stuck on a wall to do it justice.

This, rather than looking at the overall decile / quintile scores, is showing areas ranked across Access, Crime, Education, Employment, Health, Housing and Income.

Bearing in mind this was done back in January, in the depths of the Highland winter, I’m going to throw in a couple of pics here so you can see what influenced my next plot.

Behold, sunrise in Inverness in Jan 2020:

20200717_sunrise2.JPG

Even better, one Sunday evening we were treated to an amazing sunset, which this pic does not truly capture at all:

20200717_sunset1.JPG

Behold, theme_sunset:

20200717_simd_domain_rank_plot_org_blu.png

I used ggdark for this one, and did a simple blue to orange scale. I know it’s nothing like the actual sunset above , but you can’t expect me to compete with Mother Nature, even with ggplot2 and its derivatives.

Save the bees

Beeswarm plots are quite nice I think.

If you don’t agree, look away now:

Firstly, a parochial view of Highland level data:

20200717_simd_domain_Highland_beeswarm.png

Honing in on working age populations – this is not a great one really, but I’m putting it in anyway:

20200717_simd_working_age_population.png

Now, the whole of Scotland, at Local Authority level:

20200717_simd_domain_LA_beeswarm.png

Look at Edinburgh and Glasgow. (Scotland’s 2 main cities, but not necessarily the best 🙂

They are almost all mirror images of each other, which can be seen more clearly below:

20200717_simd_domain_big2_rotated_beeswarm.png

Of course, you can’t not play around with gganimate : 20200717_two_cities.gif

How would it look if they were combined?

20200717_two_cities_combined.gif

The code is on github here (me) which is a fork from David’s repo.

Thanks again to David, who did all the hard work, and without which, I would not have produced these plots.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Export WordPress to Hugo RMarkdown or Org Mode with R

$
0
0

[This article was first published on Having Fun and Creating Value With the R Language on Lucid Manager, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I started my first website in 1996 with hand-written HTML. That became a bit of a chore, so about fifteen years, WordPress became my friend. I recently returned to a static website using Hugo. I tried the WordPress to Hugo exporter, but a lot of HTML artefacts were left in the Markdown output, and each file was in a separate folder. This article explains how to export a WordPress blog to Hugo and customise it with R code.

WordPress has been great to me, but it is slowly becoming a pain to keep updating plugins, security issues, slow performance and the annoying block editor. I am also always looking for additional activities I can do with Emacs. Hugo takes a lot of the pain of managing site away as you can focus on the content. Emacs provides me with excellent editing functionality.

Convert the content to Markdown or Org Mode

The first step is to export the WordPress posts database to a CSV file. Several plugins are available that help you with this task. Alternatively, you can link directly to the database and extract the data with the RMySQL package. I have used the WP All Export plugin to export the data. We need at least the following fields:

  • Title

  • Slug

  • Date

  • Content

  • Categories

  • Tags

The content files for Hugo are either Markdown or Org Mode. I prefer to use Org Mode as it provides me with access to the extensive functionality that Emacs has to offer, including writing and evaluating R code. Org Mode is comparable to RMarkdown. You can write and execute code snippets in Org Mode, just like in RMarkdown. Org-Mode has several other advantages because it also has a fully-featured task and project management system. This software also has superior editing options compared to anything that RStudio has to offer. In this code, you set your preferred file type with the export variable.

Screenshot of Emacs with R through the Emacs Speaks Statistics packageScreenshot of Emacs with R through the Emacs Speaks Statistics package.

The Content field in the WordPress database contains HTML code. The code below reads the exported CSV file and saves each content field as an HTML file. The mighty Pandoc software undertakes the conversion from HTML to Org Mode or Markdown, depending on the export variable, using the post slug as the file name. Any draft posts or pages in the export file will have NA as the file name.

Download from GitHub

## Export WP to Hugo## Read exported WP contentlibrary(tibble)library(readr)library(dplyr)library(stringr)posts <-read_csv("Posts-Export-2020-July-17-2245.csv")## Convert to Org Mode or Markdownexport <-".org"# ".org" or ".md"for (i in 1:nrow(posts)) {    filename <-paste0(posts$Slug[i], ".html")    writeLines(posts$Content[i], filename)    pandoc <-paste0("pandoc -o content/post/", posts$Slug[i], export, " ", filename)    system(pandoc)}## Clean folderfile.remove(list.files(pattern ="*.html"))

The next step is to add the front matter for Hugo. The front matter for this export will contain the title, date and the original URL so that we can create a redirect to the new address.

Export WordPress to Hugo site

Now that we have some content, we need to provide the context in the front matter so that Hugo can build a site. Hugo knows several types of front matter, i.e. TOML, YAML, JSON and Org-Mode. This code provides either org Mode or TOML front matter for markdown files, depending on how you set the export variable.

## Create Org Mode filesbaseurl <-"https://lucidmanager.org"## Create front matterif(export ==".org") {    fm <-tibble(title =paste("#+title:", posts$Title),                 date =paste("#+date:", as.POSIXct(posts$Date, origin ="1970-01-01")),                 lastmod =paste("#+lastmod:", Sys.Date()),                 categories =paste("#+categories[]:", str_replace_all(posts$Categories, " ", "-")),                 tags =paste("#+tags[]:", str_replace_all(posts$Tags, " ", "-")),                 draft ="#+draft: true") %>%mutate(categories =str_replace_all(categories, "\\|", " "),               tags =str_replace_all(tags, "\\|", " "))} else {    fm <-tibble(opening ="+++",                 title =paste0('title = "', posts$Title, '"'),                 date =paste0('date = "', as.POSIXct(posts$Date, origin ="1970-01-01"), '"'),                 lastmod =paste0('lastmod = "', Sys.Date(), '"'),                 categories =paste0('categories = ["', posts$Categories, '"]'),                 tags =paste0('tags = ["', posts$Tags, '"]'),                 draft ='draft = "true"',                 close ="+++") %>%mutate(categories =str_replace_all(categories, "\\|", '", "'),               tags =str_replace_all(tags, "\\|", '", "'))}## Load Hugo files an append front matterfor (f in 1:nrow(posts)) {    filename <-paste0("content/post/", posts$Slug[f], export)    post <-c(paste(fm[f, ]), "", readLines(filename))    ## Repoint images    post <-str_replace_all(post, paste0(baseurl, "/wp-content"), "/images")    ## R Code highlighting    post <-str_replace_all(post, "``` \\{.*", "
")    post <- str_replace_all(post, "```", "

") ## Remove remaining WordPress artefacts post <-str_remove_all(post, ':::|\\{.wp.*|.*\\"\\}') ## Write to diskwriteLines(post , filename) }

Finalising and Publishing the new site

All you have to do now is to add a theme to your website, and your blog is fully converted. The Hugo website has a great Quick Start page that will get you going.

If you prefer R-markdown, then You can easily modify this code so you can use RStudio and the blogdown package.

This new site will not be perfect just yet. To show the images, you need to download your wp-content folder and move it to the static/images folder in Hugo. You will also need to change the permalink settings to ensure that no URL changes when you migrate your blog. There will be other bits and pieces that might not have adequately converted, so do check your pages.

As seen on R Bloggers

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Having Fun and Creating Value With the R Language on Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Estimating Covid-19 reproduction number with delays and right-truncation by @ellis2013nz

$
0
0

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This great preprint recently came out from a team of Katelyn Gostic and others. It uses simulations to test various methods of estimating the effective reproduction number R_t. If you are following the Covid-19 pandemic from a data angle at all, you will no doubt have come across the effective reproduction number and will know that it is an estimate, at a point in time, of the average number of people an infected person infects in turn. It drives the exponential growth of an epidemic, and the way it varies over time is a clear, interpretable way of understanding how that growth rate gets under control, or doesn’t.

The paper is thorough, timely, and transparent – all the code is available, and the criteria on which the authors test different methods are clear and sensible. It’s certainly a must-read for anyone thinking of trying to estimate the reproduction number in real time. Some of the key issues that it considers include (in my sequencing, not their’s):

  • the confirmed cases on any day will be a shifting underestimate of the number of actual cases. Most notably, when testing is constrained, high test positivity is very likely an indicator of more un-tested cases in the community the rate of which changes dramatically over time, but there is no agreed way of adjusting for this (in a previous blog I have suggested a pragmatic method of doing this which puts the role of tests in one day somewhere between a census which implies one should use just counts and a random sample which implies one should use just positivity, but it leaves a lot of space for judgement).
  • the impact of the delays between infection, symptoms, testing and confirmation of a case. Some method of allocating today’s cases to various past dates is needed. One common method of doing this – subtracting days according to the distribution of the delay from infection to confirmation – makes things worse.
  • the impact that correcting for that delay will have on our estimates of today’s cases, which will be right-truncated. That is, the infections in recent days will only be reflected fully in future data, so recent days’ estimates of infection numbers will be biased downwards

There’s also a complication about the difference between instantaneous or case/cohort reproduction number. As I understand it, for any given “today” the former is an estimate of the number of people being infected today per active cases today; but the latter is the average number of people who will be infected (today and later) by people who are infected today. The instantaneous measure is more appropriate for real-time surveillance.

Cases and test positivity in Victoria

I wanted to understand these issues better, particularly in the context of the recent marked increase in Covid-19 cases in Melbourne, Victoria, Australia where I live. Let’s start with looking at confirmed case numbers and test positivity rates. The Guardian compiles various announcements, media releases, dashboards and the like from the different state authorities around Australia into a tidy Google sheet. Thank you The Guardian. The data are presented as cumulative totals of tests and of cases whenever data comes in – so they are event data, with between zero and four observation events per day per state. This needs a bit of tidying to turn into daily increases in cases and tests.

Here’s the number of cases in Victoria, with and without a modest adjustment for test positivity as set out in my earlier post:

The history here will be familiar for Australians, but for other newcomers, the first burst of infections was driven nearly entirely by international travellers. There was a very low level of community transmission at that point, and the social distancing, test and trace, and lockdown measures pursued by governments at both Federal and State level were effective in suppressing the cases down to single digits. Then, since the second half of June, there has been a big increase in cases, this time locally contracted.

The adjustment I’m making for positivity (the aqua-blue line) is very mild – I’m multiplying the confirmed cases by positivity to the power of 0.1 and then scaling the result so that the cases at the lowest test positivity are left as the originally were. I’m not using the power of 0.5 (ie square root) that I used in my previous post in the US context and that has been taken up by some other analysts. I don’t believe testing bottlenecks are as much of an issue here as in the US; I think our testing regime comes much closer to the ideal of a census of cases, or at least a relatively constant proportion of cases allowing for a big but relatively stable proportion of asymptomatic cases below the radar. To see why, here’s the time series of test positivity in Victoria:

That uptick is an issue, but it’s clearly a different order of magnitude of challenge to that faced in the US, where several locations have seen 20% of tests returning positive or higher (even 50% – one in two tests returning positive!), a major flashing red light that many many cases remain undiagnosed. The bottom line – there’s a lot of testing going on here in Victoria at the moment, I’m not sure it’s a major bottleneck for counting cases, and I only want to make a small adjustment for the positivity rate which is so close to zero for a lot of the period of interest.

Here’s the code to download that data from The Guardian, estimate daily changes, model and smooth the test positivity rate and estimate an adjusted rate. It also sets us up for our next look at the issue of dealing with the delay, and with the downwards bias in recent days’ counts resulting from how we deal with the delay.

Post continues after R code

library(tidyverse)library(googlesheets4)library(janitor)library(scales)library(mgcv)library(EpiNow2)# remotes::install_github("epiforecasts/EpiNow2")library(frs)# removes::install_github("ellisp/frs-r-package/pkg")library(patchwork)library(glue)library(surveillance)# for backprojNP()#-----------------the Victoria data--------------url<-"https://docs.google.com/spreadsheets/d/1q5gdePANXci8enuiS4oHUJxcxC13d6bjMRSicakychE/edit#gid=1437767505"gd_orig<-read_sheet(url)d<-gd_orig%>%clean_names()%>%filter(state=="VIC")%>%# deal with problem of multiple observations some days:mutate(date=as.Date(date))%>%group_by(date)%>%summarise(tests_conducted_total=max(tests_conducted_total,na.rm=TRUE),cumulative_case_count=max(cumulative_case_count,na.rm=TRUE))%>%mutate(tests_conducted_total=ifelse(tests_conducted_total<0,NA,tests_conducted_total),cumulative_case_count=ifelse(cumulative_case_count<0,NA,cumulative_case_count))%>%ungroup()%>%# correct one typo, missing a zeromutate(tests_conducted_total=ifelse(date==as.Date("2020-07-10"),1068000,tests_conducted_total))%>%# remove two bad dates filter(!date%in%as.Date(c("2020-06-06","2020-06-07")))%>%mutate(test_increase=c(tests_conducted_total[1],diff(tests_conducted_total)),confirm=c(cumulative_case_count[1],diff(cumulative_case_count)),pos_raw=pmin(1,confirm/test_increase))%>%complete(date=seq.Date(min(date),max(date),by="day"),fill=list(confirm=0))%>%mutate(numeric_date=as.numeric(date),positivity=pos_raw)%>%filter(date>as.Date("2020-02-01"))%>%fill(positivity,.direction="downup")%>%# I don't believe the sqrt "corrected" cases helped here so have a much more modest 0.1.# But first we need to model positivity to smooth it, as it's far too spiky otherwise:mutate(ps1=fitted(gam(positivity~s(numeric_date),data=.,family="quasipoisson")),ps2=fitted(loess(positivity~numeric_date,data=.,span=0.1)),cases_corrected=confirm*ps1^0.1/min(ps1^0.1))%>%ungroup()%>%mutate(smoothed_confirm=fitted(loess(confirm~numeric_date,data=.,span=0.1)))the_caption<-"Data gathered by The Guardian; analysis by http://freerangestats.info"# Positivity plot:d%>%ggplot(aes(x=date))+geom_point(aes(y=pos_raw))+geom_line(aes(y=ps2))+scale_y_continuous(label=percent_format(accuracy=1))+labs(x="",y="Test positivity",title="Positive test rates for COVID-19 in Melbourne, Victoria",caption=the_caption)# Case numbers plotd%>%select(date,cases_corrected,confirm)%>%gather(variable,value,-date)%>%mutate(variable=case_when(variable=="confirm"~"Recorded cases",variable=="cases_corrected"~"With small adjustment for test positivity"))%>%ggplot(aes(x=date,y=value,colour=variable))+geom_point()+geom_smooth(se=FALSE,span=0.07)+labs(x="",y="Number of new cases per day",colour="",caption=the_caption,title="Covid-19 cases per day in Melbourne, Victoria",subtitle="With and without a small adjustment for test positivity. No adjustment for delay.")

Convolution and Deconvolution

One of the obvious challenges for estimating effective reproduction number (R_t) is the delay between infection and becoming a confirmed case. The Gostic et al paper looked at different ways of doing this. They found that with poor information, an imperfect but not-too-bad method is just to left-shift the confirmed cases by subtracting the estimated average delay from infection to reporting. A better method takes into account the distribution of that delay – typically a Poisson or negative binomial random variable of counts of days. However, a common approach to do this by subtracting delays drawn from that distribution is noticeably worse than simply subtracting the mean. In the words of Gostic et al:

“One method infers each individual’s time of infection by subtracting a sample from the delay distribution from each observation time. This is mathematically equivalent to convolving the observation time series with the reversed delay distribution. However, convolution is not the correct inverse operation and adds spurious variance to the imputed incidence curve. The delay distribution has the effect of spreading out infections incident on a particular day across many days of observation; subtracting the delay distribution from the already blurred observations spreads them out further. Instead, deconvolution is needed. In direct analogy with image processing, the subtraction operation blurs, whereas the proper deconvolution sharpens”

To be clear, it’s not just armchair epidemiologists who are wrongly using convolution backwards in time here, but specialists performing Covid-19 surveillance or research for their professional roles. So this paper does a great service in pointing out how it makes things worse. I think that at least one high profile dashboard of R_t estimates has modified its method based on the findings in this paper already. Self-correcting science at work!

Recovering unobserved past infections with toy data

To see this issue in action I made myself two functions blur() and sharpen() which respectively do the simple convolution and deconvolution tasks described above in a deterministic way. The job of blur() is to delay a vector of original cases in accordance with a given distribution. In effect, this spreads out the original vector over a greater (and delayed) time period. The job of sharpen() is to reverse this process – to recover an original vector of observations that, if blurred, would create the original (unobserved) incidence counts.

I tested these two functions with a super-simple set of six original incidence counts, the vector 4, 6, 8, 9, 7, 5. I blurred these into the future in accordance with expected values of a Poisson distribution with a mean of 3 days. Then I used sharpen() to recover the original values, which it does with near perfect success:

My blur() and sharpen() functions (which are completely deterministic and not fit in my opinion for dealing with real-life random data) are in the frs R package where I store miscellaneous stuff for this blog.

Here’s he code that makes those toy example original, delayed and recovered time series:

#------------------understanding convolution----------------------x<-c(4,6,8,9,7,5)pmf<-dpois(0:14,lambda=3)pmf<-pmf/sum(pmf)# create a lagged version of x, with lags determined by a known probability mass function:y<-blur(x,pmf,scale_pmf=TRUE)# recover the original version of x, given its blurred version # and the original probabilities of delays of various lags:recovered<-sharpen(y,pmf)p_conv<-tibble(original_x=x,position=1:length(x))%>%right_join(recovered,by="position")%>%gather(variable,value,-position)%>%mutate(variable=case_when(variable=="original_x"~"Original (unobserved) values",variable=="x"~"Original values recovered",variable=="y"~"Values after a delay"))%>%filter(position>-1)%>%ggplot(aes(x=position,y=value,colour=variable))+geom_point()+geom_line()+theme(legend.position="right")+labs(title="Convolution and deconvolution demonstrated with simulated data",colour="",x="Time",y="Value")

… and for completeness, here are the definitions of my blur() and sharpen() functions. Most of the versions of these I’ve seen in R, Python and Stan use loops, but the particular nature of the “multiply every value of a vector by the probabilities for various lags” operation suggested to me it would be simpler to write as a join of two tables. Maybe I spend too much time with databases. I did think at one point in writing these that I seemed to be re-inventing matrix multiplication, and I’m sure there’s a better way than what I’ve got; but the complications of forcing the recovered vector of original infections to match the observed vector were too much for me to come up with a more elegant approach in the hour or so time budget I had for this bit of the exercise.

#' One-dimensional convolution#'#' @param x a vector of counts or other numbers#' @param pmf a vector of probabilities to delay x at various lags. First value should be for lag 0, #' second for lag 1, etc. #' @param warnings whether to show warnings when the resulting vector does not add up to the#' same sum as the original#' @param scale_pmf whether or not to scale pmf so it adds exactly to one#' @details \code{blur} and \code{sharpen} are deterministic single dimensional convolution functions#' for simple convolution by a lagged probability mass function representing the proportion of original#' cases that are delayed at various lags. They are for illustrative / toy purposes and probably should#' not be used for actual analysis.#' @return A vector of length equal to the length of x plus length of pmf minus 1#' @export#' @importFrom dplyr left_join#' @examples #' x <- c(4,6,8,9,7,5)#' pmf <- c(0, 0.3, 0.5, 0.2)#' blur(x, pmf)blur<-function(x,pmf,warnings=TRUE,scale_pmf=FALSE){if(!class(x)%in%c("integer","numeric")||!is.null(dim(x))){stop("x should be a numeric vector of numbers")}if(!"numeric"%in%class(pmf)||!is.null(dim(pmf))){stop("pmf should be a numeric vector of probabilities")}if(scale_pmf){pmf<-pmf/sum(pmf)}cvd<-data.frame(pmf=pmf,link=1,lag=0:(length(pmf)-1))orig<-data.frame(x=x,link=1)orig$position<-1:nrow(orig)combined<-dplyr::left_join(orig,cvd,by="link")combined$z<-with(combined,x*pmf)combined$new_position<-with(combined,position+lag)y<-aggregate(combined$z,list(combined$new_position),sum)$xif(sum(x)!=sum(y)&warnings){warning("Something went wrong with blur; result did not sum up to original")}return(y)}#' Single dimensional deconvolution#' #' @param y a vector of values to be deconvolved#' @param pmf a vector of probabilities for the original convolution that created y#' @param warnings passed through to blur#' @param digits how many digits to round the deconvolved values to. If NULL no rounding occurs.#' @details \code{blur} and \code{sharpen} are deterministic single dimensional convolution functions#' for simple convolution by a lagged probability mass function representing the proportion of original#' cases that are delayed at various lags. They are for illustrative / toy purposes and probably should#' not be used for actual analysis.#' #' Use \code{\link[surveillance]{backprojNP}} for a better, maximum likelihood approach to recovering#' an unseen set of original cases that result in the observations.#' #' \code{sharpen} is the inverse of \code{blur}; it seeks to recover an original vector that, when blurred#' via \code{pmf}, would produce the actual observations.#' @return a data frame with columns for x (the inferred original values what were convolved to y),#' y (which will be padded out with some extra zeroes), and position (which is the numbering of the#' row relative to the original ordering of y; so position = 1 refers to the first value of y)#' @seealso \code{\link[surveillance]{backprojNP}}.#' @export#' @examples #' x <- c(4,6,8,9,7,5)#' pmf <- c(0, 0.3, 0.5, 0.2)#' # create a convolved version of x:#' y <- blur(x, pmf)#' # recover the original version of x, given its blurred version #' # and the original convolution probabilities:#' sharpen(y, pmf)sharpen<-function(y,pmf,warnings=FALSE,digits=NULL){y2<-c(rep(0,length(pmf)),y)starter_x<-c(y,rep(0,length(pmf)))fn<-function(x){x<-x/sum(x)*sum(y2)d<-sqrt(sum((blur(x,pmf,warnings=warnings,scale_pmf=TRUE)[1:length(x)]-y2)^2))return(d)}op_res<-optim(starter_x,fn,lower=0,method="L-BFGS-B")x<-op_res$parx<-x/sum(x)*sum(y2)if(!is.null(digits)){x<-round(x,digits=digits)}output<-data.frame(x,y=y2)output$position<-seq(from=length(y)-nrow(output)+1,to=length(y),by=1)return(output)}

A big limitation on these toy blur() and sharpen() functions is that they assume not only that we know the probability mass for each possible day of lagging, but that the process is deterministic. Of course, this isn’t the case. The backprojNP() function in the surveillance R package estimates those unobserved latent series on the assumption that the observations come from a Poisson process. Here’s how that looks when we apply it to my toy data:

This uses a method first developed in the context of understand the spread of HIV, which has a particularly long and uncertain delay between infection and confirmation of a case.

It doesn’t do as well at recovering the original as my sharpen() did, because sharpen() has the luxury of knowing the blurring took place deterministically. With these small numbers that makes as big difference. But backprojNP() does an ok job at recovering at least some of the original structure. It’s a bit sharper than the blurred observations. Here’s the code applying backprojNP() to my toy data (mostly this is plotting data; I stuck to base graphics to give me a bit more control over the legend):

# Make a "surveillance time series" of our toy observationsy_sts<-sts(y)# Back-propagate to get an estimate of the original, assuming our observations# are a Poisson processx2<-backprojNP(y_sts,pmf)par(family="Roboto")plot(x2,xaxis.labelFormat=NULL,legend=NULL,lwd=c(1,1,3),lty=c(1,1,1),col=c("grey80","grey90","red"),main="",bty="l",ylim=c(0,10),xlab="Time",ylab="Infections")title("Recovering / sharpening unobserved infections with simulated toy data and \nsurveillance::backprojNP()",adj=0,family="Sarala",font.main=1)points(1:6,x,col="orange",cex=4,pch="-")legend(15,10,legend=c("Observed","Back-propagated","Original"),pch=c(15,NA,NA),lty=c(0,1,1),lwd=c(0,3,3),col=c("grey80","red","orange"),pt.cex=c(2,2,2),cex=0.8,bty="n")#notetheactualestimatesareinx2@upperbound,whichaddsupto39,sameastheoriginaly

Not only does backprojNP() do a better job at coping with real-life random data, it’s much faster than my sharpen() function. So I should emphasise again that frs::sharpen() is here purely for illustrative purposes, to help me get my head around how this whole single dimensional deconvolution thing works.

Recovering unobserved past infections with toy data

OK, so how does this go when we apply it to real Covid-19 data? A quick google suggests that the delay from infection to symptoms is approximately a Poisson distribution with a mean of 6 days. The time from symptoms to becoming a confirmed case is unfortunately very much unknown, and also is likely to change over time (for example, as queues for testing lengthen or shorten, and testing regimes become more or less aggressive). I took a guess for illustrative purposes that it is going to be a Poisson distribution with a mean of about 3 days, shifted one day to the right (because it takes at least one day to get one’s test results). As the sum of two Poisson distributions is another Poisson distribution, this means we can model the overall delay from (unobserved) infection to confirmation as a Poisson distribution with mean of 9 days, plus one day for good luck. In the code below the probability of delay by various lags is defined on this basis in the pmf_covid vector.

Here’s what this looks like when I apply it to the Victorian data:

Overall the curve looks like how we’d expect – earlier than the observed cases, and a little steeper (not as much steeper as I expected though – in the course of writing this blog over a few days I noted this curve can change noticeably as individual data points come in, so it should be treated with caution – see later for a better chart).

Note how this chart really draws attention to the final challenge we’ll be looking at – the right truncation of the data. That massive drop in the aqua-blue line is very misleading, because it’s based on there being zero confirmed cases from tomorrow onwards! We’ve got a way to go yet before we want to turn this into estimates of effective reproduction number, but we’re on track.

Here’s the code for that bit of analysis:

pmf_covid<-c(0,dpois(0:20,lambda=9))# takes a few minutesbp_covid<-backprojNP(sts(d$confirm),pmf_covid)sharpened<-tibble(recovered_x=bp_covid@upperbound)%>%mutate(position=1:n())p_adj_conv<-d%>%mutate(position=1:n())%>%left_join(sharpened,by="position")%>%select(date,recovered_x,confirm)%>%mutate(recovered_x=replace_na(recovered_x,0))%>%gather(variable,value,-date)%>%mutate(variable=case_when(variable=="confirm"~"Confirmed cases",variable=="recovered_x"~"Estimated original cases accounting for delay"))%>%ggplot(aes(x=date,y=value,colour=variable))+geom_line(size=1.5)+labs(title="Back-projection of Victorian Covid-19 infections",subtitle=str_wrap("Non-parametric back-projection of incidence cases assuming        average delay of 10 days between infection and observation, using methods in                            Becker et al (1991). No correction for right truncation of data,                           so the last 15 days will be badly biased downwards.",100),x="",y="Number of infections",colour="")print(p_adj_conv1)

Adjust for testing, delay and right truncation all at once

Clearly the problem of adjusting for the delay is going to need a more careful evidence base than my casual guess at a Poisson variable with mean of 9 days displaced one day to the right; we’ll need an effective way to manage the smoothing of our inferred original infection estimates; we need a way of estimating the unobserved cases; and a method for dealing with the “right truncation” ie the absence of data that can be mapped to infections today, yesterday, and other recent days.

Following one of the footnotes in Gostic et al (referring to “Many statistical methods are available to adjust for right truncation in epidemiological data”) led me to the very excellent EpiNow2 R package, which tries to do most of this simultaneously. It wraps a Bayesian Gaussian process model implemented in Stan, which is exactly the way I think this problem should be approached. Their documentation helpfully includes instructions on using their package for real-time estimation of Covid-19, exactly my problem. It even includes literature-based estimates of the key characteristics of the appropriate distributions to use for estimating and combining different types of lags (generation, incubation and reporting).

EpiNow2 is one of the tools developed and used by the team behind epiforecasts.io– Sebastian Funk and others at the London School of Hygiene and Tropical Medicine. The guts of the software is this Stan program.

While it’s still under development, it’s very impressive. Hats off to the team. Here’s the full citation, noting a fair bit of cross-over with both the epiforecasts.io team and the Gostic et al paper I started this blog with a link to:

 Sam Abbott, Joel Hellewell, Robin Thompson, Katelyn Gostic, Katharine Sherratt,   Sophie Meakin, James Munday, Nikos Bosse and Sebastian Funk (2020). EpiNow2:   Estimate Realtime Case Counts and Time-varying Epidemiological Parameters.   R package version 0.3.0.

Here’s what I get when I apply their approach out-of-the-box to Victoria’s case counts:

As can be seen, several sets of estimates are returned. Panel B is important – it shows the estimated actual infection counts (the curving ribbon, which reflects a credibility interval) in contrast to the reported confirmed cases (the grey columns). It makes it easy to see the broadening uncertainty as we get to today and even a bit into the future; and also how the estimated infections precede the confirmed cases.

From panel B, the estimates of instantaneous reproduction number follow in straightforward fashion, as seen in the bottom panel. I’ve opted to only show the estimates of R_t from late April onwards because prior to that the cases were dominated by international arrivals. While methods exist to estimate reproduction number appropriately in this circumstance, I’m not sufficiently interested in that bit of history (months ago now…) to go to the effort to do it.

Of course, this chart is cautiously good news for Victorians. The brown ‘nowcasting’ segment of the plot shows a decided downwards trend in estimated infections, timing closely with the extra shutdown and distancing measures brought in just under two weeks ago. And it shows the best estimate of today’s effective reproduction number to be (just) less than 1. Let’s hope that stays the case.

That first fit was with the confirmed case numbers directly from The Guardian. If I instead apply them to the numbers after my mild correction for test positivity, we see a similar picture. Recent estimated case numbers are higher because of the higher test positivity, but the estimate of R_t today is similar – still just below 1.0, and on its way down (albeit with lots of uncertainty).

That uncertainty issue is actual one of my main motivators for writing this blog. The fact is, we don’t know what’s going to happen here. This is definitely one of those cases when one of the most useful things a forecast or nowcast can do is highlight the range of possibilities that are consistent with the data we have. And critically, what happens in the future – that big blue credibility interval in the last couple of charts – depends on actual people’s actual decisions and actions.

“The fault, dear Brutus, is not in our stars, but in ourselves”

That reminds me, a Shakespearean blog post really is on the way.

So the situation in Melbourne is clearly on a knife edge. Today’s numbers (Saturday 18 July) are good (in fact exactly what and when would be hoped if the suppression strategy is doing its job), but we know not to pay too much attention to one point in these noisy processes. In a week’s time, any number of cases per day from zero to more than 1,000 (the chart is cut off well below its top point) is possible. Let’s remember those big, uncertain prediction intervals; not be too confident about which way things are headed; and do our best to point them in the right direction by our own behaviour. Stay safe out there! Practice social distancing, stay at home as much as possible, wear a mask when going out, and pay attention to your local public health experts. New Zealanders are excluded the first three of those four things, because they did the fourth one reasonably well.

Anyway, here’s the code for those final bits of analysis. Running the estimation processes took several hours each on my laptop:

#------------Estimating R with EpiNow2---------------------# Various delay/lag distributions as per the Covid-19 examples in the EpiNow2 documentation.reporting_delay<-EpiNow2::bootstrapped_dist_fit(rlnorm(100,log(6),1))## Set max allowed delay to 30 days to truncate computationreporting_delay$max<-30generation_time<-list(mean=EpiNow2::covid_generation_times[1,]$mean,mean_sd=EpiNow2::covid_generation_times[1,]$mean_sd,sd=EpiNow2::covid_generation_times[1,]$sd,sd_sd=EpiNow2::covid_generation_times[1,]$sd_sd,max=30)incubation_period<-list(mean=EpiNow2::covid_incubation_period[1,]$mean,mean_sd=EpiNow2::covid_incubation_period[1,]$mean_sd,sd=EpiNow2::covid_incubation_period[1,]$sd,sd_sd=EpiNow2::covid_incubation_period[1,]$sd_sd,max=30)#---------Based on original data-------------estimates<-EpiNow2::epinow(reported_cases=d,generation_time=generation_time,delays=list(incubation_period,reporting_delay),horizon=7,samples=3000,warmup=600,cores=4,chains=4,verbose=TRUE,adapt_delta=0.95)# Function for doing some mild polishing to the default plot from epinow();# assumes existence in global environment of a ggplot2 theme called my_theme, # defined previously, and takes the output of epinow() as its main argument:my_plot_estimates<-function(estimates,extra_title=""){my_theme<-my_theme+theme(axis.text.x=element_text(angle=45,size=8,hjust=1))p<-estimates$plots$summaryp1<-p$patches$plots[[1]]+scale_y_continuous(label=comma_format(accuracy=1))+my_theme+theme(legend.position="none",panel.grid.minor=element_blank())+coord_cartesian(ylim=c(0,1000))+labs(title=glue("Estimated infections based on confirmed cases{extra_title}"),x="")+scale_x_date(date_breaks="1 week",date_labels="%d %b",limits=range(p$data$date))p2<-p$patches$plots[[2]]+scale_y_continuous(label=comma_format(accuracy=1))+my_theme+theme(legend.position="none",panel.grid.minor=element_blank())+coord_cartesian(ylim=c(0,1000))+labs(title=glue("Estimated infections taking delay{extra_title} into account"),x="")+scale_x_date(date_breaks="1 week",date_labels="%d %b",limits=range(p$data$date))p3<-p$data%>%filter(date>as.Date("2020-04-20"))%>%ggplot(aes(x=date,y=median,fill=type))+my_theme+geom_hline(yintercept=1,colour="steelblue")+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.5)+geom_ribbon(aes(ymin=bottom,ymax=top),alpha=0.1)+geom_line(aes(colour=type))+theme(legend.position="none",panel.grid.minor=element_blank())+ggplot2::scale_fill_brewer(palette="Dark2")+labs(title=glue("Effective Reproduction Number, correcting for both delay and right truncation{extra_title}"),y=bquote("Estimated"~R[t]),x="")+scale_x_date(date_breaks="1 week",date_labels="%d %b",limits=range(p$data$date))pc<-p1+p2+p3+plot_layout(ncol=1)return(pc)}my_plot_estimates(estimates)#---------Based on positivity-adjusted-------------d2<-select(d,date,cases_corrected)%>%mutate(confirm=round(cases_corrected))estimates2<-EpiNow2::epinow(reported_cases=d2,generation_time=generation_time,delays=list(incubation_period,reporting_delay),horizon=7,samples=3000,warmup=600,cores=4,chains=4,verbose=TRUE,adapt_delta=0.95)my_plot_estimates(estimates2,extra_title=" and positivity")
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

drat 0.1.8: Minor test fix

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

drat user

A new version of drat arrived on CRAN today. This is a follow-up release to 0.1.7 from a week ago. It contains a quick follow-up by Felix Ernst to correct on of the tests which misbehaved under the old release of R still being tested at CRAN.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories with marked releases is the better way to distribute code.

As your mother told you: Friends don’t let friends install random git commit snapshots. Rolled-up releases it is. drat is easy to use, documented by five vignettes and just works.

The NEWS file summarises the release as follows:

Changes in drat version 0.1.8 (2020-07-18)

  • The archive pruning test code was corrected for r-oldrel (Felix Ernst in #105 fixing #104).

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tint 0.1.3: Fixes for html mode, new demo

$
0
0

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new version 0.1.3 of the tint package arrived at CRAN today. It corrects some features for html output, notably margin notes and references. It also contains a new example for inline references.

The full list of changes is below.

Changes in tint version 0.1.3 (2020-07-18)

  • A new minimal demo was added showing inline references (Dirk addressing #42).

  • Code for margin notes and reference in html mode was updated with thanks to tufte (Dirk in #43 and #44 addressing #40).

  • The README.md was updated with a new ‘See Also’ section and a new badge.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Riddler: Can You Beat MLB Recods?

$
0
0

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

FiveThirtyEight’s Riddler Express

link

From Taylor Firman comes an opportunity to make baseball history:

This year, Major League Baseball announced it will play a shortened 60-game season, as opposed to the typical 162-game season. Baseball is a sport of numbers and statistics, and so Taylor wondered about the impact of the season’s length on some famous baseball records.

Some statistics are more achievable than others in a shortened season. Suppose your true batting average is .350, meaning you have a 35 percent chance of getting a hit with every at-bat. If you have four at-bats per game, what are your chances of batting at least .400 over the course of the 60-game season?2 And how does this compare to your chances of batting at least .400 over the course of a 162-game season?

Plan

This riddle should be pretty straight forward to solve statistically and with simulations, so I will do both.

Setup

knitr::opts_chunk$set(echo = TRUE, comment = "#>", cache = FALSE, dpi = 400)library(mustashe)library(tidyverse)library(conflicted)# Handle any namespace conflicts.conflict_prefer("filter", "dplyr")conflict_prefer("select", "dplyr")# Default 'ggplot2' theme.theme_set(theme_minimal())# For reproducibility.set.seed(123)

Statistical solution

This is a simple binomial system: at each at bat, the player either gets a hit or not. If their real batting average is 0.350, that means the probability of them getting a hit during each at bat is 35%. Thus, according the the Central Limit Theorem, given a sufficiently large number of at bats, the frequency of a hit should be 35%. This is because the distribution converges towards the mean. However, at smaller sample-sizes, the distribution is more broad, meaning that the observed batting average has a greater chance of being further away from the true value.

First, let’s answer the riddle. The solution is just the probability of observing a batting average of 0.400 or greater. The first value is computed using dbinom() and the second cumulative probability is calculated using pbinom(), setting lower.tail = FALSE to get the tail above 0.400.

num_at_bats <- 60 * 4real_batting_average <- 0.350target_batting_average <- 0.400prob_at_400 <- dbinom(x = target_batting_average * num_at_bats,size = num_at_bats,prob = real_batting_average)prob_above_400 <- pbinom(q = target_batting_average * num_at_bats,size = num_at_bats,prob = real_batting_average,lower.tail = FALSE)prob_at_400 + prob_above_400
#> [1] 0.06083863

Under the described assumptions, there is a 6.1% chance of reaching a batting average of 0.400 in the shorter season.

For comparison, the chance for a normal 162-game season is calculated below. Because $0.400 \times 162 \times 4$ is a non-integer value, an exact 0.400 batting average cannot be obtained. Therefore, only the probability of a batting average greater than 0.400 needs to be calculated.

num_at_bats <- 162 * 4real_batting_average <- 0.350target_batting_average <- 0.400prob_above_400 <- pbinom(q = target_batting_average * num_at_bats,size = num_at_bats,prob = real_batting_average,lower.tail = FALSE)prob_above_400
#> [1] 0.003789922

Over 162 games, there is a 0.4% chance of achieving a batting average of at least 0.400.

Simulation

The solution to this riddle could also be found by simulating a whole bunch of seasons with the real batting average of 0.350 and then just counting how frequently the simulations resulted in an observed batting average of 0.400.

A single season can be simulated using the rbinom() function where n is the number of seasons to simulate, size takes the number of at bats, and prob takes the true batting average. The returned value is a sampled number of hits (“successes”) over the season from the binomial distribution.

The first example shows the observed batting average from a single season.

num_at_bats <- 60 * 4real_batting_average <- 0.350target_batting_average <- 0.400rbinom(n = 1, size = num_at_bats, prob = real_batting_average) / num_at_bats
#> [1] 0.3375

The n = 1 can just be replaced with a large number to simulate a bunch of seasons. The average batting average over these seasons should be close to the true batting average.

n_seasons <- 1e6 # 1 million simulations.sim_res <- rbinom(n = n_seasons,size = num_at_bats,prob = real_batting_average)sim_res <- sim_res / num_at_bats# The average batting average is near the true batting average of 0.350.mean(sim_res)
#> [1] 0.3500121

The full distribution of batting averages over the 1 million simulations is shown below.

tibble(sims = sim_res) %>%ggplot(aes(x = sims)) +geom_density(color = "black", fill = "black", adjust = 2,alpha = 0.2, size = 1.2, ) +geom_vline(xintercept = target_batting_average,color = "tomato", lty = 2, size = 1.2) +scale_y_continuous(expand = expansion(mult = c(0.01, 0.02))) +theme(plot.title = element_text(hjust = 0.5)) +labs(x = "1 million simulated season batting averages",y = "probability density",title = "Distribution of simulated batting averages in a 60-game season")

The answer from the simulation is pretty close to the actual answer.

mean(sim_res >= 0.40)
#> [1] 0.060813

One last visualization I want to do demonstrates why the length of the season matters to the distribution. Instead of using rbinom() to simulate the number of successes over the entire season, I use it below to simulate a season’s-worth of individual at bats, returning a vector of 0’s and 1’s. I then plotted the cumulative number of hits at each at bat and colored the line by the running batting average.

The coloring shows how the batting average was more volatile when there were fewer at bats.

sampled_at_bats <- rbinom(60*4, 1, 0.35)tibble(at_bat = sampled_at_bats) %>%mutate(i = row_number(),cum_total = cumsum(at_bat),running_avg = cum_total / i) %>%ggplot(aes(x = i, y = cum_total)) +geom_line(aes(color = running_avg), size = 1.2) +scale_color_viridis_c() +theme(plot.title = element_text(hjust = 0.5),legend.position = c(0.85, 0.35)) +labs(x = "at bat number",y = "total number of hits",color = "batting average",title = "Running batting average over a simulated season")

The following two plots do the same analysis many times to simulate many seasons and color the lines by whether or not the final batting average was at or above 0.400. As there are more games, the running batting averages, which are essentially biased random walks, regress towards the true batting average. (Note that I had to do 500 simulations for the 162-game season to get any simulations with a final batting average above 0.400.)

simulate_season_at_bats <- function(num_at_bats) {sampled_at_bats <- rbinom(num_at_bats, size = 1, prob = 0.35)tibble(result = sampled_at_bats) %>%mutate(at_bat = row_number(),cum_total = cumsum(result),running_avg = cum_total / at_bat)}tibble(season = 1:100) %>%mutate(season_results = map(season, ~ simulate_season_at_bats(60 * 4))) %>%unnest(season_results) %>%group_by(season) %>%mutate(above_400 = 0.4 <= running_avg[which.max(at_bat)],above_400 = ifelse(above_400, "BA ≥ 0.400", "BA < 0.400")) %>%ungroup() %>%ggplot(aes(x = at_bat, y = cum_total)) +geom_line(aes(group = season, color = above_400,alpha = above_400),size = 0.8) +geom_hline(yintercept = 0.4 * 60 * 4,color = "tomato", lty = 2, size = 1) +scale_color_manual(values = c("grey50", "dodgerblue")) +scale_alpha_manual(values = c(0.1, 1.0), guide = FALSE) +scale_x_continuous(expand = c(0, 0)) +theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5),legend.position = c(0.85, 0.25)) +labs(x = "at bat number",y = "total number of hits",color = NULL,title = "Running batting averages over simulated 60-game seasons",subtitle = "Blue lines indicate a simulation with a final batting average of at least 0.400.")

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


RvsPython #1: Webscraping

$
0
0

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Webscraping is a powerful tool available for efficent data collection. There are ways to do it in both R and Python. I’ve built the same scraper in R and Python which gathers information about all the whitehouse breifings available on www.whitehouse.gov (don’t worry guys–it’s legal);

This is based off of what I learned from FreeCodeCamp about webscraping in Python (heres the link: https://www.youtube.com/watch?v=87Gx3U0BDlo ).

This blog is about approaches I naturally used with R’s rvest package and Python’s BeautifulSoup library.

Here are two versions of code which I use to scrape all the breifings

This webscraper extracts:

1) Date of the Breifing 2) The title of the Breifing 3) The URL to the Breifing 4) The The Issue Type

and puts them in a data frame.

The differences between the way I did this in Python vs R:

Python

(a) I grabbed the data using the xml (b) Parsing the data was done with the html classes (and cleaned with a small amount of Regex) (c) I used for loops (d) I had to import other libraries besides for bs4

R

(a) I used a CSS selector to get the raw data. (b) The data was parsed using good ol’ regular expressions. (c) I used sapply() (d) I just used rvest and the base library.

This is a comparison between how I learned to webscrape in Python vs How I learned how to do it in R. Lets jump in and see which one did faster!

Python Version with BeautifulSoup

# A simple webscraper providing a dataset of all Whitehouse Breifings
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import lxml


def get_whitehouse_breifings():
    # Generalize to all pages

    orig_link = requests.get("https://www.whitehouse.gov/briefings-statements/")

    orig_content = orig_link.content

    sp = BeautifulSoup(orig_content, 'lxml')

    pages = sp.find_all('a', {'class': 'page-numbers'})

    the_pages = []

    for pg in pages:
        the_pages.append(pg.get_text())

    # Now make set of links

    the_links = []

    for num in range(1, int(max(the_pages)) + 1):
        the_links.append('https://www.whitehouse.gov/briefings-statements/' + 'page/' + str(num) + '/')

    dat = pd.DataFrame()
    for link in the_links:
        link_content = requests.get(link)
        link_content = link_content.content
        sp = BeautifulSoup(link_content, 'lxml')
        h2_links = sp.find_all('h2')
        date_links = sp.find_all('p', {"class": "meta__date"})
        breif_links = sp.find_all('div', {"class": "briefing-statement__content"})

        title = []
        urls = []
        date = []
        breifing_type = []
        for i in h2_links:
            a_tag = i.find('a')
            urls.append(a_tag.attrs['href'])
            title.append(a_tag.get_text())
        for j in date_links:
            d_tag = j.find('time')
            date.append(d_tag.get_text())
        for k in breif_links:
            b_tag = k.find('p')
            b_tag = b_tag.get_text()
            b_tag = re.sub('\\t', '', b_tag)
            b_tag = re.sub('\\n', '', b_tag)
            breifing_type.append(b_tag)

        dt = pd.DataFrame(list(zip(date, title, urls, breifing_type)))

        dat = pd.concat([dat, dt])

    dat.rename(columns={"Date": date, "Title": title, "URL": urls, "Issue Type": breifing_type})
    return (dat)

Running the code, Python’s Time

import time
start_time=time.time()

pdt = get_whitehouse_breifings()


# Time taken to run code
print("--- %s seconds ---" % (time.time() - start_time))

## --- 162.8423991203308 seconds ---
 

R Version with rvest

library(rvest)

get_whitehouse_breifings<- function(){
  #Preliminary Functions




  pipeit<-function(url,code){
    read_html(url)%>%html_nodes(code)%>%html_text()
  }

  pipelink<-function(url,code){
    read_html(url)%>%html_nodes(code)%>%html_attr("href")
  }


  first_link<-"https://www.whitehouse.gov/briefings-statements/"

  # Get total number of pages

  pages<-pipeit(first_link,".page-numbers")

  pages<-as.numeric(pages[length(pages)])

  #Get all links
  all_pages<-c()

  for (i in 1:pages){
    all_pages[i]<-paste0(first_link,"page/",i,"/")
  }



  urls<-unname(sapply(all_pages,function(x){
        pipelink(x,".briefing-statement__title a")
        })) %>% unlist()

  breifing_content<-unname(sapply(all_pages,function(x){
    pipeit(x,".briefing-statement__content")
  })) %>%  unlist()


  # Data Wrangling

  test<-unname(sapply(breifing_content,function(x) gsub("\\n|\\t","_",x)))

  test<-unname(sapply(test,function(x) strsplit(x,"_")))

  test<-unname(sapply(test,function(x) x[x!=""]))

  breifing_type<-unname(sapply(test,function(x) x[1])) %>% unlist()
  title<-unname(sapply(test,function(x) x[2])) %>% unlist()
  dat<-unname(sapply(test,function(x) x[length(x)])) %>% unlist()


  dt<- data.frame("Date"=dat,"Title"=title,"URL"=urls,"Issue Type"= breifing_type)

  dt
}

Running the code,R’s Time

##    user  system elapsed 
##   16.77    4.22  415.95

Analysis and Conclusion:

On my machine Python was waaaaay faster than R. This was primarily because the function I wrote in R had to go over the website a second time to extract links. Could it be sped up if I wrote the code extracting text and links in one step? Very likely. But I would have to change the approach to be similar to how I did it in Python.

For me rvest seems to be great for “quick and dirty” code (Point and click with a CSS selector, put it in a function, iterate accross pages; Repeat for next field). BeautifulSoup seems like its better for more methodical scraping. The approach is naturally more html heavy.

Python requires one to refrence the library every time they call a function from it, which for myself being a native R user find frustrating as opposed to just attaching the library to the script.

For R you have to play with the data structure (from lists to vectors) to get the data to be coerced to a dataframe. I didn’t need to do any of this for Python.

I’m sure theres more to write about these libraries (and how there are better ways to do it in both of these languages), but I’m happy that I am aquainted with them both!

Let me know what you think!

P.S. This was uploaded with the RWordpress Package. Check out my Linkedin Post on the topic here.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RObservations #1: Uploading your .Rmd File to WordPress: A TroubleShooters Guide

$
0
0

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As anyone in tech will tell you. Having a website where you can showcase your work is a huge plus for getting an edge on the market, networking and building a portfolio. When starting out, this sort of stuff might seem overwhelming. If you’re an R user and have done work with RMarkdown, the easiest thing to do is to migrate your .Rmd files to your blog.

While there are many blog posts about importing your R files into wordpress. This blog is to show you how you can do it on your own and troubleshoot some problems that are along the way. I’m sure the number of problems are endless, but this blog post is a presentation of the problems I experienced when I first uploaded a .Rmd file on my WordPress. Lets start from the beginning shall we?

What the internet might have already told you:

What you might have heard, to upload your .Rmd file on WordPress first install your preliminary packages (if you haven’t done so already):

  install.packages("knitr")  install.packages("devtools")  devtools::install_github(c("duncantl/XMLRPC", "duncantl/RWordPress"))

Then call the knitr and RWordPress libraries and set your options to make sure you’re logged in; This snippit of code can be found probably on every blog which discusses the topic:

(WARNING: This code might give mistakes so keep reading for the solution!)

library(RWordPress)library(knitr)# Set optionsoptions(WordPressLogin = c(user = 'password'),        WordPressURL = 'https://yoursite.wordpress.com/xmlrpc.php')

Where user is your username (not as a string) and 'password' is your password.

Finally, make sure your working directory is the same as where your .Rmd file is and call the knitr2wp function to upload your file to WordPress

setwd("C:/Users/user/Documents")knit2wp('Your_RMarkdown_file.Rmd', title = 'Hey kid's! Look at how I posted this on WordPress',publish = FALSE)

This should work dandy right????

Lets pick apart the issues that I’ve had!!

1) R is not allowed through your firewall (Error 443):

You might get an error that looks like this:

Error in function (type, msg, asError = TRUE)  :   Unknown SSL protocol error in connection to https://yoursite.wordpress.com/xmlrpc.php:443

After doing some googling I found out that this error is because R is unable to pass your Firewall. So if you’re a Windows user click your way through the following steps.

(Start > Control panel > System and security > Windows Defender firewall > Applications and Functions)

If it has not checked the box of “Rstudio R Session”, check it and retry. This will usually work then.

(Thank you Marina_Anna for her post on the RStudio forum to answer this question (here))

2) Your Options are not set properly (WordPress is Mispelled?? Huh??)

This is a really annoying issue, which is that WordPress needs to be written as WordPress (with a lowercase p)in the options to be understood by knitr2wp so…

The proper way to set your options is…

options(WordPressLogin = c(user = 'password'),        WordPressURL = 'https://yoursite.wordpress.com/xmlrpc.php')# Then post your .Rmd file to your WordPress site. knit2wp('Your_RMarkdown_file.Rmd', title = 'Hey kid's! Look at how I posted this on WordPress',publish = FALSE)

… and it should work!

Hope this helps!

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1152]

$
0
0

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The weekly puzzle from Le Monde is a tournament classic:

An even number of teams play one another once a week with no tie allowed and have played all other teams. Four weeks into the tournament, A has won all its games, B,C, and D have won three games, the other teams have won at least one games. What is the minimum number of teams? Show an instance.

By sheer random search

tnmt=function(K=10,gamz=4){ t1=t0=matrix(1,K,K)tnmt=function(K=10,gamz=4){ tnmt=t0=matrix(0,K,K) while (!prod(apply(tnmt^2,1,sum)==4)){   tnmt=t0   for (i in 1:(K-2)){     if((a<-gamz-sum(tnmt[i,]^2))> K-i-1) break()     if(a>0){      j=sample((i+1):K,a)      tnmt[i,j]=sample(c(-1,1),a,rep=TRUE)      tnmt[j,i]=-tnmt[i,j]}}} tnmt}chck=function(1,gamz=4){    sumz=apply(tnmt,1,sum)    max(sumz)==gamz&    sum(sumz==2)>2&    min(sumz)>-gamz}

I found that 8 teams were not producing an acceptable game out of 10⁶ tries. Here is a solution for 9 teams:

       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,]             -1   -1         1             -1 [2,]             -1         1        -1        -1 [3,]    1    1                   1             -1 [4,]    1                   1         1   -1      [5,]        -1        -1                   1   -1 [6,]   -1        -1                  -1    1      [7,]         1        -1         1         1      [8,]                   1   -1   -1   -1      [9,]    1    1    1         1

where team 9 wins all four games, 7,4 and 3, win three games, and the other 4 teams win one game. Which makes sense since this is a zero sum game, with a value of 10 over the four top teams and 2(N-4)=10 if no team has two wins (adding an even number of such teams does not change the value of the game).

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Time series prediction with FNN-LSTM

$
0
0

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“) training_loop(ds_train)

test_batch <- as_iterator(ds_test) %>% iter_next() encoded <- encoder(test_batch[[1]]) test_var <- tf\(math\)reduce_variance(encoded, axis = 0L) print(test_var %>% as.numeric() %>% round(5)) }

On to what we'll use as a baseline for comparison.#### Vanilla LSTMHere is the vanilla LSTM, stacking two layers, each, again, of size 32. Dropout and recurrent dropout were chosen individuallyper dataset, as was the learning rate.### Data preparationFor all experiments, data were prepared in the same way.In every case, we used the first 10000 measurements available in the respective `.pkl` files [provided by Gilpin in his GitHubrepository](https://github.com/williamgilpin/fnn/tree/master/datasets). To save on file size and not depend on an externaldata source, we extracted those first 10000 entries to `.csv` files downloadable directly from this blog's repo:Should you want to access the complete time series (of considerably greater lengths), just download them from Gilpin's repoand load them using `reticulate`:Here is the data preparation code for the first dataset, `geyser` - all other datasets were treated the same way.Now we're ready to look at how forecasting goes on our four datasets.## Experiments### Geyser datasetPeople working with time series may have heard of [Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful), a geyser inWyoming, US that has continually been erupting every 44 minutes to two hours since the year 2004. For the subset of dataGilpin extracted[^3],[^3]: see dataset descriptions in the [repository\'s README](https://github.com/williamgilpin/fnn)> `geyser_train_test.pkl` corresponds to detrended temperature readings from the main runoff pool of the Old Faithful geyser> in Yellowstone National Park, downloaded from the [GeyserTimes database](https://geysertimes.org/). Temperature measurements> start on April 13, 2015 and occur in one-minute increments.Like we said above, `geyser.csv` is a subset of these measurements, comprising the first 10000 data points. To choose anadequate timestep for the LSTMs, we inspect the series at various resolutions:
Geyer dataset. Top: First 1000 observations. Bottom: Zooming in on the first 200.

(\#fig:unnamed-chunk-5)Geyer dataset. Top: First 1000 observations. Bottom: Zooming in on the first 200.

It seems like the behavior is periodic with a period of about 40-50; a timestep of 60 thus seemed like a good try.Having trained both FNN-LSTM and the vanilla LSTM for 200 epochs, we first inspect the variances of the latent variables onthe test set. The value of `fnn_multiplier` corresponding to this run was `0.7`.```{} V1 V2 V3 V4 V5 V6 V7 V8 V9 V100.258 0.0262 0.0000627 0.000000600 0.000533 0.000362 0.000238 0.000121 0.000518 0.000365

There is a drop in importance between the first two variables and the rest; however, unlike in the Lorenz system, V1 and V2 variances also differ by an order of magnitude.

Now, it’s interesting to compare prediction errors for both models. We are going to make an observation that will carry through to all three datasets to come.

Keeping up the suspense for a while, here is the code used to compute per-timestep prediction errors from both models. The same code will be used for all other datasets.

And here is the actual comparison. One thing especially jumps to the eye: FNN-LSTM forecast error is significantly lower for initial timesteps, first and foremost, for the very first prediction, which from this graph we expect to be pretty good!

Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

(#fig:unnamed-chunk-8)Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

Interestingly, we see “jumps” in prediction error, for FNN-LSTM, between the very first forecast and the second, and then between the second and the ensuing ones, reminding of the similar jumps in variable importance for the latent code! After the first ten timesteps, vanilla LSTM has caught up with FNN-LSTM, and we won’t interpret further development of the losses based on just a single run’s output.

Instead, let’s inspect actual predictions. We randomly pick sequences from the test set, and ask both FNN-LSTM and vanilla LSTM for a forecast. The same procedure will be followed for the other datasets.

Here are sixteen random picks of predictions on the test set. The ground truth is displayed in pink; blue forecasts are from FNN-LSTM, green ones from vanilla LSTM.

60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

(#fig:unnamed-chunk-10)60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

What we expect from the error inspection comes true: FNN-LSTM yields significantly better predictions for immediate continuations of a given sequence.

Let’s move on to the second dataset on our list.

Electricity dataset

This is a dataset on power consumption, aggregated over 321 different households and fifteen-minute-intervals.

electricity_train_test.pkl corresponds to average power consumption by 321 Portuguese households between 2012 and 2014, in units of kilowatts consumed in fifteen minute increments. This dataset is from the UCI machine learning database. 1

Here, we see a very regular pattern:

Electricity dataset. Top: First 2000 observations. Bottom: Zooming in on 500 observations, skipping the very beginning of the series.

(#fig:unnamed-chunk-11)Electricity dataset. Top: First 2000 observations. Bottom: Zooming in on 500 observations, skipping the very beginning of the series.

With such regular behavior, we immediately tried to predict a higher number of timesteps (120) – and didn’t have to retract behind that aspiration.

For an fnn_multiplier of 0.5, latent variable variances look like this:

V1          V2            V3       V4       V5            V6       V7         V8      V9     V100.390 0.000637 0.00000000288 1.48e-10 2.10e-11 0.00000000119 6.61e-11 0.00000115 1.11e-4 1.40e-4

We definitely see a sharp drop already after the first variable.

How do prediction errors compare on the two architectures?

Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

(#fig:unnamed-chunk-12)Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

Here, FNN-LSTM performs better over a long range of timesteps, but again, the difference is most visible for immediate predictions. Will an inspection of actual predictions confirm this view?

60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

(#fig:unnamed-chunk-13)60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

It does! In fact, forecasts from FNN-LSTM are very impressive on all time scales.

Now that we’ve seen the easy and predictable, let’s approach the weird and difficult.

ECG dataset

Says Gilpin,

ecg_train.pkl and ecg_test.pkl correspond to ECG measurements for two different patients, taken from the PhysioNet QT database. 2

How do these look?

ECG dataset. Top: First 1000 observations. Bottom: Zooming in on the first 400 observations.

(#fig:unnamed-chunk-14)ECG dataset. Top: First 1000 observations. Bottom: Zooming in on the first 400 observations.

To the layperson that I am, these do not look nearly as regular as expected. First experiments showed that both architectures are not capable of dealing with a high number of timesteps. In every try, FNN-LSTM performed better for the very first timestep.

This is also the case for n_timesteps = 12, the final try (after 120, 60 and 30). With an fnn_multiplier of 1, the latent variances obtained amounted to the following:

     V1        V2          V3        V4         V5       V6       V7         V8         V9       V10  0.110  1.16e-11     3.78e-9 0.0000992    9.63e-9  4.65e-5  1.21e-4    9.91e-9    3.81e-9   2.71e-8

There is a gap between the first variable and all other ones; but not much variance is explained by V1 either.

Apart from the very first prediction, vanilla LSTM shows lower forecast errors this time; however, we have to add that this was not consistently observed when experimenting with other timestep settings.

Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

(#fig:unnamed-chunk-15)Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

Looking at actual predictions, both architectures perform best when a persistence forecast is adequate – in fact, they produce one even when it is not.

60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

(#fig:unnamed-chunk-16)60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

On this dataset, we certainly would want to explore other architectures better able to capture the presence of high and low frequencies in the data, such as mixture models. But – were we forced to stay with one of these, and could do a one-step-ahead, rolling forecast, we’d go with FNN-LSTM.

Speaking of mixed frequencies – we haven’t seen the extremes yet …

Mouse dataset

“Mouse”, that’s spike rates recorded from a mouse thalamus.

mouse.pkl A time series of spiking rates for a neuron in a mouse thalamus. Raw spike data was obtained from CRCNS and processed with the authors’ code in order to generate a spike rate time series. 3

Mouse dataset. Top: First 2000 observations. Bottom: Zooming in on the first 500 observations.

(#fig:unnamed-chunk-17)Mouse dataset. Top: First 2000 observations. Bottom: Zooming in on the first 500 observations.

Obviously, this dataset will be very hard to predict. How, after “long” silence, do you know that a neuron is going to fire?

As usual, we inspect latent code variances (fnn_multiplier was set to 0.4):

Again, we don’t see the first variable explaining much variance. Still, interestingly, when inspecting forecast errors we get a picture very similar to the one obtained on our first, geyser, dataset:

Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

(#fig:unnamed-chunk-19)Per-timestep prediction error as obtained by FNN-LSTM and a vanilla stacked LSTM. Green: LSTM. Blue: FNN-LSTM.

So here, the latent code definitely seems to help! With every timestep “more” that we try to predict, prediction performance goes down continuously– or put the other way round, short-time predictions are expected to be pretty good!

Let’s see:

60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

(#fig:unnamed-chunk-20)60-step ahead predictions from FNN-LSTM (blue) and vanilla LSTM (green) on randomly selected sequences from the test set. Pink: the ground truth.

In fact on this dataset, the difference in behavior between both architectures is striking. When nothing is “supposed to happen”, vanilla LSTM produces “flat” curves at about the mean of the data, while FNN-LSTM takes the effort to “stay on track” as long as possible before also converging to the mean. Choosing FNN-LSTM – had we to choose one of these two – would be an obvious decision with this dataset.

Discussion

When, in timeseries forecasting, would we consider FNN-LSTM? Judging by the above experiments, conducted on four very different datasets: Whenever we consider a deep learning approach. Of course, this has been a casual exploration – and it was meant to be, as – hopefully – was evident from the nonchalant and bloomy (sometimes) writing style.

Throughout the text, we’ve emphasized utility– how could this technique be used to improve predictions? But, looking at the above results, a number of interesting questions come to mind. We already speculated (though in an indirect way) whether the number of high-variance variables in the latent code was relatable to how far we could sensibly forecast into the future. However, even more intriguing is the question of how characteristics of the dataset itself affect FNN efficiency.

Such characteristics could be:

  • How nonlinear is the dataset? (Put differently, how incompatible, as indicated by some form of test algorithm, is it with the hypothesis that the data generation mechanism was a linear one?)

  • To what degree does the system appear to be sensitively dependent on initial conditions? In other words, what is the value of its (estimated, from the observations) highest Lyapunov exponent?

  • What is its (estimated) dimensionality, for example, in terms of correlation dimension?

While it is easy to obtain those estimates, using, for instance, the nonlinearTseries package explicitly modeled after practices described in Kantz & Schreiber’s classic [@Kantz], we don’t want to extrapolate from our tiny sample of datasets, and leave such explorations and analyses to further posts, and/or the interested reader’s ventures :-). In any case, we hope you enjoyed the demonstration of practical usability of an approach that in the preceding post, was mainly introduced in terms of its conceptual attractivity.

Thanks for reading!


  1. again, citing from Gilpin’s repository’s README.↩

  2. again, citing from Gilpin’s repository’s README.↩

  3. again, citing from Gilpin’s repository’s README.↩

</p><p>// add bootstrap table styles to pandoc tablesfunction bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed');}$(document).ready(function () { bootstrapStylePandocTables();});</p><p>

$(document).ready(function () { window.buildTabsets("TOC");});</p><p>$(document).ready(function () { $('.tabset-dropdown > .nav-tabs > li').click(function () { $(this).parent().toggleClass('nav-tabs-open') });});

(function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })();

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio AI Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Building A Neural Net from Scratch Using R – Part 1

$
0
0

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Akshaj is a budding deep learning researcher who loves to work with R. He has worked as a Research Associate at the Indian Institute of Science and as a Data Scientist at KPMG India.

A lot of deep learning frameworks often abstract away the mechanics behind training a neural network. While this has the advantage of quickly building deep learning models, it has the disadvantage of hiding the details. It is equally important to slow down and understand how neural nets work. In this two-part series, we’ll dig deep and build our own neural net from scratch. This will help us understand, at a basic level, how those big frameworks work. The network we’ll build will contain a single hidden layer and perform binary classification using a vectorized implementation of backpropagation, all written in base-R. We will describe in detail what a single-layer neural network is, how it works, and the equations used to describe it. We will see what kind of data preparation is required to be able to use it with a neural network. Then, we will implement a neural-net step-by-step from scratch and examine the output at each step. Finally, to see how our neural-net fares, we will describe a few metrics used for classification problems and use them.

In this first part, we’ll present the dataset we are going to use, the pre-processing involved, the train-test split, and describe in detail the architecture of the model. Then we’ll build our neural net chunk-by-chunk. It will involve writing functions for initializing parameters and running forward propagation.

In the second part, we’ll implement backpropagation by writing functions to calculate gradients and update the weights. Finally, we’ll make predictions on the test data and see how accurate our model is using metrics such as Accuracy, Recall, Precision, and F1-score. We’ll compare our neural net with a logistic regression model and visualize the difference in the decision boundaries produced by these models.

By the end of this series, you should have a deeper understanding of the math behind neural-networks and the ability to implement it yourself from scratch!

Set Seed

Before we start, let’s set a seed value to ensure reproducibility of the results.

set.seed(69)

Architecture Definition

To understand the matrix multiplications better and keep the numbers digestible, we will describe a very simple 3-layer neural net i.e. a neural net with a single hidden layer. The \(1^{st}\) layer will take in the inputs and the \(3^{rd}\) layer will spit out an output.

The input layer will have two (input) neurons, the hidden layer four (hidden) neurons, and the output layer one (output) neuron.

Our input layer has two neurons because we’ll be passing two features (columns of a dataframe) as the input. A single output neuron because we’re performing binary classification. This means two output classes – 0 and 1. Our output will actually be a probability (a number that lies between 0 and 1). We’ll define a threshold for rounding off this probability to 0 or 1. For instance, this threshold can be 0.5.

In a deep neural net, multiple hidden layers are stacked together (hence the name “deep”). Each hidden layer can contain any number of neurons you want.

In this series, we’re implementing a single-layer neural net which, as the name suggests, contains a single hidden layer.

  • n_x: the size of the input layer (set this to 2).
  • n_h: the size of the hidden layer (set this to 4).
  • n_y: the size of the output layer (set this to 1).

Figure 1: Single layer NNet Architecture. Credits: deep learning.a

Neural networks flow from left to right, i.e. input to output. In the above example, we have two features (two columns from the input dataframe) that arrive at the input neurons from the first-row of the input dataframe. These two numbers are then multiplied by a set of weights (randomly initialized at first and later optimized).

An activation function is then applied on the result of this multiplication. This new set of numbers becomes the neurons in our hidden layer. These neurons are again multiplied by another set of weights (randomly initialized) with an activation function applied to this result. The final result we obtain is a single number. This is the prediction of our neural-net. It’s a number that lies between 0 and 1.

Once we have a prediction, we then compare it to the true output. To optimize the weights in order to make our predictions more accurate (because right now our input is being multiplied by random weights to give a random prediction), we need to first calculate how far off is our prediction from the actual value. Once we have this loss, we calculate the gradients with respect to each weight.

The gradients tell us the amount by which we need to increase or decrease each weight parameter in order to minimize the loss. All the weights in the network are updated as we repeat the entire process with the second input sample (second row).

After all the input samples have been used to optimize weights, we say that one epoch has passed. We repeat this process for multiple number of epochs till our loss stops decreasing.

At this point, you might be wondering what an activation function is. An activation function adds non-linearity to our network and enables it to learn complex features. If you look closely, a neural network consists of a bunch multiplications and additions. It’s linear and we know that a linear classification model will not be able to learn complex features in high dimensions.

Here are a few popular activation functions –

Figure 2: Sigmoid Activation Function. Credits - analyticsindiamag

We will use tanh() and sigmoid() activation functions in our neural net. Because tanh() is already available in base-R, we will implement the sigmoid() function ourselves later on.

Dry Run

For now, let’s see how the numbers flow through the above described neural-net by writing out the equations for a single sample (one input row).

For one input sample \(x^{(i)}\) where \(i\) is the row-number:

First, we calculate the output \(Z\) from the input \(x\). We will tune the parameters \(W\) and \(b\). Here, the superscript in square brackets tell us the layer number and the one in parenthesis tell use the neuron number. For instance \(z^{[1] (i)}\) is the output from the \(i^{{th}}\) neuron of the \(1^{{st}}\) layer.

\[z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}\]

Then we’ll pass this value through the tanh() activation function to get \(a\).

\[a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}\]

After that, we’ll calculate the value for the final output layer using the hidden layer values.

\[z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}\]

Finally, we’ll pass this value through the sigmoid() activation function and obtain our output probability. \[\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}\]

To obtain our prediction class from output probabilities, we round off the values as follows. \[y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}\]

Once, we have the prediction probabilities, we’ll compute the loss in order to tune our parameters (\(w\) and \(b\) can be adjusted using gradient-descent).

Given the predictions on all the examples, we will compute the cost \(J\) the cross-entropy loss as follows: \[J = – \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(\hat{y}^{(i)}\right) + (1-y^{(i)})\log\left(1- \hat{y}^{ (i)}\right) \large \right) \small \tag{6}\]

Once we have our loss, we need to calculate the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values –

  • \(dZ^{[2]} = A^{[2]} – Y\)
  • \(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)
  • \(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)
  • \(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\) where \(g\) is the activation function.
  • \(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)
  • \(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)

Now that we have the gradients, we will update the weights. We’ll multiply these gradients with a number known as the learning rate. The learning rate is represented by \(\alpha\).

  • \(W^{[2]} = W^{[2]} – \alpha * dW^{[2]}\)
  • \(b^{[2]} = b^{[2]} – \alpha * db^{[2]}\)
  • \(W^{[1]} = W^{[1]} – \alpha * dW^{[1]}\)
  • \(b^{[1]} = b^{[1]} – \alpha * db^{[1]}\)

This process is repeated multiple times until our model converges i.e. we have learned a good set of weights that fit our data well.

Load and Visualize the Data

Since, the goal of the series is to understand how neural-networks work behind the scene, we’ll use a small dataset so that our focus is on building our neural net.

We’ll use a planar dataset that looks like a flower. The output classes cannot be separated accurately using a straight line.

Construct Dataset

df <- read.csv(file = "planar_flower.csv")

Let’s shuffle our dataset so that our model is invariant to the order of samples. This is good for generalization and will help increase performance on unseen (test) data.

df <- df[sample(nrow(df)), ]head(df)
##           x1        x2 y## 209  1.53856  3.242555 0## 347 -0.05617 -0.808464 0## 386 -3.85811  1.423514 1## 112  0.82630  0.044276 1## 104  0.31350  0.004274 1## 111  2.28420  0.352476 1

Visualize Data

We have four hundred samples where two hundred belong each class.

Here’s a scatter plot between our input variables. As you can see, the output classes are not easily separable.

Train-Test Split

Now that we have our dataset prepared, let’s go ahead and split it into train and test sets. We’ll put 80% of our data into our train set and the remaining 20% into our test set. (To keep the focus on the neural-net, we will not be using a validation set here. ).

train_test_split_index <- 0.8 * nrow(df)

Train and Test Dataset

Because we’ve already shuffled the dataset above, we can go ahead and extract the first 80% rows into train set.

train <- df[1:train_test_split_index,]head(train)
##           x1        x2 y## 209  1.53856  3.242555 0## 347 -0.05617 -0.808464 0## 386 -3.85811  1.423514 1## 112  0.82630  0.044276 1## 104  0.31350  0.004274 1## 111  2.28420  0.352476 1

Next, we select last 20% rows of the shuffled dataset to be our test set.

test <- df[(train_test_split_index+1): nrow(df),]head(test)
##          x1       x2 y## 210 -0.0352 -0.03489 0## 348  2.7257 -0.54170 0## 19  -2.2235  0.42137 1## 362  2.3366 -0.40412 0## 143 -1.4984  3.55267 0## 4   -3.2264 -0.81648 0

Here, we visualize the number of samples per class in our train and test data sets to ensure that there isn’t a major class imbalance.

Preprocess

Neural networks work best when the input values are standardized. So, we’ll scale all the values to to have their mean=0 and standard-deviation=1.

Standardizing input values speeds up the training and ensures faster convergence.

To standardize the input values, we’ll use the scale() function in R. Note that we’re standardizing the input values (X) only and not the output values (y).

X_train <- scale(train[, c(1:2)])y_train <- train$ydim(y_train) <- c(length(y_train), 1) # add extra dimension to vectorX_test <- scale(test[, c(1:2)])y_test <- test$ydim(y_test) <- c(length(y_test), 1) # add extra dimension to vector

The output below tells us the shape and size of our input data.

## Shape of X_train (row, column): ##  320 2 ## Shape of y_train (row, column) : ##  320 1 ## Number of training samples: ##  320 ## Shape of X_test (row, column): ##  80 2 ## Shape of y_test (row, column) : ##  80 1 ## Number of testing samples: ##  80

Because neural nets are made up of a bunch matrix multiplications, let’s convert our input and output to matrices from dataframes. While dataframes are a good way to represent data in a tabular form, we choose to convert to a matrix type because matrices are smaller than an equivalent dataframe and often speed up the computations.

We will also change the shape of X and y by taking its transpose. This will make the matrix calculations slightly more intuitive as we’ll see in the second part. There’s really no difference though. Some of you might find this way better, while others might prefer the non-transposed way. I feel this this makes more sense.

We’re going to use the as.matrix() method to construct out matrix. We’ll fill out matrix row-by-row.

X_train <- as.matrix(X_train, byrow=TRUE)X_train <- t(X_train)y_train <- as.matrix(y_train, byrow=TRUE)y_train <- t(y_train)X_test <- as.matrix(X_test, byrow=TRUE)X_test <- t(X_test)y_test <- as.matrix(y_test, byrow=TRUE)y_test <- t(y_test)

Here are the shapes of our matrices after taking the transpose.

## Shape of X_train: ##  2 320 ## Shape of y_train: ##  1 320 ## Shape of X_test: ##  2 80 ## Shape of y_test: ##  1 80

Build a neural-net

Now that we’re done processing our data, let’s move on to building our neural net. As discussed above, we will broadly follow the steps outlined below.

  1. Define the neural net architecture.
  2. Initialize the model’s parameters from a random-uniform distribution.
  3. Loop:
    • Implement forward propagation.
    • Compute loss.
    • Implement backward propagation to get the gradients.
    • Update parameters.

Get layer sizes

A neural network optimizes certain parameters to get to the right output. These parameters are initialized randomly. However, the size of these matrices is dependent upon the number of layers in different layers of neural-net.

To generate matrices with random parameters, we need to first obtain the size (number of neurons) of all the layers in our neural-net. We’ll write a function to do that. Let’s denote n_x, n_h, and n_y as the number of neurons in input layer, hidden layer, and output layer respectively.

We will obtain these shapes from our input and output data matrices created above.

dim(X)[1] gives us \(2\) because the shape of X is (2, 320). We do the same for dim(y)[1].

getLayerSize <- function(X, y, hidden_neurons, train=TRUE) {  n_x <- dim(X)[1]  n_h <- hidden_neurons  n_y <- dim(y)[1]       size <- list("n_x" = n_x,               "n_h" = n_h,               "n_y" = n_y)    return(size)}

As we can see below, the number of neurons is decided based on shape of the input and output matrices.

layer_size <- getLayerSize(X_train, y_train, hidden_neurons = 4)layer_size
## $n_x## [1] 2## ## $n_h## [1] 4## ## $n_y## [1] 1

Initialise parameters

Before we start training our parameters, we need to initialize them. Let’s initialize the parameters based on random uniform distribution.

The function initializeParameters() takes as argument an input matrix and a list which contains the layer sizes i.e. number of neurons. The function returns the trainable parameters W1, b1, W2, b2.

Our neural-net has 3 layers, which gives us 2 sets of parameter. The first set is W1 and b1. The second set is W2 and b2. Note that these parameters exist as matrices.

These random weights matrices W1, b1, W2, b2 are created based on the layer sizes of the different layers (n_x, n_h, and n_y).

The sizes of these weights matrices are –

W1 = (n_h, n_x)b1 = (n_h, 1)W2 = (n_y, n_h)b2 = (n_y, 1)

initializeParameters <- function(X, list_layer_size){    m <- dim(data.matrix(X))[2]        n_x <- list_layer_size$n_x    n_h <- list_layer_size$n_h    n_y <- list_layer_size$n_y            W1 <- matrix(runif(n_h * n_x), nrow = n_h, ncol = n_x, byrow = TRUE) * 0.01    b1 <- matrix(rep(0, n_h), nrow = n_h)    W2 <- matrix(runif(n_y * n_h), nrow = n_y, ncol = n_h, byrow = TRUE) * 0.01    b2 <- matrix(rep(0, n_y), nrow = n_y)        params <- list("W1" = W1,                   "b1" = b1,                    "W2" = W2,                   "b2" = b2)        return (params)}

For our network, the size of our weight matrices are as follows. Remember that, number of input neurons n_x = 2, hidden neurons n_h = 4, and output neuron n_y = 1. layer_size is calculate above.

init_params <- initializeParameters(X_train, layer_size)lapply(init_params, function(x) dim(x))
## $W1## [1] 4 2## ## $b1## [1] 4 1## ## $W2## [1] 1 4## ## $b2## [1] 1 1

Define the Activation Functions.

We implement the sigmoid() activation function for the output layer.

sigmoid <- function(x){    return(1 / (1 + exp(-x)))}

\[S(x) = \frac {1} {1 + e^{-x}}\]

The tanh() function is already present in R.

\[T(x) = \frac {e^x – e^{-x}} {e^x + e^{-x}}\]

Here, we plot both activation functions side-by-side for comparison.

Forward Propagation

Now, onto defining the forward propagation. The function forwardPropagation() takes as arguments the input matrix X, the parameters list params, and the list of layer_sizes. We extract the layers sizes and weights from the respective functions defined above. To perform matrix multiplication, we use the %*% operator.

Before we perform the matrix multiplications, we need to reshape the parameters b1 and b2. Why do we do this? Let’s find out. Note that, the parameter shapes are:

  • W1: (4, 2)
  • b1: (4, 1)
  • W2: (1, 4)
  • b2 : (1, 1)

And the layers sizes are:

  • n_x = 2
  • n_h = 4
  • n_y = 1

Finally, shape of input matrix \(X\) (input layer):

  • X: (2, 320)

If we talk about the input => hidden; the hidden layer obtained by the equation A1 = activation(y1) = W1 %*% X + b1, would be as follows:

  • For the matrix multiplication of W1 and X, their shapes are correct by default: (4, 2) x (2, 320). The shape of the output matrix W1 %*% X is (4, 320).

  • Now, b1 is of shape (4, 1). Since, W1 %*% X is of the shape (4, 320), we need to repeat it b1 320 times, one for each input sample We do that using the command rep(b1, m) where m is calculated as dim(X)[2] which selects the second dimension of the shape of X.

  • The shape of A1 is (4, 320).

In the case of hidden => output; the output obtained by the equation y2 = W2 %*% A1 + b2, would be as follows:

  • To shapes of W2 and A1 are correct for us to perform matrix multiplication on them. W2 is (1, 4) and A1 is (4, 320). The output W2 %*% A1 has the shape (1, 320). b2 has a shape of (1, 1). We will again repeat b2 like we did above. So, b2 now becomes (1, 320).

  • The shape of A2 is now (1, 320).

We use the tanh() activation for the hidden layer and sigmoid() activation for the output layer.

forwardPropagation <- function(X, params, list_layer_size){        m <- dim(X)[2]    n_h <- list_layer_size$n_h    n_y <- list_layer_size$n_y        W1 <- params$W1    b1 <- params$b1    W2 <- params$W2    b2 <- params$b2        b1_new <- matrix(rep(b1, m), nrow = n_h)    b2_new <- matrix(rep(b2, m), nrow = n_y)        Z1 <- W1 %*% X + b1_new    A1 <- sigmoid(Z1)    Z2 <- W2 %*% A1 + b2_new    A2 <- sigmoid(Z2)        cache <- list("Z1" = Z1,                  "A1" = A1,                   "Z2" = Z2,                  "A2" = A2)    return (cache)}

Even though we only need the value A2 for forward propagation, you’ll notice we return all other calculated values as well. We do this because these values will be needed during backpropagation. Saving them here will reduce the the time it takes for backpropagation because we don’t have to calculate it again.

Another thing to notice is the Z and A of a particular layer will always have the same shape. This is because A = activation(Z) which does not change the shape of Z. An activation function only introduces non-linearity in a network.

fwd_prop <- forwardPropagation(X_train, init_params, layer_size)lapply(fwd_prop, function(x) dim(x))
## $Z1## [1]   4 320## ## $A1## [1]   4 320## ## $Z2## [1]   1 320## ## $A2## [1]   1 320

End of Part 1

We have reached the end of Part1 here. In the next and final part, we will implement backpropagation and evaluate our model. Stay tuned!

_____='https://rviews.rstudio.com/2020/07/20/shallow-neural-net-from-scratch-using-r-part-1/';

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Viewing all 12117 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>