Help support GetDFPData

October 11, 2019, 5:00 pm

≫ Next: Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

≪ Previous: Opioid prescribing habits in Texas

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The shiny version of GetDFPData is currently hosted in a private server at DigitalOcean. A problem with the basic (5 USD) server I was using is with the low amount of available memory (RAM and HD). With that, I had to limit all xlsx queries for the data, otherwise the shiny app would ran out of memory. After upgrading R in the server, the xlsx option was no longer working.

Today I tried all tricks in the book for keeping the 5 USD server and get the code to work. Nothing worked effectively. The Microsoft Excel is a very restrictive format, and you should only use it to small projects. If the volume of data is high, as in GetDFPData, you’re going to run into a lot of issues of cell sizes and memory allocation. Despite my explicit recommendation to avoid Excel format as much as possible, people still use it a lot. Not surprisingly, once I took the “xlsx” option from the shiny interface, people complained to my email – a lot.

I just upgraded the RAM and HD of the server in DigitalOcean. The xlsx option is back and working. The new bill is 10 USD per month. So far I’ve been paying the bill from my own pocket, using revenues from my books. The GetDFPData has no official financial support and yes, I’ll continue to finance it as much as can. But, support from those using the shiny interface of the CRAN package is very much welcomed and will motive further development to keep things running smoothly.

If you can, please help donating a small value and keeping the server financed. Once I reach 12 months of payed bills (around 120 USD), I’ll remove the Paypal donation button and only add it back after the cash runs out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

October 12, 2019, 5:00 am

≫ Next: GitHub Streak: Round Six

≪ Previous: Help support GetDFPData

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created.

In this third part, we will look at how to write R functions that generate SQL queries that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. We also briefly present wrapping these approaches into functions that can be combined with other Spark operations.

Preparation

The full setup of Spark and sparklyr is not in the scope of this post, please check the previous one for some setup instructions and a ready-made Docker image.

# Load packagessuppressPackageStartupMessages({  library(sparklyr)  library(dplyr)  library(nycflights13)})# Prepare the dataweather <- nycflights13::weather %>%  mutate(id = 1L:nrow(nycflights13::weather)) %>%   select(id, everything())# Connectsc <- sparklyr::spark_connect(master = "local")# Copy the weather dataset to the instancetbl_weather <- dplyr::copy_to(  dest = sc,   df = weather,  name = "weather",  overwrite = TRUE)# Copy the flights dataset to the instancetbl_flights <- dplyr::copy_to(  dest = sc,   df = nycflights13::flights,  name = "flights",  overwrite = TRUE)

R functions as Spark SQL generators

There are use cases where it is desirable to express the operations directly with SQL instead of combining dplyr verbs, for example when working within multi-language environments where re-usability is important. We can then send the SQL query directly to Spark to be executed. To create such queries, one option is to write R functions that work as query constructors.

Again using a very simple example, a naive implementation of column normalization could look as follows. Note that the use of SELECT * is discouraged and only here for illustration purposes:

normalize_sql <- function(df, colName, newColName) {  paste0(    "SELECT",    "\n  ", df, ".*", ",",    "\n  (", colName, " - (SELECT avg(", colName, ") FROM ", df, "))",    " / ",    "(SELECT stddev_samp(", colName,") FROM ", df, ") as ", newColName,    "\n", "FROM ", df  )}

Using the weather dataset would then yield the following SQL query when normalizing the temp column:

normalize_temp_query <- normalize_sql("weather", "temp", "normTemp")cat(normalize_temp_query)

## SELECT##   weather.*,##   (temp - (SELECT avg(temp) FROM weather)) / (SELECT stddev_samp(temp) FROM weather) as normTemp## FROM weather

Now that we have the query created, we can look at how to send it to Spark for execution.

Apache Spark and R logos

Executing the generated queries via Spark

Using DBI as the interface

The R package DBI provides an interface for communication between R and relational database management systems. We can simply use the dbGetQuery() function to execute our query, for instance:

res <- DBI::dbGetQuery(sc, statement = normalize_temp_query)head(res)

##   id origin year month day hour  temp  dewp humid wind_dir wind_speed## 1  1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702## 2  2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546## 3  3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780## 4  4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858## 5  5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858## 6  6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780##   wind_gust precip pressure visib           time_hour   normTemp## 1       NaN      0   1012.0    10 2013-01-01 06:00:00 -0.9130047## 2       NaN      0   1012.3    10 2013-01-01 07:00:00 -0.9130047## 3       NaN      0   1012.5    10 2013-01-01 08:00:00 -0.9130047## 4       NaN      0   1012.2    10 2013-01-01 09:00:00 -0.8624083## 5       NaN      0   1011.9    10 2013-01-01 10:00:00 -0.9130047## 6       NaN      0   1012.4    10 2013-01-01 11:00:00 -0.9737203

As we might have noticed thanks to the way the result is printed, a standard data frame is returned, as opposed to tibbles returned by most sparklyr operations.

It is important to note that using dbGetQuery()automatically computes and collects the results to the R session. This is in contrast with the dplyr approach which constructs the query and only collects the results to the R session when collect() is called, or computes them when compute() is called.

We will now examine 2 options to use the prepared query lazily and without collecting the results into the R session.

Invoking sql on a Spark session object

Without going into further details on the invoke() functionality of sparklyr which we will focus on in the fourth installment of the series, if the desire is to have a “lazy” SQL that does not get automatically computed and collected when called from R, we can invoke a sql method on a SparkSession class object.

The method takes a string SQL query as input and processes it using Spark, returning the result as a Spark DataFrame. This gives us the ability to only compute and collect the results when desired:

# Use the query "lazily" without execution:normalized_lazy_ds <- sc %>%  spark_session() %>%  invoke("sql",  normalize_temp_query)normalized_lazy_ds

## ##   org.apache.spark.sql.Dataset##   [id: int, origin: string ... 15 more fields]

# Collect when needed:normalized_lazy_ds %>% collect()

## # A tibble: 26,115 x 17##       id origin  year month   day  hour  temp  dewp humid wind_dir##                 ##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260## # … with 26,105 more rows, and 7 more variables: wind_speed ,## #   wind_gust , precip , pressure , visib ,## #   time_hour , normTemp

Using tbl with dbplyr’s sql

The above method gives us a reference to a Java object as a result, which might be less intuitive to work with for R users. We can also opt to use dbplyr’s sql() function in combination with tbl() to get a more familiar result.

Note that when printing the below normalized_lazy_tbl, the query gets partially executed to provide the first few rows. Only when collect() is called the entire set is retrieved to the R session:

# Nothing is executed yetnormalized_lazy_tbl <- normalize_temp_query %>%  dbplyr::sql() %>%  tbl(sc, .)# Print the first few rowsnormalized_lazy_tbl

## # Source: spark## #   [?? x 17]##       id origin  year month   day  hour  temp  dewp humid wind_dir##                 ##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260## # … with more rows, and 7 more variables: wind_speed ,## #   wind_gust , precip , pressure , visib ,## #   time_hour , normTemp # Collect the entire result to the R session and printnormalized_lazy_tbl %>% collect()## # A tibble: 26,115 x 17##       id origin  year month   day  hour  temp  dewp humid wind_dir##                 ##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260## # … with 26,105 more rows, and 7 more variables: wind_speed ,## #   wind_gust , precip , pressure , visib ,## #   time_hour , normTemp Wrapping the tbl approach into functionsIn the approach above we provided sc in the call to tbl(). When wrapping such processes into a function, it might however be useful to take the specific DataFrame reference as an input instead of the generic Spark connection reference.In that case, we can use the fact that the connection reference is also stored in the DataFrame reference, in the con sub-element of the src element. For instance, looking at our tbl_weather:class(tbl_weather[["src"]][["con"]])## [1] "spark_connection"       "spark_shell_connection"## [3] "DBIConnection"Putting this together, we can create a simple wrapper function that lazily sends a SQL query to be processed on a particular Spark DataFrame reference:lazy_spark_query <- function(tbl, qry) {  qry %>%    dbplyr::sql() %>%    dplyr::tbl(tbl[["src"]][["con"]], .)}And use it to do the same as we did above with a single function call:lazy_spark_query(tbl_weather, normalize_temp_query) %>%   collect()## # A tibble: 26,115 x 17##       id origin  year month   day  hour  temp  dewp humid wind_dir##                 ##  1     1 EWR     2013     1     1     1  39.0  26.1  59.4      270##  2     2 EWR     2013     1     1     2  39.0  27.0  61.6      250##  3     3 EWR     2013     1     1     3  39.0  28.0  64.4      240##  4     4 EWR     2013     1     1     4  39.9  28.0  62.2      250##  5     5 EWR     2013     1     1     5  39.0  28.0  64.4      260##  6     6 EWR     2013     1     1     6  37.9  28.0  67.2      240##  7     7 EWR     2013     1     1     7  39.0  28.0  64.4      240##  8     8 EWR     2013     1     1     8  39.9  28.0  62.2      250##  9     9 EWR     2013     1     1     9  39.9  28.0  62.2      260## 10    10 EWR     2013     1     1    10  41    28.0  59.6      260## # … with 26,105 more rows, and 7 more variables: wind_speed ,## #   wind_gust , precip , pressure , visib ,## #   time_hour , normTemp Combining multiple approaches and functions into lazy datasetsThe power of Spark partly comes from the lazy execution and we can take advantage of this in ways that are not immediately obvious. Consider the following function we have shown previously:lazy_spark_query## function(tbl, qry) {##   qry %>%##     dbplyr::sql() %>%##     dplyr::tbl(tbl[["src"]][["con"]], .)## }Since the output of this function without collection is actually only a translated SQL statement, we can take that output and keep combinining it with other operations, for instance:qry <- normalize_sql("flights", "dep_delay", "dep_delay_norm")lazy_spark_query(tbl_flights, qry) %>%  group_by(origin) %>%  summarise(mean(dep_delay_norm)) %>%  collect()## Warning: Missing values are always removed in SQL.## Use `mean(x, na.rm = TRUE)` to silence this warning## This warning is displayed only once per session.## # A tibble: 3 x 2##   origin `mean(dep_delay_norm)`##                      ## 1 EWR                    0.0614## 2 JFK                   -0.0131## 3 LGA                   -0.0570The crucial advantage is that even though the lazy_spark_query would return the entire updated weather dataset when collected stand-alone, in combination with other operations Spark first figures out how to execute all the operations together efficiently and only then physically executes them and returns only the grouped and aggregated data to the R session.We can therefore effectively combine multiple approaches to interfacing with Spark while still keeping the benefit of retrieving only very small, aggregated amounts of data to the R session. The effect is quite significant even with a dataset as small as flights (336,776 rows of 19 columns) and with a local Spark instance. The chart below compares executing a query lazily, aggregating within Spark and only retrieving the aggregated data, versus retrieving first and aggregating locally. The third boxplot shows the cost of pure collection on the query itself:bench <- microbenchmark::microbenchmark(  times = 20,  collect_late = lazy_spark_query(tbl_flights, qry) %>%    group_by(origin) %>%    summarise(mean(dep_delay_norm)) %>%    collect(),  collect_first = lazy_spark_query(tbl_flights, qry) %>%    collect() %>%     group_by(origin) %>%    summarise(mean(dep_delay_norm)),  collect_only = lazy_spark_query(tbl_flights, qry) %>%    collect())$(function () {  $('#r203-01-bench-late-collect').highcharts({  title: {         text: "Combine and collect late and small vs. early and bigger"       },       yAxis: {         title: {           text: "time (milliseconds)"         },         min: 0       },       credits: {         enabled: false       },       exporting: {         enabled: false       },       plotOptions: {         series: {           label: {             enabled: false           },           turboThreshold: 0,           marker: {             symbol: "circle"           },           showInLegend: false         },         treemap: {           layoutAlgorithm: "squarified"         },         boxplot: {           fillColor: "#C9E4FF",           lineWidth: 0.5,           medianWidth: 1,           stemDashStyle: "dot",           stemWidth: 1,           whiskerLength: "40%",           whiskerWidth: 1         }       },       chart: {         type: "column"       },       xAxis: {         type: "category",         categories: ""       },       series: [         {           g2: null,           data: [             {               name: "collect_late",               low: 949,               q1: 982,               median: 1048,               q3: 1113.5,               high: 1231             },             {               name: "collect_first",               low: 3196,               q1: 3273.5,               median: 3419.5,               q3: 3810.5,               high: 4088             },             {               name: "collect_only",               low: 3015,               q1: 3245.5,               median: 3403,               q3: 3530,               high: 3891             }           ],           type: "boxplot",           id: null,           color: "blue",           name: "Combine and collect late and small vs. early and bigger"         }       ]     }       );});Where SQL can be better than dbplyr translationWhen a translation is not thereWe have discussed in the first part that the set of operations translated to Spark SQL via dbplyr may not cover all possible use cases. In such a case, the option to write SQL directly is very useful.When translation does not provide expected resultsIn some instances using dbplyr to translate R operations to Spark SQL can lead to unexpected results. As one example, consider the following integer division on a column of a local data frame.# id_div_5 is as expectedweather %>%  mutate(id_div_5 = id %/% 5L) %>%  select(id, id_div_5)## # A tibble: 26,115 x 2##       id id_div_5##        ##  1     1        0##  2     2        0##  3     3        0##  4     4        0##  5     5        1##  6     6        1##  7     7        1##  8     8        1##  9     9        1## 10    10        2## # … with 26,105 more rowsAs expected, we get the result of integer division in the id_div_5 column. However, applying the very same operation on a Spark DataFrame yields unexpected results:# id_div_5 is normal division, not integer divisiontbl_weather %>%  mutate(id_div_5 = id %/% 5L) %>%  select(id, id_div_5)## # Source: spark [?? x 2]##       id id_div_5##        ##  1     1      0.2##  2     2      0.4##  3     3      0.6##  4     4      0.8##  5     5      1  ##  6     6      1.2##  7     7      1.4##  8     8      1.6##  9     9      1.8## 10    10      2  ## # … with more rowsThis is due to the fact that translation to integer division is quite difficult to implement: https://github.com/tidyverse/dbplyr/issues/108. We could certainly figure our a way to fix this particular issue, but the workarounds may prove inefficient:tbl_weather %>%  mutate(id_div_5 = as.integer(id %/% 5L)) %>%  select(id, id_div_5)## # Source: spark [?? x 2]##       id id_div_5##        ##  1     1        0##  2     2        0##  3     3        0##  4     4        0##  5     5        1##  6     6        1##  7     7        1##  8     8        1##  9     9        1## 10    10        2## # … with more rows# Not too efficient:tbl_weather %>%  mutate(id_div_5 = as.integer(id %/% 5L)) %>%  select(id, id_div_5) %>%  explain()## ## SELECT `id`, CAST(`id` / 5 AS INT) AS `id_div_5`## FROM `weather`## ## ## == Physical Plan ==## *(1) Project [id#24, cast((cast(id#24 as double) / 5.0) as int) AS id_div_5#4273]## +- InMemoryTableScan [id#24]##       +- InMemoryRelation [id#24, origin#25, year#26, month#27, day#28, hour#29, temp#30, dewp#31, humid#32, wind_dir#33, wind_speed#34, wind_gust#35, precip#36, pressure#37, visib#38, time_hour#39], StorageLevel(disk, memory, deserialized, 1 replicas)##             +- Scan ExistingRDD[id#24,origin#25,year#26,month#27,day#28,hour#29,temp#30,dewp#31,humid#32,wind_dir#33,wind_speed#34,wind_gust#35,precip#36,pressure#37,visib#38,time_hour#39]Using SQL and the knowledge that Hive does provide a built-in DIV arithmetic operator, we can get the desired results very simply and efficiently with writing SQL:"SELECT `id`, `id` DIV 5 `id_div_5` FROM `weather`" %>%  dbplyr::sql() %>%  tbl(sc, .)## # Source: spark [?? x## #   2]##       id id_div_5##        ##  1     1        0##  2     2        0##  3     3        0##  4     4        0##  5     5        1##  6     6        1##  7     7        1##  8     8        1##  9     9        1## 10    10        2## # … with more rows

Even though the numeric value of the results is correct here, we may still notice that the class of the returned id_div_5 column is actually numeric instead of integer. Such is the life of developers using data processing interfaces.

When portability is important

Since the languages that provide interfaces to Spark are not limited to R and multi-language setups are quite common, another reason to use SQL statements directly is the portability of such solutions. A SQL statement can be executed by interfaces provided for all languages – Scala, Java, and Python, without the need to rely on R-specific packages such as dbplyr.

References

The first part of this series
The second part of this series
Documentation on Hive Operators and User-Defined Functions website.
A Docker image with R, Spark, sparklyr and Arrow available and its Dockerfile.
The DBI package on CRAN

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

GitHub Streak: Round Six

October 12, 2019, 8:53 am

≫ Next: easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS)

≪ Previous: Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Five ago I referenced the Seinfeld Streak used in an earlier post of regular updates to to the Rcpp Gallery:

This is sometimes called Jerry Seinfeld’s secret to productivity: Just keep at it. Don’t break the streak.

and then showed the first chart of GitHub streaking

github activity october 2013 to october 2014

And four year ago a first follow-up appeared in this post:

github activity october 2014 to october 2015

And three years ago we had a followup

github activity october 2015 to october 2016

And two years ago we had another one

github activity october 2016 to october 2017

And last year another one

github activity october 2017 to october 2018

As today is October 12, here is the newest one from 2018 to 2019:

github activity october 2018 to october 2019

Again, special thanks go to Alessandro Pezzè for the Chrome add-on GithubOriginalStreak.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS)

October 12, 2019, 10:30 pm

≫ Next: easyMTS: My First R Package (Story, and Results)

≪ Previous: GitHub Streak: Round Six

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new R package in development. Please cite if you use it.

The post easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS) appeared first on Quality and Innovation.

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

easyMTS: My First R Package (Story, and Results)

October 12, 2019, 10:38 pm

≫ Next: Discover Offensive Programming

≪ Previous: easyMTS R Package: Quick Solver for Mahalanobis-Taguchi System (MTS)

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This weekend I decided to create my first R package… it’s here!

https://github.com/NicoleRadziwill/easyMTS

Although I’ve been using R for 15 years, developing a package has been the one thing slightly out of reach for me. Now that I’ve been through the process once, with a package that’s not completely done (but at least has a firm foundation, and is usable to some degree), I can give you some advice:

Make sure you know R Markdown before you begin.
Some experience with Git and Github will be useful. Lots of experience will be very, very useful.
Write the functions that will go into your package into a file that you can source into another R program and use. If your programs work when you run the code this way, you will have averted many problems early.

The process I used to make this happen was:

I refactored ~250 lines of code into 15 lines of code and 12 functions. I stored the functions here: https://github.com/NicoleRadziwill/R-Functions/blob/master/MTSpak.R
I followed this process:
I created an example that will become a vignette soon: https://qualityandinnovation.com/2019/10/13/easymts-r-package-quick-solver-for-mahalanobis-taguchi-system-mts-problems/

I hope you enjoy following along with my process, and that it helps you write packages too. If I can do it, so can you!

The post easyMTS: My First R Package (Story, and Results) appeared first on Quality and Innovation.

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Discover Offensive Programming

October 12, 2019, 5:00 pm

≫ Next: Cluster multiple time series using K-means

≪ Previous: easyMTS: My First R Package (Story, and Results)

[This article was first published on NEONIRA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Package wyz.code.offensiveProgramming version 1.1.12 is available on CRAN.

If you are interested in reducing time and efforts to implement and debug R code, to generate R documentation, to generate test code, then you may consider using this package. It provides tools to manage

semantic naming of function parameter arguments
function return type
functional test cases

Using this package you will be able to verify types of arguments passed to functions without implementing verification code into your function, thus reducing their size and your implementation time. Type and length of each parameter are verified on your explicit demand, allowing use at any stage of the software delivery life cycle.

Similarly, expected function returned types can also be verified on demand, either interactively or programmatically.

Browse on-line documentation to know more.

More to come on how to put that in action on next post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: NEONIRA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Cluster multiple time series using K-means

October 12, 2019, 5:00 pm

≫ Next: A Shiny Intro Survey to an Open Science Course

≪ Previous: Discover Offensive Programming

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been recently confronted to the issue of finding similarities among time-series and though about using k-means to cluster them. To illustrate the method, I’ll be using data from the Penn World Tables, readily available in R (inside the {pwt9} package):

library(tidyverse)library(lubridate)library(pwt9)library(brotools)

First, of all, let’s only select the needed columns:

pwt <- pwt9.0 %>%select(country, year, avh)

avh contains the average worked hours for a given country and year. The data looks like this:

head(pwt)

##          country year avh## ABW-1950   Aruba 1950  NA## ABW-1951   Aruba 1951  NA## ABW-1952   Aruba 1952  NA## ABW-1953   Aruba 1953  NA## ABW-1954   Aruba 1954  NA## ABW-1955   Aruba 1955  NA

For each country, there’s yearly data on the avh variable. The goal here is to cluster the different countries by looking at how similar they are on the avh variable. Let’s do some further cleaning. The k-means implementation in R expects a wide data frame (currently my data frame is in the long format) and no missing values. These could potentially be imputed, but I can’t be bothered:

pwt_wide <- pwt %>%  pivot_wider(names_from = year, values_from = avh)  %>%  filter(!is.na(`1950`)) %>%  mutate_at(vars(-country), as.numeric)

To convert my data frame from long to wide, I use the fresh pivot_wider() function, instead of the less intuitive spread() function.

We’re ready to use the k-means algorithm. To know how many clusters I should aim for, I’ll be using the elbow method (if you’re not familiar with this method, click on the image at the very top of this post):

wss <- map_dbl(1:5, ~{kmeans(select(pwt_wide, -country), ., nstart=50,iter.max = 15 )$tot.withinss})n_clust <- 1:5elbow_df <- as.data.frame(cbind("n_clust" = n_clust, "wss" = wss))ggplot(elbow_df) +geom_line(aes(y = wss, x = n_clust), colour = "#82518c") +theme_blog()

Looks like 3 clusters is a good choice. Let’s now run the kmeans algorithm:

clusters <- kmeans(select(pwt_wide, -country), centers = 3)

clusters is a list with several interesting items. The item centers contains the “average” time series:

(centers <- rownames_to_column(as.data.frame(clusters$centers), "cluster"))

##   cluster     1950     1951     1952     1953     1954     1955     1956## 1       1 2110.440 2101.273 2088.947 2074.273 2066.617 2053.391 2034.926## 2       2 2086.509 2088.571 2084.433 2081.939 2078.756 2078.710 2074.175## 3       3 2363.600 2350.774 2338.032 2325.375 2319.011 2312.083 2308.483##       1957     1958     1959     1960     1961     1962     1963     1964## 1 2021.855 2007.221 1995.038 1985.904 1978.024 1971.618 1963.780 1962.983## 2 2068.807 2062.021 2063.687 2060.176 2052.070 2044.812 2038.939 2037.488## 3 2301.355 2294.556 2287.556 2279.773 2272.899 2262.781 2255.690 2253.431##       1965     1966     1967     1968     1969     1970     1971     1972## 1 1952.945 1946.961 1928.445 1908.354 1887.624 1872.864 1855.165 1825.759## 2 2027.958 2021.615 2015.523 2007.176 2001.289 1981.906 1967.323 1961.269## 3 2242.775 2237.216 2228.943 2217.717 2207.037 2190.452 2178.955 2167.124##       1973     1974     1975     1976     1977     1978     1979     1980## 1 1801.370 1770.484 1737.071 1738.214 1713.395 1693.575 1684.215 1676.703## 2 1956.755 1951.066 1933.527 1926.508 1920.668 1911.488 1904.316 1897.103## 3 2156.304 2137.286 2125.298 2118.138 2104.382 2089.717 2083.036 2069.678##       1981     1982     1983     1984     1985     1986     1987     1988## 1 1658.894 1644.019 1636.909 1632.371 1623.901 1615.320 1603.383 1604.331## 2 1883.376 1874.730 1867.266 1861.386 1856.947 1849.568 1848.748 1847.690## 3 2055.658 2045.501 2041.428 2030.095 2040.210 2033.289 2028.345 2029.290##       1989     1990     1991     1992     1993     1994     1995     1996## 1 1593.225 1586.975 1573.084 1576.331 1569.725 1567.599 1567.113 1558.274## 2 1842.079 1831.907 1823.552 1815.864 1823.824 1830.623 1831.815 1831.648## 3 2031.741 2029.786 1991.807 1974.954 1973.737 1975.667 1980.278 1988.728##       1997     1998     1999     2000     2001     2002     2003     2004## 1 1555.079 1555.071 1557.103 1545.349 1530.207 1514.251 1509.647 1522.389## 2 1835.372 1836.030 1839.857 1827.264 1813.477 1781.696 1786.047 1781.858## 3 1985.076 1961.219 1966.310 1959.219 1946.954 1940.110 1924.799 1917.130##       2005     2006     2007     2008     2009     2010     2011     2012## 1 1514.492 1512.872 1515.299 1514.055 1493.875 1499.563 1503.049 1493.862## 2 1775.167 1776.759 1773.587 1771.648 1734.559 1736.098 1742.143 1735.396## 3 1923.496 1912.956 1902.156 1897.550 1858.657 1861.875 1861.608 1850.802##       2013     2014## 1 1485.589 1486.991## 2 1729.973 1729.543## 3 1848.158 1851.829

clusters also contains the cluster item, which tells me to which cluster the different countries belong to. I can easily add this to the original data frame:

pwt_wide <- pwt_wide %>%   mutate(cluster = clusters$cluster)

Now, let’s prepare the data for visualisation. I have to go back to a long data frame for this:

pwt_long <- pwt_wide %>%  pivot_longer(cols=c(-country, -cluster), names_to = "year", values_to = "avh") %>%  mutate(year = ymd(paste0(year, "-01-01")))centers_long <- centers %>%  pivot_longer(cols = -cluster, names_to = "year", values_to = "avh") %>%    mutate(year = ymd(paste0(year, "-01-01")))

And I can now plot the different time series, by cluster and highlight the “average” time series for each cluster as well (yellow line):

ggplot() +  geom_line(data = pwt_long, aes(y = avh, x = year, group = country), colour = "#82518c") +  facet_wrap(~cluster, nrow = 1) +   geom_line(data = centers_long, aes(y = avh, x = year, group = cluster), col = "#b58900", size = 2) +  theme_blog() +  labs(title = "Average hours worked in several countries",        caption = "The different time series have been clustered using k-means.                 Cluster 1: Belgium, Switzerland, Germany, Denmark, France, Luxembourg, Netherlands,                 Norway, Sweden.\nCluster 2: Australia, Colombia, Ireland, Iceland, Japan, Mexico,                 Portugal, Turkey.\nCluster 3: Argentina, Austria, Brazil, Canada, Cyprus, Spain, Finland,                 UK, Italy, New Zealand, Peru, USA, Venezuela") +  theme(plot.caption = element_text(colour = "white"))

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub

.bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;}

Buy me an Espresso

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

A Shiny Intro Survey to an Open Science Course

October 12, 2019, 5:00 pm

≫ Next: Hyper-Parameter Optimization of General Regression Neural Networks

≪ Previous: Cluster multiple time series using K-means

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, we started a new course titled “Statistical Programming and Open Science Methods”. It is being offered under the research program of TRR 266 “Accounting for Transparency” and enables students to conduct data-based research so that others can contribute and collaborate. This involves making research data and methods FAIR (findable, accessible, interoperable and reusable) and results reproducible. All the materials of the course are available on GitHub together with some notes in the README on how to use them for self-guided learning.

The course is over-booked so running a normal introduction round was not feasible. Yet, I was very interested to learn about the students’ backgrounds with respect to statistical programming and their learning objectives. Thus, I decided to construct a quick online survey using the ‘shiny’ package. In addition to collecting data, this also provided me the opportunity to show-case one of the less obvious applications of statistical programming.

The design of the survey is relatively straightforward. It asks the students about their familiarity with a set of statistical programming languages and then changes the survey dynamically to collect their assessments about their usability and how easy they are to learn. After that, it presents a list of programming-related terms and asks students to state whether they are reasonably familiar with these terms. It closes with asking students about their learning objectives for this course and gives them the opportunity to state their name.

Screen shot of survey page

The data is being stored in a SQLite file-based database in the directory of the shiny app. Another app accesses the data and presents a quick evaluation as well as the opportunity to download the anonymized data. You can access the survey here (submit button disabled) and the evaluation app here.

Screen shot of evaluation page

To visualize the learning objectives I used the ‘ggwordcloud’ package. Fancy looking but of limited relevance.

One of those word clouds

The code for the survey and its evaluation is part of the course’s GitHub repository. Feel free to reuse. Some things that might be relevant here:

Watch out for SQL Injection issues. In my code, I use DBI::sqlInterpolate() for this purpose.
The repository contains both shiny apps (app_survey.R and app_results.R) in one directory. Make sure to export them as two separate shiny apps in separate folders.
Your result app needs to have access to the database file that the survey app is writing to. When you are hosting this on your own shiny server this can be realized by the results app linking to the database file in the folder of the survey app. If you plan to host your apps on a service like ‘shinyapps.io’ then this will most likely not be feasible. In this case, you might consider switching to an external database.
When using shiny in very large courses, your students might experience “Too many users” errors from shiny as it has a limit of 100 concurrent users for a given app. When running your own shiny server you can configure shiny to allow more users but my guess is that you will run into performance issues at some point.

This is it. Let me know your thoughts and I would be very happy to get in touch if you are reusing the code for your own class survey. Feel free to comment below. Alternatively, you can reach me via email or twitter.

Enjoy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Hyper-Parameter Optimization of General Regression Neural Networks

October 12, 2019, 9:45 pm

≫ Next: Rename Columns | R

≪ Previous: A Shiny Intro Survey to an Open Science Course

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A major advantage of General Regression Neural Networks (GRNN) over other types of neural networks is that there is only a single hyper-parameter, namely the sigma. In the previous post (https://statcompute.wordpress.com/2019/07/06/latin-hypercube-sampling-in-hyper-parameter-optimization), I’ve shown how to use the random search strategy to find a close-to-optimal value of the sigma by using various random number generators, including uniform random, Sobol sequence, and Latin hypercube sampling.

In addition to the random search, we can also directly optimize the sigma based on a pre-defined objective function by using the grnn.optmiz_auc() function (https://github.com/statcompute/yager/blob/master/code/grnn.optmiz_auc.R), in which either Golden section search by default or Brent’s method is employed in the one-dimension optimization. In the example below, the optimized sigma is able to yield a slightly higher AUC in both training and hold-out samples. As shown in the plot, the optimized sigma in red is right next to the best sigma in the random search.

.gist table { margin-bottom: 0; }

Capture

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Rename Columns | R

October 13, 2019, 11:43 am

≫ Next: Making a background color gradient in ggplot2

≪ Previous: Hyper-Parameter Optimization of General Regression Neural Networks

[This article was first published on Data Science Using R – FinderDing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Often data you’re working with has abstract column names, such as (x1, x2, x3…). Typically, the first step I take when renaming columns with r is opening my web browser.

For some reason no matter the amount of times doing this it’s just one of those things. (Hoping that writing about it will change that)

The dataset cars is data from the 1920s on “Speed and Stopping Distances of Cars”. There is only 2 columns shown below.

colnames(datasets::cars)[1] "speed" "dist"

If we wanted to rename the column “dist” to make it easier to know what the data is/means we can do so in a few different ways.

Using dplyr:

cars %>%   rename("Stopping Distance (ft)" = dist) %>%   colnames()[1] "speed"             "Stopping Distance"

cars %>%  rename("Stopping Distance (ft)" = dist, "Speed (mph)" = speed) %>%  colnames()[1] "Speed (mph)"            "Stopping Distance (ft)"

Using Base r:

colnames(cars)[2] <-"Stopping Distance (ft)"[1] "speed"                  "Stopping Distance (ft)"colnames(cars)[1:2] <-c("Speed (mph)","Stopping Distance (ft)")[1] "Speed (mph)"            "Stopping Distance (ft)"

Using GREP:

colnames(cars)[grep("dist", colnames(cars))] <-"Stopping Distance (ft)""speed"                  "Stopping Distance (ft)"

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Science Using R – FinderDing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Making a background color gradient in ggplot2

October 13, 2019, 5:00 pm

≫ Next: Autumn Barnsley Fern

≪ Previous: Rename Columns | R

[This article was first published on Very statisticious on Very statisticious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was recently making some arrangements for the 2020 eclipse in South America, which of course got me thinking of the day we were lucky enough to have a path of totality come to us.

We have a weather station that records local temperature every 5 minutes, so after the eclipse I was able to plot the temperature change over the eclipse as we experienced it at our house. Here is an example of a basic plot I made at the time.

Looking at this now with new eyes, I see it might be nice replace the gray rectangle with one that goes from light to dark to light as the eclipse progresses to totality and then back. I’ll show how I tackled making a gradient color background in this post.

Load R packages

I’ll load ggplot2 for plotting and dplyr for data manipulation.

library(ggplot2) # 3.2.1library(dplyr) # 0.8.3

The dataset

My weather station records the temperature in °Fahrenheit every 5 minutes. I downloaded the data from 6 AM to 12 PM local time and cleaned it up a bit. The date-times and temperature are in a dataset I named temp. You can download this below if you’d like to play around with these data.

Download eclipse_temp.csv

Here are the first six lines of this temperature dataset.

head(temp)

# # A tibble: 6 x 2#   datetime            tempf#                 # 1 2017-08-21 06:00:00  54.9# 2 2017-08-21 06:05:00  54.9# 3 2017-08-21 06:10:00  54.9# 4 2017-08-21 06:15:00  54.9# 5 2017-08-21 06:20:00  54.9# 6 2017-08-21 06:25:00  54.8

I also stored the start and end times of the eclipse and totality in data.frames, which I pulled for my location from this website.

If following along at home, make sure your time zones match for all the date-time variables or, from personal experience , you’ll run into problems.

eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),                     end = as.POSIXct("2017-08-21 11:37:19") )totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),                      end = as.POSIXct("2017-08-21 10:18:52") )

Initial plot

I decided to make a plot of the temperature change during the eclipse only.

To keep the temperature line looking continuous, even though it’s taken every 5 minutes, I subset the data to times close but outside the start and end of the eclipse.

plottemp = filter(temp, between(datetime,                                 as.POSIXct("2017-08-21 09:00:00"),                                as.POSIXct("2017-08-21 12:00:00") ) )

I then zoomed the plot to only include times encompassed by the eclipse with coord_cartesian(). I removed the x axis expansion in scale_x_datetime().

Since the plot covers only about 2 and half hours, I make breaks on the x axis every 15 minutes.

ggplot(plottemp) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "15 min",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(16, 24, by = 1)) ) +     theme_bw(base_size = 14) +     theme(panel.grid = element_blank() )

Adding color gradient using geom_segment()

I wanted the background of the plot to go from light to dark back to light through time. This means a color gradient should go from left to right across the plot.

Since the gradient will be based on time, I figured I could add a vertical line with geom_segment() for every second of the eclipse and color each segment based on how far that time was from totality.

Make a variable for the color mapping

The first step I took was to make variable with a row for every second of the eclipse, since I wanted a segment drawn for each second. I used seq.POSIXt for this.

color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )

Then came some hard thinking. How would I make a continuous variable to map to color?

While I don’t have an actual measurement of light throughout the eclipse, I can show the general idea of a light change with color by using a linear change in color from the start of the eclipse to totality and then another linear change in color from totality to the end of the eclipse.

My first idea for creating a variable was to use information on the current time vs totality start/end. After subtracting the times before totality from totality start and subtracting totality end from times after totality, I realized that the amount of time before totality wasn’t actually the same as the amount of time after totality. Back to the drawing board.

Since I was making a linear change in color, I realized I could make a sequence of values before totality and after totality that covered the same range but had a different total number of values. This would account for the difference in the length of time before and after totality. I ended up making a sequence going from 100 to 0 for times before totality and a sequence from 0 to 100 after totality. Times during totality were assigned a 0.

Here’s one way to get these sequences, using base::replace(). My dataset is in order by time, which is key to this working correctly.

color_dat = mutate(color_dat,                   color = 0,                   color = replace(color,                                    time < totality$start,                                    seq(100, 0, length.out = sum(time < totality$start) ) ),                   color = replace(color,                                    time > totality$end,                                    seq(0, 100, length.out = sum(time > totality$end) ) ) )

Adding one geom_segment() per second

Once I had my color variable I was ready plot the segments along the x axis. Since the segments neeeded to go across the full height of the plot, I set y and yend to -Inf and Inf, respectively.

I put this layer first to use it as a background that the temperature line was plotted on top of.

g1 = ggplot(plottemp) +     geom_segment(data = color_dat,                  aes(x = time, xend = time,                      y = -Inf, yend = Inf, color = color),                  show.legend = FALSE) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "15 min",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(16, 24, by = 1)) ) +     theme_bw(base_size = 14) +     theme(panel.grid = element_blank() ) g1

Switching to a gray scale

The default blue color scheme for the segments actually works OK, but I was picturing going from white to dark. I picked gray colors with grDevices::gray.colors() in scale_color_gradient(). In gray.colors(), 0 is black and 1 is white. I didn’t want the colors to go all the way to black, since that would make the temperature line impossible to see during totality. And, of course, it’s not actually pitch black during totality.

g1 + scale_color_gradient(low = gray.colors(1, 0.25),                          high = gray.colors(1, 1) )

Using segments to make a gradient rectangle

I can use this same approach on only a portion of the x axis to give the appearance of a rectangle with gradient fill. Here’s an example using times outside the eclipse.

g2 = ggplot(temp) +     geom_segment(data = color_dat,                  aes(x = time, xend = time,                      y = -Inf, yend = Inf, color = color),                  show.legend = FALSE) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "1 hour",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(12, 24, by = 2)) ) +     scale_color_gradient(low = gray.colors(1, .25),                          high = gray.colors(1, 1) ) +     theme_bw(base_size = 14) +     theme(panel.grid.major.x = element_blank(),           panel.grid.minor = element_blank() ) g2

Bonus: annotations with curved arrows

This second plot gives me a chance to try out Cédric Scherer’s very cool curved annotation arrow idea for the first time .

g2 = g2 +      annotate("text", x = as.POSIXct("2017-08-21 08:00"),              y = 74,               label = "Partial eclipse begins\n09:05:10 PDT",              color = "grey24") +     annotate("text", x = as.POSIXct("2017-08-21 09:00"),              y = 57,               label = "Totality begins\n10:16:55 PDT",              color = "grey24")g2

I’ll make a data.frame for the arrow start/end positions. I’m skipping all the work it took to get the positions where I wanted them, which is always iterative for me.

arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",                                      "2017-08-21 09:34") ),                    x2 = c(eclipse$start, totality$start),                    y1 = c(74, 57.5),                    y2 = c(72.5, 60) )

I add arrows with geom_curve(). I changed the size of the arrowhead and made it closed in arrow(). I also thought the arrows looked better with a little less curvature.

g2 +     geom_curve(data = arrows,                aes(x = x1, xend = x2,                    y = y1, yend = y2),                arrow = arrow(length = unit(0.075, "inches"),                              type = "closed"),                curvature = 0.25)

Other ways to make a gradient color background

Based on a bunch of internet searches, it looks like a gradient background in ggplot2 generally takes some work. There are some nice examples out there on how to use rasterGrob() and annotate_custom() to add background gradients, such as in this Stack Overflow question. I haven’t researched how to make this go from light to dark and back to light for the uneven time scale like in my example.

I’ve also seen approaches involving dataset expansion and drawing many filled rectangles or using rasters, which is like what I did with geom_segment().

Eclipses!

Before actually experiencing totality, it seemed to me like the difference between a 99% and a 100% eclipse wasn’t a big deal. I mean, those numbers are pretty darn close.

I was very wrong.

If you ever are lucky enough to be near a path of totality, definitely try to get there even if it’s a little more trouble then the 99.9% partial eclipse. It’s an amazing experience.

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # 3.2.1library(dplyr) # 0.8.3head(temp)eclipse = data.frame(start = as.POSIXct("2017-08-21 09:05:10"),                     end = as.POSIXct("2017-08-21 11:37:19") )totality = data.frame(start = as.POSIXct("2017-08-21 10:16:55"),                      end = as.POSIXct("2017-08-21 10:18:52") )plottemp = filter(temp, between(datetime,                                 as.POSIXct("2017-08-21 09:00:00"),                                as.POSIXct("2017-08-21 12:00:00") ) )ggplot(plottemp) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "15 min",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(16, 24, by = 1)) ) +     theme_bw(base_size = 14) +     theme(panel.grid = element_blank() ) color_dat = data.frame(time = seq(eclipse$start, eclipse$end, by = "1 sec") )color_dat = mutate(color_dat,                   color = 0,                   color = replace(color,                                    time < totality$start,                                    seq(100, 0, length.out = sum(time < totality$start) ) ),                   color = replace(color,                                    time > totality$end,                                    seq(0, 100, length.out = sum(time > totality$end) ) ) )g1 = ggplot(plottemp) +     geom_segment(data = color_dat,                  aes(x = time, xend = time,                      y = -Inf, yend = Inf, color = color),                  show.legend = FALSE) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "15 min",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     coord_cartesian(xlim = c(eclipse$start, eclipse$end) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, 09:05:10 - 11:37:19 PDT") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(16, 24, by = 1)) ) +     theme_bw(base_size = 14) +     theme(panel.grid = element_blank() ) g1g1 + scale_color_gradient(low = gray.colors(1, 0.25),                          high = gray.colors(1, 1) )g2 = ggplot(temp) +     geom_segment(data = color_dat,                  aes(x = time, xend = time,                      y = -Inf, yend = Inf, color = color),                  show.legend = FALSE) +     geom_line( aes(datetime, tempf), size = 1 ) +     scale_x_datetime( date_breaks = "1 hour",                       date_labels = "%H:%M",                       expand = c(0, 0) ) +     labs(y = expression( Temperature~(degree*F) ),          x = NULL,          title = "Temperature during 2017-08-21 solar eclipse",          subtitle = expression(italic("Sapsucker Farm, Dallas, OR, USA") ),          caption = "Eclipse: 2 hours 32 minutes 9 seconds\nTotality: 1 minute 57 seconds"     ) +     scale_y_continuous(sec.axis = sec_axis(~ (. - 32) * 5 / 9 ,                                             name =  expression( Temperature~(degree*C)),                                            breaks = seq(12, 24, by = 2)) ) +     scale_color_gradient(low = gray.colors(1, .25),                          high = gray.colors(1, 1) ) +     theme_bw(base_size = 14) +     theme(panel.grid.major.x = element_blank(),           panel.grid.minor = element_blank() ) g2g2 = g2 +      annotate("text", x = as.POSIXct("2017-08-21 08:00"),              y = 74,               label = "Partial eclipse begins\n09:05:10 PDT",              color = "grey24") +     annotate("text", x = as.POSIXct("2017-08-21 09:00"),              y = 57,               label = "Totality begins\n10:16:55 PDT",              color = "grey24")g2arrows = data.frame(x1 = as.POSIXct( c("2017-08-21 08:35",                                      "2017-08-21 09:34") ),                    x2 = c(eclipse$start, totality$start),                    y1 = c(74, 57.5),                    y2 = c(72.5, 60) )g2 +     geom_curve(data = arrows,                aes(x = x1, xend = x2,                    y = y1, yend = y2),                arrow = arrow(length = unit(0.075, "inches"),                              type = "closed"),                curvature = 0.25)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Very statisticious on Very statisticious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Autumn Barnsley Fern

October 12, 2019, 5:00 pm

≫ Next: My First R Package (Part 1)

≪ Previous: Making a background color gradient in ggplot2

[This article was first published on exploRations in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

I was playing around generating fractals in R when I realized the monochromatic green Barnsely Fern I had on my screen didn’t quite look like the leaves I could see outside my window. It was already Fall. In this post I describe a technique to generate a Barnsley Fern with autumn foliage.

The Barnsley Fern

First, we generate the Barnsley Fern.

#------------ Generate the Barnsely Fern ---------------# Define the coefficients to calculate the x,y valuescoef_names <- list(NULL,c("x","y","z"))coef_x <- matrix(ncol = 3, byrow= TRUE, dimnames = coef_names,                   c(0, 0, 0,                     -0.15, 0.28, 0,                      0.2, -0.26, 0,                     0.85, 0.04, 0))coef_y <- matrix(ncol = 3, byrow= TRUE, dimnames = coef_names,                   c(0, 0.16, 0,                     0.26, 0.24, 0.44,                     0.23, 0.22, 1.6,                     -0.04, 0.85, 1.6))# Indicate the percentage of iterations through each row of the coefficientscoef_ctrl <- c(1,7,14,85)# initialize a list to collect the generated pointspoints <- list()points$x <- 0points$y <- 0# Set maximum iterations and reset coefficient trackermax_i<-1000000coef <- NA# Generate the x,y points as a list and combine as a dataframefor (i in 2:(max_i)) {    rand = runif(1, 0, 100)        if (rand < coef_ctrl[1]){coef <- 1}     else if (rand < coef_ctrl[2]){coef <- 2}    else if (rand < coef_ctrl[3]){coef <- 3}    else {coef <- 4}        points$x[i] <- points$x[i-1]*coef_x[coef, 1] + points$y[i-1]*coef_x[coef, 2] + coef_x[coef, 3]    points$y[i] <- points$x[i-1]*coef_y[coef, 1] + points$y[i-1]*coef_y[coef, 2] + coef_y[coef, 3]}df <- bind_rows(points)

We can plot the fern and to color it green is straight forward. We now have a Summer Barnsley Fern.

# Checkout your Summer Barnsley Fernplot(df$x, df$y, pch = '.', col = "forestgreen", xaxt = "n", yaxt = "n", xlab = NA, ylab = NA,     main= "Summer Barnsley Fern")

Now we consider the objective of coloring the leaf such that it reflects the patterns we see in the autumn. The edges of a leaf will usually change colors first so we want to have the red and organge tints on the sides of the leaf and yellow and green tints in the mid section along the stem. The leaf also tapers from bottom to tip and as the leaf thins at the top we expect more red and organge tints and less of the green.

To accomplish this, we color the fern in a way that is symmetrical and thus will base the color on the distance each x is from the mean of x coordinates. Because the fern curves as y increases we need to shift the colors to the right to follow the curve. This is accomplished by binning the y coordinates and calculating the mean of the x coordinates for each y bin.

df$ybin <- signif(df$y,2)df <- df[-1,] %>%     group_by(ybin) %>%     mutate(sd = sd(x), mean = mean(x))  %>%     ungroup() %>%     mutate(xdev = abs(mean -x)/sd)df[is.na(df$xdev),]$xdev <- 0

Because the fern also narrows at the top, we want to use proportionally more of the colors we use for the edges (farthest from the mean). Thus, we will factor in the value of y along with the x distance from the mean to determine the color for the point.

A histogram can help determine what break points to use for the colors.

#not run#hist(df$xdev + (df$y/10))

We define our autumn color schematic and generate the Autumn Barnsley Fern.

# Set the breakpoints and the colorscolor_table <- tibble( value = c(0.5, 0.8, 1.1, 1.5, 1.9, 2.1, 2.3, max(df$xdev + (df$y/10))),                           color = c("forestgreen", "yellowgreen", "yellow3", "gold2",                                     "darkgoldenrod2", "darkorange3", "darkorange4", "brown4"))# Lookup the corresponding color for each of the points.df$col <- NAfor (r in 1:nrow(color_table) ){    df$col[df$xdev + (df$y/10) <= color_table$value[r] & is.na(df$col)] <- color_table$color[r]}plot(df$x, df$y, pch = '.', col = df$col,xaxt = "n", yaxt = "n", xlab = NA, ylab = NA,     main = "Autumn Barnsley Fern")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: exploRations in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

My First R Package (Part 1)

October 12, 2019, 7:02 pm

≫ Next: My First R Package (Part 2)

≪ Previous: Autumn Barnsley Fern

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(What does this new package do? Find out here.)

I have had package-o-phobia for years, and have skillfully resisted learning how to build a new R package. However, I do have a huge collection of scripts on my hard drive with functions in them, and I keep a bunch of useful functions up on Github so anyone who wants can source and use them. I source them myself! So, really, there’s no reason to package them up and (god forbid) submit them to CRAN. I’m doing fine without packages!

Reality check: NO. As I’ve been told by so many people, if you have functions you use a lot, you should write a package. You don’t even have to think about a package as something you write so that other people can use. It is perfectly fine to write a package for an audience of one — YOU.

But I kept making excuses for myself until very recently, when I couldn’t find a package to do something I needed to do, and all the other packages were either not getting the same answers as in book examples OR they were too difficult to use. It was time.

So armed with moral support and some exciting code, I began the journey of a thousand miles with the first step, guided by Tomas Westlake and Emil Hvitfeldt and of course Hadley. I already had some of the packages I needed, but did not have the most magical one of all, usethis:

install.packages("usethis")library(usethis)library(roxygen2)library(devtools)

Finding a Package Name

First, I checked to see if the package name I wanted was available. It was not available on CRAN, which was sad:

> available("MTS")Urban Dictionary can contain potentially offensive results,  should they be included? [Y]es / [N]o:1: Y-- MTS -------------------------------------------------------------------------Name valid: ✔Available on CRAN: ✖ Available on Bioconductor: ✔Available on GitHub:  ✖ Abbreviations: http://www.abbreviations.com/MTSWikipedia: https://en.wikipedia.org/wiki/MTSWiktionary: https://en.wiktionary.org/wiki/MTS

My second package name was available though, and I think it’s even better. I’ve written code to easily create and evaluate diagnostic algorithms using the Mahalanobis-Taguchi System (MTS), so my target package name is easyMTS:

> available("easyMTS")-- easyMTS ------------------------------------------------------------Name valid: ✔Available on CRAN: ✔ Available on Bioconductor: ✔Available on GitHub:  ✔ Abbreviations: http://www.abbreviations.com/easyWikipedia: https://en.wikipedia.org/wiki/easyWiktionary: https://en.wiktionary.org/wiki/easySentiment:+++

Create Minimum Viable Package

Next, I set up the directory structure locally. Another RStudio session started up on its own; I’m hoping this is OK.

> create_package("D:/R/easyMTS")✔ Creating 'D:/R/easyMTS/'✔ Setting active project to 'D:/R/easyMTS'✔ Creating 'R/'✔ Writing 'DESCRIPTION'Package: easyMTSTitle: What the Package Does (One Line, Title Case)Version: 0.0.0.9000Authors@R (parsed):    * First Last  [aut, cre] ()Description: What the package does (one paragraph).License: What license it usesEncoding: UTF-8LazyData: true✔ Writing 'NAMESPACE'✔ Writing 'easyMTS.Rproj'✔ Adding '.Rproj.user' to '.gitignore'✔ Adding '^easyMTS\\.Rproj$', '^\\.Rproj\\.user$' to '.Rbuildignore'✔ Opening 'D:/R/easyMTS/' in new RStudio session✔ Setting active project to ''

Syncing with Github

use_git_config(user.name = "nicoleradziwill", user.email = "nicole.radziwill@gmail.com")browse_github_token()

This took me to a page on Github where I entered my password, and then had to go down to the bottom of the page to click on the green button that said “Generate Token.” They said I would never be able to see it again, so I gmailed it to myself for easy searchability. Next, I put this token where it is supposed to be locally:

edit_r_environ()

A blank file popped up in RStudio, and I added this line, then saved the file to its default location (not my real token):

GITHUB_PAT=e54545x88f569fff6c89abvs333443433d

Then I had to restart R and confirm it worked:

github_token()

This revealed my token! I must have done the Github setup right. Finally I could proceed with the rest of the git setup:

> use_github()✔ Setting active project to 'D:/R/easyMTS'Error: Cannot detect that project is already a Git repository.Do you need to run `use_git()`?> use_git()✔ Initialising Git repo✔ Adding '.Rhistory', '.RData' to '.gitignore'There are 5 uncommitted files:* '.gitignore'* '.Rbuildignore'* 'DESCRIPTION'* 'easyMTS.Rproj'* 'NAMESPACE'Is it ok to commit them?1: No2: Yeah3: Not nowSelection: use_github()Enter an item from the menu, or 0 to exitSelection: 2✔ Adding files✔ Commit with message 'Initial commit'● A restart of RStudio is required to activate the Git paneRestart now?1: No way2: For sure3: NopeSelection: 2

When I tried to commit to Github, it was asking me if the description was OK, but it was NOT. Every time I said no, it kicked me out. Turns out it wanted me to go directly into the DESCRIPTION file and edit it, so I did. I used Notepad because this was crashing RStudio. But this caused a new problem:

Error: Uncommited changes. Please commit to git before continuing.

This is the part of the exercise where it’s great to be living with a software engineer who uses git and Github all the time. He pointed me to a tiny little tab that said “Terminal” in the bottom left corner of RStudio, just to the right of “Console”. He told me to type this, which unstuck me:

THEN, when I went back to the Console, it all worked:

> use_git()> use_github()✔ Checking that current branch is 'master'Which git protocol to use? (enter 0 to exit) 1: ssh   <-- presumes that you have set up ssh keys2: https <-- choose this if you don't have ssh keys (or don't know if you do)Selection: 2● Tip: To suppress this menu in future, put  `options(usethis.protocol = "https")`  in your script or in a user- or project-level startup file, '.Rprofile'.  Call `usethis::edit_r_profile()` to open it for editing.● Check title and description  Name:        easyMTS  Description: Are title and description ok?1: Yes2: Negative3: NoSelection: 1✔ Creating GitHub repository✔ Setting remote 'origin' to 'https://github.com/NicoleRadziwill/easyMTS.git'✔ Pushing 'master' branch to GitHub and setting remote tracking branch✔ Opening URL 'https://github.com/NicoleRadziwill/easyMTS'

This post is getting long, so I’ll split it into parts. See you in Part 2.

GO TO PART 2 –>

The post My First R Package (Part 1) appeared first on Quality and Innovation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

My First R Package (Part 2)

October 12, 2019, 8:11 pm

≫ Next: My First R Package (Part 3)

≪ Previous: My First R Package (Part 1)

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Part 1, I set up RStudio with usethis, and created my first Minimum Viable R Package (MVRP?) which was then pushed to Github to create a new repository.

I added a README:

> use_readme_rmd()✔ Writing 'README.Rmd'✔ Adding '^README\\.Rmd$' to '.Rbuildignore'● Modify 'README.Rmd'✔ Writing '.git/hooks/pre-commit'

Things were moving along just fine, until I got this unkind message (what do you mean NOT an R package???!! What have I been doing the past hour?)

> use_testthat()Error: `use_testthat()` is designed to work with packages.Project 'easyMTS' is not an R package.> use_mit_license("Nicole Radziwill")✔ Setting active project to 'D:/R/easyMTS'Error: `use_mit_license()` is designed to work with packages.Project 'easyMTS' is not an R package.

Making easyMTS a Real Package

I sent out a tweet hoping to find some guidance, because Stack Overflow and Google and the RStudio community were coming up blank. As soon as I did, I discovered this button in RStudio:

The first time I ran it, it complained that I needed Rtools, but that Rtools didn’t exist for version 3.6.1. I decided to try finding and installing Rtools anyway because what could I possibly lose. I went to my favorite CRAN repository and found a link for Rtools just under the link for the base install:

I’m on Windows 10, so this downloaded an .exe which I quickly right-clicked on to run… the installer did its thing, and I clicked “Finish”, assuming that all was well. Then I went back into RStudio and tried to do Build -> Clean and Rebuild… and here’s what happened:

IT WORKED!! (I think!!!)

It created a package (top right) and then loaded it into my RStudio session (bottom left)! It loaded the package name into the package console (bottom right)!

I feel like this is a huge accomplishment for now, so I’m going to move to Part 3 of my blog post. We’ll figure out how to close the gaps that I’ve invariably introduced by veering off-tutorial.

The post My First R Package (Part 2) appeared first on Quality and Innovation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

My First R Package (Part 3)

October 12, 2019, 9:50 pm

≫ Next: Using R: Animal model with simulated data

≪ Previous: My First R Package (Part 2)

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After refactoring my programming so that it was only about 10 lines of code, using 12 functions I wrote an loaded in via the source command, I went through all the steps in Part 1 of this blog post and Part 2 of this blog post to set up the R package infrastructure using testthis in RStudio. Then things started humming along with the rest of the setup:

> use_mit_license("Nicole Radziwill")✔ Setting active project to 'D:/R/easyMTS'✔ Setting License field in DESCRIPTION to 'MIT + file LICENSE'✔ Writing 'LICENSE.md'✔ Adding '^LICENSE\\.md$' to '.Rbuildignore'✔ Writing 'LICENSE'> use_testthat()✔ Adding 'testthat' to Suggests field in DESCRIPTION✔ Creating 'tests/testthat/'✔ Writing 'tests/testthat.R'● Call `use_test()` to initialize a basic test file and open it for editing.> use_vignette("easyMTS")✔ Adding 'knitr' to Suggests field in DESCRIPTION✔ Setting VignetteBuilder field in DESCRIPTION to 'knitr'✔ Adding 'inst/doc' to '.gitignore'✔ Creating 'vignettes/'✔ Adding '*.html', '*.R' to 'vignettes/.gitignore'✔ Adding 'rmarkdown' to Suggests field in DESCRIPTION✔ Writing 'vignettes/easyMTS.Rmd'● Modify 'vignettes/easyMTS.Rmd'> use_citation()✔ Creating 'inst/'✔ Writing 'inst/CITATION'● Modify 'inst/CITATION'

Add Your Dependencies

> use_package("ggplot2")✔ Adding 'ggplot2' to Imports field in DESCRIPTION● Refer to functions with `ggplot2::fun()`> use_package("dplyr")✔ Adding 'dplyr' to Imports field in DESCRIPTION● Refer to functions with `dplyr::fun()`> use_package("magrittr")✔ Adding 'magrittr' to Imports field in DESCRIPTION● Refer to functions with `magrittr::fun()`> use_package("tidyr")✔ Adding 'tidyr' to Imports field in DESCRIPTION● Refer to functions with `tidyr::fun()`> use_package("MASS")✔ Adding 'MASS' to Imports field in DESCRIPTION● Refer to functions with `MASS::fun()`> use_package("qualityTools")✔ Adding 'qualityTools' to Imports field in DESCRIPTION● Refer to functions with `qualityTools::fun()`> use_package("highcharter")Registered S3 method overwritten by 'xts':  method     from  as.zoo.xts zoo Registered S3 method overwritten by 'quantmod':  method            from  as.zoo.data.frame zoo ✔ Adding 'highcharter' to Imports field in DESCRIPTION● Refer to functions with `highcharter::fun()`> use_package("cowplot")✔ Adding 'cowplot' to Imports field in DESCRIPTION● Refer to functions with `cowplot::fun()`

Adding Data to the Package

I want to include two files, one data frame containing 50 observations of a healthy group with 5 predictors each, and another data frame containing 15 observations from an abnormal or unhealthy group (also with 5 predictors). I made sure the two CSV files I wanted to add to the package were in my working directory first by using dir().

> use_data_raw()✔ Creating 'data-raw/'✔ Adding '^data-raw$' to '.Rbuildignore'✔ Writing 'data-raw/DATASET.R'● Modify 'data-raw/DATASET.R'● Finish the data preparation script in 'data-raw/DATASET.R'● Use `usethis::use_data()` to add prepared data to package> mtsdata1 <- read.csv("MTS-Abnormal.csv") %>% mutate(abnormal=1)> usethis::use_data(mtsdata1)✔ Creating 'data/'✔ Saving 'mtsdata1' to 'data/mtsdata1.rda'> mtsdata2 <- read.csv("MTS-Normal.csv") %>% mutate(normal=1)> usethis::use_data(mtsdata2)✔ Saving 'mtsdata2' to 'data/mtsdata2.rda'

Magically, this added my two files (in .rds format) into my /data directory. (Now, though, I don’t know why the /data-raw directory is there… maybe we’ll figure that out later.) I decided it was time to commit these to my repository again:

Following the instruction above, I re-knit the README.Rmd and then it was possible to commit everything to Github again. At which point I ended up in a fistfight with git, again saved only by my software engineer partner who uses Github all the time:

I think it should be working. The next test will be if anyone can install this from github using devtools. Let me know if it works for you… it works for me locally, but you know how that goes. The next post will show you how to use it

install.packages("devtools")install_github("NicoleRadziwill/easyMTS")

The post My First R Package (Part 3) appeared first on Quality and Innovation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Using R: Animal model with simulated data

October 13, 2019, 12:16 pm

≫ Next: Did Russia Use Manafort’s Polling Data in 2016 Election?

≪ Previous: My First R Package (Part 3)

[This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week’s post just happened to use MCMCglmm as an example of an R package that can get confused by tibble-style data frames. To make that example, I simulated some pedigree and trait data. Just for fun, let’s look at the simulation code, and use MCMCglmm and AnimalINLA to get heritability estimates.

First, here is some AlphaSimR code that creates a small random mating population, and collects trait and pedigree:

library(AlphaSimR)## Founder populationFOUNDERPOP <- runMacs(nInd = 100,                      nChr = 20,                      inbred = FALSE,                      species = "GENERIC")## Simulation parameters SIMPARAM <- SimParam$new(FOUNDERPOP)SIMPARAM$addTraitA(nQtlPerChr = 100,                   mean = 100,                   var = 10)SIMPARAM$setGender("yes_sys")SIMPARAM$setVarE(h2 = 0.3) ## Random mating for 9 more generationsgenerations <- vector(mode = "list", length = 10) generations[[1]] <- newPop(FOUNDERPOP,                           simParam = SIMPARAM)for (gen in 2:10) {    generations[[gen]] <- randCross(generations[[gen - 1]],                                    nCrosses = 10,                                    nProgeny = 10,                                    simParam = SIMPARAM)}## Put them all togethercombined <- Reduce(c, generations)## Extract phentoypespheno <- data.frame(animal = combined@id,                    pheno = combined@pheno[,1])## Extract pedigreeped <- data.frame(id = combined@id,                  dam = combined@mother,                  sire =combined@father)ped$dam[ped$dam == 0] <- NAped$sire[ped$sire == 0] <- NA## Write out the fileswrite.csv(pheno,          file = "sim_pheno.csv",          row.names = FALSE,          quote = FALSE)write.csv(ped,          file = "sim_ped.csv",          row.names = FALSE,          quote = FALSE)

In turn, we:

Set up a founder population with a AlphaSimR’s generic livestock-like population history, and 20 chromosomes.
Choose simulation parameters: we have an organism with separate sexes, a quantitative trait with an additive polygenic architecture, and we want an environmental variance to give us a heritability of 0.3.
We store away the founders as the first generation, then run a loop to give us nine additional generations of random mating.
Combine the resulting generations into one population.
Extract phenotypes and pedigree into their own data frames.
Optionally, save the latter data frames to files (for the last post).

Now that we have some data, we can fit a quantitative genetic pedigree model (”animal model”) to estimate genetic parameters. We’re going to try two methods to fit it: Markov Chain Monte Carlo and (the unfortunately named) Integrated Nested Laplace Approximation. MCMC explores the posterior distribution by sampling; I’m not sure where I heard it described as ”exploring a mountain by random teleportation”. INLA makes approximations to the posterior that can be integrated numerically; I guess it’s more like building a sculpture of the mountain.

First, a Gaussian animal model in MCMCglmm:

library(MCMCglmm)## Gamma priors for variancesprior_gamma <- list(R = list(V = 1, nu = 1),                    G = list(G1 = list(V = 1, nu = 1)))    ## Fit the modelmodel_mcmc  <- MCMCglmm(scaled ~ 1,                        random = ~ animal,                        family = "gaussian",                        prior = prior_gamma,                        pedigree = ped,                        data = pheno,                        nitt = 100000,                        burnin = 10000,                        thin = 10)## Calculate heritability for heritability from variance componentsh2_mcmc_object  <- model_mcmc$VCV[, "animal"] /    (model_mcmc$VCV[, "animal"] + model_mcmc$VCV[, "units"]) ## Summarise results from that posteriorh2_mcmc  <- data.frame(mean = mean(h2_mcmc_object),                       lower = quantile(h2_mcmc_object, 0.025),                       upper = quantile(h2_mcmc_object, 0.975),                       method = "MCMC",                       stringsAsFactors = FALSE)

And here is a similar animal model in AnimalINLA:

library(AnimalINLA)## Format pedigree to AnimalINLA's tastesped_inla <- pedped_inla$id  <- as.numeric(ped_inla$id)ped_inla$dam  <- as.numeric(ped_inla$dam)ped_inla$dam[is.na(ped_inla$dam)] <- 0ped_inla$sire  <- as.numeric(ped_inla$sire)ped_inla$sire[is.na(ped_inla$sire)] <- 0    ## Turn to relationship matrixA_inv <- compute.Ainverse(ped_inla)    ## Fit the modelmodel_inla  <- animal.inla(response = scaled,                           genetic = "animal",                           Ainverse = A_inv,                           type.data = "gaussian",                           data = pheno,                           verbose = TRUE)## Pull out summaries from the model objectsummary_inla  <- summary(model_inla)## Summarise resultsh2_inla  <- data.frame(mean = summary_inla$summary.hyperparam["Heritability", "mean"],                       lower = summary_inla$summary.hyperparam["Heritability", "0.025quant"],                       upper = summary_inla$summary.hyperparam["Heritability", "0.975quant"],                       method = "INLA",                       stringsAsFactors = FALSE)

If we wrap this all in a loop, we can see how the estimation methods do on replicate data (full script on GitHub). Here are estimates and intervals from ten replicates (black dots show the actual heritability in the first generation):

As you can see, the MCMC and INLA estimates agree pretty well and mostly hit the mark. In the one replicate dataset where they falter, they falter together.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Did Russia Use Manafort’s Polling Data in 2016 Election?

October 13, 2019, 1:24 pm

≫ Next: Assess Variable Importance In GRNN

≪ Previous: Using R: Animal model with simulated data

[This article was first published on sweissblaug, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction:

On August 2, 2016 then Trump campaign manager, Paul Manafort, gave polling data to Konstantin Kalimnik a Russian widely assumed to be a spy. Before then Manafort ordered his protege, Rick Gates, to share polling data with Kilmnik. Gates periodically did so starting April or May. The Mueller Report stated it did not know why Manafort was insistent on giving this information or whether the Russian’s used it to further Trump’s cause (p. 130 see here for my summary of Mueller Report V1).

One theory says that Manafort wanted to show the good work he was doing to Kilimnik’s boss, a Russian Oligarch named Deripiskia, whom Manafort owed money to. A more sinister hypothesis is that Manafort knew that the information would be valuable in the hands of Russian’s trying to interfere with the election.

This post will analyze whether the Russians used the polling data irrespective of Manafort’s intent. I looked at Russian Facebook Ads uncovered by House Intelligence Committee and tried to identify any changes in messaging after August 2nd. I conclude with a guess on what the polling data was shared.

Russian Facebook Data:The House Intelligence Committee released thousands of Russian Advertisements by the Internet Research Agency. There have been several analysis on these advertisements that discuss they’re effectiveness and one good one is by Spangher et al. However, I couldn’t find any that showed topics of advertisements over time.

I focused the analysis to data in 2016 which includes periods of Manafort coming into the position of campaign manager and the election itself in november. Overall there 1858 facebook Ads captured in this dataset. Below is a time series plot of number of Advertisements per day for 2016.

There are periods of high activity in May / June and in October right before the election.

Change After August 2nd?

Each advertisement has metadata and text associated with it including: date, text, target population, etc. To see if there were any changes through time and in particular august 2nd I tried some topic modeling and text clustering to see if there were any natural changes. I couldn’t find any changes or trends using an unsupervised approach.

Instead I built a predictive model with the response being a binary variable; before / after august 2nd and explanatory variables as text features from each ad (over 1200 words). I then performed variable importance on these words to see which were most predictive. Below I plotted the number of adverts with the important words divided by numer of advertisements for a particular day to get a normalized percentage.

The blue line is when Manafort made contact with Kilimnik initially and the red line is the august 2nd meeting. There does appear to be large increases in the words associated with African American civil rights topics after 8/2. Specifically these words were not in the advertisements texts themselves but were in the ‘people who liked’ description. That is, if you liked ‘Martin Luther King’ on your profile then a particular ad would target you.

Another way to look at this information is to see the proportion of these words used before and after 8/2.

The above plot shows the number of times a word appeared before and after 8/2 and the P(date>8/2) | word). For instance the word 1954, signifying the beginning of civil rights, occured 4 times before and 376 times after 8/2 which means that just under 99% of times it appears happen after that. This suggests there was a change in the IRA advertisements where they focused more on targeting people that were interested African American civil rights issues.

Conclusions / Discussions

I’m guessing that the contents of the polling data would be something related to African Americans and how those that have an interest in civil rights movement are more susceptible to negative ads.

Do I think the evidence presented here is that strong enough to believe the Russians used polling data? Meh, not really. For few reasons:

All words found here were used a few times before the 8/2
Gates gave information on a continuous basis. If Russians used this data I assume they would incorporate it accordingly and there would not be a discrete change at 8/2
I only did this for one date. Perhaps if I did this analysis for an arbitrary dates then I would find other words that were associated with other dates

I’m not saying that they didn’t use the polling data but I don’t think the evidence here is strong enough to say that they did. At a minimum I think that the IRA and Russians adapted Ads to target different populations at different points in time. This shows they are sophisticated and probably learn from previous results.

Code

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Assess Variable Importance In GRNN

October 13, 2019, 9:48 pm

≫ Next: Automatic data types checking in predictive models

≪ Previous: Did Russia Use Manafort’s Polling Data in 2016 Election?

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Technically speaking, there is no need to evaluate the variable importance and to perform the variable selection in the training of a GRNN. It’s also been a consensus that the neural network is a black-box model and it is not an easy task to assess the variable importance in a neural network. However, from the practical prospect, it is helpful to understand the individual contribution of each predictor to the overall goodness-of-fit of a GRNN. For instance, the variable importance can help us make up a beautiful business story to decorate our model. In addition, dropping variables with trivial contributions also helps us come up with a more parsimonious model as well as improve the computational efficiency.

In the YAGeR project (https://github.com/statcompute/yager), two functions have been added with the purpose to assess the variable importance in a GRNN. While the grnn.x_imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.x_imp.R) will provide the importance assessment of a single variable, the grnn.imp() function (https://github.com/statcompute/yager/blob/master/code/grnn.imp.R) can give us a full picture of the variable importance for all variables in the GRNN. The returned value “imp1” is calculated as the decrease in AUC with all values for the variable of interest equal to its mean and the “imp2” is calculated as the decrease in AUC with the variable of interest dropped completely. The variable with a higher value of the decrease in AUC is deemed more important.

Below is an example demonstrating how to assess the variable importance in a GRNN. As shown in the output, there are three variables making no contribution to AUC statistic. It is also noted that dropping three unimportant variables in the GRNN can actually increase AUC in the hold-out sample. What’s more, marginal effects of variables remaining in the GRNN make more sense now with all showing nice monotonic relationships, in particular “tot_open_tr”.

.gist table { margin-bottom: 0; }

imp

margin

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Automatic data types checking in predictive models

October 14, 2019, 7:50 am

≫ Next: Shiny 1.4.0

≪ Previous: Assess Variable Importance In GRNN

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Automatic data types checking in predictive models

The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types. Many strange errors appear when we are creating models just because of data format.

The new version of funModeling 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.

Cover photo by: @franjacquier_

tl;dr;code

Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check…

Example 1:

# install.packages("funModeling")library(funModeling)library(tidyverse)# Load datadata=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')# Call the function:integ_mod_1=data_integrity_model(data = data, model_name = "randomForest")# Any errors?integ_mod_1

## ## ✖ {NA detected} num_vessels_flour, thal, gender## ✖ {Character detected} gender, has_heart_disease## ✖ {One unique value} constant

Regardless the "one unique value", the other errors need to be solved in order to create a random forest.

Automatic data types checking in predictive models

Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task… data_integrity_model will alert with a common error message about such errors.

Introduction

data_integrity_model is built on top of data_integrity function. We talked about it in the post: Fast data exploration for predictive modeling.

It checks:

NA
Data types (allow non-numeric? allow character?)
High cardinality
One unique value

Supported models

It takes the metadata from a table that is pre-loaded with funModeling

head(metadata_models)

## # A tibble: 6 x 6##   name         allow_NA max_unique allow_factor allow_character only_numeric##                                               ## 1 randomForest FALSE            53 TRUE         FALSE           FALSE       ## 2 xgboost      TRUE            Inf FALSE        FALSE           TRUE        ## 3 num_no_na    FALSE           Inf FALSE        FALSE           TRUE        ## 4 no_na        FALSE           Inf TRUE         TRUE            TRUE        ## 5 kmeans       FALSE           Inf TRUE         TRUE            TRUE        ## 6 hclust       FALSE           Inf TRUE         TRUE            TRUE

The idea is anyone can add the most popular models or some configuration that is not there. There are some redundancies, but the purpose is to focus on the model, not the needed metadata. This way we don’t think in no NA in random forest, we just write randomForest.

Some custom configurations:

no_na: no NA variables.
num_no_na: numeric with no NA (for example, useful when doing deep learning).

Embed in a data flow on production

Many people ask for typical questions when interviewing candidates. I like these ones: "How do you deal with new data?" or "What are the considerations you have when you do a deploy?"

Based on our first example:

integ_mod_1

## ## ✖ {NA detected} num_vessels_flour, thal, gender## ✖ {Character detected} gender, has_heart_disease## ✖ {One unique value} constant

We can check:

integ_mod_1$data_ok

## [1] FALSE

data_ok is a flag useful to stop a process raising an error if anything goes wrong.

More examples

Example 2:

On mtcars data frame, check if there is any variable with NA:

di2=data_integrity_model(data = mtcars, model_name = "no_na")# Check:di2

## ✔ Data model integrity ok!

Good to go?

di2$data_ok

## [1] TRUE

Example 3:

data_integrity_model(data = heart_disease, model_name = "pca")

## ## ✖ {NA detected} num_vessels_flour, thal## ✖ {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease

Example 4:

data_integrity_model(data = iris, model_name = "kmeans")

## ## ✖ {Non-numeric detected} Species

Any suggestions?

If you come across any cases which aren’t covered here, you are welcome to contribute: funModeling’s github.

How about time series? I took them as: numeric with no na (model_name = num_no_na). You can add any new model by updating the table metadata_models.

And that’s it.

In case you want to understand more about data types and qualilty, you can check the Data Science Live Book

Have data fun!

You can found me at: Linkedin& Twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Shiny 1.4.0

October 14, 2019, 5:00 pm

≫ Next: Strange Attractors: an R experiment about maths, recursivity and creative coding

≪ Previous: Automatic data types checking in predictive models

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Shiny 1.4.0 has been released! This release mostly focuses on under-the-hood fixes, but there are a few user-facing changes as well.

If you’ve written a Shiny app before, you’ve probably encountered errors like this:

div("Hello", "world!", )
#> Error in tag("div", list(...)) : argument is missing, with no default

This is due to a trailing comma in div(). It’s very easy to accidentally add trailing commas when you’re writing and debugging a Shiny application.

In Shiny 1.4.0, you’ll no longer get this error – it will just work with trailing commas. This is true for div() and all other HTML tag functions, like span(), p(), and so on.

The new version of Shiny also lets you control the whitespace between HTML tags. Previously, if there were two adjacent tags, like the two spans in div(a("Visit this link", href="path/"), span(".")), whitespace would always be inserted between them, resulting in output that renders as “Visit this link .”.

Here’s what the generated HTML looks like:

div(a("Visit this link", href = "path/"), span("."))
#> 
#>   Visit this link
#>   .
#>

Now, you can use the .noWS parameter to remove the spacing between tags, so you can create output that renders as “Visit this link.”:

div(a("Visit this link", href = "path/", .noWS = "after"), span("."))
#> 
#>   Visit this link.
#>

The .noWS parameter can take one or more other values to control whitespace in other ways:

"before" suppresses whitespace before the opening tag.
"after" suppresses whitespace after the closing tag.
"after-begin" suppresses whitespace between the opening tag and its first child. (In the example above, the tags are children of the
.
"after-begin" suppresses whitespace between the last child and the closing tag.

(These changes actually come from version 0.4.0 of the htmltools package, but most users will encounter these functions via Shiny, and the documentation in Shiny has been updated to reflect the changes.)

Breaking changes

We’ve updated from jQuery 1.12.4 to 3.4.1. There’s a small chance that JavaScript code will behave slightly differently with the new version of jQuery, so if you encounter a compatibility issue, you can use the old version of jQuery with options(shiny.jquery.version=1). Note that this option will go away some time in the future, so if you find that you need to use it, please make sure to update your JavaScript code to work with jQuery 3.

For the full set of changes in this version of Shiny, please see this page.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Introduction

Contents

Preparation

R functions as Spark SQL generators

Executing the generated queries via Spark

Using DBI as the interface

Invoking sql on a Spark session object

Using tbl with dbplyr’s sql

When portability is important

References

Using dplyr:

Using Base r:

Using GREP:

Table of Contents

Load R packages

The dataset

Initial plot

Adding color gradient using geom_segment()

Make a variable for the color mapping

Adding one geom_segment() per second

Switching to a gray scale

Using segments to make a gradient rectangle

Bonus: annotations with curved arrows

Other ways to make a gradient color background

Eclipses!

Just the code, please

Intro

The Barnsley Fern

Finding a Package Name

Create Minimum Viable Package

Syncing with Github

Making easyMTS a Real Package

Add Your Dependencies

Adding Data to the Package

tl;dr;code

Introduction

Supported models

Embed in a data flow on production

More examples

Any suggestions?

Breaking changes