A duck. Giving a look at DuckDB since MonetDBLite was removed from CRAN

October 26, 2019, 5:05 am

≫ Next: Building a Shiny App for Cycling in Ottawa

≪ Previous: A Comprehensive Introduction to Command Line for R Users

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You may know that MonetDBLite was removed from CRAN.DuckDB comming up.

Breaking change

>install.packages('MonetDBLite')Warningininstall.packages:package‘MonetDBLite’isnotavailable(forRversion3.6.1)

People who based their works on MonetDBLite may ask what happened, what to do. Not to play a risky game with database and tools choices for future works… (“It’s really fast but we may waste some time if we have to replace it by another solution”).

It’s the game with open source. Remember big changes in dplyr 0.7. Sometimes we want better tools, and most of the time they become better. It’s really great. And sometimes we don’t have time and energy to adapt our work to tools that became better in a too iterative way. Or in a too subjective way. We want it to work, not break. Keeping code as simple as possible (and avoid nebulous dependencies, so, tidy?) is one of the key point. Stocking data in a database is another one.

All that we can say is that “we’re walking on works in progress”. Like number of eggshells, more works in progress here probably means more breaking changes.

Works in progress for packages, also for (embedded) databases!

From Monet to Duck

MonetDBLite philosophy is to be like a “very very fast SQLite”. But it’s time for change (or it seems to be). Then we can thanks MonetDBLite developers as it was a nice adventure to play/work with MonetDB speed! As a question, is there another person, some volunteers, possibilities to maintain MonetDBLite (somewhere a nice tool)? There are not so many informations for the moment about what happened and that’s why I write this post.

Here, I read that they are now working on a new solution, under MIT License, named DuckDB, see here for more details.

As I’m just a R user and haven’t collaborate to the project, I would just say for short: DuckDB takes good parts from SQLite and PostGreSQL (Parser), see here for complete list, it looks promising. As in MonetDB, philosophy is focused on columns and speed. And dates for instance are handled correctly, not having to convert them in “ISO-8601 – like” character strings.

It can be called from C/C++, Python and R.

Here is a post about python binding.

I also put a link at the bottom of this page which give some explanations about the name of this new tool and DuckDB developers point’s of view about data manipulation and storage¹.

Beginning with duckDB in R

Create / connect to the db

# remotes::install_github("cwida/duckdb/tools/rpkg", build = FALSE)library(duckdb)library(dplyr)library(DBI)# Create or connect to the dbcon_duck<-dbConnect(duckdb::duckdb(),"~/Documents/data/duckdb/my_first.duckdb")#con <- dbConnect(duckdb::duckdb(), ":memory:")con_duck

iris

dbWriteTable(con_duck,"iris",iris)tbl(con,'iris')

Put some rows and columns in db

>dim(nycflights13::flights)[1]33677619>object.size(nycflights13::flights)%>%format(units="Mb")[1]"38.8 Mb"

Sampling it to get more rows, then duplicating columns, two time.

# Sample to get bigger data.framedf_test<-nycflights13::flights%>%sample_n(2e6,replace=TRUE)%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols')}))%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols_bis')}))

>dim(df_test)[1]200000076>object.size(df_test)%>%format(units="Mb")[1]"916.4 Mb"

Write in db

tictoc::tic()dbWriteTable(con_duck,"df_test",df_test)tictoc::toc()

It take some times compared to MonetDBLite (no benchmark here, I just run this several times and it was consistent).

# DuckDB      : 23.251 sec elapsed# SQLite      : 20.23 sec elapsed# MonetDBLite : 8.4 sec elapsed

The three are pretty fast. Most importantly if queries are fast, and they are, most of the time we’re allwright.

I want to say here that’s for now it’s a work in progress, we have to wait more communication from DuckDB developers. I just write this to share the news.

Glimpse

>tbl(con_duck,'df_test')%>%glimpse()Observations:??Variables:76Database:duckdb_connection$year<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,…$month<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6…$day<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,…$dep_time<int>1608,2218,1920,NA,1506,1917,1034,655,1039,1752,2018,1732,82…$sched_dep_time<int>1612,2127,1920,2159,1500,1900,1030,700,1045,1720,1629,1728,…$dep_delay<dbl>-4,51,0,NA,6,17,4,-5,-6,32,229,4,-9,-3,-4,-3,9,38,34,…$arr_time<int>1904,2321,2102,NA,1806,2142,1337,938,1307,2103,2314,1934,11…$sched_arr_time<int>1920,2237,2116,2326,1806,2131,1345,958,1313,2025,1927,2011,…$arr_delay<dbl>-16,44,-14,NA,0,11,-8,-20,-6,38,227,-37,-16,-12,-10,-39,…$carrier<chr>"UA","EV","9E","UA","DL","DL","VX","UA","UA","AA","B6","UA",…$flight<int>1242,4372,3525,424,2181,2454,187,1627,1409,695,1161,457,717…$tailnum<chr>"N24211","N13994","N910XJ",NA,"N329NB","N3749D","N530VA","N37281…$ origin                                  "EWR", "EWR", "JFK", "EWR", "LGA", "JFK", "EWR", "EWR", "EWR", "JFK", "…$dest<chr>"FLL","DCA","ORD","BOS","MCO","DEN","SFO","PBI","LAS","AUS","…$air_time<dbl>155,42,116,NA,131,217,346,134,301,230,153,276,217,83,36,…$distance<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138…$hour<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20…$minute<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…......$minute_bind_cols<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour_bind_cols<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…$year_bind_cols_bis<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,…$month_bind_cols_bis<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6…$day_bind_cols_bis<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,…......$distance_bind_cols_bind_cols_bis<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138…$hour_bind_cols_bind_cols_bis<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20…$minute_bind_cols_bind_cols_bis<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour_bind_cols_bind_cols_bis<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…

Count

>tbl(con_duck,'df_test')%>%count()# Source:   lazy query [?? x 1]# Database: duckdb_connectionn<dbl>12000000

Dates

Compared to SQLite it handles dates/times correctly. No need to convert in character.

tbl(con_duck,'df_test')%>%select(time_hour)

# Source:   lazy query [?? x 1]# Database: duckdb_connectiontime_hour<dttm>12013-11-2921:00:00.00000022013-10-0801:00:00.00000032013-03-0200:00:00.00000042013-05-0301:00:00.00000052013-12-1820:00:00.00000062013-09-1823:00:00.00000072013-07-2014:00:00.00000082013-03-0712:00:00.00000092013-09-1514:00:00.000000102013-04-2521:00:00.000000# … with more rows

tbl(con_sqlite,'df_test')%>%select(time_hour)

# Source:   lazy query [?? x 1]# Database: sqlite 3.22.0 [/Users/guillaumepressiat/Documents/data/sqlite.sqlite]time_hour<dbl>113857588002138119400031362182400413675428005138739680061379545200713743288008136265760091379253600101366923600# … with more rows

Some querying

Running some queries

dplyr

It already works nicely with dplyr.

>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%+collect()

# A tibble: 3 x 2Species`min(Sepal.Width)`<chr><dbl>1virginica2.22setosa2.33versicolor2

>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%show_query()

<SQL>SELECT"Species",MIN("Sepal.Width")AS"min(Sepal.Width)"FROM"iris"GROUPBY"Species"

sql

Run query as a string

dbGetQuery(con_duck,'SELECT "Species", MIN("Sepal.Width") FROM iris GROUP BY "Species"')

Speciesmin(Sepal.Width)1virginica2.22setosa2.33versicolor2.0

Like for all data sources with DBI, if the query is more complex, we can write it comfortably in an external file and launch it like this for example:

dbGetQuery(con_duck,readr::read_file('~/Documents/scripts/script.sql'))

“Little” benchmarks

Collecting this big data frame

This has no sense but give some idea of read speed. We collect df_test in memory, from duckdb, monetdb and sqlite.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')),+times=5)Unit:secondsexprminlqmeanmedianuqmaxnevala3.587033.6325073.7631293.6766693.7251484.194295>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')),+times=5)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb973.1111990.36991003.4171010.6511013.8581029.0975>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')),+times=1)Unit:secondsexprminlqmeanmedianuqmaxnevald52.0878552.0878552.0878552.0878552.0878552.087851

Really good !

Simple count

Count then collect aggregate rows.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevala50.1801453.2419754.8753254.6820357.0920658.9487320>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb151.729157.9267160.5727160.8815163.8343167.47720>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')%>%count(year,month)),+times=20)Unit:secondsexprminlqmeanmedianuqmaxnevald2.1672022.1962882.2052812.204862.2165942.25360620

Faster !

It remains to test joins, filters, sorts, etc.

Informations

I find that there are not so many communications for the moment about this work and binding for R, so I made this post to highlight it.

MonetDBLite speed is amazing, do you will give DuckDB a try ?

In any case thanks to DuckDB developers and welcome to the new duck.

See here https://github.com/cwida/duckdb.

DuckDB developers point’s of view on data management and explanations about “duck” can be found here.

Here we can read more informations on ALTREP, MonetDBLite and DuckDB, and reasons why MonetDB was finally abandoned (“RIP MonetDBLite”).

What the duck? Explanation slide n° 25 : https://db.in.tum.de/teaching/ss19/moderndbs/duckdb-tum.pdf?lang=de

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Building a Shiny App for Cycling in Ottawa

October 26, 2019, 5:00 pm

≫ Next: GRNN vs. GAM

≪ Previous: A duck. Giving a look at DuckDB since MonetDBLite was removed from CRAN

[This article was first published on R on Will Hipson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a different kind of post, but one that I think is kind of fun. I currently live in Ottawa, which for those who don’t know, is the capital city of Canada. For a capital city, it’s fairly small, but it’s increasingly urbanizing (we just got lightrail transit). Segregated bicycle lanes and paths are becoming more common too and many of these paths have trackers on them that count how many bicycles cross a particular street or path each day. What’s great is that this data is shared publicly by the city.

I started looking into this data, cleaned it up, and eventually put it together in an interactive web app:

Click here to go the app.

library(tidyverse)library(leaflet)library(leafpop)

We’ll start by reading in the data from the GitHub repositiory. There’s a lot of missing data, so much that R gets confused about the data structure of some of the columns. We need to add another argument to read_csv telling it the type of data in each column. The col_types argument takes a letter for each column, with ? meaning that we let R decide what the data is and n meaning ‘numeric’.

bikes <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/bikes_app.csv", col_types = c("?nnnnnnnnnnnnnn"))bikes

## # A tibble: 3,560 x 15##    date                alexandra_bridge eastern_canal ottawa_river##                                              ##  1 2010-01-01 00:00:00                0             0            0##  2 2010-01-02 00:00:00                0             0            0##  3 2010-01-03 00:00:00                0             0            0##  4 2010-01-04 00:00:00                0             0            0##  5 2010-01-05 00:00:00                0             0            0##  6 2010-01-06 00:00:00                0             0            0##  7 2010-01-07 00:00:00                0             0            0##  8 2010-01-08 00:00:00                0             0            0##  9 2010-01-09 00:00:00                0             0            0## 10 2010-01-10 00:00:00                0             0            0## # ... with 3,550 more rows, and 11 more variables: western_canal ,## #   laurier_bay , laurier_lyon , laurier_metcalfe ,## #   somerset_bridge , otrain_young , otrain_gladstone ,## #   otrain_bayview , portage_bridge , adawe_crossing_a ,## #   adawe_crossing_b

Each row is a day and the columns are bicycle counters spread across the city. Let’s start by creating the graphs we want in the Shiny app. It’s easier to do this outside of the Shiny framework first. We’ll start by plotting total bicycle counts over time.

bikes_total <- bikes %>%  pivot_longer(names_to = "counter", values_to = "count", -date) %>%  group_by(date) %>%  mutate(daily_total = sum(count, na.rm = TRUE))bikes_total

## # A tibble: 49,840 x 4## # Groups:   date [3,560]##    date                counter          count daily_total##                                     ##  1 2010-01-01 00:00:00 alexandra_bridge     0           0##  2 2010-01-01 00:00:00 eastern_canal        0           0##  3 2010-01-01 00:00:00 ottawa_river         0           0##  4 2010-01-01 00:00:00 western_canal       NA           0##  5 2010-01-01 00:00:00 laurier_bay         NA           0##  6 2010-01-01 00:00:00 laurier_lyon        NA           0##  7 2010-01-01 00:00:00 laurier_metcalfe    NA           0##  8 2010-01-01 00:00:00 somerset_bridge     NA           0##  9 2010-01-01 00:00:00 otrain_young        NA           0## 10 2010-01-01 00:00:00 otrain_gladstone    NA           0## # ... with 49,830 more rows

And now to plot it over time:

bikes_total %>%  ggplot(aes(x = date, y = daily_total)) +  geom_line(size = .5, alpha = .80, color = "#36648B") +  scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +  labs(x = NULL,       y = "Count",       title = "Total Bicycle Crossings in Ottawa",       subtitle = "Jan 2010 - Sep 2019") +  theme_minimal(base_size = 16) +  theme(plot.title = element_text(hjust = .5),        axis.text.x = element_text(size = 16))

There’s clear seasonality, with bicycle crossings peaking in the summer months and troughing in the winter. There also appears to be a trend, increasing from 2010 to 2017, then leveling out. Does this mean that bicycling is leveling off in Ottawa? We may want to look at specific counters to get a better sense of this.

bikes %>%  pivot_longer(names_to = "counter", values_to = "count", -date) %>%   ggplot(aes(x = date, y = count)) +  geom_line(size = .5, alpha = .80, color = "#36648B") +  labs(x = NULL,       y = "Count",       title = "Bicycle Crossings in Ottawa by Location",       subtitle = "Jan 2010 - Sep 2019") +  facet_wrap(~counter) +  theme_minimal(base_size = 16) +  theme(plot.title = element_text(hjust = .5),        axis.text.x = element_blank())

## Warning: Removed 2191 rows containing missing values (geom_path).

This graph tells us that we have to be a bit careful about interpreting the total count because some counters are introduced later or go out of commission. The drop in total counts for 2018 could be due to the Western Canal counter going offline that year. What about average counts over time?

bikes %>%  pivot_longer(names_to = "counter", values_to = "count", -date) %>%  group_by(date) %>%  mutate(daily_average = mean(count, na.rm = TRUE)) %>%  ggplot(aes(x = date, y = daily_average)) +  geom_line(size = .5, alpha = .80, color = "#36648B") +  scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +  labs(x = NULL,       y = "Count",       title = "Average Bicycle Crossings in Ottawa",       subtitle = "Jan 2010 - Sep 2019") +  theme_minimal(base_size = 16) +  theme(plot.title = element_text(hjust = .5),        axis.text.x = element_text(size = 16))

There may be an upward trend, but it’s less clear compared to the total count. We again have to be careful because earlier years have fewer counters online so the average is based on less data. However, knowing both the total and the average counts gives us a pretty clear picture of how cycling is changing over time in Ottawa.

Maps with Leaflet

Now we’ll add the functionality of an interactive map – one that shows where the counters are located geographically and allows the user to select specific counters. Earlier we loaded up the leaflet and leafpop packages. These will help us construct our map of Ottawa.

We’ll also need the latitude and longitude coordinates of the counters. Using information from the Open Data Ottawa, I found the location of each counter and obtained its latitude and longitude using Google Maps. I also added a bit of descriptive information for each counter. We can put all of this in a dataframe as follows:

coords <- data.frame(counter = names(bikes[,2:15]),                     name = c("Alexandra Bridge", "Eastern Canal Pathway", "Ottawa River Pathway", "Western Canal Pathway",                              "Laurier at Bay", "Laurier at Lyon", "Laurier at Metcalfe", "Somerset Bridge", "OTrain at Young",                              "OTrain at Gladstone", "OTrain at Bayview", "Portage Bridge", "Adawe Crossing A", "Adawe Crossing B"),                     lat = c(45.430366, 45.420924, 45.411959, 45.406280,                             45.415893, 45.417036, 45.419790, 45.420512,                             45.402859, 45.404599, 45.408636, 45.421980,                              45.426282, 45.426575),                     long = c(-75.704761, -75.685060, -75.723424, -75.681814,                              -75.705328, -75.702613, -75.697623, -75.684625,                              -75.712760, -75.714812, -75.723644, -75.713324,                              -75.670234, -75.669765),                     desc = c("Ottawa approach to the NCC Alexandra Bridge Bikeway. This counter was not operational for most of 2010                              due to bridge construction. This is one of the more consistent counters, until the internal battery                              failed in August 2019.",                              "NCC Eastern Canal Pathway approximately 100m north of the Corktown Bridge.",                              "NCC Ottawa River Pathway approximately 100m east of the Prince of Wales Bridge. Canada Day in 2011                              boasts the highest single day count of any counter.",                              "NCC Western Canal Pathway approximately 200m north of “The Ritz”. Out of operation for much of 2018.                              MEC Bikefest on May 17, 2015 accounts for the large spike that day.",                              "Laurier Segregated Bike lane just west of Bay. Minimal data available due to inactivity after 2014.",                              "Laurier Segregated Bike lane just east of Lyon. No longer in operation since 2016.",                              "Laurier Segregated Bike lane just west of Metcalfe. Construction in late 2012 accounts for unusual dip                              in counts.",                              "Somerset bridge over O-Train west-bound direction only. Inexplicably large spike in 2012 followed by a                              typical seasonal pattern. Inactive since late 2018.",                              "O-Train Pathway just north of Young Street. Minimal data available due to inactivity after 2016. See                              O-Train at Gladstone counter for a better estimate.",                              "O-Train Pathway just north of Gladstone Avenue. In operation since mid-2013. Shows unusual spike in                              November of 2017.",                              "O-Train Pathway just north of Bayview Station. In operation since mid-2013. Trending upward.",                              "Portage Bridge connecting Gatineau to Ottawa. Installed in late 2013, this counter registered                              relatively high traffic but seems to have experienced outages during Winter months. Inactive since early                              2016.",                              "Adàwe Crossing Bridge bike lane. This counter is one of a pair on this pedestrian bridge. Installed in                              2016, it seems to have experienced an outage during the Winter of its inaugural year.",                              "The second of two counters on the Adàwe Crossing Bridge. This counter may pick up more pedestrian than                              bike traffic, as suggested by the trend over time."))

Now we just pipe the coordinate data into leaflet.

leaflet(data = coords) %>%  addTiles() %>%  addMarkers(~long, ~lat)

{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["//{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=https://willhipson.netlify.com/post/bicycle_app/bicycle_app/ rel="nofollow" target="_blank">OpenStreetMap<\/a> contributors, <a href=https://willhipson.netlify.com/post/bicycle_app/bicycle_app/ rel="nofollow" target="_blank">CC-BY-SA<\/a>"}]},{"method":"addMarkers","args":[[45.430366,45.420924,45.411959,45.40628,45.415893,45.417036,45.41979,45.420512,45.402859,45.404599,45.408636,45.42198,45.426282,45.426575],[-75.704761,-75.68506,-75.723424,-75.681814,-75.705328,-75.702613,-75.697623,-75.684625,-75.71276,-75.714812,-75.723644,-75.713324,-75.670234,-75.669765],null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},null,null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[45.402859,45.430366],"lng":[-75.723644,-75.669765]}},"evals":[],"jsHooks":[]}

Leaflet automatically generates a map of size to fit all the markers. There are a few modifications to make though. One is to have it so that when the user hovers the mouse over a marker a label pops up with the name of that counter. Another is to make the map more aesthetically pleasing. Finally, we may want to add some bounds so that the user can’t scroll too far away from the markers.

leaflet(data = coords) %>%  addTiles() %>%  addMarkers(~long, ~lat, label = ~name) %>%  setMaxBounds(-75.65, 45.38, -75.75, 45.46) %>%  addProviderTiles(providers$CartoDB.Positron)

{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["//{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=https://willhipson.netlify.com/post/bicycle_app/bicycle_app/ rel="nofollow" target="_blank">OpenStreetMap<\/a> contributors, <a href=https://willhipson.netlify.com/post/bicycle_app/bicycle_app/ rel="nofollow" target="_blank">CC-BY-SA<\/a>"}]},{"method":"addMarkers","args":[[45.430366,45.420924,45.411959,45.40628,45.415893,45.417036,45.41979,45.420512,45.402859,45.404599,45.408636,45.42198,45.426282,45.426575],[-75.704761,-75.68506,-75.723424,-75.681814,-75.705328,-75.702613,-75.697623,-75.684625,-75.71276,-75.714812,-75.723644,-75.713324,-75.670234,-75.669765],null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},null,null,null,null,["Alexandra Bridge","Eastern Canal Pathway","Ottawa River Pathway","Western Canal Pathway","Laurier at Bay","Laurier at Lyon","Laurier at Metcalfe","Somerset Bridge","OTrain at Young","OTrain at Gladstone","OTrain at Bayview","Portage Bridge","Adawe Crossing A","Adawe Crossing B"],{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]},{"method":"setMaxBounds","args":[45.38,-75.65,45.46,-75.75]},{"method":"addProviderTiles","args":["CartoDB.Positron",null,null,{"errorTileUrl":"","noWrap":false,"detectRetina":false}]}],"limits":{"lat":[45.402859,45.430366],"lng":[-75.723644,-75.669765]}},"evals":[],"jsHooks":[]}

Great. So we now have the two components of the app: the time plots and the map. Time to bring in Shiny and put it all together. Now, if you have never used Shiny before, this probably isn’t the easiest example to start with. I’d highly recommend this set of tutorial videos by Garrett Grolemund to get started.

Creating the Shiny App

There are two parts to every Shiny app: the UI or User Interface and the Server. The UI is like the look and feel of the app, it’s where we tell Shiny what kinds of inputs and outputs we want, how we want to organize the panels, and so on. In contrast, the Server is the engine of the app. We’ll start by constructing the UI. It’s important to note that it’s easier to build a Shiny app in a new R script. So we’re basically going to start over in a new script, which means we’ll reload the packages and the data as if we were starting new:

Create a new R script

We’ll start with the packages and data. We haven’t done anything with the UI or Server yet. We usually want to keep the data outside the UI. We’ll also transform our data as we did earlier to generate the total and average time plots.

library(tidyverse)library(leaflet)library(leafpop)library(shiny)library(shinythemes)library(shinyWidgets)bikes <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/bikes_app.csv", col_types = c("?nnnnnnnnnnnnnn"))#For ease, I've put the coordinates in a separate file, but you could just as easily rerun the 'coords' object abovecoords <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/coords.csv")bikes_plot <- bikes %>%  pivot_longer(names_to = "counter", values_to = "count", -date) %>%  left_join(coords, by = "counter")bikes_total <- bikes_plot %>%  group_by(date) %>%  summarize(count = sum(count, na.rm = TRUE))bikes_mean <- bikes_plot %>%  group_by(date) %>%  summarize(count = mean(count, na.rm = TRUE))

Now, still in the same R script, we can build the UI. It’s going to look a bit strange with parentheses all over the place. It’s just customary Shiny scripting to use hanging parentheses.

Specifying the UI

ui <- fluidPage(theme = shinytheme("flatly"),  sidebarLayout(  #Layout        sidebarPanel(id = "Sidebar",  #Side panel                 h2("Ottawa Bicycle Counters", align = "center", tags$style("#Sidebar{font-family: Verdana;}")),                 fluidRow(  # Row 1 of side panel                   htmlOutput("caption"),  # Caption output, provides descriptive text                   tags$style("#caption{font-size: 16px; height: 200px; font-family: Verdana;}")                 ),                 fluidRow(  # Row 2 of side panel                   htmlOutput("stats"),  # Statistics output, provides descriptive statistics                    tags$style("#stats{font-size: 16px; height: 125px; font-family: Verdana;}")                 ),                 fluidRow(  # Row 3 of side panel                   switchInput("average",  # User input, allows the user to turn a switch to display the average                               "Display Average",                               value = FALSE)                 ),                 fluidRow(  # Row 4 of side panel                   htmlOutput("caption2"),  # More caption output                   tags$style("#caption2{font-size: 12px; height: 80px; font-family: Verdana;}")                   ),                 fluidRow(  # Row 5 of side panel                    downloadButton("download", "Download Data")  # A button so that users can download the data                   )                 ),    mainPanel(id = "Main",  # Main panel (this is where the plots and map go)              fluidRow(  # Row 1 of main panel                leafletOutput("map", height = 400)  # Here's the output for the map                ),              fluidRow(  # Row 2 of main panel                plotOutput("timeplot", height = 300)  # Here's the output for the time plots                )              )    ))

There’s the code for the UI. Starting from the top, we use the FluidPage function and here I’m using the theme flatly. Then I say that I want to use a sidebarLayout. From here, I split the code into a sidebarPanel and a mainPanel. I further split things into fluidRows which just helps to organize the layout. All of the #s are notes, of course, and will not actually be run.

The big thing to notice is that there are inputs and outputs. The only input is a switchInput which lets the user choose whether to display totals or averages. Everything else is an output. Each of these gets a name, for example, I’m calling the leafletOutput map. These names are important, as they will correspond with what we provide in the server part.

Specifying the Server

server <- function(input, output) {    output$map <- renderLeaflet({  # Map output      leaflet(data = coords) %>%         addTiles() %>%         addMarkers(~long, ~lat, label = ~name) %>%         setMaxBounds(-75.65, 45.38, -75.75, 45.46) %>%         addProviderTiles(providers$CartoDB.Positron)    })    output$caption2 <- renderUI({  # Lower caption output    str1 <- paste("Created by ", a("Will Hipson.", href = "https://willhipson.netlify.com/"))    str2 <- paste("Data courtesy of ", a("Open Data Ottawa.", href = "https://open.ottawa.ca/datasets/bicycle-trip-counters"))    str3 <- "2010-01-01 - 2019-09-30"    str4 <- "Updated on 2019-10-24"    HTML(paste(str1, str2, str3, str4, sep = ' '))  })    observeEvent(input$map_marker_click, { # If the user clicks a marker, this line is run.    output$timeplot <- renderPlot({      if(input$average == TRUE) { # if average is selected we get average overlayed        ggplot() +          geom_line(data = bikes_plot[bikes_plot$lat == input$map_marker_click$lat, ],                     aes(x = date, y = count), size = .5, alpha = .70, color = "#36648B") +          geom_line(data = bikes_mean, aes(x = date, y = count), alpha = .50, color = "#9F79EE") +          scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +          scale_y_continuous(limits = c(0, 6000)) +          labs(x = NULL,               y = "Count",               title = paste(bikes_plot[bikes_plot$lat == input$map_marker_click$lat,]$name)) +          theme_minimal(base_size = 16) +          theme(plot.title = element_text(hjust = .5),                axis.text.x = element_text(size = 16),                text = element_text(family = "Verdana"))      } else { # if average is not selected, then it's just the total        ggplot() +          geom_line(data = bikes_plot[bikes_plot$lat == input$map_marker_click$lat, ],                     aes(x = date, y = count), size = .5, alpha = .70, color = "#36648B") +          scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +          scale_y_continuous(limits = c(0, 6000)) +          labs(x = NULL,               y = "Count",               title = paste(bikes_plot[bikes_plot$lat == input$map_marker_click$lat,]$name)) +          theme_minimal(base_size = 16) +          theme(plot.title = element_text(hjust = .5),                axis.text.x = element_text(size = 16),                text = element_text(family = "Verdana"))      }    })        output$caption <- renderUI({ # counter specific description      str1 <- coords[coords$lat == input$map_marker_click$lat, ]$desc      HTML(str1)    })        output$stats <- renderUI({ # counter specific statistics      str1 <- "Statistics"      str2 <- paste("Total count: ", format(round(sum(bikes_plot[bikes_plot$lat == input$map_marker_click$lat,]$count, na.rm = TRUE)), big.mark = ","))      str3 <- paste("Average count: ", format(round(mean(bikes_plot[bikes_plot$lat == input$map_marker_click$lat,]$count, na.rm = TRUE), 1), big.mark = ","))      str4 <- paste("Busiest day: ", bikes_plot[which.max(bikes_plot[bikes_plot$lat == input$map_marker_click$lat,]$count),]$date)      HTML(paste(str1, str2, str3, str4, sep = ' '))      })  })      observeEvent(input$map_click, ignoreNULL = FALSE, {  # If the user clicks on the map it goes back to the cumulative data    output$timeplot <- renderPlot({      if(input$average == TRUE) {  # if the average is selected, it displays average      ggplot(data = bikes_mean, aes(x = date, y = count)) +          geom_line(size = .5, alpha = .70, color = "#36648B") +          scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +          labs(x = NULL,               y = "Count") +          theme_minimal(base_size = 16) +          theme(plot.title = element_text(hjust = .5),                axis.text.x = element_text(size = 16),                text = element_text(family = "Verdana"))      } else { # if average is not selected it is the total        ggplot(data = bikes_total, aes(x = date, y = count)) +          geom_line(size = .5, alpha = .70, color = "#36648B") +          scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") +          labs(x = NULL,               y = "Count") +          theme_minimal(base_size = 16) +          theme(plot.title = element_text(hjust = .5),                axis.text.x = element_text(size = 16),                text = element_text(family = "Verdana"))      }    })        output$caption <- renderUI({  # the default caption      str1 <- "Presenting data from bicycle counters across Ottawa. There are 14 counters spread across the city. The graph below displays how daily counts change over time. Click on a map marker to select a specific counter."      HTML(str1)    })        output$stats <- renderUI({  # Statistics output      str1 <- "Statistics"      str2 <- paste("Total count: ", format(round(sum(bikes_total$count, na.rm = TRUE)), big.mark = ","))      str3 <- paste("Average count: ", format(round(mean(bikes_total$count, na.rm = TRUE), 1), big.mark = ","))      str4 <- paste("Busiest day: ", bikes_total[which.max(bikes_total$count),]$date)      HTML(paste(str1, str2, str3, str4, sep = ' '))    })  })    output$download <- downloadHandler( # download button. Will turn 'bikes' object into a csv file.    filename = function() {      paste("ottawa_bikes", ".csv", sep = "")    },        content = function(file) {      write.csv(bikes, file)    }  )}

The code for the server is much busier and it can be overwhelming. Essentially we’re just saying what we want to do with the inputs and outputs. We generate a little code chunk for each output. Look at the first one for map. This is where we generate the map. We say we want to renderLeaflet and then we just copy the code that we made earlier into this block.

Where things get a bit more complicated is when we want our output to change based on user input. If the user selects the switch that converts the data to averages, for example. I used if and else statements to modulate the output based on whether ‘average’ is selected. What happens, is when the user clicks on the switch, the value of input$average changes to TRUE. Using if and else functions, I just say what I want to happen when ‘average’ is TRUE and what happens if it’s FALSE.

Finally, we want the user to be able to click on specific markers and have the output change to that specific marker. We use the observeEvent function and specify the input, ‘map_marker_click’. We also want the user to be able to click off the marker to go back to the default output. Again, we use observeEvent but now with ‘click_map’.

Once we have all the other outputs in place for the downloads and the captions, we put it all together using the shinyApp function.

shinyApp(ui, server)

And there it is, a user-friendly app for exploring bicycling data in Ottawa. Future avenues include building in some time-series forecasting. It would be cool to show the user how the trend is expected to change over time.

One last shout out to Open Data Ottawa for sharing this data!

To leave a comment for the author, please follow the link and comment on their blog: R on Will Hipson.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

GRNN vs. GAM

October 26, 2019, 10:46 pm

≫ Next: littler 0.3.9: More nice new features

≪ Previous: Building a Shiny App for Cycling in Ottawa

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In practice, GRNN is very similar to GAM (Generalized Additive Models) in the sense that they both shared the flexibility of approximating non-linear functions. In the example below, both GRNN and GAM were applied to the Kyphosis data that has been widely experimented in examples of GAM and revealed very similar patterns of functional relationships between model predictors and the response (red for GRNN and blue for GAM). However, while we have to determine the degree of freedom for each predictor in order to control the smoothness of a GAM model, there is only one tuning parameter governing the overall fitting of a GRNN model.

.gist table { margin-bottom: 0; }

gam

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

littler 0.3.9: More nice new features

October 27, 2019, 5:27 am

≫ Next: New Introduction to rquery

≪ Previous: GRNN vs. GAM

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

max-heap image

The tenth release of littler as a CRAN package is now available, following in the thirteen-ish year history as a package started by Jeff in 2006, and joined by me a few weeks later.

littler is the first command-line interface for R as it predates Rscript. It allows for piping as well for shebang scripting via #!, uses command-line arguments more consistently and still starts faster. It also always loaded the methods package which Rscript only started to do more recently.

littler lives on Linux and Unix, has its difficulties on macOS due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems as a default where a good idea?) and simply does not exist on Windows (yet – the build system could be extended – see RInside for an existence proof, and volunteers are welcome!). See the FAQ vignette on how to add it to your PATH.

A few examples are highlighted at the Github repo, as well as in the examples vignette.

This release adds several new helper scripts / examples such as a Solaris-checker for rhub, a Sweave runner, and bibtex-record printer for packages. It also extends several existing scripts: render.r can now compact pdf files, build.r does this for package builds, tt.r covers parallel tinytest use, rcc.r reports the exit code from rcmdcheck, update.r corrects which package library directories it looks at, kitten.r can add puppies for tinytest, and thanks to Stefan the dratInsert.r (and render.r) script use call. correctly in stop().

The NEWS file entry is below.

Changes in littler version 0.3.9 (2019-10-27)
Changes in examples
The use of call. in stop() was corrected (Stefan Widgren in #72).
New script cos.r to check (at rhub) on Solaris.
New script compactpdf.r to compact pdf files.
The build.r script now compacts vignettes and resaves data.
The tt.r script now supports parallel tests and side effects.
The rcc.r script can now report error codes.
The ‘–libloc’ option to update.r was updated.
The render.r script can optionally compact pdfs.
New script sweave.r to render (and compact) pdfs.
New script pkg2bibtex.r to show bibtex entries.
The kitten.r script has a new option --puppy to add tinytest support in purring packages.

CRANberries provides a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page. The code is available via the GitHub repo, from tarballs and now of course all from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter.

Comments and suggestions are welcome at the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

New Introduction to rquery

October 27, 2019, 6:29 am

≫ Next: The Hitchhiker’s Guide to Plotnine

≪ Previous: littler 0.3.9: More nice new features

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL“window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new data.frames, instead of as alterations of a primary data structure (as is the case with data.table). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In rquery’s case the primary set of data operators is as follows:

drop_columns
select_columns
rename_columns
select_rows
order_rows
extend
project
natural_join
convert_records (supplied by the cdata package).

These operations break into a small number of themes:

Simple column operations (selecting and re-naming columns).
Simple row operations (selecting and re-ordering rows).
Creating new columns or replacing columns with new calculated values.
Aggregating or summarizing data.
Combining results between two data.frames.
General conversion of record layouts (supplied by the cdata package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. rquery supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful SQL interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the rquery data manipulation operators.

`rquery` operators

Simple column operations (selecting and re-naming columns)

The simple column operations are as follows.

drop_columns
select_columns
rename_columns

These operations are easy to demonstrate.

We set up some simple data.

d <- data.frame(
  x = c(1, 1, 2),
  y = c(5, 4, 3),
  z = c(6, 7, 8)
)

knitr::kable(d)

x	y	z
1	5	6
1	4	7
2	3	8

For example: drop_columns works as follows. drop_columns creates a new data.frame without certain columns.

library(rquery)

drop_columns(d, c('y', 'z'))

##    x
## 1: 1
## 2: 1
## 3: 2

In all cases the first argument of a rquery operator is either the data to be processed, or an earlier rquery pipeline to be extended. We will take about composing rquery operations after we work through examples of all of the basic operations.

We can write the above in piped notation (using the wrapr pipe in this case):

d %.>%
  drop_columns(., c('y', 'z')) %.>%
  knitr::kable(.)

x
1
1
2

Notice the first argument is an explicit “dot” in wrapr pipe notation.

select_columns’s action is also obvious from example.

d %.>%
  select_columns(., c('x', 'y')) %.>%
  knitr::kable(.)

x	y
1	5
1	4
2	3

Simple row operations (selecting and re-ordering rows)

The simple row operations are:

select_rows
order_rows

select_rows keeps the set of rows that meet a given predicate expression.

d %.>%
  select_rows(., x == 1) %.>%
  knitr::kable(.)

x	y	z
1	5	6
1	4	7

order_rows re-orders rows by a selection of column names (and allows reverse ordering by naming which columns to reverse in the optional reverse argument). Multiple columns can be selected in the order, each column breaking ties in the earlier comparisons.

d %.>%
  order_rows(., 
             c('x', 'y'),
             reverse = 'x') %.>%
  knitr::kable(.)

x	y	z
2	3	8
1	4	7
1	5	6

General rquery operations do not depend on row-order and are not guaranteed to preserve row-order, so if you do want to order rows you should make it the last step of your pipeline.

Creating new columns or replacing columns with new calculated values

The important create or replace column operation is:

extend

extend accepts arbitrary expressions to create new columns (or replace existing ones). For example:

d %.>%
  extend(., zzz := y / x) %.>%
  knitr::kable(.)

x	y	z	zzz
1	5	6	5.0
1	4	7	4.0
2	3	8	1.5

We can use = or := for column assignment. In these examples we will use := to keep column assignment clearly distinguishable from argument binding.

extend allows for very powerful per-group operations akin to what SQL calls “window functions”. When the optional partitionby argument is set to a vector of column names then aggregate calculations can be performed per-group. For example.

shift <- data.table::shift

d %.>%
  extend(.,
         max_y := max(y),
         shift_z := shift(z),
         row_number := row_number(),
         cumsum_z := cumsum(z),
         partitionby = 'x',
         orderby = c('y', 'z')) %.>%
  knitr::kable(.)

x	y	z	max_y	shift_z	row_number	cumsum_z
1	4	7	5	NA	1	7
1	5	6	5	7	2	13
2	3	8	3	NA	1	8

Notice the aggregates were performed per-partition (a set of rows with matching partition key values, specified by partitionby) and in the order determined by the orderby argument (without the orderby argument order is not guaranteed, so always set orderby for windowed operations that depend on row order!).

More on the window functions can be found here.

Aggregating or summarizing data

The main aggregation method for rquery is:

project

project performs per-group calculations, and returns only the grouping columns (specified by groupby) and derived aggregates. For example:

d %.>%
  project(.,
         max_y := max(y),
         count := n(),
         groupby = 'x') %.>%
  knitr::kable(.)

x	max_y	count
1	5	2
2	3	1

Notice we only get one row for each unique combination of the grouping variables. We can also aggregate into a single row by not specifying any groupby columns.

d %.>%
  project(.,
         max_y := max(y),
         count := n()) %.>%
  knitr::kable(.)

max_y	count
5	3

Combining results between two `data.frame`s

To combine multiple tables in rquery one uses what we call the natural_join operator. In the rquerynatural_join, rows are matched by column keys and any two columns with the same name are coalesced (meaning the first table with a non-missing values supplies the answer). This is easiest to demonstrate with an example.

Let’s set up new example tables.

d_left <- data.frame(
  k = c('a', 'a', 'b'),
  x = c(1, NA, 3),
  y = c(1, NA, NA),
  stringsAsFactors = FALSE
)

knitr::kable(d_left)

k	x	y
a	1	1
a	NA	NA
b	3	NA

d_right <- data.frame(
  k = c('a', 'b', 'q'),
  y = c(10, 20, 30),
  stringsAsFactors = FALSE
)

knitr::kable(d_right)

k	y
a	10
b	20
q	30

To perform a join we specify which set of columns our our row-matching conditions (using the by argument) and what type of join we want (using the jointype argument). For example we can use jointype = 'LEFT' to augment our d_left table with additional values from d_right.

natural_join(d_left, d_right,
             by = 'k',
             jointype = 'LEFT') %.>%
  knitr::kable(.)

k	x	y
a	1	1
a	NA	10
b	3	20

In a left-join (as above) if the right-table has unique keys then we get a table with the same structure as the left-table- but with more information per row. This is a very useful type of join in data science projects. Notice columns with matching names are coalesced into each other, which we interpret as “take the value from the left table, unless it is missing.”

General conversion of record layouts

Record transformation is “simple once you get it”. However, we suggest reading up on that as a separate topic here.

Composing operations

We could, of course, perform complicated data manipulation by sequencing rquery operations. For example to select one row with minimal y per-x group we could work in steps as follows.

. <- d
. <- extend(.,
            row_number := row_number(),
            partitionby = 'x',
            orderby = c('y', 'z'))
. <- select_rows(.,
                 row_number == 1)
. <- drop_columns(.,
                  "row_number")
knitr::kable(.)

x	y	z
1	4	7
2	3	8

The above discipline has the advantage that it is easy to debug, as we can run line by line and inspect intermediate values. We can even use the Bizarro pipe to make this look like a pipeline of operations.

d ->.;
  extend(.,
         row_number := row_number(),
         partitionby = 'x',
         orderby = c('y', 'z')) ->.;
  select_rows(.,
              row_number == 1)  ->.;
  drop_columns(.,
               "row_number")    ->.;
  knitr::kable(.)

x	y	z
1	4	7
2	3	8

Or we can use the wrapr pipe on the data, which we call “immediate mode” (for more on modes please see here).

d %.>%
  extend(.,
         row_number := row_number(),
         partitionby = 'x',
         orderby = c('y', 'z')) %.>%
  select_rows(.,
              row_number == 1)  %.>%
  drop_columns(.,
               "row_number")    %.>%
  knitr::kable(.)

x	y	z
1	4	7
2	3	8

rquery operators can also act on rquery pipelines instead of acting on data. We can write our operations as follows:

ops <- local_td(d) %.>%
  extend(.,
         row_number := row_number(),
         partitionby = 'x',
         orderby = c('y', 'z')) %.>%
  select_rows(.,
              row_number == 1)  %.>%
  drop_columns(.,
               "row_number")

cat(format(ops))

## mk_td("d", c(
##   "x",
##   "y",
##   "z")) %.>%
##  extend(.,
##   row_number := row_number(),
##   partitionby = c('x'),
##   orderby = c('y', 'z'),
##   reverse = c()) %.>%
##  select_rows(.,
##    row_number == 1) %.>%
##  drop_columns(.,
##    c('row_number'))

And we can re-use this pipeline, both on local data and to generate SQL to be run in remote databases. Applying this operator pipeline to our data.framed is performed as follows.

d %.>% 
  ops %.>%
  knitr::kable(.)

x	y	z
1	4	7
2	3	8

What we are trying to illustrate above: there is a continuum of notations possible between:

Working over values with explicit intermediate variables.
Working over values with a pipeline.
Working over operators with a pipeline.

Being able to see these as all related gives some flexibility in decomposing problems into solutions. We have some more advanced notes on the differences in working modalities here and here.

Conclusion

rquery supplies a very teachable grammar of data manipulation based on Codd’s relational algebra and experience with pipelined data transforms (such as base::transform(), dplyr, and data.table).

For in-memory situations rquery uses data.table as the implementation provider (through the small adapter package rqdatatable) and is routinely faster than any other R data manipulation system exceptdata.table itself.

For bigger than memory situations rquery can translate to any sufficiently powerful SQL dialect, allowing rquery pipelines to be executed on PostgreSQL, Apache Spark, or Google BigQuery.

In addition the data_algebra Python package supplies a nearly identical system for working with data in Python. The two systems can even share data manipulation code between each other (allowing very powerful R/Python inter-operation or helping port projects from one to the other).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The Hitchhiker’s Guide to Plotnine

October 26, 2019, 5:00 pm

≫ Next: Accupedo vs. Fitbit Part 1: Convergent Validity of Hourly Step Counts with R

≪ Previous: New Introduction to rquery

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

About the book

Jodie Burchell and yours truly have published a book, The Hitchhiker’s Guide to Plotnine, on graphing in Python using plotnine, a comprehensive port of ggplot2.

This book will help you easily build beautiful plots. If you’d like to create highly customised plots, including replicating the styles of XKCD and The Economist, this is your book.

Like the previous book, The Hitchhiker’s Guide to Ggplot2, the suggested price is $9.99 (you can pay as low as $4.99 or get it for free), and there is an option to download a combo containing the book and the jupyter notebooks with the exercises for a different price.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Accupedo vs. Fitbit Part 1: Convergent Validity of Hourly Step Counts with R

October 26, 2019, 5:00 pm

≫ Next: Le Monde puzzle [#1115]

≪ Previous: The Hitchhiker’s Guide to Plotnine

[This article was first published on Method Matters Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we will investigate the relationship between hourly step count data from two different sources: the Accupedo app on my phone, the Fitbit I wear on my wrist. We will use several visualization and analytical techniques to understand the correspondence (or lack thereof) between the two measurements. Do these devices give similar readings of hourly step counts? When are the two measurements more likely to agree or disagree with one another?

The Data

The data come from two sources: the Accupedo app on my phone and from the Fibit (model Alta HR) that I wear on my wrist. Both data sources are accessible (with a little work) via R: you can see my write up of how to access data from Accupedo here and my post on how to access data from Fitbit here.

I got the Fitbit in March 2018, and the data from both devices were extracted in mid-December 2018. I was able to match 273 days for which I had step counts for both Accupedo and Fitbit. The data contain the hourly steps for the hours from 6 AM to 11 PM. For each hour, I also record the cumulative daily steps (e.g. the number of steps I’ve taken so far during the day). In total, the dataset contains 4,914 observations of hourly step counts for the 273 days for which we have data (e.g. 18 observations per day).

You can find the data and all the code from this blog post on Github here.

The head of the dataset (named merged_data) looks like this:

<p> table { margin-left: auto; margin-right: auto; table-layout: fixed; width: 100%;word-wrap: break-word; } table, th, td { border: 1px solid grey; border-collapse: collapse; } th, td { padding: 5px; text-align: center; font-family: Helvetica, Arial, sans-serif; font-size: 90%; width: 85px; } table tbody tr:hover { background-color: #dddddd; } .wide { width: 90%; }</p>

date	daily_total_apedo	hour	hourly_steps_apedo	cumulative_daily_steps_apedo	daily_total_fbit	hourly_steps_fbit	cumulative_daily_steps_fbit	dow	week_weekend	hour_diff_apedo_fbit
2018-03-20	16740	6	0	0	15562	281	281	Tue	Weekday	-281
2018-03-20	16740	7	977	977	15562	1034	1315	Tue	Weekday	-57
2018-03-20	16740	8	341	1318	15562	1605	2920	Tue	Weekday	-1264
2018-03-20	16740	9	1741	3059	15562	223	3143	Tue	Weekday	1518
2018-03-20	16740	10	223	3282	15562	287	3430	Tue	Weekday	-64
2018-03-20	16740	11	226	3508	15562	188	3618	Tue	Weekday	38
2018-03-20	16740	12	283	3791	15562	1124	4742	Tue	Weekday	-841
2018-03-20	16740	13	1587	5378	15562	525	5267	Tue	Weekday	1062
2018-03-20	16740	14	431	5809	15562	372	5639	Tue	Weekday	59
2018-03-20	16740	15	624	6433	15562	392	6031	Tue	Weekday	232

Hourly Step Counts

Correspondence Plot

In this post, we will explore the hourly step counts. We can make a scatterplot showing the correspondence between the hourly step counts (coloring the points by type of day – week vs. weekend), and compute their correlation, using the following code:

# plot hourly steps against one another  # regression lines show that  # Accupedo over-counts  # this over-counting is less strong on weekends  library(ggplot2)ggplot(data=merged_data,aes(x=hourly_steps_apedo,y=hourly_steps_fbit,color=week_weekend))+geom_point(alpha=.5)+geom_abline(intercept=0,slope=1,color='blue',linetype=2,size=2,show.legend=TRUE)+geom_smooth(method="lm",fill=NA)+labs(x="Accupedo",y="Fitbit")+scale_color_manual(values=c("black","red"))+labs(color='Week/Weekend')# what's the correlation between the two columns?  cor.test(merged_data$hourly_steps_apedo,merged_data$hourly_steps_fbit)

Which returns the following plot:

hourly scatterplot

There are a number of things to note in this figure. The dashed blue line is the identity line – if both Accupedo and Fitbit recorded the same number of steps every hour, all of the points would lie on this line. Clearly they do not – it seems like there is some substantial disagreement between the devices on the hourly step counts!

I have colored the points separately for weekdays and weekend days, and drawn separate regression lines for each type of day. Both regression lines lie below the identity line, indicating that Accupedo has higher step counts on both types of days. Because the red line is closer to the identity line, the analysis indicates that correspondence between the hourly device measurements is higher (e.g. the values are closer together) on weekends vs. weekdays.

We can also note a number of points with positive values on the y access and a value of zero on the x axis – e.g. a fairly solid line extending upwards at the left-most side of the plot. These indicate hours that Fitbit recorded steps, but Accupedo recorded none. There are very few points with the opposite pattern – e.g. hours when Accupedo counted steps but Fibit did not.

Finally, the code returns a correlation value of .52, which is highly statistically significant but not the strength of a relationship you’d expect if you had two measurements of exactly the same thing.

Bland Altman Plot

Another way of examining the correspondence between two measurements is the Bland-Altman plot. The Bland-Altman plot displays the mean of the measurements on the x-axis, and the difference between the measurements on the y-axis. A horizontal line (in red in the plot blow) is drawn on the plot to indicate the mean difference between the measurements. In addition, two lines (in blue in the plot below) are drawn at +/- 1.96 standard deviations above and below the mean difference, respectively.

We will use the excellent BlandAltmanLeh package in R to make the Bland-Altman plot. Note that it takes some additional work to get the plot to have the same color scheme as our above correlation plot, with separate colors for weekdays and weekends, and transparency in the points.

# Bland Altman plot - color the points  # by weekday/weekend and make points  # semi-transparent  library(BlandAltmanLeh)trans_red<-rgb(1,0,0,alpha=0.5)trans_blk<-rgb(0,0,0,alpha=0.5)week_weekday_color<-ifelse(merged_data$week_weekend=='Weekday',trans_blk,trans_red)bland.altman.plot(merged_data$hourly_steps_apedo,merged_data$hourly_steps_fbit,conf.int=.95,main="Bland Altman Plot: Hourly Step Counts",xlab="Mean of Measurements",ylab="Differences",pch=19,col=week_weekday_color)legend(x="topright",legend=c("Weekday","Weekend"),fill=1:2)

Which gives us the following plot:

Bland Altman hourly

There are a couple of things to notice here. The first is that the mean of the difference of the hourly step counts – shown on the y axis – is close to zero (the exact value is 19.82, as we’ll see below). The y axis shows the value of the Accupdeo steps minus the Fitbit steps, and so the positive average difference indicates that Accupedo gives higher step counts than Fitbit. However, the size of this hourly difference is small.

We can also see that the smaller the average of the step counts, the closer the correspondence between the two measurements. As the mean of the step counts gets higher (e.g. in hours where I walk a greater number of steps), the disagreement between the measurements is greater. This disagreement is fairly symmetrical at average step counts of less than 3,000 – in this range, neither device systematically records higher or lower numbers. Indeed, the differences in this range seem to be somewhat random – at times the Accupedo step counts are higher and at times the Fitbit step counts are higher.

At high levels of mean step counts, however, there seems to be a slight trend for Accupedo to give higher values. Indeed, past values of 3,000 on the x-axis, virtually all of the points outside of the 1.96 standard deviation lines are on the upper part of the plot, which indicates hours that the Accupedo measurement is greater than that given by Fitbit. However, of our 4,914 hourly observations, only a handful have step counts above 3,000, and so the influence of these observations on the overall mean difference is small.

We can again notice the observations for which the Fitbit records steps, but Accupdeo doesn’t. These steps appear to be on the diagonal line sloping downwards to the right at the bottom half of the plot. Finally, there is not much visual evidence here for enormous differences between weekdays and weekend days.

Testing the Statistical Significance of the Hourly Differences

A corresponding statistical analysis that often accompanies the Bland-Altman plot is a one-sided t-test, comparing the mean difference of the measurements against zero (null hypothesis: the mean difference between the measurements is equal to zero). Although I’m not super-convinced of the utility of p-values, let’s go ahead and conduct this test. The difference score is contained in our dataset (shown above) in the column hour_diff_apedo_fbit. The mean of the differences is 19.82 and the standard deviation is 1080.42.

# calculate the mean, standard deviation  # and the one-sample t-test against zero  mean(merged_data$hour_diff_apedo_fbit)sd(merged_data$hour_diff_apedo_fbit)t.test(merged_data$hour_diff_apedo_fbit,mu=0,alternative="two.sided",conf.level=0.95)

This test indicates that the difference between the hourly step counts between Accupedo and Fitbit is not statistically significant, t(4913) = 1.29, p = .20.

More interesting is the effect size of this comparison. Effect sizes give a measure of the magnitude of an observed difference. There are many such measures, but we can compute Cohen’s D from the values we have above. Cohen’s D gives the size of a difference scaled to the standard deviation of that difference. Here, we simply take the mean difference score (19.82) and divide it by the standard deviation of that score (1080.42), which gives us a Cohen’s D value of .02. The average value of the hourly differences is dwarfed by the size of the variation in the hourly differences; this indicates that the magnitude of the average hourly step count differences is extremely small, and by all indications not meaningfully different from zero.

Differences in Hourly Step Counts Across the Day

The above plots mix data across all hours of the day in order to examine the global correspondence between hourly step counts. I was curious to visualize the difference in the step count measurements across the hours of the day, to see if there were any systematic differences within certain times of the day.

To make this plot, we’ll use the excellent ggridges package, visualizing the density distributions of step count differences separately for each hour of the day, with separate panels for weekdays and weekends.

library(ggridges)# plot distributions for each hour  # separate week/weekend with facet  ggplot(data=merged_data,aes(x=hour_diff_apedo_fbit,y=as.factor(hour),fill=week_weekend))+geom_density_ridges()+geom_vline(xintercept=0,color='darkblue',linetype=3,size=1)+coord_flip()+facet_wrap(~week_weekend)+labs(y="Hour of Day",x="Difference Hourly Steps (Accupedo - Fitbit)")+scale_fill_manual(values=c("black","red"))+labs(fill='Week/Weekend')

Which gives us the following plot:

ggridges hourly

The distributions are for the most part neatly centered around zero, indicating few systematic differences across the hours of the day. There are some exceptions for the earliest hours of the day. Between 6 and 8 AM during the week and on weekends, the Fitbit counts more steps per hour (I guess I carry my phone around less just after waking up).

Summary and Conclusion

In this post, we examined the convergent validity of hourly step counts from the Accupedo app on my phone and the Fitbit I wear on my wrist. The correlation between these two devices’ records was .52, which is a rather weak relationship for two measurements of the same thing. The correspondence between the two measurements was higher on weekends than on weekdays, but regardless of the type of day, Accupedo gives numerically higher estimates of my hourly step count than does Fitbit.

We then constructed a Bland-Altman plot to compare the two measurements. This plot revealed that the average difference between the Accupedo and Fitbit hourly step counts was 19.82 steps; in other words, on average Accupedo counted 19.82 more steps each hour than did Fitbit. This difference was not statistically significantly different from zero, and the effect size analysis revealed that the magnitude of the difference between the devices was very small compared to the variation in the observed differences. The Bland-Altman plot also revealed that the disagreement between Accupedo and Fitbit was smaller when I walked fewer steps per hour. At very high step counts, the disagreement between the two devices’ readings was much larger.

Finally, the analysis of the density distributions of the differences in hourly step counts revealed few systematic differences between the Accupedo and Fitbit measurements across hours of the day.

In sum, this analysis offers some bad news and some good news, when considering the convergent validity of hourly step counts between Accupedo and Fitbit. On the one hand, the correlation between the hourly readings was only .52, which is a rather modest correlation. On the other hand, the average difference was small and not meaningfully different from zero. The differences in the measurements appear somewhat random – some hours Accupedo records higher step counts, and some hours Fitbit records higher step counts (although at higher step counts, Accupedo seems to give increasingly larger step counts than Fitbit).

It’s hard to say which device is more accurate, because we don’t have an objective gold standard to compare our measurements against. However, it’s safe to say that Accupedo and Fitbit give fairly similar readings, and that the differences in recorded hourly step counts are small on average.

Coming Up Next

In the next post, we’ll continue with our exploration of the step count measurements in this dataset. We will examine the cumulative daily step counts recorded by Accupedo and Fitbit, to see how they compare. This analysis focuses on the data that the user typically sees when interacting with the apps: the number of steps walked so far during a day. The results have both similarities and dissimilarities in comparison to what we’ve seen here.

Stay tuned!

P.S.

Welcome to the new home of this blog! It took a while to transfer everything from Blogger to Github, but I’m very pleased with the results. I hope to write a blog post at some point describing the process of transferring everything over.

To leave a comment for the author, please follow the link and comment on their blog: Method Matters Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Le Monde puzzle [#1115]

October 27, 2019, 4:19 pm

≫ Next: Reconstructing Images Using PCA

≪ Previous: Accupedo vs. Fitbit Part 1: Convergent Validity of Hourly Step Counts with R

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A two-person game as Le weekly Monde current mathematical puzzle:

Two players Amaruq and Atiqtalik are in a game with n tokens where Amaruq chooses a number 1

The run of a brute force R code like

B=rep(-1,200);B[1:9]=1for (i in 10:200){    v=matrix(-2,9,9)    for (b in 2:9){       for (a in (2:9)[-b+1])       for (d in c(1,a,b)){        e=i-d-c(1,a,b)        if (max(!e)){v[a,b]=max(-1,v[a,b])}else{         if (max(e)>0) v[a,b]=max(v[a,b],min(B[e[which(e>0)]]))}}     B[i]=max(B[i],min(v[v[,b]>-2,b]))}

always produces 1’s in B, which means the first player wins no matter… I thus found out (from the published solution) that my interpretation of the game rules were wrong. The values A and B are fixed once for all and each player only has the choice between withdrawing 1, A, and B on her turn. With the following code showing that Amaruq looses both times.

B=rep(1,210)for(b in(2:9)) for(a in(2:9)[-b+1])  for(i in(2:210)){   be=-2   for(d in c(1,a,b)){    if (d==i){best=1}else{      e=i-d-c(1,a,b)      if (max(!e)){be=max(-1,be)}else{       if (max(e)>0)be=max(be,min(B[e[which(e>0)]]))}}}   B[i]=be}

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Reconstructing Images Using PCA

October 27, 2019, 5:22 pm

≫ Next: Data Science – A New Hope for Good

≪ Previous: Le Monde puzzle [#1115]

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A decade or more ago I read a nice worked example from the political scientist Simon Jackman demonstrating how to do Principal Components Analysis. PCA is one of the basic techniques for reducing data with multiple dimensions to some much smaller subset that nevertheless represents or condenses the information we have in a useful way. In a PCA approach, we transform the data in order to find the “best” set of underlying components. We want the dimensions we choose to be orthogonal to one another—that is, linearly uncorrelated. PCA is an inductive approach to data analysis. Because of the way it works, we’re arithmetically guaranteed to find a set of components that “explain” all the variance we observe. The substantive explanatory question is whether the main components uncovered by PCA have a plausible interpretation.

I was reminded of all of this on Friday because some of my first-year undergrad students are doing an “Algorithms for Data Science” course, and the topic of PCA came up there. Some students not in that class wanted some intuitions about what PCA was. The thing I remembered about Jackman’s discussion was that he had the nice idea of doing PCA on an image, in order to show both how you could reconstruct the whole image from the PCA, if you wanted, and more importantly to provide some intuition about what the first few components of a PCA picked up on. His discussion doesn’t seem to be available anymore, so this afternoon I rewrote the example myself. I’ll use the same image he did. This one:

Elvis Meets Nixon.

Setup

The Imager Library is our friend here. It’s a great toolkit for processing images in R, and it’s friendly to tidyverse packages, too.

library(imager)library(here)library(dplyr)library(broom)library(ggplot2)

Load the image

Our image is in a subfolder of our project directory. The load.image() function is from Imager, and imports the image as a cimg object. The library provides a method to convert these objects to a long-form data frame. Our image is greyscale, which makes it easier to work with. It’s 800 pixels wide by 633 pixels tall.

img <- load.image(here("img/elvis-nixon.jpeg"))str(img)

##  'cimg' num [1:800, 1:633, 1, 1] 0.914 0.929 0.91 0.906 0.898 ...

dim(img)

## [1] 800 633   1   1

img_df_long <-as.data.frame(img)head(img_df_long)

##   x y     value## 1 1 1 0.9137255## 2 2 1 0.9294118## 3 3 1 0.9098039## 4 4 1 0.9058824## 5 5 1 0.8980392## 6 6 1 0.8862745

Each x-y pair is a location in the 800 by 633 pixel grid, and the value is a grayscale value ranging from zero to one. To do a PCA we will need a matrix of data in wide format, though—one that reproduces the shape of the image, a row-and-column grid of pixels, each with some a level of gray. We’ll widen it using pivot_wider:

img_df <- tidyr::pivot_wider(img_df_long,                              names_from = y,                              values_from = value)dim(img_df)

## [1] 800 634

So now it’s the right shape. Here are the first few rows and columns.

img_df[1:5,1:5]

## # A tibble: 5 x 5##       x   `1`   `2`   `3`   `4`##       ## 1     1 0.914 0.914 0.914 0.910## 2     2 0.929 0.929 0.925 0.918## 3     3 0.910 0.910 0.902 0.894## 4     4 0.906 0.902 0.898 0.894## 5     5 0.898 0.894 0.890 0.886

The values stretch off in both directions. Notice the x column there, which names the rows. We’ll drop that when we do the PCA.

Do the PCA

Next, we do the PCA, dropping the x column and feeding the 800×633 matrix to Base R’s prcomp() function.

img_pca <- img_df %>%  dplyr::select(-x)%>%  prcomp(scale =TRUE, center =TRUE)

There are a lot of components—633 of them altogether, in fact, so I’m only going to show the first twelve and the last six here. You can see that by component 12 we’re already up to almost 87% of the total variance “explained”.

summary(img_pca)

## Importance of components:##                            PC1     PC2     PC3     PC4     PC5     PC6## Standard deviation     15.2124 10.9823 7.54308 5.57239 4.77759 4.55531## Proportion of Variance  0.3656  0.1905 0.08989 0.04905 0.03606 0.03278## Cumulative Proportion   0.3656  0.5561 0.64601 0.69506 0.73112 0.76391##                           PC7     PC8     PC9    PC10    PC11    PC12## Standard deviation     4.0649 3.66116 3.36891 3.27698 2.82984 2.49643## Proportion of Variance 0.0261 0.02118 0.01793 0.01696 0.01265 0.00985## Cumulative Proportion  0.7900 0.81118 0.82911 0.84608 0.85873 0.86857## [A lot more components]##                           PC628    PC629    PC630    PC631    PC632## Standard deviation     0.001125 0.001104 0.001097 0.001037 0.000993## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000## Cumulative Proportion  1.000000 1.000000 1.000000 1.000000 1.000000##                            PC633## Standard deviation     0.0009215## Proportion of Variance 0.0000000## Cumulative Proportion  1.0000000

We can tidy the output of prcomp with broom’s tidy function, just to get a summary scree plot showing the variance “explained” by each component.

pca_tidy <- tidy(img_pca,matrix="pcs")pca_tidy %>%    ggplot(aes(x = PC, y = percent))+    geom_line()+    labs(x ="Principal Component", y ="Variance Explained")

Scree plot of the PCA.

Reversing the PCA

Now the fun bit. The object produced by prcomp() has a few pieces inside:

names(img_pca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

What are these? sdev contains the standard deviations of the principal components. rotation is a matrix where the rows correspond to the columns of the original data, and the columns are the principal components. x is a matrix containing the value of the rotated data multiplied by the rotation matrix. Finally, center and scale are vectors with the centering and scaling information for each observation.

Now, to get from this information back to the original data matrix, we need to multiply x by the transpose of the rotation matrix, and then revert the centering and scaling steps. If we multiply by the transpose of the full rotation matrix (and then un-center and un-scale), we’ll recover the original data matrix exactly. But we can also choose to use just the first few principal components, instead. There are 633 components in all (corresponding to the number of rows in the original data matrix), but the scree plot suggests that most of the data is “explained” by a much smaller number of components than that.

Here’s a function that takes a PCA object created by prcomp() and returns an approximation of the original data, calculated by some number (n_comp) of principal components. It returns its results in long format, in a way that mirrors what the Imager library wants. This will make plotting easier in a minute.

reverse_pca <-function(n_comp =20, pca_object = img_pca){## The pca_object is an object created by base R's prcomp() function.## Multiply the matrix of rotated data by the transpose of the matrix ## of eigenvalues (i.e. the component loadings) to get back to a ## matrix of original data values  recon <- pca_object$x[,1:n_comp]%*%t(pca_object$rotation[,1:n_comp])## Reverse any scaling and centering that was done by prcomp()if(all(pca_object$scale !=FALSE)){## Rescale by the reciprocal of the scaling factor, i.e. back to## original range.    recon <-scale(recon, center =FALSE, scale =1/pca_object$scale)}if(all(pca_object$center !=FALSE)){## Remove any mean centering by adding the subtracted mean back in    recon <-scale(recon, scale =FALSE, center =-1* pca_object$center)}## Make it a data frame that we can easily pivot to long format## (because that's the format that the excellent imager library wants## when drawing image plots with ggplot)  recon_df <-data.frame(cbind(1:nrow(recon), recon))colnames(recon_df)<-c("x",1:(ncol(recon_df)-1))## Return the data to long form   recon_df_long <- recon_df %>%    tidyr::pivot_longer(cols =-x,                         names_to ="y",                         values_to ="value")%>%    mutate(y =as.numeric(y))%>%    arrange(y)%>%as.data.frame()    recon_df_long}

Let’s put the function to work by mapping it to our PCA object, and reconstructing our image based on the first 2, 3, 4, 5, 10, 20, 50, and 100 principal components.

## The sequence of PCA components we wantn_pcs <-c(2:5,10,20,50,100)names(n_pcs)<-paste("First", n_pcs,"Components", sep ="_")## map reverse_pca() recovered_imgs <- map_dfr(n_pcs,                           reverse_pca,.id ="pcs")%>%  mutate(pcs = stringr::str_replace_all(pcs,"_"," "),          pcs =factor(pcs, levels =unique(pcs), ordered =TRUE))

This gives us a very long tibble with an index (pcs) for the number of components used to reconstruct the image. In essence it’s eight images stacked on top of one another. Each image has been reconstituted using a some number of components, from a very small number (2) to a larger number (100). Now we can plot each resulting image in a small multiple.

p <- ggplot(data = recovered_imgs,             mapping = aes(x = x, y = y, fill = value))p_out <- p + geom_raster()+   scale_y_reverse()+   scale_fill_gradient(low ="black", high ="white")+  facet_wrap(~ pcs, ncol =2)+   guides(fill =FALSE)+   labs(title ="Recovering the content of an 800x600 pixel image\nfrom a Principal Components Analysis of its pixels")+   theme(strip.text = element_text(face ="bold", size = rel(1.2)),        plot.title = element_text(size = rel(1.5)))p_out

Elvis Meets Nixon, as recaptured by varying numbers of principal components.

There’s a lot more one could do with this, especially if I knew rather more linear algebra than I in fact do haha. But at any rate we can see that it’s pretty straightforward to use R to play around with PCA and images in a tidy framework.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Data Science – A New Hope for Good

October 28, 2019, 2:19 am

≫ Next: Illuminating the Illuminated Part Two: Ipsa Scientia Potestas Est

≪ Previous: Reconstructing Images Using PCA

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Good triumphing over evil in the end is the stuff of every good fairy tale or Hollywood storyline, but in real life, as we all know, it’s usually the tales of political doom and gloom across the world that dominate our screens with stories of good remaining well away from the spotlight. Good, it seems, does not make for high viewing figures.

And stories about data are no exception to this rule.

Barely a day goes by without a story alarming the general public about their privacy and how their information is being used. Think about the investigative documentary about the Cambridge Analytica Scandal, The Great Hack. Think about banking information leaks or how Facebook is using your personal details and preferences.

It’s easy to forget that data science is also shaping the way we live, improving lives for the better and providing services we could only have dreamed of decades ago.

And it’s for this reason, that we at Mango decided to celebrate #Data4Good week, showcasing all of the different ways data science and analytics can be used for good in the world.

When The Economist declared in 2017 that data was more valuable than oil, few people truly understood its power and how this was possible. Fast-forward to the current day, and the picture of data usage is becoming clearer.

In September, we were fortunate enough to secure some incredible speakers at our annual EARL Conference in London who shared stories of how data science has benefited services from local communities, to healthcare and even helping progress peace talks in war stricken areas. We have shared some of these stories via Twitter and I would urge you to take a look.

R in the NHS

When it comes to healthcare, analysis and prediction can be used to better inform decisions about healthcare provision, streamline and automate tasks, dig into complex problems, and predict changes in the healthcare the NHS provides to its patients. Because of this, many non-profit organisations want to harness the power of data science.

During EARL, Edward Watkinson, an analyst, economist and data scientist currently working for the Royal Free London Hospital Group, took to the stage to explain how adopting R as the core tool on their analytical workbench for helping to run their hospitals, and show how useful it has been in the cash-strapped NHS.

You can watch Edward’s ten minute lighting talk here.

Helping local communities

Another great use-case for data science being used for good, is how it can help local communities. David Baker, research and evaluation residential volunteer worker at Toynbee Hall, took to the stage at EARL to explain how Toynbee Hall has adopted R as a tool within the charity sector.

By way of background, since its inception in 1884, evidence-based research has been central to Toynbee Hall’s mission as a charity throughout East London communities. It has had a hand in creating first data visualisations for public good, publishing a series of poverty maps, and regularly engages with the local community to solve problems.

R has allowed for rapid analyses of data on a diversity of projects, at Toynbee Hall. Additionally, Baker explained how embracing an open source software allowed for the team to host a series of data hackathons, that allowed them to recruit freelance data scientists to help analyse publicly available datasets, contributing to building materials that they use in their policy advocacy campaigns.

You can catch-up on David’s talk here.

Some amazing work has been done through the power of data science and analytics, and it’s continually changing the world around us. These stories won’t ever make the news, but it’s reassuring to remind ourselves that sometimes, good really can triumph over evil in the real world as well as in the movies. We hope that people are beginning to see that data science is more than just a buzzword – it’s a new hope for good.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Illuminating the Illuminated Part Two: Ipsa Scientia Potestas Est

October 28, 2019, 2:42 am

≫ Next: Renaming all files in a folder in R

≪ Previous: Data Science – A New Hope for Good

[This article was first published on Weird Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the previous post in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.

Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.

One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.

The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as Zipf’s Law, whereby the frequency of a word’s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.

In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.

Unnatural Laws

Zipf’s Law is an example of a discrete power law probability distribution. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.

In its original context of human langauge, Zipf’s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds for much of the corpus, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus¹. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.

The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called log-log plot. A power law, represented in such a way, will appear linear. Unfortunately, a hideous menagerie of other distributions will also appear linear in such a setting.

More generally, it is rarely sensible to claim that any natural phenomenon follows a given distribution or model, but instead to demonstrate that a distribution presents a useful model for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.

Faced with the awful statistical horror of the universe, we are reduced to seeking evidence against a phenomenon’s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.

Fitted Power Law of Voynich Corpus | (PDF Version)

A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf’s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.

The poweRlaw package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to disprove the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is not drawn from a power law.

This is an inversion of the more typical frequentist null hypothesis scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is insufficiently low to make such a claim, and thus that a power law is consistent with the data.

The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of 0.151. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.

Voynich Power Law Fit and Plot

voynich_powerlaw.r

library( tidyverse )library( magrittr )library( ggthemes )library( showtext )library( tidytext )library( drlib )library( poweRlaw )library(cowplot)library(magick)font_add( "voynich_font", "/usr/share/fonts/TTF/weird/voynich/eva1.ttf")font_add( "main_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")font_add( "bold_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")\showtext_auto()message( "Reading raw Voynich data..." )voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%rename( folio = X1, text = X2 )# Tokenizevoynich_words <-voynich_tbl %>%unnest_tokens( word, text ) # Most common wordsmessage( "Calculating Voynich language statistics..." )voynich_common <-voynich_words %>%count( word, sort=TRUE ) %>%mutate( word = reorder( word, n ) ) %>%mutate( freq = n / sum(n) )voynich_word_counts <-voynich_words %>%count( word, folio, sort = TRUE ) # (Following the poweRlaw vignette)# Create a discrete power law distribution object from the word countsvoynich_powerlaw <- voynich_common %>%extract2( "n" ) %>%displ$new()# Estimate the lower boundvoynich_powerlaw_xmin <- estimate_xmin( voynich_powerlaw )# Set the parameters of the voynich_powerlaw to the estimated valuesvoynich_powerlaw$setXmin( voynich_powerlaw_xmin )# Estimate parameters of the power law distributionvoynich_powerlaw_est <-estimate_pars( voynich_powerlaw )# Calculate p-value of power law. See Section 4.2 of "Power-Law Distributions in Empirical Data" by Clauset et al.# If the p-value is _greater_ than 0.1 then we cannot rule out a power-law distribution.voynich_powerlaw_bootstrap_p <-bootstrap_p(voynich_powerlaw, no_of_sims=1000, threads=7 )# p=0.143 power law cannot be ruled out# Parameter uncertainty via boostrappingvoynich_powerlaw_bootstrap <- bootstrap( voynich_powerlaw, no_of_sims=1000, threads=7 )# Plot data and power law fitvoynich_powerlaw_plot_data <- plot( voynich_powerlaw, draw = F ) %>%mutate(  log_x = log( x ), log_y = log( y ) )voynich_powerlaw_fit_data <- lines( voynich_powerlaw, col=2, draw = F ) %>%mutate(  log_x = log( x ), log_y = log( y ) )# Plot the fitted power law data.gp <-ggplot( voynich_powerlaw_plot_data ) + geom_point( aes( x = log( x ), y =log( y ) ), colour="#8a0707" ) + geom_line( data= voynich_powerlaw_fit_data,  aes(   x = log( x ),  y = log( y ) ), colour="#0b6788") +labs(  x = "Log Rank",   y = "Log Frequency" )gp <-gp +theme( panel.background = element_rect(fill = "transparent", colour = "transparent"),plot.background = element_rect(fill = "transparent", colour = "transparent"),plot.title = element_text( family="bold_font", colour="#3c3f4a", size=22 ),plot.subtitle = element_text( family="bold_font", colour="#3c3f4a", size=12 ),axis.text = element_text( family="bold_font", colour="#3c3f4a", size=12 ),axis.title.x = element_text( family="bold_font", colour="#3c3f4a", size=12 ),axis.title.y = element_text( family="bold_font", colour="#3c3f4a", size=12 ),axis.line = element_line( colour="#3c3f4a" ),panel.grid.major.x = element_blank(),panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank(),panel.grid.minor.y = element_blank()) # Remove legend from internal plottheme_set(theme_cowplot(font_size=4, font_family = "main_font" ) )  # Cowplot trick for ggtitletitle <- ggdraw() + draw_label(  "Fitted Power Law of Voynich Corpus",   fontfamily="bold_font",   colour = "#3c3f4a",   size=20,   hjust=0, vjust=1,   x=0.02, y=0.88 ) +draw_label(  "http://www.weirddatascience.net | @WeirdDataSci",   fontfamily="main_font",   colour = "#3c3f4a",   size=12,   hjust=0, vjust=1,   x=0.02, y=0.40 )data_label <- ggdraw() +draw_label(  "Data: http://www.voynich.nu",   fontfamily="main_font",   colour = "#3c3f4a",   size=14, hjust=1,   x=0.98 ) # Combine plotstgp <- plot_grid(  title,  gp,  data_label,  ncol=1,  rel_heights=c( 0.1, 1, 0.1 ) ) # Add parchment underlayparchment_plot <- ggdraw() +draw_image("img/parchment.jpg", scale=1.4 ) +draw_plot(tgp)save_plot("output/voynich_power_law.pdf", parchment_plot,base_width = 16,base_height = 9,           base_aspect_ratio = 1.78 )

Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available here.

Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora | (PDF Version)

Wikipedia Word Frequency Plot

Data:

Wikipedia Language Corpora: https://www.datos.gov.co/en/Ciencia-Tecnolog-a-e-Innovaci-n/Multilingual-Wikipedia-2015-word-frequencies-32-la/jmxq-gzgh/data

voynich_zipf_wikipedia.r

library( tidyverse )library( magrittr )library( ggthemes )library( showtext )library( tidytext )library( drlib )library(cowplot)library(magick)font_add( "voynich_font", "/usr/share/fonts/TTF/weird/voynich/eva1.ttf")font_add( "main_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")font_add( "bold_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")message( "Reading raw Voynich data..." )voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%rename( folio = X1, text = X2 )# Tokenize# (Remove words of 3 letters or less)# Stemming and stopword removal apparently not so effective anyway,# according to Schofield et al.: voynich_words <-voynich_tbl %>%unnest_tokens( word, text ) # Most common wordsmessage( "Calculating Voynich language statistics..." )voynich_common <-voynich_words %>%count( word, sort=TRUE ) %>%mutate( word = reorder( word, n ) ) %>%mutate( freq = n / sum(n) )# Plot a log-log plot of Voynich word frequencies.voynich_word_counts <-voynich_words %>%count( word, folio, sort = TRUE ) # Load other languages.# Select frequency counts.# Convert to long format, then normalise per-language.message( "Loading common language statistics..." )wiki_language <- read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>%head( 10000 ) %>%as_tibble %>%select( matches( "*_FREQ" ) ) %>%gather( key = "language", value = "count" ) %>%mutate( language = str_replace( language, "_FREQ", "" ) ) %>%group_by( language ) %>%transmute( freq = count / sum( count ) ) %>%ungroupwiki_language_words <- read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>%head( 10000 ) %>%as_tibble # Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy"message( "Combining common and Voynich language statistics..." )voynich_language <-voynich_common %>%transmute( language = "vy", freq = freq )# Combine, then add per-language rank informationmessage( "Processing common and Voynich language statistics..." )all_languages <- bind_rows( wiki_language, voynich_language ) %>%mutate( colour = ifelse( str_detect( `language`, "vy" ), "red", "grey" ) ) %>%group_by( language ) %>%transmute( log_rank=log( row_number() ), log_freq=log( freq ), colour ) %>%ungroup # Plot a log-log plot of all language word frequencies.message( "Plotting common and Voynich language statistics..." )voynich_wikipedia_plot <-all_languages %>%ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +geom_point( alpha=0.4, shape=20 ) + scale_color_manual( values=c("#3c3f4a", "#8a0707" ) ) +theme ( axis.title.y = element_text( angle = 90, family="main_font", size=12 ), axis.text.y = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.title.x = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.text.x = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.line.x = element_line( color = "#3c3f4a" ), axis.line.y = element_line( color = "#3c3f4a" ), plot.title = element_blank(), plot.subtitle = element_blank(), plot.background = element_rect( fill = "transparent" ), panel.background = element_rect( fill = "transparent" ) # bg of the panel ) +#scale_colour_viridis_d( option="cividis", begin=0.4 ) +guides( colour="none" ) +labs( y="Log Frequency",x="Log Rank" )theme_set(theme_cowplot(font_size=4, font_family = "main_font" ) )  # Cowplot trick for ggtitletitle <- ggdraw() + draw_label(  "Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora",   fontfamily="bold_font", colour = "#3c3f4a", size=20,   hjust=0, vjust=1, x=0.02, y=0.88 ) +draw_label("http://www.weirddatascience.net | @WeirdDataSci",   fontfamily="bold_font", colour = "#3c3f4a", size=12,   hjust=0, vjust=1, x=0.02, y=0.40 )data_label <- ggdraw() +draw_label("Data from: http://www.voynich.nu | http://wikipedia.org",   fontfamily="bold_font", colour = "#3c3f4a", size=12,   hjust=1, x=0.98 ) tgp <- plot_grid(title, voynich_wikipedia_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) voynich_wikipedia_plot <- ggdraw() +draw_image("img/parchment.jpg", scale=1.4 ) +draw_plot(tgp)save_plot("output/voynich_wikipedia_plot.pdf", voynich_wikipedia_plot,base_width = 16,base_height = 9,           base_aspect_ratio = 1.78 )

The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.

To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.

Voynich Manuscript Rank Frequency Distribution against UNDHR Translations | (PDF Version)

UNDHR Word Frequency Plot

voynich_zipf_udhr.r

library( tidyverse )library( magrittr )library( ggthemes )library( showtext )library( tidytext )library(cowplot)library(magick)font_add( "voynich_font", "/usr/share/fonts/TTF/weird/voynich/eva1.ttf")font_add( "main_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")font_add( "bold_font", "/usr/share/fonts/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")showtext_auto()message( "Reading raw Voynich data..." )voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%rename( folio = X1, text = X2 )# Tokenize# (Remove words of 3 letters or less)# Stemming and stopword removal apparently not so effective anyway,# according to Schofield et al.: voynich_words <-voynich_tbl %>%unnest_tokens( word, text ) # Most common wordsmessage( "Calculating Voynich language statistics..." )voynich_common <-voynich_words %>%count( word, sort=TRUE ) %>%mutate( word = reorder( word, n ) ) %>%mutate( freq = n / sum(n) )# Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy"message( "Combining common and Voynich language statistics..." )voynich_language <-voynich_common %>%transmute( language = "vy", freq = freq )voynich_word_counts <-voynich_words %>%count( word, folio, sort = TRUE ) # UDHR corpus comparison (smaller text)udhr_corpus_files <- list.files("data/udhr/udhr_txt", pattern="*.txt", full.names=TRUE )# Helper function to read in a text file and calculate a frequency tablletable_frequency_mapper <- function( x ) {# Read file and extract language code from filenameudhr_text <- read_lines( x, skip=6, skip_empty_rows=TRUE )language <- basename( x ) %>% str_replace( "udhr_", "" ) %>% str_replace( ".txt", "" ) # Tokenize and remove punctuationudhr_words <-udhr_text %>%str_flatten %>%str_remove_all( "[.,]" ) %>%str_split( "\\s+" ) %>%extract2( 1 ) %>%{ tibble( word=. ) }# Most common wordsudhr_common <-udhr_words %>%count( word, sort=TRUE ) %>%mutate( word = reorder( word, n ), language )}voynich_corpus <-voynich_language %>%transmute( language, log_rank=log( row_number() ), log_freq=log( freq ), colour="Voynich Text" )udhr_corpus <-udhr_corpus_files %>%map( table_frequency_mapper ) %>%bind_rows %>%group_by( language ) %>%transmute( log_rank=log( row_number() ), log_freq=log( n / sum(n) ), colour="Known UDHR Language" ) %>%ungroupvoynich_udhr_corpus <-bind_rows( udhr_corpus, voynich_corpus )voynich_udhr_frequency_plot <-voynich_udhr_corpus %>%ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +geom_point( alpha=0.4, shape=19 ) + scale_color_manual( values=c( "Known UDHR Language" = "#3c3f4a", "Voynich Text" = "#8a0707" ) ) +theme ( axis.title.y = element_text( angle = 90, family="main_font", size=12 ), axis.text.y = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.title.x = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.text.x = element_text( colour="#3c3f4a", family="main_font", size=12 ), axis.line.x = element_line( color = "#3c3f4a" ), axis.line.y = element_line( color = "#3c3f4a" ), plot.title = element_blank(), plot.subtitle = element_blank(), plot.background = element_rect( fill = "transparent" ), panel.background = element_rect( fill = "transparent" ), # bg of the panel panel.grid.major.x = element_blank(), panel.grid.major.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), legend.text = element_text( family="bold_font", colour="#3c3f4a", size=10 ), legend.title = element_blank(), legend.key.height = unit(1.2, "lines"), legend.position=c(.85,.5) ) +labs( y="Log Frequency",x="Log Rank" )theme_set(theme_cowplot(font_size=4, font_family = "main_font" ) )  # Cowplot trick for ggtitletitle <- ggdraw() + draw_label("Voynich Manuscript Rank Frequency Distribution against UNDHR Translations", fontfamily="bold_font", colour = "#3c3f4a", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +draw_label("http://www.weirddatascience.net | @WeirdDataSci", fontfamily="bold_font", colour = "#3c3f4a", size=12, hjust=0, vjust=1, x=0.02, y=0.40)data_label <- ggdraw() +draw_label("Data from: http://www.voynich.nu | http://unicode.org/udhr/", fontfamily="bold_font", colour = "#3c3f4a", size=12, hjust=1, x=0.98 ) tgp <- plot_grid(title, voynich_udhr_frequency_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) voynich_udhr_plot <- ggdraw() +draw_image("img/parchment.jpg", scale=1.4 ) +draw_plot(tgp)save_plot("output/voynich_udhr_plot.pdf", voynich_udhr_plot,base_width = 16,base_height = 9,           base_aspect_ratio = 1.78 )

A Refined Randomness

The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.

Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.

Rugg has argued that certain random mechanisms can produce text that adheres to Zipf’s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by Li. (1992)

Twisting Paths

These analyses can only present a dim outline of the text itself, and we resist the awful temptation to attempt any form of decipherment. Certainly, the evidence here seems convincing enough that the Voynich Manuscript does represent a human language, but the statistics presented here are of little use in such an effort. It is likely, of course, that the most frequent words in the manuscript may, under certain assumptions, correspond to the most common words or particles in many languages — the definite article, the indefinite article, conjunctions, pronouns, and similar. Without deeper knowledge of the language, however, and with the range of scribing conventions and shortcuts commonplace in texts of the period, these techniques are too limited to do more than tantalise us with what we may never know.

Credible Conclusions

Subjecting the text of the Voynich Manuscript to the crude frequency analyses presented here can support, although not prove, the view that the manuscript, regardless of its true content, is not simply random gibberish. Nor is the text likely to be the result of a simple mechanical process designed without knowledge of the statistical patterns of human languages. Neither is it likely to be any form of cryptogram more sophisticated than the simplest ciphers, as these would have tended to compromise the statistical properties that we have observed.

The demonstrable following of Zipf’s Law, and the adherence to a Gamma distribution of similar shape to known languages, strongly suggests that the text is likely a representation of some natural language.

In the next post we will attempt blindly to wrench more secrets from the text itself through application of modern textual analysis techniques. Until then the Voynich Manuscript remains, silently obscure, beyond the reach of our faltering science.

Footnotes

To leave a comment for the author, please follow the link and comment on their blog: Weird Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Renaming all files in a folder in R

October 28, 2019, 2:50 am

≫ Next: How to Grow Your Own Data Scientists – a practical guide for the data-driven C-Suite

≪ Previous: Illuminating the Illuminated Part Two: Ipsa Scientia Potestas Est

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I hate the way files are run in a camera. While it was cool to learn for this post that DSCN stands for “Digital Still Capture – Nikon,” it means nothing to me!

For this post, I will be renaming the files that I took from Worden Ledges into a more “human-readable” name.

Ready to “automate the boring stuff with R?” Check out my course, R Explained for Excel Users.

Vectorization for the efficiency

I thought that this would be a loop and even an apply() function, but it turns out all that’s needed is a list of the file names. To rename the files, we will simply list all the current files, list the names of the new files that we want, then switch them around.

1. List files in the folder

I have saved these photos under C:/Ledgeson my computer. Using the list.files()function, I see them all “listed.”

Actually, this is a vector, not a list, which is its own thing in R. This will make a big difference later on. Too bad that vector.files() doesn’t quite have the same ring!

> old_files <- list.files("C:/Ledges", pattern = "*.JPG", full.names = TRUE)> old_files [1] "C:/Ledges/DSCN7155.JPG" "C:/Ledges/DSCN7156.JPG" "C:/Ledges/DSCN7157.JPG" "C:/Ledges/DSCN7158.JPG" [5] "C:/Ledges/DSCN7160.JPG" "C:/Ledges/DSCN7161.JPG" "C:/Ledges/DSCN7162.JPG" "C:/Ledges/DSCN7163.JPG" [9] "C:/Ledges/DSCN7164.JPG" "C:/Ledges/DSCN7165.JPG" "C:/Ledges/DSCN7166.JPG" "C:/Ledges/DSCN7167.JPG"[13] "C:/Ledges/DSCN7168.JPG" "C:/Ledges/DSCN7169.JPG" "C:/Ledges/DSCN7170.JPG" "C:/Ledges/DSCN7171.JPG"[17] "C:/Ledges/DSCN7172.JPG" "C:/Ledges/DSCN7174.JPG" "C:/Ledges/DSCN7175.JPG" "C:/Ledges/DSCN7176.JPG"[21] "C:/Ledges/DSCN7177.JPG" "C:/Ledges/DSCN7178.JPG" "C:/Ledges/DSCN7179.JPG" "C:/Ledges/DSCN7180.JPG"[25] "C:/Ledges/DSCN7181.JPG" "C:/Ledges/DSCN7182.JPG" "C:/Ledges/DSCN7183.JPG" "C:/Ledges/DSCN7184.JPG"[29] "C:/Ledges/DSCN7185.JPG" "C:/Ledges/DSCN7186.JPG"

2. Create vector of new files

Now we can name all the new files that we want. For example, instead of DSCN7155.JPG, I want a file name like ledges_1.JPG.

Using 1:length(old_files)gives us a vector of the exact same length as old_files.

I have saved these in the folder C:/LedgesR.

> new_files <- paste0("C:/LedgesR/ledges_",1:length(old_files),".JPG")> new_files [1] "C:/LedgesR/ledges_1.JPG"  "C:/LedgesR/ledges_2.JPG"  "C:/LedgesR/ledges_3.JPG"  [4] "C:/LedgesR/ledges_4.JPG"  "C:/LedgesR/ledges_5.JPG"  "C:/LedgesR/ledges_6.JPG"  [7] "C:/LedgesR/ledges_7.JPG"  "C:/LedgesR/ledges_8.JPG"  "C:/LedgesR/ledges_9.JPG" [10] "C:/LedgesR/ledges_10.JPG" "C:/LedgesR/ledges_11.JPG" "C:/LedgesR/ledges_12.JPG"[13] "C:/LedgesR/ledges_13.JPG" "C:/LedgesR/ledges_14.JPG" "C:/LedgesR/ledges_15.JPG"[16] "C:/LedgesR/ledges_16.JPG" "C:/LedgesR/ledges_17.JPG" "C:/LedgesR/ledges_18.JPG"[19] "C:/LedgesR/ledges_19.JPG" "C:/LedgesR/ledges_20.JPG" "C:/LedgesR/ledges_21.JPG"[22] "C:/LedgesR/ledges_22.JPG" "C:/LedgesR/ledges_23.JPG" "C:/LedgesR/ledges_24.JPG"[25] "C:/LedgesR/ledges_25.JPG" "C:/LedgesR/ledges_26.JPG" "C:/LedgesR/ledges_27.JPG"[28] "C:/LedgesR/ledges_28.JPG" "C:/LedgesR/ledges_29.JPG" "C:/LedgesR/ledges_30.JPG"

3. Copy from old files to new files

Now all we’ll do is copy the files from the old file locations to the new. A TRUEoutput indicates a successful transfer.

> file.copy(from = old_files, to = new_files) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE[22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Now we can open and see that these have more user-friendly names.

One nice thing is that because the original files are named sequentially (i.e., DSCN7155 comes before DSCN7156, etc.), so will our new files (i.e., they become ledges_1, ledges_2, etc.).

4. Clear out the old files

There is no Ctrl + Z on deleting files in R! That’s why I like to copy our files to a new location before deleting the source. We can remove the old files with the file.remove()function.

> file.remove(old_files) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE[22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

More than one way to name a file

There are doubtless other (likely even better) ways to do this in R, so how would you do it? One candidate might, for example, be file.path(); however, I found paste0() to work a little more exactly in what I wanted.

The complete code is below.

To leave a comment for the author, please follow the link and comment on their blog: George J. Mount.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

How to Grow Your Own Data Scientists – a practical guide for the data-driven C-Suite

October 28, 2019, 5:36 am

≫ Next: The Chaos Game: an experiment about fractals, recursivity and creative coding

≪ Previous: Renaming all files in a folder in R

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data today is the fuel driving the modern business world. It therefore stands to reason that the ability to read and speak data should be a fairly mainstream skill. Except it isn’t - yet. A 2018 report by Qlik suggests that just 24% of business decision were fully confident in their abilities with data. This is despite the fact that, according to the 2018 Gartner CIO Agenda, CIOs globally ranked analytics and business intelligence as the most critical technology to achieve the organisation’s business goals, with data and analytics skills topping the list as the most sought-after talent.

As more organisations embrace data-driven digital transformation, it’s clear that the need to upskill and resource data science teams has become far more pronounced. With the gap seeming to only become wider, how can the C-suite continue to leverage data-driven digital transformation if there are insufficient resources to fill it? With the widening gulf between the skills on offer and those emerging from tertiary education, and the demand for data literacy, it’s becoming incumbent upon the businesses themselves, led by the C-suite, to champion the drive towards a more data-savvy future.

Fortunately, a positive trend that is taking rapid shape is the emergence of data science and analytics capabilities across a much wider range of sectors. However, if that innovation is taking place in siloes, separated from the business, it likely isn’t delivering the results you need. So, what should a data-driven C-Suite do?

Nurture existing resources

Pulling together existing disparate data science resources into a single, connected community of practice creates a secure foundation for the C-suite from which to grow its data scientists. If this single entity has a common understanding of the skill sets it has within the business already, the best practice examples for approaching different business scenarios, and an awareness of new tools and solutions that could help, it provides the most solid basis for working out where the talent pool needs to be extended with new hires or training.

Similarly, it’s important to encourage knowledge-sharing and innovation within this community. Organising team hackathons to boost cross-function collaboration and new ideas can be a great example of this, while hosting internal events which showcase successes can help motivate the team to deliver creative new solutions.

Encourage collaboration and knowledge-sharing

Building relationships between the data team and the business is critical for two things: ensuring the data science team understands the business’ problems and is producing useful insights, and for helping to “demystify” the data science process. Ensuring that the business as a whole has insight into the data available, how it can and cannot be used, and encouraging a dialogue between technical and non-technical professionals will foster curiosity and trust that ensures a productive data-driven culture.

How can the C-suite go about doing this? By organising short workshops that focus on delivering real ideas. If the existing data scientists can show the business more broadly how data science techniques can lead to real business outcomes that improve success for a business area, it is likely to encourage enthusiasm and optimism about the potential of data. As a result, this is more likely to drive members of the business team to look further into the types of analytics they might be able to learn and apply, as well as seek out both at-work and external training to support this.

Understand that everyone, to some extent, needs to be data literate

Drew Conway’s famous Venn diagram on what data science is made of stressed the importance of substantive expertise as an integral part of data science – and encouraging a culture of collaboration between business and analytics function ensures that this is the case. However, an argument can also be made in the reverse. Ensuring that the business has enough knowledge of data science to be able to accurately reflect and act on the findings of data-driven insights is just as important. Without this, insights discovered by the data science community will not be able to have the fullest possible impact on the business, or, at worst, could even end up being misunderstood and misinterpreted. Enabling training at all levels of data awareness will be critical – and this should even include training on how to use information to guide decision-making, and other non-technical topics.

This sort of training, repeatedly reinforced, will be essential for cutting through any data apathy that exists within the organisation. It’s important that even those resistant to change understand that the business is developing to be data-driven, and that a culture shift will come as part of this. Having a connected community of evangelists, both in the form of technical experts, and business enthusiasts who can continue to spread the message will be invaluable.

With these three steps, C-suite executives will find it far easier to grow data scientists throughout the organisation and invest more effectively in hiring new talent to fill any skills gaps. By encouraging a culture of data curiosity, and an awareness of the power and potential of data, as well as an interest in learning more, businesses are creating fertile ground to inspire a new generation of data scientists from any background.

Photo by Claudio Schwarz | @purzlbaum on Unsplash

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

The Chaos Game: an experiment about fractals, recursivity and creative coding

October 28, 2019, 6:00 am

≫ Next: Spelunking macOS ‘ScreenTime’ App Usage with R

≪ Previous: How to Grow Your Own Data Scientists – a practical guide for the data-driven C-Suite

[This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mathematics, rightly viewed, possesses not only truth, but supreme beauty (Bertrand Russell)

You have a pentagon defined by its five vertex. Now, follow these steps:

Step 0: take a point inside the pentagon (it can be its center if you want to do it easy). Keep this point in a safe place.
Step 1: choose a vertex randomly and take the midpoint between both of them (the vertex and the original point). Keep also this new point. Repeat Step 1 one more time.
Step 2: compare the last two vertex that you have chosen. If they are the same, choose another with this condition: if it’s not a neighbor of the last vertex you chose, keep it. If it is a neighbor, choose another vertex randomly until you choose a not-neighbor one. Then, take the midpoint between the last point you obtained and this new vertex. Keep also this new point.
Step 3: Repeat Step 2 a number of times and after that, do a plot with the set of points that you obtained.

If you repeat these steps 10 milion times, you will obtain this stunning image:

I love the incredible ability of maths to create beauty. More concretely, I love the fact of how repeating extremely simple operations can bring you to unexpected places. Would you expect that the image created with the initial naive algorithm would be that? I wouldn’t. Even knowing the result I cannot imagine how those simple steps can produce it.

The image generated by all the points repeat itself at different scales. This characteristic, called self-similarity, is property of fractals and make them extremely attractive. Step 2 is the key one to define the shape of the image. Apart of comparing two previous vertex as it’s defined in the algorithm above, I implemented two other versions:

one version where the currently chosen vertex cannot be the same as the previously chosen vertex.
another one where the currently chosen vertex cannot neighbor the previously chosen vertex if the three previously chosen vertices are the same (note that this implementation is the same as the original but comparing with three previous vertex instead two).

These images are the result of applying the three versions of the algorithm to a square, a pentagon, a hexagon and a heptagon (a row for each polygon and a column for each algorithm):

From a technical point of view I used Rcppto generate the set of points. Since each iteration depends on the previous one, the loop cannot easily vectorised and C++ is a perfect option to avoid the bottleneck if you use another technique to iterate. In this case, instead of writing the C++ directly inside the R file with cppFunction(), I used a stand-alone C++ file called chaos_funcs.cpp to write the C++ code that I load into R using sourceCpp().

Some days ago, I gave a tutorial at the coding club of the University Carlos III in Madrid where we worked with the integration of C++ and R to create beautiful images of strange attractors. The tutorial and the code we developed is here. You can also find the code of this experiment here. Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Spelunking macOS ‘ScreenTime’ App Usage with R

October 28, 2019, 7:33 am

≫ Next: Dogs of New York

≪ Previous: The Chaos Game: an experiment about fractals, recursivity and creative coding

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apple has brought Screen Time to macOS for some time now and that means it has to store this data somewhere. Thankfully, Sarah Edwards has foraged through the macOS filesystem for us and explained where these bits of knowledge are in her post, Knowledge is Power! Using the macOS/iOS knowledgeC.db Database to Determine Precise User and Application Usage, which ultimately reveals the data lurks in ~/Library/Application Support/Knowledge/knowledgeC.db. Sarah also has a neat little Python utility dubbed APOLLO (Apple Pattern of Life Lazy Output’er) which has a smattering of knowledgeC.dbcanned SQL queries that cover a myriad of tracked items.

Today, we’ll show how to work with this database in R and the {tidyverse} to paint our own pictures of application usage.

There are quite a number of tables in the knowledgeC.db SQLite 3 database:

That visual schema was created in OmniGraffle via a small R script that uses the OmniGraffle automation framework. The OmniGraffle source files are also available upon request.

Most of the interesting bits (for any tracking-related spelunking) are in the ZOBJECT table and to get a full picture of usage we’ll need to join it with some other tables that are connected via a few foreign keys:

There are a few ways to do this in {tidyverse} R. The first is an extended straight SQL riff off of one of Sarah’s original queries:

library(hrbrthemes) # for ggplot2 machinationslibrary(tidyverse)# source the knowledge dbkdb <- src_sqlite("~/Library/Application Support/Knowledge/knowledgeC.db")tbl(  kdb,   sql('SELECT  ZOBJECT.ZVALUESTRING AS "app",     (ZOBJECT.ZENDDATE - ZOBJECT.ZSTARTDATE) AS "usage",      CASE ZOBJECT.ZSTARTDAYOFWEEK       WHEN "1" THEN "Sunday"      WHEN "2" THEN "Monday"      WHEN "3" THEN "Tuesday"      WHEN "4" THEN "Wednesday"      WHEN "5" THEN "Thursday"      WHEN "6" THEN "Friday"      WHEN "7" THEN "Saturday"    END "dow",    ZOBJECT.ZSECONDSFROMGMT/3600 AS "tz",    DATETIME(ZOBJECT.ZSTARTDATE + 978307200, \'UNIXEPOCH\') as "start_time",     DATETIME(ZOBJECT.ZENDDATE + 978307200, \'UNIXEPOCH\') as "end_time",    DATETIME(ZOBJECT.ZCREATIONDATE + 978307200, \'UNIXEPOCH\') as "created_at",     CASE ZMODEL      WHEN ZMODEL THEN ZMODEL      ELSE "Other"    END "source"  FROM    ZOBJECT     LEFT JOIN      ZSTRUCTUREDMETADATA     ON ZOBJECT.ZSTRUCTUREDMETADATA = ZSTRUCTUREDMETADATA.Z_PK     LEFT JOIN      ZSOURCE     ON ZOBJECT.ZSOURCE = ZSOURCE.Z_PK     LEFT JOIN      ZSYNCPEER    ON ZSOURCE.ZDEVICEID = ZSYNCPEER.ZDEVICEID  WHERE    ZSTREAMNAME = "/app/usage"'  )) -> usageusage## # Source:   SQL [?? x 8]## # Database: sqlite 3.29.0 [/Users/johndoe/Library/Application Support/Knowledge/knowledgeC.db]##    app                      usage dow         tz start_time          end_time            created_at         source       ##                                                                                  ##  1 com.bitrock.appinstaller    15 Friday      -4 2019-10-05 01:11:27 2019-10-05 01:11:42 2019-10-05 01:11:… MacBookPro13…##  2 com.tinyspeck.slackmacg…  4379 Tuesday     -4 2019-10-01 13:19:24 2019-10-01 14:32:23 2019-10-01 14:32:… Other        ##  3 com.tinyspeck.slackmacg…  1167 Tuesday     -4 2019-10-01 18:19:24 2019-10-01 18:38:51 2019-10-01 18:38:… Other        ##  4 com.tinyspeck.slackmacg…  1316 Tuesday     -4 2019-10-01 19:13:49 2019-10-01 19:35:45 2019-10-01 19:35:… Other        ##  5 com.tinyspeck.slackmacg… 12053 Thursday    -4 2019-10-03 12:25:18 2019-10-03 15:46:11 2019-10-03 15:46:… Other        ##  6 com.tinyspeck.slackmacg…  1258 Thursday    -4 2019-10-03 15:50:16 2019-10-03 16:11:14 2019-10-03 16:11:… Other        ##  7 com.tinyspeck.slackmacg…  2545 Thursday    -4 2019-10-03 16:24:30 2019-10-03 17:06:55 2019-10-03 17:06:… Other        ##  8 com.tinyspeck.slackmacg…   303 Thursday    -4 2019-10-03 17:17:10 2019-10-03 17:22:13 2019-10-03 17:22:… Other        ##  9 com.tinyspeck.slackmacg…  9969 Thursday    -4 2019-10-03 17:33:38 2019-10-03 20:19:47 2019-10-03 20:19:… Other        ## 10 com.tinyspeck.slackmacg…  2813 Thursday    -4 2019-10-03 20:19:52 2019-10-03 21:06:45 2019-10-03 21:06:… Other        ## # … with more rows

Before explaining what that query does, let’s rewrite it {dbplyr}-style:

tbl(kdb, "ZOBJECT") %>%   mutate(    created_at = datetime(ZCREATIONDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),    start_dow = case_when(      ZSTARTDAYOFWEEK == 1 ~ "Sunday",      ZSTARTDAYOFWEEK == 2 ~ "Monday",      ZSTARTDAYOFWEEK == 3 ~ "Tuesday",      ZSTARTDAYOFWEEK == 4 ~ "Wednesday",      ZSTARTDAYOFWEEK == 5 ~ "Thursday",      ZSTARTDAYOFWEEK == 6 ~ "Friday",      ZSTARTDAYOFWEEK == 7 ~ "Saturday"    ),    start_time = datetime(ZSTARTDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),    end_time = datetime(ZENDDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),    usage = (ZENDDATE - ZSTARTDATE),    tz = ZSECONDSFROMGMT/3600   ) %>%   left_join(tbl(kdb, "ZSTRUCTUREDMETADATA"), c("ZSTRUCTUREDMETADATA" = "Z_PK")) %>%   left_join(tbl(kdb, "ZSOURCE"), c("ZSOURCE" = "Z_PK")) %>%   left_join(tbl(kdb, "ZSYNCPEER"), "ZDEVICEID") %>%   filter(ZSTREAMNAME == "/app/usage")  %>%   select(    app = ZVALUESTRING, created_at, start_dow, start_time, end_time, usage, tz, source = ZMODEL  ) %>%   mutate(source = ifelse(is.na(source), "Other", source)) %>%   collect() %>%   mutate_at(vars(created_at, start_time, end_time), as.POSIXct) -> usage

What we’re doing is pulling out the day of week, start/end usage times & timezone info, app bundle id, source of the app interactions and the total usage time for each entry along with when that entry was created. We need to do some maths since Apple stores time-y whime-y info in its own custom format, plus we need to convert numeric DOW to labeled DOW.

The bundle ids are pretty readable, but they’re not really intended for human consumption, so we’ll make a translation table for the bundle id to app name by using the mdls command.

list.files(  c("/Applications", "/System/Library/CoreServices", "/Applications/Utilities", "/System/Applications"), # main places apps are stored (there are potentially more but this is sufficient for our needs)  pattern = "\\.app$",   full.names = TRUE) -> appsx <- sys::exec_internal("mdls", c("-name", "kMDItemCFBundleIdentifier", "-r", apps))# mdls null (\0) terminates each entry so we have to do some raw surgery to get it into a format we can usex$stdout[x$stdout == as.raw(0)] <- as.raw(0x0a)tibble(  name = gsub("\\.app$", "", basename(apps)),  app = read_lines(x$stdout) ) -> app_transapp_trans## # A tibble: 270 x 2##    name                    app                                    ##                                                         ##  1 1Password 7             com.agilebits.onepassword7             ##  2 Adium                   com.adiumX.adiumX                      ##  3 Agenda                  com.momenta.agenda.macos               ##  4 Alfred 4                com.runningwithcrayons.Alfred          ##  5 Amazon Music            com.amazon.music                       ##  6 Android File Transfer   com.google.android.mtpviewer           ##  7 Awsaml                  com.rapid7.awsaml                      ##  8 Bartender 2             com.surteesstudios.Bartender           ##  9 BBEdit                  com.barebones.bbedit                   ## 10 BitdefenderVirusScanner com.bitdefender.BitdefenderVirusScanner## # … with 260 more rows

The usage info goes back ~30 days, so let’s do a quick summary of the top 10 apps and their total usage (in hours):

usage %>%   group_by(app) %>%   summarise(first = min(start_time), last = max(end_time), total = sum(usage, na.rm=TRUE)) %>%   ungroup() %>%   mutate(total = total / 60 / 60) %>% # hours  arrange(desc(total)) %>%   left_join(app_trans) -> overall_usageoverall_usage %>%   slice(1:10) %>%   left_join(app_trans) %>%  mutate(name = fct_inorder(name) %>% fct_rev()) %>%  ggplot(aes(x=total, y=name)) +   geom_segment(aes(xend=0, yend=name), size=5, color = ft_cols$slate) +  scale_x_comma(position = "top") +  labs(    x = "Total Usage (hrs)", y = NULL,    title = glue::glue('App usage in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days')  ) +  theme_ft_rc(grid="X")

There’s a YUGE flaw in the current way macOS tracks application usage. Unlike iOS where apps really don’t run simultaneously (with iPadOS they kinda can/do, now), macOS apps are usually started and left open along with other apps. Apple doesn’t do a great job identifying only active app usage activity so many of these usage numbers are heavily inflated. Hopefully that will be fixed by macOS 10.15.

We have more data at our disposal, so let’s see when these apps get used. To do that, we’ll use segments to plot individual usage tracks and color them by weekday/weekend usage (still limiting to top 10 for blog brevity):

usage %>%   filter(app %in% overall_usage$app[1:10]) %>%   left_join(app_trans) %>%  mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>%   ggplot() +  geom_segment(    aes(      x = start_time, xend = end_time, y = name, yend = name,       color = ifelse(start_dow %in% c("Saturday", "Sunday"), "Weekend", "Weekday")    ),    size = 10,  ) +  scale_x_datetime(position = "top") +  scale_colour_manual(    name = NULL,    values = c(      "Weekend" = ft_cols$light_blue,       "Weekday" = ft_cols$green    )  ) +  guides(    colour = guide_legend(override.aes = list(size = 1))  ) +  labs(    x = NULL, y = NULL,    title = glue::glue('Top 10 App usage on this Mac in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days'),    subtitle = "Each segment represents that app being 'up' (Open to Quit).\nUnfortunately, this is what Screen Time uses for its calculations on macOS"  ) +  theme_ft_rc(grid="X") +  theme(legend.position = c(1, 1.25)) +  theme(legend.justification = "right")

I’m not entirely sure “on this Mac” is completely accurate since I think this syncs across all active Screen Time devices due to this (n is in seconds):

count(usage, source, wt=usage, sort=TRUE)## # A tibble: 2 x 2##   source               n##               ## 1 Other          4851610## 2 MacBookPro13,2 1634137

The “Other” appears to be the work-dev Mac but it doesn’t have the identifier mapped so I think that means it’s the local one and that the above chart is looking at Screen Time across all devices. I literally (right before this sentence) enabled Screen Time on my iPhone so we’ll see if that ends up in the database and I’ll post a quick update if it does.

We’ll take one last look by day of week and use a heatmap to see the results:

count(usage, start_dow, app, wt=usage/60/60) %>%   left_join(app_trans) %>%  filter(app %in% overall_usage$app[1:10]) %>%   mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>%   mutate(start_dow = factor(start_dow, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>%   ggplot() +  geom_tile(aes(start_dow, name, fill = n), color = "#252a32", size = 0.75) +  scale_x_discrete(expand = c(0, 0.5), position = "top") +  scale_y_discrete(expand = c(0, 0.5)) +  scale_fill_viridis_c(direction = -1, option = "magma", name = "Usage (hrs)") +  labs(    x = NULL, y = NULL,    title = "Top 10 App usage by day of week"  ) +  theme_ft_rc(grid="")

I really need to get into the habit of using the RStudio Server access features of RSwitch over Chrome so I can get RSwitch into the top 10, but some habits (and bookmarks) die hard.

FIN

Apple’s Screen Time also tracks “category”, which is something we can pick up from each application’s embedded metadata. We’ll do that in a follow-up post along with seeing whether we can capture iOS usage now that I’ve enabled Screen Time on those devices as well.

Keep spelunking the knowledgeC.db table(s) and blog about or reply in the comments with any interesting nuggets you find.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Dogs of New York

October 28, 2019, 10:18 am

≫ Next: (Re)introducing skimr v2 – A year in the life of an open source R project

≪ Previous: Spelunking macOS ‘ScreenTime’ App Usage with R

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other week I took a few publicly-available datasets that I use for teaching data visualization and bundled them up into an R package called nycdogs. The package has datasets on various aspects of dog ownership in New York City, and amongst other things you can draw maps with it at the zip code level. The package homepage has installation instructions and an example.

Using this data, I made a poster called Dogs of New York. It’s a small multiple map of the relative prevalence of the twenty five most common dog breeds in the city, based on information from the city’s dog license database. This morning, reflecting on a series of conversations I had with people about the original poster, I tweaked it a little further. In the original version, I used a discrete version of one of the Viridis palettes, initially the “Inferno” variant, and subsequently the “Plasma” one. These palettes are really terrific in everyday use for visualizations of all kinds, because they are vivid, colorblind-friendly, and perceptually uniform across their scale. I use them all the time, for instance with the Mortality in France poster I made last year.

When it came to using this palette in a map, though, I ran into an interesting problem. Here’s a detail from the “Plasma” palette version of the poster.

Detail from the Plasma version of Dogs of New York

Now, these colors are vivid. But when I showed it to people, opinion was pretty evenly split between people who intuitively saw the darker, purplish areas as signifying “more”, and people who intuitively saw the warmer, yellowish areas as signifying “more”. So, for example, a number of people asked if I could make the map with the colors range flipped, with yellow meaning “more” or “high” (and indeed, in the very first version of the map I originally had done this). A friend with conflicting intuitions incisively noted that she associated darker colors with “more”, in contrast to lighter colors, but also associated warmer colors with “more”. The fact that the scale moved from a cooler, darker color through to a different, warmer, lighter color was thus confusing. The warm and vivid quality of the yellow end of the Plasma spectrum seemed to be particularly prone to this confusion.

I think the small-multiple character of the graph exacerbated this confusion. It shows the selected dog breeds from most (top left) to least (bottom right) common, whereas the guide to the scale (in the top right) showed the scale running from low to high values, or least to most common.

In the end, after a bit of experimentation I decided to redo the figure in a one-hue HCL scale, the “Oranges” palette from the Colorspace package. I also reversed the guide and added a few other small details to aid the viewer in interpreting the graph. Here’s the final version.

Dogs of New York

The palette isn’t quite as immediately vivid as the Viridis version, but it seems to solve the problems of interpretation, and that’s more important. The image is made to be blown up to quite a large size, which I plan on doing myself. To that end, here’s a PDF version of the poster as well.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

(Re)introducing skimr v2 – A year in the life of an open source R project

October 28, 2019, 5:00 pm

≫ Next: Any one interested in a function to quickly generate data with many predictors?

≪ Previous: Dogs of New York

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Theme song: PSA by Jay-Z

We announced the testing version of skimr v2 on June 19, 2018. After more than a year of (admittedly intermittent) work, we’re thrilled to be able to say that the package is ready to go to CRAN. So, what happened over the last year? And why are we so excited for v2?

Wait, what is a “skimr”?

skimr is an R package for summarizing your data. It extends tidyverse packages, and dplyr in particular, so that you can get a broad set of summary statistics with a single function call. You can install a pre-release version from the package’s GitHub repo.

devtools::install_github("ropensci/skimr")

skimr is also on CRAN, and v2 should be appearing there soon. For those of you that might have never seen skimr, here’s a typical call.

library(skimr)
library(dplyr)
options(width = 90)

skim(iris)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris
## Number of rows             150
## Number of columns          5
## _______________________
## Column type frequency:
##   factor                   1
##   numeric                  4
## ________________________
## Group variables            None
##
## ── Variable type: factor ─────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist
## 1 Sepal.Length          0             1  5.84 0.828   4.3   5.1  5.8    6.4   7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1  3.06 0.436   2     2.8  3      3.3   4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1  3.76 1.77    1     1.6  4.35   5.1   6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1  1.20 0.762   0.1   0.3  1.3    1.8   2.5 ▇▁▇▅▃

Setting the stage

Before we can talk about the last year of skimr development, we need to lay out the timeline that got us to this point. For those deeply enmeshed in skimr lore, all dozens of you, bear with.

skimr was originally an rOpenSci unconf17 project, a big collaboration between eight different participants that resulted in a conceptual outline of the package and a basic working version. Participating in the unconf was a truly magical experience, with everyone bringing a tremendous amount of energy and ideas to the project, and implementation happening over a flurry of “fancy git commits”.

About six months later, we released our first version on CRAN. The time between these two milestones was mostly spent on fleshing out all of the different ideas that were generated during the unconf (like handling grouped data frames) and fixing all the bugs we discovered along the way.

Getting the package on CRAN opened the gates for bug reports and feature requests on GitHub. About the same time we pushed our first version to CRAN, Elin got skimr’s rOpenSci package peer review started (thank you Jennifer and Jim!), opening another incredibly useful channel for collecting feedback on the package. All of these new ideas and suggestions gave us the opportunity to really push skimr to the next level, but doing that would require rethinking the package, from the ground up.

A month after finishing the peer review (and six months after the process began), we announced v2. Over the first phase of skimr’s life, we accumulated 700 commits, two releases, 400 GitHub stars, 95 percent code coverage and a lifetime’s worth of unicode rendering bugs!

Just kidding! We love our little histograms, even when they don’t love us back!

Getting it right

Under normal circumstances (i.e. not during a hackathon), most software engineering projects begin with a design phase and series of increasingly detailed design docs. skimr is only a few hundred lines of code, which means “increasingly detailed design docs” translates to one doc. But we did actually write it! It’s here. And it still goes a good job of laying out some of the big ideas we were interested in taking on for v2.

Eliminating frictions that resulted from differences in the way we stored data vs how it was displayed to users
Getting away from using a global environment to configure skimr
Making it easier for others to extend skimr
Create more useful ways to use skimr

Better internal data structures

In v1, skimr stored all of its data in a “long format”, data frame. Although hidden from the user by its print methods, this format would appear any time you’d try do something with the results of a skim() call. It looked something like this:

skim(mtcars) %>% dplyr::filter(stat=="hist")

# A tibble: 11 x 6
   variable type    stat  level value formatted
             
 1 mpg      numeric hist  .all     NA ▃▇▇▇▃▂▂▂
 2 cyl      numeric hist  .all     NA ▆▁▁▃▁▁▁▇
 3 disp     numeric hist  .all     NA ▇▆▁▂▅▃▁▂
 4 hp       numeric hist  .all     NA ▃▇▃▅▂▃▁▁
 5 drat     numeric hist  .all     NA ▃▇▁▅▇▂▁▁
 6 wt       numeric hist  .all     NA ▃▃▃▇▆▁▁▂
 7 qsec     numeric hist  .all     NA ▃▂▇▆▃▃▁▁
 8 vs       numeric hist  .all     NA ▇▁▁▁▁▁▁▆
 9 am       numeric hist  .all     NA ▇▁▁▁▁▁▁▆
10 gear     numeric hist  .all     NA ▇▁▁▆▁▁▁▂
11 carb     numeric hist  .all     NA ▆▇▂▇▁▁▁▁

Big ups to anyone who looked at the rendered output and saw that this was how you actually filtered the results. Hopefully there are even better applications of your near-telepathic abilities.

Now, working with skimr is a bit more sane.

skimmed <- iris %>%
  skim() %>%
  dplyr::filter(numeric.sd > 1)

skimmed

## ── Data Summary ────────────────────────
##                            Values
## Name                       Piped data
## Number of rows             150
## Number of columns          5
## _______________________
## Column type frequency:
##   numeric                  1
## ________________________
## Group variables            None
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist
## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

And

dplyr::glimpse(skimmed)

## Observations: 1
## Variables: 15
## $ skim_type          "numeric"
## $ skim_variable      "Petal.Length"
## $ n_missing          0
## $ complete_rate      1
## $ factor.ordered     NA
## $ factor.n_unique    NA
## $ factor.top_counts  NA
## $ numeric.mean       3.758
## $ numeric.sd         1.765298
## $ numeric.p0         1
## $ numeric.p25        1.6
## $ numeric.p50        4.35
## $ numeric.p75        5.1
## $ numeric.p100       6.9
## $ numeric.hist       "▇▁▆▇▂"

It’s still not perfect, as you need to rely on a pseudo-namespace to refer to the column that you want. But this is unfortunately a necessary trade-off. As the Rstats Bible, errr Hadley Wickham’s Advanced R, states, all elements of an atomic vector must have the same type. This normally isn’t something that you have to think too much about, that is until you try to combine the means of all your Date columns with the means of your numeric columns and everything comes out utterly garbled. So instead of that basket of laughs, we prefix columns names by their data type.

There’s a couple of other nuances here:

The data frame skim() produces always starts off with some metadata columns
Functions that always produce the same, regardless of input type, can be treated as base_skimmers and don’t need a namespace

Manipulating internal data

A better representation of internal data comes with better tools for reshaping the data and getting it for other contexts. A common request in v1 was tooling to handle the skimr subtables separately. We now do this with partition(). It replaces the v1 function skim_to_list().

partition(skimmed)

## $numeric
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist
## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

You can undo a call to partition() with bind(), which joins the subtables into the original skim_df object and properly accounts for metadata. You can skip a step with the function yank(), which calls partition and pulls out a particular subtable

yank(skimmed, "numeric")

##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist
## 1 Petal.Length          0             1  3.76  1.77     1   1.6  4.35   5.1   6.9 ▇▁▆▇▂

Last, with support something close to the older format with the to_long() function. This can be added for something close to backwards compatibility. Being realistic on open source sustainability means that we are not able to support 100% backward compatibility in v2 even with new functions. Meanwhile you can keep using v1 if you are happy with it. However, because skimr’s dependencies are under ongoing development, sooner or later skimr v1 will no longer work with updates to them.

Working with dplyr

Using skimr in a dplyr pipeline was part of the original package design, and we’ve needed to devote some extra love to making sure that everything is as seamless as possible. Part of this is due to the object produce by skim(), which we call skim_df. It’s a little weird in that it needs both metadata and columns in the underlying data frame.

In practice, this means that you can coerce it into a different type through normal dplyr operations. Here’s one:

select(skimmed, numeric.mean)

## # A tibble: 1 x 1
##   numeric.mean
##          
## 1         3.76

To get around this, we’ve added some helper functions and methods. The more skimr-like replacement for select() is focus(), which preserves metadata columns.

focus(skimmed, numeric.mean)

## ── Data Summary ────────────────────────
##                            Values
## Name                       Piped data
## Number of rows             150
## Number of columns          5
## _______________________
## Column type frequency:
##   numeric                  1
## ________________________
## Group variables            None
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable  mean
## 1 Petal.Length   3.76

Configuring and extending skimr

Most of skimr’s magic, to steal a term, comes from the fact that you can do most everything with one function. But believe it or not, there’s actually a bit more to the package.

One big one is customization. We like the skimr defaults, but that doesn’t guarantee you will. So what if you want to do something different, we have a function factory for that!

my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))
my_skim(faithful)

## ── Data Summary ────────────────────────
##                            Values
## Name                       faithful
## Number of rows             272
## Number of columns          2
## _______________________
## Column type frequency:
##   numeric                  2
## ________________________
## Group variables            None
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p50  p100 hist    iqr
## 1 eruptions             0             1  3.49  1.14   1.6     4   5.1 ▇▂▂▇▇  2.29
## 2 waiting               0             1 70.9  13.6   43      76  96   ▃▃▂▇▂ 24

Those of you familiar with customizing skim() in v1 will notice a couple differences:

we now has an object called sfl() for managing skimr function lists; more below
instead of setting global options, we now have a function factory

Yes! A function factory. skim_with() gives us a new function each time we call it, and the returned function is configured by the arguments in skim_with(). This works the same way as ecdf() in the stats package or colorRamp in grDevices. Creating new functions has a few advantages over the previous approach.

you can export a skim() function in a package or create it in a .Rprofile
you avoid a bunch of potential side effects from setting options with skim_with()

The other big change is how we now handle different data types. Although many will never see it, a key piece of skimr customization comes from the get_skimmers() generic. It’s used to detect different column types in your data and set the appropriate summary functions for that type. It’s also designed to work with sfl(). Here’s an example from the “Supporting additional objects” vignette. Here, we’ll create some skimmers for sf data types:

get_skimmers.sfc_POINT <- function(column) {
  sfl(
    skim_type = "sfc_POINT",
    n_unique = n_unique,
    valid = ~ sum(sf::st_is_valid(.))
  )
}

While it was required in skim_with(), users must provide a skim_type value when creating new methods. With that, you can export this method in a new package (be sure to import the generic), and the new default skimmer is added when you load the package.

get_default_skimmer_names()

...
$sfc_POINT
[1] "missing"  "complete" "n"        "n_unique" "valid"
...

Even if you don’t go the full route of supporting a new data type, creating a couple of skimr function lists has other benefits. For example, you can add some to your .Rprofile as a way to quickly configure skimr interactively.

sfc_point_sfl <- sfl(
  n_unique = n_unique,
  valid = ~ sum(sf::st_is_valid(.))
)

my_skimmer <- skim_with(sfc_POINT = sfc_point_sfl)

Using skimr in other contexts

In skimr v1, we developed some slightly hacky approaches to getting nicer skim() output in RMarkdown docs. These have been removed in favor of the actually-supportedknit_print API. Now, calling skim(), within an RMarkdown doc should produce something nice by default.

skim(chickwts)

Data summary

Name	chickwts
Number of rows	71
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
feed	0	1	FALSE	6	soy: 14, cas: 12, lin: 12, sun: 12

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
weight	0	1	261.31	78.07	108	204.5	258	323.5	423	▆▆▇▇▃

You get a nice html version of both the summary header and the skimr subtables for each type of data.

In this context, you configure the output the same way you handle other knitr code chunks.

This means that we’re dropping direct support for kable.skim_df() and pander.skim_df(). But you can still get pretty similar results to these functions by using the reshaping functions described above to get subtables. You can also still use Pander and other nice rendering packages on an ad hoc basis as you would for other data frames or tibbles.

We also have a similarly-nice rendered output in Jupyter and RMarkdown notebooks. In the latter, the summary is separated from the rest of the output when working interactively. We like it that way, but we’d be happy to hear what the rest of you think!

Wait, that took over a year?

Well, we think that’s a lot! But to be fair, it wasn’t exactly simple to keep up with skimr. Real talk, open source development takes up a lot of time, and the skimr developers have additional important priorities. Michael’s family added a new baby, and despite swearing up and down otherwise, he got absolutely nothing not-baby-related done during his paternity leave (take note new dads!). Elin ended up taking a much bigger role on at Lehman, really limiting time for any other work.

Even so, these are just the highlights in the normal ebb and flow of this sort of work. Since it’s no one’s real job, it might not always be the first focus. And that’s OK! We’ve been really lucky to have a group of new users that have been very patient with this slow development cycle while still providing really good feedback throughout. Thank you all!

We’re really excited about this next step in the skimr journey. We’ve put a huge amount of work into this new version. Hopefully it shows. And hopefully it inspires some of you to send more feedback and help us find even more ways to improve!

If you want to learn more about skimr, check out our GitHub repo. GitHub is also the best place to file issues. We’d love to hear from you!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Any one interested in a function to quickly generate data with many predictors?

October 28, 2019, 5:00 pm

≫ Next: Mocking is catching

≪ Previous: (Re)introducing skimr v2 – A year in the life of an open source R project

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of months ago, I was contacted about the possibility of creating a simple function in simstudy to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.

I’m presenting a new function here as a work-in-progress. I am putting it out there in case other folks have opinions about what might be most useful; feel free to let me know if you do. If not, I am likely to include something very similar to this in the next iteration of simstudy, which will be version 0.1.16.

Function genMultPred

In its latest iteration, the new function has three interesting arguments. The first two are predNorm and predBin, which are each vectors of length 2. The first value indicates the number of predictors to generate with either a standard normal distribution or a binary distribution, respectively. The second value in each vector represents the number of variables that will actually be predictive of the outcome. (Obviously, the second value cannot be greater than the first value.)

The third interesting argument is corStrength, which is a non-negative number indicating the overall strength of the correlation between the predictors. When corStrength is set to 0 (which is the default), the variables are generated assuming independence. When corStrength is non-zero, a random correlation matrix is generated using package clusterGeneration [Weiliang Qiu and Harry Joe. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation).] The corStrength value is passed on to the argument ratioLambda in the function genPositiveDefMat. As the value of corStrength increases, higher levels of correlation are induced in the random correlation matrix for the predictors.

Currently, the outcome can only have one of three distributions: normal, binomial, or Poisson.

One possible enhancement would be to allow the distributions of the predictors to have more flexibility. However, I’m not sure the added complexity would be worth it. Again, you could always take the more standard simstudy approach of function genData if you wanted more flexibility.

Here’s the function, in case you want to take a look under the hood:

genMultPred <- function(n, predNorm, predBin,                         dist = "normal", sdy = 1, corStrength = 0) {    normNames <- paste0("n", 1:predNorm[1])  binNames <- paste0("b", 1:predBin[1])    ## Create the definition tables to be used by genData    defn <- data.table(varname = normNames,                     formula = 0,                     variance = 1,                     dist = "normal",                     link = "identity")    defb <- data.table(varname = binNames,                     formula = 0.5,                     variance = NA,                     dist = "binary",                     link = "identity")    defx <- rbind(defn, defb)  attr(defx, which = "id") <- "id"    ## Create the coefficient values - all normally distributed    ncoefs <- rnorm(predNorm[1], 0, 1)  setzero <- sample(1:predNorm[1], (predNorm[1] - predNorm[2]),                     replace = FALSE)  ncoefs[setzero] <- 0    bcoefs <- rnorm(predBin[1], 0, 1)  setzero <- sample(1:predBin[1], (predBin[1] - predBin[2]),                     replace = FALSE)  bcoefs[setzero] <- 0    coefs <- c(ncoefs, bcoefs)  names(coefs) <- c(normNames, binNames)    ## Generate the predictors    if (corStrength <= 0) {     # predictors are independent        dx <- genData(n, defx)      } else {        rLambda <- max(1, corStrength)    covx <- cov2cor(genPositiveDefMat(nrow(defx),                         lambdaLow = 1, ratioLambda = rLambda)$Sigma)    dx <- genCorFlex(n, defx, corMatrix = covx)      }    ## Generate the means (given the predictors)    mu <- as.matrix(dx[,-"id"]) %*% coefs  dx[, mu := mu]    ## Generate the outcomes based on the means    if (dist == "normal") {    dx[, y := rnorm(n, mu, sdy)]  } else if (dist == "binary") {    dx[, y := rbinom(n, 1, 1/(1 + exp(-mu)))]  # link = logit  } else if (dist == "poisson") {    dx[, y := rpois(n, exp(mu))]               # link = log  }     dx[, mu := NULL]    return(list(data = dx[], coefs = coefs))}

A brief example

Here is an example with 7 normally distributed covariates and 4 binary covariates. Only 3 of the continuous covariates and 2 of the binary covariates will actually be predictive.

library(simstudy)library(clusterGeneration)set.seed(732521)dx <- genMultPred(250, c(7, 3), c(4, 2))

The function returns a list of two objects. The first is a data.table containing the generated predictors and outcome:

round(dx$data, 2)

##       id    n1    n2    n3    n4    n5    n6    n7 b1 b2 b3 b4     y##   1:   1  0.15  0.12 -0.07 -1.38 -0.05  0.58  0.57  1  1  0  1 -1.07##   2:   2  1.42 -0.64  0.08  0.83  2.01  1.18  0.23  1  1  0  0  4.42##   3:   3 -0.71  0.77  0.94  1.59 -0.53 -0.05  0.26  0  0  0  0  0.09##   4:   4  0.35 -0.80  0.90 -0.79 -1.72 -0.16  0.09  0  0  1  1 -0.58##   5:   5 -0.22 -0.72  0.62  1.40  0.17  2.21 -0.45  0  1  0  1 -2.18##  ---                                                                ## 246: 246 -1.04  1.62  0.40  1.46  0.80 -0.77 -1.27  0  0  0  0 -1.19## 247: 247 -0.85  1.56  1.39 -1.25 -0.82 -0.63  0.13  0  1  0  0 -0.70## 248: 248  0.72 -0.83 -0.04 -1.38  0.61 -0.71 -0.06  1  0  1  1  0.74## 249: 249 -0.15  1.62 -1.01 -0.79 -0.53  0.44 -0.46  1  1  1  1  0.95## 250: 250 -0.59  0.34 -0.31  0.18 -0.86 -0.90  0.22  1  0  1  0 -1.90

The second object is the set of coefficients that determine the average response conditional on the predictors:

round(dx$coefs, 2)

##    n1    n2    n3    n4    n5    n6    n7    b1    b2    b3    b4 ##  2.48  0.62  0.28  0.00  0.00  0.00  0.00  0.00  0.00  0.53 -1.21

Finally, we can “recover” the original coefficients with linear regression:

lmfit <- lm(y ~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + b1 + b2 + b3 + b4,             data = dx$data)

Here’s a plot showing the 95% confidence intervals of the estimates along with the true values. The yellow lines are covariates where there is truly no association.

Addendum: correlation among predictors

Here is a pair of examples using the corStrength argument. In the first case, the observed correlations are close to 0, whereas in the second case, the correlations range from -0.50 to 0.25. The impact of corStrength will vary depending on the number of potential predictors.

set.seed(291212)# Case 1dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 0)round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2## n1  1.00 -0.02  0.02  0.03 -0.01 -0.01## n2 -0.02  1.00 -0.01  0.03 -0.03  0.00## n3  0.02 -0.01  1.00  0.00 -0.04 -0.01## n4  0.03  0.03  0.00  1.00  0.06 -0.01## b1 -0.01 -0.03 -0.04  0.06  1.00 -0.01## b2 -0.01  0.00 -0.01 -0.01 -0.01  1.00

# Case 2dx <- genMultPred(1000, c(4, 2), c(2, 1), corStrength = 50)round(cor(as.matrix(dx$data[, -c(1, 8)])), 2)

##       n1    n2    n3    n4    b1    b2## n1  1.00  0.09  0.08 -0.32  0.25  0.04## n2  0.09  1.00 -0.29 -0.47 -0.05 -0.02## n3  0.08 -0.29  1.00 -0.46 -0.01 -0.01## n4 -0.32 -0.47 -0.46  1.00 -0.20 -0.05## b1  0.25 -0.05 -0.01 -0.20  1.00 -0.04## b2  0.04 -0.02 -0.01 -0.05 -0.04  1.00

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Mocking is catching

October 28, 2019, 5:00 pm

≫ Next: Sept 2019: “Top 40” New R Packages

≪ Previous: Any one interested in a function to quickly generate data with many predictors?

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When writing unit tests for a package, you might find yourself wondering about how to best test the behaviour of your package

when the data it’s supposed to munge has this or that quirk,
when the operating system is Windows,
when a package enhancing its functionality is not there,
when a web API returns an error;

or you might even wonder how to test at least part of that package of yours that calls a web API or local database… without accessing the web API or local database during testing.

In some of these cases, the programming concept you’re after is mocking, i.e. making a function act as if something were a certain way! In this blog post we shall offer a round-up of resources around mocking, or not mocking, when unit testing an R package.

also YES you know i'm trying to hit 100% but once i read i would have to do something called "mocking" my brain was like ok let's go on twitter instead
— Sharla Gelfand (@sharlagelfand) August 1, 2019

Please keep reading, do not flee to Twitter! (The talented Sharla did end up using mocking for her package!)

Packages for mocking

General mocking

Nowadays, when using testthat for testing, the recommended tool for mocking is the mockery package, nottestthat’s own with_mock() function. To read how they differ in their implementation of mocking, refer to this issue and that section of mockery README. In brief, with mockery you can stub (i.e. replace) a function in a given environment e.g. the environment of a function. Let’s create a small toy example to illustrate that.

# a function that says encoding is a pain# when the OS is Windowsis_encoding_a_pain <- function(){  if (Sys.info()[["sysname"]] == "Windows"){    return("YES")  } else {    return("no")  }}# The post was rendered on UbuntuSys.info()[["sysname"]]

## [1] "Linux"

# So, is encoding a pain?is_encoding_a_pain()

## [1] "no"

# stub/replace Sys.info() in is_encoding_a_pain()# with a mock that answers "Windows"mockery::stub(where = is_encoding_a_pain,              what = "Sys.info",               how = c(sysname = "Windows"))# Different outputis_encoding_a_pain()

## [1] "YES"

# NOT changedSys.info()[["sysname"]]

## [1] "Linux"

Let’s also look at a real life example, from keyring tests:

test_that("auto windows", {  mockery::stub(default_backend_auto, "Sys.info", c(sysname = "Windows"))  expect_equal(default_backend_auto(), backend_wincred)})

What happens after the call to mockery::stub() is that inside the test, when default_backend_auto() is called, it won’t use the actual Sys.info() but instead a mock that returns c(sysname = "Windows") so the test can assess what default_backend_auto() returns on Windows… without the test being run on a Windows machine.

Instead of directly defining the return value as is the case in this example, one could stub the function with a function, as seen in one of the tests for the remotes package.

To find more examples of how to use mockery in tests, you can use GitHub search in combination with R-hub’s CRAN source code mirror: https://github.com/search?l=&q=%22mockery%3A%3Astub%22+user%3Acran&type=Code

Web mocking

In the case of a package doing HTTP requests, you might want to test what happens when an error code is received for instance. To do that, you can use either httptest or webmockr (compatible with both httr and crul).

Temporarily modify the global state

To test what happens when, say, an environment variable has a particular value, one can set it temporarily within a test using the withr package. You could argue it’s not technically mocking, but it’s an useful trick. You can see it in action in keyring’s tests.

To mock or… not to mock

Sometimes, you might not need mocking and can resort to an alternative approach instead, using the real thing/situation. You could say it’s a less “unit” approach and requires more work.

Fake input data

For say a plotting or modelling library, you can tailor-make data. Comparing approaches or packages for creating fake data are beyond the scope of this post, so let’s just name a few packages:

charlatan,
fakir,
salty (for “salting” clean data i.e. adding errors in it).

Stored data from a web API / a database

As explained in this discussion about testing web API packages, when testing a package accessing and munging web data you might want to separate testing of the data access and of the data munging, on the one hand because failures will be easier to trace back to a problem in the web API vs. your code, on the other hand to be able to selectively turn off some tests based on internet connection, the presence of an API key, etc. Storing and replaying HTTP requests is supported by:

vcr with webmockr,
httptest.

What about applying the same idea to packages using a database connection?

The test suite of dbplyr could be a good inspiration.
There is a nascent package called dbtest that is similar to httptest, but for databases.

Different operating systems

Say you want to be sure your packages builds correctly on another operating system… you can use R-hub package builder or maybe a continuous integration service.

Different system configurations or libraries

Regarding the case where you want to test your package when a suggested dependency is or is not installed, you can use the configuration script of a continuous integration service to have at least one build without that dependency:

I’d suggest skip_if_not_installed() plus some Travis wrangling so you have a build that doesn’t have the package installed (probably by running remove.packages() in before_script)
— Hadley Wickham (@hadleywickham) October 4, 2019

Conclusion

In this post we offered a round-up of resources around mocking when unit testing R packages, as well as around not mocking. To learn about more packages for testing your package, refer to the list published on Locke Data’s blog. Now, what if you’re not sure about the best approach for that quirky thing you want to test, mocking or not mocking, and how exactly? Well, you can fall back on two methods: Reading the source code of other packages, and Asking for help! Good luck!

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Sept 2019: “Top 40” New R Packages

October 28, 2019, 5:00 pm

≫ Next: An Amazon SDK for R!?

≪ Previous: Mocking is catching

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One hundred and thirteen new packages made it to CRAN in September. Here are my “Top 40” picks in eight categories: Computational Methods, Data, Economics, Machine Learning, Statistics, Time Series, Utilities, and Visualization.

Computational Methods

eRTG3D v0.6.2: Provides functions to create realistic random trajectories in a 3-D space between two given fixed points (conditional empirical random walks), based on empirical distribution functions extracted from observed trajectories (training data), and thus reflect the geometrical movement characteristics of the mover. There are several small vignettes, including sample data sets, linkage to the sf package, and point cloud analysis.

freealg v1.0: Implements the free algebra in R: multivariate polynomials with non-commuting indeterminates. See the vignette for the math.

HypergeoMat v3.0.0: Implements Koev & Edelman’s algorithm (2006) to evaluate the hypergeometric functions of a matrix argument, which appear in random matrix theory. There is a vignette.

opart v2019.1.0: Provides a reference implementation of standard optimal partitioning algorithm in C using square-error loss and Poisson loss functions as described by Maidstone (2016), Hocking (2016), Rigaill (2016), and Fearnhead (2016) that scales quadratically with the number of data points in terms of time-complexity. There are vignettes for Gaussian and Poisson squared error loss.

Data

cde v0.4.1: Facilitates searching, download and plotting of Water Framework Directive (WFD) reporting data for all water bodies within the UK Environment Agency area. This package has been peer-reviewed by rOpenSci. There is a Getting Started Guide and a vignette on output reference.

eph v0.1.1: Provides tools to download and manipulate data from the Argentina Permanent Household Survey. The implemented methods are based on INDEC (2016).

leri v0.0.1: Fetches Landscape Evaporative Response Index (LERI) data using the raster package. The LERI product measures anomalies in actual evapotranspiration, to support drought monitoring and early warning systems. See the vignette for examples.

rwhatsapp v0.2.0: Provides functions to parse and digest history files from the popular messenger service WhatsApp. There is a vignette.

tidyUSDA v0.2.1: Provides a consistent API to pull United States Department of Agriculture census and survey data from the National Agricultural Statistics Service (NASS) QuickStats service. See the vignette.

Economics

bunching v0.8.4: Implements the bunching estimator from economic theory for kinks and knots. There is a vignette on Theory, and another with Examples.

fixest v0.1.2: Provides fast estimation of econometric models with multiple fixed-effects, including ordinary least squares (OLS), generalized linear models (GLM), and the negative binomial. The method to obtain the fixed-effects coefficients is based on Berge (2018). There is a vignette.

raceland v1.0.3: Implements a computational framework for a pattern-based, zoneless analysis, and visualization of (ethno)racial topography for analyzing residential segregation and racial diversity. There is a vignette describing the Computational Framework, one describing Patterns of Racial Landscapes, and a third on SocScape Grids.

Machine Learning

biclustermd v0.1.0: Implements biclustering, a statistical learning technique that simultaneously partitions, and clusters rows and columns of a data matrix in a manner that can deal with missing values. See the vignette for examples.

bbl v0.1.5: Implements supervised learning using Boltzmann Bayes model inference, enabling the classification of data into multiple response groups based on a large number of discrete predictors that can take factor values of heterogeneous levels. See Woo et al. (2016) for background, and the vignette for how to use the package.

corporaexplorer v0.6.3: Implements Shiny apps to dynamically explore collections of texts. Look here for more information.

fairness v1.0.1: Offers various metrics to assess and visualize the algorithmic fairness of predictive and classification models using methods described by Calders and Verwer (2010), Chouldechova (2017), Feldman et al. (2015), Friedler et al. (2018), and Zafar et al. (2017). There is a tutorial for the package.

imagefluency v0.2.1: Provides functions to collect image statistics based on processing fluency theory that include scores for several basic aesthetic principles that facilitate fluent cognitive processing of images: contrast, complexity / simplicity, self-similarity, symmetry, and typicality. See Mayer & Landwehr (2018) and Mayer & Landwehr (2018) for the theoretical background, and the vignette for an introduction.

ineqJD v1.0: Provides functions to compute and decompose Gini, Bonferroni, and Zenga 2007 point and synthetic concentration indexes. See Zenga M. (2015), Zenga & Valli (2017), and Zenga & Valli (2018) for more information.

lmds v0.1.0: Implements Landmark Multi-Dimensional Scaling (LMDS), a dimensionality reduction method scaleable to large numbers of samples, because rather than calculating a complete distance matrix between all pairs of samples, it only calculates the distances between a set of landmarks and the samples. See the README for an example.

modelStudio v0.1.7: Implements an interactive platform to help interpret machine learning models. There is a vignette, and look here for a demo of the interactive features.

nlpred v1.0: Provides methods for obtaining improved estimates of non-linear cross-validated risks obtained using targeted minimum loss-based estimation, estimating equations, and one-step estimation. Cross-validated area under the receiver operating characteristics curve ( LeDell sr al. (2015) ) and other metrics are included. There is a vignette on small sample estimates.

pyMTurkR v1.1: Provides access to the latest Amazon Mechanical Turk’ (‘MTurk’) Requester API (version ‘2017–01–17’), replacing the now deprecated MTurkR package.

stagedtrees v1.0.0: Creates and fits staged event tree probability models, probabilistic graphical models capable of representing asymmetric conditional independence statements among categorical variables. See Görgen et al. (2018), Thwaites & Smith (2017), Barclay et al. (2013), and Smith & Anderson](doi:10.1016/j.artint.2007.05.004) for background, and look here for and overview.

Statistics

confoundr v1.2: Implements three covariate-balance diagnostics for time-varying confounding and selection-bias in complex longitudinal data, as described in Jackson (2016) and Jackson (2019). There is a Demo vignette and another Describing Selection Bias from Dependent Censoring

distributions3 v0.1.1: Provides tools to create and manipulate probability distributions using S3. Generics random(), pdf(), cdf(), and quantile() provide replacements for base R’s r/d/p/q style functions. The documentation for each distribution contains detailed mathematical notes. There are several vignettes: Intro to hypothesis testing, One-sample sign tests, One-sample T confidence interval, One-sample T-tests, Z confidence interval for a mean, One-sample Z-tests for a proportion, One-sample Z-tests, Paired tests, and Two-sample Z-tests.

dobin v0.8.4: Implements a dimension reduction technique for outlier detection, which constructs a set of basis vectors for outlier detection that bring outliers to the forefront using fewer basis vectors. See Kandanaarachchi & Hyndman (2019) for background, and the vignette for a brief introduction.

glmpca v0.1.0: Implements a generalized version of principal components analysis (GLM-PCA) for dimension reduction of non-normally distributed data, such as counts or binary matrices. See Townes et al. (2019) and Townes (2019) for details, and the vignette for examples.

immuneSIM v0.8.7: Provides functions to simulate full B-cell and T-cell receptor repertoires using an in-silico recombination process that includes a wide variety of tunable parameters to introduce noise and biases. See Weber et al. (2019) for background, and look here for information about the package.

irrCAC v1.0: Provides functions to calculate various chance-corrected agreement coefficients (CAC) among two or more raters, including Cohen’s kappa, Conger’s kappa, Fleiss’ kappa, Brennan-Prediger coefficient, Gwet’s AC1/AC2 coefficients, and Krippendorff’s alpha. There are vignettes on benchmarking, Calculating Chance-corrected Agreement Coefficients, and Computing weighted agreement coefficients.

LPBlg v1.2: Provides functions that estimate a density and derive a deviance test to assess if the data distribution deviates significantly from the postulated model, given a postulated model and a set of data. See Algeri S. (2019) for details.

SynthTools v1.0.0: Provides functions to support experimentation with partially synthetic data sets. Confidence interval and standard error formulas have options for either synthetic data sets or multiple imputed data sets. For more information, see Reiter & Raghunathan (2007).

Time Series

fable v0.1.0: Provides a collection of commonly used univariate and multivariate time series forecasting models, including automatically selected exponential smoothing (ETS) and autoregressive integrated moving average (ARIMA) models. There is an Introduction and a vignette on transformations.

nsarfima v0.1.0.0: Provides routines for fitting and simulating data under autoregressive fractionally integrated moving average (ARFIMA) models, without the constraint of stationarity. Two fitting methods are implemented: a pseudo-maximum likelihood method and a minimum distance estimator. See Mayoral (2007) and Beran (1995) for reference.

Utilities

nc v2019.9.16: Provides functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. There is a vignette on capture first match and another on capture all match.

pins v0.2.0: Provides functions that “pin” remote resources into a local cache in order to work offline, improve speed, avoid recomputing, and discover and share resources in local folders, GitHub, Kaggle and RStudio Connect. There is a Getting Started Guide and vignettes on Extending Boards, Using GitHub Boards, Using Kaggle Boards, Using RStudio Connect Boards, Using Website Boards, Using Pins in RStudio, Understanding Boards, and Extending Pins.

queryparser v0.1.1: Provides functions to translate SQL SELECT statements into lists of R expressions.

rawr v0.1.0: Retrieves pure R code from popular R websites, including github, kaggle, datacamp, and R blogs made using blogdown.

Visualization

FunnelPlotR v0.2.1: Implements Spiegelhalter (2005) Funnel plots for reporting standardized ratios, with overdispersion adjustment. The vignette offers examples.

ggBubbles v0.1.4: Implements mini bubble plots to display more information for discrete data than traditional bubble plots do. The vignette provides examples.

gghalves v0.0.1: Implements a ggplot2 extension for easy plotting of half-half geom combinations: think half boxplot and half jitterplot, or half violinplot and half dotplot.

_____='https://rviews.rstudio.com/2019/10/29/sept-2019-top-40-new-r-packages/';

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

↧

Breaking change

From Monet to Duck

Beginning with duckDB in R

Create / connect to the db

iris

Put some rows and columns in db

Glimpse

Count

Dates

Some querying

dplyr

sql

“Little” benchmarks

Collecting this big data frame

Simple count

Informations

Maps with Leaflet

Creating the Shiny App

Create a new R script

Specifying the UI

Specifying the Server

Changes in littler version 0.3.9 (2019-10-27)

Introduction

rquery operators

Simple column operations (selecting and re-naming columns)

Simple row operations (selecting and re-ordering rows)

Creating new columns or replacing columns with new calculated values

Aggregating or summarizing data

Combining results between two data.frames

General conversion of record layouts

Composing operations

Conclusion

About the book

The Data

Hourly Step Counts

Correspondence Plot

Bland Altman Plot

Testing the Statistical Significance of the Hourly Differences

Differences in Hourly Step Counts Across the Day

Summary and Conclusion

P.S.

Setup

Load the image

Do the PCA

Reversing the PCA

R in the NHS

Helping local communities

Unnatural Laws

A Refined Randomness

Credible Conclusions

Footnotes

Vectorization for the efficiency

1. List files in the folder

2. Create vector of new files

3. Copy from old files to new files

4. Clear out the old files

More than one way to name a file

Nurture existing resources

Encourage collaboration and knowledge-sharing

Understand that everyone, to some extent, needs to be data literate

FIN

Wait, what is a “skimr”?

Setting the stage

Getting it right

Better internal data structures

Manipulating internal data

Working with dplyr

Configuring and extending skimr

Using skimr in other contexts

Wait, that took over a year?

Function genMultPred

A brief example

Addendum: correlation among predictors

Packages for mocking

General mocking

Web mocking

Temporarily modify the global state

To mock or… not to mock

Fake input data

Stored data from a web API / a database

`rquery` operators

Combining results between two `data.frame`s