[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You may know that MonetDBLite was removed from CRAN.DuckDB comming up.

Breaking change

>install.packages('MonetDBLite')Warningininstall.packages:package‘MonetDBLite’isnotavailable(forRversion3.6.1)

People who based their works on MonetDBLite may ask what happened, what to do. Not to play a risky game with database and tools choices for future works… (“It’s really fast but we may waste some time if we have to replace it by another solution”).

It’s the game with open source. Remember big changes in dplyr 0.7. Sometimes we want better tools, and most of the time they become better. It’s really great. And sometimes we don’t have time and energy to adapt our work to tools that became better in a too iterative way. Or in a too subjective way. We want it to work, not break. Keeping code as simple as possible (and avoid nebulous dependencies, so, tidy?) is one of the key point. Stocking data in a database is another one.

All that we can say is that “we’re walking on works in progress”. Like number of eggshells, more works in progress here probably means more breaking changes.

Works in progress for packages, also for (embedded) databases!

From Monet to Duck

MonetDBLite philosophy is to be like a “very very fast SQLite”. But it’s time for change (or it seems to be). Then we can thanks MonetDBLite developers as it was a nice adventure to play/work with MonetDB speed! As a question, is there another person, some volunteers, possibilities to maintain MonetDBLite (somewhere a nice tool)? There are not so many informations for the moment about what happened and that’s why I write this post.

Here, I read that they are now working on a new solution, under MIT License, named DuckDB, see here for more details.

As I’m just a R user and haven’t collaborate to the project, I would just say for short: DuckDB takes good parts from SQLite and PostGreSQL (Parser), see here for complete list, it looks promising. As in MonetDB, philosophy is focused on columns and speed. And dates for instance are handled correctly, not having to convert them in “ISO-8601 – like” character strings.

It can be called from C/C++, Python and R.

Here is a post about python binding.

I also put a link at the bottom of this page which give some explanations about the name of this new tool and DuckDB developers point’s of view about data manipulation and storage¹.

Beginning with duckDB in R

Create / connect to the db

# remotes::install_github("cwida/duckdb/tools/rpkg", build = FALSE)library(duckdb)library(dplyr)library(DBI)# Create or connect to the dbcon_duck<-dbConnect(duckdb::duckdb(),"~/Documents/data/duckdb/my_first.duckdb")#con <- dbConnect(duckdb::duckdb(), ":memory:")con_duck

iris

dbWriteTable(con_duck,"iris",iris)tbl(con,'iris')

Put some rows and columns in db

>dim(nycflights13::flights)[1]33677619>object.size(nycflights13::flights)%>%format(units="Mb")[1]"38.8 Mb"

Sampling it to get more rows, then duplicating columns, two time.

# Sample to get bigger data.framedf_test<-nycflights13::flights%>%sample_n(2e6,replace=TRUE)%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols')}))%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols_bis')}))

>dim(df_test)[1]200000076>object.size(df_test)%>%format(units="Mb")[1]"916.4 Mb"

Write in db

tictoc::tic()dbWriteTable(con_duck,"df_test",df_test)tictoc::toc()

It take some times compared to MonetDBLite (no benchmark here, I just run this several times and it was consistent).

# DuckDB      : 23.251 sec elapsed# SQLite      : 20.23 sec elapsed# MonetDBLite : 8.4 sec elapsed

The three are pretty fast. Most importantly if queries are fast, and they are, most of the time we’re allwright.

I want to say here that’s for now it’s a work in progress, we have to wait more communication from DuckDB developers. I just write this to share the news.

Glimpse

>tbl(con_duck,'df_test')%>%glimpse()Observations:??Variables:76Database:duckdb_connection$year<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,…$month<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6…$day<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,…$dep_time<int>1608,2218,1920,NA,1506,1917,1034,655,1039,1752,2018,1732,82…$sched_dep_time<int>1612,2127,1920,2159,1500,1900,1030,700,1045,1720,1629,1728,…$dep_delay<dbl>-4,51,0,NA,6,17,4,-5,-6,32,229,4,-9,-3,-4,-3,9,38,34,…$arr_time<int>1904,2321,2102,NA,1806,2142,1337,938,1307,2103,2314,1934,11…$sched_arr_time<int>1920,2237,2116,2326,1806,2131,1345,958,1313,2025,1927,2011,…$arr_delay<dbl>-16,44,-14,NA,0,11,-8,-20,-6,38,227,-37,-16,-12,-10,-39,…$carrier<chr>"UA","EV","9E","UA","DL","DL","VX","UA","UA","AA","B6","UA",…$flight<int>1242,4372,3525,424,2181,2454,187,1627,1409,695,1161,457,717…$tailnum<chr>"N24211","N13994","N910XJ",NA,"N329NB","N3749D","N530VA","N37281…$ origin                                  "EWR", "EWR", "JFK", "EWR", "LGA", "JFK", "EWR", "EWR", "EWR", "JFK", "…$dest<chr>"FLL","DCA","ORD","BOS","MCO","DEN","SFO","PBI","LAS","AUS","…$air_time<dbl>155,42,116,NA,131,217,346,134,301,230,153,276,217,83,36,…$distance<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138…$hour<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20…$minute<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…......$minute_bind_cols<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour_bind_cols<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…$year_bind_cols_bis<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,…$month_bind_cols_bis<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6…$day_bind_cols_bis<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,…......$distance_bind_cols_bind_cols_bis<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138…$hour_bind_cols_bind_cols_bis<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20…$minute_bind_cols_bind_cols_bis<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,…$time_hour_bind_cols_bind_cols_bis<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05…

Count

>tbl(con_duck,'df_test')%>%count()# Source:   lazy query [?? x 1]# Database: duckdb_connectionn<dbl>12000000

Dates

Compared to SQLite it handles dates/times correctly. No need to convert in character.

tbl(con_duck,'df_test')%>%select(time_hour)

# Source:   lazy query [?? x 1]# Database: duckdb_connectiontime_hour<dttm>12013-11-2921:00:00.00000022013-10-0801:00:00.00000032013-03-0200:00:00.00000042013-05-0301:00:00.00000052013-12-1820:00:00.00000062013-09-1823:00:00.00000072013-07-2014:00:00.00000082013-03-0712:00:00.00000092013-09-1514:00:00.000000102013-04-2521:00:00.000000# … with more rows

tbl(con_sqlite,'df_test')%>%select(time_hour)

# Source:   lazy query [?? x 1]# Database: sqlite 3.22.0 [/Users/guillaumepressiat/Documents/data/sqlite.sqlite]time_hour<dbl>113857588002138119400031362182400413675428005138739680061379545200713743288008136265760091379253600101366923600# … with more rows

Some querying

Running some queries

dplyr

It already works nicely with dplyr.

>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%+collect()

# A tibble: 3 x 2Species`min(Sepal.Width)`<chr><dbl>1virginica2.22setosa2.33versicolor2

>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%show_query()

<SQL>SELECT"Species",MIN("Sepal.Width")AS"min(Sepal.Width)"FROM"iris"GROUPBY"Species"

sql

Run query as a string

dbGetQuery(con_duck,'SELECT "Species", MIN("Sepal.Width") FROM iris GROUP BY "Species"')

Speciesmin(Sepal.Width)1virginica2.22setosa2.33versicolor2.0

Like for all data sources with DBI, if the query is more complex, we can write it comfortably in an external file and launch it like this for example:

dbGetQuery(con_duck,readr::read_file('~/Documents/scripts/script.sql'))

“Little” benchmarks

Collecting this big data frame

This has no sense but give some idea of read speed. We collect df_test in memory, from duckdb, monetdb and sqlite.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')),+times=5)Unit:secondsexprminlqmeanmedianuqmaxnevala3.587033.6325073.7631293.6766693.7251484.194295>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')),+times=5)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb973.1111990.36991003.4171010.6511013.8581029.0975>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')),+times=1)Unit:secondsexprminlqmeanmedianuqmaxnevald52.0878552.0878552.0878552.0878552.0878552.087851

Really good !

Simple count

Count then collect aggregate rows.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevala50.1801453.2419754.8753254.6820357.0920658.9487320>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb151.729157.9267160.5727160.8815163.8343167.47720>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')%>%count(year,month)),+times=20)Unit:secondsexprminlqmeanmedianuqmaxnevald2.1672022.1962882.2052812.204862.2165942.25360620

Faster !

It remains to test joins, filters, sorts, etc.

Informations

I find that there are not so many communications for the moment about this work and binding for R, so I made this post to highlight it.

MonetDBLite speed is amazing, do you will give DuckDB a try ?

In any case thanks to DuckDB developers and welcome to the new duck.

See here https://github.com/cwida/duckdb.

DuckDB developers point’s of view on data management and explanations about “duck” can be found here.

Here we can read more informations on ALTREP, MonetDBLite and DuckDB, and reasons why MonetDB was finally abandoned (“RIP MonetDBLite”).

What the duck? Explanation slide n° 25 : https://db.in.tum.de/teaching/ss19/moderndbs/duckdb-tum.pdf?lang=de

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A duck. Giving a look at DuckDB since MonetDBLite was removed from CRAN

Breaking change

From Monet to Duck

Beginning with duckDB in R

Create / connect to the db

iris

Put some rows and columns in db

Glimpse

Count

Dates

Some querying

dplyr

sql

“Little” benchmarks

Collecting this big data frame

Simple count

Informations

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112