Quantcast
Channel: R-bloggers
Viewing all articles
Browse latest Browse all 12081

A duck. Giving a look at DuckDB since MonetDBLite was removed from CRAN

$
0
0

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

You may know that MonetDBLite was removed from CRAN.DuckDB comming up.


Breaking change

>install.packages('MonetDBLite')Warningininstall.packages:packageMonetDBLiteisnotavailable(forRversion3.6.1)

People who based their works on MonetDBLite may ask what happened, what to do. Not to play a risky game with database and tools choices for future works… (“It’s really fast but we may waste some time if we have to replace it by another solution”).

It’s the game with open source. Remember big changes in dplyr 0.7. Sometimes we want better tools, and most of the time they become better. It’s really great. And sometimes we don’t have time and energy to adapt our work to tools that became better in a too iterative way. Or in a too subjective way. We want it to work, not break. Keeping code as simple as possible (and avoid nebulous dependencies, so, tidy?) is one of the key point. Stocking data in a database is another one.

All that we can say is that “we’re walking on works in progress”. Like number of eggshells, more works in progress here probably means more breaking changes.

Works in progress for packages, also for (embedded) databases!

From Monet to Duck

MonetDBLite philosophy is to be like a “very very fast SQLite”. But it’s time for change (or it seems to be). Then we can thanks MonetDBLite developers as it was a nice adventure to play/work with MonetDB speed! As a question, is there another person, some volunteers, possibilities to maintain MonetDBLite (somewhere a nice tool)? There are not so many informations for the moment about what happened and that’s why I write this post.

Here, I read that they are now working on a new solution, under MIT License, named DuckDB, see here for more details.

As I’m just a R user and haven’t collaborate to the project, I would just say for short: DuckDB takes good parts from SQLite and PostGreSQL (Parser), see here for complete list, it looks promising. As in MonetDB, philosophy is focused on columns and speed. And dates for instance are handled correctly, not having to convert them in “ISO-8601 – like” character strings.

It can be called from C/C++, Python and R.

Here is a post about python binding.

I also put a link at the bottom of this page which give some explanations about the name of this new tool and DuckDB developers point’s of view about data manipulation and storage1.

Beginning with duckDB in R

Create / connect to the db

# remotes::install_github("cwida/duckdb/tools/rpkg", build = FALSE)library(duckdb)library(dplyr)library(DBI)# Create or connect to the dbcon_duck<-dbConnect(duckdb::duckdb(),"~/Documents/data/duckdb/my_first.duckdb")#con <- dbConnect(duckdb::duckdb(), ":memory:")con_duck

iris

dbWriteTable(con_duck,"iris",iris)tbl(con,'iris')

Put some rows and columns in db

>dim(nycflights13::flights)[1]33677619>object.size(nycflights13::flights)%>%format(units="Mb")[1]"38.8 Mb"

Sampling it to get more rows, then duplicating columns, two time.

# Sample to get bigger data.framedf_test<-nycflights13::flights%>%sample_n(2e6,replace=TRUE)%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols')}))%>%bind_cols(.,rename_all(.,function(x){paste0(x,'_bind_cols_bis')}))
>dim(df_test)[1]200000076>object.size(df_test)%>%format(units="Mb")[1]"916.4 Mb"

Write in db

tictoc::tic()dbWriteTable(con_duck,"df_test",df_test)tictoc::toc()

It take some times compared to MonetDBLite (no benchmark here, I just run this several times and it was consistent).

# DuckDB      : 23.251 sec elapsed# SQLite      : 20.23 sec elapsed# MonetDBLite : 8.4 sec elapsed

The three are pretty fast. Most importantly if queries are fast, and they are, most of the time we’re allwright.

I want to say here that’s for now it’s a work in progress, we have to wait more communication from DuckDB developers. I just write this to share the news.

Glimpse

>tbl(con_duck,'df_test')%>%glimpse()Observations:??Variables:76Database:duckdb_connection$year<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,$month<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6$day<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,$dep_time<int>1608,2218,1920,NA,1506,1917,1034,655,1039,1752,2018,1732,82$sched_dep_time<int>1612,2127,1920,2159,1500,1900,1030,700,1045,1720,1629,1728,$dep_delay<dbl>-4,51,0,NA,6,17,4,-5,-6,32,229,4,-9,-3,-4,-3,9,38,34,$arr_time<int>1904,2321,2102,NA,1806,2142,1337,938,1307,2103,2314,1934,11$sched_arr_time<int>1920,2237,2116,2326,1806,2131,1345,958,1313,2025,1927,2011,$arr_delay<dbl>-16,44,-14,NA,0,11,-8,-20,-6,38,227,-37,-16,-12,-10,-39,$carrier<chr>"UA","EV","9E","UA","DL","DL","VX","UA","UA","AA","B6","UA",$flight<int>1242,4372,3525,424,2181,2454,187,1627,1409,695,1161,457,717$tailnum<chr>"N24211","N13994","N910XJ",NA,"N329NB","N3749D","N530VA","N37281…$ origin                                  "EWR", "EWR", "JFK", "EWR", "LGA", "JFK", "EWR", "EWR", "EWR", "JFK", "$dest<chr>"FLL","DCA","ORD","BOS","MCO","DEN","SFO","PBI","LAS","AUS","…$air_time<dbl>155,42,116,NA,131,217,346,134,301,230,153,276,217,83,36,$distance<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138$hour<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20$minute<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,$time_hour<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05......$minute_bind_cols<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,$time_hour_bind_cols<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05$year_bind_cols_bis<int>2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,$month_bind_cols_bis<int>11,10,3,5,12,9,7,3,9,4,7,6,1,1,9,10,9,8,4,1,4,9,6$day_bind_cols_bis<int>29,7,1,2,18,18,20,7,15,25,22,1,29,18,30,27,27,22,19,......$distance_bind_cols_bind_cols_bis<dbl>1065,199,740,200,950,1626,2565,1023,2227,1521,1035,2133,138$hour_bind_cols_bind_cols_bis<dbl>16,21,19,21,15,19,10,7,10,17,16,17,8,14,8,19,15,16,20$minute_bind_cols_bind_cols_bis<dbl>12,27,20,59,0,0,30,0,45,20,29,28,35,50,25,0,35,55,0,$time_hour_bind_cols_bind_cols_bis<dttm>2013-11-2921:00:00,2013-10-0801:00:00,2013-03-0200:00:00,2013-05

Count

>tbl(con_duck,'df_test')%>%count()# Source:   lazy query [?? x 1]# Database: duckdb_connectionn<dbl>12000000

Dates

Compared to SQLite it handles dates/times correctly. No need to convert in character.

tbl(con_duck,'df_test')%>%select(time_hour)
# Source:   lazy query [?? x 1]# Database: duckdb_connectiontime_hour<dttm>12013-11-2921:00:00.00000022013-10-0801:00:00.00000032013-03-0200:00:00.00000042013-05-0301:00:00.00000052013-12-1820:00:00.00000062013-09-1823:00:00.00000072013-07-2014:00:00.00000082013-03-0712:00:00.00000092013-09-1514:00:00.000000102013-04-2521:00:00.000000# … with more rows
tbl(con_sqlite,'df_test')%>%select(time_hour)
# Source:   lazy query [?? x 1]# Database: sqlite 3.22.0 [/Users/guillaumepressiat/Documents/data/sqlite.sqlite]time_hour<dbl>113857588002138119400031362182400413675428005138739680061379545200713743288008136265760091379253600101366923600# … with more rows

Some querying

Running some queries

dplyr

It already works nicely with dplyr.

>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%+collect()
# A tibble: 3 x 2Species`min(Sepal.Width)`<chr><dbl>1virginica2.22setosa2.33versicolor2
>tbl(con_duck,'iris')%>%+group_by(Species)%>%+summarise(min(Sepal.Width))%>%show_query()
<SQL>SELECT"Species",MIN("Sepal.Width")AS"min(Sepal.Width)"FROM"iris"GROUPBY"Species"

sql

Run query as a string

dbGetQuery(con_duck,'SELECT "Species", MIN("Sepal.Width") FROM iris GROUP BY "Species"')
Speciesmin(Sepal.Width)1virginica2.22setosa2.33versicolor2.0

Like for all data sources with DBI, if the query is more complex, we can write it comfortably in an external file and launch it like this for example:

dbGetQuery(con_duck,readr::read_file('~/Documents/scripts/script.sql'))

“Little” benchmarks

Collecting this big data frame

This has no sense but give some idea of read speed. We collect df_test in memory, from duckdb, monetdb and sqlite.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')),+times=5)Unit:secondsexprminlqmeanmedianuqmaxnevala3.587033.6325073.7631293.6766693.7251484.194295>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')),+times=5)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb973.1111990.36991003.4171010.6511013.8581029.0975>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')),+times=1)Unit:secondsexprminlqmeanmedianuqmaxnevald52.0878552.0878552.0878552.0878552.0878552.087851

Really good !

Simple count

Count then collect aggregate rows.

>microbenchmark::microbenchmark(+a=collect(tbl(con_duck,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevala50.1801453.2419754.8753254.6820357.0920658.9487320>microbenchmark::microbenchmark(+b=collect(tbl(con_monet,'df_test')%>%count(year,month)),+times=20)Unit:millisecondsexprminlqmeanmedianuqmaxnevalb151.729157.9267160.5727160.8815163.8343167.47720>microbenchmark::microbenchmark(+d=collect(tbl(con_sqlite,'df_test')%>%count(year,month)),+times=20)Unit:secondsexprminlqmeanmedianuqmaxnevald2.1672022.1962882.2052812.204862.2165942.25360620

Faster !

It remains to test joins, filters, sorts, etc.

Informations

I find that there are not so many communications for the moment about this work and binding for R, so I made this post to highlight it.

MonetDBLite speed is amazing, do you will give DuckDB a try ?

In any case thanks to DuckDB developers and welcome to the new duck.

See here https://github.com/cwida/duckdb.

DuckDB developers point’s of view on data management and explanations about “duck” can be found here.

Here we can read more informations on ALTREP, MonetDBLite and DuckDB, and reasons why MonetDB was finally abandoned (“RIP MonetDBLite”).


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Viewing all articles
Browse latest Browse all 12081

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>