Quantcast
Channel: R-bloggers
Viewing all articles
Browse latest Browse all 12466

Programming with data.table

$
0
0

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

getting started multiple bare variable names in data.table functions –

Flexible functions in data.table

I’m getting slightly more experienced with data.table, and I really like it.

My learning method was to get pretty deep for a month, reading everything I could and replicating my dplyr code in data.table.

I then stopped using it for a month, and carried on with dplyr.

Then I tried switching back to data.table again. Some of it stuck, some of it didn’t, but I persevered. I’m still struggling with joining tables, (for some reason the default right-joins really throw my mental model), but I really enjoy working with it, and I know there is a lot more for me to learn.

When in use interactively, there are some nice little shortcuts that allow you to explore a dataset reasonably quickly, and I have been able to create some little helper functions without too much effort.

However, I am passing in column names wrapped in quotes, which shouldn’t really be a big deal, but working with dplyr for so long has spoiled me.

So this post is a way to note some potential ways round it.

N.B. not a data.table expert, some of this is probably horrendous, use the comments below / reach out otherwise and educate me. It will be appreciated.

Let’s get set up with the flights dataset:

library(nycflights13)library(data.table)data(flights)# bring flights into the environmentsetDT(flights)

Normal use and a brief .SD explainer

flights[,head(.SD,5),.SDcols='dep_delay']
##    dep_delay## 1:         2## 2:         4## 3:         2## 4:        -1## 5:        -6

This does nothing earth shattering, just grabbing the first few rows from the ‘dep_delay’ column. .SD means to take a subset of the data , and I specify the columns with .SDcols (note, not .SDCols as my brain seems to want to type)

You can of course pass in multiple column names like this:

flights[,head(.SD,5),.SDcols=c('dep_delay','carrier','sched_dep_time')]
##    dep_delay carrier sched_dep_time## 1:         2      UA            515## 2:         4      UA            529## 3:         2      AA            540## 4:        -1      B6            545## 5:        -6      DL            600

Or you can do this:

columns_of_interest<-c('dep_delay','carrier','sched_dep_time')flights[,head(.SD,5),.SDcols=columns_of_interest]
##    dep_delay carrier sched_dep_time## 1:         2      UA            515## 2:         4      UA            529## 3:         2      AA            540## 4:        -1      B6            545## 5:        -6      DL            600

Single column functions – quoted column names

Of course we don’t want to have to do this repeatedly so we can create a function.

Here is a simple one, which will return unique values for a column of our choosing. There are a few ways we can do this by passing in a quoted column name:

unique_dots<-function(DT,target_col){vec<-unique(DT[,..target_col])vec}

See the two dots before ‘target_col’ in the function body. That’s the magic right there. Don’t believe me?

unique_dots(flights,'dep_delay')
##      dep_delay##   1:         2##   2:         4##   3:        -1##   4:        -6##   5:        -4##  ---          ## 524:       358## 525:       602## 526:       593## 527:      1014## 528:       422
unique_dots(flights,'sched_dep_time')
##       sched_dep_time##    1:            515##    2:            529##    3:            540##    4:            545##    5:            600##   ---               ## 1017:           1058## 1018:            516## 1019:           2153## 1020:           2246## 1021:           2208
unique_dots(flights,'carrier')
##     carrier##  1:      UA##  2:      AA##  3:      B6##  4:      DL##  5:      EV##  6:      MQ##  7:      US##  8:      WN##  9:      VX## 10:      FL## 11:      AS## 12:      9E## 13:      F9## 14:      HA## 15:      YV## 16:      OO

Cool, we have a function that works.

But wait, we can also do this:

# using with = FALSEunique_with<-function(DT,target_col){vec<-unique(DT[,target_col,with=FALSE])vec}
unique_with(flights,'dep_delay')
##      dep_delay##   1:         2##   2:         4##   3:        -1##   4:        -6##   5:        -4##  ---          ## 524:       358## 525:       602## 526:       593## 527:      1014## 528:       422
unique_with(flights,'sched_dep_time')
##       sched_dep_time##    1:            515##    2:            529##    3:            540##    4:            545##    5:            600##   ---               ## 1017:           1058## 1018:            516## 1019:           2153## 1020:           2246## 1021:           2208
unique_with(flights,'carrier')
##     carrier##  1:      UA##  2:      AA##  3:      B6##  4:      DL##  5:      EV##  6:      MQ##  7:      US##  8:      WN##  9:      VX## 10:      FL## 11:      AS## 12:      9E## 13:      F9## 14:      HA## 15:      YV## 16:      OO

And a cursory check that the results are the same for both functions :

all.equal(unique_dots(flights,'dep_delay'),unique_with(flights,'dep_delay'))
## [1] TRUE

Well, that all seems marvellous.

But wait, there’s even more. We can pass in a quoted column name and use ‘get’. Note, I wrapped the call to get in brackets to return a data.table, rather than a vector.

unique_get<-function(DT,target_col){vec<-unique(DT[,.(get(target_col))])# ugly but returns a DTvec}

A marginally less horrible way would be this, which returns a vector:

unique_get2<-function(DT,target_col){vec<-unique(DT[,get(target_col)])vec}

Anyway, despite the hideousness, it still works

unique_get(flights,'dep_delay')
##        V1##   1:    2##   2:    4##   3:   -1##   4:   -6##   5:   -4##  ---     ## 524:  358## 525:  602## 526:  593## 527: 1014## 528:  422
unique_get(flights,'sched_dep_time')
##         V1##    1:  515##    2:  529##    3:  540##    4:  545##    5:  600##   ---     ## 1017: 1058## 1018:  516## 1019: 2153## 1020: 2246## 1021: 2208
unique_get(flights,'carrier')
##     V1##  1: UA##  2: AA##  3: B6##  4: DL##  5: EV##  6: MQ##  7: US##  8: WN##  9: VX## 10: FL## 11: AS## 12: 9E## 13: F9## 14: HA## 15: YV## 16: OO

Enough of this. Give me multiple unquoted column names

No, I will not do that. Instead, have a function that takes a single unquoted column name

bare_col<-function(dt,n,target_col){target_col<-deparse(substitute(target_col))dt[,head(.SD,n),.SDcols=target_col]}

If you are thinking, “Dude, this is standard base R stuff” then yes, you are correct. Which is kind of the point.. Does it work? Oh yes..

bare_col(flights,5,dep_delay)
##    dep_delay## 1:         2## 2:         4## 3:         2## 4:        -1## 5:        -6
bare_col(flights,20,origin)
##     origin##  1:    EWR##  2:    LGA##  3:    JFK##  4:    JFK##  5:    LGA##  6:    EWR##  7:    EWR##  8:    LGA##  9:    JFK## 10:    LGA## 11:    JFK## 12:    JFK## 13:    JFK## 14:    EWR## 15:    LGA## 16:    JFK## 17:    EWR## 18:    LGA## 19:    LGA## 20:    EWR

I literally hate you. Give me multiple unquoted columns now..

Well, seeing as you asked nicely.. As a reminder, we can do this kind of thing with quotes

flights[,head(.SD,10),.SDcols=c('origin','distance','tailnum')]
##     origin distance tailnum##  1:    EWR     1400  N14228##  2:    LGA     1416  N24211##  3:    JFK     1089  N619AA##  4:    JFK     1576  N804JB##  5:    LGA      762  N668DN##  6:    EWR      719  N39463##  7:    EWR     1065  N516JB##  8:    LGA      229  N829AS##  9:    JFK      944  N593JB## 10:    LGA      733  N3ALAA

And we can do this..

getcols<-function(dt,n,...){sdcols<-eval(substitute(alist(...)))sdcols<-sapply(as.list(sdcols),deparse)dt[,head(.SD,n),.SDcols=sdcols]}

And look – no quotes necessary :

getcols(flights,10,origin,distance,tailnum)
##     origin distance tailnum##  1:    EWR     1400  N14228##  2:    LGA     1416  N24211##  3:    JFK     1089  N619AA##  4:    JFK     1576  N804JB##  5:    LGA      762  N668DN##  6:    EWR      719  N39463##  7:    EWR     1065  N516JB##  8:    LGA      229  N829AS##  9:    JFK      944  N593JB## 10:    LGA      733  N3ALAA
getcols(flights,20,dep_time,sched_dep_time,carrier)
##     dep_time sched_dep_time carrier##  1:      517            515      UA##  2:      533            529      UA##  3:      542            540      AA##  4:      544            545      B6##  5:      554            600      DL##  6:      554            558      UA##  7:      555            600      B6##  8:      557            600      EV##  9:      557            600      B6## 10:      558            600      AA## 11:      558            600      B6## 12:      558            600      B6## 13:      558            600      UA## 14:      558            600      UA## 15:      559            600      AA## 16:      559            559      B6## 17:      559            600      UA## 18:      600            600      B6## 19:      600            600      MQ## 20:      601            600      B6

2020-01-20-boom.gif

This also works :

getcols2<-function(dt,n,...){sdcols<-eval(substitute(alist(...)))sdcols<-sapply(sdcols,deparse)dt[,head(.SD,n),.SDcols=sdcols]}
getcols2(flights,10,origin,distance,tailnum)
##     origin distance tailnum##  1:    EWR     1400  N14228##  2:    LGA     1416  N24211##  3:    JFK     1089  N619AA##  4:    JFK     1576  N804JB##  5:    LGA      762  N668DN##  6:    EWR      719  N39463##  7:    EWR     1065  N516JB##  8:    LGA      229  N829AS##  9:    JFK      944  N593JB## 10:    LGA      733  N3ALAA

Again, usual disclaimers apply. I’m not a data.table expert. Indeed I’m not even a full time R user, much to my general displeasure. Which is why I’m faffing about with this at midnight on a Sunday. Anyway, I digress… there are no doubt a load of better ways of doing this, but this will hopefully serve as a starter.. if you have better ways of creating a flexible function that will accept multiple unknown columns, don’t be shy in sharing them

Thanks 🙂

Until then, I’ll be getting down with my new found flexi function ability:

2020-01-20-mac.gif

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) {var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;s.src = '//cdn.viglink.com/api/vglnk.js';var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Viewing all articles
Browse latest Browse all 12466

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>