Quantcast
Channel: R-bloggers
Viewing all articles
Browse latest Browse all 12095

Imputing missing values in parallel using {furrr}

$
0
0

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Today I saw this tweet on my timeline:

For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.

Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew

— Davis Vaughan (@dvaughan32) April 13, 2018

and as a heavy {purrr} user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out {furrr}. For those unfamiliar with {purrr}, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: {purrr} contains so-called higher order functions, which are functions that take other functions as argument. One such function is map(). Consider the following simple example:

numbers <- seq(1, 10)

If you want the square root of this numbers, you can of course simply use the sqrt() function, because it is vectorized:

sqrt(numbers)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if sqrt() was not vectorized:

sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
  sqrt_numbers[i] <- sqrt(numbers[i])
}

First, you need to initialize a container, and then you have to populate the sqrt_numbers list with the results. Using, {purrr} is way easier:

library(tidyverse)
map(numbers, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 2.645751
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 3
## 
## [[10]]
## [1] 3.162278

map() is only one of the nice functions that are bundled inside {purrr}. Mastering {purrr} can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice} package. {mice} comes with an example dataset called boys, let’s take a look at it:

library(mice)

data(boys)

brotools::describe(boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 9 x 12
##   variable type    n_missing   mean    sd mode      min   max   q25 median
##                         
## 1 age      Numeric         0   9.16  6.89    0.0350  21.2  1.58   10.5
## 2 bmi      Numeric        21  18.1   3.05   11.8     31.7 15.9    17.4
## 3 hc       Numeric        46  51.5   5.91   33.7     65.0 48.1    53.0
## 4 hgt      Numeric        20 132    46.5    50.0    198   84.9   147  
## 5 tv       Numeric       522  11.9   7.99    1.00    25.0  4.00   12.0
## 6 wgt      Numeric         4  37.2  26.0     3.14   117   11.7    34.7
## 7 gen      Factor        503  NA    NA      NA       NA   NA      NA  
## 8 phb      Factor        503  NA    NA      NA       NA   NA      NA  
## 9 reg      Factor          3  NA    NA    south NA       NA   NA      NA  
## # ... with 2 more variables: q75 , n_unique 

In the code above I use the describe() function from my personal package to get some summary statistics of the boys dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the mice function, it is very easy to impute them:

start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)
## Time difference of 3.290611 mins

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The mice() function returns a mids object. If you want to look at the data, you have to use the complete() function (careful, there is also a complete() function in the {tidyr} package, so to avoid problems, I suggest you explicitely call mice::complete()):

imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 11 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##                         
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.03   11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.3   53.2 
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146   
##  5 tv       Numer…         0   8.39  8.09    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6 
##  7 .id      Factor         0  NA    NA    3     NA       NA   NA     NA   
##  8 .imp     Factor         0  NA    NA    1     NA       NA   NA     NA   
##  9 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
## 10 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 11 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 , n_unique 

As expected, no more missing values. The “long” argument inside mice::complete() is needed if you want the complete() function to return a long dataset. Doing the above “manually” using {purrr} is possible with the following code:

start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 3.393966 mins

What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE) to a list of 1s, and creates 10 imputed data sets. m = . means that m will be equal to whatever is inside the list we are mapping our function over, so 1, then 1 then another 1 etc…. It took around the same amount of time as using mice() directly.

imp_boys_purrr is now a list of 10 mids objects. We thus need to map mice::complete() to imp_boys_purrr to get the data:

imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)

Now, imp_boys_purrr_complete is a list of 10 datasets. Let’s map brotools::describe() to it:

map(imp_boys_purrr_complete, brotools::describe)
## [[1]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.90   33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   84.0  146    175  
## 5 tv       Numeric   8.35  8.00    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0     3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[2]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.88   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.5  145    175  
## 5 tv       Numeric   8.37  8.02    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.9   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P2    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[3]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.87   33.7     65.0 48.5   53.3   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  145    175  
## 5 tv       Numeric   8.46  8.14    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[4]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.02   11.8     31.7 15.9   17.5   19.4
## 3 hc       Numeric  51.7   5.93   33.7     65.0 48.5   53.4   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   82.9  145    175  
## 5 tv       Numeric   8.45  8.11    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0     3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[5]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03   11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.91   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  146    175  
## 5 tv       Numeric   8.21  8.02    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[6]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.89   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   83.0  146    175  
## 5 tv       Numeric   8.44  8.24    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[7]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.88   33.7     65.0 48.2   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.5  146    175  
## 5 tv       Numeric   8.47  8.15    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[8]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04   11.8     31.7 15.9   17.4   19.4
## 3 hc       Numeric  51.6   5.85   33.7     65.0 48.2   53.3   56.0
## 4 hgt      Numeric 131    46.5    50.0    198   83.0  146    175  
## 5 tv       Numeric   8.36  8.06    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[9]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.90   33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.9  146    175  
## 5 tv       Numeric   8.57  8.25    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 
## 
## [[10]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##                     
## 1 age      Numeric   9.16  6.89    0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04   11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.89   33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.6    50.0    198   83.0  146    175  
## 5 tv       Numeric   8.49  8.18    1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1     3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing , n_unique 

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of purrr::map(), called map2():

imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))

map2() applies a function, say f(), to 2 lists sequentially: f(x_1, y_1), then f(x_2, y_2), etc… So here I map mutate() to create a new column, imp_id in each dataset. Now let’s bind the rows and take a look at the data:

imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##                         
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.04   11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.3   53.2 
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146   
##  5 tv       Numer…         0   8.42  8.11    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6 
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA   
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 , n_unique 

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using purrr::map() and simply use furrr::future_map() to parallelize the imputation process:

library(furrr)
## Loading required package: future
plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 33.73772 secs

Boooom! Much faster! And simply by loading {furrr}, then using plan(multiprocess) to run the code in parallel (if you forget that, the code will run on a single core) and using future_map() instead of map().

Let’s take a look at the data:

imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##                         
##  1 age      Numer…         0   9.16  6.89    0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.04   11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89   33.7     65.0 48.4   53.2 
##  4 hgt      Numer…         0 131    46.5    50.0    198   83.0  146   
##  5 tv       Numer…         0   8.35  8.09    1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0     3.14   117   11.7   34.6 
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA   
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 , n_unique 

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with {furrr} you must install it from Github, as it is not yet available on CRAN:

devtools::install_github("DavisVaughan/furrr")

If you are not comfortable with map() (and thus future_map()) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:

devtools::install_github("b-rodrigues/parlMICE")

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


Viewing all articles
Browse latest Browse all 12095

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>