Skip to content

window blocking performance #70

@bengoehring

Description

@bengoehring

Hello,

Thank you for making and maintaining such a helpful package.

I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.

Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.

I hope that makes sense.

It appears, however, that the window blocking option is doing something different. For instance, If I run:

library(tidyverse)
library(microbenchmark)
library(fastLink)

# make test datasets for window blocking
test_a <- diamonds
test_b <- arrange(diamonds, cut)

window_fast <- blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth')

length(unique(test_a$depth))
#84

length(window_fast)
#3390

The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?

I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!

my_window_blocking <- function(data_a,
                               data_b,
                               window_blocking_var,
                               window_size = 1) {
  
  # return the vector containing the window blocking variable in each dataset 
  data_a_window_values <- pull(select(data_a,
                                      {{window_blocking_var}}))
  data_b_window_values <- pull(select(data_b,
                                      {{window_blocking_var}}))
  
  # unique values of the first datasets window blocking variable for looping
  data_a_unique_vals <- sort(unique(data_a_window_values))
  
  # find a and b indices 
  out_a_indices <- vector('list',
                          length = length(data_a_unique_vals))
  out_b_indices <- vector('list',
                          length = length(data_a_unique_vals))
  
  for(i in 1:length(data_a_unique_vals)) {
    # a indices
    out_a_indices[[i]] <- which(data_a_window_values == data_a_unique_vals[i])
    
    # b indices
    min_window_value <- data_a_unique_vals[i] - window_size
    max_window_value <- data_a_unique_vals[i] + window_size
    
    out_b_indices[[i]] <- which(data_b_window_values %in% seq(min_window_value, 
                                                              max_window_value))
  }
  
  # return final data
  final_list <- vector('list',
                       length = length(data_a_unique_vals))
  
  for (k in 1:length(data_a_unique_vals)) {
    final_list[[k]][[1]] <- out_a_indices[[k]]
    final_list[[k]][[2]] <- out_b_indices[[k]]
    
    names(final_list[[k]]) <- c('dfA.inds', 
                                'dfB.inds')
  }
  return(final_list)
}

microbenchmark(my_window_blocking(test_a, 
                                  test_b,
                                  depth),
               blockData(test_a, 
                         test_b,
                         varnames = "depth",
                         window.block = 'depth'),
               times = 10)

Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.

Best,
Ben

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions