Hello,
Thank you for making and maintaining such a helpful package.
I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.
Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.
I hope that makes sense.
It appears, however, that the window blocking option is doing something different. For instance, If I run:
library(tidyverse)
library(microbenchmark)
library(fastLink)
# make test datasets for window blocking
test_a <- diamonds
test_b <- arrange(diamonds, cut)
window_fast <- blockData(test_a,
test_b,
varnames = "depth",
window.block = 'depth')
length(unique(test_a$depth))
#84
length(window_fast)
#3390
The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?
I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!
my_window_blocking <- function(data_a,
data_b,
window_blocking_var,
window_size = 1) {
# return the vector containing the window blocking variable in each dataset
data_a_window_values <- pull(select(data_a,
{{window_blocking_var}}))
data_b_window_values <- pull(select(data_b,
{{window_blocking_var}}))
# unique values of the first datasets window blocking variable for looping
data_a_unique_vals <- sort(unique(data_a_window_values))
# find a and b indices
out_a_indices <- vector('list',
length = length(data_a_unique_vals))
out_b_indices <- vector('list',
length = length(data_a_unique_vals))
for(i in 1:length(data_a_unique_vals)) {
# a indices
out_a_indices[[i]] <- which(data_a_window_values == data_a_unique_vals[i])
# b indices
min_window_value <- data_a_unique_vals[i] - window_size
max_window_value <- data_a_unique_vals[i] + window_size
out_b_indices[[i]] <- which(data_b_window_values %in% seq(min_window_value,
max_window_value))
}
# return final data
final_list <- vector('list',
length = length(data_a_unique_vals))
for (k in 1:length(data_a_unique_vals)) {
final_list[[k]][[1]] <- out_a_indices[[k]]
final_list[[k]][[2]] <- out_b_indices[[k]]
names(final_list[[k]]) <- c('dfA.inds',
'dfB.inds')
}
return(final_list)
}
microbenchmark(my_window_blocking(test_a,
test_b,
depth),
blockData(test_a,
test_b,
varnames = "depth",
window.block = 'depth'),
times = 10)
Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.
Best,
Ben
Hello,
Thank you for making and maintaining such a helpful package.
I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.
Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.
I hope that makes sense.
It appears, however, that the window blocking option is doing something different. For instance, If I run:
The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?
I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!
Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.
Best,
Ben