Adds function to validate crosswalk geoids#520
Conversation
Adds a function to check that all GEOIDs in a crosswalk file are present in a corresponding metric file. The function supports flexible column naming, filtering by year, and provides informative warnings if missing GEOIDs are found.
There was a problem hiding this comment.
Thank you for the great work on developing this check! The function works really well and I have some thoughts that I'd love your input on.
- If I provide a function with a year that doesn't exist in the metric file, it prints the entire list of missing GEOIDs for that year. It might be easier to understand the output from a user perspective if there was a check that verifies whether the year exists in the metric file or not. This way we can avoid printing hundreds of GEOIDs and provide the user with a more informative message i.e. they haven't provided data for the year they want to verify.
- If the user provides an incorrect column name, then an error is thrown which says "object" is not found. It be clearer to inform the user that the column they want to test for doesn't exist in the data so they can fix the function call/data.
- I wonder if the function could be a little more specific in terms of verifying year based GEOIDs. For example, if a GEOID doesn't exist in "pre-2022" crosswalk and the metric file also has only "pre-2022" years, the function will still flag that GEOID as missing because it is present for the "2022" crosswalk. Does it make sense for the function to distinguish between years the user wants to test and corresponding GEOIDs? Or do you think it is preferable to have the user figure this out?
Please let me know if I can clarify anything. Thanks again for putting this together!
- Added new argument `crosswalk_years` to explicitly define the valid years a crosswalk applies to - If both `crosswalk_years` and `years` are supplied, the function now: - Warns if any metric years fall outside the crosswalk's valid range - Filters the metric file to include only the overlapping years - Prevents false positives for missing GEOIDs due to incompatible temporal scope - Ensures user intent and awareness when working with non-standard crosswalk metadata
|
Hi @ridhi96, thank you for your thoughtful and helpful feedback! I've incorporated your suggestions into the latest version of the function:
Let me know if you think this can be improved further — appreciate your review! |
| stop("Could not infer geography columns. Expecting one of: county or place.") | ||
| } | ||
|
|
||
| crosswalk_geo_cols <- crosswalk_col_names %||% infer_geo_cols(crosswalk) |
There was a problem hiding this comment.
I don't think this is handling the comment I had about the user specifying a wrong column name. It still throws an object not found error when either code line 112 or 116 is executed.
There was a problem hiding this comment.
Would you mind sharing the example you used to get this error? It works fine for me with
result <- validate_geographies(
crosswalk_path = here::here(
"geographic-crosswalks",
"data",
"crosswalk_puma_to_county.csv"
),
metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv")
)
There was a problem hiding this comment.
Sorry I wasn't clear earlier. The issue comes up when a user misspells or provides an incorrect column name which doesn't exist in the data (either metric or crosswalk). For example, I get an error when I run the following snippet.
result <- validate_geographies(
crosswalk_path = here::here(
"geographic-crosswalks",
"data",
"crosswalk_puma_to_county.csv"
),
metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv"),
metric_col_names = "stte"
)
Output:
Error in `mutate()`:
ℹ In argument: `geoid = str_c(stte)`.
Caused by error:
! object 'stte' not found
It would be helpful to anticipate this error and inform the user accordingly.
There was a problem hiding this comment.
Thank you for clarifying! I pushed a commit with an informative error that uses fuzzy matching to say
! The following metric columns do not exist:
stte
Did you mean:
stte → state
Show Traceback
| rename(state = any_of(c("state", "statefip"))) | ||
|
|
||
| # Optional: filter metric by years | ||
| if (!is.null(years)) { |
There was a problem hiding this comment.
If I want to subset/check data for a specific year, this doesn't solve the problem of GEOIDs getting flagged as missing for crosswalks like crosswalk_puma__to_county.csv which don't distinguish geographies on an year basis. To recreate my concern, you can run this code snippet:
validate_geographies(
here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"),
here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"),
years = c("2018")
)
There was a problem hiding this comment.
I see what you're saying. When I run this snippet I get back:
missing_geoids
[1] "09110" "09120" "09130" "09140" "09150"
[6] "09160" "09170" "09180" "09190"
Which I believe are the CT planning regions.
However, I don't really know how to solve this without there being a year indicator in the crosswalk file. Do you have any suggestions? My instinct is to just give a warning and check whether these missing geoids are expected or not. What do you think?
There was a problem hiding this comment.
I agree that a warning here would be helpful specifically for checking years when CT planning regions don't exist.
There was a problem hiding this comment.
With the implementation of your suggestion of breaking the year comparison into two options depending on the way crosswalk is set up, I was able to get this to not return an error! Now it correctly returns that
✔ All geoids in `crosswalk_puma_to_county.csv` are present in `digital_access_county_all_longitudinal.csv`.
|
|
||
| if (!is.null(crosswalk_years)) { | ||
| out_of_scope_years <- setdiff(years, crosswalk_years) | ||
| effective_years <- intersect(years, crosswalk_years) |
There was a problem hiding this comment.
I think this effective_years piece of code could lead to bugs. For example, I executed this:
validate_geographies(
here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"),
here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"),
years = c("2018"),
crosswalk_years = c("2020")
)
The result was a long list of GEOIDs being printed as missing which I'm guessing is because the intersection of years and crosswalk_years is NULL. So, even if your year is in the data, and cross-walk year is valid, it could lead to hard to understand outputs. (I think this could especially cause issues with crosswalk files that aren't year specific.)
There was a problem hiding this comment.
You are right that the issue here is that the intersection between years and crosswalk_years, however that is caused because crosswalk_puma_to_county.csv does not have a year column. Instead, it uses crosswalk_period with values "2022" or "pre-2022". I can add some conditional logic to search for that column and create a vector of what those years would be. What do you think?
There was a problem hiding this comment.
I think a possible solution is to compare years in the metric data with the crosswalk years in a more direct manner. I'd suggest breaking the comparison into two options depending on the way crosswalk is set up.
- If a crosswalk has a
yearsoryearcolumn and lists years specifically then we subset metric and crosswalk data by matching the years for comparison for each unique year in the data. - If a crosswalk has a
crosswalk_periodcolumn, then we should subset metric data for each unique year and compare it against the relevant crosswalk period. Per my understanding, we only have2022andpre-2022periods in some files and other crosswalks don't have this so I like your suggestion of searching for this column and then comparing with metric. Essentially, we uniquely test each metric year in the data by checking the period it falls in.
There was a problem hiding this comment.
Great suggestion! I added this logic in my latest commit!
| #' @param crosswalk_col_names Optional character vector of column names in the crosswalk to build the GEOID. Defaults to inferred `c("state", "county")` or `c("state", "place")`. | ||
| #' @param metric_col_names Optional character vector of column names in the metric file to build the GEOID. Same logic as `crosswalk_col_names`. | ||
| #' @param years Optional numeric vector of years to restrict the check. The metric file must contain a `year` column if this is specified. | ||
| #' |
There was a problem hiding this comment.
Adding an explanation for the usage of crosswalk_years parameter like others would be great!
There was a problem hiding this comment.
Good catch! I missed this and will add documentation
Adds validation to ensure geography columns exist in both the crosswalk and metric datasets, providing informative error messages with suggestions for correction if columns are missing.
- Renamed *_col_names → *_geo_cols for clarity and consistency. - Improved documentation to clarify expected structure and typical use of geography column inputs. - Added validation to ensure both state and county/place columns are provided when user specifies custom geoid inputs. - Enhanced error messaging using cli::cli_abort() for clearer, top-level feedback. - Refactored year filtering logic into filter_by_crosswalk_year_logic() to: Handle year columns directly. Support crosswalk_period logic (e.g. "2022", "pre-2022"). Warn on unmatched periods and years. - Preserved default inference of common column names (state, county, place), with fallbacks for variants like statefip, statefp, etc.
|
Hi @ridhi96, I think this is ready for another review when you have time/funding. In particular, my last three commits:
It would be helpful to get feedback on whether a) i've fully addressed your review and b) if unit tests seem appropriate here given how complex the function is? if so, do you have any top priorities on things to test? |
ridhi96
left a comment
There was a problem hiding this comment.
Hi @malcalakovalski,
Thank you for taking my suggestions and updating the code! It works really well and the error messages are very informative!
I'm not sure if we have the budget to add unit tests, but here's a list of things I tested the function against (in case it is helpful to note this somewhere for future reference):
- Provided the function with a metric year that doesn't exist in the data file.
- Provided the function with an incorrect column name e.g., "stte" instead of "state"
- For a county metric file, provided the function with a place crosswalk file. In this case, the function printed all GEOIDs which aren't present.
- Tested a few different county and place metric files.
There's one more thing I'd like to flag but I think it's okay even if we don't address it because it will surface/be fixed elsewhere in the metric finalization process. If there's a metric with multiple years and if one of the years is missing some GEOIDs, then the function won't flag them as missing. This is because after making sure years in metric data and crosswalk file match, we compare distinct GEOIDs across the years.
I think we're pretty close to getting this through, thank you for the hard work in implementing this check!
| crosswalk_geo_cols = NULL, | ||
| metric_geo_cols = NULL, | ||
| years = NULL, | ||
| crosswalk_years = NULL) { |
There was a problem hiding this comment.
crosswalk_years parameter is not really used anywhere in the code (apart from being declared) so it's probably best to remove.
There was a problem hiding this comment.
You're right, this was leftover from a previous version of the function. I addressed it now!
| missing_geoids = list(missing_geoids) | ||
| ) %>% invisible() | ||
| } | ||
| filter_by_crosswalk_year_logic <- function(crosswalk, metric, years = NULL, crosswalk_years = NULL) { |
There was a problem hiding this comment.
Because the code is very complex, I'd suggest adding docstrings for the helper functions as well.
|
@ridhi96 Thank you for the thoughtful re-review! I have addressed your suggestion of removing the While I think your suggested unit tests sound super helpful, I opened a separate issue #537 to address that as we don't have time/budget in this round |
Closes #519.
Adds a function to check that all GEOIDs in a crosswalk file are present in a corresponding metric file. The function supports flexible column naming, filtering by year, and provides informative warnings if missing GEOIDs are found.