Skip to content

Adds function to validate crosswalk geoids#520

Open
malcalakovalski wants to merge 7 commits intomainfrom
iss519
Open

Adds function to validate crosswalk geoids#520
malcalakovalski wants to merge 7 commits intomainfrom
iss519

Conversation

@malcalakovalski
Copy link
Contributor

@malcalakovalski malcalakovalski commented Jun 18, 2025

Closes #519.

Adds a function to check that all GEOIDs in a crosswalk file are present in a corresponding metric file. The function supports flexible column naming, filtering by year, and provides informative warnings if missing GEOIDs are found.

Adds a function to check that all GEOIDs in a crosswalk file
are present in a corresponding metric file. The
function supports flexible column naming, filtering by year,
and provides informative warnings if missing GEOIDs are found.
@malcalakovalski malcalakovalski requested a review from cdsolari June 18, 2025 19:00
@malcalakovalski malcalakovalski marked this pull request as ready for review June 18, 2025 19:00
@cdsolari cdsolari requested review from ridhi96 and removed request for cdsolari June 20, 2025 14:23
Copy link
Contributor

@ridhi96 ridhi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @malcalakovalski

Thank you for the great work on developing this check! The function works really well and I have some thoughts that I'd love your input on.

  • If I provide a function with a year that doesn't exist in the metric file, it prints the entire list of missing GEOIDs for that year. It might be easier to understand the output from a user perspective if there was a check that verifies whether the year exists in the metric file or not. This way we can avoid printing hundreds of GEOIDs and provide the user with a more informative message i.e. they haven't provided data for the year they want to verify.
  • If the user provides an incorrect column name, then an error is thrown which says "object" is not found. It be clearer to inform the user that the column they want to test for doesn't exist in the data so they can fix the function call/data.
  • I wonder if the function could be a little more specific in terms of verifying year based GEOIDs. For example, if a GEOID doesn't exist in "pre-2022" crosswalk and the metric file also has only "pre-2022" years, the function will still flag that GEOID as missing because it is present for the "2022" crosswalk. Does it make sense for the function to distinguish between years the user wants to test and corresponding GEOIDs? Or do you think it is preferable to have the user figure this out?

Please let me know if I can clarify anything. Thanks again for putting this together!

- Added new argument `crosswalk_years` to explicitly define the valid years a crosswalk applies to
- If both `crosswalk_years` and `years` are supplied, the function now:
  - Warns if any metric years fall outside the crosswalk's valid range
  - Filters the metric file to include only the overlapping years
- Prevents false positives for missing GEOIDs due to incompatible temporal scope
- Ensures user intent and awareness when working with non-standard crosswalk metadata
@malcalakovalski
Copy link
Contributor Author

Hi @ridhi96, thank you for your thoughtful and helpful feedback!

I've incorporated your suggestions into the latest version of the function:

  • The function now checks whether the user-specified years exist in the metric file. If none of them do, it stops early with an informative message rather than proceeding to print a long list of missing GEOIDs. If only some of the years are valid, it warns and continues with the overlap only. This makes the output cleaner and more intuitive for users.
  • The function now accepts a new argument crosswalk_years, which allows users to explicitly specify the range of years the crosswalk applies to. This avoids having to infer the year from the filename, which—as we discussed—is unreliable.
  • If both years (from the metric) and crosswalk_years are provided, the function now:
    • Warns the user if any of the specified metric years fall outside the range of the crosswalk
    • Filters the metric file to only include data for the years that overlap with the crosswalk, to avoid reporting GEOIDs as missing when they are simply out of scope
  • This ensures that the check remains precise and avoids misleading results, while keeping the logic explicit and transparent for users.

Let me know if you think this can be improved further — appreciate your review!

stop("Could not infer geography columns. Expecting one of: county or place.")
}

crosswalk_geo_cols <- crosswalk_col_names %||% infer_geo_cols(crosswalk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is handling the comment I had about the user specifying a wrong column name. It still throws an object not found error when either code line 112 or 116 is executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind sharing the example you used to get this error? It works fine for me with

result <- validate_geographies(
 crosswalk_path = here::here(
   "geographic-crosswalks",
   "data",
   "crosswalk_puma_to_county.csv"
 ),
 metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv")
)

Copy link
Contributor

@ridhi96 ridhi96 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear earlier. The issue comes up when a user misspells or provides an incorrect column name which doesn't exist in the data (either metric or crosswalk). For example, I get an error when I run the following snippet.

result <- validate_geographies(
 crosswalk_path = here::here(
   "geographic-crosswalks",
   "data",
   "crosswalk_puma_to_county.csv"
 ),
 metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv"),
metric_col_names = "stte"
)

Output:

Error in `mutate()`:                                                                                                                       
ℹ In argument: `geoid = str_c(stte)`.
Caused by error:
! object 'stte' not found

It would be helpful to anticipate this error and inform the user accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying! I pushed a commit with an informative error that uses fuzzy matching to say

! The following metric columns do not exist:
stte

Did you mean:
stte → state
Show Traceback

rename(state = any_of(c("state", "statefip")))

# Optional: filter metric by years
if (!is.null(years)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I want to subset/check data for a specific year, this doesn't solve the problem of GEOIDs getting flagged as missing for crosswalks like crosswalk_puma__to_county.csv which don't distinguish geographies on an year basis. To recreate my concern, you can run this code snippet:

validate_geographies(
here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"), 
here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"), 
years = c("2018")
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. When I run this snippet I get back:

missing_geoids
[1] "09110" "09120" "09130" "09140" "09150"
[6] "09160" "09170" "09180" "09190"

Which I believe are the CT planning regions.

However, I don't really know how to solve this without there being a year indicator in the crosswalk file. Do you have any suggestions? My instinct is to just give a warning and check whether these missing geoids are expected or not. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a warning here would be helpful specifically for checking years when CT planning regions don't exist.

Copy link
Contributor Author

@malcalakovalski malcalakovalski Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the implementation of your suggestion of breaking the year comparison into two options depending on the way crosswalk is set up, I was able to get this to not return an error! Now it correctly returns that

✔ All geoids in `crosswalk_puma_to_county.csv` are present in `digital_access_county_all_longitudinal.csv`.


if (!is.null(crosswalk_years)) {
out_of_scope_years <- setdiff(years, crosswalk_years)
effective_years <- intersect(years, crosswalk_years)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this effective_years piece of code could lead to bugs. For example, I executed this:

validate_geographies(
here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"), 
here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"), 
years = c("2018"),
crosswalk_years = c("2020")
)

The result was a long list of GEOIDs being printed as missing which I'm guessing is because the intersection of years and crosswalk_years is NULL. So, even if your year is in the data, and cross-walk year is valid, it could lead to hard to understand outputs. (I think this could especially cause issues with crosswalk files that aren't year specific.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that the issue here is that the intersection between years and crosswalk_years, however that is caused because crosswalk_puma_to_county.csv does not have a year column. Instead, it uses crosswalk_period with values "2022" or "pre-2022". I can add some conditional logic to search for that column and create a vector of what those years would be. What do you think?

Copy link
Contributor

@ridhi96 ridhi96 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a possible solution is to compare years in the metric data with the crosswalk years in a more direct manner. I'd suggest breaking the comparison into two options depending on the way crosswalk is set up.

  • If a crosswalk has a years or year column and lists years specifically then we subset metric and crosswalk data by matching the years for comparison for each unique year in the data.
  • If a crosswalk has a crosswalk_period column, then we should subset metric data for each unique year and compare it against the relevant crosswalk period. Per my understanding, we only have 2022 and pre-2022 periods in some files and other crosswalks don't have this so I like your suggestion of searching for this column and then comparing with metric. Essentially, we uniquely test each metric year in the data by checking the period it falls in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I added this logic in my latest commit!

#' @param crosswalk_col_names Optional character vector of column names in the crosswalk to build the GEOID. Defaults to inferred `c("state", "county")` or `c("state", "place")`.
#' @param metric_col_names Optional character vector of column names in the metric file to build the GEOID. Same logic as `crosswalk_col_names`.
#' @param years Optional numeric vector of years to restrict the check. The metric file must contain a `year` column if this is specified.
#'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an explanation for the usage of crosswalk_years parameter like others would be great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I missed this and will add documentation

Adds validation to ensure geography columns exist in both the
crosswalk and metric datasets, providing informative error
messages with suggestions for correction if columns are missing.
- Renamed *_col_names → *_geo_cols for clarity and consistency.
- Improved documentation to clarify expected structure and typical use of geography column inputs.
- Added validation to ensure both state and county/place columns are provided when user specifies custom geoid inputs.
- Enhanced error messaging using cli::cli_abort() for clearer, top-level feedback.
- Refactored year filtering logic into filter_by_crosswalk_year_logic() to:
Handle year columns directly.
Support crosswalk_period logic (e.g. "2022", "pre-2022").
Warn on unmatched periods and years.
- Preserved default inference of common column names (state, county, place), with fallbacks for variants like statefip, statefp, etc.
@malcalakovalski
Copy link
Contributor Author

malcalakovalski commented Aug 5, 2025

Hi @ridhi96, I think this is ready for another review when you have time/funding.

In particular, my last three commits:

  • Renamed *_col_names → *_geo_cols for clarity and consistency.
  • Improved documentation to clarify expected structure and typical use of geography column inputs.
  • Added documentation for crosswalk year
  • Implemented fuzzy matching for one a user misspecifies a column name like in your example
  • Added validation to ensure both state and county/place columns are provided when user specifies custom geoid inputs.
  • Enhanced error messaging using cli::cli_abort() for clearer, top-level feedback.
  • Refactored year filtering logic into filter_by_crosswalk_year_logic() to:
    Handle year columns directly.
    Support crosswalk_period logic (e.g. "2022", "pre-2022").
    Warn on unmatched periods and years.
  • Preserved default inference of common column names (state, county, place), with fallbacks for variants like statefip, statefp, etc.

It would be helpful to get feedback on whether a) i've fully addressed your review and b) if unit tests seem appropriate here given how complex the function is? if so, do you have any top priorities on things to test?

@awunderground awunderground changed the base branch from version2026 to tranche1 October 21, 2025 13:09
Copy link
Contributor

@ridhi96 ridhi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @malcalakovalski,

Thank you for taking my suggestions and updating the code! It works really well and the error messages are very informative!

I'm not sure if we have the budget to add unit tests, but here's a list of things I tested the function against (in case it is helpful to note this somewhere for future reference):

  • Provided the function with a metric year that doesn't exist in the data file.
  • Provided the function with an incorrect column name e.g., "stte" instead of "state"
  • For a county metric file, provided the function with a place crosswalk file. In this case, the function printed all GEOIDs which aren't present.
  • Tested a few different county and place metric files.

There's one more thing I'd like to flag but I think it's okay even if we don't address it because it will surface/be fixed elsewhere in the metric finalization process. If there's a metric with multiple years and if one of the years is missing some GEOIDs, then the function won't flag them as missing. This is because after making sure years in metric data and crosswalk file match, we compare distinct GEOIDs across the years.

I think we're pretty close to getting this through, thank you for the hard work in implementing this check!

crosswalk_geo_cols = NULL,
metric_geo_cols = NULL,
years = NULL,
crosswalk_years = NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crosswalk_years parameter is not really used anywhere in the code (apart from being declared) so it's probably best to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this was leftover from a previous version of the function. I addressed it now!

missing_geoids = list(missing_geoids)
) %>% invisible()
}
filter_by_crosswalk_year_logic <- function(crosswalk, metric, years = NULL, crosswalk_years = NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the code is very complex, I'd suggest adding docstrings for the helper functions as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@malcalakovalski
Copy link
Contributor Author

malcalakovalski commented Oct 29, 2025

@ridhi96 Thank you for the thoughtful re-review! I have addressed your suggestion of removing the crosswalk_years argument and adding docstring documentation to the helper functions in the file.

While I think your suggested unit tests sound super helpful, I opened a separate issue #537 to address that as we don't have time/budget in this round

Base automatically changed from tranche1 to main December 22, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants