Adds function to validate crosswalk geoids by malcalakovalski · Pull Request #520 · UI-Research/mobility-from-poverty

malcalakovalski · 2025-06-18T18:59:37Z

Closes #519.

Adds a function to check that all GEOIDs in a crosswalk file are present in a corresponding metric file. The function supports flexible column naming, filtering by year, and provides informative warnings if missing GEOIDs are found.

ridhi96

Hi @malcalakovalski

Thank you for the great work on developing this check! The function works really well and I have some thoughts that I'd love your input on.

If I provide a function with a year that doesn't exist in the metric file, it prints the entire list of missing GEOIDs for that year. It might be easier to understand the output from a user perspective if there was a check that verifies whether the year exists in the metric file or not. This way we can avoid printing hundreds of GEOIDs and provide the user with a more informative message i.e. they haven't provided data for the year they want to verify.
If the user provides an incorrect column name, then an error is thrown which says "object" is not found. It be clearer to inform the user that the column they want to test for doesn't exist in the data so they can fix the function call/data.
I wonder if the function could be a little more specific in terms of verifying year based GEOIDs. For example, if a GEOID doesn't exist in "pre-2022" crosswalk and the metric file also has only "pre-2022" years, the function will still flag that GEOID as missing because it is present for the "2022" crosswalk. Does it make sense for the function to distinguish between years the user wants to test and corresponding GEOIDs? Or do you think it is preferable to have the user figure this out?

Please let me know if I can clarify anything. Thanks again for putting this together!

- Added new argument `crosswalk_years` to explicitly define the valid years a crosswalk applies to - If both `crosswalk_years` and `years` are supplied, the function now: - Warns if any metric years fall outside the crosswalk's valid range - Filters the metric file to include only the overlapping years - Prevents false positives for missing GEOIDs due to incompatible temporal scope - Ensures user intent and awareness when working with non-standard crosswalk metadata

malcalakovalski · 2025-06-27T17:29:44Z

Hi @ridhi96, thank you for your thoughtful and helpful feedback!

I've incorporated your suggestions into the latest version of the function:

The function now checks whether the user-specified years exist in the metric file. If none of them do, it stops early with an informative message rather than proceeding to print a long list of missing GEOIDs. If only some of the years are valid, it warns and continues with the overlap only. This makes the output cleaner and more intuitive for users.
The function now accepts a new argument crosswalk_years, which allows users to explicitly specify the range of years the crosswalk applies to. This avoids having to infer the year from the filename, which—as we discussed—is unreliable.
If both years (from the metric) and crosswalk_years are provided, the function now:
- Warns the user if any of the specified metric years fall outside the range of the crosswalk
- Filters the metric file to only include data for the years that overlap with the crosswalk, to avoid reporting GEOIDs as missing when they are simply out of scope
This ensures that the check remains precise and avoids misleading results, while keeping the logic explicit and transparent for users.

Let me know if you think this can be improved further — appreciate your review!

ridhi96 · 2025-07-01T00:15:43Z

functions/testing/check_crosswalk_geoids_present.R

+    stop("Could not infer geography columns. Expecting one of: county or place.")
+  }
+
+  crosswalk_geo_cols <- crosswalk_col_names %||% infer_geo_cols(crosswalk)


I don't think this is handling the comment I had about the user specifying a wrong column name. It still throws an object not found error when either code line 112 or 116 is executed.

Would you mind sharing the example you used to get this error? It works fine for me with

result <- validate_geographies( crosswalk_path = here::here( "geographic-crosswalks", "data", "crosswalk_puma_to_county.csv" ), metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv") )

Sorry I wasn't clear earlier. The issue comes up when a user misspells or provides an incorrect column name which doesn't exist in the data (either metric or crosswalk). For example, I get an error when I run the following snippet.

result <- validate_geographies( crosswalk_path = here::here( "geographic-crosswalks", "data", "crosswalk_puma_to_county.csv" ), metric_path = here::here("01_financial-well-being", "final", "households_house_value_race_ethnicity_all_county.csv"), metric_col_names = "stte" )

Output:

Error in `mutate()`: ℹ In argument: `geoid = str_c(stte)`. Caused by error: ! object 'stte' not found

It would be helpful to anticipate this error and inform the user accordingly.

Thank you for clarifying! I pushed a commit with an informative error that uses fuzzy matching to say

! The following metric columns do not exist: stte Did you mean: stte → state Show Traceback

ridhi96 · 2025-07-01T00:19:21Z

functions/testing/check_crosswalk_geoids_present.R

+    rename(state = any_of(c("state", "statefip")))
+
+  # Optional: filter metric by years
+  if (!is.null(years)) {


If I want to subset/check data for a specific year, this doesn't solve the problem of GEOIDs getting flagged as missing for crosswalks like crosswalk_puma__to_county.csv which don't distinguish geographies on an year basis. To recreate my concern, you can run this code snippet:

validate_geographies( here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"), here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"), years = c("2018") )

I see what you're saying. When I run this snippet I get back:

missing_geoids [1] "09110" "09120" "09130" "09140" "09150" [6] "09160" "09170" "09180" "09190"

Which I believe are the CT planning regions.

However, I don't really know how to solve this without there being a year indicator in the crosswalk file. Do you have any suggestions? My instinct is to just give a warning and check whether these missing geoids are expected or not. What do you think?

I agree that a warning here would be helpful specifically for checking years when CT planning regions don't exist.

With the implementation of your suggestion of breaking the year comparison into two options depending on the way crosswalk is set up, I was able to get this to not return an error! Now it correctly returns that

✔ All geoids in `crosswalk_puma_to_county.csv` are present in `digital_access_county_all_longitudinal.csv`.

ridhi96 · 2025-07-01T00:29:11Z

functions/testing/check_crosswalk_geoids_present.R

+
+    if (!is.null(crosswalk_years)) {
+      out_of_scope_years <- setdiff(years, crosswalk_years)
+      effective_years <- intersect(years, crosswalk_years)


I think this effective_years piece of code could lead to bugs. For example, I executed this:

validate_geographies( here::here("geographic-crosswalks","data","crosswalk_puma_to_county.csv"), here::here("08_education", "data", "final","digital_access_county_all_longitudinal.csv"), years = c("2018"), crosswalk_years = c("2020") )

The result was a long list of GEOIDs being printed as missing which I'm guessing is because the intersection of years and crosswalk_years is NULL. So, even if your year is in the data, and cross-walk year is valid, it could lead to hard to understand outputs. (I think this could especially cause issues with crosswalk files that aren't year specific.)

You are right that the issue here is that the intersection between years and crosswalk_years, however that is caused because crosswalk_puma_to_county.csv does not have a year column. Instead, it uses crosswalk_period with values "2022" or "pre-2022". I can add some conditional logic to search for that column and create a vector of what those years would be. What do you think?

I think a possible solution is to compare years in the metric data with the crosswalk years in a more direct manner. I'd suggest breaking the comparison into two options depending on the way crosswalk is set up.

If a crosswalk has a years or year column and lists years specifically then we subset metric and crosswalk data by matching the years for comparison for each unique year in the data.

If a crosswalk has a crosswalk_period column, then we should subset metric data for each unique year and compare it against the relevant crosswalk period. Per my understanding, we only have 2022 and pre-2022 periods in some files and other crosswalks don't have this so I like your suggestion of searching for this column and then comparing with metric. Essentially, we uniquely test each metric year in the data by checking the period it falls in.

Great suggestion! I added this logic in my latest commit!

ridhi96 · 2025-07-01T00:33:46Z

functions/testing/check_crosswalk_geoids_present.R

+#' @param crosswalk_col_names Optional character vector of column names in the crosswalk to build the GEOID. Defaults to inferred `c("state", "county")` or `c("state", "place")`.
+#' @param metric_col_names Optional character vector of column names in the metric file to build the GEOID. Same logic as `crosswalk_col_names`.
+#' @param years Optional numeric vector of years to restrict the check. The metric file must contain a `year` column if this is specified.
+#'


Adding an explanation for the usage of crosswalk_years parameter like others would be great!

Good catch! I missed this and will add documentation

Adds validation to ensure geography columns exist in both the crosswalk and metric datasets, providing informative error messages with suggestions for correction if columns are missing.

- Renamed *_col_names → *_geo_cols for clarity and consistency. - Improved documentation to clarify expected structure and typical use of geography column inputs. - Added validation to ensure both state and county/place columns are provided when user specifies custom geoid inputs. - Enhanced error messaging using cli::cli_abort() for clearer, top-level feedback. - Refactored year filtering logic into filter_by_crosswalk_year_logic() to: Handle year columns directly. Support crosswalk_period logic (e.g. "2022", "pre-2022"). Warn on unmatched periods and years. - Preserved default inference of common column names (state, county, place), with fallbacks for variants like statefip, statefp, etc.

malcalakovalski · 2025-08-05T18:25:09Z

Hi @ridhi96, I think this is ready for another review when you have time/funding.

In particular, my last three commits:

Renamed *_col_names → *_geo_cols for clarity and consistency.
Improved documentation to clarify expected structure and typical use of geography column inputs.
Added documentation for crosswalk year
Implemented fuzzy matching for one a user misspecifies a column name like in your example
Added validation to ensure both state and county/place columns are provided when user specifies custom geoid inputs.
Enhanced error messaging using cli::cli_abort() for clearer, top-level feedback.
Refactored year filtering logic into filter_by_crosswalk_year_logic() to:
Handle year columns directly.
Support crosswalk_period logic (e.g. "2022", "pre-2022").
Warn on unmatched periods and years.
Preserved default inference of common column names (state, county, place), with fallbacks for variants like statefip, statefp, etc.

It would be helpful to get feedback on whether a) i've fully addressed your review and b) if unit tests seem appropriate here given how complex the function is? if so, do you have any top priorities on things to test?

ridhi96

Hi @malcalakovalski,

Thank you for taking my suggestions and updating the code! It works really well and the error messages are very informative!

I'm not sure if we have the budget to add unit tests, but here's a list of things I tested the function against (in case it is helpful to note this somewhere for future reference):

Provided the function with a metric year that doesn't exist in the data file.
Provided the function with an incorrect column name e.g., "stte" instead of "state"
For a county metric file, provided the function with a place crosswalk file. In this case, the function printed all GEOIDs which aren't present.
Tested a few different county and place metric files.

There's one more thing I'd like to flag but I think it's okay even if we don't address it because it will surface/be fixed elsewhere in the metric finalization process. If there's a metric with multiple years and if one of the years is missing some GEOIDs, then the function won't flag them as missing. This is because after making sure years in metric data and crosswalk file match, we compare distinct GEOIDs across the years.

I think we're pretty close to getting this through, thank you for the hard work in implementing this check!

ridhi96 · 2025-10-21T22:12:48Z

functions/testing/check_crosswalk_geoids_present.R

+                                 crosswalk_geo_cols = NULL,
+                                 metric_geo_cols = NULL,
+                                 years = NULL,
+                                 crosswalk_years = NULL) {


crosswalk_years parameter is not really used anywhere in the code (apart from being declared) so it's probably best to remove.

You're right, this was leftover from a previous version of the function. I addressed it now!

ridhi96 · 2025-10-21T22:13:36Z

functions/testing/check_crosswalk_geoids_present.R

+    missing_geoids = list(missing_geoids)
+  ) %>% invisible()
+}
+filter_by_crosswalk_year_logic <- function(crosswalk, metric, years = NULL, crosswalk_years = NULL) {


Because the code is very complex, I'd suggest adding docstrings for the helper functions as well.

malcalakovalski · 2025-10-29T18:49:39Z

@ridhi96 Thank you for the thoughtful re-review! I have addressed your suggestion of removing the crosswalk_years argument and adding docstring documentation to the helper functions in the file.

While I think your suggested unit tests sound super helpful, I opened a separate issue #537 to address that as we don't have time/budget in this round

Adds function to validate crosswalk geoids

9ca1cc0

Adds a function to check that all GEOIDs in a crosswalk file are present in a corresponding metric file. The function supports flexible column naming, filtering by year, and provides informative warnings if missing GEOIDs are found.

malcalakovalski requested a review from cdsolari June 18, 2025 19:00

malcalakovalski marked this pull request as ready for review June 18, 2025 19:00

cdsolari requested review from ridhi96 and removed request for cdsolari June 20, 2025 14:23

ridhi96 reviewed Jun 26, 2025

View reviewed changes

ridhi96 requested changes Jul 1, 2025

View reviewed changes

malcalakovalski added 3 commits August 1, 2025 13:58

Validates crosswalk and metric geography columns

de0f401

Adds validation to ensure geography columns exist in both the crosswalk and metric datasets, providing informative error messages with suggestions for correction if columns are missing.

Adds crosswalk year documentation

7931aae

malcalakovalski requested a review from ridhi96 September 17, 2025 15:55

awunderground changed the base branch from version2026 to tranche1 October 21, 2025 13:09

ridhi96 requested changes Oct 21, 2025

View reviewed changes

malcalakovalski added 2 commits October 29, 2025 14:47

Remove superfluous crosswalk_years argument

9d98388

Add docstring documentation to helper functions

dd5e30a

malcalakovalski mentioned this pull request Oct 29, 2025

Create unit tests for function to validate crosswalk geoids #537

Open

Base automatically changed from tranche1 to main December 22, 2025 12:39

Conversation

malcalakovalski commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ridhi96 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malcalakovalski commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ridhi96 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malcalakovalski Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ridhi96 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malcalakovalski commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ridhi96 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malcalakovalski commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malcalakovalski commented Jun 18, 2025 •

edited

Loading

ridhi96 left a comment •

edited

Loading

ridhi96 Jul 30, 2025 •

edited

Loading

malcalakovalski Aug 5, 2025 •

edited

Loading

ridhi96 Jul 30, 2025 •

edited

Loading

malcalakovalski commented Aug 5, 2025 •

edited

Loading

malcalakovalski commented Oct 29, 2025 •

edited

Loading