Mixed data type fields in Firestore lead to extra fields downstream in BigQuery #1262

jacobmpeters · 2025-03-18T16:44:08Z

jacobmpeters
Mar 18, 2025
Collaborator

There is a recurring issue that comes up in which fields in Firestore sometimes allow entries of differing data types.

For example, if the data look something like this in Firestore...

[
  {
    "Connect_ID": 1,
    "x": "1"  // Stored as string
  },
  {
    "Connect_ID": 2,
    "x": 2  // Stored as integer
  },
  {
    "Connect_ID": 3,
    "x": ""  // Stored as string (empty)
  }
]

we get a complex table like this...

Connect_ID	x.string	x.integer	x.provided
1	"1"	null	"string"
2	null	2	"integer"
3	""	null	"string"

..instead of something clean like this..

Connect_ID	x
1	1
2	2
3	null

This gets particularly difficult to work with for nested structs.

This SQL query finds all examples of these issues in production:

-- Find all variables in the Connect dataset that have mixed data types.
-- + Mixed data types include suffixes like ".string", ".integer", etc., which indicate the data type of the value.
-- + The '.provided' field specifies the data type in which the value is given.

SELECT table_catalog, table_schema, table_name, column_name, field_path
FROM `nih-nci-dceg-connect-prod-6d04`.`Connect`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
  REGEXP_CONTAINS(LOWER(field_path), '(.provided|.string|.integer|.entity)')
  AND NOT REGEXP_CONTAINS(field_path, '\\.__key__')  -- we can ignore the '.__key__' fields which don't matter
ORDER BY table_name, column_name, field_path;

This is a CSV file with the results: https://nih.app.box.com/folder/312339347849

I would like to correct these mixed data fields in Firestore directly rather than fixing them as they appear in BigQuery. It would also be great to impose some sort of type checking on these fields so that these issues don't keep popping up and unexpectedly disrupting our analytic workflows.

Does anyone have a good idea about the source of these issues and why they continue to arise? @we-ai @anthonypetersen @JoeArmani

Cleaning these fields up is a priority for our PR2 investigator-facing data warehouse work.

we-ai · 2025-03-18T17:35:49Z

we-ai
Mar 18, 2025
Maintainer

If mixed STRING and INTEGER types popped up again after previous cleanups, there should be bug(s) saving data in the wrong type. We may need to list out each instance, find root causes (likely in code base) and fix them. Cleaning up mixed STRING and INTEGER types is easy, since integers can be easily converted to strings and vice versa.

I agree that mixed STRING and RECORD (object) types are more difficult. We previously discussed this, but didn't reach a consensus on how to do cleanups. Also, we need to find out and fix the root causes of these.

2 replies

jacobmpeters Mar 19, 2025
Collaborator Author

We might determine that the responses with the STRING values when a RECORD is expected are a tiny fraction of the data and can be dropped. I am doubtful that we can map them back to the original question in some cases. I think we need to assign an analyst to this issue to be sure before moving forward.

jacobmpeters Mar 19, 2025
Collaborator Author

We should also determine whether these are all old issues or if there are any new instances.

anthonypetersen · 2025-03-18T18:02:55Z

anthonypetersen
Mar 18, 2025
Maintainer

The two biospecimen variables are tied to Blood and Urine Accession IDs. As Warren mentioned, we need to figure out where in code these are being set incorrectly as well as do a mass data correction update.

@jacobmpeters can you please explain what the survey specific ones look like in data? Are these nested questions that are supposed to have objects, but some got sent as just strings?

The final three lines are a bug that I can clean up.

@jacobmpeters for the notifications error code one... are the ones that are strings just instances of empty strings?

5 replies

jacobmpeters Mar 18, 2025
Collaborator Author

fore the errorCode one, there are instances where there are empty strings and where there are 5-digit codes wrapped in strings.

jacobmpeters Mar 18, 2025
Collaborator Author

can you please explain what the survey specific ones look like in data? Are these nested questions that are supposed to have objects, but some got sent as just strings?

I'll try to take a closer look at a couple examples when I have some bandwidth and get back to you.

anthonypetersen Mar 18, 2025
Maintainer

fore the errorCode one, there are instances where there are empty strings and where there are 5-digit codes wrapped in strings.

That makes sense. Do we see any reason that the notifications table would be in PR2?

jacobmpeters Mar 18, 2025
Collaborator Author

No, I don't think there are any plans to include the notifications table in PR2. I just flagged all related issues, so this one is not a priority. The survey data is definitely the priority.

I have been working on a pipeline to clean up our data for PR2, but this issue is a game of whack-a-mole and it's hard to define an ETL process to handle all of these edge cases, so I think we'll need to try to fix it upstream of BQ.

anthonypetersen Mar 18, 2025
Maintainer

That makes sense, either way it's good to know all of the issues.

we-ai · 2025-03-18T18:32:27Z

we-ai
Mar 18, 2025
Maintainer

Related discussions earlier: issue#938.

0 replies

brotzmanmj · 2025-03-18T19:52:34Z

brotzmanmj
Mar 18, 2025
Collaborator

From above 'The two biospecimen variables are tied to Blood and Urine Accession IDs. As Warren mentioned, we need to figure out where in code these are being set incorrectly as well as do a mass data correction update.'

These are scanned into the clinical biospec dashboard by the sites. They exist independently of us (they are assigned by the health care systems), we are capturing what they scan and need to accept it. The dictionary says they are 'numeric'. I believe the expected behavior is that when the sites scan something that has a leading or trailing character, the dashboard removes it when it stores the data, but I am not certain. We would need someone to help us look at the data and the interface and check what it is doing. Thanks.

1 reply

we-ai Mar 18, 2025
Maintainer

The strings can be the fallback values. Whenever there're conflicts of integers vs strings, converting integers to strings is alway fine, but converting strings to integers may not work.
In cases we cannot control data inputs, using strings value can be easier for us.

jacobmpeters · 2025-03-19T16:22:20Z

jacobmpeters
Mar 19, 2025
Collaborator Author

@FrogGirl1123 I think we might benefit from assigning someone from the Analytics Team to develop a simple report to detect mixed cases of mixed data type issues as they arise and provide DevOps with the necessary guidance to correct the issues. I think @KELSEYDOWLING7 and I are both too swamped at the moment.

0 replies

FrogGirl1123 · 2025-03-27T20:29:59Z

FrogGirl1123
Mar 27, 2025
Collaborator

Let's assign @hullingsag. Autumn, please work with Jake to create a weekly report that DevOps can use to correct this issue. @jacobmpeters I don't have the ability to do more thank tag Autumn, so please assign her.

2 replies

jacobmpeters Mar 27, 2025
Collaborator Author

@hullingsag I'm happy to meet with you to kick this off. It should be pretty straight forward.

hullingsag Mar 28, 2025
Collaborator

Thanks @jacobmpeters ! I'll reach out on Teams

Mixed data type fields in Firestore lead to extra fields downstream in BigQuery #1262

Uh oh!

Uh oh!

jacobmpeters Mar 18, 2025 Collaborator

Replies: 6 comments · 10 replies

Uh oh!

Uh oh!

we-ai Mar 18, 2025 Maintainer

Uh oh!

jacobmpeters Mar 19, 2025 Collaborator Author

Uh oh!

jacobmpeters Mar 19, 2025 Collaborator Author

Uh oh!

Uh oh!

anthonypetersen Mar 18, 2025 Maintainer

Uh oh!

jacobmpeters Mar 18, 2025 Collaborator Author

Uh oh!

Uh oh!

jacobmpeters Mar 18, 2025 Collaborator Author

Uh oh!

anthonypetersen Mar 18, 2025 Maintainer

Uh oh!

jacobmpeters Mar 18, 2025 Collaborator Author

Uh oh!

anthonypetersen Mar 18, 2025 Maintainer

Uh oh!

we-ai Mar 18, 2025 Maintainer

Uh oh!

brotzmanmj Mar 18, 2025 Collaborator

Uh oh!

we-ai Mar 18, 2025 Maintainer

Uh oh!

jacobmpeters Mar 19, 2025 Collaborator Author

Uh oh!

FrogGirl1123 Mar 27, 2025 Collaborator

Uh oh!

jacobmpeters Mar 27, 2025 Collaborator Author

Uh oh!

hullingsag Mar 28, 2025 Collaborator

jacobmpeters
Mar 18, 2025
Collaborator

Replies: 6 comments 10 replies

we-ai
Mar 18, 2025
Maintainer

jacobmpeters Mar 19, 2025
Collaborator Author

jacobmpeters Mar 19, 2025
Collaborator Author

anthonypetersen
Mar 18, 2025
Maintainer

jacobmpeters Mar 18, 2025
Collaborator Author

jacobmpeters Mar 18, 2025
Collaborator Author

anthonypetersen Mar 18, 2025
Maintainer

jacobmpeters Mar 18, 2025
Collaborator Author

anthonypetersen Mar 18, 2025
Maintainer

we-ai
Mar 18, 2025
Maintainer

brotzmanmj
Mar 18, 2025
Collaborator

we-ai Mar 18, 2025
Maintainer

jacobmpeters
Mar 19, 2025
Collaborator Author

FrogGirl1123
Mar 27, 2025
Collaborator

jacobmpeters Mar 27, 2025
Collaborator Author

hullingsag Mar 28, 2025
Collaborator