validate

Validate CSV data blazingly-fast using JSON Schema Validation (Draft 2020-12) (e.g. up to 780,031 rows/second[^1] using NYC's 311 schema generated by the schema command) & put invalid records into a separate file along with a detailed validation error report. Supports several custom JSON Schema formats & keywords: * currency custom format with ISO-4217 validation * dynamicEnum custom keyword that supports enum validation against a CSV on the filesystem or a URL (http/https/ckan & dathere URL schemes supported) * uniqueCombinedWith custom keyword to validate uniqueness across multiple columns for composite key validation. If no JSON schema file is provided, validates if a CSV conforms to the RFC 4180 standard and is UTF-8 encoded.

Table of Contents | Source: src/cmd/validate.rs | 📇🚀🌐📚🗄️

Description ↩

Validates CSV data using two main modes:

JSON Schema Validation Mode:

This mode is invoked if a JSON Schema file (draft 2020-12) is provided.

The CSV data is validated against the JSON Schema. If the CSV data is valid, no output files are created and the command returns an exit code of 0.

If invalid records are found, they are put into an "invalid" file, with the rest of the records put into a "valid"" file.

A "validation-errors.tsv" report is also created with the following columns:

row_number: the row number of the invalid record
field: the field name of the invalid field
error: a validation error message detailing why the field is invalid

It uses the JSON Schema Validation Specification (draft 2020-12) to validate the CSV. It validates the structure of the file, as well as the data types and domain/range of the fields. See https://json-schema.org/draft/2020-12/json-schema-validation.html

qsv supports a custom format - currency. This format will only accept a valid currency, defined as:

ISO Currency Symbol (optional): This is the ISO 4217 three-character code or currency symbol (e.g. USD, EUR, JPY, $, €, ¥, etc.)
Amount: This is the numerical value of the currency. More than 2 decimal places are allowed.
Formats: Valid currency formats include: Standard: $1,000.00 or USD1000.00 Negative amounts: ($100.00) or -$100.00 Different styles: 1.000,00 (used in some countries for euros)

qsv also supports two custom keywords - `dynamicEnum` and `uniqueCombinedWith`.

Dynamicenum

dynamicEnum allows for dynamic validation against a reference CSV file. It can be used to validate against a set of values unknown at the time of schema creation or when the set of valid values is dynamic or too large to hardcode into the JSON Schema with enum. The reference CSV file can be local or a URL (http/https, dathere & ckan schemes supported). The "dynamicEnum" value has the form:

// qsvlite binary variant only supports URIs which can be files on the local filesystem // or remote files (http and https schemes supported) dynamicEnum = "URI|colname" where colname is the column name or column index (0-based)

// use data.csv from the current working directory; use the 1st column for validation dynamicEnum = "data.csv"

// use data.csv in /lookup_dir directory; use the column "Agency" for validation dynamicEnum = "/lookupdir/data.csv|Agency"

// get data.csv; use the 3rd column for validation (2 as the col index is 0-based) dynamicEnum = "https://example.com/data.csv|2"

// on other qsv binary variants, dynamicEnum has expanded caching functionality dynamicEnum = "[cache_name;cache_age]|URI|colname" where cache_name and cache_age are optional

// use data.csv from current working directory; cache it as data with a default // cache age of 3600 seconds i.e. the cached data.csv expires after 1 hour dynamicEnum = "data.csv"

// get data.csv; cache it as custom_name, cache age 600 seconds dynamicEnum = "custom_name;600|https://example.com/data.csv"

// get data.csv; cache it as data, cache age 800 seconds dynamicEnum = ";800|https://example.com/data.csv"

// get the top matching result for nyc_neighborhoods (signaled by trailing ?), // cache it as nyc_neighborhood_data.csv (NOTE: cache name is required when using CKAN scheme) // with a default cache age of 3600 seconds // be sure to set --ckan-api, otherwise it will default to datHere's CKAN (data.dathere.com) dynamicEnum = "nyc_neighborhood_data|ckan:://nyc_neighborhoods?"

// get CKAN resource with id 1234567, cache it as resname, 3600 secs cache age // note that if the resource is a private resource, you'll need to set --ckan-token dynamicEnum = "resname|ckan:://1234567"

// same as above but with a cache age of 100 seconds; use the borough column for validation dynamicEnum = "resname;100|ckan:://1234567|borough

// get us_states.csv from datHere lookup tables dynamicEnum = "dathere://us_states.csv"

If colname is not specified, the first column of the CSV file is read and used for validation.

Uniquecombinedwith

uniqueCombinedWith allows you to validate that combinations of values across specified columns are unique. It can be used with either column names or column indices (0-based). For example:

// Validate that combinations of name and email are unique uniqueCombinedWith = ["name", "email"]

// Validate that combinations of columns at indices 1 and 2 are unique uniqueCombinedWith = [1, 2]

// Validate that the combinations of named and indexed columns are unique uniqueCombinedWith = ["name", 2]

When a duplicate combination is found, the validation will fail and the error message will indicate which columns had duplicate combinations (named columns first, then indexed columns). The invalid records will be written to the .invalid file, while valid records will be written to the .valid file.

uniqueCombinedWith complements the standard uniqueItems keyword, which can only validate uniqueness across a single column.

You can create a JSON Schema file from a reference CSV file using the qsv schema command. Once the schema is created, you can fine-tune it to your needs and use it to validate other CSV files that have the same structure.

Be sure to select a "training" CSV file that is representative of the data you want to validate when creating a schema. The data types, domain/range and regular expressions inferred from the reference CSV file should be appropriate for the data you want to validate.

Typically, after creating a schema, you should edit it to fine-tune each field's inferred validation rules.

For example, if we created a JSON schema file called "reference.schema.json" using the schema command. And want to validate "mydata.csv" which we know has validation errors, the output files from running qsv validate mydata.csv reference.schema.json are:

mydata.csv.valid
mydata.csv.invalid
mydata.csv.validation-errors.tsv

With an exit code of 1 to indicate a validation error.

If we validate another CSV file, "mydata2.csv", which we know is valid, there are no output files, and the exit code is 0.

If piped from stdin, the filenames will use stdin.csv as the base filename. For example: cat mydata.csv | qsv validate reference.schema.json

stdin.csv.valid
stdin.csv.invalid
stdin.csv.validation-errors.tsv

JSON SCHEMA SCHEMA VALIDATION SUBMODE:

validate also has a schema subcommand to validate JSON Schema files themselves. E.g. qsv validate schema myjsonschema.json // ignore format validation qsv validate schema --no-format-validation myjsonschema.json

Rfc 4180 Validation Mode:

If run without a JSON Schema file, the CSV is validated for RFC 4180 CSV standard compliance (see https://github.com/dathere/qsv#rfc-4180-csv-standard).

It also confirms if the CSV is UTF-8 encoded.

For both modes, returns exit code 0 when the CSV file is valid, exitcode > 0 otherwise. If all records are valid, no output files are produced.

Examples ↩

Validate a CSV file. Use this to check if a CSV file is readable by qsv.

qsv validate data.csv

Validate a TSV file against a JSON Schema

qsv validate data.tsv schema.json

Validate multiple CSV files using various dialects against a JSON Schema

qsv validate data1.csv data2.tab data3.ssv schema.json

Validate all CSV files in a directory against a JSON Schema

qsv validate /path/to/csv_directory schema.json

Validate CSV files listed in a '.infile-list' file against a JSON Schema

qsv validate files.infile-list schema.json

For more examples, see the tests included in this file (denoted by '#[test]') or see

https://github.com/dathere/qsv/blob/master/tests/test_validate.rs.

Usage ↩

qsv validate schema [--no-format-validation] [<json-schema>]
qsv validate [options] [<input>...]
qsv validate [options] [<input>] <json-schema>
qsv validate --help

Arguments ↩

Argument Description

<input> ... Input CSV file(s) to validate. If not provided, will read from stdin. If input is a directory, all files in the directory will be validated. If the input is a file with a '.infile-list' extension, the file will be read as a list of input files. If the input are snappy-compressed files(s), it will be decompressed automatically. Extended Input Support is only available for RFC 4180 validation mode.

<json-schema> JSON Schema file to validate against. If not provided, validate will run in RFC 4180 validation mode. The file can be a local file or a URL (http and https schemes supported).

Argument	Description
`<input>`	... Input CSV file(s) to validate. If not provided, will read from stdin. If input is a directory, all files in the directory will be validated. If the input is a file with a '.infile-list' extension, the file will be read as a list of input files. If the input are snappy-compressed files(s), it will be decompressed automatically. Extended Input Support is only available for RFC 4180 validation mode.
`<json-schema>`	JSON Schema file to validate against. If not provided, `validate` will run in RFC 4180 validation mode. The file can be a local file or a URL (http and https schemes supported).

Validate Options ↩

Option	Type	Description	Default
`‑‑trim`	flag	Trim leading and trailing whitespace from fields before validating.
`‑‑no‑format‑validation`	flag	Disable JSON Schema format validation. Ignores all JSON Schema "format" keywords (e.g. date,email, uri, currency, etc.). This is useful when you want to validate the structure of the CSV file w/o worrying about the data types and domain/range of the fields.
`‑‑fail‑fast`	flag	Stops on first error.
`‑‑valid`	string	Valid record output file suffix.	`valid`
`‑‑invalid`	string	Invalid record output file suffix.	`invalid`
`‑‑json`	flag	When validating without a JSON Schema, return the RFC 4180 check as a JSON file instead of a message.
`‑‑pretty‑json`	flag	Same as --json, but pretty printed.
`‑‑valid‑output`	string	Change validation mode behavior so if ALL rows are valid, to pass it to output, return exit code 1, and set stderr to the number of valid rows. Setting this will override the default behavior of creating a valid file only when there are invalid records. To send valid records to stdout, use `-` as the filename.
`‑j,` `‑‑jobs`	string	The number of jobs to run in parallel. When not set, the number of jobs is set to the number of CPUs detected.
`‑b,` `‑‑batch`	string	The number of rows per batch to load into memory, before running in parallel. Automatically determined for CSV files with more than 50000 rows. Set to 0 to load all rows in one batch. Set to 1 to force batch optimization even for files with less than 50000 rows.	`50000`

Fancy Regex Options ↩

Option	Type	Description	Default
`‑‑fancy‑regex`	flag	Use the fancy regex engine instead of the default regex engine for validation. The fancy engine supports advanced regex features such as lookaround and backreferences, but is not as performant as the default regex engine which guarantees linear-time matching, prevents DoS attacks, and is more efficient for simple patterns.
`‑‑backtrack‑limit`	string	Set the approximate number of backtracking steps allowed. This is only used when --fancy-regex is set.	`1000000`

Options For Both Regex Engines ↩

Option	Type	Description	Default
`‑‑size‑limit`	string	Set the approximate size limit, in megabytes, of a compiled regex.	`50`
`‑‑dfa‑size‑limit`	string	Set the approximate capacity, in megabytes, of the cache of transitions used by the engine's lazy Discrete Finite Automata.	`10`
`‑‑timeout`	string	Timeout for downloading json-schemas on URLs and for 'dynamicEnum' lookups on URLs. If 0, no timeout is used.	`30`
`‑‑cache‑dir`	string	The directory to use for caching downloaded dynamicEnum resources. If the directory does not exist, qsv will attempt to create it. If the QSV_CACHE_DIR envvar is set, it will be used instead. Not available on qsvlite.	`~/.qsv-cache`
`‑‑ckan‑api`	string	The URL of the CKAN API to use for downloading dynamicEnum resources with the "ckan://" scheme. If the QSV_CKAN_API envvar is set, it will be used instead. Not available on qsvlite.	`https://data.dathere.com/api/3/action`
`‑‑ckan‑token`	string	The CKAN API token to use. Only required if downloading private resources. If the QSV_CKAN_TOKEN envvar is set, it will be used instead. Not available on qsvlite.

Email Validation Options ↩

Option	Type	Description	Default
`‑‑email‑required‑tld`	flag	Require the email to have a valid Top-Level Domain (TLD) (e.g. .com, .org, .net, etc.). e.g. "john.doe@example" is VALID if this option is NOT set.
`‑‑email‑display‑text`	flag	Allow display text in emails. e.g. "John Doe john.doe@example.com" is INVALID if this option is NOT set.
`‑‑email‑min‑subdomains`	string	Minimum number of subdomains required in the email. e.g. "jdoe@example.com" is INVALID if this option is set to 3, but "jdoe@sub.example.com" is VALID.	`2`
`‑‑email‑domain‑literal`	flag	Allow domain literals in emails. e.g. "john.doe@[127.0.0.1]" is VALID if this option is set.

Common Options ↩

Option	Type	Description
`‑h,` `‑‑help`	flag	Display this message
`‑n,` `‑‑no‑headers`	flag	When set, the first row will not be interpreted as headers. It will be validated with the rest of the rows. Otherwise, the first row will always appear as the header row in the output. Note that this option is only valid when running in RFC 4180 validation mode as JSON Schema validation requires headers.
`‑d,` `‑‑delimiter`	string	The field delimiter for reading CSV data. Must be a single character.
`‑p,` `‑‑progressbar`	flag	Show progress bars. Not valid for stdin.
`‑q,` `‑‑quiet`	flag	Do not display validation summary message.

Source: src/cmd/validate.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate

Description ↩

JSON Schema Validation Mode:

Dynamicenum

Uniquecombinedwith

JSON SCHEMA SCHEMA VALIDATION SUBMODE:

Rfc 4180 Validation Mode:

Examples ↩

Usage ↩

Arguments ↩

Validate Options ↩

Fancy Regex Options ↩

Options For Both Regex Engines ↩

Email Validation Options ↩

Common Options ↩

FilesExpand file tree

validate.md

Latest commit

History

validate.md

File metadata and controls

validate

Description ↩

JSON Schema Validation Mode:

Dynamicenum

Uniquecombinedwith

JSON SCHEMA SCHEMA VALIDATION SUBMODE:

Rfc 4180 Validation Mode:

Examples ↩

Usage ↩

Arguments ↩

Validate Options ↩

Fancy Regex Options ↩

Options For Both Regex Engines ↩

Email Validation Options ↩

Common Options ↩